Using Azure AI Document Intelligence and Azure OpenAI to extract structured data from documents (2024)

Addressing the challenges of efficient document processing, explore a novel solution to extract structured data from documents using Azure AI Document Intelligence and Azure OpenAI.

Context

In today’s data-driven landscape, efficient document processing is crucial for most organizations worldwide. Accurate document analysis is essential to provide much needed streamlining of business workflows to enhance productivity.

In this article, we’ll explore the key challenges that solution providers face with extracting relevant, structured data from documents. We'll also showcase a novel solution to solve these challenges using Azure AI Document Intelligence and Azure OpenAI.

Key challenges of effective document data extraction

ISVs and Digital Natives building document data extraction solutions often grapple with the complexities of finding a reliable mechanism to parse their customer’s documents. The key challenges include:

Benefits of using Azure AI Document Intelligence with Azure OpenAI

As solution providers for document data extraction capabilities, the following approach enables these benefits over other approaches:

No requirement to train a custom model. Combining these Azure AI services allows you to extract structured data without the need to train a custom model for the various document formats and layouts that your solution may receive. Instead, you tailor natural language prompts to your specific needs.
Define your own schema. The capabilities of GPT models enables you to extract data that matches or closely matches a schema that you define. This is a major benefit over alternative approach, particularly when each document’s domain jargon differs. This makes it easier to extract structured data accurately for your downstream processes post-extraction.
Out-of-the-box support for multiple file types. This approach supports a variety of document types, including PDFs, Office file types, HTML, and images. This flexibility allows you to extract structure data from a variety of sources without the need for custom logic in your application for each file type.

Let’s explore how to extract structured data from documents with both Azure AI Document Intelligence and Azure OpenAI in more detail.

Understanding layout analysis to Markdown with Azure AI Document Intelligence

Updated in March 2024, the pre-built layout model in Azure AI Document Intelligence gained new capabilities to extract content and structure from Office file types (Word, PowerPoint, and Excel) and HTML, alongside the existing PDF and image capabilities.

This introduced the capability for document processing solutions to take any document, such as a contract or invoice, with any layout or file format, and convert it into a structured Markdown output. This has the significant benefit of maintaining the content’s hierarchy when extracted.

This is important when we consider the capabilities of the Azure OpenAI GPT models. GPT models are pre-trained on vast amounts of natural language data, which helps them to understand structures and semantic patterns. The simplicity of Markdown’s markup allows GPT models to interpret structures such as headings, lists, and tables, as well as formatting such as links, emphasis (italic/bold), and code blocks.

Combining Azure AI Document Intelligence layout analysis with GPT prompting for data extraction

The following diagram illustrates this novel approach, introducing the new Markdown capabilities of Azure AI Document Intelligence’s pre-built layout model with completion requests to Azure OpenAI to extract the data.

A novel approach to efficient data extraction from documents using Azure AI Document Intelligence and Azure OpenAI

This approach is achieved in the following way:

A customer uploads their files to analyze for data extraction. This could be of any supported file type, including PDF, image, or Word document.
The application makes a request to the Azure AI Document Intelligence’s analyze API using the pre-built layout model with the output content format flag set to Markdown. The document data is provided in the request either as a base64 source or a URI.
- If you are processing many, large documents, it is recommended to use a URI to reduce the memory utilization which will prevent unexpected behavior in your application. You can achieve this approach by uploading your documents to an Azure Blob Storage container and providing a SAS URI to the document.
With the Markdown result as context, prompt the Azure OpenAI completions API with specific instruction to extract the structured data you require in a JSON format. With a now structured data response, you can store this data however you require for the needs of your application.

For a full code sample demonstrating this capability, check out theusing Azure AI Document Intelligence and Azure OpenAI GPT-3.5 Turbo to extract structured data from documentssample on GitHub. Along with the code, this sample includes the necessary infrastructure-as-code Bicep templates to deploy the Azure resources for testing.

Conclusion

Adopting Azure AI Document Intelligence and Azure OpenAI to extract structured data from documents simplifies the challenges of document processing today. This well-rounded solution offers significant benefits over alternatives, removing the requirement to train custom models and improving overall accuracy of data extraction in most use cases.

Consider the following recommendations to maximize the benefits of this approach:

Experiment with prompting for data extraction. The provided code sample provides a well-rounded starting point for structure data extraction. Consider experimenting with the prompt and JSON schemas to incorporate domain specific language to capture the nuances in your documents to improve accuracy further.
Optimize the document processing workflow. As you scale out this approach to production, consider the host resource requirements for your application to process a large quantity of documents. Optimize this approach by maximizing CPU and memory usage by offloading the loading of documents to Azure AI Document Intelligence using URIs.

By adopting this approach, solution providers can streamline their document processing workflows, enhancing productivity for themselves and their customers.