SchemaForge: Extracting Structured Data from Unstructured Documents with Azure AI

The Problem with Unstructured Documents

An enterprise generates thousands of documents per month: invoices, compliance reports, technical specifications, contracts, procurement forms. Each document type has a different structure. Some are PDFs from legacy systems. Some are scanned paper. Some are Word documents exported to PDF with inconsistent formatting.

The question is: how do you extract specific, structured data from hundreds of different document layouts without writing a custom parser for each one?

That was the challenge behind SchemaForge.

What Azure Document Intelligence Provides

Azure Document Intelligence (formerly Form Recognizer) uses computer vision and layout analysis to understand document structure — not just text, but where the text is and what it means structurally.

The key models:

| Model | Best For | |---|---| |

| General key-value extraction, tables | |

| Invoices with standard fields | |

For SchemaForge, we primarily use

combined with an LLM post-processing step — because the real value is in the semantic understanding, not just the OCR.

The Two-Stage Pipeline

Stage 1: Layout Analysis

Setting

is a key optimisation. The model returns document content as structured Markdown, which preserves tables, headers, and document hierarchy far better than plain text — and it's what the LLM stage works best with.

Stage 2: Schema Mapping with GPT-4o

The user defines a JSON schema for what they want to extract:

GPT-4o maps the extracted layout content to this schema:

The Hard Problems

Rotated or Skewed Scans

Document Intelligence handles moderate skew (< 15°) automatically. For severely rotated documents, we pre-process with OpenCV:

Validation and Confidence Scoring

Every extraction gets a confidence score based on field coverage, numerical consistency, and format validation:

In production, SchemaForge achieves 94% extraction accuracy on invoice documents and 89% on technical specification sheets — compared to 60–70% for pure OCR approaches without the LLM mapping stage.