The Problem with Unstructured Documents
An enterprise generates thousands of documents per month: invoices, compliance reports, technical specifications, contracts, procurement forms. Each document type has a different structure. Some are PDFs from legacy systems. Some are scanned paper. Some are Word documents exported to PDF with inconsistent formatting.
The question is: how do you extract specific, structured data from hundreds of different document layouts without writing a custom parser for each one?
That was the challenge behind SchemaForge.
What Azure Document Intelligence Provides
Azure Document Intelligence (formerly Form Recognizer) uses computer vision and layout analysis to understand document structure — not just text, but where the text is and what it means structurally.
The key models:
| Model | Best For |
|---|---|
| | General key-value extraction, tables |
| | Invoices with standard fields |
| | Page structure, tables, paragraphs, reading order |
| Custom model | Your specific document type with training data |
For SchemaForge, we primarily use combined with an LLM post-processing step — because the real value is in the semantic understanding, not just the OCR.
The Two-Stage Pipeline
Stage 1: Layout Analysis
Setting is a key optimisation. The model returns document content as structured Markdown, which preserves tables, headers, and document hierarchy far better than plain text — and it's what the LLM stage works best with.
Stage 2: Schema Mapping with GPT-4o
The user defines a JSON schema for what they want to extract:
GPT-4o maps the extracted layout content to this schema:
The Hard Problems
Rotated or Skewed Scans
Document Intelligence handles moderate skew (< 15°) automatically. For severely rotated documents, we pre-process with OpenCV:
Validation and Confidence Scoring
Every extraction gets a confidence score based on field coverage, numerical consistency, and format validation:
In production, SchemaForge achieves 94% extraction accuracy on invoice documents and 89% on technical specification sheets — compared to 60–70% for pure OCR approaches without the LLM mapping stage.