Building a Production RAG Pipeline with Azure AI Search and GPT-4

What RAG Actually Solves

Large Language Models hallucinate. They also have a knowledge cutoff. And they can't know about your internal documentation, your product catalogue, or your VM specification database.

Retrieval-Augmented Generation solves all three: instead of asking the model to recall facts from training data, you retrieve the relevant facts at query time and put them directly in the prompt context.

This post documents the end-to-end RAG pipeline I built for VirtuAI — a platform that answers questions about Azure VM configurations, pricing, and workload fit using a proprietary specification database.

The Architecture

Step 1: Indexing VM Specifications

The data source is a structured JSON database of Azure VM SKUs enriched with workload benchmarks, real-world performance characteristics, and pricing data.

Chunking Strategy

Chunking is the most important decision in a RAG pipeline. Chunk too large and retrieval precision suffers. Chunk too small and you lose context.

For structured VM spec data, we used a document-per-SKU strategy rather than fixed-size text chunking. Each SKU becomes one index document:

Generating Embeddings

Step 2: Hybrid Retrieval (Vector + Keyword)

Pure vector search misses exact matches (e.g., "D4s_v3"). Pure keyword search misses semantic matches ("4 CPU 16 GB general purpose VM"). Hybrid search wins:

Step 3: Prompt Construction and Generation

Evaluation: What Makes a RAG Pipeline "Good"?

We measure four metrics in production:

| Metric | Description | Target | |---|---|---| | Retrieval Recall@5 | Correct doc in top 5 results | > 92% | | Answer Faithfulness | Answer grounded in context | > 95% | | Answer Relevance | Answers the actual question | > 90% | | Latency P95 | End-to-end response time | < 3s |

UseLLM-as-judge evaluation: have GPT-4 score faithfulness and relevance on a sample of queries. It's cheap, fast, and correlates well with human evaluation at scale.

The hybrid retrieval approach improved Recall@5 from 78% (vector-only) to 94% — a significant jump for minimal implementation complexity.