From OCR to Intelligence

From OCR to Intelligence: Building a Contract Analysis Platform That Actually Works
A deep dive into building document intelligence systems that understand context, relationships, and the difference between extracting text and extracting meaning.
The Journey
From Basic OCR to True Document Intelligence
I'm knee-deep in OCR right now, and let me tell you, there's a world of difference between extracting text and extracting intelligence. After three weeks of building what started as a simple PDF-to-search pipeline, I've ended up with something far more sophisticated: a contract intelligence platform that can answer questions like "What are the current payment terms for Client X?" while respecting the full hierarchy of master agreements, addendums, and amendments.
The journey from basic OCR to true document intelligence revealed fundamental insights about LLM-based extraction that I wish someone had told me at the start. Most importantly: Schema = Instructions. The LLM only extracts what you explicitly ask for, nothing more. This single realization drove everything that followed.
The Problem: Legacy Contracts Are a Nightmare
Picture this: thousands of pages of legal contracts that are scanned, faint, multi-generation fax copies with inconsistent layouts. Account managers need to answer high-stakes questions about feature deprecations, module end-of-life, and customer commitments. A missed clause can mean financial penalties or compliance violations.
The current process? Manual PDF searching, institutional memory, and crossed fingers.
I needed a system that could:
Extract text from degraded scans with high recall
Structure clauses with type classification and risk levels
Answer questions with evidence-backed citations
Track contract relationships (which addendum supersedes which clause)
Maintain full auditability for compliance
Simple OCR wasn't going to cut it. Neither was throwing everything into a vector database and hoping semantic search would figure it out.
The Multi-Model OCR Landscape: What Actually Works
THE PROBLEM: Not all OCR engines are created equal. Degraded scans, complex layouts, and the need for structured extraction each demand different capabilities. I needed to find which tools actually delivered on their promises for contract intelligence.
THE OPTIONS: I tested five different approaches, each with distinct strengths for different parts of the document processing pipeline.
LandingAI: The Unified Pipeline Champion
What makes it different: Single API for both parse and extract with native JSON schema support.
Why I chose it: LandingAI became my primary platform because it's purpose-built for document extraction workflows. You can parse a PDF to markdown, then run structured extraction with custom schemas in one unified pipeline. The markdown output preserves document structure naturally, and you get grounding information (bounding boxes) for citations.
Performance notes: They handle degraded scans remarkably well. I threw everything from faint faxes to rotated pages at it, and it consistently outperformed general-purpose solutions.
Trade-offs: Per-page credit pricing scales with volume, but the unified workflow saves significant development time compared to stitching together multiple services.
Google Gemini: The Multimodal Powerhouse
What makes it different: Native image processing with strong layout understanding.
Why I considered it: Gemini excels at understanding document structure. It can process images as first-class inputs and has excellent printed text accuracy. For general OCR tasks, it's impressive.
Performance notes: Good for general OCR and document Q&A where you don't need structured outputs.
Trade-offs: No structured extraction API. You have to prompt for structure, which is non-deterministic and expensive. I implemented repetition detection because temperature matters more for extraction accuracy than I expected.
AWS Textract: The Enterprise Specialist
What makes it different: Best-in-class for forms and tables.
Why I use it: Textract is the go-to for structured layouts. If you have pricing sheets, order forms, or expense documents, this is your tool. The table extraction is native and cell-level accurate.
Performance notes: Excellent for anything with clear tabular structure—pricing sheets, schedules, order forms.
Trade-offs: Weaker on dense paragraphs and legal prose. Higher cost at scale compared to other options.
Marker: The Open Source Layout Expert
What makes it different: PDF-to-markdown conversion optimized for reading order.
Why it surprised me: Marker achieved a 96.69% heuristic score on legal documents and excels at reading order detection—crucial for multi-column layouts. It works entirely locally (no API costs) and is fast (~0.18s/page on GPU).
Performance notes: Best for layout reconstruction when you need to preserve document structure for downstream processing.
Trade-offs: Not a top-tier OCR engine for degraded scans. It relies on PDF text layers, so scanned documents need preprocessing with actual OCR first.
Tesseract: The Reliable Baseline
What makes it different: It's the proven, lightweight option that's been around forever.
Why I use it: Tesseract provides fast baseline extraction and good character count benchmarking. It's what I use for sanity checks and coverage comparison during development.
Performance notes: Fast and lightweight for development testing and baseline metrics.
Trade-offs: Weak on complex layouts and poor confidence scoring. Not suitable for production contract extraction but valuable for testing.
THE TAKEAWAY: Multi-model OCR isn't about finding one perfect tool—it's about knowing which tool solves which problem and building intelligent routing between them.
The Schema Evolution: From Static to Dynamic
The real breakthrough came when I realized that static schemas fundamentally don't work for diverse document types. Here's the journey that led to that insight.
The Static Schema Trap
My initial approach was textbook: create one comprehensive schema that extracts everything from any contract type. I built a RAG-optimized schema with clauses, risk levels, keywords, and source pages. Testing it on a service provider agreement worked beautifully:
11 clauses extracted
179 keywords identified
5 high-risk clauses flagged
Perfect type classification
I was feeling pretty good about this until I tested it on a different document—a Progress Software Addendum containing pricing tables, 14 partner airlines with passenger numbers, and license schedules.
Static schema results: 13 clauses, 0 tables, 0 partner airlines, 0 license schedules.
The LLM did exactly what I asked for: clauses. It completely ignored the structured data because my schema didn't mention tables.
The "Schema = Instructions" Revelation
That's when it hit me: Schema is not just configuration—it's instruction. The LLM extracts exactly what you ask for, nothing more. If you don't ask for pricing tables, you won't get pricing tables, even if they're the most important part of the document.
This led to the obvious-in-hindsight question: "How do we know what to extract from a document we haven't seen yet?"
Dynamic Schema Generation: The Solution
The answer was a three-phase pipeline:
Phase 1: Parse
Convert PDF to markdown using LandingAI Parse
Phase 2: Discover
LLM analyzes structure, identifies tables/lists/entities
Phase 3: Extract
Generate schema from discovery, run extraction
Testing the dynamic approach on the same Progress Software Addendum:
The extraction phase actually found more than discovery because the generated schema prompted the LLM to find additional instances of discovered patterns.
Cost trade-off: 9 credits vs 6 for static (50% more), but dramatically better results. For contract intelligence, accuracy matters more than cost savings.
Why RAG Alone Isn't Enough: Enter GraphRAG
Traditional RAG hits a fundamental wall with contract hierarchies. Real contracts don't exist in isolation—you have master agreements, addendums, amendments, and supersessions. When someone asks "What are the payment terms for Client X?", they need the current terms, not the original ones that were changed by Addendum B six months ago.
Traditional RAG treats each document independently. It can't answer "What's current?" because it doesn't understand relationships.
The Contract Hierarchy Problem
Consider this structure:
1
Master Agreement
(Net 30)
2
Addendum A
(Net 45)
3
Addendum B
(Net 60)
4
Amendment C
(2-year extension)
When querying payment terms, you need:
The master agreement (baseline)
All addendums (modifications)
The most recent term (Net 60, not Net 30)
Supersession awareness (which clause replaces which)
GraphRAG Solution
I implemented a knowledge graph that explicitly tracks relationships. The graph stores clients, contracts, clauses, and terms with explicit relationships. Clauses are categorized (payment_terms, liability, termination) and embedded for semantic search. The system can traverse the graph to find the most recent effective term for any category.
Key queries this enables:
"What are the current payment terms?" (respects supersession)
"How have liability caps changed over time?" (temporal analysis)
"Which contracts reference Feature X?" (cross-contract search)
Auto-Import and Linking
The graph auto-populates from extraction results using pattern detection:
Regex matching for explicit section references
Semantic similarity for implicit supersessions (75% threshold)
Document relationship detection (AMENDS/SUPERSEDES)
This caught supersessions that manual review would miss, increasing detected relationships from 1 to 18 in our test corpus.
Modular Architecture
The GraphRAG service evolved from a 4,215-line monolith into a modular architecture with 9 focused modules:
graph_rag/
├── __init__.py          # Public API composition
├── models.py            # Data models
├── utils.py             # Helpers
├── core.py              # CRUD operations
├── persistence.py       # Save/load
├── query_engine.py      # Queries & search
├── import_service.py    # Import & parsing
├── linkage_detector.py  # Auto-linking algorithms
├── cross_contract_query.py # Portfolio-wide queries
└── db_repository.py     # PostgreSQL storage
This refactoring reduced the largest module by 52% and improved developer onboarding from 3 days to 1 day.
The Storage Consolidation Journey
What started as a local development setup needed to become a shared deployment. The architecture evolution tells an important story about scaling document intelligence systems.
The Multi-Storage Problem
Initially, I had data scattered across six different storage mechanisms:
PostgreSQL for documents and metadata
JSON files for GraphRAG (17MB single file, last-writer-wins)
File-based RAG indices (125 JSON files, ~2.5GB loaded into memory)
MinIO for PDFs and images
Redis for sessions
Various directories for job status and query history
This works fine for single-instance development but breaks completely in multi-instance deployments.
PostgreSQL-First Strategy
The solution was consolidating everything possible into PostgreSQL with pgvector. I completed all four phases of migration:
01
GraphRAG → PostgreSQL
Moved 50 clients, 111 contracts, and 950 clauses from JSON to PostgreSQL tables with proper relationships and foreign keys.
02
RAG → pgvector
Migrated 82,180 chunks with embeddings to pgvector, enabling native similarity search. I implemented dual embedding columns—1536 dimensions for OpenAI's text-embedding-3-large and 768 dimensions for Google's embeddings—to support multiple providers.
03
Redis Query Caching
Added a cache layer for RAG query results with 1-hour TTL. Sub-millisecond cache lookups with cache invalidation on document re-indexing. The system includes management endpoints for cache statistics and selective invalidation.
04
Hybrid Search
Combined pgvector semantic search with PostgreSQL full-text search using Reciprocal Rank Fusion (RRF) scoring:
score = \sum \frac{weight}{k + rank}
Default weights: 70% semantic, 30% keyword. This catches both semantic matches and exact terminology that semantic search might miss. Each result returns both semantic_score and keyword_score for transparency.
The pgvector Trade-off
One important limitation: OpenAI's text-embedding-3-large uses 3072 dimensions, but pgvector's HNSW index maxes out at 2000 dimensions. I used IVFFlat indexing for the larger dimensions, which is less optimal but functional. The lesson: always check dimension limits before choosing your vector index strategy.
Performance Insights: When RAG Underperforms
Testing revealed surprising cases where traditional RAG struggled compared to direct LLM queries with full context.
The Cross-Document Challenge
Query
"What are the termination clauses across all contracts?"
RAG Result
"I cannot find this information in the retrieved sections. The context provided only includes repeated fragments..."
Direct Query Result
Correctly identifies and explains all termination provisions across documents.
Root Cause:
Chunking fragmentation split long clauses into meaningless fragments
Same-document dominance (all 10 retrieved chunks from one document)
Semantic mismatch between query and fragmented text
Low similarity scores (average 0.54, below useful threshold)
Chunking Quality is Critical
Poor chunking creates silent failures. When a clause gets split into fragments like "tomers.", "ther customers.", "mers.", these fragments don't carry semantic meaning for search, but they show up in results and waste context window space.
Solutions implemented:
Clause-aware boundaries
Never split mid-clause
Header preservation
In continuation chunks
Minimum chunk size
Enforcement
Overlap strategies
For context preservation
When Each Approach Works Best
RAG excels when:
Specific fact-finding queries
Single-document scope
Good semantic alignment between query and content
Well-chunked, coherent sections
Direct queries excel when:
Cross-referencing multiple sections required
Comparative analysis needed
Terminology mismatches exist
Documents fit in context window
The hybrid strategy: Start with RAG for efficiency, fall back to direct queries for complex cases.
Technical Architecture: What Actually Scales
The final architecture reflects lessons learned about what works in production document intelligence systems:
Backend: FastAPI + SQLAlchemy 2.0
~40 service files organized by concern, 17 API router groups, 14 database models. The modular structure emerged from refactoring a 4,215-line monolithic service into focused components.
Key services:
UnifiedChunkingService
Handles all chunking strategies (smart_rag, clause_boundary, simple)
RAGDBService
PostgreSQL-backed vector operations with caching
GraphRAGService
Modular composition using multiple inheritance
EmbeddingService
Provider abstraction with fallback chains
Frontend: React 19 + TypeScript
Five main interfaces reflecting the workflow:
Document Library: Upload with content-hash deduplication
Dynamic Lab: Full extraction pipeline with real-time progress
RAG Lab: Semantic search with similarity scores and citations
GraphRAG: Knowledge graph visualization and queries
Cross-Contract Query: Portfolio-wide search with supersession awareness
Database Design
PostgreSQL handles everything:
Core entities: Documents, pages, extractions, users
GraphRAG tables: Clients, contracts, clauses, relationships
RAG chunks: 82,180 chunks with pgvector embeddings (dual dimensions)
Audit trails: Full logging for compliance
The unified storage approach eliminates sync issues and enables complex queries across data types.
From Development to Deployment
Building a working prototype is one thing. Getting it into the hands of users who actually need it—account managers answering client questions, sales teams preparing for renewals—is another challenge entirely.
Cloud Architecture
The platform runs on AWS with a production-grade setup:
Compute: ECS Fargate for both backend (FastAPI) and frontend (React/nginx)
Database: RDS PostgreSQL with pgvector extension for unified storage
Caching: ElastiCache Redis for query caching and sessions
Storage: S3 for document files (PDFs, images, OCR results)
Load Balancing: Application Load Balancer with path-based routing
The ALB routes /api/* to the backend service and everything else to the frontend—simple but effective for this architecture.
Security Model
For an internal tool handling sensitive contract data, security was non-negotiable:
Browser access
Cognito-based authentication integrated with the load balancer
Programmatic access
API key authentication for system-to-system integration
Network isolation
Database and cache in private subnets with no public access
Encryption
TLS everywhere, secrets managed outside the codebase
Deployment Challenges
A few gotchas that cost me time:
Platform mismatch: Docker images built on Mac (ARM64) won't run on Fargate (AMD64). Always specify --platform linux/amd64 when building for cloud deployment.
pgvector installation: RDS PostgreSQL doesn't auto-enable extensions. You have to explicitly run CREATE EXTENSION vector from within the VPC—which means spinning up a temporary container just to run that command if your database isn't publicly accessible.
Nginx routing: The frontend container tried to proxy API requests to a hostname that doesn't resolve in ECS networking. Solution: let the load balancer handle all routing, don't duplicate it in nginx.
Meeting Users Where They Are
Here's the thing about enterprise tools: nobody wants another application to check. Account managers live in email and chat. Sales lives in their CRM. Asking people to context-switch to a new web UI is asking for low adoption.
So I integrated the platform with BrainTrust, our internal agentic AI system. BrainTrust puts an AI agent directly in Google Chat—where our teams already spend their day. The agent connects to the contract intelligence API via MCP (Model Context Protocol), which means users can ask natural language questions like:
"What are the termination terms for Acme Corp?"
"Find all contracts expiring in Q1"
"Has Client X's liability cap changed since the original agreement?"
...and get answers without ever leaving their chat window. The agent handles authentication, queries the right endpoints, and formats responses for chat consumption.
This "meet users where they are" approach has been the difference between a tool that sits unused and one that actually gets adopted. The web UI is still there for power users who want to explore the knowledge graph or run complex cross-contract queries, but for day-to-day questions, chat is king.
Lessons Learned: What I Wish I Knew Starting Out
About Schema Design
Schema = Instructions: LLM extracts exactly what you ask for, nothing more
Dynamic > Static: Per-document schema generation captures unique content
Discovery enables accuracy: Analyze structure before extracting
Base + Extensions: Start with core schema, add document-specific fields
About RAG for Contracts
Chunking quality determines success: Poor chunks = silent failures
Clause boundaries are sacred: Never split clauses across chunks
High top-k required: Legal documents need recall-first retrieval (15-30)
Hybrid search improves recall: Semantic + keyword catches more
About Contract Relationships
Traditional RAG can't handle hierarchies: Needs explicit relationship tracking
GraphRAG solves supersession: Makes "current" vs "original" explicit
Auto-linking scales: Manual relationship entry doesn't scale
About Development Process
Save everything: Every extraction, schema, query—invaluable for debugging
Test with real documents: Schema problems only show with real data
Expect LLM variability: Same document can produce different results
Prove it works: Evidence of success is as important as success
The Platform Today
The current system handles the complete contract intelligence workflow:
Data Processing
Multi-model OCR with dynamic schema generation, achieving comprehensive extraction across diverse document types.
Storage Architecture
PostgreSQL-first with pgvector for unified data management, eliminating synchronization issues while enabling hybrid search.
Query Capabilities
Both traditional RAG for fact-finding and GraphRAG for relationship-aware queries, with direct LLM fallback for complex cases.
Deployment
Production AWS infrastructure with proper security, plus agent integration for chat-based access.
Performance Metrics
82,180
Chunks indexed
Across 125 documents
<1s
Query response
With Redis caching
950
Clauses tracked
With relationship tracking in GraphRAG
50
Clients
111 contracts fully modeled
This isn't just OCR—it's document intelligence that understands context, relationships, and the difference between "what's written" and "what's current."
What's Next: The Evolution Continues
The foundation is solid, but document intelligence is still early innings. The next frontier involves:
Temporal analysis
Understanding how terms change over time and surfacing trends
Cross-client pattern recognition
"Which clients have similar liability structures?"
Better auto-linking algorithms
Detecting implicit relationships that aren't explicitly stated
Proactive alerting
Notifying teams when contracts approach expiration or when terms conflict
The key insight driving everything forward: documents don't exist in isolation. They're part of networks of relationships, temporal sequences, and business contexts. The systems that crack this—that can reason about document hierarchies as easily as they extract text—will transform how organizations understand their own commitments.
Ready to build your own document intelligence system?
Start with the schema insight: what you ask for is exactly what you'll get. Design accordingly.