# Multi-Modal Retrieval-Augmented Generation (MMRAG) ## Enterprise Architecture, Ingestion Pipeline, Vector Database Design, and Search Framework **Version:** 1.0 **Author:** Vishwanathan --- # 1. Introduction MMRAG (Multi-Modal Retrieval-Augmented Generation) extends traditional Retrieval-Augmented Generation by enabling retrieval across multiple data modalities: * Text * Images * Scanned Documents * OCR Content * Tables * Future Audio Transcriptions Unlike conventional RAG systems that store only text embeddings, MMRAG creates dedicated vector spaces for different modalities and stores them within Qdrant. The platform supports: * Semantic Search * Reverse Image Search * OCR Search * Table Search * Multi-Document Search * Page-Level Citations * Source Traceability --- # 2. High-Level Architecture ```text ┌───────────────────┐ │ PDF / Images │ └─────────┬─────────┘ │ ▼ ┌───────────────────┐ │ Extraction Layer │ │ │ │ PyMuPDF │ │ OCR │ │ Table Extractor │ └─────────┬─────────┘ │ ▼ ┌───────────────────┐ │ Chunking Layer │ └─────────┬─────────┘ │ ┌─────────────────┴─────────────────┐ │ │ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ Text Embedding │ │ Image Embedding │ │ BAAI BGE │ │ CLIP ViT-B/32 │ │ 768 Dimensions │ │ 512 Dimensions │ └────────┬────────┘ └────────┬────────┘ │ │ └──────────────┬────────────────────┘ ▼ ┌──────────────────┐ │ Qdrant Database │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ Search Engine │ └────────┬─────────┘ │ ▼ ┌──────────────────┐ │ User Interface │ └──────────────────┘ ``` --- # 3. Why Qdrant? Qdrant is selected because it provides: * Multi-vector support * High-performance Approximate Nearest Neighbor (ANN) search * Metadata filtering * Local persistent storage * Docker deployment * REST API * Python SDK * Hybrid retrieval architecture --- # 4. Qdrant Collection Design Collection Name: ```python mmrag ``` Schema: ```python vectors_config = { "text": VectorParams( size=768, distance=Distance.COSINE ), "image": VectorParams( size=512, distance=Distance.COSINE ) } ``` --- # 5. Why Cosine Similarity? Cosine similarity compares vector direction rather than magnitude. Formula: ```text cos(θ) = A·B / (|A| × |B|) ``` Advantages: * Length independent * Better semantic retrieval * Standard for embedding models * More robust than Euclidean distance --- # 6. Text Embedding Models ## Option 1: BAAI/bge-base-en-v1.5 Dimensions: ```text 768 ``` Advantages: * Superior semantic understanding * Better contextual retrieval * Strong long-document performance * High benchmark retrieval accuracy Example: Query: ```text Principles of mental freedom ``` Retrieved Content: ```text Freedom is achieved through mastery of one's thoughts... ``` Recommended for: * Enterprise RAG * Research papers * Books * Technical documentation * Legal content --- ## Option 2: all-MiniLM-L6-v2 Dimensions: ```text 384 ``` Advantages: * Lightweight * Fast inference * Lower RAM usage Disadvantages: * Lower semantic richness * Reduced retrieval quality ### Comparison | Feature | MiniLM | BGE | | -------------- | ------ | --------- | | Dimensions | 384 | 768 | | Accuracy | Medium | High | | Speed | Faster | Slower | | Memory | Low | Medium | | Enterprise RAG | Good | Excellent | ### Recommendation Use: ```text BAAI/bge-base-en-v1.5 ``` --- # 7. Image Embedding Models ## CLIP Only (Current Implementation) Model: ```text ViT-B/32 ``` Dimensions: ```text 512 ``` Workflow: ```text Image ↓ CLIP Encoder ↓ 512-D Vector ↓ Qdrant ``` Advantages: * Fast * Robust * Industry standard --- ## CLIP + Caption Architecture Workflow: ```text Image ↓ Caption Generator ↓ Caption Text ↓ BGE Embedding ↓ Qdrant ``` Example Image: ```text Electrical Circuit Diagram ``` Generated Caption: ```text Electrical circuit showing resistor network and voltage source. ``` Embedded using: ```text BAAI/bge-base-en-v1.5 ``` Advantages: * Better semantic retrieval * Human explainability * Better cross-modal search Disadvantages: * Extra inference stage * Higher latency * Increased storage ### Comparison | Feature | CLIP Only | CLIP + Caption | | --------------- | --------- | -------------- | | Speed | Faster | Slower | | Storage | Lower | Higher | | Semantic Search | Medium | Excellent | | Explainability | Low | High | ### Recommended Enterprise Strategy Store both: ```python { "text": bge_embedding, "image": clip_embedding } ``` --- # 8. Table Processing Tables should NOT be stored as images. Recommended workflow: ```text PDF ↓ Table Detection ↓ Table Extraction ↓ Convert To Text ↓ BGE Embedding ``` Example Table | Product | Revenue | | ------- | ------- | | A | 100 | | B | 250 | Converted To: ```text Product A Revenue 100. Product B Revenue 250. ``` Then embedded using: ```text BAAI/bge-base-en-v1.5 ``` Benefits: * Searchable * Explainable * Better retrieval quality --- # 9. PDF Processing Scenarios ## Scenario 1 — Digital PDF ```python page.get_text() ``` Workflow: ```text PDF ↓ PyMuPDF ↓ Direct Text ``` --- ## Scenario 2 — Scanned PDF ```text PDF ↓ Render Page ↓ RapidOCR ↓ Text ``` --- ## Scenario 3 — Image-Only PDF ```text PDF ↓ Render Page ↓ CLIP Embedding ``` --- ## Scenario 4 — Mixed Content PDF ```text PDF ↓ Text Extraction ↓ OCR ↓ Table Extraction ↓ CLIP Image Embedding ``` Most enterprise PDFs fall into this category. --- # 10. Page Loading vs Chunk Loading ## Page-Level Processing ```text PDF ↓ Page 1 Page 2 Page 3 ... ``` Advantages: * Preserves page references * Easier citations Stored Metadata: ```json { "source_file": "physics.pdf", "page_no": 23 } ``` --- ## Chunk-Level Processing Configuration: ```python CHUNK_SIZE = 800 CHUNK_OVERLAP = 100 ``` Example: ```text Chunk 1: 0 - 800 Chunk 2: 700 - 1500 Chunk 3: 1400 - 2200 ``` Advantages: * Better retrieval precision * Preserves context * Faster vector search --- # 11. Ingestion Workflow ```text PDF ↓ MD5 Hash ↓ Registry Check ↓ Extract Text ↓ OCR If Required ↓ Extract Tables ↓ Render Page Images ↓ Generate BGE Embeddings ↓ Generate CLIP Embeddings ↓ Qdrant Upsert ↓ Save Registry ``` --- # 12. Metadata Structure ```json { "source_file": "physics.pdf", "page_no": 23, "chunk_id": 14, "chunk_length": 785, "ingested_at": "2026-06-24T11:00:00" } ``` --- # 13. Search Modes ## Semantic Search Input: ```text Explain electrostatic force ``` Workflow: ```text Query ↓ BGE Embedding ↓ Qdrant Text Search ``` --- ## Reverse Image Search Input: ```text Circuit Diagram Image ``` Workflow: ```text Image ↓ CLIP ↓ Qdrant Image Search ``` --- ## OCR Search Input: ```text Handwritten Formula ``` Workflow: ```text OCR ↓ BGE ↓ Qdrant Search ``` --- ## Table Search Input: ```text Revenue of Product B ``` Workflow: ```text Query ↓ BGE ↓ Table Text ↓ Qdrant Search ``` --- ## Future Audio Search Input: ```text Voice Question ``` Workflow: ```text Audio ↓ Whisper ↓ Text ↓ BGE ↓ Qdrant Search ``` --- # 14. Search Threshold Recommendations ## Text Search | Score | Meaning | | ----- | ------- | | 0.90+ | Exact | | 0.80+ | Strong | | 0.70+ | Good | | 0.60+ | Broad | --- ## Image Search | Score | Meaning | | ----- | ------- | | 0.35+ | Strong | | 0.25+ | Good | | 0.15+ | Broad | --- # 15. Example Output Format ```text 🎯 Match #1 Score: 0.8942 Source: Physics.pdf Page: 143 Extracted Context: Electrostatic force between two charges... ``` --- # 16. Future Enhancements ## Multi-Vector Retrieval ```python { "text": 768, "image": 512, "table": 768 } ``` --- ## Reranking Layer Recommended: ```text BAAI/bge-reranker-large ``` Workflow: ```text Retrieve Top 20 ↓ Rerank ↓ Return Top 5 ``` --- ## Hybrid Retrieval ```text Dense Search + Metadata Filters + Keyword Search ``` --- # 17. Final Enterprise Architecture ```text PDF / DOCX / PPT / IMAGE / AUDIO │ ▼ Extraction Layer │ ▼ OCR + Table Parsing │ ▼ Chunking & Metadata │ ┌─────────┴─────────┐ ▼ ▼ BGE 768 CLIP 512 ▼ ▼ QDRANT text vector image vector ▼ Hybrid Search ▼ User Interface ``` --- # Conclusion MMRAG provides a scalable enterprise-grade multi-modal retrieval framework capable of semantic search across text, images, OCR documents, tables, and future audio sources. By combining BGE embeddings for textual understanding and CLIP embeddings for visual understanding within Qdrant's multi-vector architecture, the system delivers accurate, explainable, and source-traceable retrieval suitable for production RAG deployments.