# Multi-Modal Retrieval-Augmented Generation (MMRAG)

## Enterprise Architecture, Ingestion Pipeline, Vector Database Design, and Search Framework

**Version:** 1.0
**Author:** Vishwanathan

---

# 1. Introduction

MMRAG (Multi-Modal Retrieval-Augmented Generation) extends traditional Retrieval-Augmented Generation by enabling retrieval across multiple data modalities:

* Text
* Images
* Scanned Documents
* OCR Content
* Tables
* Future Audio Transcriptions

Unlike conventional RAG systems that store only text embeddings, MMRAG creates dedicated vector spaces for different modalities and stores them within Qdrant.

The platform supports:

* Semantic Search
* Reverse Image Search
* OCR Search
* Table Search
* Multi-Document Search
* Page-Level Citations
* Source Traceability

---

# 2. High-Level Architecture

```text
                  ┌───────────────────┐
                  │ PDF / Images      │
                  └─────────┬─────────┘
                            │
                            ▼
                  ┌───────────────────┐
                  │ Extraction Layer  │
                  │                   │
                  │ PyMuPDF           │
                  │ OCR              │
                  │ Table Extractor  │
                  └─────────┬─────────┘
                            │
                            ▼
                  ┌───────────────────┐
                  │ Chunking Layer    │
                  └─────────┬─────────┘
                            │
          ┌─────────────────┴─────────────────┐
          │                                   │
          ▼                                   ▼

 ┌─────────────────┐                 ┌─────────────────┐
 │ Text Embedding  │                 │ Image Embedding │
 │ BAAI BGE        │                 │ CLIP ViT-B/32   │
 │ 768 Dimensions  │                 │ 512 Dimensions  │
 └────────┬────────┘                 └────────┬────────┘
          │                                   │
          └──────────────┬────────────────────┘
                         ▼

               ┌──────────────────┐
               │ Qdrant Database  │
               └────────┬─────────┘
                        │
                        ▼

               ┌──────────────────┐
               │ Search Engine    │
               └────────┬─────────┘
                        │
                        ▼

               ┌──────────────────┐
               │ User Interface   │
               └──────────────────┘
```

---

# 3. Why Qdrant?

Qdrant is selected because it provides:

* Multi-vector support
* High-performance Approximate Nearest Neighbor (ANN) search
* Metadata filtering
* Local persistent storage
* Docker deployment
* REST API
* Python SDK
* Hybrid retrieval architecture

---

# 4. Qdrant Collection Design

Collection Name:

```python
mmrag
```

Schema:

```python
vectors_config = {
    "text": VectorParams(
        size=768,
        distance=Distance.COSINE
    ),
    "image": VectorParams(
        size=512,
        distance=Distance.COSINE
    )
}
```

---

# 5. Why Cosine Similarity?

Cosine similarity compares vector direction rather than magnitude.

Formula:

```text
cos(θ) = A·B / (|A| × |B|)
```

Advantages:

* Length independent
* Better semantic retrieval
* Standard for embedding models
* More robust than Euclidean distance

---

# 6. Text Embedding Models

## Option 1: BAAI/bge-base-en-v1.5

Dimensions:

```text
768
```

Advantages:

* Superior semantic understanding
* Better contextual retrieval
* Strong long-document performance
* High benchmark retrieval accuracy

Example:

Query:

```text
Principles of mental freedom
```

Retrieved Content:

```text
Freedom is achieved through mastery of one's thoughts...
```

Recommended for:

* Enterprise RAG
* Research papers
* Books
* Technical documentation
* Legal content

---

## Option 2: all-MiniLM-L6-v2

Dimensions:

```text
384
```

Advantages:

* Lightweight
* Fast inference
* Lower RAM usage

Disadvantages:

* Lower semantic richness
* Reduced retrieval quality

### Comparison

| Feature        | MiniLM | BGE       |
| -------------- | ------ | --------- |
| Dimensions     | 384    | 768       |
| Accuracy       | Medium | High      |
| Speed          | Faster | Slower    |
| Memory         | Low    | Medium    |
| Enterprise RAG | Good   | Excellent |

### Recommendation

Use:

```text
BAAI/bge-base-en-v1.5
```

---

# 7. Image Embedding Models

## CLIP Only (Current Implementation)

Model:

```text
ViT-B/32
```

Dimensions:

```text
512
```

Workflow:

```text
Image
  ↓
CLIP Encoder
  ↓
512-D Vector
  ↓
Qdrant
```

Advantages:

* Fast
* Robust
* Industry standard

---

## CLIP + Caption Architecture

Workflow:

```text
Image
  ↓
Caption Generator
  ↓
Caption Text
  ↓
BGE Embedding
  ↓
Qdrant
```

Example

Image:

```text
Electrical Circuit Diagram
```

Generated Caption:

```text
Electrical circuit showing resistor network and voltage source.
```

Embedded using:

```text
BAAI/bge-base-en-v1.5
```

Advantages:

* Better semantic retrieval
* Human explainability
* Better cross-modal search

Disadvantages:

* Extra inference stage
* Higher latency
* Increased storage

### Comparison

| Feature         | CLIP Only | CLIP + Caption |
| --------------- | --------- | -------------- |
| Speed           | Faster    | Slower         |
| Storage         | Lower     | Higher         |
| Semantic Search | Medium    | Excellent      |
| Explainability  | Low       | High           |

### Recommended Enterprise Strategy

Store both:

```python
{
    "text": bge_embedding,
    "image": clip_embedding
}
```

---

# 8. Table Processing

Tables should NOT be stored as images.

Recommended workflow:

```text
PDF
 ↓
Table Detection
 ↓
Table Extraction
 ↓
Convert To Text
 ↓
BGE Embedding
```

Example Table

| Product | Revenue |
| ------- | ------- |
| A       | 100     |
| B       | 250     |

Converted To:

```text
Product A Revenue 100.
Product B Revenue 250.
```

Then embedded using:

```text
BAAI/bge-base-en-v1.5
```

Benefits:

* Searchable
* Explainable
* Better retrieval quality

---

# 9. PDF Processing Scenarios

## Scenario 1 — Digital PDF

```python
page.get_text()
```

Workflow:

```text
PDF
 ↓
PyMuPDF
 ↓
Direct Text
```

---

## Scenario 2 — Scanned PDF

```text
PDF
 ↓
Render Page
 ↓
RapidOCR
 ↓
Text
```

---

## Scenario 3 — Image-Only PDF

```text
PDF
 ↓
Render Page
 ↓
CLIP Embedding
```

---

## Scenario 4 — Mixed Content PDF

```text
PDF
 ↓
Text Extraction
 ↓
OCR
 ↓
Table Extraction
 ↓
CLIP Image Embedding
```

Most enterprise PDFs fall into this category.

---

# 10. Page Loading vs Chunk Loading

## Page-Level Processing

```text
PDF
 ↓
Page 1
Page 2
Page 3
...
```

Advantages:

* Preserves page references
* Easier citations

Stored Metadata:

```json
{
  "source_file": "physics.pdf",
  "page_no": 23
}
```

---

## Chunk-Level Processing

Configuration:

```python
CHUNK_SIZE = 800
CHUNK_OVERLAP = 100
```

Example:

```text
Chunk 1:
0 - 800

Chunk 2:
700 - 1500

Chunk 3:
1400 - 2200
```

Advantages:

* Better retrieval precision
* Preserves context
* Faster vector search

---

# 11. Ingestion Workflow

```text
PDF
 ↓
MD5 Hash
 ↓
Registry Check
 ↓
Extract Text
 ↓
OCR If Required
 ↓
Extract Tables
 ↓
Render Page Images
 ↓
Generate BGE Embeddings
 ↓
Generate CLIP Embeddings
 ↓
Qdrant Upsert
 ↓
Save Registry
```

---

# 12. Metadata Structure

```json
{
  "source_file": "physics.pdf",
  "page_no": 23,
  "chunk_id": 14,
  "chunk_length": 785,
  "ingested_at": "2026-06-24T11:00:00"
}
```

---

# 13. Search Modes

## Semantic Search

Input:

```text
Explain electrostatic force
```

Workflow:

```text
Query
 ↓
BGE Embedding
 ↓
Qdrant Text Search
```

---

## Reverse Image Search

Input:

```text
Circuit Diagram Image
```

Workflow:

```text
Image
 ↓
CLIP
 ↓
Qdrant Image Search
```

---

## OCR Search

Input:

```text
Handwritten Formula
```

Workflow:

```text
OCR
 ↓
BGE
 ↓
Qdrant Search
```

---

## Table Search

Input:

```text
Revenue of Product B
```

Workflow:

```text
Query
 ↓
BGE
 ↓
Table Text
 ↓
Qdrant Search
```

---

## Future Audio Search

Input:

```text
Voice Question
```

Workflow:

```text
Audio
 ↓
Whisper
 ↓
Text
 ↓
BGE
 ↓
Qdrant Search
```

---

# 14. Search Threshold Recommendations

## Text Search

| Score | Meaning |
| ----- | ------- |
| 0.90+ | Exact   |
| 0.80+ | Strong  |
| 0.70+ | Good    |
| 0.60+ | Broad   |

---

## Image Search

| Score | Meaning |
| ----- | ------- |
| 0.35+ | Strong  |
| 0.25+ | Good    |
| 0.15+ | Broad   |

---

# 15. Example Output Format

```text
🎯 Match #1

Score:
0.8942

Source:
Physics.pdf

Page:
143

Extracted Context:
Electrostatic force between two charges...
```

---

# 16. Future Enhancements

## Multi-Vector Retrieval

```python
{
  "text": 768,
  "image": 512,
  "table": 768
}
```

---

## Reranking Layer

Recommended:

```text
BAAI/bge-reranker-large
```

Workflow:

```text
Retrieve Top 20
 ↓
Rerank
 ↓
Return Top 5
```

---

## Hybrid Retrieval

```text
Dense Search
 +
Metadata Filters
 +
Keyword Search
```

---

# 17. Final Enterprise Architecture

```text
PDF / DOCX / PPT / IMAGE / AUDIO
                │
                ▼
       Extraction Layer
                │
                ▼
      OCR + Table Parsing
                │
                ▼
      Chunking & Metadata
                │
      ┌─────────┴─────────┐
      ▼                   ▼

 BGE 768           CLIP 512

      ▼                   ▼

         QDRANT

      text vector
      image vector

                ▼

        Hybrid Search

                ▼

          User Interface
```

---

# Conclusion

MMRAG provides a scalable enterprise-grade multi-modal retrieval framework capable of semantic search across text, images, OCR documents, tables, and future audio sources. By combining BGE embeddings for textual understanding and CLIP embeddings for visual understanding within Qdrant's multi-vector architecture, the system delivers accurate, explainable, and source-traceable retrieval suitable for production RAG deployments.