Building MiniRAG: A Multi-Tenant RAG Platform from Scratch

The Vision

What if you could build a Retrieval-Augmented Generation (RAG) platform that's truly modular, provider-agnostic, and multi-tenant from day one? That's exactly what I set out to do with MiniRAG – a platform that lets multiple organizations securely manage their own knowledge bases, chat histories, and AI configurations in a single deployment.

After completing the first four major milestones, I wanted to share the journey, the technical decisions, and most importantly, the lessons learned along the way.

What We Built So Far

Step 1: The Foundation 🏗️

Every great platform starts with solid infrastructure. I chose a modern Python stack:

FastAPI for lightning-fast async APIs
SQLModel (Pydantic + SQLAlchemy) for type-safe database models
PostgreSQL for reliable data persistence
Qdrant for vector similarity search
Redis for task queuing and caching
Docker Compose to orchestrate it all

The core models emerged naturally: Tenant, User, and ApiToken form the security backbone, while the config system handles environment-specific settings gracefully.

Step 2: Authentication & Multi-Tenancy 🔐

Multi-tenancy isn't just about data isolation – it's about making it invisible to developers using your platform. I implemented a dual-token authentication system:

# In app/api/deps.py
class AuthContext:
    tenant_id: str
    user_id: str | None
    is_service_account: bool

The beauty is in the deps.py module – every endpoint can simply depend on AuthContext, and the tenant isolation happens automatically. No more "oops, I forgot to filter by tenant_id" bugs.

Step 3: Bot Profiles & Data Sources 🤖

Here's where things got interesting. Each tenant needed to configure their own AI providers (OpenAI, Anthropic, etc.) without exposing credentials to other tenants. Enter Fernet encryption:

# Credentials encrypted at rest, decrypted only when needed
class BotProfile(SQLModel, table=True):
    tenant_id: str
    provider: str
    encrypted_credentials: str  # Fernet-encrypted JSON

The Source model handles data ingestion with proper lifecycle management – tracking whether sources are PENDING, PROCESSING, COMPLETED, or FAILED. Clean state machines make debugging so much easier.

Step 4: The Ingestion Pipeline 🔄

This is where the magic happens. Raw documents need to become searchable knowledge, and that journey involves several critical steps:

Text Chunking: Not all text is created equal. I built a recursive chunker that respects sentence boundaries while maintaining consistent chunk sizes (512 tokens with 64-token overlap by default).

Embedding Generation: Using LiteLLM for provider abstraction, with intelligent batching (max 128 chunks per API call) to balance speed and rate limits.

Vector Storage: Qdrant handles the heavy lifting, but the key insight was using a single collection with tenant-based payload filtering rather than per-tenant collections. Much simpler operations.

The entire pipeline runs as an async ARQ task:

# app/workers/ingest.py
async def ingest_source(ctx, source_id: str) -> str:
    # content → chunks → embeddings → vector store → done
    source = await get_source(source_id)
    chunks = chunk_text(source.content)
    embeddings = await generate_embeddings(chunks)
    await vector_store.upsert(chunks, embeddings, source.tenant_id)
    await update_source_status(source_id, "COMPLETED")

Lessons Learned (The Pain Points) 💡

Database Connection Woes

The Problem: Integration tests were trying to connect to a PostgreSQL instance that didn't exist locally, causing mysterious socket.gaierror exceptions.

The Solution: Create a test-specific session factory and patch it at the module level. This gave us clean, isolated tests without requiring a full database setup.

# In conftest.py
@pytest.fixture(scope="session")
def test_session_factory():
    engine = create_engine("sqlite:///:memory:")
    return sessionmaker(bind=engine)

# In tests
async def test_ingest_task(test_session_factory):
    with patch('app.workers.ingest.async_session_factory', test_session_factory):
        # Now the test uses SQLite instead of Postgres
        result = await ingest_source({}, source_id)

Text Normalization Gotchas

The Problem: Text preprocessing is trickier than it looks. My initial normalize_text() function was collapsing spaces but leaving them around newlines, breaking paragraph structure.

The Solution: Regex to the rescue! re.sub(r" *\n *", "\n", text) cleans up spacing around line breaks while preserving intentional formatting.

The Numbers Don't Lie

After four major development phases:

26 tests passing ✅
4 core services (web, worker, database, vector store)
8 database models with proper relationships
3 major service modules (chunking, embedding, vector store)
Full Docker orchestration ready for production

What's Next: The Chat Interface

Step 5 is where everything comes together – building the actual chat experience. This involves:

Chat & Message models for conversation history
Usage tracking for token consumption and billing
The orchestrator service – the brain that coordinates query rewriting, retrieval, context assembly, and LLM generation
Streaming responses via Server-Sent Events
Comprehensive testing of the full chat pipeline

The orchestrator is particularly exciting because it's where all our earlier work pays off:

async def chat_completion(query: str, context: AuthContext):
    # 1. Rewrite query for better retrieval
    # 2. Search vector store for relevant chunks  
    # 3. Assemble context with source attribution
    # 4. Generate streaming LLM response
    # 5. Track token usage for billing

Why This Approach Matters

Building a RAG platform isn't just about connecting an LLM to a vector database. The real challenges are:

Multi-tenancy: Secure data isolation at scale
Provider flexibility: Not being locked into a single AI vendor
Observability: Understanding what's happening when things go wrong
Testing: Building confidence in a complex async system

MiniRAG addresses each of these from the ground up, creating a foundation that can grow with real-world demands.

Want to follow along with the development? The next post will dive deep into the chat orchestrator and streaming response implementation. Until then, happy coding! 🚀