Building MiniRAG: A Multi-Tenant RAG Platform from Scratch (Part 1: The Foundation)
Building MiniRAG: A Multi-Tenant RAG Platform from Scratch
The Vision
What if you could build a Retrieval-Augmented Generation (RAG) platform that's truly modular, provider-agnostic, and multi-tenant from day one? That's exactly what I set out to do with MiniRAG – a platform that lets multiple organizations securely manage their own knowledge bases, chat histories, and AI configurations in a single deployment.
After completing the first four major milestones, I wanted to share the journey, the technical decisions, and most importantly, the lessons learned along the way.
What We Built So Far
Step 1: The Foundation 🏗️
Every great platform starts with solid infrastructure. I chose a modern Python stack:
- FastAPI for lightning-fast async APIs
- SQLModel (Pydantic + SQLAlchemy) for type-safe database models
- PostgreSQL for reliable data persistence
- Qdrant for vector similarity search
- Redis for task queuing and caching
- Docker Compose to orchestrate it all
The core models emerged naturally: Tenant, User, and ApiToken form the security backbone, while the config system handles environment-specific settings gracefully.
Step 2: Authentication & Multi-Tenancy 🔐
Multi-tenancy isn't just about data isolation – it's about making it invisible to developers using your platform. I implemented a dual-token authentication system:
# In app/api/deps.py
class AuthContext:
tenant_id: str
user_id: str | None
is_service_account: bool
The beauty is in the deps.py module – every endpoint can simply depend on AuthContext, and the tenant isolation happens automatically. No more "oops, I forgot to filter by tenant_id" bugs.
Step 3: Bot Profiles & Data Sources 🤖
Here's where things got interesting. Each tenant needed to configure their own AI providers (OpenAI, Anthropic, etc.) without exposing credentials to other tenants. Enter Fernet encryption:
# Credentials encrypted at rest, decrypted only when needed
class BotProfile(SQLModel, table=True):
tenant_id: str
provider: str
encrypted_credentials: str # Fernet-encrypted JSON
The Source model handles data ingestion with proper lifecycle management – tracking whether sources are PENDING, PROCESSING, COMPLETED, or FAILED. Clean state machines make debugging so much easier.
Step 4: The Ingestion Pipeline 🔄
This is where the magic happens. Raw documents need to become searchable knowledge, and that journey involves several critical steps:
Text Chunking: Not all text is created equal. I built a recursive chunker that respects sentence boundaries while maintaining consistent chunk sizes (512 tokens with 64-token overlap by default).
Embedding Generation: Using LiteLLM for provider abstraction, with intelligent batching (max 128 chunks per API call) to balance speed and rate limits.
Vector Storage: Qdrant handles the heavy lifting, but the key insight was using a single collection with tenant-based payload filtering rather than per-tenant collections. Much simpler operations.
The entire pipeline runs as an async ARQ task:
# app/workers/ingest.py
async def ingest_source(ctx, source_id: str) -> str:
# content → chunks → embeddings → vector store → done
source = await get_source(source_id)
chunks = chunk_text(source.content)
embeddings = await generate_embeddings(chunks)
await vector_store.upsert(chunks, embeddings, source.tenant_id)
await update_source_status(source_id, "COMPLETED")
Lessons Learned (The Pain Points) 💡
Database Connection Woes
The Problem: Integration tests were trying to connect to a PostgreSQL instance that didn't exist locally, causing mysterious socket.gaierror exceptions.
The Solution: Create a test-specific session factory and patch it at the module level. This gave us clean, isolated tests without requiring a full database setup.
# In conftest.py
@pytest.fixture(scope="session")
def test_session_factory():
engine = create_engine("sqlite:///:memory:")
return sessionmaker(bind=engine)
# In tests
async def test_ingest_task(test_session_factory):
with patch('app.workers.ingest.async_session_factory', test_session_factory):
# Now the test uses SQLite instead of Postgres
result = await ingest_source({}, source_id)
Text Normalization Gotchas
The Problem: Text preprocessing is trickier than it looks. My initial normalize_text() function was collapsing spaces but leaving them around newlines, breaking paragraph structure.
The Solution: Regex to the rescue! re.sub(r" *\n *", "\n", text) cleans up spacing around line breaks while preserving intentional formatting.
The Numbers Don't Lie
After four major development phases:
- 26 tests passing ✅
- 4 core services (web, worker, database, vector store)
- 8 database models with proper relationships
- 3 major service modules (chunking, embedding, vector store)
- Full Docker orchestration ready for production
What's Next: The Chat Interface
Step 5 is where everything comes together – building the actual chat experience. This involves:
- Chat & Message models for conversation history
- Usage tracking for token consumption and billing
- The orchestrator service – the brain that coordinates query rewriting, retrieval, context assembly, and LLM generation
- Streaming responses via Server-Sent Events
- Comprehensive testing of the full chat pipeline
The orchestrator is particularly exciting because it's where all our earlier work pays off:
async def chat_completion(query: str, context: AuthContext):
# 1. Rewrite query for better retrieval
# 2. Search vector store for relevant chunks
# 3. Assemble context with source attribution
# 4. Generate streaming LLM response
# 5. Track token usage for billing
Why This Approach Matters
Building a RAG platform isn't just about connecting an LLM to a vector database. The real challenges are:
- Multi-tenancy: Secure data isolation at scale
- Provider flexibility: Not being locked into a single AI vendor
- Observability: Understanding what's happening when things go wrong
- Testing: Building confidence in a complex async system
MiniRAG addresses each of these from the ground up, creating a foundation that can grow with real-world demands.
Want to follow along with the development? The next post will dive deep into the chat orchestrator and streaming response implementation. Until then, happy coding! 🚀