Building MiniRAG: From Code to Production with FastAPI and Vector Search

After seven development sessions and countless cups of coffee, I'm excited to share that MiniRAG is now a fully functional multi-tenant RAG (Retrieval-Augmented Generation) chat backend. What started as an ambitious idea to combine FastAPI, PostgreSQL, Qdrant vector database, Redis task queues, and LiteLLM has evolved into a robust system that's passing all tests and ready for the real world.

What We Built

MiniRAG is a complete chat backend that allows multiple tenants to:

Upload and process documents into searchable knowledge bases
Chat with their documents using AI models (OpenAI, Anthropic, etc.)
Manage users, tenants, and permissions through a clean REST API
Process documents asynchronously with background tasks

The tech stack brings together some powerful tools:

FastAPI for lightning-fast API development
PostgreSQL for reliable data persistence
Qdrant for vector similarity search
Redis + ARQ for background job processing
LiteLLM for unified access to multiple AI providers

The Victory Moment

Nothing beats the satisfaction of seeing this in your terminal:

newman run postman_collection.json -e postman_environment.json

┌─────────────────────────────────────────────────────────────────────────┐
│                                                                         │
│   30 requests, 58 assertions                                          │
│   ✓ 30 requests succeeded, 0 failed                                   │
│   ✓ 58 assertions passed, 0 failed                                    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

This comprehensive test suite covers everything from user authentication to live RAG conversations with real LLM calls. Having 100% test coverage gives me confidence that every component is working together harmoniously.

Lessons Learned: When APIs Evolve Faster Than Documentation

The most interesting challenge came from an unexpected source: dependency version mismatches. Just when everything seemed to be working perfectly, I hit a wall with the Qdrant client library.

The Problem

# This code worked in development...
results = await client.search(
    collection_name="documents",
    query_vector=query_embedding,
    limit=limit
)

for hit in results:
    # Process results...

But suddenly threw this error:

AttributeError: 'AsyncQdrantClient' object has no attribute 'search'

The Investigation

It turns out that qdrant-client version 1.16.2 removed the .search() method entirely, replacing it with a new .query_points() API. The Docker image I was using (Qdrant 1.13.2) was also slightly behind, creating a version compatibility dance.

The Solution

The fix required updating both the method call and response handling:

# New approach with query_points
response = await client.query_points(
    collection_name="documents", 
    query=query_embedding,  # Note: 'query' not 'query_vector'
    limit=limit
)

# Response structure changed too
for hit in response.points:  # Not direct iteration
    # Process results...

I also had to suppress the version warning:

client = AsyncQdrantClient(
    url=settings.qdrant_url,
    check_compatibility=False  # Skip version mismatch warnings
)

The Takeaway

This experience reinforced a crucial lesson: pin your dependencies in production. While staying current with updates is important, breaking changes in minor versions can catch you off guard. Always test thoroughly after dependency updates, and maintain comprehensive test suites to catch these issues early.

Architecture Highlights

One aspect I'm particularly proud of is the clean separation of concerns:

Service Layer: Each major component (vector store, document processing, chat) has its own service class with clear interfaces.

Background Processing: Document ingestion happens asynchronously, so users don't wait for embeddings to be generated.

Multi-tenancy: Everything is properly scoped to tenants, ensuring data isolation and scalability.

LLM Flexibility: Using LiteLLM means switching from OpenAI to Anthropic (or any other provider) is just a configuration change.

What's Next

The foundation is solid, but there are exciting enhancements on the horizon:

Streaming Responses: Implementing Server-Sent Events for real-time chat experiences
Production Deployment: Adding Alembic database migrations and containerization
Advanced Features: Rate limiting, usage quotas, and tenant analytics
Performance Optimization: Caching strategies and query optimization

Try It Yourself

The complete codebase is available, along with a full Postman collection for testing. Whether you're building your own RAG system or just curious about modern Python API development, MiniRAG demonstrates patterns that scale.

The journey from concept to working system taught me that the devil truly is in the details—but with good testing practices and patience for debugging, even the trickiest integration challenges become stepping stones to a better architecture.

Have you built similar systems? I'd love to hear about your experiences with vector databases and multi-tenant architectures. The RAG space is evolving rapidly, and there's always more to learn.