Building Auto-Refresh for MiniRAG: Scheduled Content Updates and Hard-Won Lessons
Building Auto-Refresh for MiniRAG: Scheduled Content Updates and Hard-Won Lessons
Last night I wrapped up implementing one of those features that sounds simple on paper but turns into a rabbit hole of architectural decisions: automatic content refresh for my MiniRAG system. The goal was straightforward—let users schedule their URL-based sources to automatically re-ingest content on a regular basis. The implementation? Well, that's where things got interesting.
The Feature: Making RAG Systems Self-Updating
If you're building a Retrieval-Augmented Generation (RAG) system, you've probably faced this problem: content goes stale. That blog post you ingested last month got updated, that documentation page changed, and suddenly your AI is giving answers based on outdated information.
The solution seemed obvious: add scheduling. Let users say "refresh this URL every hour" or "check for updates daily" and handle it automatically in the background.
Here's what I built:
1. Database Schema Updates
First, I extended my Source model with scheduling fields:
# app/models/source.py
class RefreshSchedule(str, Enum):
NEVER = "never"
HOURLY = "hourly"
DAILY = "daily"
WEEKLY = "weekly"
class Source(SQLModel, table=True):
# ... existing fields
refresh_schedule: RefreshSchedule = RefreshSchedule.NEVER
last_refreshed_at: Optional[datetime] = None
2. HTML Content Extraction
Since I'm dealing with URLs, I needed robust HTML-to-text extraction. Rather than pulling in another dependency, I built a lightweight parser using Python's standard library:
# app/services/html_extract.py
from html.parser import HTMLParser
class HTMLToTextParser(HTMLParser):
def __init__(self):
super().__init__()
self.text_parts = []
self.ignore_tags = {'script', 'style', 'meta', 'link'}
def handle_data(self, data):
if self.get_starttag_text() and \
not any(tag in str(self.get_starttag_text()).lower()
for tag in self.ignore_tags):
self.text_parts.append(data.strip())
3. Background Job Scheduling
This is where it gets fun. I'm using ARQ (Async Redis Queue) for background jobs, so I needed a cron job that runs every 15 minutes to check which sources need refreshing:
# app/workers/refresh.py
async def check_refresh_schedules(ctx):
redis_pool = ctx["redis"] # ARQ provides this
# Query sources that need refreshing based on schedule
sources_to_refresh = await get_sources_needing_refresh()
for source in sources_to_refresh:
await redis_pool.enqueue_job(
'ingest_source',
source_id=source.id,
is_refresh=True
)
The cron job registration was clean:
# app/workers/main.py
cron_jobs = [
cron(check_refresh_schedules, minute={0, 15, 30, 45})
]
Lessons Learned: The Hard Way
The Circular Import Trap
Here's where I learned a valuable lesson about Python imports. My first attempt looked like this:
# refresh.py trying to import Redis settings
from app.workers.main import _redis_settings
# But main.py was importing the refresh function
from app.workers.refresh import check_refresh_schedules
Boom. Circular import. The error message was clear, but the solution took some thinking.
The fix was elegant: ARQ automatically provides a Redis connection pool in the job context. Instead of trying to create my own pool, I used what was already there:
# Clean solution - use ARQ's provided context
async def check_refresh_schedules(ctx):
redis_pool = ctx["redis"] # No imports needed!
Lesson: When working with job queues, check what the framework provides before rolling your own connections.
Database Migration Reality Check
The second gotcha was more operational. I added new columns to my SQLModel, but SQLAlchemy's create_all() doesn't add columns to existing tables—it only creates missing tables entirely.
Running my test suite against a local database with existing data failed spectacularly:
500 Internal Server Error - column "refresh_schedule" does not exist
Quick fix for development:
ALTER TABLE sources ADD COLUMN IF NOT EXISTS refresh_schedule VARCHAR(20);
ALTER TABLE sources ADD COLUMN IF NOT EXISTS last_refreshed_at TIMESTAMP;
But this highlighted a gap: I need proper Alembic migrations for any environment with persistent data.
Lesson: Schema changes are easy in development, trickier in production. Plan your migration strategy early.
The Results: Feature Complete
After working through these challenges, here's what I ended up with:
- ✅ 108 pytest tests passing (including 13 new tests for URL ingestion and refresh logic)
- ✅ Newman API tests at 88/107 (19 failures were pre-existing, related to missing LLM API keys)
- ✅ Clean dashboard UI with schedule picker and refresh status columns
- ✅ Deployed to production via GitHub Actions (13-second deploy time!)
The refresh system handles:
- Fetching URL content via
httpx - HTML-to-text conversion without external dependencies
- Smart scheduling based on last refresh time and chosen interval
- Proper error handling and status tracking
What's Next?
This feature opens up some interesting possibilities:
- Webhook notifications when refresh succeeds or fails
- Content change detection to avoid unnecessary re-indexing
- Backoff strategies for sources that consistently fail
- Analytics on refresh patterns and success rates
Key Takeaways
Building this feature reinforced a few important principles:
- Leverage your framework: ARQ's context system was cleaner than managing my own Redis connections
- Test thoroughly: Having comprehensive tests caught the database schema issue early
- Plan for persistence: Schema changes need proper migration strategies
- Start simple: Using stdlib
HTMLParserinstead of heavy dependencies kept things lightweight
Sometimes the best learning happens when you're knee-deep in circular imports at 2 AM, but that's what makes these late-night coding sessions worthwhile.
MiniRAG is my experimental RAG system built with FastAPI, PostgreSQL, and Redis. You can follow the development journey in my technical blog series.