Building Auto-Refresh for MiniRAG: Scheduled Content Updates and Hard-Won Lessons

Last night I wrapped up implementing one of those features that sounds simple on paper but turns into a rabbit hole of architectural decisions: automatic content refresh for my MiniRAG system. The goal was straightforward—let users schedule their URL-based sources to automatically re-ingest content on a regular basis. The implementation? Well, that's where things got interesting.

The Feature: Making RAG Systems Self-Updating

If you're building a Retrieval-Augmented Generation (RAG) system, you've probably faced this problem: content goes stale. That blog post you ingested last month got updated, that documentation page changed, and suddenly your AI is giving answers based on outdated information.

The solution seemed obvious: add scheduling. Let users say "refresh this URL every hour" or "check for updates daily" and handle it automatically in the background.

Here's what I built:

1. Database Schema Updates

First, I extended my Source model with scheduling fields:

# app/models/source.py
class RefreshSchedule(str, Enum):
    NEVER = "never"
    HOURLY = "hourly" 
    DAILY = "daily"
    WEEKLY = "weekly"

class Source(SQLModel, table=True):
    # ... existing fields
    refresh_schedule: RefreshSchedule = RefreshSchedule.NEVER
    last_refreshed_at: Optional[datetime] = None

2. HTML Content Extraction

Since I'm dealing with URLs, I needed robust HTML-to-text extraction. Rather than pulling in another dependency, I built a lightweight parser using Python's standard library:

# app/services/html_extract.py  
from html.parser import HTMLParser

class HTMLToTextParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.text_parts = []
        self.ignore_tags = {'script', 'style', 'meta', 'link'}
        
    def handle_data(self, data):
        if self.get_starttag_text() and \
           not any(tag in str(self.get_starttag_text()).lower() 
                  for tag in self.ignore_tags):
            self.text_parts.append(data.strip())

3. Background Job Scheduling

This is where it gets fun. I'm using ARQ (Async Redis Queue) for background jobs, so I needed a cron job that runs every 15 minutes to check which sources need refreshing:

# app/workers/refresh.py
async def check_refresh_schedules(ctx):
    redis_pool = ctx["redis"]  # ARQ provides this
    
    # Query sources that need refreshing based on schedule
    sources_to_refresh = await get_sources_needing_refresh()
    
    for source in sources_to_refresh:
        await redis_pool.enqueue_job(
            'ingest_source', 
            source_id=source.id,
            is_refresh=True
        )

The cron job registration was clean:

# app/workers/main.py
cron_jobs = [
    cron(check_refresh_schedules, minute={0, 15, 30, 45})
]

Lessons Learned: The Hard Way

The Circular Import Trap

Here's where I learned a valuable lesson about Python imports. My first attempt looked like this:

# refresh.py trying to import Redis settings
from app.workers.main import _redis_settings

# But main.py was importing the refresh function
from app.workers.refresh import check_refresh_schedules

Boom. Circular import. The error message was clear, but the solution took some thinking.

The fix was elegant: ARQ automatically provides a Redis connection pool in the job context. Instead of trying to create my own pool, I used what was already there:

# Clean solution - use ARQ's provided context
async def check_refresh_schedules(ctx):
    redis_pool = ctx["redis"]  # No imports needed!

Lesson: When working with job queues, check what the framework provides before rolling your own connections.

Database Migration Reality Check

The second gotcha was more operational. I added new columns to my SQLModel, but SQLAlchemy's create_all() doesn't add columns to existing tables—it only creates missing tables entirely.

Running my test suite against a local database with existing data failed spectacularly:

500 Internal Server Error - column "refresh_schedule" does not exist

Quick fix for development:

ALTER TABLE sources ADD COLUMN IF NOT EXISTS refresh_schedule VARCHAR(20);
ALTER TABLE sources ADD COLUMN IF NOT EXISTS last_refreshed_at TIMESTAMP;

But this highlighted a gap: I need proper Alembic migrations for any environment with persistent data.

Lesson: Schema changes are easy in development, trickier in production. Plan your migration strategy early.

The Results: Feature Complete

After working through these challenges, here's what I ended up with:

✅ 108 pytest tests passing (including 13 new tests for URL ingestion and refresh logic)
✅ Newman API tests at 88/107 (19 failures were pre-existing, related to missing LLM API keys)
✅ Clean dashboard UI with schedule picker and refresh status columns
✅ Deployed to production via GitHub Actions (13-second deploy time!)

The refresh system handles:

Fetching URL content via httpx
HTML-to-text conversion without external dependencies
Smart scheduling based on last refresh time and chosen interval
Proper error handling and status tracking

What's Next?

This feature opens up some interesting possibilities:

Webhook notifications when refresh succeeds or fails
Content change detection to avoid unnecessary re-indexing
Backoff strategies for sources that consistently fail
Analytics on refresh patterns and success rates

Key Takeaways

Building this feature reinforced a few important principles:

Leverage your framework: ARQ's context system was cleaner than managing my own Redis connections
Test thoroughly: Having comprehensive tests caught the database schema issue early
Plan for persistence: Schema changes need proper migration strategies
Start simple: Using stdlib HTMLParser instead of heavy dependencies kept things lightweight

Sometimes the best learning happens when you're knee-deep in circular imports at 2 AM, but that's what makes these late-night coding sessions worthwhile.

MiniRAG is my experimental RAG system built with FastAPI, PostgreSQL, and Redis. You can follow the development journey in my technical blog series.