Building Hierarchical Data Management: Parent-Child Relationships in a RAG System

When building data-intensive applications, there often comes a moment when flat data structures no longer serve your users' needs. Recently, I tackled exactly this challenge while working on a RAG (Retrieval-Augmented Generation) system that manages document sources for AI chatbots.

The problem was clear: users wanted to group related sources together, manage them as collections, and get aggregated insights about their status. What started as a simple feature request evolved into a comprehensive hierarchical data management system with batch operations, cascade deletion, and an intuitive dashboard interface.

The Challenge: From Flat to Hierarchical

Our existing system treated each document source as an independent entity. Users could upload PDFs, add web URLs, and manage individual sources, but there was no way to group related content. Imagine having 20 company policy documents scattered across your source list with no way to organize them as a cohesive "Company Policies" collection.

The solution required implementing a parent-child relationship where:

Parent sources act as containers that group related content
Child sources contain the actual documents or URLs
Aggregated status rolls up from children to show overall progress
Batch operations allow managing multiple sources as a unit

Database Design: Keeping It Simple

The database changes were surprisingly minimal. I added a single parent_id column to the existing sources table:

ALTER TABLE sources ADD COLUMN parent_id UUID REFERENCES sources(id);
CREATE INDEX ix_sources_parent_id ON sources(parent_id);

This self-referencing foreign key approach offers several advantages:

Simplicity: One table handles both parents and children
Flexibility: Easy to query hierarchies with standard SQL
Scalability: Indexed foreign key ensures fast lookups
Integrity: Database-level constraints prevent orphaned records

I deliberately limited the hierarchy to two levels (no grandparenting) to keep the UI and business logic manageable while covering 95% of real-world use cases.

API Design: Batch Operations and Smart Defaults

The API evolved to handle both individual sources and batch operations seamlessly. Here's how the key endpoints work:

Listing Sources with Hierarchy Awareness

@router.get("/sources", response_model=list[SourceRead])
async def list_sources(
    parent_id: Optional[UUID] = None,
    include_children: bool = False,
    # ... other params
):
    # Default: return only top-level sources
    # With parent_id: return children of specific parent
    # With include_children: return full hierarchy

This approach provides three distinct views:

Dashboard view (default): Only top-level sources for clean organization
Parent detail view: Children of a specific parent
Full hierarchy view: Everything together when needed

Batch Creation with Smart Validation

@router.post("/sources/batch", response_model=BatchSourceResponse)
async def create_batch_source(request: BatchSourceCreate):
    # Create parent source
    # Validate all children against same tenant/bot
    # Create children with parent_id reference
    # Return aggregated response

The batch endpoint handles the common case where users want to add multiple URLs or documents as a group, automatically creating the parent container and linking all children.

Aggregated Status Magic

One of the trickiest parts was calculating meaningful status for parent sources. A parent's status aggregates from its children:

def _aggregate_status(children: List[Source]) -> SourceStatus:
    if not children:
        return SourceStatus.CREATED
    
    statuses = [child.status for child in children]
    
    if any(s == SourceStatus.FAILED for s in statuses):
        return SourceStatus.FAILED
    elif any(s in [SourceStatus.PROCESSING, SourceStatus.CREATED] for s in statuses):
        return SourceStatus.PROCESSING
    else:
        return SourceStatus.COMPLETED

This gives users immediate insight into their collection's overall health without diving into individual items.

Frontend: Expand/Collapse with Visual Hierarchy

The dashboard UI transformation was equally important. Parent sources now display with:

Expandable chevrons to show/hide children
Item count badges showing "(N items)"
Indented child rows with visual arrows when expanded
Aggregate status indicators that reflect the collection's overall state

The JavaScript handling gracefully manages the expand/collapse state:

// Toggle parent row expansion
if (row.classList.contains('parent-row')) {
    const isExpanded = row.classList.contains('expanded');
    if (isExpanded) {
        // Hide children, update chevron
        hideChildren(source.id);
    } else {
        // Load and show children
        showChildren(source.id);
    }
}

File uploads automatically detect when users select multiple files and create a parent container, while single files remain standalone sources.

Cascade Operations: Delete with Confidence

Deleting hierarchical data requires careful consideration. When users delete a parent source, they're typically expecting all related content to disappear together. I implemented cascade soft-deletion with clear confirmation:

async def delete_source(source_id: UUID, current_user: User):
    source = await get_source_or_404(source_id)
    
    # Cascade soft-delete to children
    if source.parent_id is None:  # This is a parent
        children = await _get_children(source_id)
        for child in children:
            child.is_active = False
    
    source.is_active = False
    await db.commit()

The UI confirms cascade operations: "This will also delete 5 child sources. Continue?" gives users clear expectations about the operation's scope.

Lessons Learned

The implementation went surprisingly smoothly, but a few insights emerged:

Default Filtering Changes Behavior: The new list_sources() endpoint filters to top-level sources by default. This is a breaking change from the original "return everything" approach, but it creates a much better user experience. Existing tests still passed because they don't rely on the changed behavior.

Keep Hierarchies Shallow: Limiting to two levels eliminated a whole class of complexity around recursive queries, circular references, and UI visualization challenges. The 80/20 rule applies strongly to hierarchical data.

Batch Operations Need Validation: When creating multiple sources simultaneously, validating that they all belong to the same tenant and bot profile prevents confusing mixed-context collections.

Aggregated Status is Nuanced: Deciding when a parent should show "FAILED" vs "PROCESSING" when children have mixed statuses required careful thought about user expectations.

Testing: Comprehensive Coverage

I added 12 new test cases covering the hierarchy functionality, bringing the total to 77 passing tests. The test suite covers:

Parent-child relationship creation and validation
Cascade deletion behavior
Aggregated status calculation
Batch creation with error handling
API endpoint parameter combinations

Additionally, I updated the Postman collection with 9 new requests testing the full API surface, maintaining 100% assertion success rate.

What's Next?

The foundation is solid, but there are opportunities for enhancement:

Real-time updates: Auto-refresh parent status when children finish processing
Search and filtering: Find sources within large hierarchies
Bulk operations: Move sources between parents, merge collections
Advanced aggregation: Show progress percentages, error summaries

Wrapping Up

Building hierarchical data management taught me that the best solutions often involve minimal database changes paired with thoughtful API design and clear UI patterns. By keeping the data model simple and focusing on user workflows, we transformed a flat source list into an organized, intuitive content management system.

The key was recognizing that hierarchy isn't just about data relationships—it's about giving users mental models that match how they think about their content. Sometimes the most powerful features are the ones that feel obvious once they exist.

The complete implementation includes database migrations, API endpoints, frontend UI, and comprehensive tests. All code is production-ready and has been deployed to the live system.