Skip to content

Relevance Ranking System

Unified Ranking Architecture

graph TD
    subgraph "Ranking Strategies"
        FAST[Fast Strategy<br/>Fastest<br/>Keyword + Path Only]
        BALANCED[Balanced Strategy<br/>1.5x slower<br/>BM25 + Structure]
        THOROUGH[Thorough Strategy<br/>4x slower<br/>Full Analysis + ML]
        ML_STRAT[ML Strategy<br/>5x slower<br/>Semantic Embeddings]
    end

    subgraph "Text Analysis (40% in Balanced)"
        KEY_MATCH[Keyword Matching<br/>20%<br/>Direct term hits]
        BM25_SIM[BM25 Similarity<br/>20%<br/>Statistical relevance]
        BM25_SCORE[BM25 Score<br/>15%<br/>Probabilistic ranking]
    end

    subgraph "Code Structure Analysis (25% in Balanced)"
        PATH_REL[Path Relevance<br/>15%<br/>Directory structure]
        IMP_CENT[Import Centrality<br/>10%<br/>Dependency importance]
    end

    subgraph "File Characteristics (15% in Balanced)"
        COMPLEXITY_REL[Complexity Relevance<br/>5%<br/>Code complexity signals]
        FILE_TYPE[File Type Relevance<br/>5%<br/>Extension/type matching]
        CODE_PAT[Code Patterns<br/>5%<br/>AST pattern matching]
    end

    subgraph "Git Signals (10% in Balanced)"
        GIT_REC[Git Recency<br/>5%<br/>Recent changes]
        GIT_FREQ[Git Frequency<br/>5%<br/>Change frequency]
    end

    subgraph "ML Enhancement (Only in ML Strategy)"
        SEM_SIM[Semantic Similarity<br/>25%<br/>Embedding-based understanding]
        LOCAL_EMB[Local Embeddings<br/>sentence-transformers]
        EMBED_CACHE[Embedding Cache<br/>Performance optimization]
    end

    subgraph "Unified Pipeline"
        FILE_DISCOVERY[File Discovery<br/>Scanner + Filters]
        ANALYSIS[Code Analysis<br/>AST + Structure]
        RANKING[Multi-Factor Ranking<br/>Strategy-specific weights]
        AGGREGATION[Context Aggregation<br/>Token optimization]
    end

    FAST --> KEY_MATCH
    BALANCED --> BM25_SIM
    BALANCED --> BM25_SCORE
    THOROUGH --> IMP_CENT
    ML_STRAT --> SEM_SIM

    FILE_DISCOVERY --> ANALYSIS
    ANALYSIS --> RANKING
    RANKING --> AGGREGATION

    KEY_MATCH --> RANKING
    BM25_SIM --> RANKING
    BM25_SCORE --> RANKING
    PATH_REL --> RANKING
    IMP_CENT --> RANKING
    COMPLEXITY_REL --> RANKING
    FILE_TYPE --> RANKING
    CODE_PAT --> RANKING
    GIT_REC --> RANKING
    GIT_FREQ --> RANKING

    SEM_SIM --> LOCAL_EMB
    LOCAL_EMB --> EMBED_CACHE
    EMBED_CACHE --> RANKING

Strategy Comparison

StrategySpeedAccuracyUse CasesFactors Used
FastFastestBasicQuick file discoveryKeyword (60%), Path (30%), File type (10%)
Balanced1.5x slowerGoodDEFAULT Production usageKeyword (20%), BM25 (35%), Structure (25%), Git (10%)
Thorough4x slowerHighComplex codebasesAll balanced factors + enhanced analysis
ML5x slowerHighestSemantic searchEmbeddings (25%) + all thorough factors

Factor Calculation Details

graph LR
    subgraph "Semantic Similarity Calculation"
        CHUNK[Chunk Long Files<br/>1000 chars, 100 overlap]
        EMBED[Generate Embeddings<br/>Local model]
        COSINE[Cosine Similarity]
        CACHE_SEM[Cache Results]
    end

    subgraph "Keyword Matching"
        FILENAME[Filename Match<br/>Weight: 0.4]
        IMPORT_M[Import Match<br/>Weight: 0.3]
        CLASS_FN[Class/Function Name<br/>Weight: 0.25]
        POSITION[Position Weight<br/>Early lines favored]
    end

    subgraph "Import Centrality"
        IN_EDGES[Incoming Edges<br/>Files importing this<br/>70% weight]
        OUT_EDGES[Outgoing Edges<br/>Files this imports<br/>30% weight]
        LOG_SCALE[Logarithmic Scaling<br/>High-degree nodes]
        NORMALIZE[Normalize 0-1]
    end

    subgraph "Git Signals"
        RECENCY[Recency Score<br/>Exponential decay<br/>30-day half-life]
        FREQUENCY[Frequency Score<br/>Log of commit count]
        EXPERTISE[Author Expertise<br/>Contribution volume]
        CHURN[Recent Churn<br/>Lines changed]
    end

    CHUNK --> EMBED
    EMBED --> COSINE
    COSINE --> CACHE_SEM

    FILENAME --> POSITION
    IMPORT_M --> POSITION
    CLASS_FN --> POSITION

    IN_EDGES --> LOG_SCALE
    OUT_EDGES --> LOG_SCALE
    LOG_SCALE --> NORMALIZE

    RECENCY --> EXPERTISE
    FREQUENCY --> EXPERTISE
    EXPERTISE --> CHURN