`bm25`¶

Full name: tenets.core.nlp.bm25

bm25¶

BM25 ranking algorithm implementation.

BM25 (Best Matching 25) is a probabilistic ranking function that improves upon TF-IDF for information retrieval. This module provides a robust, well-documented implementation optimized for code search.

Key Features

Term frequency saturation to prevent over-weighting repeated terms
Sophisticated document length normalization
Configurable parameters for different document types
Efficient sparse representation for large corpora
Cache-friendly design for repeated queries

Mathematical Foundation

BM25 score for document D given query Q:

Score(D,Q) = Σ IDF(qi) × [f(qi,D) × (k1 + 1)] / [f(qi,D) + k1 × (1 - b + b × |D|/avgdl)]

Where: qi = each query term f(qi,D) = frequency of term qi in document D |D| = length of document D in tokens avgdl = average document length in the corpus k1 = term frequency saturation parameter (default: 1.2) b = length normalization parameter (default: 0.75)

IDF Component: IDF(qi) = log[(N - df(qi) + 0.5) / (df(qi) + 0.5) + 1]

Text Only

Where:
    N = total number of documents
    df(qi) = number of documents containing term qi

Usage

from tenets.core.nlp.bm25 import BM25Calculator
Initialize calculator¶
bm25 = BM25Calculator(k1=1.2, b=0.75)
Build corpus¶
documents = [ ... ("doc1", "Python web framework Django"), ... ("doc2", "Flask is a lightweight Python framework"), ... ("doc3", "JavaScript React framework for UI") ... ] bm25.build_corpus(documents)
Score documents for a query¶
scores = bm25.get_scores("Python framework") for doc_id, score in scores: ... print(f"{doc_id}: {score:.3f}")

References

Robertson & Walker (1994): "Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval"
Trotman et al. (2014): "Improvements to BM25 and language models examined"

Classes¶

BM25Calculator¶

Python

BM25Calculator(k1: float = 1.2, b: float = 0.75, epsilon: float = 0.25, use_stopwords: bool = False, stopword_set: str = 'code')

BM25 ranking algorithm with advanced features for code search.

This implementation provides

Configurable term saturation (k1) and length normalization (b)
Efficient tokenization with optional stopword filtering
IDF caching for performance
Support for incremental corpus updates
Query expansion capabilities
Detailed scoring explanations for debugging

ATTRIBUTE	DESCRIPTION
`k1`	Controls term frequency saturation. Higher values mean less saturation (more weight to term frequency). Typical range: 0.5-2.0, default: 1.2 TYPE:`float`
`b`	Controls document length normalization. 0 = no normalization, 1 = full normalization. Typical range: 0.5-0.8, default: 0.75 TYPE:`float`
`epsilon`	Small constant to prevent division by zero TYPE:`float`

Initialize BM25 calculator with configurable parameters.

PARAMETER	DESCRIPTION
`k1`	Term frequency saturation parameter. Lower values (0.5-1.0) work well for short queries, higher values (1.5-2.0) for longer queries. Default: 1.2 (good general purpose value) TYPE:`float`DEFAULT:`1.2`
`b`	Length normalization parameter. Set to 0 to disable length normalization, 1 for full normalization. Default: 0.75 (moderate normalization, good for mixed-length documents) TYPE:`float`DEFAULT:`0.75`
`epsilon`	Small constant for numerical stability TYPE:`float`DEFAULT:`0.25`
`use_stopwords`	Whether to filter common words TYPE:`bool`DEFAULT:`False`
`stopword_set`	Which stopword set to use ('code' for programming, 'english' for natural language) TYPE:`str`DEFAULT:`'code'`

Source code in tenets/core/nlp/bm25.py

Python

def __init__(
    self,
    k1: float = 1.2,
    b: float = 0.75,
    epsilon: float = 0.25,
    use_stopwords: bool = False,
    stopword_set: str = "code",
):
    """Initialize BM25 calculator with configurable parameters.

    Args:
        k1: Term frequency saturation parameter. Lower values (0.5-1.0)
            work well for short queries, higher values (1.5-2.0) for
            longer queries. Default: 1.2 (good general purpose value)
        b: Length normalization parameter. Set to 0 to disable length
           normalization, 1 for full normalization. Default: 0.75
           (moderate normalization, good for mixed-length documents)
        epsilon: Small constant for numerical stability
        use_stopwords: Whether to filter common words
        stopword_set: Which stopword set to use ('code' for programming,
                     'english' for natural language)
    """
    self.logger = get_logger(__name__)

    # Validate and set parameters
    if k1 < 0:
        raise ValueError(f"k1 must be non-negative, got {k1}")
    if not 0 <= b <= 1:
        raise ValueError(f"b must be between 0 and 1, got {b}")

    self.k1 = k1
    self.b = b
    self.epsilon = epsilon
    self.use_stopwords = use_stopwords
    self.stopword_set = stopword_set

    # Initialize tokenizer
    from .tokenizer import CodeTokenizer

    self.tokenizer = CodeTokenizer(use_stopwords=use_stopwords)

    # Core data structures
    self.document_count = 0
    self.document_frequency: Dict[str, int] = defaultdict(int)
    self.document_lengths: Dict[str, int] = {}
    self.document_tokens: Dict[str, List[str]] = {}
    self.average_doc_length = 0.0
    self.vocabulary: Set[str] = set()

    # Caching structures for performance
    self.idf_cache: Dict[str, float] = {}
    self._score_cache: Dict[Tuple[str, str], float] = {}

    # Statistics tracking
    self.stats = {
        "queries_processed": 0,
        "cache_hits": 0,
        "cache_misses": 0,
        "documents_added": 0,
    }

    self.logger.info(
        f"BM25 initialized with k1={k1}, b={b}, "
        f"stopwords={'enabled' if use_stopwords else 'disabled'}"
    )

Functions¶

tokenize¶

Python

tokenize(text: str) -> List[str]

Tokenize text using code-aware tokenizer.

Handles various code constructs

CamelCase and snake_case splitting
Preservation of important symbols
Number and identifier extraction

PARAMETER	DESCRIPTION
`text`	Input text to tokenize TYPE:`str`

RETURNS	DESCRIPTION
`List[str]`	List of tokens, lowercased and filtered

Source code in tenets/core/nlp/bm25.py

Python

def tokenize(self, text: str) -> List[str]:
    """Tokenize text using code-aware tokenizer.

    Handles various code constructs:
        - CamelCase and snake_case splitting
        - Preservation of important symbols
        - Number and identifier extraction

    Args:
        text: Input text to tokenize

    Returns:
        List of tokens, lowercased and filtered
    """
    return self.tokenizer.tokenize(text)

add_document¶

Python

add_document(doc_id: str, text: str) -> None

Add a document to the BM25 corpus.

Updates all corpus statistics including document frequency, average document length, and vocabulary.

PARAMETER	DESCRIPTION
`doc_id`	Unique identifier for the document TYPE:`str`
`text`	Document content TYPE:`str`

Note

Adding documents invalidates the IDF and score caches. For bulk loading, use build_corpus() instead.

Source code in tenets/core/nlp/bm25.py

Python

def add_document(self, doc_id: str, text: str) -> None:
    """Add a document to the BM25 corpus.

    Updates all corpus statistics including document frequency,
    average document length, and vocabulary.

    Args:
        doc_id: Unique identifier for the document
        text: Document content

    Note:
        Adding documents invalidates the IDF and score caches.
        For bulk loading, use build_corpus() instead.
    """
    tokens = self.tokenize(text)

    # Handle empty documents
    if not tokens:
        self.document_lengths[doc_id] = 0
        self.document_tokens[doc_id] = []
        self.logger.debug(f"Added empty document: {doc_id}")
        return

    # Remove old version if updating
    if doc_id in self.document_tokens:
        self._remove_document(doc_id)

    # Update corpus statistics
    self.document_count += 1
    self.document_lengths[doc_id] = len(tokens)
    self.document_tokens[doc_id] = tokens

    # Update document frequency for unique terms
    unique_terms = set(tokens)
    for term in unique_terms:
        self.document_frequency[term] += 1
        self.vocabulary.add(term)

    # Update average document length incrementally
    total_length = sum(self.document_lengths.values())
    self.average_doc_length = total_length / max(1, self.document_count)

    # Invalidate caches
    self.idf_cache.clear()
    self._score_cache.clear()

    self.stats["documents_added"] += 1

    self.logger.debug(
        f"Added document {doc_id}: {len(tokens)} tokens, "
        f"corpus now has {self.document_count} docs"
    )

build_corpus¶

Python

build_corpus(documents: List[Tuple[str, str]]) -> None

Build BM25 corpus from multiple documents efficiently.

More efficient than repeated add_document() calls as it calculates statistics once at the end.

PARAMETER	DESCRIPTION
`documents`	List of (doc_id, text) tuples TYPE:`List[Tuple[str, str]]`

Example

documents = [ ... ("file1.py", "import os\nclass FileHandler"), ... ("file2.py", "from pathlib import Path") ... ] bm25.build_corpus(documents)

Source code in tenets/core/nlp/bm25.py

Python

def build_corpus(self, documents: List[Tuple[str, str]]) -> None:
    """Build BM25 corpus from multiple documents efficiently.

    More efficient than repeated add_document() calls as it
    calculates statistics once at the end.

    Args:
        documents: List of (doc_id, text) tuples

    Example:
        >>> documents = [
        ...     ("file1.py", "import os\\nclass FileHandler"),
        ...     ("file2.py", "from pathlib import Path")
        ... ]
        >>> bm25.build_corpus(documents)
    """
    self.logger.info(f"Building corpus from {len(documents)} documents")

    # Clear existing data
    self.document_count = 0
    self.document_frequency.clear()
    self.document_lengths.clear()
    self.document_tokens.clear()
    self.vocabulary.clear()
    self.idf_cache.clear()
    self._score_cache.clear()

    # Process all documents
    total_length = 0
    for doc_id, text in documents:
        tokens = self.tokenize(text)

        if not tokens:
            self.document_lengths[doc_id] = 0
            self.document_tokens[doc_id] = []
            continue

        self.document_count += 1
        self.document_lengths[doc_id] = len(tokens)
        self.document_tokens[doc_id] = tokens
        total_length += len(tokens)

        # Update document frequency
        unique_terms = set(tokens)
        for term in unique_terms:
            self.document_frequency[term] += 1
            self.vocabulary.add(term)

    # Calculate average document length
    self.average_doc_length = total_length / max(1, self.document_count)

    self.stats["documents_added"] = self.document_count

    self.logger.info(
        f"Corpus built: {self.document_count} docs, "
        f"{len(self.vocabulary)} unique terms, "
        f"avg length: {self.average_doc_length:.1f} tokens"
    )

compute_idf¶

Python

compute_idf(term: str) -> float

Compute IDF (Inverse Document Frequency) for a term.

Uses the standard BM25 IDF formula with smoothing to handle edge cases and prevent negative values.

Formula

IDF(term) = log[(N - df + 0.5) / (df + 0.5) + 1]

PARAMETER	DESCRIPTION
`term`	Term to compute IDF for TYPE:`str`

RETURNS	DESCRIPTION
`float`	IDF value (always positive due to +1 in formula)

Source code in tenets/core/nlp/bm25.py

Python

def compute_idf(self, term: str) -> float:
    """Compute IDF (Inverse Document Frequency) for a term.

    Uses the standard BM25 IDF formula with smoothing to handle
    edge cases and prevent negative values.

    Formula:
        IDF(term) = log[(N - df + 0.5) / (df + 0.5) + 1]

    Args:
        term: Term to compute IDF for

    Returns:
        IDF value (always positive due to +1 in formula)
    """
    # Check cache first
    if term in self.idf_cache:
        self.stats["cache_hits"] += 1
        return self.idf_cache[term]

    self.stats["cache_misses"] += 1

    # Get document frequency
    df = self.document_frequency.get(term, 0)

    # BM25 IDF formula with smoothing
    # Adding 1 ensures IDF is always positive
    numerator = self.document_count - df + 0.5
    denominator = df + 0.5
    idf = math.log(numerator / denominator + 1)

    # Cache the result
    self.idf_cache[term] = idf

    return idf

score_document¶

Python

score_document(query_tokens: List[str], doc_id: str, explain: bool = False) -> float

Calculate BM25 score for a document given query tokens.

Implements the full BM25 scoring formula with term saturation and length normalization.

PARAMETER	DESCRIPTION
`query_tokens`	Tokenized query terms TYPE:`List[str]`
`doc_id`	Document identifier to score TYPE:`str`
`explain`	If True, return detailed scoring breakdown TYPE:`bool`DEFAULT:`False`

RETURNS	DESCRIPTION
`float`	BM25 score (higher is more relevant)
`float`	If explain=True, returns tuple of (score, explanation_dict)

Source code in tenets/core/nlp/bm25.py

Python

def score_document(self, query_tokens: List[str], doc_id: str, explain: bool = False) -> float:
    """Calculate BM25 score for a document given query tokens.

    Implements the full BM25 scoring formula with term saturation
    and length normalization.

    Args:
        query_tokens: Tokenized query terms
        doc_id: Document identifier to score
        explain: If True, return detailed scoring breakdown

    Returns:
        BM25 score (higher is more relevant)
        If explain=True, returns tuple of (score, explanation_dict)
    """
    # Check if document exists
    if doc_id not in self.document_tokens:
        return (0.0, {}) if explain else 0.0

    doc_tokens = self.document_tokens[doc_id]
    if not doc_tokens:
        return (0.0, {"empty_doc": True}) if explain else 0.0

    # Check score cache
    cache_key = (tuple(query_tokens), doc_id)
    if cache_key in self._score_cache and not explain:
        self.stats["cache_hits"] += 1
        return self._score_cache[cache_key]

    self.stats["cache_misses"] += 1

    # Get document statistics
    doc_length = self.document_lengths[doc_id]
    doc_tf = Counter(doc_tokens)

    # Length normalization factor
    if self.average_doc_length > 0:
        norm_factor = 1 - self.b + self.b * (doc_length / self.average_doc_length)
    else:
        norm_factor = 1.0

    # Calculate score
    score = 0.0
    term_scores = {} if explain else None

    for term in set(query_tokens):  # Use set to handle repeated query terms
        if term not in self.vocabulary:
            continue

        # Get term frequency in document
        tf = doc_tf.get(term, 0)
        if tf == 0:
            continue

        # IDF component
        idf = self.compute_idf(term)

        # BM25 term frequency component with saturation
        tf_component = (tf * (self.k1 + 1)) / (tf + self.k1 * norm_factor)

        # Term contribution to score
        term_score = idf * tf_component
        score += term_score

        if explain:
            term_scores[term] = {
                "tf": tf,
                "idf": idf,
                "tf_component": tf_component,
                "score": term_score,
            }

    # Cache the score
    self._score_cache[cache_key] = score

    if explain:
        explanation = {
            "total_score": score,
            "doc_length": doc_length,
            "avg_doc_length": self.average_doc_length,
            "norm_factor": norm_factor,
            "term_scores": term_scores,
        }
        return score, explanation

    return score

get_scores¶

Python

get_scores(query: str, doc_ids: Optional[List[str]] = None) -> List[Tuple[str, float]]

Get BM25 scores for all documents or a subset.

PARAMETER	DESCRIPTION
`query`	Search query string TYPE:`str`
`doc_ids`	Optional list of document IDs to score. If None, scores all documents. TYPE:`Optional[List[str]]`DEFAULT:`None`

RETURNS	DESCRIPTION
`List[Tuple[str, float]]`	List of (doc_id, score) tuples sorted by score (descending)

Source code in tenets/core/nlp/bm25.py

Python

def get_scores(
    self, query: str, doc_ids: Optional[List[str]] = None
) -> List[Tuple[str, float]]:
    """Get BM25 scores for all documents or a subset.

    Args:
        query: Search query string
        doc_ids: Optional list of document IDs to score.
                If None, scores all documents.

    Returns:
        List of (doc_id, score) tuples sorted by score (descending)
    """
    self.stats["queries_processed"] += 1

    # Tokenize query
    query_tokens = self.tokenize(query)
    if not query_tokens:
        self.logger.warning(f"Empty query after tokenization: '{query}'")
        return []

    # Determine documents to score
    if doc_ids is None:
        doc_ids = list(self.document_tokens.keys())

    # Calculate scores
    scores = []
    for doc_id in doc_ids:
        score = self.score_document(query_tokens, doc_id)
        if score > 0:  # Only include documents with positive scores
            scores.append((doc_id, score))

    # Sort by score (descending)
    scores.sort(key=lambda x: x[1], reverse=True)

    return scores

get_top_k¶

Python

get_top_k(query: str, k: int = 10, threshold: float = 0.0) -> List[Tuple[str, float]]

Get top-k documents by BM25 score.

PARAMETER	DESCRIPTION
`query`	Search query TYPE:`str`
`k`	Number of top documents to return TYPE:`int`DEFAULT:`10`
`threshold`	Minimum score threshold (documents below are filtered) TYPE:`float`DEFAULT:`0.0`

RETURNS	DESCRIPTION
`List[Tuple[str, float]]`	List of top-k (doc_id, score) tuples

Source code in tenets/core/nlp/bm25.py

Python

def get_top_k(self, query: str, k: int = 10, threshold: float = 0.0) -> List[Tuple[str, float]]:
    """Get top-k documents by BM25 score.

    Args:
        query: Search query
        k: Number of top documents to return
        threshold: Minimum score threshold (documents below are filtered)

    Returns:
        List of top-k (doc_id, score) tuples
    """
    scores = self.get_scores(query)

    # Filter by threshold
    if threshold > 0:
        scores = [(doc_id, score) for doc_id, score in scores if score >= threshold]

    return scores[:k]

compute_similarity¶

Python

compute_similarity(query: str, doc_id: str) -> float

Compute normalized similarity score between query and document.

Returns a value between 0 and 1 for consistency with other similarity measures.

PARAMETER	DESCRIPTION
`query`	Query text TYPE:`str`
`doc_id`	Document identifier TYPE:`str`

RETURNS	DESCRIPTION
`float`	Normalized similarity score (0-1)

Source code in tenets/core/nlp/bm25.py

Python

def compute_similarity(self, query: str, doc_id: str) -> float:
    """Compute normalized similarity score between query and document.

    Returns a value between 0 and 1 for consistency with other
    similarity measures.

    Args:
        query: Query text
        doc_id: Document identifier

    Returns:
        Normalized similarity score (0-1)
    """
    query_tokens = self.tokenize(query)
    if not query_tokens:
        return 0.0

    # Get raw BM25 score
    score = self.score_document(query_tokens, doc_id)

    # Normalize score to 0-1 range
    # Using sigmoid-like normalization for better distribution
    normalized = score / (score + 10.0)  # 10.0 is empirically chosen

    return min(1.0, normalized)

explain_score¶

Python

explain_score(query: str, doc_id: str) -> Dict

Get detailed explanation of BM25 scoring for debugging.

PARAMETER	DESCRIPTION
`query`	Query text TYPE:`str`
`doc_id`	Document to explain scoring for TYPE:`str`

RETURNS	DESCRIPTION
`Dict`	Dictionary with detailed scoring breakdown

Source code in tenets/core/nlp/bm25.py

Python

def explain_score(self, query: str, doc_id: str) -> Dict:
    """Get detailed explanation of BM25 scoring for debugging.

    Args:
        query: Query text
        doc_id: Document to explain scoring for

    Returns:
        Dictionary with detailed scoring breakdown
    """
    query_tokens = self.tokenize(query)

    if not query_tokens:
        return {"error": "Empty query after tokenization"}

    score, explanation = self.score_document(query_tokens, doc_id, explain=True)

    # Add query information
    explanation["query"] = query
    explanation["query_tokens"] = query_tokens
    explanation["parameters"] = {"k1": self.k1, "b": self.b}

    return explanation

get_stats¶

Python

get_stats() -> Dict

Get calculator statistics for monitoring.

RETURNS	DESCRIPTION
`Dict`	Dictionary with usage statistics

Source code in tenets/core/nlp/bm25.py

Python

def get_stats(self) -> Dict:
    """Get calculator statistics for monitoring.

    Returns:
        Dictionary with usage statistics
    """
    return {
        **self.stats,
        "corpus_size": self.document_count,
        "vocabulary_size": len(self.vocabulary),
        "avg_doc_length": self.average_doc_length,
        "idf_cache_size": len(self.idf_cache),
        "score_cache_size": len(self._score_cache),
        "cache_hit_rate": (
            self.stats["cache_hits"]
            / max(1, self.stats["cache_hits"] + self.stats["cache_misses"])
        ),
    }

clear_cache¶

Python

clear_cache() -> None

Clear all caches to free memory.

Source code in tenets/core/nlp/bm25.py

Python

def clear_cache(self) -> None:
    """Clear all caches to free memory."""
    self.idf_cache.clear()
    self._score_cache.clear()
    self.logger.debug("Caches cleared")

Functions¶

create_bm25¶

Python

create_bm25(documents: List[Tuple[str, str]], **kwargs) -> BM25Calculator

Create and initialize a BM25 calculator with documents.

PARAMETER	DESCRIPTION
`documents`	List of (doc_id, text) tuples TYPE:`List[Tuple[str, str]]`
`**kwargs`	Additional arguments for BM25Calculator DEFAULT:`{}`

RETURNS	DESCRIPTION
`BM25Calculator`	Initialized BM25Calculator with corpus built

Source code in tenets/core/nlp/bm25.py

Python

def create_bm25(documents: List[Tuple[str, str]], **kwargs) -> BM25Calculator:
    """Create and initialize a BM25 calculator with documents.

    Args:
        documents: List of (doc_id, text) tuples
        **kwargs: Additional arguments for BM25Calculator

    Returns:
        Initialized BM25Calculator with corpus built
    """
    calculator = BM25Calculator(**kwargs)
    calculator.build_corpus(documents)
    return calculator