Skip to content

tfidf

Full name: tenets.core.nlp.tfidf

tfidf

TF-IDF calculator for relevance ranking.

This module provides TF-IDF text similarity as an optional fallback to the primary BM25 ranking algorithm. The TF-IDF implementation reuses centralized logic from keyword_extractor.

Classes

TFIDFCalculator

Python
TFIDFCalculator(use_stopwords: bool = False)

TF-IDF calculator for ranking.

Simplified wrapper around NLP TFIDFCalculator to maintain existing ranking API while using centralized logic.

Initialize TF-IDF calculator.

PARAMETERDESCRIPTION
use_stopwords

Whether to filter stopwords (uses 'code' set)

TYPE:boolDEFAULT:False

Source code in tenets/core/nlp/tfidf.py
Python
def __init__(self, use_stopwords: bool = False):
    """Initialize TF-IDF calculator.

    Args:
        use_stopwords: Whether to filter stopwords (uses 'code' set)
    """
    self.logger = get_logger(__name__)
    self.use_stopwords = use_stopwords

    # Use centralized NLP TF-IDF calculator
    from tenets.core.nlp.keyword_extractor import TFIDFCalculator as NLPTFIDFCalculator

    self._calculator = NLPTFIDFCalculator(
        use_stopwords=use_stopwords,
        stopword_set="code",  # Use minimal stopwords for code/code-search
    )

    # Expose a mutable stopword set expected by tests; we'll additionally
    # filter tokens against this set in tokenize() when enabled
    if use_stopwords:
        try:
            from tenets.core.nlp.stopwords import StopwordManager

            sw = StopwordManager().get_set("code")
            self.stopwords: Set[str] = set(sw.words) if sw else set()
        except Exception:
            self.stopwords = set()
    else:
        self.stopwords = set()
Attributes
document_vectorsproperty
Python
document_vectors: Dict[str, Dict[str, float]]

Get document vectors.

document_normsproperty
Python
document_norms: Dict[str, float]

Get document vector norms.

vocabularyproperty
Python
vocabulary: set

Get vocabulary.

Functions
tokenize
Python
tokenize(text: str) -> List[str]

Tokenize text using NLP tokenizer.

PARAMETERDESCRIPTION
text

Input text

TYPE:str

RETURNSDESCRIPTION
List[str]

List of tokens

Source code in tenets/core/nlp/tfidf.py
Python
def tokenize(self, text: str) -> List[str]:
    """Tokenize text using NLP tokenizer.

    Args:
        text: Input text

    Returns:
        List of tokens
    """
    tokens = self._calculator.tokenize(text)
    if self.use_stopwords and self.stopwords:
        sw = self.stopwords
        tokens = [t for t in tokens if t not in sw and t.lower() not in sw]
    return tokens
add_document
Python
add_document(doc_id: str, text: str) -> Dict[str, float]

Add document to corpus.

PARAMETERDESCRIPTION
doc_id

Document identifier

TYPE:str

text

Document content

TYPE:str

RETURNSDESCRIPTION
Dict[str, float]

TF-IDF vector for document

Source code in tenets/core/nlp/tfidf.py
Python
def add_document(self, doc_id: str, text: str) -> Dict[str, float]:
    """Add document to corpus.

    Args:
        doc_id: Document identifier
        text: Document content

    Returns:
        TF-IDF vector for document
    """
    # Invalidate IDF cache before/after adding a document to reflect corpus change
    try:
        if hasattr(self._calculator, "idf_cache"):
            self._calculator.idf_cache = {}
    except Exception:
        pass
    result = self._calculator.add_document(doc_id, text)
    try:
        if hasattr(self._calculator, "idf_cache"):
            self._calculator.idf_cache = {}
    except Exception:
        pass
    return result
compute_similarity
Python
compute_similarity(query_text: str, doc_id: str) -> float

Compute similarity between query and document.

PARAMETERDESCRIPTION
query_text

Query text

TYPE:str

doc_id

Document identifier

TYPE:str

RETURNSDESCRIPTION
float

Cosine similarity score (0-1)

Source code in tenets/core/nlp/tfidf.py
Python
def compute_similarity(self, query_text: str, doc_id: str) -> float:
    """Compute similarity between query and document.

    Args:
        query_text: Query text
        doc_id: Document identifier

    Returns:
        Cosine similarity score (0-1)
    """
    return self._calculator.compute_similarity(query_text, doc_id)
get_top_terms
Python
get_top_terms(doc_id: str, n: int = 10) -> List[Tuple[str, float]]

Return the top-n TF-IDF terms for a given document.

PARAMETERDESCRIPTION
doc_id

Document identifier

TYPE:str

n

Maximum number of terms to return

TYPE:intDEFAULT:10

RETURNSDESCRIPTION
List[Tuple[str, float]]

List of (term, score) sorted by score descending

Source code in tenets/core/nlp/tfidf.py
Python
def get_top_terms(self, doc_id: str, n: int = 10) -> List[Tuple[str, float]]:
    """Return the top-n TF-IDF terms for a given document.

    Args:
        doc_id: Document identifier
        n: Maximum number of terms to return

    Returns:
        List of (term, score) sorted by score descending
    """
    vec = self._calculator.document_vectors.get(doc_id, {})
    if not vec:
        return []
    # Already normalized; just sort and take top-n
    return sorted(vec.items(), key=lambda x: x[1], reverse=True)[:n]
build_corpus
Python
build_corpus(documents: List[Tuple[str, str]]) -> None

Build corpus from documents.

PARAMETERDESCRIPTION
documents

List of (doc_id, text) tuples

TYPE:List[Tuple[str, str]]

Source code in tenets/core/nlp/tfidf.py
Python
def build_corpus(self, documents: List[Tuple[str, str]]) -> None:
    """Build corpus from documents.

    Args:
        documents: List of (doc_id, text) tuples
    """
    self._calculator.build_corpus(documents)

Functions