`tfidf`¶

Full name: tenets.core.nlp.tfidf

tfidf¶

TF-IDF calculator for relevance ranking.

This module provides TF-IDF text similarity as an optional fallback to the primary BM25 ranking algorithm. The TF-IDF implementation reuses centralized logic from keyword_extractor.

Classes¶

TFIDFCalculator¶

Python

TFIDFCalculator(use_stopwords: bool = False)

TF-IDF calculator for ranking.

Simplified wrapper around NLP TFIDFCalculator to maintain existing ranking API while using centralized logic.

Initialize TF-IDF calculator.

PARAMETER	DESCRIPTION
`use_stopwords`	Whether to filter stopwords (uses 'code' set) TYPE:`bool`DEFAULT:`False`

Source code in tenets/core/nlp/tfidf.py

Python

def __init__(self, use_stopwords: bool = False):
    """Initialize TF-IDF calculator.

    Args:
        use_stopwords: Whether to filter stopwords (uses 'code' set)
    """
    self.logger = get_logger(__name__)
    self.use_stopwords = use_stopwords

    # Use centralized NLP TF-IDF calculator
    from tenets.core.nlp.keyword_extractor import TFIDFCalculator as NLPTFIDFCalculator

    self._calculator = NLPTFIDFCalculator(
        use_stopwords=use_stopwords,
        stopword_set="code",  # Use minimal stopwords for code/code-search
    )

    # Expose a mutable stopword set expected by tests; we'll additionally
    # filter tokens against this set in tokenize() when enabled
    if use_stopwords:
        try:
            from tenets.core.nlp.stopwords import StopwordManager

            sw = StopwordManager().get_set("code")
            self.stopwords: Set[str] = set(sw.words) if sw else set()
        except Exception:
            self.stopwords = set()
    else:
        self.stopwords = set()

Attributes¶

document_vectors`property`¶

Python

document_vectors: Dict[str, Dict[str, float]]

Get document vectors.

document_norms`property`¶

Python

document_norms: Dict[str, float]

Get document vector norms.

vocabulary`property`¶

Python

vocabulary: set

Get vocabulary.

Functions¶

tokenize¶

Python

tokenize(text: str) -> List[str]

Tokenize text using NLP tokenizer.

PARAMETER	DESCRIPTION
`text`	Input text TYPE:`str`

RETURNS	DESCRIPTION
`List[str]`	List of tokens

Source code in tenets/core/nlp/tfidf.py

Python

def tokenize(self, text: str) -> List[str]:
    """Tokenize text using NLP tokenizer.

    Args:
        text: Input text

    Returns:
        List of tokens
    """
    tokens = self._calculator.tokenize(text)
    if self.use_stopwords and self.stopwords:
        sw = self.stopwords
        tokens = [t for t in tokens if t not in sw and t.lower() not in sw]
    return tokens

add_document¶

Python

add_document(doc_id: str, text: str) -> Dict[str, float]

Add document to corpus.

PARAMETER	DESCRIPTION
`doc_id`	Document identifier TYPE:`str`
`text`	Document content TYPE:`str`

RETURNS	DESCRIPTION
`Dict[str, float]`	TF-IDF vector for document

Source code in tenets/core/nlp/tfidf.py

Python

def add_document(self, doc_id: str, text: str) -> Dict[str, float]:
    """Add document to corpus.

    Args:
        doc_id: Document identifier
        text: Document content

    Returns:
        TF-IDF vector for document
    """
    # Invalidate IDF cache before/after adding a document to reflect corpus change
    try:
        if hasattr(self._calculator, "idf_cache"):
            self._calculator.idf_cache = {}
    except Exception:
        pass
    result = self._calculator.add_document(doc_id, text)
    try:
        if hasattr(self._calculator, "idf_cache"):
            self._calculator.idf_cache = {}
    except Exception:
        pass
    return result

compute_similarity¶

Python

compute_similarity(query_text: str, doc_id: str) -> float

Compute similarity between query and document.

PARAMETER	DESCRIPTION
`query_text`	Query text TYPE:`str`
`doc_id`	Document identifier TYPE:`str`

RETURNS	DESCRIPTION
`float`	Cosine similarity score (0-1)

Source code in tenets/core/nlp/tfidf.py

Python

def compute_similarity(self, query_text: str, doc_id: str) -> float:
    """Compute similarity between query and document.

    Args:
        query_text: Query text
        doc_id: Document identifier

    Returns:
        Cosine similarity score (0-1)
    """
    return self._calculator.compute_similarity(query_text, doc_id)

get_top_terms¶

Python

get_top_terms(doc_id: str, n: int = 10) -> List[Tuple[str, float]]

Return the top-n TF-IDF terms for a given document.

PARAMETER	DESCRIPTION
`doc_id`	Document identifier TYPE:`str`
`n`	Maximum number of terms to return TYPE:`int`DEFAULT:`10`

RETURNS	DESCRIPTION
`List[Tuple[str, float]]`	List of (term, score) sorted by score descending

Source code in tenets/core/nlp/tfidf.py

Python

def get_top_terms(self, doc_id: str, n: int = 10) -> List[Tuple[str, float]]:
    """Return the top-n TF-IDF terms for a given document.

    Args:
        doc_id: Document identifier
        n: Maximum number of terms to return

    Returns:
        List of (term, score) sorted by score descending
    """
    vec = self._calculator.document_vectors.get(doc_id, {})
    if not vec:
        return []
    # Already normalized; just sort and take top-n
    return sorted(vec.items(), key=lambda x: x[1], reverse=True)[:n]

build_corpus¶

Python

build_corpus(documents: List[Tuple[str, str]]) -> None

Build corpus from documents.

PARAMETER	DESCRIPTION
`documents`	List of (doc_id, text) tuples TYPE:`List[Tuple[str, str]]`

Source code in tenets/core/nlp/tfidf.py

Python

def build_corpus(self, documents: List[Tuple[str, str]]) -> None:
    """Build corpus from documents.

    Args:
        documents: List of (doc_id, text) tuples
    """
    self._calculator.build_corpus(documents)

tfidf¶

tfidf¶

Classes¶

TFIDFCalculator¶

Attributes¶

document_vectorsproperty¶

document_normsproperty¶

vocabularyproperty¶

Functions¶

tokenize¶

add_document¶

compute_similarity¶

get_top_terms¶

build_corpus¶

Functions¶

`tfidf`¶

document_vectors`property`¶

document_norms`property`¶

vocabulary`property`¶