tfidf
¶
Full name: tenets.core.nlp.tfidf
tfidf¶
TF-IDF calculator for relevance ranking.
This module provides TF-IDF text similarity as an optional fallback to the primary BM25 ranking algorithm. The TF-IDF implementation reuses centralized logic from keyword_extractor.
Classes¶
TFIDFCalculator¶
TF-IDF calculator for ranking.
Simplified wrapper around NLP TFIDFCalculator to maintain existing ranking API while using centralized logic.
Initialize TF-IDF calculator.
PARAMETER | DESCRIPTION |
---|---|
use_stopwords | Whether to filter stopwords (uses 'code' set) TYPE: |
Source code in tenets/core/nlp/tfidf.py
def __init__(self, use_stopwords: bool = False):
"""Initialize TF-IDF calculator.
Args:
use_stopwords: Whether to filter stopwords (uses 'code' set)
"""
self.logger = get_logger(__name__)
self.use_stopwords = use_stopwords
# Use centralized NLP TF-IDF calculator
from tenets.core.nlp.keyword_extractor import TFIDFCalculator as NLPTFIDFCalculator
self._calculator = NLPTFIDFCalculator(
use_stopwords=use_stopwords,
stopword_set="code", # Use minimal stopwords for code/code-search
)
# Expose a mutable stopword set expected by tests; we'll additionally
# filter tokens against this set in tokenize() when enabled
if use_stopwords:
try:
from tenets.core.nlp.stopwords import StopwordManager
sw = StopwordManager().get_set("code")
self.stopwords: Set[str] = set(sw.words) if sw else set()
except Exception:
self.stopwords = set()
else:
self.stopwords = set()
Attributes¶
Functions¶
tokenize¶
Tokenize text using NLP tokenizer.
PARAMETER | DESCRIPTION |
---|---|
text | Input text TYPE: |
RETURNS | DESCRIPTION |
---|---|
List[str] | List of tokens |
Source code in tenets/core/nlp/tfidf.py
def tokenize(self, text: str) -> List[str]:
"""Tokenize text using NLP tokenizer.
Args:
text: Input text
Returns:
List of tokens
"""
tokens = self._calculator.tokenize(text)
if self.use_stopwords and self.stopwords:
sw = self.stopwords
tokens = [t for t in tokens if t not in sw and t.lower() not in sw]
return tokens
add_document¶
Add document to corpus.
PARAMETER | DESCRIPTION |
---|---|
doc_id | Document identifier TYPE: |
text | Document content TYPE: |
RETURNS | DESCRIPTION |
---|---|
Dict[str, float] | TF-IDF vector for document |
Source code in tenets/core/nlp/tfidf.py
def add_document(self, doc_id: str, text: str) -> Dict[str, float]:
"""Add document to corpus.
Args:
doc_id: Document identifier
text: Document content
Returns:
TF-IDF vector for document
"""
# Invalidate IDF cache before/after adding a document to reflect corpus change
try:
if hasattr(self._calculator, "idf_cache"):
self._calculator.idf_cache = {}
except Exception:
pass
result = self._calculator.add_document(doc_id, text)
try:
if hasattr(self._calculator, "idf_cache"):
self._calculator.idf_cache = {}
except Exception:
pass
return result
compute_similarity¶
get_top_terms¶
Return the top-n TF-IDF terms for a given document.
PARAMETER | DESCRIPTION |
---|---|
doc_id | Document identifier TYPE: |
n | Maximum number of terms to return TYPE: |
RETURNS | DESCRIPTION |
---|---|
List[Tuple[str, float]] | List of (term, score) sorted by score descending |
Source code in tenets/core/nlp/tfidf.py
def get_top_terms(self, doc_id: str, n: int = 10) -> List[Tuple[str, float]]:
"""Return the top-n TF-IDF terms for a given document.
Args:
doc_id: Document identifier
n: Maximum number of terms to return
Returns:
List of (term, score) sorted by score descending
"""
vec = self._calculator.document_vectors.get(doc_id, {})
if not vec:
return []
# Already normalized; just sort and take top-n
return sorted(vec.items(), key=lambda x: x[1], reverse=True)[:n]
build_corpus¶
Build corpus from documents.
PARAMETER | DESCRIPTION |
---|---|
documents | List of (doc_id, text) tuples |