tenets.core.nlp
Package¶
Natural Language Processing and Machine Learning utilities.
This package provides all NLP/ML functionality for Tenets including: - Tokenization and text processing - Keyword extraction (YAKE, TF-IDF) - Stopword management - Embedding generation and caching - Semantic similarity calculation
All ML features are optional and gracefully degrade when not available.
Attributes¶
ML_AVAILABLEmodule-attribute
¶
Classes¶
KeywordExtractor¶
KeywordExtractor(use_rake: bool = True, use_yake: bool = True, language: str = 'en', use_stopwords: bool = True, stopword_set: str = 'prompt')
Multi-method keyword extraction with automatic fallback.
Provides robust keyword extraction using multiple algorithms with automatic fallback based on availability and Python version compatibility. Prioritizes fast, accurate methods while ensuring compatibility across Python versions.
Methods are attempted in order
- RAKE (Rapid Automatic Keyword Extraction) - Primary method, fast and Python 3.13+ compatible
- YAKE (Yet Another Keyword Extractor) - Secondary method, only for Python < 3.13 due to compatibility issues
- TF-IDF - Custom implementation, always available
- Frequency-based - Final fallback, simple but effective
ATTRIBUTE | DESCRIPTION |
---|---|
use_rake | Whether RAKE extraction is enabled and available. TYPE: |
use_yake | Whether YAKE extraction is enabled and available. TYPE: |
language | Language code for extraction (e.g., 'en' for English). TYPE: |
use_stopwords | Whether to filter stopwords during extraction. TYPE: |
stopword_set | Which stopword set to use ('code' or 'prompt'). TYPE: |
rake_extractor | RAKE extractor instance if available. TYPE: |
yake_extractor | YAKE instance if available. TYPE: |
tokenizer | Tokenizer for fallback extraction. TYPE: |
stopwords | Set of stopwords if filtering is enabled. |
Example
extractor = KeywordExtractor() keywords = extractor.extract("implement OAuth2 authentication") print(keywords) ['oauth2 authentication', 'implement', 'authentication']
Get keywords with scores¶
keywords_with_scores = extractor.extract( ... "implement OAuth2 authentication", ... include_scores=True ... ) print(keywords_with_scores) [('oauth2 authentication', 0.9), ('implement', 0.7), ...]
Note
On Python 3.13+, YAKE is automatically disabled due to a known infinite loop bug. RAKE is used as the primary extractor instead, providing similar quality with better performance.
Initialize keyword extractor with configurable extraction methods.
PARAMETER | DESCRIPTION |
---|---|
use_rake | Enable RAKE extraction if available. RAKE is fast and works well with technical text. Defaults to True. TYPE: |
use_yake | Enable YAKE extraction if available. Automatically disabled on Python 3.13+ due to compatibility issues. Defaults to True. TYPE: |
language | Language code for extraction algorithms. Currently supports 'en' (English). Other languages may work but are not officially tested. Defaults to 'en'. TYPE: |
use_stopwords | Whether to filter common stopwords during extraction. This can improve keyword quality but may miss some contextual phrases. Defaults to True. TYPE: |
stopword_set | Which stopword set to use. Options are: - 'prompt': Aggressive filtering for user prompts (200+ words) - 'code': Minimal filtering for code analysis (30 words) Defaults to 'prompt'. TYPE: |
RAISES | DESCRIPTION |
---|---|
None | Gracefully handles missing dependencies and logs warnings. |
Note
The extractor automatically detects available libraries and Python version to choose the best extraction method. If RAKE and YAKE are unavailable, it falls back to TF-IDF and frequency-based extraction.
Attributes¶
loggerinstance-attribute
¶
use_rakeinstance-attribute
¶
use_yakeinstance-attribute
¶
languageinstance-attribute
¶
use_stopwordsinstance-attribute
¶
stopword_setinstance-attribute
¶
rake_extractorinstance-attribute
¶
yake_extractorinstance-attribute
¶
tokenizerinstance-attribute
¶
stopwordsinstance-attribute
¶
Functions¶
extract¶
extract(text: str, max_keywords: int = 20, include_scores: bool = False) -> Union[List[str], List[Tuple[str, float]]]
Extract keywords from text using the best available method.
Attempts extraction methods in priority order (RAKE → YAKE → TF-IDF → Frequency) until one succeeds. Each method returns normalized scores between 0 and 1, with higher scores indicating more relevant keywords.
PARAMETER | DESCRIPTION |
---|---|
text | Input text to extract keywords from. Can be any length, but very long texts may be truncated by some algorithms. TYPE: |
max_keywords | Maximum number of keywords to return. Keywords are sorted by relevance score. Defaults to 20. TYPE: |
include_scores | If True, return (keyword, score) tuples. If False, return only keyword strings. Defaults to False. TYPE: |
RETURNS | DESCRIPTION |
---|---|
Union[List[str], List[Tuple[str, float]]] | Union[List[str], List[Tuple[str, float]]]: - If include_scores=False: List of keyword strings sorted by relevance (e.g., ['oauth2', 'authentication', 'implement']) - If include_scores=True: List of (keyword, score) tuples where scores are normalized between 0 and 1 (e.g., [('oauth2', 0.95), ('authentication', 0.87), ...]) |
Examples:
>>> extractor = KeywordExtractor()
>>> # Simple keyword extraction
>>> keywords = extractor.extract("Python web framework Django")
>>> print(keywords)
['django', 'python web framework', 'web framework']
>>> # With scores for ranking
>>> scored = extractor.extract("Python web framework Django",
... max_keywords=5, include_scores=True)
>>> for keyword, score in scored:
... print(f"{keyword}: {score:.2f}")
django: 0.95
python web framework: 0.87
web framework: 0.82
Note
Empty input returns an empty list. All extraction methods handle various text formats including code, documentation, and natural language. Scores are normalized for consistency across methods.
TFIDFExtractor¶
Simple TF-IDF vectorizer with NLP tokenization.
Provides a scikit-learn-like interface with fit/transform methods returning dense vectors. Uses TextTokenizer for general text.
Initialize the extractor.
PARAMETER | DESCRIPTION |
---|---|
use_stopwords | Whether to filter stopwords TYPE: |
stopword_set | Which stopword set to use ('prompt'|'code') TYPE: |
Attributes¶
loggerinstance-attribute
¶
use_stopwordsinstance-attribute
¶
stopword_setinstance-attribute
¶
tokenizerinstance-attribute
¶
Functions¶
fit¶
Learn vocabulary and IDF from documents.
PARAMETER | DESCRIPTION |
---|---|
documents | List of input texts |
RETURNS | DESCRIPTION |
---|---|
TFIDFExtractor | self |
transform¶
fit_transform¶
Fit to documents, then transform them.
get_feature_names¶
Return the learned vocabulary as a list of feature names.
StopwordManager¶
Manages multiple stopword sets for different contexts.
Initialize stopword manager.
PARAMETER | DESCRIPTION |
---|---|
data_dir | Directory containing stopword files |
Attributes¶
DEFAULT_DATA_DIRclass-attribute
instance-attribute
¶
loggerinstance-attribute
¶
data_dirinstance-attribute
¶
Functions¶
get_set¶
Get a stopword set by name.
PARAMETER | DESCRIPTION |
---|---|
name | Name of stopword set ('code', 'prompt', etc.) TYPE: |
RETURNS | DESCRIPTION |
---|---|
Optional[StopwordSet] | StopwordSet or None if not found |
add_custom_set¶
Add a custom stopword set.
PARAMETER | DESCRIPTION |
---|---|
name | Name for the set TYPE: |
words | Set of stopword strings |
description | What this set is for TYPE: |
RETURNS | DESCRIPTION |
---|---|
StopwordSet | Created StopwordSet |
combine_sets¶
Combine multiple stopword sets.
PARAMETER | DESCRIPTION |
---|---|
sets | Names of sets to combine |
name | Name for combined set TYPE: |
RETURNS | DESCRIPTION |
---|---|
StopwordSet | Combined StopwordSet |
StopwordSetdataclass
¶
A set of stopwords with metadata.
ATTRIBUTE | DESCRIPTION |
---|---|
name | Name of this stopword set TYPE: |
words | Set of stopword strings |
description | What this set is used for TYPE: |
source_file | Path to source file |
CodeTokenizer¶
Tokenizer optimized for source code.
Handles: - camelCase and PascalCase splitting - snake_case splitting - Preserves original tokens for exact matching - Language-specific keywords - Optional stopword filtering
Initialize code tokenizer.
PARAMETER | DESCRIPTION |
---|---|
use_stopwords | Whether to filter stopwords TYPE: |
Attributes¶
loggerinstance-attribute
¶
use_stopwordsinstance-attribute
¶
stopwordsinstance-attribute
¶
token_patterninstance-attribute
¶
camel_case_patterninstance-attribute
¶
snake_case_patterninstance-attribute
¶
Functions¶
tokenize¶
TextTokenizer¶
Tokenizer for natural language text (prompts, comments, docs).
More aggressive than CodeTokenizer, designed for understanding user intent rather than exact matching.
Initialize text tokenizer.
PARAMETER | DESCRIPTION |
---|---|
use_stopwords | Whether to filter stopwords (default True) TYPE: |
LocalEmbeddings¶
LocalEmbeddings(model_name: str = 'all-MiniLM-L6-v2', device: Optional[str] = None, cache_dir: Optional[Path] = None)
Bases: EmbeddingModel
Local embedding generation using sentence transformers.
This runs completely locally with no external API calls. Models are downloaded and cached by sentence-transformers.
Initialize local embeddings.
PARAMETER | DESCRIPTION |
---|---|
model_name | Sentence transformer model name TYPE: |
device | Device to use ('cpu', 'cuda', or None for auto) |
cache_dir | Directory to cache models |
Attributes¶
deviceinstance-attribute
¶
modelinstance-attribute
¶
embedding_diminstance-attribute
¶
Functions¶
encode¶
encode(texts: Union[str, List[str]], batch_size: int = 32, show_progress: bool = False, normalize: bool = True) -> np.ndarray
encode_file¶
SemanticSimilarity¶
Stub for when ML features not available.
Functions¶
sparse_cosine_similarity¶
Stub for when ML features not available.
extract_keywords¶
extract_keywords(text: str, max_keywords: int = 20, use_yake: bool = True, language: str = 'en') -> List[str]
Extract keywords from text using best available method.
PARAMETER | DESCRIPTION |
---|---|
text | Input text TYPE: |
max_keywords | Maximum keywords to extract TYPE: |
use_yake | Try YAKE first if available TYPE: |
language | Language for YAKE TYPE: |
RETURNS | DESCRIPTION |
---|---|
List[str] | List of extracted keywords |
tokenize_code¶
compute_similarity¶
Modules¶
bm25
- Bm25 modulecache
- Cache moduleembeddings
- Embeddings modulekeyword_extractor
- Keyword Extractor moduleml_utils
- Ml Utils moduleprogramming_patterns
- Programming Patterns modulesimilarity
- Similarity modulestopwords
- Stopwords moduletfidf
- Tfidf moduletokenizer
- Tokenizer module