`tenets.core.nlp` Package¶

Natural Language Processing and Machine Learning utilities.

This package provides all NLP/ML functionality for Tenets including: - Tokenization and text processing - Keyword extraction (YAKE, TF-IDF) - Stopword management - Embedding generation and caching - Semantic similarity calculation

All ML features are optional and gracefully degrade when not available.

Attributes¶

ML_AVAILABLE`module-attribute`¶

Python

ML_AVAILABLE = True

Classes¶

KeywordExtractor¶

Python

KeywordExtractor(use_rake: bool = True, use_yake: bool = True, language: str = 'en', use_stopwords: bool = True, stopword_set: str = 'prompt')

Multi-method keyword extraction with automatic fallback.

Provides robust keyword extraction using multiple algorithms with automatic fallback based on availability and Python version compatibility. Prioritizes fast, accurate methods while ensuring compatibility across Python versions.

Methods are attempted in order

RAKE (Rapid Automatic Keyword Extraction) - Primary method, fast and Python 3.13+ compatible
YAKE (Yet Another Keyword Extractor) - Secondary method, only for Python < 3.13 due to compatibility issues
TF-IDF - Custom implementation, always available
Frequency-based - Final fallback, simple but effective

ATTRIBUTE	DESCRIPTION
`use_rake`	Whether RAKE extraction is enabled and available. TYPE:`bool`
`use_yake`	Whether YAKE extraction is enabled and available. TYPE:`bool`
`language`	Language code for extraction (e.g., 'en' for English). TYPE:`str`
`use_stopwords`	Whether to filter stopwords during extraction. TYPE:`bool`
`stopword_set`	Which stopword set to use ('code' or 'prompt'). TYPE:`str`
`rake_extractor`	RAKE extractor instance if available. TYPE:`Rake \| None`
`yake_extractor`	YAKE instance if available. TYPE:`KeywordExtractor \| None`
`tokenizer`	Tokenizer for fallback extraction. TYPE:`TextTokenizer`
`stopwords`	Set of stopwords if filtering is enabled. TYPE:`Set[str] \| None`

Example

extractor = KeywordExtractor() keywords = extractor.extract("implement OAuth2 authentication") print(keywords) ['oauth2 authentication', 'implement', 'authentication']
Get keywords with scores¶
keywords_with_scores = extractor.extract( ... "implement OAuth2 authentication", ... include_scores=True ... ) print(keywords_with_scores) [('oauth2 authentication', 0.9), ('implement', 0.7), ...]

Note

On Python 3.13+, YAKE is automatically disabled due to a known infinite loop bug. RAKE is used as the primary extractor instead, providing similar quality with better performance.

Initialize keyword extractor with configurable extraction methods.

PARAMETER	DESCRIPTION
`use_rake`	Enable RAKE extraction if available. RAKE is fast and works well with technical text. Defaults to True. TYPE:`bool`DEFAULT:`True`
`use_yake`	Enable YAKE extraction if available. Automatically disabled on Python 3.13+ due to compatibility issues. Defaults to True. TYPE:`bool`DEFAULT:`True`
`language`	Language code for extraction algorithms. Currently supports 'en' (English). Other languages may work but are not officially tested. Defaults to 'en'. TYPE:`str`DEFAULT:`'en'`
`use_stopwords`	Whether to filter common stopwords during extraction. This can improve keyword quality but may miss some contextual phrases. Defaults to True. TYPE:`bool`DEFAULT:`True`
`stopword_set`	Which stopword set to use. Options are: - 'prompt': Aggressive filtering for user prompts (200+ words) - 'code': Minimal filtering for code analysis (30 words) Defaults to 'prompt'. TYPE:`str`DEFAULT:`'prompt'`

RAISES	DESCRIPTION
`None`	Gracefully handles missing dependencies and logs warnings.

Note

The extractor automatically detects available libraries and Python version to choose the best extraction method. If RAKE and YAKE are unavailable, it falls back to TF-IDF and frequency-based extraction.

Attributes¶

logger`instance-attribute`¶

Python

logger = get_logger(__name__)

use_rake`instance-attribute`¶

Python

use_rake = use_rake and RAKE_AVAILABLE

use_yake`instance-attribute`¶

Python

use_yake = use_yake and YAKE_AVAILABLE

language`instance-attribute`¶

Python

language = language

use_stopwords`instance-attribute`¶

Python

use_stopwords = use_stopwords

stopword_set`instance-attribute`¶

Python

stopword_set = stopword_set

rake_extractor`instance-attribute`¶

Python

rake_extractor = SimpleRAKE(stopwords=stopwords, max_length=3)

yake_extractor`instance-attribute`¶

Python

yake_extractor = KeywordExtractor(lan=language, n=3, dedupLim=0.7, dedupFunc='seqm', windowsSize=1, top=30)

tokenizer`instance-attribute`¶

Python

tokenizer = TextTokenizer(use_stopwords=use_stopwords)

stopwords`instance-attribute`¶

Python

stopwords = get_set(stopword_set)

Functions¶

extract¶

Python

extract(text: str, max_keywords: int = 20, include_scores: bool = False) -> Union[List[str], List[Tuple[str, float]]]

Extract keywords from text using the best available method.

Attempts extraction methods in priority order (RAKE → YAKE → TF-IDF → Frequency) until one succeeds. Each method returns normalized scores between 0 and 1, with higher scores indicating more relevant keywords.

PARAMETER	DESCRIPTION
`text`	Input text to extract keywords from. Can be any length, but very long texts may be truncated by some algorithms. TYPE:`str`
`max_keywords`	Maximum number of keywords to return. Keywords are sorted by relevance score. Defaults to 20. TYPE:`int`DEFAULT:`20`
`include_scores`	If True, return (keyword, score) tuples. If False, return only keyword strings. Defaults to False. TYPE:`bool`DEFAULT:`False`

RETURNS	DESCRIPTION
`Union[List[str], List[Tuple[str, float]]]`	Union[List[str], List[Tuple[str, float]]]: - If include_scores=False: List of keyword strings sorted by relevance (e.g., ['oauth2', 'authentication', 'implement']) - If include_scores=True: List of (keyword, score) tuples where scores are normalized between 0 and 1 (e.g., [('oauth2', 0.95), ('authentication', 0.87), ...])

Examples:

Python Console Session

>>> extractor = KeywordExtractor()
>>> # Simple keyword extraction
>>> keywords = extractor.extract("Python web framework Django")
>>> print(keywords)
['django', 'python web framework', 'web framework']

Python Console Session

>>> # With scores for ranking
>>> scored = extractor.extract("Python web framework Django",
...                           max_keywords=5, include_scores=True)
>>> for keyword, score in scored:
...     print(f"{keyword}: {score:.2f}")
django: 0.95
python web framework: 0.87
web framework: 0.82

Note

Empty input returns an empty list. All extraction methods handle various text formats including code, documentation, and natural language. Scores are normalized for consistency across methods.

TFIDFExtractor¶

Python

TFIDFExtractor(use_stopwords: bool = True, stopword_set: str = 'prompt')

Simple TF-IDF vectorizer with NLP tokenization.

Provides a scikit-learn-like interface with fit/transform methods returning dense vectors. Uses TextTokenizer for general text.

Initialize the extractor.

PARAMETER	DESCRIPTION
`use_stopwords`	Whether to filter stopwords TYPE:`bool`DEFAULT:`True`
`stopword_set`	Which stopword set to use ('prompt'\|'code') TYPE:`str`DEFAULT:`'prompt'`

Attributes¶

logger`instance-attribute`¶

Python

logger = get_logger(__name__)

use_stopwords`instance-attribute`¶

Python

use_stopwords = use_stopwords

stopword_set`instance-attribute`¶

Python

stopword_set = stopword_set

tokenizer`instance-attribute`¶

Python

tokenizer = TextTokenizer(use_stopwords=use_stopwords)

Functions¶

fit¶

Python

fit(documents: List[str]) -> TFIDFExtractor

Learn vocabulary and IDF from documents.

PARAMETER	DESCRIPTION
`documents`	List of input texts TYPE:`List[str]`

RETURNS	DESCRIPTION
`TFIDFExtractor`	self

transform¶

Python

transform(documents: List[str]) -> List[List[float]]

Transform documents to dense TF-IDF vectors.

PARAMETER	DESCRIPTION
`documents`	List of input texts TYPE:`List[str]`

RETURNS	DESCRIPTION
`List[List[float]]`	List of dense vectors (each aligned to the learned vocabulary)

fit_transform¶

Python

fit_transform(documents: List[str]) -> List[List[float]]

Fit to documents, then transform them.

get_feature_names¶

Python

get_feature_names() -> List[str]

Return the learned vocabulary as a list of feature names.

StopwordManager¶

Python

StopwordManager(data_dir: Optional[Path] = None)

Manages multiple stopword sets for different contexts.

Initialize stopword manager.

PARAMETER	DESCRIPTION
`data_dir`	Directory containing stopword files TYPE:`Optional[Path]`DEFAULT:`None`

Attributes¶

DEFAULT_DATA_DIR`class-attributeinstance-attribute`¶

Python

DEFAULT_DATA_DIR = parent / 'data' / 'stopwords'

logger`instance-attribute`¶

Python

logger = get_logger(__name__)

data_dir`instance-attribute`¶

Python

data_dir = data_dir or DEFAULT_DATA_DIR

Functions¶

get_set¶

Python

get_set(name: str) -> Optional[StopwordSet]

Get a stopword set by name.

PARAMETER	DESCRIPTION
`name`	Name of stopword set ('code', 'prompt', etc.) TYPE:`str`

RETURNS	DESCRIPTION
`Optional[StopwordSet]`	StopwordSet or None if not found

add_custom_set¶

Python

add_custom_set(name: str, words: Set[str], description: str = '') -> StopwordSet

Add a custom stopword set.

PARAMETER	DESCRIPTION
`name`	Name for the set TYPE:`str`
`words`	Set of stopword strings TYPE:`Set[str]`
`description`	What this set is for TYPE:`str`DEFAULT:`''`

RETURNS	DESCRIPTION
`StopwordSet`	Created StopwordSet

combine_sets¶

Python

combine_sets(sets: List[str], name: str = 'combined') -> StopwordSet

Combine multiple stopword sets.

PARAMETER	DESCRIPTION
`sets`	Names of sets to combine TYPE:`List[str]`
`name`	Name for combined set TYPE:`str`DEFAULT:`'combined'`

RETURNS	DESCRIPTION
`StopwordSet`	Combined StopwordSet

StopwordSet`dataclass`¶

Python

StopwordSet(name: str, words: Set[str], description: str, source_file: Optional[Path] = None)

A set of stopwords with metadata.

ATTRIBUTE	DESCRIPTION
`name`	Name of this stopword set TYPE:`str`
`words`	Set of stopword strings TYPE:`Set[str]`
`description`	What this set is used for TYPE:`str`
`source_file`	Path to source file TYPE:`Optional[Path]`

Attributes¶

name`instance-attribute`¶

Python

name: str

words`instance-attribute`¶

Python

words: Set[str]

description`instance-attribute`¶

Python

description: str

source_file`class-attributeinstance-attribute`¶

Python

source_file: Optional[Path] = None

Functions¶

filter¶

Python

filter(words: List[str]) -> List[str]

Filter stopwords from word list.

PARAMETER	DESCRIPTION
`words`	List of words to filter TYPE:`List[str]`

RETURNS	DESCRIPTION
`List[str]`	Filtered list without stopwords

CodeTokenizer¶

Python

CodeTokenizer(use_stopwords: bool = False)

Tokenizer optimized for source code.

Handles: - camelCase and PascalCase splitting - snake_case splitting - Preserves original tokens for exact matching - Language-specific keywords - Optional stopword filtering

Initialize code tokenizer.

PARAMETER	DESCRIPTION
`use_stopwords`	Whether to filter stopwords TYPE:`bool`DEFAULT:`False`

Attributes¶

logger`instance-attribute`¶

Python

logger = get_logger(__name__)

use_stopwords`instance-attribute`¶

Python

use_stopwords = use_stopwords

stopwords`instance-attribute`¶

Python

stopwords = get_set('code')

token_pattern`instance-attribute`¶

Python

token_pattern = compile('\\b[a-zA-Z_][a-zA-Z0-9_]*\\b')

camel_case_pattern`instance-attribute`¶

Python

camel_case_pattern = compile('[A-Z][a-z]+|[a-z]+|[A-Z]+(?=[A-Z][a-z]|\\b)')

snake_case_pattern`instance-attribute`¶

Python

snake_case_pattern = compile('[a-z]+|[A-Z]+')

Functions¶

tokenize¶

Python

tokenize(text: str, language: Optional[str] = None, preserve_original: bool = True) -> List[str]

Tokenize code text.

PARAMETER	DESCRIPTION
`text`	Code to tokenize TYPE:`str`
`language`	Programming language (for language-specific handling) TYPE:`Optional[str]`DEFAULT:`None`
`preserve_original`	Keep original tokens alongside splits TYPE:`bool`DEFAULT:`True`

RETURNS	DESCRIPTION
`List[str]`	List of tokens

tokenize_identifier¶

Python

tokenize_identifier(identifier: str) -> List[str]

Tokenize a single identifier (function/class/variable name).

PARAMETER	DESCRIPTION
`identifier`	Identifier to tokenize TYPE:`str`

RETURNS	DESCRIPTION
`List[str]`	List of component tokens

TextTokenizer¶

Python

TextTokenizer(use_stopwords: bool = True)

Tokenizer for natural language text (prompts, comments, docs).

More aggressive than CodeTokenizer, designed for understanding user intent rather than exact matching.

Initialize text tokenizer.

PARAMETER	DESCRIPTION
`use_stopwords`	Whether to filter stopwords (default True) TYPE:`bool`DEFAULT:`True`

Attributes¶

logger`instance-attribute`¶

Python

logger = get_logger(__name__)

use_stopwords`instance-attribute`¶

Python

use_stopwords = use_stopwords

stopwords`instance-attribute`¶

Python

stopwords = get_set('prompt')

token_pattern`instance-attribute`¶

Python

token_pattern = compile('\\b[a-zA-Z][a-zA-Z0-9]*\\b')

Functions¶

tokenize¶

Python

tokenize(text: str, min_length: int = 2) -> List[str]

Tokenize natural language text.

PARAMETER	DESCRIPTION
`text`	Text to tokenize TYPE:`str`
`min_length`	Minimum token length TYPE:`int`DEFAULT:`2`

RETURNS	DESCRIPTION
`List[str]`	List of tokens

extract_ngrams¶

Python

extract_ngrams(text: str, n: int = 2) -> List[str]

Extract n-grams from text.

PARAMETER	DESCRIPTION
`text`	Input text TYPE:`str`
`n`	Size of n-grams TYPE:`int`DEFAULT:`2`

RETURNS	DESCRIPTION
`List[str]`	List of n-grams

LocalEmbeddings¶

Python

LocalEmbeddings(model_name: str = 'all-MiniLM-L6-v2', device: Optional[str] = None, cache_dir: Optional[Path] = None)

Bases: EmbeddingModel

Local embedding generation using sentence transformers.

This runs completely locally with no external API calls. Models are downloaded and cached by sentence-transformers.

Initialize local embeddings.

PARAMETER	DESCRIPTION
`model_name`	Sentence transformer model name TYPE:`str`DEFAULT:`'all-MiniLM-L6-v2'`
`device`	Device to use ('cpu', 'cuda', or None for auto) TYPE:`Optional[str]`DEFAULT:`None`
`cache_dir`	Directory to cache models TYPE:`Optional[Path]`DEFAULT:`None`

Attributes¶

device`instance-attribute`¶

Python

device = device

model`instance-attribute`¶

Python

model = SentenceTransformer(model_name, device=device, cache_folder=str(cache_dir) if cache_dir else None)

embedding_dim`instance-attribute`¶

Python

embedding_dim = get_sentence_embedding_dimension()

Functions¶

encode¶

Python

encode(texts: Union[str, List[str]], batch_size: int = 32, show_progress: bool = False, normalize: bool = True) -> np.ndarray

Encode texts to embeddings.

PARAMETER	DESCRIPTION
`texts`	Text or list of texts TYPE:`Union[str, List[str]]`
`batch_size`	Batch size for encoding TYPE:`int`DEFAULT:`32`
`show_progress`	Show progress bar TYPE:`bool`DEFAULT:`False`
`normalize`	L2 normalize embeddings TYPE:`bool`DEFAULT:`True`

RETURNS	DESCRIPTION
`ndarray`	Numpy array of embeddings

encode_file¶

Python

encode_file(file_path: Path, chunk_size: int = 1000, overlap: int = 100) -> np.ndarray

Encode a file with chunking for long files.

PARAMETER	DESCRIPTION
`file_path`	Path to file TYPE:`Path`
`chunk_size`	Characters per chunk TYPE:`int`DEFAULT:`1000`
`overlap`	Overlap between chunks TYPE:`int`DEFAULT:`100`

RETURNS	DESCRIPTION
`ndarray`	Mean pooled embedding for the file