Skip to content

tenets.core.nlp Package

Natural Language Processing and Machine Learning utilities.

This package provides all NLP/ML functionality for Tenets including: - Tokenization and text processing - Keyword extraction (YAKE, TF-IDF) - Stopword management - Embedding generation and caching - Semantic similarity calculation

All ML features are optional and gracefully degrade when not available.

Attributes

ML_AVAILABLEmodule-attribute

Python
ML_AVAILABLE = True

Classes

KeywordExtractor

Python
KeywordExtractor(use_rake: bool = True, use_yake: bool = True, language: str = 'en', use_stopwords: bool = True, stopword_set: str = 'prompt')

Multi-method keyword extraction with automatic fallback.

Provides robust keyword extraction using multiple algorithms with automatic fallback based on availability and Python version compatibility. Prioritizes fast, accurate methods while ensuring compatibility across Python versions.

Methods are attempted in order
  1. RAKE (Rapid Automatic Keyword Extraction) - Primary method, fast and Python 3.13+ compatible
  2. YAKE (Yet Another Keyword Extractor) - Secondary method, only for Python < 3.13 due to compatibility issues
  3. TF-IDF - Custom implementation, always available
  4. Frequency-based - Final fallback, simple but effective
ATTRIBUTEDESCRIPTION
use_rake

Whether RAKE extraction is enabled and available.

TYPE:bool

use_yake

Whether YAKE extraction is enabled and available.

TYPE:bool

language

Language code for extraction (e.g., 'en' for English).

TYPE:str

use_stopwords

Whether to filter stopwords during extraction.

TYPE:bool

stopword_set

Which stopword set to use ('code' or 'prompt').

TYPE:str

rake_extractor

RAKE extractor instance if available.

TYPE:Rake | None

yake_extractor

YAKE instance if available.

TYPE:KeywordExtractor | None

tokenizer

Tokenizer for fallback extraction.

TYPE:TextTokenizer

stopwords

Set of stopwords if filtering is enabled.

TYPE:Set[str] | None

Example

extractor = KeywordExtractor() keywords = extractor.extract("implement OAuth2 authentication") print(keywords) ['oauth2 authentication', 'implement', 'authentication']

Get keywords with scores

keywords_with_scores = extractor.extract( ... "implement OAuth2 authentication", ... include_scores=True ... ) print(keywords_with_scores) [('oauth2 authentication', 0.9), ('implement', 0.7), ...]

Note

On Python 3.13+, YAKE is automatically disabled due to a known infinite loop bug. RAKE is used as the primary extractor instead, providing similar quality with better performance.

Initialize keyword extractor with configurable extraction methods.

PARAMETERDESCRIPTION
use_rake

Enable RAKE extraction if available. RAKE is fast and works well with technical text. Defaults to True.

TYPE:boolDEFAULT:True

use_yake

Enable YAKE extraction if available. Automatically disabled on Python 3.13+ due to compatibility issues. Defaults to True.

TYPE:boolDEFAULT:True

language

Language code for extraction algorithms. Currently supports 'en' (English). Other languages may work but are not officially tested. Defaults to 'en'.

TYPE:strDEFAULT:'en'

use_stopwords

Whether to filter common stopwords during extraction. This can improve keyword quality but may miss some contextual phrases. Defaults to True.

TYPE:boolDEFAULT:True

stopword_set

Which stopword set to use. Options are: - 'prompt': Aggressive filtering for user prompts (200+ words) - 'code': Minimal filtering for code analysis (30 words) Defaults to 'prompt'.

TYPE:strDEFAULT:'prompt'

RAISESDESCRIPTION
None

Gracefully handles missing dependencies and logs warnings.

Note

The extractor automatically detects available libraries and Python version to choose the best extraction method. If RAKE and YAKE are unavailable, it falls back to TF-IDF and frequency-based extraction.

Attributes

loggerinstance-attribute
Python
logger = get_logger(__name__)
use_rakeinstance-attribute
Python
use_rake = use_rake and RAKE_AVAILABLE
use_yakeinstance-attribute
Python
use_yake = use_yake and YAKE_AVAILABLE
languageinstance-attribute
Python
language = language
use_stopwordsinstance-attribute
Python
use_stopwords = use_stopwords
stopword_setinstance-attribute
Python
stopword_set = stopword_set
rake_extractorinstance-attribute
Python
rake_extractor = SimpleRAKE(stopwords=stopwords, max_length=3)
yake_extractorinstance-attribute
Python
yake_extractor = KeywordExtractor(lan=language, n=3, dedupLim=0.7, dedupFunc='seqm', windowsSize=1, top=30)
tokenizerinstance-attribute
Python
tokenizer = TextTokenizer(use_stopwords=use_stopwords)
stopwordsinstance-attribute
Python
stopwords = get_set(stopword_set)

Functions

extract
Python
extract(text: str, max_keywords: int = 20, include_scores: bool = False) -> Union[List[str], List[Tuple[str, float]]]

Extract keywords from text using the best available method.

Attempts extraction methods in priority order (RAKE → YAKE → TF-IDF → Frequency) until one succeeds. Each method returns normalized scores between 0 and 1, with higher scores indicating more relevant keywords.

PARAMETERDESCRIPTION
text

Input text to extract keywords from. Can be any length, but very long texts may be truncated by some algorithms.

TYPE:str

max_keywords

Maximum number of keywords to return. Keywords are sorted by relevance score. Defaults to 20.

TYPE:intDEFAULT:20

include_scores

If True, return (keyword, score) tuples. If False, return only keyword strings. Defaults to False.

TYPE:boolDEFAULT:False

RETURNSDESCRIPTION
Union[List[str], List[Tuple[str, float]]]

Union[List[str], List[Tuple[str, float]]]: - If include_scores=False: List of keyword strings sorted by relevance (e.g., ['oauth2', 'authentication', 'implement']) - If include_scores=True: List of (keyword, score) tuples where scores are normalized between 0 and 1 (e.g., [('oauth2', 0.95), ('authentication', 0.87), ...])

Examples:

Python Console Session
>>> extractor = KeywordExtractor()
>>> # Simple keyword extraction
>>> keywords = extractor.extract("Python web framework Django")
>>> print(keywords)
['django', 'python web framework', 'web framework']
Python Console Session
>>> # With scores for ranking
>>> scored = extractor.extract("Python web framework Django",
...                           max_keywords=5, include_scores=True)
>>> for keyword, score in scored:
...     print(f"{keyword}: {score:.2f}")
django: 0.95
python web framework: 0.87
web framework: 0.82
Note

Empty input returns an empty list. All extraction methods handle various text formats including code, documentation, and natural language. Scores are normalized for consistency across methods.

TFIDFExtractor

Python
TFIDFExtractor(use_stopwords: bool = True, stopword_set: str = 'prompt')

Simple TF-IDF vectorizer with NLP tokenization.

Provides a scikit-learn-like interface with fit/transform methods returning dense vectors. Uses TextTokenizer for general text.

Initialize the extractor.

PARAMETERDESCRIPTION
use_stopwords

Whether to filter stopwords

TYPE:boolDEFAULT:True

stopword_set

Which stopword set to use ('prompt'|'code')

TYPE:strDEFAULT:'prompt'

Attributes

loggerinstance-attribute
Python
logger = get_logger(__name__)
use_stopwordsinstance-attribute
Python
use_stopwords = use_stopwords
stopword_setinstance-attribute
Python
stopword_set = stopword_set
tokenizerinstance-attribute
Python
tokenizer = TextTokenizer(use_stopwords=use_stopwords)

Functions

fit
Python
fit(documents: List[str]) -> TFIDFExtractor

Learn vocabulary and IDF from documents.

PARAMETERDESCRIPTION
documents

List of input texts

TYPE:List[str]

RETURNSDESCRIPTION
TFIDFExtractor

self

transform
Python
transform(documents: List[str]) -> List[List[float]]

Transform documents to dense TF-IDF vectors.

PARAMETERDESCRIPTION
documents

List of input texts

TYPE:List[str]

RETURNSDESCRIPTION
List[List[float]]

List of dense vectors (each aligned to the learned vocabulary)

fit_transform
Python
fit_transform(documents: List[str]) -> List[List[float]]

Fit to documents, then transform them.

get_feature_names
Python
get_feature_names() -> List[str]

Return the learned vocabulary as a list of feature names.

StopwordManager

Python
StopwordManager(data_dir: Optional[Path] = None)

Manages multiple stopword sets for different contexts.

Initialize stopword manager.

PARAMETERDESCRIPTION
data_dir

Directory containing stopword files

TYPE:Optional[Path]DEFAULT:None

Attributes

DEFAULT_DATA_DIRclass-attributeinstance-attribute
Python
DEFAULT_DATA_DIR = parent / 'data' / 'stopwords'
loggerinstance-attribute
Python
logger = get_logger(__name__)
data_dirinstance-attribute
Python
data_dir = data_dir or DEFAULT_DATA_DIR

Functions

get_set
Python
get_set(name: str) -> Optional[StopwordSet]

Get a stopword set by name.

PARAMETERDESCRIPTION
name

Name of stopword set ('code', 'prompt', etc.)

TYPE:str

RETURNSDESCRIPTION
Optional[StopwordSet]

StopwordSet or None if not found

add_custom_set
Python
add_custom_set(name: str, words: Set[str], description: str = '') -> StopwordSet

Add a custom stopword set.

PARAMETERDESCRIPTION
name

Name for the set

TYPE:str

words

Set of stopword strings

TYPE:Set[str]

description

What this set is for

TYPE:strDEFAULT:''

RETURNSDESCRIPTION
StopwordSet

Created StopwordSet

combine_sets
Python
combine_sets(sets: List[str], name: str = 'combined') -> StopwordSet

Combine multiple stopword sets.

PARAMETERDESCRIPTION
sets

Names of sets to combine

TYPE:List[str]

name

Name for combined set

TYPE:strDEFAULT:'combined'

RETURNSDESCRIPTION
StopwordSet

Combined StopwordSet

StopwordSetdataclass

Python
StopwordSet(name: str, words: Set[str], description: str, source_file: Optional[Path] = None)

A set of stopwords with metadata.

ATTRIBUTEDESCRIPTION
name

Name of this stopword set

TYPE:str

words

Set of stopword strings

TYPE:Set[str]

description

What this set is used for

TYPE:str

source_file

Path to source file

TYPE:Optional[Path]

Attributes

nameinstance-attribute
Python
name: str
wordsinstance-attribute
Python
words: Set[str]
descriptioninstance-attribute
Python
description: str
source_fileclass-attributeinstance-attribute
Python
source_file: Optional[Path] = None

Functions

filter
Python
filter(words: List[str]) -> List[str]

Filter stopwords from word list.

PARAMETERDESCRIPTION
words

List of words to filter

TYPE:List[str]

RETURNSDESCRIPTION
List[str]

Filtered list without stopwords

CodeTokenizer

Python
CodeTokenizer(use_stopwords: bool = False)

Tokenizer optimized for source code.

Handles: - camelCase and PascalCase splitting - snake_case splitting - Preserves original tokens for exact matching - Language-specific keywords - Optional stopword filtering

Initialize code tokenizer.

PARAMETERDESCRIPTION
use_stopwords

Whether to filter stopwords

TYPE:boolDEFAULT:False

Attributes

loggerinstance-attribute
Python
logger = get_logger(__name__)
use_stopwordsinstance-attribute
Python
use_stopwords = use_stopwords
stopwordsinstance-attribute
Python
stopwords = get_set('code')
token_patterninstance-attribute
Python
token_pattern = compile('\\b[a-zA-Z_][a-zA-Z0-9_]*\\b')
camel_case_patterninstance-attribute
Python
camel_case_pattern = compile('[A-Z][a-z]+|[a-z]+|[A-Z]+(?=[A-Z][a-z]|\\b)')
snake_case_patterninstance-attribute
Python
snake_case_pattern = compile('[a-z]+|[A-Z]+')

Functions

tokenize
Python
tokenize(text: str, language: Optional[str] = None, preserve_original: bool = True) -> List[str]

Tokenize code text.

PARAMETERDESCRIPTION
text

Code to tokenize

TYPE:str

language

Programming language (for language-specific handling)

TYPE:Optional[str]DEFAULT:None

preserve_original

Keep original tokens alongside splits

TYPE:boolDEFAULT:True

RETURNSDESCRIPTION
List[str]

List of tokens

tokenize_identifier
Python
tokenize_identifier(identifier: str) -> List[str]

Tokenize a single identifier (function/class/variable name).

PARAMETERDESCRIPTION
identifier

Identifier to tokenize

TYPE:str

RETURNSDESCRIPTION
List[str]

List of component tokens

TextTokenizer

Python
TextTokenizer(use_stopwords: bool = True)

Tokenizer for natural language text (prompts, comments, docs).

More aggressive than CodeTokenizer, designed for understanding user intent rather than exact matching.

Initialize text tokenizer.

PARAMETERDESCRIPTION
use_stopwords

Whether to filter stopwords (default True)

TYPE:boolDEFAULT:True

Attributes

loggerinstance-attribute
Python
logger = get_logger(__name__)
use_stopwordsinstance-attribute
Python
use_stopwords = use_stopwords
stopwordsinstance-attribute
Python
stopwords = get_set('prompt')
token_patterninstance-attribute
Python
token_pattern = compile('\\b[a-zA-Z][a-zA-Z0-9]*\\b')

Functions

tokenize
Python
tokenize(text: str, min_length: int = 2) -> List[str]

Tokenize natural language text.

PARAMETERDESCRIPTION
text

Text to tokenize

TYPE:str

min_length

Minimum token length

TYPE:intDEFAULT:2

RETURNSDESCRIPTION
List[str]

List of tokens

extract_ngrams
Python
extract_ngrams(text: str, n: int = 2) -> List[str]

Extract n-grams from text.

PARAMETERDESCRIPTION
text

Input text

TYPE:str

n

Size of n-grams

TYPE:intDEFAULT:2

RETURNSDESCRIPTION
List[str]

List of n-grams

LocalEmbeddings

Python
LocalEmbeddings(model_name: str = 'all-MiniLM-L6-v2', device: Optional[str] = None, cache_dir: Optional[Path] = None)

Bases: EmbeddingModel

Local embedding generation using sentence transformers.

This runs completely locally with no external API calls. Models are downloaded and cached by sentence-transformers.

Initialize local embeddings.

PARAMETERDESCRIPTION
model_name

Sentence transformer model name

TYPE:strDEFAULT:'all-MiniLM-L6-v2'

device

Device to use ('cpu', 'cuda', or None for auto)

TYPE:Optional[str]DEFAULT:None

cache_dir

Directory to cache models

TYPE:Optional[Path]DEFAULT:None

Attributes

deviceinstance-attribute
Python
device = device
modelinstance-attribute
Python
model = SentenceTransformer(model_name, device=device, cache_folder=str(cache_dir) if cache_dir else None)
embedding_diminstance-attribute
Python
embedding_dim = get_sentence_embedding_dimension()

Functions

encode
Python
encode(texts: Union[str, List[str]], batch_size: int = 32, show_progress: bool = False, normalize: bool = True) -> np.ndarray

Encode texts to embeddings.

PARAMETERDESCRIPTION
texts

Text or list of texts

TYPE:Union[str, List[str]]

batch_size

Batch size for encoding

TYPE:intDEFAULT:32

show_progress

Show progress bar

TYPE:boolDEFAULT:False

normalize

L2 normalize embeddings

TYPE:boolDEFAULT:True

RETURNSDESCRIPTION
ndarray

Numpy array of embeddings

encode_file
Python
encode_file(file_path: Path, chunk_size: int = 1000, overlap: int = 100) -> np.ndarray

Encode a file with chunking for long files.

PARAMETERDESCRIPTION
file_path

Path to file

TYPE:Path

chunk_size

Characters per chunk

TYPE:intDEFAULT:1000

overlap

Overlap between chunks

TYPE:intDEFAULT:100

RETURNSDESCRIPTION
ndarray

Mean pooled embedding for the file

EmbeddingModel

Python
EmbeddingModel(*args, **kwargs)

Stub for when ML features not available.

SemanticSimilarity

Python
SemanticSimilarity(*args, **kwargs)

Stub for when ML features not available.

EmbeddingCache

Python
EmbeddingCache(*args, **kwargs)

Stub for when ML features not available.

Functions

cosine_similarity

Python
cosine_similarity(a, b)

Stub for when ML features not available.

sparse_cosine_similarity

Python
sparse_cosine_similarity(a, b)

Stub for when ML features not available.

euclidean_distance

Python
euclidean_distance(a, b)

Stub for when ML features not available.

manhattan_distance

Python
manhattan_distance(a, b)

Stub for when ML features not available.

extract_keywords

Python
extract_keywords(text: str, max_keywords: int = 20, use_yake: bool = True, language: str = 'en') -> List[str]

Extract keywords from text using best available method.

PARAMETERDESCRIPTION
text

Input text

TYPE:str

max_keywords

Maximum keywords to extract

TYPE:intDEFAULT:20

use_yake

Try YAKE first if available

TYPE:boolDEFAULT:True

language

Language for YAKE

TYPE:strDEFAULT:'en'

RETURNSDESCRIPTION
List[str]

List of extracted keywords

tokenize_code

Python
tokenize_code(code: str, language: Optional[str] = None, use_stopwords: bool = False) -> List[str]

Tokenize code with language-aware processing.

PARAMETERDESCRIPTION
code

Source code to tokenize

TYPE:str

language

Programming language (auto-detect if None)

TYPE:Optional[str]DEFAULT:None

use_stopwords

Filter stopwords

TYPE:boolDEFAULT:False

RETURNSDESCRIPTION
List[str]

List of tokens

compute_similarity

Python
compute_similarity(text1: str, text2: str, method: str = 'auto') -> float

Compute similarity between two texts.

PARAMETERDESCRIPTION
text1

First text

TYPE:str

text2

Second text

TYPE:str

method

'semantic'|'tfidf'|'auto'

TYPE:strDEFAULT:'auto'

RETURNSDESCRIPTION
float

Similarity score (0-1)

Modules