DensityClassifierFull
DensityClassifierFull(
embedding_dim=384,
num_bits=14,
seed=31,
isolation_k=10,
isolation_sample_size=1000,
num_stability_seeds=0,
)Density Classifier using PCA-based LSH.
Returns raw density metrics per item - classification is up to you.
Example
classifier = DensityClassifier(embedding_dim=384) classifier.fit(embeddings, categories=categories) print(classifier.report())
Get raw metrics
labels = classifier.get_labels() sparse = labels.filter(pl.col(‘bucket_size’) < 10) isolated = labels.filter(pl.col(‘isolation_score’) > 0.5)
Methods
| Name | Description |
|---|---|
| fit | Fit the density classifier. |
| from_pandas | Create classifier from a Pandas DataFrame. |
| from_polars | Create classifier from a Polars DataFrame. |
| from_texts | Create classifier from raw texts using TF-IDF + SVD embeddings. |
| get_bucket_ids | Get bucket IDs for all items. |
| get_bucket_sizes | Get bucket sizes for all items. |
| get_centroid_similarities | Get centroid similarities for all items. |
| get_isolation_scores | Get isolation scores for all items. |
| get_labels | Get per-record labels as a Polars DataFrame. |
| get_stability_scores | Get stability scores for all items (0-1, higher = more stable). |
| label_buckets | Generate descriptive labels for buckets using a local LLM. |
| label_buckets_keywords | Generate labels for buckets using keyword extraction (no LLM required). |
| report | Get the density classification report. |
| to_pandas | Return source DataFrame with density metrics columns added. |
| to_polars | Return source DataFrame with density metrics columns added. |
fit
DensityClassifierFull.fit(
embeddings,
categories=None,
texts=None,
normalize=True,
verbose=True,
)Fit the density classifier.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| embeddings | np.ndarray | (n, d) array of embedding vectors | required |
| categories | Optional[List[str]] | Optional list of category labels | None |
| texts | Optional[List[str]] | Optional list of text content | None |
| normalize | bool | Whether to L2-normalize embeddings | True |
| verbose | bool | Print progress | True |
Returns
| Name | Type | Description |
|---|---|---|
| DensityClassifier | self (for chaining) |
from_pandas
DensityClassifierFull.from_pandas(
df,
embedding_col,
category_col=None,
text_col=None,
**kwargs,
)Create classifier from a Pandas DataFrame.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| df | pd.DataFrame | Pandas DataFrame with embeddings | required |
| embedding_col | str | Column name containing embedding vectors (list of floats) | required |
| category_col | Optional[str] | Optional column name for category labels | None |
| text_col | Optional[str] | Optional column name for text content (enables labeling) | None |
| **kwargs | Additional args passed to init (num_bits, seed, etc.) | {} |
Returns
| Name | Type | Description |
|---|---|---|
| DensityClassifier | Fitted DensityClassifier instance |
Example
df = pd.read_parquet(“embeddings.parquet”) classifier = DensityClassifier.from_pandas( … df, … embedding_col=“embedding”, … category_col=“category” … ) result = classifier.to_pandas()
from_polars
DensityClassifierFull.from_polars(
df,
embedding_col,
category_col=None,
text_col=None,
**kwargs,
)Create classifier from a Polars DataFrame.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| df | pl.DataFrame | Polars DataFrame with embeddings | required |
| embedding_col | str | Column name containing embedding vectors (list of floats) | required |
| category_col | Optional[str] | Optional column name for category labels | None |
| text_col | Optional[str] | Optional column name for text content (enables labeling) | None |
| **kwargs | Additional args passed to init (num_bits, seed, etc.) | {} |
Returns
| Name | Type | Description |
|---|---|---|
| DensityClassifier | Fitted DensityClassifier instance |
Example
df = pl.read_parquet(“embeddings.parquet”) classifier = DensityClassifier.from_polars( … df, … embedding_col=“embedding”, … category_col=“category” … ) result = classifier.to_polars()
from_texts
DensityClassifierFull.from_texts(
texts,
categories=None,
embedding_dim=128,
max_features=10000,
min_df=2,
max_df=0.95,
ngram_range=(1, 2),
num_bits=12,
verbose=True,
**kwargs,
)Create classifier from raw texts using TF-IDF + SVD embeddings.
No external embedding model required. Uses sklearn’s TfidfVectorizer and TruncatedSVD to create dense embeddings from text.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| texts | List[str] | List of text documents | required |
| categories | Optional[List[str]] | Optional category labels | None |
| embedding_dim | int | SVD output dimensions (default 128) | 128 |
| max_features | int | Max vocabulary size for TF-IDF | 10000 |
| min_df | int | Min document frequency for terms | 2 |
| max_df | float | Max document frequency for terms | 0.95 |
| ngram_range | Tuple[int, int] | N-gram range (default unigrams + bigrams) | (1, 2) |
| num_bits | int | Bits for PCA LSH | 12 |
| verbose | bool | Print progress | True |
| **kwargs | Additional args passed to init | {} |
Returns
| Name | Type | Description |
|---|---|---|
| DensityClassifier | Fitted DensityClassifier instance |
Example
texts = [“doc about machine learning”, “another ml paper”, …] classifier = DensityClassifier.from_texts(texts, categories=cats) print(classifier.report())
get_bucket_ids
DensityClassifierFull.get_bucket_ids()Get bucket IDs for all items.
get_bucket_sizes
DensityClassifierFull.get_bucket_sizes()Get bucket sizes for all items.
get_centroid_similarities
DensityClassifierFull.get_centroid_similarities()Get centroid similarities for all items.
get_isolation_scores
DensityClassifierFull.get_isolation_scores()Get isolation scores for all items.
get_labels
DensityClassifierFull.get_labels()Get per-record labels as a Polars DataFrame.
Returns DataFrame with columns
- index: Record index (0-based)
- bucket_id: Primary LSH bucket ID
- bucket_size: Number of items in same bucket
- centroid_similarity: Cosine similarity to bucket centroid
- isolation_score: How isolated the item is
- stability_score: How stable bucket assignment is (0-1)
- category: Category label if provided during fit
Example
classifier.fit(embeddings, categories=categories) labels = classifier.get_labels() sparse = labels.filter(pl.col(‘bucket_size’) < 10)
get_stability_scores
DensityClassifierFull.get_stability_scores()Get stability scores for all items (0-1, higher = more stable).
label_buckets
DensityClassifierFull.label_buckets(
base_url='http://localhost:11434/v1',
model='qwen2.5:7b',
samples_per_bucket=5,
max_text_len=200,
min_bucket_size=5,
verbose=True,
)Generate descriptive labels for buckets using a local LLM.
Uses OpenAI-compatible API (works with Ollama or MLX server).
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| base_url | str | API endpoint (Ollama: localhost:11434, MLX: localhost:8080) | 'http://localhost:11434/v1' |
| model | str | Model name | 'qwen2.5:7b' |
| samples_per_bucket | int | Number of representative texts to send | 5 |
| max_text_len | int | Max length of each sample text | 200 |
| min_bucket_size | int | Only label buckets with at least this many items | 5 |
| verbose | bool | Print progress | True |
Returns
| Name | Type | Description |
|---|---|---|
| Dict[int, Dict] | Dict mapping bucket_id -> {‘label’: str, ‘size’: int, ‘samples’: List[str]} |
Example
labels = classifier.label_buckets( … base_url=“http://localhost:11434/v1”, … model=“qwen2.5:7b” … ) print(labels[1234][‘label’]) ‘Reinforcement Learning’
label_buckets_keywords
DensityClassifierFull.label_buckets_keywords(
top_k=3,
min_bucket_size=5,
stopwords=None,
min_word_len=3,
use_tfidf=True,
)Generate labels for buckets using keyword extraction (no LLM required).
Extracts top keywords from bucket texts using TF-IDF or frequency analysis.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| top_k | int | Number of top keywords to include in label | 3 |
| min_bucket_size | int | Only label buckets with at least this many items | 5 |
| stopwords | Optional[set] | Set of words to exclude (uses default if None) | None |
| min_word_len | int | Minimum word length to consider | 3 |
| use_tfidf | bool | Use TF-IDF weighting (vs simple frequency) | True |
Returns
| Name | Type | Description |
|---|---|---|
| Dict[int, Dict] | Dict mapping bucket_id -> {‘label’: str, ‘keywords’: List[Tuple[str, float]], ‘size’: int} |
Example
labels = classifier.label_buckets_keywords() print(labels[1234][‘label’]) ‘neural network training’
report
DensityClassifierFull.report()Get the density classification report.
to_pandas
DensityClassifierFull.to_pandas()Return source DataFrame with density metrics columns added.
Returns DataFrame with original columns plus
- bucket_id: LSH bucket ID
- bucket_size: Number of items in same bucket
- centroid_similarity: Cosine similarity to bucket centroid (0-1)
- isolation_score: How isolated the item is
- stability_score: How stable bucket assignment is (0-1)
Example
classifier = DensityClassifier.from_pandas(df, “embedding”) result = classifier.to_pandas() sparse = result[result[“bucket_size”] < 5]
to_polars
DensityClassifierFull.to_polars()Return source DataFrame with density metrics columns added.
Returns DataFrame with original columns plus
- bucket_id: LSH bucket ID
- bucket_size: Number of items in same bucket
- centroid_similarity: Cosine similarity to bucket centroid (0-1)
- isolation_score: How isolated the item is
- stability_score: How stable bucket assignment is (0-1)
Example
classifier = DensityClassifier.from_polars(df, “embedding”) result = classifier.to_polars() sparse = result.filter(pl.col(“bucket_size”) < 5)