DensityClassifierFull

DensityClassifierFull(
    embedding_dim=384,
    num_bits=14,
    seed=31,
    isolation_k=10,
    isolation_sample_size=1000,
    num_stability_seeds=0,
)

Density Classifier using PCA-based LSH.

Returns raw density metrics per item - classification is up to you.

Example

classifier = DensityClassifier(embedding_dim=384) classifier.fit(embeddings, categories=categories) print(classifier.report())

Get raw metrics

labels = classifier.get_labels() sparse = labels.filter(pl.col(‘bucket_size’) < 10) isolated = labels.filter(pl.col(‘isolation_score’) > 0.5)

Methods

Name Description
fit Fit the density classifier.
from_pandas Create classifier from a Pandas DataFrame.
from_polars Create classifier from a Polars DataFrame.
from_texts Create classifier from raw texts using TF-IDF + SVD embeddings.
get_bucket_ids Get bucket IDs for all items.
get_bucket_sizes Get bucket sizes for all items.
get_centroid_similarities Get centroid similarities for all items.
get_isolation_scores Get isolation scores for all items.
get_labels Get per-record labels as a Polars DataFrame.
get_stability_scores Get stability scores for all items (0-1, higher = more stable).
label_buckets Generate descriptive labels for buckets using a local LLM.
label_buckets_keywords Generate labels for buckets using keyword extraction (no LLM required).
report Get the density classification report.
to_pandas Return source DataFrame with density metrics columns added.
to_polars Return source DataFrame with density metrics columns added.

fit

DensityClassifierFull.fit(
    embeddings,
    categories=None,
    texts=None,
    normalize=True,
    verbose=True,
)

Fit the density classifier.

Parameters

Name Type Description Default
embeddings np.ndarray (n, d) array of embedding vectors required
categories Optional[List[str]] Optional list of category labels None
texts Optional[List[str]] Optional list of text content None
normalize bool Whether to L2-normalize embeddings True
verbose bool Print progress True

Returns

Name Type Description
DensityClassifier self (for chaining)

from_pandas

DensityClassifierFull.from_pandas(
    df,
    embedding_col,
    category_col=None,
    text_col=None,
    **kwargs,
)

Create classifier from a Pandas DataFrame.

Parameters

Name Type Description Default
df pd.DataFrame Pandas DataFrame with embeddings required
embedding_col str Column name containing embedding vectors (list of floats) required
category_col Optional[str] Optional column name for category labels None
text_col Optional[str] Optional column name for text content (enables labeling) None
**kwargs Additional args passed to init (num_bits, seed, etc.) {}

Returns

Name Type Description
DensityClassifier Fitted DensityClassifier instance

Example

df = pd.read_parquet(“embeddings.parquet”) classifier = DensityClassifier.from_pandas( … df, … embedding_col=“embedding”, … category_col=“category” … ) result = classifier.to_pandas()

from_polars

DensityClassifierFull.from_polars(
    df,
    embedding_col,
    category_col=None,
    text_col=None,
    **kwargs,
)

Create classifier from a Polars DataFrame.

Parameters

Name Type Description Default
df pl.DataFrame Polars DataFrame with embeddings required
embedding_col str Column name containing embedding vectors (list of floats) required
category_col Optional[str] Optional column name for category labels None
text_col Optional[str] Optional column name for text content (enables labeling) None
**kwargs Additional args passed to init (num_bits, seed, etc.) {}

Returns

Name Type Description
DensityClassifier Fitted DensityClassifier instance

Example

df = pl.read_parquet(“embeddings.parquet”) classifier = DensityClassifier.from_polars( … df, … embedding_col=“embedding”, … category_col=“category” … ) result = classifier.to_polars()

from_texts

DensityClassifierFull.from_texts(
    texts,
    categories=None,
    embedding_dim=128,
    max_features=10000,
    min_df=2,
    max_df=0.95,
    ngram_range=(1, 2),
    num_bits=12,
    verbose=True,
    **kwargs,
)

Create classifier from raw texts using TF-IDF + SVD embeddings.

No external embedding model required. Uses sklearn’s TfidfVectorizer and TruncatedSVD to create dense embeddings from text.

Parameters

Name Type Description Default
texts List[str] List of text documents required
categories Optional[List[str]] Optional category labels None
embedding_dim int SVD output dimensions (default 128) 128
max_features int Max vocabulary size for TF-IDF 10000
min_df int Min document frequency for terms 2
max_df float Max document frequency for terms 0.95
ngram_range Tuple[int, int] N-gram range (default unigrams + bigrams) (1, 2)
num_bits int Bits for PCA LSH 12
verbose bool Print progress True
**kwargs Additional args passed to init {}

Returns

Name Type Description
DensityClassifier Fitted DensityClassifier instance

Example

texts = [“doc about machine learning”, “another ml paper”, …] classifier = DensityClassifier.from_texts(texts, categories=cats) print(classifier.report())

get_bucket_ids

DensityClassifierFull.get_bucket_ids()

Get bucket IDs for all items.

get_bucket_sizes

DensityClassifierFull.get_bucket_sizes()

Get bucket sizes for all items.

get_centroid_similarities

DensityClassifierFull.get_centroid_similarities()

Get centroid similarities for all items.

get_isolation_scores

DensityClassifierFull.get_isolation_scores()

Get isolation scores for all items.

get_labels

DensityClassifierFull.get_labels()

Get per-record labels as a Polars DataFrame.

Returns DataFrame with columns

  • index: Record index (0-based)
  • bucket_id: Primary LSH bucket ID
  • bucket_size: Number of items in same bucket
  • centroid_similarity: Cosine similarity to bucket centroid
  • isolation_score: How isolated the item is
  • stability_score: How stable bucket assignment is (0-1)
  • category: Category label if provided during fit

Example

classifier.fit(embeddings, categories=categories) labels = classifier.get_labels() sparse = labels.filter(pl.col(‘bucket_size’) < 10)

get_stability_scores

DensityClassifierFull.get_stability_scores()

Get stability scores for all items (0-1, higher = more stable).

label_buckets

DensityClassifierFull.label_buckets(
    base_url='http://localhost:11434/v1',
    model='qwen2.5:7b',
    samples_per_bucket=5,
    max_text_len=200,
    min_bucket_size=5,
    verbose=True,
)

Generate descriptive labels for buckets using a local LLM.

Uses OpenAI-compatible API (works with Ollama or MLX server).

Parameters

Name Type Description Default
base_url str API endpoint (Ollama: localhost:11434, MLX: localhost:8080) 'http://localhost:11434/v1'
model str Model name 'qwen2.5:7b'
samples_per_bucket int Number of representative texts to send 5
max_text_len int Max length of each sample text 200
min_bucket_size int Only label buckets with at least this many items 5
verbose bool Print progress True

Returns

Name Type Description
Dict[int, Dict] Dict mapping bucket_id -> {‘label’: str, ‘size’: int, ‘samples’: List[str]}

Example

labels = classifier.label_buckets( … base_url=“http://localhost:11434/v1”, … model=“qwen2.5:7b” … ) print(labels[1234][‘label’]) ‘Reinforcement Learning’

label_buckets_keywords

DensityClassifierFull.label_buckets_keywords(
    top_k=3,
    min_bucket_size=5,
    stopwords=None,
    min_word_len=3,
    use_tfidf=True,
)

Generate labels for buckets using keyword extraction (no LLM required).

Extracts top keywords from bucket texts using TF-IDF or frequency analysis.

Parameters

Name Type Description Default
top_k int Number of top keywords to include in label 3
min_bucket_size int Only label buckets with at least this many items 5
stopwords Optional[set] Set of words to exclude (uses default if None) None
min_word_len int Minimum word length to consider 3
use_tfidf bool Use TF-IDF weighting (vs simple frequency) True

Returns

Name Type Description
Dict[int, Dict] Dict mapping bucket_id -> {‘label’: str, ‘keywords’: List[Tuple[str, float]], ‘size’: int}

Example

labels = classifier.label_buckets_keywords() print(labels[1234][‘label’]) ‘neural network training’

report

DensityClassifierFull.report()

Get the density classification report.

to_pandas

DensityClassifierFull.to_pandas()

Return source DataFrame with density metrics columns added.

Returns DataFrame with original columns plus

  • bucket_id: LSH bucket ID
  • bucket_size: Number of items in same bucket
  • centroid_similarity: Cosine similarity to bucket centroid (0-1)
  • isolation_score: How isolated the item is
  • stability_score: How stable bucket assignment is (0-1)

Example

classifier = DensityClassifier.from_pandas(df, “embedding”) result = classifier.to_pandas() sparse = result[result[“bucket_size”] < 5]

to_polars

DensityClassifierFull.to_polars()

Return source DataFrame with density metrics columns added.

Returns DataFrame with original columns plus

  • bucket_id: LSH bucket ID
  • bucket_size: Number of items in same bucket
  • centroid_similarity: Cosine similarity to bucket centroid (0-1)
  • isolation_score: How isolated the item is
  • stability_score: How stable bucket assignment is (0-1)

Example

classifier = DensityClassifier.from_polars(df, “embedding”) result = classifier.to_polars() sparse = result.filter(pl.col(“bucket_size”) < 5)