How it works
DYF uses a two-stage PCA-based LSH to classify every item in your corpus. Bucket sizes reveal density, centroid similarity reveals centrality, and stability under perturbation reveals bridges.
Turn any collection of embeddings into a map — find what's common, what connects, and what's unique.
Core items in well-populated semantic regions. The backbone of your corpus.
Transitional items connecting different clusters. Critical for semantic navigation.
Unique items with no semantic neighbors. Anomalies worth investigating.
Fit a classifier, inspect the report, retrieve items by category.
from dyf import DensityClassifierFull
classifier = DensityClassifierFull.from_texts(
texts=documents, # any list of strings
categories=categories, # optional grouping
)
# What did we find?
print(classifier.report())
# Dense: 8,420 (84.2%) — core topics
# Bridge: 980 (9.8%) — cross-topic items
# Orphan: 600 (6.0%) — unique outliers
DYF uses a two-stage PCA-based LSH to classify every item in your corpus. Bucket sizes reveal density, centroid similarity reveals centrality, and stability under perturbation reveals bridges.
Trace paths between concepts through bridge items that connect distant clusters.
See how your data organizes itself — no k to pick, no taxonomy to impose.
Surface orphans that don't fit anywhere and bridges that span multiple domains.
Build .dyf files for instant-open, memory-mapped approximate search.
# Core
pip install dyf
# With serialization
pip install dyf[io]
# Full features (embeddings, LLM labeling)
pip install dyf[full]