DYF discovers structure in embedding spaces using density-based LSH. Every item gets classified as Dense (core), Bridge (connector), or Orphan (unique) — no k to pick, no hyperparameters to tune.
pip install dyfGitHub | PyPI | How It Works | API Reference |
Quick start
from dyf import DensityClassifierFull
classifier = DensityClassifierFull.from_texts(
texts=documents, # any list of strings
categories=categories, # optional grouping
)
print(classifier.report())
# Dense: 8,420 (84.2%) — core topics
# Bridge: 980 (9.8%) — cross-topic items
# Orphan: 600 (6.0%) — unique outliersBuild & search indexes
DYF writes .dyf files — memory-mapped indexes with zero startup cost:
from dyf import build_dyf_tree, write_lazy_index, LazyIndex
tree = build_dyf_tree(embeddings, max_depth=4, num_bits=3, min_leaf_size=8)
write_lazy_index(tree, embeddings, "index.dyf",
quantization="float16",
stored_fields={"title": titles},
metadata={"model": "nomic-embed-text-v1.5"})
with LazyIndex("index.dyf") as idx:
result = idx.search(query_embedding, k=10, nprobe="auto")
print(result.indices, result.scores)
print(result.fields["title"])How it works
Two-stage PCA-based LSH:
- Random bucketing — throw random hyperplanes through embedding space
- Compute centroids — average embeddings per bucket
- PCA on centroids — learn directions that separate clusters (fast: hundreds of rows, not millions)
- Re-hash — project all embeddings onto learned hyperplanes
Bucket sizes reveal density. Centroid similarity reveals centrality. Stability under perturbation reveals bridges.
See How It Works for the full algorithm, metrics, and the connection to Krapivin et al. (2025).
What you can do with it
- Semantic navigation — trace paths between concepts through bridge items
- Structure discovery — see how data organizes itself without imposing a taxonomy
- Anomaly detection — surface orphans and bridges that span multiple domains
- Index building —
.dyffiles for instant-open, memory-mapped approximate search
Performance
| Dataset | Time | Per item |
|---|---|---|
| 60K embeddings (384d) | ~60ms | 1.0 µs |
Rust-accelerated via PyO3.
Installation
pip install dyf # core
pip install dyf[io] # + serialization
pip install dyf[full] # + embedding generation, LLM labeling