flowchart LR
A["<b>Step 1</b><br/>Random<br/>hyperplanes"] --> B["<b>Step 2</b><br/>Compute bucket<br/>centroids"]
B --> C["<b>Step 3</b><br/>PCA on<br/>centroids"]
C --> D["<b>Step 4</b><br/>Re-hash with<br/>learned planes"]
style A stroke:#4a6fa5,stroke-width:2
style B stroke:#4a6fa5,stroke-width:2
style C stroke:#c47f17,stroke-width:2
style D stroke:#2d8a2d,stroke-width:2
How DYF Works
DYF discovers structure in embedding spaces — dense clusters, bridge connections, and orphan outliers — without requiring you to choose the number of clusters. This page explains the algorithm, the metrics, and the tools built on top of them.
The Problem
Embedding models turn text, images, or other data into high-dimensional vectors. Similar items end up near each other, but the structure of that space — where the clusters are, what connects them, which items are truly unique — is invisible. You have a cloud of points and no map.
Traditional clustering forces choices. k-means requires picking k upfront, and the result is just a partition — it doesn’t tell you which items are central to their cluster, which ones sit between clusters, or which ones don’t belong anywhere. HDBSCAN handles variable-density clusters but scales poorly and still gives you a flat labeling.
DYF takes a different approach. Instead of assigning every point to exactly one cluster, it characterizes each point’s role in the space: Dense (core member of a region), Bridge (connector between regions), or Orphan (genuinely unique). It does this without choosing k, and it reveals the topology of your data — not just its partitions.
The Algorithm: Four Steps
DYF uses a two-pass approach: start with random structure, then learn from it.
Step 1: Random bucketing
Throw random hyperplanes through the embedding space. Each hyperplane splits the space in two — items on the same side of all hyperplanes land in the same bucket. With 3 hyperplanes you get up to 8 buckets, with 6 you get up to 64.
This is fast but noisy. The hyperplanes don’t know anything about your data’s structure.
Step 2: Compute centroids
Average the embeddings in each bucket. These centroids are a compressed view of how the data is distributed — a rough sketch of its topology.
Step 3: PCA on centroids
Run PCA on the centroids, not the full dataset. This is the key move — centroids are a small matrix (hundreds of rows, not millions), so PCA is fast. And because each centroid represents a cluster of similar items, the principal components capture the directions that separate clusters. These become the new hyperplanes.
Step 4: Re-hash
Project all embeddings onto the learned hyperplanes using the same sign-bit scheme. Now the buckets reflect actual semantic regions rather than arbitrary splits.
DYF uses randomness to discover structure, then learns from it. The final hyperplanes aren’t random — they’re fit to your data.
How the hash works: sign-bit LSH
Each hyperplane is a unit vector in embedding space. To assign an item to a bucket, DYF computes the dot product of the embedding with each hyperplane and takes the sign:
- Positive dot product → bit = 1 (item is on the “positive” side)
- Negative dot product → bit = 0 (item is on the “negative” side)
The bits from all hyperplanes are packed into a single integer — the bucket ID. With 3 hyperplanes, each item gets a 3-bit ID (0–7). Items that share a bucket ID are on the same side of every hyperplane, meaning they’re in the same region of the space.
flowchart LR
E["Embedding"] --> D1["dot with h1<br/>→ positive → 1"]
E --> D2["dot with h2<br/>→ negative → 0"]
E --> D3["dot with h3<br/>→ positive → 1"]
D1 --> ID["bucket = 101₂ = 5"]
D2 --> ID
D3 --> ID
style E stroke:#4a6fa5,stroke-width:2
style ID stroke:#2d8a2d,stroke-width:2
This is a form of Locality Sensitive Hashing (LSH) — nearby embeddings are likely to land in the same bucket because they’ll be on the same side of most hyperplanes. What makes DYF different from standard random-projection LSH is the second pass: the hyperplanes are learned from the data’s own structure via PCA, so the splits align with natural cluster boundaries instead of arbitrary directions.
Improving the hash: ITQ rotation
The sign function is lossy — embeddings near a decision boundary can flip to the wrong side. Iterative Quantization (ITQ) reduces this loss by learning a rotation matrix that aligns the PCA projections with the sign thresholds.
The idea: after PCA, the projections are continuous values. ITQ finds a rotation that makes those values as far from zero as possible — pushing them away from the decision boundary before the sign function is applied. This typically improves recall by 5–12% with minimal extra build cost.
# Standard PCA-based hash
classifier.fit(embeddings)
# ITQ-refined hash (better quantization)
classifier.fit_itq(embeddings)Why Hashing Yields Density
Most hash tables try to spread items evenly across buckets — collisions are a bug. DYF flips this: uneven bucket sizes are the signal. When you hash embeddings with data-adaptive hyperplanes, the resulting bucket distribution directly encodes the density structure of the embedding space.
flowchart LR
subgraph hash ["Data-adaptive hash"]
direction TB
H1["PCA hyperplanes<br/>split along cluster<br/>boundaries"]
end
subgraph buckets ["Bucket sizes encode density"]
direction TB
B1["Bucket 3: 1,200 items<br/>→ dense region"]
B2["Bucket 7: 45 items<br/>→ sparse region"]
B3["Bucket 12: 2 items<br/>→ orphan territory"]
end
hash --> buckets
style hash stroke:#4a6fa5,stroke-width:2
style buckets stroke:#2d8a2d,stroke-width:2
A large bucket means many embeddings landed on the same side of every hyperplane — they occupy a coherent, well-populated region of the space. A tiny bucket means those items ended up in a region with almost no neighbors. The bucket size is a density estimate, computed as a byproduct of the hash with no additional distance calculations.
This idea draws from Krapivin et al. (2025), which analyzes hash table occupancy patterns and probe complexity under non-uniform distributions. DYF applies this principle to embedding spaces: instead of fighting non-uniform bucket sizes, it treats them as a feature — the load pattern of the hash table reveals the topology of the data.
DYF can also read density at multiple scales by masking bucket IDs to fewer bits. With 14-bit hashes, masking to 8 bits merges neighboring buckets — items that looked sparse at fine resolution may cluster at coarser resolution. This multi-resolution view separates true orphans (sparse at every scale) from items that belong to broad, diffuse regions.
What the Metrics Tell You
After fitting, DYF computes per-item metrics that characterize each point’s role:
| Metric | What it means | Analogy |
|---|---|---|
bucket_id |
Which region this item belongs to | Which neighborhood you live in |
bucket_size |
How many items share this bucket | How many neighbors you have |
centroid_similarity |
Cosine similarity to bucket centroid | How central you are (1 = city center, 0 = edge of town) |
isolation_score |
Distance to nearest item outside the bucket | How different you are from everyone around you |
stability_score |
Fraction of reruns where this item stays in the same bucket | If we redrew the map, would you stay put? |
from dyf import DensityClassifier
import numpy as np
classifier = DensityClassifier(embedding_dim=384)
classifier.fit(embeddings)
bucket_ids = classifier.get_bucket_ids()
bucket_sizes = classifier.get_bucket_sizes()
centroid_sims = classifier.get_centroid_similarities()
isolation = classifier.get_isolation_scores()
stability = classifier.get_stability_scores()Dense, Bridge, Orphan
DYF classifies every item into one of three roles based on its metrics:
flowchart TD
subgraph dense ["Dense"]
direction TB
D1["Large bucket"]
D2["High centroid similarity"]
D3["Core of a topic"]
end
subgraph bridge ["Bridge"]
direction TB
B1["Changes buckets<br/>under perturbation"]
B2["Low stability score"]
B3["Connects two regions"]
end
subgraph orphan ["Orphan"]
direction TB
O1["Small bucket"]
O2["High isolation"]
O3["Genuinely unique"]
end
style dense stroke:#2d8a2d,stroke-width:2
style bridge stroke:#c47f17,stroke-width:2
style orphan stroke:#c44a4a,stroke-width:2
Dense items sit in large buckets with high centroid similarity. They’re the core members of a semantic region — the papers that are clearly about machine learning, the products that are obviously kitchen appliances.
Bridge items are unstable under perturbation. If you re-run the classifier with slightly different hyperplanes, they change buckets. This means they sit near a decision boundary — a paper about “ML for medical imaging” bridges the ML cluster and the Medicine cluster. Bridges reveal the connections in your data’s topology.
Orphan items land in small buckets with high isolation scores. They’re genuinely unlike everything else — not noise, but unique.
dense_indices = classifier.get_dense()
bridge_indices = classifier.get_bridge()
orphan_indices = classifier.get_orphans()Going Deeper: The DYF Tree
One level of bucketing captures coarse structure. For finer resolution, DYF builds a tree: fit a DensityClassifier at each node, then recurse into each bucket.
graph TD
R["Root<br/>10,000 items<br/>3 bits → 8 buckets"] --> A["Bucket 0<br/>2,100 items"]
R --> B["Bucket 1<br/>1,800 items"]
R --> C["..."]
R --> D["Bucket 7<br/>900 items"]
A --> A1["Sub-bucket 0<br/>450 items"]
A --> A2["Sub-bucket 1<br/>380 items"]
A --> A3["..."]
style R stroke:#4a6fa5,stroke-width:2
style A stroke:#2d8a2d,stroke-width:2
style B stroke:#2d8a2d,stroke-width:2
style D stroke:#2d8a2d,stroke-width:2
style A1 stroke:#c47f17,stroke-width:2
style A2 stroke:#c47f17,stroke-width:2
Each level re-fits PCA to the local data at that node. This means deeper levels discover structure within clusters — sub-topics that wouldn’t be visible at the top level. With num_bits=3, each node splits up to 8 ways, so a tree of depth 4 can produce up to 4,096 leaf buckets.
from dyf import build_dyf_tree
tree = build_dyf_tree(
embeddings,
max_depth=4,
num_bits=3,
min_leaf_size=8,
fit_method="itq", # ITQ gives +5-12% recall over raw PCA
)From Tree to Search Index
The DYF tree can be serialized to a .dyf file for fast approximate nearest neighbor search.
flowchart LR
Q["Query<br/>embedding"] --> H["Hash through<br/>tree"]
H --> L["Find matching<br/>leaf bucket"]
L --> S["Score embeddings<br/>in leaf"]
S --> R["Return<br/>top-k"]
H -.->|"near boundary?"| P["Probe<br/>additional leaves"]
P --> S
style Q stroke:#4a6fa5,stroke-width:2
style R stroke:#2d8a2d,stroke-width:2
style P stroke:#c47f17,stroke-width:2
Zero startup cost: metadata lives in a small header; embeddings stay on disk (memory-mapped) until a query actually needs them. Opening a .dyf file is instant regardless of index size.
Adaptive probing: when a query embedding is close to a decision boundary (low margin on a hyperplane), the search automatically checks neighboring leaves. This trades a small amount of speed for significantly better recall on ambiguous queries.
from dyf import write_lazy_index, LazyIndex
# Build and write index
write_lazy_index(
tree, embeddings, "index.dyf",
quantization="float16",
stored_fields={"title": titles},
)
# Search
with LazyIndex("index.dyf") as idx:
result = idx.search(query_embedding, k=10, nprobe="auto")
print(result.indices, result.scores)
print(result.fields["title"])Next Steps
- Getting Started — hands-on code recipes
- API Reference — full documentation for all classes and functions