How DYF Works

DYF discovers structure in embedding spaces — dense clusters, bridge connections, and orphan outliers — without requiring you to choose the number of clusters. This page explains the algorithm, the metrics, and the tools built on top of them.

The Problem

Embedding models turn text, images, or other data into high-dimensional vectors. Similar items end up near each other, but the structure of that space — where the clusters are, what connects them, which items are truly unique — is invisible. You have a cloud of points and no map.

Traditional clustering forces choices. k-means requires picking k upfront, and the result is just a partition — it doesn’t tell you which items are central to their cluster, which ones sit between clusters, or which ones don’t belong anywhere. HDBSCAN handles variable-density clusters but scales poorly and still gives you a flat labeling.

DYF takes a different approach. Instead of assigning every point to exactly one cluster, it characterizes each point’s role in the space: Dense (core member of a region), Bridge (connector between regions), or Orphan (genuinely unique). It does this without choosing k, and it reveals the topology of your data — not just its partitions.

The Algorithm: Four Steps

DYF uses a two-pass approach: start with random structure, then learn from it.

flowchart LR
    A["<b>Step 1</b><br/>Random<br/>hyperplanes"] --> B["<b>Step 2</b><br/>Compute bucket<br/>centroids"]
    B --> C["<b>Step 3</b><br/>PCA on<br/>centroids"]
    C --> D["<b>Step 4</b><br/>Re-hash with<br/>learned planes"]

    style A stroke:#4a6fa5,stroke-width:2
    style B stroke:#4a6fa5,stroke-width:2
    style C stroke:#c47f17,stroke-width:2
    style D stroke:#2d8a2d,stroke-width:2

Step 1: Random bucketing

Throw random hyperplanes through the embedding space. Each hyperplane splits the space in two — items on the same side of all hyperplanes land in the same bucket. With 3 hyperplanes you get up to 8 buckets, with 6 you get up to 64.

This is fast but noisy. The hyperplanes don’t know anything about your data’s structure.

Step 2: Compute centroids

Average the embeddings in each bucket. These centroids are a compressed view of how the data is distributed — a rough sketch of its topology.

Step 3: PCA on centroids

Run PCA on the centroids, not the full dataset. This is the key move — centroids are a small matrix (hundreds of rows, not millions), so PCA is fast. And because each centroid represents a cluster of similar items, the principal components capture the directions that separate clusters. These become the new hyperplanes.

Step 4: Re-hash

Project all embeddings onto the learned hyperplanes using the same sign-bit scheme. Now the buckets reflect actual semantic regions rather than arbitrary splits.

Key insight

DYF uses randomness to discover structure, then learns from it. The final hyperplanes aren’t random — they’re fit to your data.

How the hash works: sign-bit LSH

Each hyperplane is a unit vector in embedding space. To assign an item to a bucket, DYF computes the dot product of the embedding with each hyperplane and takes the sign:

Positive dot product → bit = 1 (item is on the “positive” side)
Negative dot product → bit = 0 (item is on the “negative” side)

The bits from all hyperplanes are packed into a single integer — the bucket ID. With 3 hyperplanes, each item gets a 3-bit ID (0–7). Items that share a bucket ID are on the same side of every hyperplane, meaning they’re in the same region of the space.

flowchart LR
    E["Embedding"] --> D1["dot with h1<br/>→ positive → 1"]
    E --> D2["dot with h2<br/>→ negative → 0"]
    E --> D3["dot with h3<br/>→ positive → 1"]
    D1 --> ID["bucket = 101₂ = 5"]
    D2 --> ID
    D3 --> ID

    style E stroke:#4a6fa5,stroke-width:2
    style ID stroke:#2d8a2d,stroke-width:2

This is a form of Locality Sensitive Hashing (LSH) — nearby embeddings are likely to land in the same bucket because they’ll be on the same side of most hyperplanes. What makes DYF different from standard random-projection LSH is the second pass: the hyperplanes are learned from the data’s own structure via PCA, so the splits align with natural cluster boundaries instead of arbitrary directions.

Improving the hash: ITQ rotation

The sign function is lossy — embeddings near a decision boundary can flip to the wrong side. Iterative Quantization (ITQ) reduces this loss by learning a rotation matrix that aligns the PCA projections with the sign thresholds.

The idea: after PCA, the projections are continuous values. ITQ finds a rotation that makes those values as far from zero as possible — pushing them away from the decision boundary before the sign function is applied. This typically improves recall by 5–12% with minimal extra build cost.

# Standard PCA-based hash
classifier.fit(embeddings)

# ITQ-refined hash (better quantization)
classifier.fit_itq(embeddings)

Why Hashing Yields Density

Most hash tables try to spread items evenly across buckets — collisions are a bug. DYF flips this: uneven bucket sizes are the signal. When you hash embeddings with data-adaptive hyperplanes, the resulting bucket distribution directly encodes the density structure of the embedding space.

flowchart LR
    subgraph hash ["Data-adaptive hash"]
        direction TB
        H1["PCA hyperplanes<br/>split along cluster<br/>boundaries"]
    end

    subgraph buckets ["Bucket sizes encode density"]
        direction TB
        B1["Bucket 3: 1,200 items<br/>→ dense region"]
        B2["Bucket 7: 45 items<br/>→ sparse region"]
        B3["Bucket 12: 2 items<br/>→ orphan territory"]
    end

    hash --> buckets

    style hash stroke:#4a6fa5,stroke-width:2
    style buckets stroke:#2d8a2d,stroke-width:2

A large bucket means many embeddings landed on the same side of every hyperplane — they occupy a coherent, well-populated region of the space. A tiny bucket means those items ended up in a region with almost no neighbors. The bucket size is a density estimate, computed as a byproduct of the hash with no additional distance calculations.

This idea draws from Krapivin et al. (2025), which analyzes hash table occupancy patterns and probe complexity under non-uniform distributions. DYF applies this principle to embedding spaces: instead of fighting non-uniform bucket sizes, it treats them as a feature — the load pattern of the hash table reveals the topology of the data.

Multi-resolution density

DYF can also read density at multiple scales by masking bucket IDs to fewer bits. With 14-bit hashes, masking to 8 bits merges neighboring buckets — items that looked sparse at fine resolution may cluster at coarser resolution. This multi-resolution view separates true orphans (sparse at every scale) from items that belong to broad, diffuse regions.

What the Metrics Tell You

After fitting, DYF computes per-item metrics that characterize each point’s role:

Metric	What it means	Analogy
`bucket_id`	Which region this item belongs to	Which neighborhood you live in
`bucket_size`	How many items share this bucket	How many neighbors you have
`centroid_similarity`	Cosine similarity to bucket centroid	How central you are (1 = city center, 0 = edge of town)
`isolation_score`	Distance to nearest item outside the bucket	How different you are from everyone around you
`stability_score`	Fraction of reruns where this item stays in the same bucket	If we redrew the map, would you stay put?

from dyf import DensityClassifier
import numpy as np

classifier = DensityClassifier(embedding_dim=384)
classifier.fit(embeddings)

bucket_ids = classifier.get_bucket_ids()
bucket_sizes = classifier.get_bucket_sizes()
centroid_sims = classifier.get_centroid_similarities()
isolation = classifier.get_isolation_scores()
stability = classifier.get_stability_scores()

Dense, Bridge, Orphan

DYF classifies every item into one of three roles based on its metrics:

flowchart TD
    subgraph dense ["Dense"]
        direction TB
        D1["Large bucket"]
        D2["High centroid similarity"]
        D3["Core of a topic"]
    end

    subgraph bridge ["Bridge"]
        direction TB
        B1["Changes buckets<br/>under perturbation"]
        B2["Low stability score"]
        B3["Connects two regions"]
    end

    subgraph orphan ["Orphan"]
        direction TB
        O1["Small bucket"]
        O2["High isolation"]
        O3["Genuinely unique"]
    end

    style dense stroke:#2d8a2d,stroke-width:2
    style bridge stroke:#c47f17,stroke-width:2
    style orphan stroke:#c44a4a,stroke-width:2

Dense items sit in large buckets with high centroid similarity. They’re the core members of a semantic region — the papers that are clearly about machine learning, the products that are obviously kitchen appliances.

Bridge items are unstable under perturbation. If you re-run the classifier with slightly different hyperplanes, they change buckets. This means they sit near a decision boundary — a paper about “ML for medical imaging” bridges the ML cluster and the Medicine cluster. Bridges reveal the connections in your data’s topology.

Orphan items land in small buckets with high isolation scores. They’re genuinely unlike everything else — not noise, but unique.

dense_indices = classifier.get_dense()
bridge_indices = classifier.get_bridge()
orphan_indices = classifier.get_orphans()

Going Deeper: The DYF Tree

One level of bucketing captures coarse structure. For finer resolution, DYF builds a tree: fit a DensityClassifier at each node, then recurse into each bucket.

graph TD
    R["Root<br/>10,000 items<br/>3 bits → 8 buckets"] --> A["Bucket 0<br/>2,100 items"]
    R --> B["Bucket 1<br/>1,800 items"]
    R --> C["..."]
    R --> D["Bucket 7<br/>900 items"]

    A --> A1["Sub-bucket 0<br/>450 items"]
    A --> A2["Sub-bucket 1<br/>380 items"]
    A --> A3["..."]

    style R stroke:#4a6fa5,stroke-width:2
    style A stroke:#2d8a2d,stroke-width:2
    style B stroke:#2d8a2d,stroke-width:2
    style D stroke:#2d8a2d,stroke-width:2
    style A1 stroke:#c47f17,stroke-width:2
    style A2 stroke:#c47f17,stroke-width:2

Each level re-fits PCA to the local data at that node. This means deeper levels discover structure within clusters — sub-topics that wouldn’t be visible at the top level. With num_bits=3, each node splits up to 8 ways, so a tree of depth 4 can produce up to 4,096 leaf buckets.

from dyf import build_dyf_tree

tree = build_dyf_tree(
    embeddings,
    max_depth=4,
    num_bits=3,
    min_leaf_size=8,
    fit_method="itq",  # ITQ gives +5-12% recall over raw PCA
)

From Tree to Search Index

The DYF tree can be serialized to a .dyf file for fast approximate nearest neighbor search.

flowchart LR
    Q["Query<br/>embedding"] --> H["Hash through<br/>tree"]
    H --> L["Find matching<br/>leaf bucket"]
    L --> S["Score embeddings<br/>in leaf"]
    S --> R["Return<br/>top-k"]

    H -.->|"near boundary?"| P["Probe<br/>additional leaves"]
    P --> S

    style Q stroke:#4a6fa5,stroke-width:2
    style R stroke:#2d8a2d,stroke-width:2
    style P stroke:#c47f17,stroke-width:2

Zero startup cost: metadata lives in a small header; embeddings stay on disk (memory-mapped) until a query actually needs them. Opening a .dyf file is instant regardless of index size.

Adaptive probing: when a query embedding is close to a decision boundary (low margin on a hyperplane), the search automatically checks neighboring leaves. This trades a small amount of speed for significantly better recall on ambiguous queries.

from dyf import write_lazy_index, LazyIndex

# Build and write index
write_lazy_index(
    tree, embeddings, "index.dyf",
    quantization="float16",
    stored_fields={"title": titles},
)

# Search
with LazyIndex("index.dyf") as idx:
    result = idx.search(query_embedding, k=10, nprobe="auto")
    print(result.indices, result.scores)
    print(result.fields["title"])

Next Steps

Getting Started — hands-on code recipes
API Reference — full documentation for all classes and functions