Getting Started

This guide walks through the core DYF workflow: fitting a classifier, inspecting results, and integrating with DataFrames.

Basic Usage

The fastest way to get started is with DensityClassifier (Rust-accelerated):

import numpy as np
from dyf import DensityClassifier

# Load or generate embeddings
embeddings = np.random.randn(10000, 384).astype(np.float32)

# Create classifier and fit
classifier = DensityClassifier(embedding_dim=384)
classifier.fit(embeddings)

# Inspect the report
report = classifier.report()
print(report)

The report shows how items distribute across Dense, Bridge, and Orphan categories.

Retrieving Items

# Get item indices by category
dense_items = classifier.get_dense()
bridge_items = classifier.get_bridge()
orphan_items = classifier.get_orphans()

# Get per-item metrics
bucket_ids = classifier.get_bucket_ids()
bucket_sizes = classifier.get_bucket_sizes()
isolation_scores = classifier.get_isolation_scores()

Building & Searching Indexes

Build a .dyf index file for fast, memory-mapped search:

from dyf import build_dyf_tree, write_lazy_index, LazyIndex

# Build a DYF tree
tree = build_dyf_tree(embeddings, max_depth=4, num_bits=3, min_leaf_size=8)

# Write to disk
write_lazy_index(tree, embeddings, "index.dyf",
                 quantization="float16", compression="zstd",
                 stored_fields={"title": titles},
                 metadata={"model": "nomic-embed-text-v1.5"})

# Search (instant open, zero startup cost)
with LazyIndex("index.dyf") as idx:
    result = idx.search(query_embedding, k=10, nprobe=3)
    print(result.indices, result.scores)
    print(result.fields["title"])

Adaptive Probing

Queries near decision boundaries automatically probe more leaves:

from dyf import LazyIndex, AdaptiveProbeConfig

with LazyIndex("index.dyf") as idx:
    # Auto mode — margin-based probe count
    result = idx.search(query, k=10, nprobe="auto", return_routing=True)
    print(result.routing["adaptive_nprobe"])

    # Custom thresholds
    cfg = AdaptiveProbeConfig(margin_lo=0.005, margin_hi=0.2,
                              min_probes=1, max_probes=8)
    result = idx.search(query, k=10, nprobe=cfg)

Tree Methods

Build hierarchical structure from embeddings for multi-resolution analysis:

from dyf import build_pca_tree, cut_tree_to_labels

# Build a PCA-based binary tree
tree = build_pca_tree(embeddings, max_depth=8)

# Cut to get cluster labels
labels = cut_tree_to_labels(tree, max_depth=8, n_points=len(embeddings), n_clusters=25)

Bridge-Aware Retrieval (RAG)

Use BridgeIndex for structure-aware approximate nearest neighbor search:

from dyf import BridgeIndex

# Build index
index = BridgeIndex(n_anchors=1000)
index.fit(embeddings)

# Query
query = np.random.randn(384).astype(np.float32)
indices, scores = index.query(query, k=10)

Re-ranking

Apply diversity-aware re-ranking to search results:

from dyf import rerank_mmr, rerank_bridge_mmr

# Standard MMR re-ranking
reranked = rerank_mmr(query_emb, candidate_indices, embeddings_normed, top_k=10)

# Bridge-aware MMR (boosts items that span clusters)
reranked = rerank_bridge_mmr(
    query_emb, candidate_indices, embeddings_normed,
    bridge_scores, cluster_labels, top_k=10,
)

Next Steps

See the API Reference for complete documentation of all classes and functions.