Getting Started

This guide walks through the core DYF workflow: fitting a classifier, inspecting results, and integrating with DataFrames.

Basic Usage

The fastest way to get started is with DensityClassifier (Rust-accelerated):

import numpy as np
from dyf import DensityClassifier

# Load or generate embeddings
embeddings = np.random.randn(10000, 384).astype(np.float32)

# Create classifier and fit
classifier = DensityClassifier(embedding_dim=384)
classifier.fit(embeddings)

# Inspect the report
report = classifier.report()
print(report)

The report shows how items distribute across Dense, Bridge, and Orphan categories.

Retrieving Items

# Get item indices by category
dense_items = classifier.get_dense()
bridge_items = classifier.get_bridge()
orphan_items = classifier.get_orphans()

# Get per-item metrics
bucket_ids = classifier.get_bucket_ids()
bucket_sizes = classifier.get_bucket_sizes()
isolation_scores = classifier.get_isolation_scores()

Full-Featured Classifier

DensityClassifierFull wraps the Rust core with embedding generation, DataFrame I/O, and LLM-based labeling:

from dyf import DensityClassifierFull, EmbedderConfig, LabelerConfig

# From raw texts (auto-generates TF-IDF embeddings)
classifier = DensityClassifierFull.from_texts(
    texts=["document one", "document two", ...],
    categories=["cat_a", "cat_b", ...],
)

# Label buckets with a local LLM
labels = classifier.label_buckets(**LabelerConfig.MEDIUM.as_kwargs())

DataFrame Integration

Create classifiers directly from Polars or Pandas DataFrames:

import polars as pl

df = pl.DataFrame({
    "text": ["doc1", "doc2", ...],
    "embedding": [emb1, emb2, ...],  # list of numpy arrays
    "category": ["a", "b", ...],
})

# From Polars
classifier = DensityClassifierFull.from_polars(
    df,
    embedding_col="embedding",
    category_col="category",
    text_col="text",
)

# Export results back to Polars
results_df = classifier.to_polars()

Pandas works the same way via from_pandas() / to_pandas().

Building & Searching Indexes

Build a .dyf index file for fast, memory-mapped search:

from dyf import build_dyf_tree, write_lazy_index, LazyIndex

# Build a DYF tree
tree = build_dyf_tree(embeddings, max_depth=4, num_bits=3, min_leaf_size=8)

# Write to disk
write_lazy_index(tree, embeddings, "index.dyf",
                 quantization="float16", compression="zstd",
                 stored_fields={"title": titles},
                 metadata={"model": "nomic-embed-text-v1.5"})

# Search (instant open, zero startup cost)
with LazyIndex("index.dyf") as idx:
    result = idx.search(query_embedding, k=10, nprobe=3)
    print(result.indices, result.scores)
    print(result.fields["title"])

Adaptive Probing

Queries near decision boundaries automatically probe more leaves:

from dyf import LazyIndex, AdaptiveProbeConfig

with LazyIndex("index.dyf") as idx:
    # Auto mode — margin-based probe count
    result = idx.search(query, k=10, nprobe="auto", return_routing=True)
    print(result.routing["adaptive_nprobe"])

    # Custom thresholds
    cfg = AdaptiveProbeConfig(margin_lo=0.005, margin_hi=0.2,
                              min_probes=1, max_probes=8)
    result = idx.search(query, k=10, nprobe=cfg)

Tree Methods

Build hierarchical structure from embeddings for multi-resolution analysis:

from dyf import build_pca_tree, cut_tree_to_labels

# Build a PCA-based binary tree
tree = build_pca_tree(embeddings, max_depth=8)

# Cut to get cluster labels
labels = cut_tree_to_labels(tree, max_depth=8, n_points=len(embeddings), n_clusters=25)

Bridge-Aware Retrieval (RAG)

Use BridgeIndex for structure-aware approximate nearest neighbor search:

from dyf import BridgeIndex

# Build index
index = BridgeIndex(n_anchors=1000)
index.fit(embeddings)

# Query
query = np.random.randn(384).astype(np.float32)
indices, scores = index.query(query, k=10)

Re-ranking

Apply diversity-aware re-ranking to search results:

from dyf import rerank_mmr, rerank_bridge_mmr

# Standard MMR re-ranking
reranked = rerank_mmr(query_emb, candidate_indices, embeddings_normed, top_k=10)

# Bridge-aware MMR (boosts items that span clusters)
reranked = rerank_bridge_mmr(
    query_emb, candidate_indices, embeddings_normed,
    bridge_scores, cluster_labels, top_k=10,
)

Next Steps

See the API Reference for complete documentation of all classes and functions.