Getting Started
This guide walks through the core DYF workflow: fitting a classifier, inspecting results, and integrating with DataFrames.
Basic Usage
The fastest way to get started is with DensityClassifier (Rust-accelerated):
import numpy as np
from dyf import DensityClassifier
# Load or generate embeddings
embeddings = np.random.randn(10000, 384).astype(np.float32)
# Create classifier and fit
classifier = DensityClassifier(embedding_dim=384)
classifier.fit(embeddings)
# Inspect the report
report = classifier.report()
print(report)The report shows how items distribute across Dense, Bridge, and Orphan categories.
Retrieving Items
# Get item indices by category
dense_items = classifier.get_dense()
bridge_items = classifier.get_bridge()
orphan_items = classifier.get_orphans()
# Get per-item metrics
bucket_ids = classifier.get_bucket_ids()
bucket_sizes = classifier.get_bucket_sizes()
isolation_scores = classifier.get_isolation_scores()Full-Featured Classifier
DensityClassifierFull wraps the Rust core with embedding generation, DataFrame I/O, and LLM-based labeling:
from dyf import DensityClassifierFull, EmbedderConfig, LabelerConfig
# From raw texts (auto-generates TF-IDF embeddings)
classifier = DensityClassifierFull.from_texts(
texts=["document one", "document two", ...],
categories=["cat_a", "cat_b", ...],
)
# Label buckets with a local LLM
labels = classifier.label_buckets(**LabelerConfig.MEDIUM.as_kwargs())DataFrame Integration
Create classifiers directly from Polars or Pandas DataFrames:
import polars as pl
df = pl.DataFrame({
"text": ["doc1", "doc2", ...],
"embedding": [emb1, emb2, ...], # list of numpy arrays
"category": ["a", "b", ...],
})
# From Polars
classifier = DensityClassifierFull.from_polars(
df,
embedding_col="embedding",
category_col="category",
text_col="text",
)
# Export results back to Polars
results_df = classifier.to_polars()Pandas works the same way via from_pandas() / to_pandas().
Building & Searching Indexes
Build a .dyf index file for fast, memory-mapped search:
from dyf import build_dyf_tree, write_lazy_index, LazyIndex
# Build a DYF tree
tree = build_dyf_tree(embeddings, max_depth=4, num_bits=3, min_leaf_size=8)
# Write to disk
write_lazy_index(tree, embeddings, "index.dyf",
quantization="float16", compression="zstd",
stored_fields={"title": titles},
metadata={"model": "nomic-embed-text-v1.5"})
# Search (instant open, zero startup cost)
with LazyIndex("index.dyf") as idx:
result = idx.search(query_embedding, k=10, nprobe=3)
print(result.indices, result.scores)
print(result.fields["title"])Adaptive Probing
Queries near decision boundaries automatically probe more leaves:
from dyf import LazyIndex, AdaptiveProbeConfig
with LazyIndex("index.dyf") as idx:
# Auto mode — margin-based probe count
result = idx.search(query, k=10, nprobe="auto", return_routing=True)
print(result.routing["adaptive_nprobe"])
# Custom thresholds
cfg = AdaptiveProbeConfig(margin_lo=0.005, margin_hi=0.2,
min_probes=1, max_probes=8)
result = idx.search(query, k=10, nprobe=cfg)Tree Methods
Build hierarchical structure from embeddings for multi-resolution analysis:
from dyf import build_pca_tree, cut_tree_to_labels
# Build a PCA-based binary tree
tree = build_pca_tree(embeddings, max_depth=8)
# Cut to get cluster labels
labels = cut_tree_to_labels(tree, max_depth=8, n_points=len(embeddings), n_clusters=25)Bridge-Aware Retrieval (RAG)
Use BridgeIndex for structure-aware approximate nearest neighbor search:
from dyf import BridgeIndex
# Build index
index = BridgeIndex(n_anchors=1000)
index.fit(embeddings)
# Query
query = np.random.randn(384).astype(np.float32)
indices, scores = index.query(query, k=10)Re-ranking
Apply diversity-aware re-ranking to search results:
from dyf import rerank_mmr, rerank_bridge_mmr
# Standard MMR re-ranking
reranked = rerank_mmr(query_emb, candidate_indices, embeddings_normed, top_k=10)
# Bridge-aware MMR (boosts items that span clusters)
reranked = rerank_bridge_mmr(
query_emb, candidate_indices, embeddings_normed,
bridge_scores, cluster_labels, top_k=10,
)Next Steps
See the API Reference for complete documentation of all classes and functions.