API Reference

Core

Core density classification and reporting. DensityClassifierFull wraps the Rust-accelerated core with embedding generation, DataFrame I/O, and LLM-based labeling.

DensityClassifierFull Density Classifier using PCA-based LSH.

Tree Methods

Hierarchical tree construction for multi-resolution analysis of embeddings.

build_pca_tree Build a PCA bisection tree over embeddings.
cut_tree_to_labels Cut PCA tree at n_clusters and return point labels array.
extract_boundary_persistence Identify points that persist as boundary across multiple tree depths.
boundary_persistence_scores Compute depth-weighted boundary persistence bridge scores.
build_dyf_tree Build a DYF recursive tree over embeddings.
refine_dyf_tree Re-split incoherent tree leaves with rotated LSH seeds.
cut_dyf_tree_to_labels Cut DYF tree into flat cluster labels using agglomerative merge.
refine_clusters Post-processing safety net: eject periphery from incoherent clusters.

Retrieval (RAG)

Bridge-aware retrieval, anchor selection, and ontology construction.

BridgeIndex Two-tier RAG index using bridge-based anchors for efficient retrieval.
find_super_connectors Find super connectors: points with high centrality in both global and local
select_orthogonal_anchors Select k maximally spread anchors using greedy farthest-point sampling.
diversify_by_facet Select k items from k different semantic facets (buckets).
get_kmeans_init Get bridge-seeded initial centroids for k-means/IVF index construction.
compute_neighbor_diversity Compute neighbor diversity for each point.
compute_hub_score Compute hub score for each point in the embedding space.
mine_dag_chains Mine DAG (directed acyclic graph) structures from embedding space.
build_dag_taxonomy Build a navigable DAG taxonomy from embedding space.
build_unified_ontology Build a unified ontology that covers both dense and sparse embedding regions.
build_rog_ontology Recursive Ontological Generation (ROG).

Result Types

Dataclasses returned by RAG and analysis functions.

SuperConnectorResult Result of finding super connectors in an embedding space.
OrthogonalAnchorResult Result of orthogonal anchor selection.
FacetDiverseResult Result of facet-based diversification.
DAGChain A single DAG chain representing a hierarchical path.
DAGMiningResult Result of DAG mining on an embedding space.
DAGTaxonomy A navigable DAG taxonomy extracted from embedding space.
UnifiedOntologyResult Result of building a unified ontology from embedding space.
HubScoreResult Result of hub score computation.
ROGLayer A single layer in the ROG hierarchy.
ROGResult Result of Recursive Ontological Generation.

Re-ranking

Re-rank search results with diversity and bridge awareness.

rerank_standard Take the top-k most similar candidates.
rerank_mmr Maximal Marginal Relevance re-ranking.
rerank_bridge_boost Re-rank by blending similarity with normalized bridge score.
rerank_bridge_mmr MMR with bridge-weighted diversity and optional meta-cluster awareness.

Chunk Analysis

Analyze chunk redundancy, document spread, and cluster quality.

DocSpread Per-document chunk distribution across LSH buckets.
chunk_redundancy Count same-doc siblings sharing each point’s bucket.
deduplicate_chunks Boolean mask keeping one representative per (bucket, doc) pair.
doc_spread Per-document chunk distribution analysis.
cluster_quality Identify meta/structural clusters from neighbor coherence.
neighbor_coherence Measure how coherent each point’s neighborhood is.

Lazy Index

Memory-mapped tree index with FlatBuffers + Arrow IPC. Build indexes with write_lazy_index, search with LazyIndex, and use adaptive probing for margin-aware probe counts.

LazyIndex Memory-mapped DYF index for instant-start query serving.
write_lazy_index Write a DYF lazy index file (FlatBuffers tree + Arrow IPC leaf data).
rewrite_lazy_index Rewrite a .dyf file with additional stored fields and/or metadata.
split_dyf3 Split a DYF3 file into chunks for transport (e.g. GitHub Pages).
from_faiss Build a dyf lazy index file from a FAISS IVF index.
SearchResult Search result with indices, scores, and optional stored fields.
AdaptiveProbeConfig Configuration for adaptive probe count based on routing margin.

Fisher Weighting

Per-dimension Fisher ratio weighting for supervised embedding refinement.

compute_fisher_weights Compute sqrt(Fisher ratio) per dimension, L2-normalized.
apply_fisher_weights Multiply embeddings by per-dimension weights.

Split Keywords

TF-IDF keyword extraction from DYF tree splits. Each PCA split naturally separates vocabularies — these functions extract discriminative terms per split side without LLM calls.

compute_split_keywords Compute discriminative TF-IDF keywords for each side of each tree split.
compute_embedding_keywords Compute discriminative keywords by projecting term embeddings onto split hyperplanes.
label_clusters_frequency Label clusters using TF-IDF token frequency (no LLM).
build_tree_maps Extract tree structure, children_map, and leaf_batches from a LazyIndex.
collect_descendant_indices Recursively collect all item indices under a node.
compute_domain_stopwords Words appearing in >threshold fraction of titles. Corpus-level stop words.
format_split_path Return the keyword path for a single item from root to leaf.
tokenize Lowercase, extract words of 3+ chars, filter stop words.
TextDiversityReport Result of text diversity assessment for a collection of titles.
assess_text_diversity Cheaply assess whether a title collection has enough diversity for LLM labeling.

Cluster Tree

Connect BIRCH spatial clusters to DYF tree nodes for deterministic cluster naming.

build_cluster_tree_dag Build a DAG connecting BIRCH clusters to tree nodes they overlap with.
compute_sibling_keywords Compute contrastive TF-IDF keywords for clusters sharing tree parents.
derive_path_labels Derive deterministic path labels for each cluster from the DAG.
format_cluster_context Format cluster context for an LLM labeling prompt.

Agglomeration

Merge DYF tree leaves into coarser groups via linkage or community detection.

agglomerate_tree_leaves Agglomerate DYF tree leaves into ~n_groups using embedding centroids.
louvain_cluster_leaves Cluster tree leaves via centroid KNN + Louvain community detection.
merge_to_max_k Merge communities to ≤max_k using complete linkage on community centroids.

Categorical

Hierarchical taxonomy graphs, label coarsening, and multi-level Fisher weighting.

CategoryGraph DAG of string-labeled category nodes with weighted edges.
coarsen General label extraction / coarsening.
multi_level_fisher_weights Compute Fisher weights averaged across multiple DAG depths.
store_category_graph Serialize a CategoryGraph for .dyf metadata.
load_category_graphs Deserialize CategoryGraphs from .dyf metadata.
AxisDiagnostic Diagnostic result for a single categorical axis.
diagnose_axes Detect which categorical axes are under-served by the embedding.
diagnostics_to_metadata Serialize axis diagnostics for .dyf metadata.
discover_categorical_columns Auto-detect categorical columns from a polars DataFrame.
embed_with_diagnostics Two-pass embedding with axis diagnostics.

Colors

Deterministic color assignment for spatial clusters.

spatial_rgb_map Return dict mapping label -> [r, g, b] with spatially coherent hues.
spatial_color_map Return dict mapping label -> hex color with spatially coherent hues.
tree_rgb_map Assign colors by DFS leaf order of the DYF tree.

Provenance

Artifact identity fingerprinting for DAG-oriented pipelines.

Provenance Identity fingerprint for a pipeline artifact.
create_provenance Create a provenance record for a new artifact.
file_hash Fast partial hash: first 64 KB + file size + mtime.
params_hash Deterministic hash of sorted JSON-serialized params.
check_compatible Check if an upstream artifact is compatible with the current pipeline.
provenance_to_dict JSON-serializable dict.
provenance_from_dict Reconstruct from dict.

Configuration

Pre-built configurations for embedding models and LLM labelers.

EmbedderConfig Configuration for text embedding models.
LabelerConfig Configuration for LLM-based bucket labeling.
list_configs Print available embedder and labeler configurations.