API Reference
Core
Core density classification and reporting. DensityClassifierFull wraps the Rust-accelerated core with embedding generation, DataFrame I/O, and LLM-based labeling.
| DensityClassifierFull | Density Classifier using PCA-based LSH. |
Tree Methods
Hierarchical tree construction for multi-resolution analysis of embeddings.
| build_pca_tree | Build a PCA bisection tree over embeddings. |
| cut_tree_to_labels | Cut PCA tree at n_clusters and return point labels array. |
| extract_boundary_persistence | Identify points that persist as boundary across multiple tree depths. |
| boundary_persistence_scores | Compute depth-weighted boundary persistence bridge scores. |
| build_dyf_tree | Build a DYF recursive tree over embeddings. |
| refine_dyf_tree | Re-split incoherent tree leaves with rotated LSH seeds. |
| cut_dyf_tree_to_labels | Cut DYF tree into flat cluster labels using agglomerative merge. |
| refine_clusters | Post-processing safety net: eject periphery from incoherent clusters. |
Retrieval (RAG)
Bridge-aware retrieval, anchor selection, and ontology construction.
| BridgeIndex | Two-tier RAG index using bridge-based anchors for efficient retrieval. |
| find_super_connectors | Find super connectors: points with high centrality in both global and local |
| select_orthogonal_anchors | Select k maximally spread anchors using greedy farthest-point sampling. |
| diversify_by_facet | Select k items from k different semantic facets (buckets). |
| get_kmeans_init | Get bridge-seeded initial centroids for k-means/IVF index construction. |
| compute_neighbor_diversity | Compute neighbor diversity for each point. |
| compute_hub_score | Compute hub score for each point in the embedding space. |
| mine_dag_chains | Mine DAG (directed acyclic graph) structures from embedding space. |
| build_dag_taxonomy | Build a navigable DAG taxonomy from embedding space. |
| build_unified_ontology | Build a unified ontology that covers both dense and sparse embedding regions. |
| build_rog_ontology | Recursive Ontological Generation (ROG). |
Result Types
Dataclasses returned by RAG and analysis functions.
| SuperConnectorResult | Result of finding super connectors in an embedding space. |
| OrthogonalAnchorResult | Result of orthogonal anchor selection. |
| FacetDiverseResult | Result of facet-based diversification. |
| DAGChain | A single DAG chain representing a hierarchical path. |
| DAGMiningResult | Result of DAG mining on an embedding space. |
| DAGTaxonomy | A navigable DAG taxonomy extracted from embedding space. |
| UnifiedOntologyResult | Result of building a unified ontology from embedding space. |
| HubScoreResult | Result of hub score computation. |
| ROGLayer | A single layer in the ROG hierarchy. |
| ROGResult | Result of Recursive Ontological Generation. |
Re-ranking
Re-rank search results with diversity and bridge awareness.
| rerank_standard | Take the top-k most similar candidates. |
| rerank_mmr | Maximal Marginal Relevance re-ranking. |
| rerank_bridge_boost | Re-rank by blending similarity with normalized bridge score. |
| rerank_bridge_mmr | MMR with bridge-weighted diversity and optional meta-cluster awareness. |
Chunk Analysis
Analyze chunk redundancy, document spread, and cluster quality.
| DocSpread | Per-document chunk distribution across LSH buckets. |
| chunk_redundancy | Count same-doc siblings sharing each point’s bucket. |
| deduplicate_chunks | Boolean mask keeping one representative per (bucket, doc) pair. |
| doc_spread | Per-document chunk distribution analysis. |
| cluster_quality | Identify meta/structural clusters from neighbor coherence. |
| neighbor_coherence | Measure how coherent each point’s neighborhood is. |
Lazy Index
Memory-mapped tree index with FlatBuffers + Arrow IPC. Build indexes with write_lazy_index, search with LazyIndex, and use adaptive probing for margin-aware probe counts.
| LazyIndex | Memory-mapped DYF index for instant-start query serving. |
| write_lazy_index | Write a DYF lazy index file (FlatBuffers tree + Arrow IPC leaf data). |
| rewrite_lazy_index | Rewrite a .dyf file with additional stored fields and/or metadata. |
| split_dyf3 | Split a DYF3 file into chunks for transport (e.g. GitHub Pages). |
| from_faiss | Build a dyf lazy index file from a FAISS IVF index. |
| SearchResult | Search result with indices, scores, and optional stored fields. |
| AdaptiveProbeConfig | Configuration for adaptive probe count based on routing margin. |
Fisher Weighting
Per-dimension Fisher ratio weighting for supervised embedding refinement.
| compute_fisher_weights | Compute sqrt(Fisher ratio) per dimension, L2-normalized. |
| apply_fisher_weights | Multiply embeddings by per-dimension weights. |
Split Keywords
TF-IDF keyword extraction from DYF tree splits. Each PCA split naturally separates vocabularies — these functions extract discriminative terms per split side without LLM calls.
| compute_split_keywords | Compute discriminative TF-IDF keywords for each side of each tree split. |
| compute_embedding_keywords | Compute discriminative keywords by projecting term embeddings onto split hyperplanes. |
| label_clusters_frequency | Label clusters using TF-IDF token frequency (no LLM). |
| build_tree_maps | Extract tree structure, children_map, and leaf_batches from a LazyIndex. |
| collect_descendant_indices | Recursively collect all item indices under a node. |
| compute_domain_stopwords | Words appearing in >threshold fraction of titles. Corpus-level stop words. |
| format_split_path | Return the keyword path for a single item from root to leaf. |
| tokenize | Lowercase, extract words of 3+ chars, filter stop words. |
| TextDiversityReport | Result of text diversity assessment for a collection of titles. |
| assess_text_diversity | Cheaply assess whether a title collection has enough diversity for LLM labeling. |
Cluster Tree
Connect BIRCH spatial clusters to DYF tree nodes for deterministic cluster naming.
| build_cluster_tree_dag | Build a DAG connecting BIRCH clusters to tree nodes they overlap with. |
| compute_sibling_keywords | Compute contrastive TF-IDF keywords for clusters sharing tree parents. |
| derive_path_labels | Derive deterministic path labels for each cluster from the DAG. |
| format_cluster_context | Format cluster context for an LLM labeling prompt. |
Agglomeration
Merge DYF tree leaves into coarser groups via linkage or community detection.
| agglomerate_tree_leaves | Agglomerate DYF tree leaves into ~n_groups using embedding centroids. |
| louvain_cluster_leaves | Cluster tree leaves via centroid KNN + Louvain community detection. |
| merge_to_max_k | Merge communities to ≤max_k using complete linkage on community centroids. |
Categorical
Hierarchical taxonomy graphs, label coarsening, and multi-level Fisher weighting.
| CategoryGraph | DAG of string-labeled category nodes with weighted edges. |
| coarsen | General label extraction / coarsening. |
| multi_level_fisher_weights | Compute Fisher weights averaged across multiple DAG depths. |
| store_category_graph | Serialize a CategoryGraph for .dyf metadata. |
| load_category_graphs | Deserialize CategoryGraphs from .dyf metadata. |
| AxisDiagnostic | Diagnostic result for a single categorical axis. |
| diagnose_axes | Detect which categorical axes are under-served by the embedding. |
| diagnostics_to_metadata | Serialize axis diagnostics for .dyf metadata. |
| discover_categorical_columns | Auto-detect categorical columns from a polars DataFrame. |
| embed_with_diagnostics | Two-pass embedding with axis diagnostics. |
Colors
Deterministic color assignment for spatial clusters.
| spatial_rgb_map | Return dict mapping label -> [r, g, b] with spatially coherent hues. |
| spatial_color_map | Return dict mapping label -> hex color with spatially coherent hues. |
| tree_rgb_map | Assign colors by DFS leaf order of the DYF tree. |
Provenance
Artifact identity fingerprinting for DAG-oriented pipelines.
| Provenance | Identity fingerprint for a pipeline artifact. |
| create_provenance | Create a provenance record for a new artifact. |
| file_hash | Fast partial hash: first 64 KB + file size + mtime. |
| params_hash | Deterministic hash of sorted JSON-serialized params. |
| check_compatible | Check if an upstream artifact is compatible with the current pipeline. |
| provenance_to_dict | JSON-serializable dict. |
| provenance_from_dict | Reconstruct from dict. |
Configuration
Pre-built configurations for embedding models and LLM labelers.
| EmbedderConfig | Configuration for text embedding models. |
| LabelerConfig | Configuration for LLM-based bucket labeling. |
| list_configs | Print available embedder and labeler configurations. |