LazyIndex
LazyIndex(path)Memory-mapped DYF index for instant-start query serving.
Opens a .dyf file, mmap’s it, parses the FlatBuffers header, and serves queries by traversing the tree and decompressing Arrow IPC batches on demand (LRU cached).
Usage
idx = LazyIndex(“index.dyf”) results = idx.search(query_vector, k=10, nprobe=3) print(idx.tree_summary)
Attributes
| Name | Description |
|---|---|
| format_version | DYF format version: 1 (header FB) or 2 (footer FB, append-friendly). |
| has_stored_fields | True if this index has stored fields. |
| is_pq | True if this index uses Product Quantization. |
| quantization | Read quantization string from BuildParams. |
| stored_field_names | List of stored field names (empty if no stored fields). |
| tree_summary | Return tree stats without touching Arrow data. |
Methods
| Name | Description |
|---|---|
| close | Release mmap and file handle. |
| detect_enrichment_level | Detect enrichment level based on stored fields and metadata. |
| extract_all_fields | Bulk-read all leaf batches, concatenate, sort by item_index. |
| get_item_vector | Extract a single item’s embedding vector from the index. |
| get_leaf | Decompress and return a leaf’s Arrow RecordBatch. |
| get_split_eigenvalues | Return PCA eigenvalues for each internal node. |
| get_split_hyperplanes | Return the PCA hyperplane(s) for each internal node. |
| get_stored_fields | Look up stored fields for given item indices without re-searching. |
| get_tree_structure | Export tree hierarchy for visualization (FlatBuffers only, no Arrow). |
| search | Search the index for nearest neighbors. |
| search_ivf | IVF-style search: rank leaves by centroid similarity, then score. |
| to_faiss | Export dyf index as a FAISS IVF index. |
close
LazyIndex.close()Release mmap and file handle.
detect_enrichment_level
LazyIndex.detect_enrichment_level()Detect enrichment level based on stored fields and metadata.
Returns 0-3
0: Base (embeddings + tree only) 1: Projected (has umap_x, umap_y, umap_z stored fields) 2: Clustered (has community_id or cluster_* stored fields) 3: Viz-ready (has edge_pairs or tour_narration in metadata)
extract_all_fields
LazyIndex.extract_all_fields()Bulk-read all leaf batches, concatenate, sort by item_index.
Returns
| Name | Type | Description |
|---|---|---|
| ExtractedData | dict with keys: ‘embeddings’: (n, d) float32 array, or None if the file was written without embeddings (PQ indexes return approximate reconstructions) ‘fields’: {field_name: array} for all stored fields ‘metadata’: dict of metadata key-value pairs |
get_item_vector
LazyIndex.get_item_vector(item_index)Extract a single item’s embedding vector from the index.
Scans leaf batches to find the item (building a cached mapping on first call), then returns the reconstructed float32 vector.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| item_index | The item index (as stored in the item_index column). | required |
Returns
| Name | Type | Description |
|---|---|---|
| (dim,) float32 numpy array of the item’s embedding. |
Raises
| Name | Type | Description |
|---|---|---|
| KeyError | If item_index is not found in the index. |
get_leaf
LazyIndex.get_leaf(batch_index)Decompress and return a leaf’s Arrow RecordBatch.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| batch_index | The batch index (from node.BatchIndex()). | required |
Returns
| Name | Type | Description |
|---|---|---|
| pyarrow.RecordBatch with columns: item_index, embedding. |
get_split_eigenvalues
LazyIndex.get_split_eigenvalues()Return PCA eigenvalues for each internal node.
Returns
| Name | Type | Description |
|---|---|---|
| dict[int, np.ndarray] | {node_id: (num_bits,) float32 array} for internal nodes that | |
| dict[int, np.ndarray] | have eigenvalues stored. Empty dict for old .dyf files. |
get_split_hyperplanes
LazyIndex.get_split_hyperplanes()Return the PCA hyperplane(s) for each internal node.
Returns
| Name | Type | Description |
|---|---|---|
| dict[int, np.ndarray] | {node_id: (num_bits, dim) float32 array} for internal nodes only. |
get_stored_fields
LazyIndex.get_stored_fields(item_indices)Look up stored fields for given item indices without re-searching.
Scans all leaves to find matching items. Less efficient than getting fields from search results, but useful for post-hoc exploration.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| item_indices | array-like of item indices to look up. | required |
Returns
| Name | Type | Description |
|---|---|---|
| dict mapping field name to list of values (in same order as | ||
| item_indices). Missing items get None. |
get_tree_structure
LazyIndex.get_tree_structure()Export tree hierarchy for visualization (FlatBuffers only, no Arrow).
Returns
| Name | Type | Description |
|---|---|---|
| list[TreeNode] | list of dicts with keys: node_id, parent_id, depth, num_items, | |
| list[TreeNode] | is_leaf, batch_index. Cached after first call. |
search
LazyIndex.search(query, k=10, nprobe=3, return_routing=False)Search the index for nearest neighbors.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| query | (dim,) query vector. | required | |
| k | Number of results to return. | 10 |
|
| nprobe | Number of leaf probes. Accepts: - int: fixed probe count (1 = single path, >1 = multi-probe) - “auto”: adaptive probing with default AdaptiveProbeConfig - AdaptiveProbeConfig: adaptive probing with custom thresholds | 3 |
|
| return_routing | If True, populate result.routing with diagnostics. | False |
Returns
| Name | Type | Description |
|---|---|---|
| SearchResult with indices, scores, and fields. Supports | ||
| backward-compatible unpacking: indices, scores = idx.search(…) |
search_ivf
LazyIndex.search_ivf(query, k=10, nprobe=3)IVF-style search: rank leaves by centroid similarity, then score.
Instead of LSH tree traversal, computes query similarity to all leaf centroids and probes the top-nprobe leaves. Combines DYF’s lazy loading with FAISS IVF-quality routing.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| query | (dim,) query vector. | required | |
| k | Number of results to return. | 10 |
|
| nprobe | Number of leaves to probe. | 3 |
Returns
| Name | Type | Description |
|---|---|---|
| SearchResult with indices, scores, and fields. Supports | ||
| backward-compatible unpacking: indices, scores = idx.search_ivf(…) |
to_faiss
LazyIndex.to_faiss(pq_subquantizers=None, pq_bits=8)Export dyf index as a FAISS IVF index.
Uses dyf’s leaf centroids as the coarse quantizer and populates FAISS inverted lists from dyf’s leaf embeddings. This gives dyf’s PCA-LSH partitioning with FAISS’s optimized search.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| pq_subquantizers | If set, use IndexIVFPQ with this many subquantizers for compression (e.g., 8 or 16). If None, use IndexIVFFlat (no compression). | None |
|
| pq_bits | Bits per subquantizer for PQ (default: 8). | 8 |
Returns
| Name | Type | Description |
|---|---|---|
| faiss.IndexIVFFlat or faiss.IndexIVFPQ, ready to search. |