agglomerate_tree_leaves
agglomerate_tree_leaves(idx, coords, embeddings, n_groups=50)Agglomerate DYF tree leaves into ~n_groups using embedding centroids.
Walks the tree to find leaf nodes, computes per-leaf embedding centroids, then uses complete linkage to merge leaves into n_groups agglomerated clusters. After initial assignment, reassigns individual points to their nearest bucket centroid to clean up impure leaves and bad merges.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| idx | An open LazyIndex handle (used to read tree structure and leaf batches). |
required | |
| coords | (N, 2-or-3) UMAP coordinates for every item. | required | |
| embeddings | (N, D) embedding matrix (float32). | required | |
| n_groups | Target number of agglomerated buckets (default 50). | 50 |
Returns
| Name | Type | Description |
|---|---|---|
Tuple of (point_labels, lsh_names, lsh_label_data, | | | | item_leaf_map, tree_structure) ready for multi_level_data. |
||
* point_labels – int32 array (N,) of bucket ids (0-based). |
||
* lsh_names – {cid: "Bucket <cid>"} placeholder names. |
||
* lsh_label_data – list of dicts with centroid x/y/z, size, cid. |
||
* item_leaf_map – int32 array (N,) mapping each item to its tree leaf node_id (before agglomeration). |
||
* tree_structure – raw tree node list from idx.get_tree_structure(). |
||
Returns (None, {}, [], None, tree) when the tree has fewer than |
||
| two leaves. |