compute_embedding_keywords
compute_embedding_keywords(
titles,
embeddings,
tree,
leaf_batches,
children_map,
hyperplanes,
*,
max_depth_from_root=3,
min_child_items=50,
top_k=10,
domain_stopwords=None,
min_term_count=3,
)Compute discriminative keywords by projecting term embeddings onto split hyperplanes.
For each internal node, builds a “term embedding” (centroid of all item embeddings whose titles contain that term), then projects onto the node’s PCA hyperplane to find terms most aligned with each child’s side of the split.
Drop-in replacement for compute_split_keywords() — same output format.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| titles | list[str] | List of title strings, indexed by item index. | required |
| embeddings | np.ndarray | (n, dim) float32 array of full embeddings. | required |
| tree | list[TreeNode] | Tree structure from idx.get_tree_structure(). | required |
| leaf_batches | dict[int, np.ndarray] | {node_id: item_indices} for leaf nodes. | required |
| children_map | dict[int, list[int]] | {parent_id: [child_ids]} for internal nodes. | required |
| hyperplanes | dict[int, np.ndarray] | {node_id: (num_bits, dim)} from idx.get_split_hyperplanes(). | required |
| max_depth_from_root | int | Maximum tree depth to process. | 3 |
| min_child_items | int | Skip children with fewer items than this. | 50 |
| top_k | int | Number of top keywords to return per child. | 10 |
| domain_stopwords | set[str] | None | Additional stop words to filter. | None |
| min_term_count | int | Minimum number of items containing a term to include it. | 3 |
Returns
| Name | Type | Description |
|---|---|---|
| dict | dict with keys: “domain_stopwords”: list of domain stop words used “splits”: {node_id: { “depth”: int, “children”: {child_id: { “count”: int, “unigrams”: [(“word”, score), …], }}, }} |