compute_embedding_keywords

compute_embedding_keywords(
    titles,
    embeddings,
    tree,
    leaf_batches,
    children_map,
    hyperplanes,
    *,
    max_depth_from_root=3,
    min_child_items=50,
    top_k=10,
    domain_stopwords=None,
    min_term_count=3,
)

Compute discriminative keywords by projecting term embeddings onto split hyperplanes.

For each internal node, builds a “term embedding” (centroid of all item embeddings whose titles contain that term), then projects onto the node’s PCA hyperplane to find terms most aligned with each child’s side of the split.

Drop-in replacement for compute_split_keywords() — same output format.

Parameters

Name Type Description Default
titles list[str] List of title strings, indexed by item index. required
embeddings np.ndarray (n, dim) float32 array of full embeddings. required
tree list[TreeNode] Tree structure from idx.get_tree_structure(). required
leaf_batches dict[int, np.ndarray] {node_id: item_indices} for leaf nodes. required
children_map dict[int, list[int]] {parent_id: [child_ids]} for internal nodes. required
hyperplanes dict[int, np.ndarray] {node_id: (num_bits, dim)} from idx.get_split_hyperplanes(). required
max_depth_from_root int Maximum tree depth to process. 3
min_child_items int Skip children with fewer items than this. 50
top_k int Number of top keywords to return per child. 10
domain_stopwords set[str] | None Additional stop words to filter. None
min_term_count int Minimum number of items containing a term to include it. 3

Returns

Name Type Description
dict dict with keys: “domain_stopwords”: list of domain stop words used “splits”: {node_id: { “depth”: int, “children”: {child_id: { “count”: int, “unigrams”: [(“word”, score), …], }}, }}