compute_embedding_keywords

compute_embedding_keywords(
    titles,
    embeddings,
    tree,
    leaf_batches,
    children_map,
    hyperplanes,
    *,
    max_depth_from_root=3,
    min_child_items=50,
    top_k=10,
    domain_stopwords=None,
    min_term_count=3,
)

Compute discriminative keywords by projecting term embeddings onto split hyperplanes.

For each internal node, builds a “term embedding” (centroid of all item embeddings whose titles contain that term), then projects onto the node’s PCA hyperplane to find terms most aligned with each child’s side of the split.

Drop-in replacement for compute_split_keywords() — same output format.

Parameters

Name	Type	Description	Default
titles	list[str]	List of title strings, indexed by item index.	required
embeddings	np.ndarray	(n, dim) float32 array of full embeddings.	required
tree	list[TreeNode]	Tree structure from idx.get_tree_structure().	required
leaf_batches	dict[int, np.ndarray]	{node_id: item_indices} for leaf nodes.	required
children_map	dict[int, list[int]]	{parent_id: [child_ids]} for internal nodes.	required
hyperplanes	dict[int, np.ndarray]	{node_id: (num_bits, dim)} from idx.get_split_hyperplanes().	required
max_depth_from_root	int	Maximum tree depth to process.	`3`
min_child_items	int	Skip children with fewer items than this.	`50`
top_k	int	Number of top keywords to return per child.	`10`
domain_stopwords	set[str] \| None	Additional stop words to filter.	`None`
min_term_count	int	Minimum number of items containing a term to include it.	`3`

Returns

Name	Type	Description
	dict	dict with keys: “domain_stopwords”: list of domain stop words used “splits”: {node_id: { “depth”: int, “children”: {child_id: { “count”: int, “unigrams”: [(“word”, score), …], }}, }}