compute_split_keywords

compute_split_keywords(
    titles,
    tree,
    leaf_batches,
    children_map,
    *,
    max_depth_from_root=3,
    min_child_items=50,
    top_k=10,
    domain_stopwords=None,
    bigram_check=False,
)

Compute discriminative TF-IDF keywords for each side of each tree split.

For each internal node (up to max_depth_from_root), treats each child as a “document class” and computes TF-IDF scores to find discriminative words.

Parameters

Name	Type	Description	Default
titles	list[str]	List of title strings, indexed by item index.	required
tree	list[TreeNode]	Tree structure from idx.get_tree_structure().	required
leaf_batches	dict[int, np.ndarray]	{node_id: item_indices} for leaf nodes.	required
children_map	dict[int, list[int]]	{parent_id: [child_ids]} for internal nodes.	required
max_depth_from_root	int	Maximum tree depth to process.	`3`
min_child_items	int	Skip children with fewer items than this.	`50`
top_k	int	Number of top keywords to return per child.	`10`
domain_stopwords	set[str] \| None	Additional stop words to filter.	`None`
bigram_check	bool	Enable PMI-based compound meaning detection.	`False`

Returns

Name	Type	Description
	dict	dict with keys: “domain_stopwords”: list of domain stop words used “splits”: {node_id: { “depth”: int, “children”: {child_id: { “count”: int, “unigrams”: [(“word”, score), …], “bigrams”: [(“w1_w2”, score), …], # if bigram_check }}, “bigram_needed”: bool, # if bigram_check }}