compute_split_keywords
compute_split_keywords(
titles,
tree,
leaf_batches,
children_map,
*,
max_depth_from_root=3,
min_child_items=50,
top_k=10,
domain_stopwords=None,
bigram_check=False,
)Compute discriminative TF-IDF keywords for each side of each tree split.
For each internal node (up to max_depth_from_root), treats each child as a “document class” and computes TF-IDF scores to find discriminative words.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| titles | list[str] | List of title strings, indexed by item index. | required |
| tree | list[TreeNode] | Tree structure from idx.get_tree_structure(). | required |
| leaf_batches | dict[int, np.ndarray] | {node_id: item_indices} for leaf nodes. | required |
| children_map | dict[int, list[int]] | {parent_id: [child_ids]} for internal nodes. | required |
| max_depth_from_root | int | Maximum tree depth to process. | 3 |
| min_child_items | int | Skip children with fewer items than this. | 50 |
| top_k | int | Number of top keywords to return per child. | 10 |
| domain_stopwords | set[str] | None | Additional stop words to filter. | None |
| bigram_check | bool | Enable PMI-based compound meaning detection. | False |
Returns
| Name | Type | Description |
|---|---|---|
| dict | dict with keys: “domain_stopwords”: list of domain stop words used “splits”: {node_id: { “depth”: int, “children”: {child_id: { “count”: int, “unigrams”: [(“word”, score), …], “bigrams”: [(“w1_w2”, score), …], # if bigram_check }}, “bigram_needed”: bool, # if bigram_check }} |