compute_split_keywords

compute_split_keywords(
    titles,
    tree,
    leaf_batches,
    children_map,
    *,
    max_depth_from_root=3,
    min_child_items=50,
    top_k=10,
    domain_stopwords=None,
    bigram_check=False,
)

Compute discriminative TF-IDF keywords for each side of each tree split.

For each internal node (up to max_depth_from_root), treats each child as a “document class” and computes TF-IDF scores to find discriminative words.

Parameters

Name Type Description Default
titles list[str] List of title strings, indexed by item index. required
tree list[TreeNode] Tree structure from idx.get_tree_structure(). required
leaf_batches dict[int, np.ndarray] {node_id: item_indices} for leaf nodes. required
children_map dict[int, list[int]] {parent_id: [child_ids]} for internal nodes. required
max_depth_from_root int Maximum tree depth to process. 3
min_child_items int Skip children with fewer items than this. 50
top_k int Number of top keywords to return per child. 10
domain_stopwords set[str] | None Additional stop words to filter. None
bigram_check bool Enable PMI-based compound meaning detection. False

Returns

Name Type Description
dict dict with keys: “domain_stopwords”: list of domain stop words used “splits”: {node_id: { “depth”: int, “children”: {child_id: { “count”: int, “unigrams”: [(“word”, score), …], “bigrams”: [(“w1_w2”, score), …], # if bigram_check }}, “bigram_needed”: bool, # if bigram_check }}