assess_text_diversity

assess_text_diversity(
    titles,
    *,
    min_unique_tokens=50,
    min_token_ratio=0.01,
    min_unique_title_ratio=0.05,
)

Cheaply assess whether a title collection has enough diversity for LLM labeling.

Single O(n) pass: collects unique tokens (via tokenize()) and unique title strings. Returns a report with three signals and a boolean gate.

Parameters

Name Type Description Default
titles list[str] List of title strings. required
min_unique_tokens int Minimum unique token count to be considered diverse. 50
min_token_ratio float Minimum ratio of unique tokens to total items. 0.01
min_unique_title_ratio float Minimum ratio of unique titles to total titles. 0.05

Returns

Name Type Description
TextDiversityReport TextDiversityReport with is_diverse=True if all thresholds pass.