assess_text_diversity
assess_text_diversity(
titles,
*,
min_unique_tokens=50,
min_token_ratio=0.01,
min_unique_title_ratio=0.05,
)Cheaply assess whether a title collection has enough diversity for LLM labeling.
Single O(n) pass: collects unique tokens (via tokenize()) and unique title strings. Returns a report with three signals and a boolean gate.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| titles | list[str] | List of title strings. | required |
| min_unique_tokens | int | Minimum unique token count to be considered diverse. | 50 |
| min_token_ratio | float | Minimum ratio of unique tokens to total items. | 0.01 |
| min_unique_title_ratio | float | Minimum ratio of unique titles to total titles. | 0.05 |
Returns
| Name | Type | Description |
|---|---|---|
| TextDiversityReport | TextDiversityReport with is_diverse=True if all thresholds pass. |