deduplicate_chunks

deduplicate_chunks(bucket_ids, doc_ids)

Boolean mask keeping one representative per (bucket, doc) pair.

Multi-topic documents retain multiple representatives (one per bucket they touch). Single-topic documents collapse to fewer points.

Parameters

Name Type Description Default
bucket_ids Array-like of bucket assignments per point. required
doc_ids Array-like of document identifiers per point (same length). required

Returns

Name Type Description
np.ndarray Boolean array of length len(bucket_ids). True for representative points.