refine_clusters

refine_clusters(
    labels,
    embeddings,
    min_coherence=None,
    min_cluster_size=None,
    num_bits=6,
    seed_offset=2000,
)

Post-processing safety net: eject periphery from incoherent clusters.

For each cluster below the coherence threshold, keeps core points (above median similarity to centroid) and ejects periphery into a shared pool. The pool is re-split with a rotated LSH seed. Small resulting clusters are iteratively merged with their nearest neighbor until they reach min_cluster_size, then any remaining runts are absorbed into the nearest large cluster.

Parameters

Name Type Description Default
labels (n,) cluster label array. required
embeddings (n, d) array of embedding vectors. required
min_coherence Coherence threshold. If None, uses 25th percentile. None
min_cluster_size Minimum viable cluster size. If None, uses n_points / n_clusters / 3 (one-third of the expected median). None
num_bits LSH bits for re-splitting the ejected pool. 6
seed_offset Seed for the re-split DensityClassifier. 2000

Returns

Name Type Description
np.ndarray of shape (n,) with refined cluster labels.