get_kmeans_init
get_kmeans_init(
embeddings,
nlist,
use_stable_bridges=True,
num_stability_seeds=5,
stability_threshold=3,
global_num_bits=12,
seed=42,
verbose=False,
)Get bridge-seeded initial centroids for k-means/IVF index construction.
Using orthogonal bridge points as k-means initialization provides: - +1% recall improvement over standard k-means++ - 17-24% fewer bucket probes needed for same recall - Better bucket coherence (neighbors land in same bucket more often)
The returned centroids can be used with sklearn.cluster.KMeans or FAISS IVF.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| embeddings | np.ndarray | Normalized embeddings (n, d) | required |
| nlist | int | Number of clusters/centroids (IVF nlist parameter) | required |
| use_stable_bridges | bool | If True, prefer bridges stable across multiple seeds | True |
| num_stability_seeds | int | Number of seeds for stability testing | 5 |
| stability_threshold | int | Minimum seeds a bridge must appear in (if use_stable_bridges) | 3 |
| global_num_bits | int | LSH bits for bridge detection | 12 |
| seed | int | Random seed | 42 |
| verbose | bool | Print progress | False |
Returns
| Name | Type | Description |
|---|---|---|
| np.ndarray | Initial centroids array (nlist, d) suitable for k-means init parameter |
Example
from dyf import get_kmeans_init from sklearn.cluster import KMeans
Get bridge-seeded initialization
init = get_kmeans_init(embeddings, nlist=1000)
Use with sklearn
kmeans = KMeans(n_clusters=1000, init=init, n_init=1) kmeans.fit(embeddings)
Use with FAISS
import faiss quantizer = faiss.IndexFlatIP(dim) quantizer.add(kmeans.cluster_centers_)
Note
For high-recall requirements (99%+), consider using candidate_indices=np.arange(len(embeddings)) in select_orthogonal_anchors instead, as all-points expansion provides better long-tail coverage.