diversify_by_facet
diversify_by_facet(
query,
candidate_indices,
embeddings,
bucket_ids,
k=10,
similarity_weight=0.0,
)Select k items from k different semantic facets (buckets).
Unlike MMR which maximizes geometric diversity (“different things”), facet diversification maximizes semantic coverage (“different perspectives”). Each result comes from a different DYF bucket, ensuring coverage of distinct semantic regions.
This is preferred 2:1 over MMR for queries with high topical redundancy (same concept appearing in multiple results).
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| query | np.ndarray | Query embedding vector (d,) | required |
| candidate_indices | np.ndarray | Indices of candidate items to diversify | required |
| embeddings | np.ndarray | Full embedding matrix (n, d) | required |
| bucket_ids | np.ndarray | DYF bucket IDs for all items (from classifier.get_bucket_ids()) | required |
| k | int | Number of diverse results to return | 10 |
| similarity_weight | float | Weight for similarity vs bucket order (0.0 = pure bucket diversity, 1.0 = similarity-weighted selection within buckets) | 0.0 |
Returns
| Name | Type | Description |
|---|---|---|
| FacetDiverseResult | FacetDiverseResult with selected indices and metadata |
Example
from dyf import DensityClassifier, diversify_by_facet
Fit classifier
clf = DensityClassifier(embedding_dim=dim) clf.fit(embeddings) bucket_ids = clf.get_bucket_ids()
Get initial candidates (e.g., top-100 by similarity)
sims = query @ embeddings.T candidates = np.argsort(sims)[-100:][::-1]
Diversify by facet
result = diversify_by_facet(query, candidates, embeddings, bucket_ids, k=10) print(f”Selected {len(result)} items from {result.buckets_covered} buckets”)