diversify_by_facet

diversify_by_facet(
    query,
    candidate_indices,
    embeddings,
    bucket_ids,
    k=10,
    similarity_weight=0.0,
)

Select k items from k different semantic facets (buckets).

Unlike MMR which maximizes geometric diversity (“different things”), facet diversification maximizes semantic coverage (“different perspectives”). Each result comes from a different DYF bucket, ensuring coverage of distinct semantic regions.

This is preferred 2:1 over MMR for queries with high topical redundancy (same concept appearing in multiple results).

Parameters

Name Type Description Default
query np.ndarray Query embedding vector (d,) required
candidate_indices np.ndarray Indices of candidate items to diversify required
embeddings np.ndarray Full embedding matrix (n, d) required
bucket_ids np.ndarray DYF bucket IDs for all items (from classifier.get_bucket_ids()) required
k int Number of diverse results to return 10
similarity_weight float Weight for similarity vs bucket order (0.0 = pure bucket diversity, 1.0 = similarity-weighted selection within buckets) 0.0

Returns

Name Type Description
FacetDiverseResult FacetDiverseResult with selected indices and metadata

Example

from dyf import DensityClassifier, diversify_by_facet

Fit classifier

clf = DensityClassifier(embedding_dim=dim) clf.fit(embeddings) bucket_ids = clf.get_bucket_ids()

Get initial candidates (e.g., top-100 by similarity)

sims = query @ embeddings.T candidates = np.argsort(sims)[-100:][::-1]

Diversify by facet

result = diversify_by_facet(query, candidates, embeddings, bucket_ids, k=10) print(f”Selected {len(result)} items from {result.buckets_covered} buckets”)