scipy.cluster — Clustering¶
The scipy_cluster module wraps scipy.cluster.hierarchy and scipy.cluster.vq as Clausal predicates. It covers hierarchical clustering (linkage, flat cluster assignment, dendrogram, cophenetic analysis) and vector quantisation (k-means).
Import¶
-import_from(scipy_cluster, [Linkage, FlatCluster, Dendrogram,
Cophenet, Inconsistent,
KMeans2, KMeans, VectorQuantize, Whiten,
ResultGet])
Or via the canonical py.* path:
Tier¶
All predicates are Tier 2 — they return result dicts or NumPy arrays. Use ResultGet to access named fields from dict results.
Naming conventions¶
The Cluster prefix is dropped since these predicates live in the cluster module. Abbreviations that are not the universal name are expanded:
| scipy function | Clausal predicate |
|---|---|
hierarchy.linkage |
Linkage |
hierarchy.fcluster |
FlatCluster |
hierarchy.dendrogram |
Dendrogram |
hierarchy.cophenet |
Cophenet |
hierarchy.inconsistent |
Inconsistent |
vq.kmeans2 |
KMeans2 |
vq.kmeans |
KMeans |
vq.vq |
VectorQuantize |
vq.whiten |
Whiten |
Predicate catalogue¶
Hierarchical clustering¶
Linkage(Y, RESULT)¶
Linkage(Y, METHOD, RESULT)¶
Linkage(Y, METHOD, METRIC, RESULT)¶
Linkage(Y, METHOD, METRIC, OPTIMAL_ORDERING, RESULT)¶
Compute a hierarchical clustering linkage matrix from observation matrix or condensed distance matrix Y.
METHOD:'single','complete','average','weighted','centroid','median','ward'(default'single')METRIC: distance metric string (default'euclidean')OPTIMAL_ORDERING: reorder leaves to minimise distance (defaultFalse)RESULT: ndarray of shape(n-1, 4)— the linkage matrixZ
FlatCluster(Z, T, RESULT)¶
FlatCluster(Z, T, CRITERION, RESULT)¶
FlatCluster(Z, T, CRITERION, DEPTH, RESULT)¶
Form flat clusters from a hierarchical clustering linkage matrix Z.
T: threshold or number of clusters (depending onCRITERION)CRITERION:'inconsistent','distance','maxclust','monocrit','maxclust_monocrit'(default'inconsistent')DEPTH: depth for inconsistency calculation (default2)RESULT: ndarray of shape(n,)— integer cluster assignment for each observation
# skip
Linkage(DATA, 'ward', Z),
FlatCluster(Z, 3, 'maxclust', LABELS),
% LABELS[i] is the cluster number for observation i
Dendrogram(Z, RESULT)¶
Dendrogram(Z, TRUNCATE_MODE, RESULT)¶
Compute dendrogram layout data from linkage matrix Z. Always uses no_plot=True to avoid matplotlib dependency.
TRUNCATE_MODE:None,'lastp', or'level'RESULT: dict with keysicoord,dcoord,ivl,leaves,color_list
# skip
Linkage(DATA, 'ward', Z),
Dendrogram(Z, D),
ResultGet(D, 'leaves', LEAVES),
% LEAVES is the list of leaf node indices
Cophenet(Z, RESULT)¶
Cophenet(Z, Y, RESULT)¶
Compute cophenetic distances from linkage matrix Z.
- Without
Y:RESULTis the condensed cophenetic distance array (ndarray of lengthn*(n-1)/2) - With
Y(condensed pairwise distances):RESULTisdict {'c': float, 'd': ndarray}wherecis the cophenetic correlation coefficient anddis the cophenetic distance array
# skip
Linkage(DATA, 'ward', Z),
Y is ++(pdist(DATA)),
Cophenet(Z, Y, RESULT),
ResultGet(RESULT, 'c', C),
% C is the cophenetic correlation coefficient (1.0 = perfect)
Inconsistent(Z, RESULT)¶
Inconsistent(Z, DEPTH, RESULT)¶
Compute inconsistency statistics for each non-singleton cluster in linkage matrix Z.
DEPTH: number of levels to consider (default2)RESULT: ndarray of shape(n-1, 4)— each row is[mean, std, count, inconsistency_coefficient]
# skip
Linkage(DATA, 'ward', Z),
Inconsistent(Z, STATS),
% STATS[i, 3] is the inconsistency coefficient for merge i
Vector quantisation¶
KMeans2(DATA, K, RESULT)¶
KMeans2(DATA, K, ITERATIONS, RESULT)¶
KMeans2(DATA, K, ITERATIONS, SEED, RESULT)¶
k-means clustering with explicit re-initialisation (scipy.cluster.vq.kmeans2).
K: number of clusters (integer)ITERATIONS: number of iterations (default10)SEED: random seed for reproducibilityRESULT: dict{'centroid': ndarray shape (K, D), 'label': ndarray shape (N,)}
KMeans2(DATA, 3, RESULT),
ResultGet(RESULT, 'centroid', CENTROIDS),
ResultGet(RESULT, 'label', LABELS),
KMeans(OBS, K, RESULT)¶
KMeans(OBS, K, ITERATIONS, RESULT)¶
Classic k-means (scipy.cluster.vq.kmeans). Runs until convergence or the iteration limit.
K: number of clusters (integer) or initial codebook (ndarray)ITERATIONS: maximum iterations (default10)RESULT: dict{'codebook': ndarray shape (K, D), 'distortion': float}
# skip
KMeans(DATA, 2, RESULT),
ResultGet(RESULT, 'codebook', CODEBOOK),
ResultGet(RESULT, 'distortion', D),
% D is the mean Euclidean distance to the nearest centroid
VectorQuantize(OBS, CODE_BOOK, RESULT)¶
Assign each observation in OBS to the nearest code in CODE_BOOK.
OBS: ndarray of shape(N, D)CODE_BOOK: ndarray of shape(K, D)RESULT: dict{'code': ndarray shape (N,), 'dist': ndarray shape (N,)}code[i]— index of nearest centroid for observationidist[i]— Euclidean distance to that centroid
KMeans(DATA, 2, KR),
ResultGet(KR, 'codebook', CODEBOOK),
VectorQuantize(DATA, CODEBOOK, VQR),
ResultGet(VQR, 'code', CODE),
Whiten(OBS, RESULT)¶
Normalise observations by dividing each feature by its standard deviation.
OBS: ndarray of shape(N, D)RESULT: ndarray of shape(N, D)with each column standardised to unit variance
Helper¶
ResultGet(RESULT, FIELD, VALUE)¶
Extract a named field from a Tier 2 result dict.
RESULT: dict returned byKMeans2,KMeans,VectorQuantize,Cophenet(with Y), orDendrogramFIELD: string keyVALUE: unified withRESULT[FIELD]
Typical pipeline¶
# skip
% 1. Load and whiten data
Whiten(RAW_DATA, DATA),
% 2. Hierarchical clustering to explore structure
Linkage(DATA, 'ward', Z),
FlatCluster(Z, 3, 'maxclust', LABELS),
% 3. k-means for production assignment
KMeans(DATA, 3, KR),
ResultGet(KR, 'codebook', CODEBOOK),
VectorQuantize(DATA, CODEBOOK, VQR),
ResultGet(VQR, 'code', ASSIGNMENTS).
Notes¶
KMeans2uses random initialisation by default; results are non-deterministic unlessSEEDis fixed.KMeansandKMeans2may warn about empty clusters on small or degenerate data.Dendrogramalways passesno_plot=Trueinternally — it returns the layout dict but never calls matplotlib. If you need a plot, access the raw data viaResultGetand draw it yourself.- All predicates fail silently (yield no solutions) on exceptions such as singular matrices or incompatible array shapes.