scipy.cluster — Clustering¶

The scipy_cluster module wraps scipy.cluster.hierarchy and scipy.cluster.vq as Clausal predicates. It covers hierarchical clustering (linkage, flat cluster assignment, dendrogram, cophenetic analysis) and vector quantisation (k-means).

Import¶

-import_from(scipy_cluster, [Linkage, FlatCluster, Dendrogram,
                              Cophenet, Inconsistent,
                              KMeans2, KMeans, VectorQuantize, Whiten,
                              ResultGet])

Or via the canonical py.* path:

# skip
-import_from(py.scipy_cluster, [Linkage, FlatCluster, ...])

Tier¶

All predicates are Tier 2 — they return result dicts or NumPy arrays. Use ResultGet to access named fields from dict results.

Naming conventions¶

The Cluster prefix is dropped since these predicates live in the cluster module. Abbreviations that are not the universal name are expanded:

scipy function	Clausal predicate
`hierarchy.linkage`	`Linkage`
`hierarchy.fcluster`	`FlatCluster`
`hierarchy.dendrogram`	`Dendrogram`
`hierarchy.cophenet`	`Cophenet`
`hierarchy.inconsistent`	`Inconsistent`
`vq.kmeans2`	`KMeans2`
`vq.kmeans`	`KMeans`
`vq.vq`	`VectorQuantize`
`vq.whiten`	`Whiten`

Predicate catalogue¶

Hierarchical clustering¶

`Linkage(Y, RESULT)`¶

`Linkage(Y, METHOD, RESULT)`¶

`Linkage(Y, METHOD, METRIC, RESULT)`¶

`Linkage(Y, METHOD, METRIC, OPTIMAL_ORDERING, RESULT)`¶

Compute a hierarchical clustering linkage matrix from observation matrix or condensed distance matrix Y.

METHOD: 'single', 'complete', 'average', 'weighted', 'centroid', 'median', 'ward' (default 'single')
METRIC: distance metric string (default 'euclidean')
OPTIMAL_ORDERING: reorder leaves to minimise distance (default False)
RESULT: ndarray of shape (n-1, 4) — the linkage matrix Z

# skip
Linkage(DATA, 'ward', Z),
% Z is the linkage matrix for ward hierarchical clustering

`FlatCluster(Z, T, RESULT)`¶

`FlatCluster(Z, T, CRITERION, RESULT)`¶

`FlatCluster(Z, T, CRITERION, DEPTH, RESULT)`¶

Form flat clusters from a hierarchical clustering linkage matrix Z.

T: threshold or number of clusters (depending on CRITERION)
CRITERION: 'inconsistent', 'distance', 'maxclust', 'monocrit', 'maxclust_monocrit' (default 'inconsistent')
DEPTH: depth for inconsistency calculation (default 2)
RESULT: ndarray of shape (n,) — integer cluster assignment for each observation

# skip
Linkage(DATA, 'ward', Z),
FlatCluster(Z, 3, 'maxclust', LABELS),
% LABELS[i] is the cluster number for observation i

`Dendrogram(Z, RESULT)`¶

`Dendrogram(Z, TRUNCATE_MODE, RESULT)`¶

Compute dendrogram layout data from linkage matrix Z. Always uses no_plot=True to avoid matplotlib dependency.

TRUNCATE_MODE: None, 'lastp', or 'level'
RESULT: dict with keys icoord, dcoord, ivl, leaves, color_list

# skip
Linkage(DATA, 'ward', Z),
Dendrogram(Z, D),
ResultGet(D, 'leaves', LEAVES),
% LEAVES is the list of leaf node indices

`Cophenet(Z, RESULT)`¶

`Cophenet(Z, Y, RESULT)`¶

Compute cophenetic distances from linkage matrix Z.

Without Y: RESULT is the condensed cophenetic distance array (ndarray of length n*(n-1)/2)
With Y (condensed pairwise distances): RESULT is dict {'c': float, 'd': ndarray} where c is the cophenetic correlation coefficient and d is the cophenetic distance array

# skip
Linkage(DATA, 'ward', Z),
Y is ++(pdist(DATA)),
Cophenet(Z, Y, RESULT),
ResultGet(RESULT, 'c', C),
% C is the cophenetic correlation coefficient (1.0 = perfect)

`Inconsistent(Z, RESULT)`¶

`Inconsistent(Z, DEPTH, RESULT)`¶

Compute inconsistency statistics for each non-singleton cluster in linkage matrix Z.

DEPTH: number of levels to consider (default 2)
RESULT: ndarray of shape (n-1, 4) — each row is [mean, std, count, inconsistency_coefficient]

# skip
Linkage(DATA, 'ward', Z),
Inconsistent(Z, STATS),
% STATS[i, 3] is the inconsistency coefficient for merge i

Vector quantisation¶

`KMeans2(DATA, K, RESULT)`¶

`KMeans2(DATA, K, ITERATIONS, RESULT)`¶

`KMeans2(DATA, K, ITERATIONS, SEED, RESULT)`¶

k-means clustering with explicit re-initialisation (scipy.cluster.vq.kmeans2).

K: number of clusters (integer)
ITERATIONS: number of iterations (default 10)
SEED: random seed for reproducibility
RESULT: dict {'centroid': ndarray shape (K, D), 'label': ndarray shape (N,)}

KMeans2(DATA, 3, RESULT),
ResultGet(RESULT, 'centroid', CENTROIDS),
ResultGet(RESULT, 'label', LABELS),

`KMeans(OBS, K, RESULT)`¶

`KMeans(OBS, K, ITERATIONS, RESULT)`¶

Classic k-means (scipy.cluster.vq.kmeans). Runs until convergence or the iteration limit.

K: number of clusters (integer) or initial codebook (ndarray)
ITERATIONS: maximum iterations (default 10)
RESULT: dict {'codebook': ndarray shape (K, D), 'distortion': float}

# skip
KMeans(DATA, 2, RESULT),
ResultGet(RESULT, 'codebook', CODEBOOK),
ResultGet(RESULT, 'distortion', D),
% D is the mean Euclidean distance to the nearest centroid

`VectorQuantize(OBS, CODE_BOOK, RESULT)`¶

Assign each observation in OBS to the nearest code in CODE_BOOK.

OBS: ndarray of shape (N, D)
CODE_BOOK: ndarray of shape (K, D)
RESULT: dict {'code': ndarray shape (N,), 'dist': ndarray shape (N,)}
code[i] — index of nearest centroid for observation i
dist[i] — Euclidean distance to that centroid

KMeans(DATA, 2, KR),
ResultGet(KR, 'codebook', CODEBOOK),
VectorQuantize(DATA, CODEBOOK, VQR),
ResultGet(VQR, 'code', CODE),

`Whiten(OBS, RESULT)`¶

Normalise observations by dividing each feature by its standard deviation.

OBS: ndarray of shape (N, D)
RESULT: ndarray of shape (N, D) with each column standardised to unit variance

Whiten(RAW_DATA, NORMALISED),
KMeans2(NORMALISED, 3, RESULT),

Helper¶

`ResultGet(RESULT, FIELD, VALUE)`¶

Extract a named field from a Tier 2 result dict.

RESULT: dict returned by KMeans2, KMeans, VectorQuantize, Cophenet (with Y), or Dendrogram
FIELD: string key
VALUE: unified with RESULT[FIELD]

KMeans(DATA, 2, R),
ResultGet(R, 'codebook', CODEBOOK),
ResultGet(R, 'distortion', D),

Typical pipeline¶

# skip
% 1. Load and whiten data
Whiten(RAW_DATA, DATA),

% 2. Hierarchical clustering to explore structure
Linkage(DATA, 'ward', Z),
FlatCluster(Z, 3, 'maxclust', LABELS),

% 3. k-means for production assignment
KMeans(DATA, 3, KR),
ResultGet(KR, 'codebook', CODEBOOK),
VectorQuantize(DATA, CODEBOOK, VQR),
ResultGet(VQR, 'code', ASSIGNMENTS).

Notes¶

KMeans2 uses random initialisation by default; results are non-deterministic unless SEED is fixed.
KMeans and KMeans2 may warn about empty clusters on small or degenerate data.
Dendrogram always passes no_plot=True internally — it returns the layout dict but never calls matplotlib. If you need a plot, access the raw data via ResultGet and draw it yourself.
All predicates fail silently (yield no solutions) on exceptions such as singular matrices or incompatible array shapes.

scipy.cluster — Clustering¶

Import¶

Tier¶

Naming conventions¶

Predicate catalogue¶

Hierarchical clustering¶

Linkage(Y, RESULT)¶

Linkage(Y, METHOD, RESULT)¶

Linkage(Y, METHOD, METRIC, RESULT)¶

Linkage(Y, METHOD, METRIC, OPTIMAL_ORDERING, RESULT)¶

FlatCluster(Z, T, RESULT)¶

FlatCluster(Z, T, CRITERION, RESULT)¶

FlatCluster(Z, T, CRITERION, DEPTH, RESULT)¶

Dendrogram(Z, RESULT)¶

Dendrogram(Z, TRUNCATE_MODE, RESULT)¶

Cophenet(Z, RESULT)¶

Cophenet(Z, Y, RESULT)¶

Inconsistent(Z, RESULT)¶

Inconsistent(Z, DEPTH, RESULT)¶