If you’ve ever dealt with spatial data—think GPS points, locations of stores, or any coordinates scattered on a map—you know that making sense of it visually only gets you so far. That’s where clustering comes in, and scipy.cluster
provides some robust tools to help you group points that are “close” to each other in a meaningful way.
At its core, scipy.cluster
offers hierarchical clustering and k-means clustering algorithms, both useful but suited for different scenarios. Hierarchical clustering builds a tree of clusters, which you can slice at any level to get a desired number of clusters or a distance threshold. K-means, on the other hand, explicitly tries to partition your data into a fixed number of groups by minimizing the variance within each cluster.
Let’s start with hierarchical clustering because it’s often more intuitive when you don’t know how many clusters to expect. The primary function you’ll want to look at is scipy.cluster.hierarchy.linkage
, which takes a condensed distance matrix or an array of observations and produces a linkage matrix. This matrix encodes the hierarchical clustering structure, which you can then use to extract clusters or visualize dendrograms.
SAMSUNG MUF-256AB/AM FIT Plus 256GB - 300MB/s USB 3.1 Flash Drive, Gunmetal Gray
39% Offimport numpy as np from scipy.cluster.hierarchy import linkage, fcluster # Sample spatial data: 2D points (e.g., coordinates) points = np.array([ [1.0, 2.0], [1.5, 1.8], [5.0, 8.0], [8.0, 8.0], [1.0, 0.6], [9.0, 11.0] ]) # Perform hierarchical clustering using Ward's method Z = linkage(points, method='ward') # Extract flat clusters with max distance threshold clusters = fcluster(Z, t=3, criterion='distance') print(clusters)
Notice the method='ward'
parameter in linkage
. Ward’s method tries to minimize the variance within clusters, which tends to create compact, spherical clusters—often what you want with spatial data. Other methods like ‘single’ and ‘complete’ linkage define cluster distances differently and can produce elongated or chaining clusters.
One thing to keep in mind: hierarchical clustering’s complexity is roughly O(n³)
due to pairwise distance calculations and cluster merges, so it can get sluggish with datasets over a few thousand points. For bigger datasets, k-means or other clustering techniques might be smarter.
Speaking of k-means, scipy.cluster.vq
provides a straightforward interface. You give it your data and the number of clusters, and it iteratively refines centroids to minimize the squared distance between points and their assigned centroid.
from scipy.cluster.vq import kmeans, vq # Use the same points array centroids, distortion = kmeans(points, 2) # 2 clusters idx, _ = vq(points, centroids) print("Cluster assignments:", idx)
Here, kmeans
returns the centroids and a distortion metric (sum of squared distances), while vq
assigns each point to the nearest centroid. Keep in mind k-means assumes clusters are convex and roughly equal in size—if your data violates this, clusters might not reflect reality.
One last tip: before clustering, always normalize or standardize your spatial coordinates if combining different units or scales. Otherwise, the algorithm might overweight one dimension arbitrarily, skewing your clusters.
Getting comfortable with these basics opens up a lot of possibilities—once you have clusters, you can analyze hotspots, identify outliers, or simply reduce noise. Up next, we’ll discuss how to pick the clustering algorithm that fits your problem best, because picking the right tool saves you hours of debugging and frustration.
Choosing the right clustering algorithm for your problem
Choosing the right clustering algorithm hinges on the shape, size, and distribution of your spatial data, as well as your end goal. If your data naturally forms spherical blobs and you know roughly how many clusters you want, k-means is usually the fastest and simplest choice. But if your clusters have irregular shapes, varying densities, or you’re unsure how many clusters exist, hierarchical methods or density-based approaches become more appealing.
For example, hierarchical clustering doesn’t require you to pre-specify the number of clusters. Instead, you can cut the dendrogram at different levels to explore various granularities. This flexibility is great for exploratory analysis, but keep in mind it’s computationally expensive and doesn’t scale well beyond a few thousand points.
Here’s a quick way to visualize how changing the distance threshold affects the number of clusters:
from scipy.cluster.hierarchy import fcluster # Vary distance thresholds for t in [1.5, 3, 5]: clusters = fcluster(Z, t=t, criterion='distance') print(f"Threshold {t}: {len(np.unique(clusters))} clusters")
If your spatial data is huge—think tens or hundreds of thousands of points—k-means or approximate methods like MiniBatch KMeans (from scikit-learn) are more practical. Although scipy.cluster.vq
is convenient, it lacks these scalable variants.
Another consideration is noise and outliers. K-means will assign every point to a cluster, even if it’s an outlier far from any centroid. Hierarchical clustering can sometimes isolate outliers into their own clusters, but it’s not guaranteed. For spatial data with noise, density-based clustering algorithms like DBSCAN (not in SciPy but in scikit-learn) often outperform both by explicitly handling noise and discovering clusters of arbitrary shape.
Here’s a simple comparison of k-means cluster centers versus hierarchical clusters on the same data:
import matplotlib.pyplot as plt plt.scatter(points[:, 0], points[:, 1], c=idx, cmap='viridis', marker='o', label='k-means') plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x', s=100, label='Centroids') plt.title("K-means clustering") plt.legend() plt.show() plt.scatter(points[:, 0], points[:, 1], c=clusters, cmap='tab10', marker='o') plt.title("Hierarchical clustering") plt.show()
Notice how k-means tends to produce clusters with roughly equal variance, while hierarchical clustering can capture more nuanced groupings depending on your threshold. If your spatial clusters are not well-separated or have complex boundaries, neither may be ideal, pushing you toward density-based or graph-based methods.
Ultimately, the best algorithm depends on your data’s geometry and your task. For segmentation, k-means might suffice. For discovering natural groupings without assumptions, hierarchical methods shine. For robustness to noise and irregular shapes, look beyond SciPy’s cluster module to libraries like scikit-learn.
One more practical tip: always visualize your clusters after running the algorithm. It’s the quickest sanity check to see if your chosen method fits the data’s structure or if you should try a different approach. For example, plotting clusters overlaid on a map or scatter plot can reveal if clusters correspond to meaningful spatial patterns or are artifacts of the algorithm.
In the next section, we’ll dig into visualization techniques that help you interpret these clusters visually, turning raw coordinates into actionable insights. But before that, keep in mind that even the best clustering algorithm can be slowed down or derailed by huge datasets. Optimizing performance and scalability often means combining smarter algorithms with efficient data structures and preprocessing—
Visualizing clusters to make sense of spatial patterns
So you’ve run your clustering algorithm and have an array of numbers telling you which point belongs to which cluster. Great. But what does that actually *mean*? An array like [0, 0, 1, 1, 0, 1]
is abstract; it doesn’t tell you if you’ve found meaningful geographic regions or just arbitrary groupings. This is where visualization is not just a nice-to-have, it’s a mandatory step for sanity-checking your work.
The most basic visualization, as we hinted at before, is a scatter plot where you color each point according to its cluster label. This is your first-pass diagnostic. It immediately tells you if the clusters are spatially coherent. If points in the same cluster are scattered all over the plot, something is wrong—either your choice of algorithm, your parameters, or your distance metric is a poor fit for the data.
import matplotlib.pyplot as plt from scipy.cluster.vq import kmeans, vq from scipy.cluster.hierarchy import linkage, fcluster import numpy as np # Let's use a slightly more interesting dataset np.random.seed(42) points1 = np.random.randn(50, 2) * 1.5 + np.array([5, 5]) points2 = np.random.randn(50, 2) * 1.0 + np.array([1, 1]) points3 = np.random.randn(50, 2) * 0.5 + np.array([8, 1]) points = np.vstack([points1, points2, points3]) # K-means clustering centroids, _ = kmeans(points, 3) idx, _ = vq(points, centroids) # Hierarchical clustering Z = linkage(points, 'ward') h_clusters = fcluster(Z, t=6, criterion='distance') # Plotting K-means results plt.figure(figsize=(12, 5)) plt.subplot(1, 2, 1) plt.scatter(points[:, 0], points[:, 1], c=idx, cmap='viridis', s=10) plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=100, label='Centroids') plt.title('K-means Clustering (k=3)') plt.legend() # Plotting Hierarchical results plt.subplot(1, 2, 2) plt.scatter(points[:, 0], points[:, 1], c=h_clusters, cmap='viridis', s=10) plt.title('Hierarchical Clustering (t=6)') plt.tight_layout() plt.show()
This side-by-side comparison is incredibly useful. You can see how k-means creates roughly spherical clusters around its centroids, while hierarchical clustering might produce slightly different boundaries depending on the linkage method and cutoff distance. For hierarchical clustering, however, the real diagnostic tool is the dendrogram. It’s a tree diagram that shows you the entire history of cluster merges. It’s the single best way to decide on a distance threshold (the t
parameter in fcluster
).
from scipy.cluster.hierarchy import dendrogram plt.figure(figsize=(10, 7)) plt.title("Hierarchical Clustering Dendrogram") dendrogram(Z, truncate_mode='lastp', p=12, leaf_rotation=90., leaf_font_size=8., show_contracted=True) plt.xlabel("Cluster size") plt.ylabel("Distance") plt.axhline(y=6, c='grey', lw=1, linestyle='--') # Our cutoff line plt.show()
Look at this dendrogram. The y-axis represents the distance between clusters. The horizontal lines show merges. A longer vertical line means two clusters were merged at a greater distance, implying they are less similar. To pick a number of clusters, you draw a horizontal line that cuts through the vertical lines. The number of vertical lines it intersects is the number of clusters you get. Our dashed line at a distance of 6 cuts through three vertical lines, which is why fcluster
with t=6
gave us three clusters. This visualization makes the choice of t
much less of a guessing game.
Of course, for true spatial data (latitude/longitude), a scatter plot on a blank canvas is only half the story. The real “aha!” moment comes when you overlay these clusters on an actual map. Is one cluster centered on a downtown area? Does another follow a highway? You need context. Libraries like geopandas
and folium
are built for this. While they are outside the scope of SciPy, you should know that the typical workflow is: cluster with SciPy, then visualize with a geospatial library. The cluster IDs you generate are just another column of data you can pass to your plotting tool.
Beyond just looking at the points, you can visualize the quality of the clusters themselves. A silhouette plot, for example, measures how similar a point is to its own cluster compared to other clusters. It gives you a score for each point, and a quick visual inspection can tell you if your clusters are well-separated or if they overlap significantly. A high silhouette score indicates dense, well-defined clusters, while a low or negative score suggests that a point might be in the wrong cluster, or that the clusters themselves are not very distinct. This moves you from subjective visual assessment to a more objective, quantifiable measure of cluster validity. For instance, if you see a lot of points with low silhouette scores, it’s a strong signal that perhaps you’ve chosen the wrong number of clusters for k-means, or the data simply doesn’t have a strong clustering structure to begin with. This is crucial because bad clusters are worse than no clusters at all; they lead to wrong conclusions. When you’re dealing with massive datasets, you can’t just eyeball every point, so these summary visualizations become essential. But even with the best visualizations, performance can become a bottleneck when your dataset grows from thousands to millions of points.
Optimizing performance and scalability in large datasets
When you move from a cozy dataset of a few thousand points to hundreds of thousands or millions, the algorithms we’ve discussed start to creak and groan. Hierarchical clustering is usually the first casualty. The need to compute a pairwise distance matrix means you’re looking at O(n²)
memory and at least O(n²)
time complexity, which is a complete non-starter. For a million points, you’d need to store about 500 billion distances. Your laptop will not be happy. SciPy’s k-means is more resilient, with a complexity around O(n*k*i)
where n
is the number of points, k
is the number of clusters, and i
is the number of iterations. Still, for very large n
, reading all the data into memory and iterating over it repeatedly can become prohibitively slow.
So what do you do? Your first, and often surprisingly effective, line of attack is not to use a fancier algorithm, but to use less data. No, I don’t mean throwing away your hard-won data. I mean subsampling. If your data is reasonably well-distributed, you can often get very good cluster centroids by running k-means on a random sample of, say, 1% of your data. Once you have those centroids, you can efficiently assign every point from the original, full dataset to its nearest centroid. This is a classic trick that scales beautifully.
import numpy as np from scipy.cluster.vq import kmeans, vq import time # Simulate a large dataset of 1 million points print("Generating 1,000,000 points...") large_points = np.random.rand(1_000_000, 2) # Subsample 1% of the data (10,000 points) sample_size = 10000 sample_indices = np.random.choice(large_points.shape[0], sample_size, replace=False) sample_points = large_points[sample_indices, :] print(f"Clustering {sample_size} sample points...") start_time = time.time() # Run k-means only on the small sample centroids, _ = kmeans(sample_points, 10) end_time = time.time() print(f"Found centroids in {end_time - start_time:.2f} seconds.") print("Assigning all 1,000,000 points to the centroids...") start_time = time.time() # Use the centroids to assign clusters to the *entire* dataset all_idx, _ = vq(large_points, centroids) end_time = time.time() print(f"Assigned full dataset in {end_time - start_time:.2f} seconds.")
While subsampling is a great pragmatic hack, a more principled approach for large-scale k-means is to use an algorithm designed for it. This is where you’ll want to reach for scikit-learn
, which is built to play nicely with SciPy and NumPy. Its MiniBatchKMeans
implementation is a godsend for large datasets. Instead of using the entire dataset for each iteration, it uses small, random “mini-batches” of data. This dramatically reduces the computation per iteration and often converges much faster than traditional k-means, with very similar results.
from sklearn.cluster import MiniBatchKMeans # Let's use a moderately large dataset to see the speed difference medium_points = np.random.rand(100_000, 2) k = 10 # Time SciPy's standard k-means print("Running scipy.cluster.vq.kmeans...") start_time = time.time() scipy_centroids, _ = kmeans(medium_points, k) end_time = time.time() print(f"SciPy took {end_time - start_time:.4f} seconds.") # Time scikit-learn's MiniBatchKMeans print("nRunning sklearn.cluster.MiniBatchKMeans...") start_time = time.time() mbk = MiniBatchKMeans(n_clusters=k, batch_size=1000, n_init='auto') mbk.fit(medium_points) sklearn_centroids = mbk.cluster_centers_ end_time = time.time() print(f"Scikit-learn's MiniBatchKMeans took {end_time - start_time:.4f} seconds.")
Another critical, and often overlooked, optimization is memory usage. Your spatial coordinates are probably stored as 64-bit floating-point numbers (float64
), which is the NumPy default. Do you really need that much precision for latitude and longitude? Probably not. Converting your data to 32-bit floats (float32
) can halve your memory footprint instantly, which might be the difference between fitting your data in RAM and swapping to disk. This simple change can make everything faster because more data can fit into the CPU caches.
# Default dtype is float64 points_64 = np.random.rand(1_000_000, 2) print(f"Memory usage (float64): {points_64.nbytes / 1e6:.2f} MB") # Convert to float32 points_32 = points_64.astype(np.float32) print(f"Memory usage (float32): {points_32.nbytes / 1e6:.2f} MB")
When you’re pushing the limits of a single machine—when even MiniBatchKMeans
and memory-optimized arrays aren’t enough—it’s time to think about parallel and distributed computing. Libraries like Dask can parallelize NumPy and scikit-learn computations across multiple CPU cores or even multiple machines without you having to fundamentally change your code. For datasets that are truly massive (terabytes of spatial data), this is the next logical step, moving from single-node processing to a distributed framework that can handle data that doesn’t fit into one computer’s memory.
Source: https://www.pythonlore.com/clustering-and-spatial-analysis-with-scipy-cluster/