#install.packages("sf")
#install.packages("dplyr")
Typology construction using K-means clustering in R
Introduction
This tutorial demonstrates how to construct urban typologies in R. Specifically, we use K-means Clustering, a method of unsupervised machine learning, to classify urban units into typologies for urban stream restoration.
Preparation
In the previous session, we introduced the key preparatory steps for typology construction before clustering. Here, we provide a specific example that illustrates how these steps can be applied in a particular context.
Define your objective:
To identify urban stream typologies in areas directly adjacent to stream corridors, where local conditions affect stream ecology and guide restoration design strategies.Define your focused area:
The stream corridors in the whole city of Dresden.Define your spatial unit:
Fixed size, 100 x 100 m square; centered on the stream and rotated according to the local flow direction of the stream; overlapped units are deleted.Define your variables:
impervious
– percentage of impervious surface, such as roads, squares and buildings; related to run-off and water management, may require rain gardens or permeable pavement.
slope
– mean terrain slope (degrees); related to erosion and stability, may require slope planting or retaining walls.
crossing
– the number of points where roads or railways cross a stream, divided by stream length; related to ecological connectivity, may require ecological bridges.
Prepare geometries of spatial unit covering focused area:
Read ingrids_TC.gpkg
, with a columngrid_id
created.Calculate the value of variables in each spatial unit:
Read ingrids_TC.gpkg
, with columnsimpervious
,slope
,crossing
created.
Clustering steps
After the above preparatory steps, now we start the clustering process.
We will:
Step 1. Load R packages and data
Step 2. Standardization
Step 3. Determine optimal k
Step 4. Run K-means clustering
Step 5. Interpret cluster center
Step 6. (Optional) Calculate distance to cluster center
Step 1. Load R packages and data
We begin by loading the necessary R packages.
If you haven’t installed these packages yet, remove the #
and run the following chunk before continuing:
Adding #
at the beginning of a line of code will prevent that line from being executed. We use it here because we only need to run install.packages()
once, when we install a package for the first time. After that we disable that line because we do not need to run it anymore.
Next, load the required packages:
library(sf) # for processing vector data
Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(dplyr) # for selecting and transforming data
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
We first need to confirm that the project folder is an RStudio project. This will ensure that the working directory points to the project folder. You can confirm this with getwd()
. Make sure you copied the prepared grid_TC.gpkg
in the root of your working directory folder. Now we read the dataset from a GeoPackage file and display the first few rows using head():
#getwd()
<- st_read("grids_TC.gpkg", quiet = TRUE)
grids
# View the first few rows of the data
head(grids)
Simple feature collection with 6 features and 4 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 420889.5 ymin: 5658844 xmax: 421165.8 ymax: 5659723
Projected CRS: ETRS89 / UTM zone 33N
grid_id impervious slope crossing geom
1 0 0 4.729869 0.8957654 POLYGON ((420982.1 5659723,...
2 1 0 3.581370 0.7567734 POLYGON ((421056.3 5659624,...
3 2 0 3.182152 0.9955550 POLYGON ((421150.8 5659341,...
4 3 0 3.733626 0.9997012 POLYGON ((421142.7 5659083,...
5 4 0 2.914602 0.0000000 POLYGON ((421150.3 5659197,...
6 5 0 3.828850 0.0000000 POLYGON ((421147.5 5658961,...
We can check how many girds in the file.
# Count how many grid units we have
nrow(grids)
[1] 1745
We can also visualize the data to have a better understanding visually.
# Plot impervious
plot(grids["impervious"], border = NA, main = "Impervious Cover")
# Plot slope
plot(grids["slope"], border = NA, main = "Terrain Slope")
# Plot crossing
plot(grids["crossing"], border = NA,
#breaks = "quantile",
main = "Transport-Stream Crossing counts")
By default, plot()
is using equal interval, which might make the differences hard to see. You can change the breaks
type to show the visual difference more clearly.
Step 2. Standardization
We select the relevant features. These features are then standardized to ensure they contribute equally to the clustering algorithm.
<- grids |>
features select(impervious, slope, crossing) |>
st_drop_geometry() # remove geometry column so we just keep a data table
<- scale(features) # Standardize (mean=0, sd=1)
X_scaled
head(X_scaled)
impervious slope crossing
1 -0.6612074 -0.5714982 0.4049445
2 -0.6612074 -0.7437663 0.2509882
3 -0.6612074 -0.8036468 0.5154779
4 -0.6612074 -0.7209288 0.5200705
5 -0.6612074 -0.8437778 -0.5872618
6 -0.6612074 -0.7066458 -0.5872618
|>
passes the result from the left expression to the right function.
Even after selecting variables from an sf
object, the geometry column is still there. Use st_drop_geometry()
to remove it before applying functions like scale()
.
Step 3: Determine the optimal K
We use the elbow method to choose a good number of clusters.
For each value of k
(e.g. 2 to 9), we run K-means and record a value called inertia : the total distance between points and their cluster centers. Lower inertia means tighter (better) clusters.
# Initialize an empty numeric vector to store inertia values
<- numeric()
inertia
# Try k values from 2 to 9
<- 2:9
k_values
# Loop through each k value
for (k in k_values) {
<- kmeans(X_scaled, centers = k, nstart = 20)
km
# tot.withinss is Total within-cluster sum of squares
# This measures how compact the clusters are: lower is better.
<- c(inertia, km$tot.withinss)
inertia
}
# Combine the results into a data frame for plotting
<- data.frame(k = k_values, inertia = inertia)
elbow_df
print(elbow_df)
k inertia
1 2 3490.5663
2 3 2334.0686
3 4 1723.0514
4 5 1428.5589
5 6 1223.6857
6 7 1051.6004
7 8 909.2724
8 9 814.8496
We suggest using a larger nstart
value, such as 20 or 50, to get more reliable results. Setting nstart = 20
makes R try 20 different starting points and choose the one with the lowest tot.withinss
. The tot.withinss
value measures how close points are to their cluster centers. Lower values mean better clustering.
We can visualize how inertia changes with increasing k. After a certain point, adding more clusters doesn’t help much — the curve bends. That bend is called the elbow point, and we use it to choose the best k.
# Make the elbow plot
plot(k_values, inertia,
type = "b", # shown both points + lines
col = "darkblue",
main = "Elbow Method")
Step 4. Run K-Means clustering
Based on the elbow plot, we choose k = 4
as a good number of clusters.
We now run the K-means algorithm and assign each grid to one of the four clusters.
# `set.seed()` sets the random number generator to a fixed state
# Set the seed so the clustering result is always the same when re-run
set.seed(0) # The number 0 is just a fixed choice. You can also use 10, 345, etc.
# Choose the number of clusters based on the elbow plot
<- 4
k
# Run K-means clustering on the standardized data
<- kmeans(X_scaled, centers = k, nstart = 20)
kmeans_result
# Add the cluster labels to the spatial data
$cluster <- as.factor(kmeans_result$cluster) # The result kmeans_result$cluster is a list of cluster labels (1 to 4), in the same order as the original rows in X_scaled and grids
grids
head(grids)
Simple feature collection with 6 features and 5 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 420889.5 ymin: 5658844 xmax: 421165.8 ymax: 5659723
Projected CRS: ETRS89 / UTM zone 33N
grid_id impervious slope crossing geom cluster
1 0 0 4.729869 0.8957654 POLYGON ((420982.1 5659723,... 3
2 1 0 3.581370 0.7567734 POLYGON ((421056.3 5659624,... 3
3 2 0 3.182152 0.9955550 POLYGON ((421150.8 5659341,... 3
4 3 0 3.733626 0.9997012 POLYGON ((421142.7 5659083,... 3
5 4 0 2.914602 0.0000000 POLYGON ((421150.3 5659197,... 3
6 5 0 3.828850 0.0000000 POLYGON ((421147.5 5658961,... 3
# Show how many grids fall into each cluster
print(table(grids$cluster))
1 2 3 4
392 365 932 56
Save the updated grid data (with cluster labels) to a new GeoPackage file. You can use it for further analysis and visualize.
#st_write(grids, "grids_cluster.gpkg")
We can also visualize the spatial pattern.
# Plot clusters with base R
plot(grids["cluster"],
main = "Spatial Pattern of Urban Stream Clusters",
border = NA)
Step 5. Interpret cluster centers
Now we look at the center of each cluster. First, we check the values in standardized form. Then, we convert them back to the original units (e.g. degrees), so they are easier to understand.
# Get the cluster centers (in standardized form)
<- kmeans_result$centers
scaled_centers
# Print them
print("Cluster centers (standardized):")
[1] "Cluster centers (standardized):"
print(scaled_centers)
impervious slope crossing
1 1.6080590 -0.3517408 0.1545016
2 -0.4634886 1.6320976 -0.2651149
3 -0.5563009 -0.4721882 -0.1838377
4 1.0229762 -0.3170328 3.7060541
# Convert the centers back to original scale: x * SD + mean
<- t(apply(
original_centers 1,
scaled_centers, function(x) x * attr(X_scaled, "scaled:scale") + attr(X_scaled, "scaled:center")
))
# Print the real-world values
print("Cluster centers (original):")
[1] "Cluster centers (original):"
print(original_centers)
impervious slope crossing
1 0.76223558 6.194974 0.6696651
2 0.06641279 19.421070 0.2908348
3 0.03523757 5.391960 0.3642119
4 0.56570913 6.426370 3.8760120
To interpret the clustering results, we examine the centers of each cluster in the original data scale. The table below summarizes the environmental characteristics of each cluster based on the original (unscaled) values.
Typology description
Type | Impervious | Slope (°) | Crossing | Description |
---|---|---|---|---|
1 | 0.76 | 6.2 | 0.67 | Urban area with high imperviousness and moderate traffic impact |
2 | 0.07 | 19.4 | 0.29 | Steep natural area with minimal development |
3 | 0.04 | 5.4 | 0.36 | Flat and near-natural stream surroundings |
4 | 0.57 | 6.4 | 3.88 | Urban stream corridor with high crossing disturbance |
Step 6. (Optional) Calculate distance to cluster center
## Step 6 (Optional): Compute distance to cluster center
# Get the cluster center for each row, using the cluster assignment
<- kmeans_result$centers[kmeans_result$cluster, ]
centroids_matrix
# Calculate Euclidean distance between each point and its assigned cluster center
<- sqrt(rowSums((X_scaled - centroids_matrix)^2))
grid_distances
# Save to grids
$cluster_dist <- grid_distances
grids#st_write(grids, "grids_cluster_distance.gpkg")
Summary
- We prepared spatial data representing the surroundings of urban streams.
- We selected key indicators for stream ecology: imperviousness, slope, and crossing frequency.
- We standardized the data to ensure fair clustering.
- We used the elbow method to decide the K.
- We performed K-means clustering to identify distinct urban stream types.
- We analyzed and interpreted each cluster using cluster centers.
- (Optional) We calculated how close each grid is to its cluster center.
This workflow can now be applied to your own datasets to explore meaningful typologies for planning and design. Enjoy clustering!
Appendix
A1. Formula of crossing
crossing
– the number of points where roads or railways cross a stream, divided by stream length (m), multiplied by 100; related to ecological connectivity, may require ecological bridges.
\[ \text{crossing} = \frac{crossing Count}{stream Length (m)} \times 100 \]
A2. Difference between standardization and normalization
There are two common methods of scaling:
Standardization (Z-score Scaling)
\[ x' = \frac{x - \mu}{\sigma} \]
𝜇
is the mean, and𝜎
is the standard deviation.- Centers the data around 0 with a standard deviation of 1.
- Keeps the original distribution shape.
Normalization (Min-Max Scaling)
\[ x' = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}} \]
- Rescales the values to the range [0, 1].
- Sensitive to outliers.
A3. Visualize variable distributions in histogram
# calculate nornalization
<- function(x) (x - min(x)) / (max(x) - min(x))
normalize <- as.data.frame(lapply(features, normalize))
features_norm
# set canvas
par(mfrow = c(3, 3), mar = c(4, 4, 2, 1))
# original raw data
hist(features$impervious,
main = "impervious (original)",
xlab = "value", col = "pink")
hist(features$slope,
main = "slope (original)",
xlab = "value", col = "lightblue")
hist(features$crossing,
main = "crossing (original)",
xlab = "value", col = "lightyellow")
# standardization z-score
hist(X_scaled[, "impervious"],
main = "impervious (standardized)",
xlab = "z-score", col = "red")
hist(X_scaled[, "slope"],
main = "slope (standardized)",
xlab = "z-score", col = "blue")
hist(X_scaled[, "crossing"],
main = "crossing (standardized)",
xlab = "z-score", col = "yellow")
# normalization
hist(features_norm$impervious,
main = "impervious (normalized)",
xlab = "normalized [0–1]", col = "darkred", xlim = c(0, 1))
hist(features_norm$slope,
main = "slope (normalized)",
xlab = "normalized [0–1]", col = "darkblue", xlim = c(0, 1))
hist(features_norm$crossing,
main = "crossing (normalized)",
xlab = "normalized [0–1]", col = "gold", xlim = c(0, 1))
A4. K-means Euclidean distance
For a data point x
with 3 variables ( u, v, w
),
\[
\mathbf{x} = (x_u, x_v, x_w)
\]
and a cluster center
\[
\mathbf{c} = (c_u, c_v, c_w),
\]
the Euclidean distance between the point and the cluster center is calculated as:
\[ \text{Distance} = \sqrt{(x_u - c_u)^2 + (x_v - c_v)^2 + (x_w - c_w)^2} \]
In our case:
\[ \text{Distance} = \sqrt{ (x_{\text{impervious}} - c_{\text{impervious}})^2 + (x_{\text{slope}} - c_{\text{slope}})^2 + (x_{\text{crossing}} - c_{\text{crossing}})^2 } \]
A5. Visualize cluster points in 3D distribution
<- c(
cluster_colors "1" = "#65C3A1",
"2" = "#FC9964",
"3" = "#869DC5",
"4" = "#E597C0"
)
library(plotly)
Loading required package: ggplot2
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
<- plot_ly(
cluster3d_scaled x = X_scaled[, "impervious"],
y = X_scaled[, "slope"],
z = X_scaled[, "crossing"],
type = "scatter3d",
mode = "markers",
color = as.factor(kmeans_result$cluster),
colors = cluster_colors,
marker = list(size = 3)
%>%
) layout(
scene = list(
xaxis = list(title = "impervious"),
yaxis = list(title = "slope"),
zaxis = list(title = "crossing")
)
) cluster3d_scaled