Clustering: Dendrograms and K-Means

Clustering is the ability to link observations together (customers) based on calculated distance, which distance is calculated based on certain attributes (customer’s neighborhood, age, etc.) of the data.

The following is an example of this in R, where I clustered states based on different attributes: population, income, illiteracy, life expectancy, etc. This was analyzed in two ways: by dendrogram, and through the k-means method.

Simply, a dendrogram makes small groups of the observations by like attributes, and then groups the groups together to form slightly large groups of the smaller groups. It might look like a March Madness bracket!

In the K-means method, we decide we want to make an arbitrary amount of centers (k), and then go through an iterative process where we want to put the k center in the middle the observations so that there is a minimum distance between all.

Dendrograms

In the following dendrogram, you can see how Alaska and Texas had outlier distances compared to the rest, probably because area is an attribute. Some of the best insights can be taken form the smallest of the groups, or the clusters at the bottom. For example, it believes that Colorado, Oregon, and Wyoming are all similar, and it would be up to us the find the interpretation, based off the attributes, to see why.

Since area probably had a large influence in the attribute distance calculations, we are going to normalize the attributes, which will make area only as influential as the rest.

The normalized data has more clusters at the first level than the non-normalized data. The non-normalized data clusters more groups right away. You can also observe that Alaska and Texas are still considered far away from any other state, or even first few clusters.

Maybe to give an even better clustering of the data, we will take away areas as an attribute.

The Dendrogram for the data without the area column is in fact very different from the former two. For example, Alaska no longer is by itself from most of the grouping. This is a representation of the other attributes having more pull.

K-Means

The following is the k-means output in R. I arbitrarily selected that we will have 3 centers, 3 being K.

You can see the attributes included, as well as their mean values within each cluster. This way you can see how each cluster was different from the others numerically. We normalized to include a (-1 to 1) interval for our numbers, so that’s why there are negatives.

You can also see which group each state was in, as well as the variability within each states, shown by the sum of squares.

## K-means clustering with 3 clusters of sizes 24, 11, 15
## 
## Cluster means:
##   Population     Income   Illiteracy   Life Exp     Murder    HS Grad
## 1 -0.4873370  0.1329601 -0.641201154  0.7422562 -0.8552439  0.5515044
## 2 -0.2269956 -1.3014617  1.391527063 -1.1773136  1.0919809 -1.4157826
## 3  0.9462026  0.7416690  0.005468667 -0.3242467  0.5676042  0.1558335
##        Frost       Area
## 1  0.4528591 -0.1729366
## 2 -0.7206500 -0.2340290
## 3 -0.1960979  0.4483198
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              2              3              3              2              3 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              1              1              1              3              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              1              1              3              1              1 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              1              2              2              1              3 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              1              3              1              2              3 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              1              1              3              1              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              3              2              1              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              1              1              3              1              2 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              1              2              3              1              1 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              1              2              1              1 
## 
## Within cluster sum of squares by cluster:
## [1]  67.72742  23.62227 111.66951
##  (between_SS / total_SS =  48.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

The size of the 3 clusters:

24 (group 2)
15 (group 3)
11 (group 1)

This can be useful if we want to find commonality among groups. In this case, we found commonality between states based off our attributes.

Selecting “K”

There is a way to systematically choose K, and it’s through the elbow method. I will allow the code to be shown, which demonstrates how it’s calculated, which is that we calculate the between sum of squares produced from a range of k, and graph it. The theory suggests that the ideal K would be the inflection point, or the elbow, because it’s at the points where it’s still producing less and less error, but not to the point where we are just adding more and more K clusters to receive a slight decrease in error. If we create too many K, it may be just individualizing the clusters a little too much, which would oppose the reason why we decided to cluster in the first place.

table <- NULL

for (k in 1:25){
  clusters_for <- kmeans(scaled_data, k)
  table[k] <- sum(clusters_for$withinss)
}

plot(table, col = topo.colors(25),
     ylab = "Between SS",
     xlab = "K Clusters")

This is very arbitrary since the elbow point is not too obvious. So I will select 6 to be my cluster amount, and re-perform the cluster analysis.

## K-means clustering with 6 clusters of sizes 11, 6, 3, 1, 11, 18
## 
## Cluster means:
##   Population      Income  Illiteracy   Life Exp     Murder     HS Grad
## 1  0.4862079  0.58900729 -0.44296199 -0.3897109  0.2719240 -0.01449684
## 2 -0.2076952  0.32553803  0.07656133  0.5969933 -0.2107522  0.59948017
## 3  2.8948232  0.48692374  0.65077132  0.1301655  1.0172810  0.13932569
## 4 -0.8693980  3.05824562  0.54139799 -1.1685098  1.0624293  1.68280347
## 5 -0.2269956 -1.30146170  1.39152706 -1.1773136  1.0919809 -1.41578257
## 6 -0.5233464  0.07581964 -0.74373865  0.8018513 -0.9918174  0.55752288
##        Frost        Area
## 1  0.3059518 -0.32541400
## 2 -1.4196254 -0.03700902
## 3 -1.1310576  0.99272004
## 4  0.9145676  5.80934967
## 5 -0.7206500 -0.23402899
## 6  0.8643354 -0.13397682
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              5              4              2              5              3 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              6              6              1              2              5 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              6              1              1              6 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              6              5              5              6              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              6              1              6              5              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              6              6              1              6              1 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              5              3              5              6              1 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              1              6              5 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              6              5              3              6              6 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              1              2              5              6              6 
## 
## Within cluster sum of squares by cluster:
## [1] 24.68546 17.65636 11.34904  0.00000 23.62227 35.53577
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

These are the groups made:

Colorado, Connecticut, Iowa, Kansas, Rhode Island, South Dakota, Wisconsin, Utah
Alaska
Delaware, Idaho, Indiana, Maine, Missouri, Montana, Nevada, New Hampshire, Oklahoma, Vermont, Wyoming
California, Florida, Illinois, Maryland, Michigan, New Jersey, New York, Ohio, Pennsylvania, Texas, Virginia
Arizona, Hawaii, Oregon, Washington
Alabama, Georgia, Kentucky, Louisiana, Mississippi, New Mexico, North Carolina, South Carolina, Tennessee, West Virginia

Graphing the Cluster

We can actually graph the cluster on a 2D plane.

Insight Gained

Since there are about 8 attributes used, it seems like there were many - not one - defining feature or influence that helped shape these groups.

Group 4, which contains Texas, California, and New York, could be grouped together based on commonality with population, and to an extent maybe salary and murder.

Group 6 seems to be a subbelt group, which might hold common income, frost, and area statistics.

Now, the values of the components are a little more difficult to distinguish. It seems like colder, more northern states are in the positive side of the X-Axis, whereas the sunbelt group 6 is way on the negative end.