Clustering in SAS

4 Feb 2014

Clustering is a undirected data mining activity which means that there is no fixed variable that we are trying to predict or there is no Hypothesis Testing involved. Clustering is a popular technique used in various business situations. Some examples could be – fraud detection in credit card industry, customer segmentation for strategy implementation in retail industry, player replacement in sports, etc. Wherever we want to study unusual patterns in the data or wherever segmenting the data is more efficient for the purpose of analysis, we can use Cluster Analysis.

The purpose of cluster analysis is to place objects into groups, as observed in the data, such that data points in a given cluster tend to have least variation, and data points in different clusters tend to be dissimilar. While clustering can be done using various Statistical tools including R, Stata, SPSS and SAS/STAT, SAS is one of the most popular tools for clustering in a corporate setup. K-Means clustering is best done in SAS as compared to R.

The SAS procedures for clustering are oriented toward disjoint or hierarchical.

Below are the SAS procedures that perform cluster analysis:

hierarchical clustering of multivariate data or distance data
K-means and hybrid clustering for large multivariate data sets
disjoint and hierarchical clustering of variables by oblique multiple-group component analysis providing a least squares fit to the data
approximate covariance estimation for clustering
disjoint or hierarchical clustering based on correlation or covariance matrix
clustering based on nonparametric density estimates
tree diagram

Even though there are many methods of clustering, K-Means and Hierarchical are most commonly used. In SAS, you can use different procedures for different methods of clustering.

• proc cluster for Hierarchical clustering
• proc fastclus for K-Means clustering

In proc fastclus method, one needs to Scale and weight the object of clustering. Scaling involves standardising the variables and proc standard can be used for this purpose. This helps in looking at the data under consideration on a equal scale so that comparison is possible. Weighting involves assigning importance to different variables so that the most important & least important variables are considered appropriately.

Clustering is an art. It is not an exact science. Weighting is one of the things where subjectivity (usually based on experience) comes into play.

Weighting is just a way for the analyst to tell the algorithm that I think these variables are more important than the others. Note that weightage is given to variables and not observations.

What follows after the clustering process in SAS is cluster profiling, which is essentially done to study different characteristics and attributes for a cluster and to select the best cluster for implementing business decisions.

For some interesting real life example of Clustering in SAS go to

https://www.sas.com/content/dam/SAS/en_ca/User%20Group%20Presentations/Calgary-User-Group/Chen-RetailSegmentation-Apr2014.pdf

For syntax using Clustering in SAS see the links below –

https://support.sas.com/documentation/cdl/en/statug/66103/HTML/default/viewer.htm#statug_cluster_syntax01.htm

https://support.sas.com/documentation/cdl/en/statug/66103/HTML/default/viewer.htm#statug_fastclus_details10.htm

Interested in learning about other Analytics and Big Data tools and techniques? Click on our course links and explore more.

Jigsaw’s Data Science with SAS Course – click here.

Jigsaw’s Data Science with R Course – click here.

Jigsaw’s Big Data Course – click here.