Clustering is unsupervised learning.
- No predefined classes
- No examples demonstrating how the data should be grouped
Clustering is a method of data exploration.
- A way of looking for patterns or structure in the data that are of interest
- As a stand-alone tool to get insight into data distribution
- As a processing step for other algorithms
- Group them based on what they do
- Group them based on where they live
- Use multiple variables and do cluster analysis with a similarity/dissimilarity measure
- Cluster them based on their shopping behavior
- Discover distinct groups in their customer data sets, and then use this knowledge to develop targeted marking programs (e.g., fresh food lovers, junk food lovers)
Major Clustering Approaches
- Partitioning algorithm
- We can construct various partitions and then evaluate them by some criterion
- Hierarchical algorithms
- We can create a hierarchical decomposition of the set of data using some criterion
- Hard clustering: Each observation belongs to exactly one cluster
- Soft clustering: An observation can belong to more than one cluster to a certain degree (e.g., likelihood of belonging to the cluster)
How to Choose a Clustering Algorithm
Depending on your problem, we can ask these questions:
- Is the algorithm scalable?
- Does it handle different types of attributes?
- Do you have to specify the number of clusters?
- How much control do you have on the parameters and on the output?
- How does it handle noise and outliers?
- Is it sensitive to order of observations?
- Can it handle high dimensional data?
- Are the results interpretable?
K-means Clustering Summary
- Simple, understandable, efficient
- Items automatically assigned to clusters
- Can be used as a pre-clustering step
- Other clustering algorithms can be applied on smaller sub-spaces.
- Must pick number of clusters k
- All items forced into a cluster
- Too sensitive to outliers and noise
- Does not work well with non-cluster cluster shape
Similarity vs Dissimilarity
- Depends on what we want to find or emphasize in the data
- Depends on the type of attributes in your data
- Measures the relationship between 2 observations
- Weighting the attributes might be necessary.
- Some of the clustering algorithms use distance matrices as input.
- Cosine similarity
- Inverse of distance measures values
- Euclidean distance
- Manhattan distance
Internal vs External
We can tell which clustering you need.
- Good clustering will produce high-quality clusters in which:
- The intra-cluster similarity is high.
- The inter-cluster similarity is low.
- Quality measured by its ability to discover some or all of the hidden patterns or latent classes in gold standard data
- Assess a clustering with respect to ground truth
Estimating K: Reference Distribution
We can use the following methods to compare a clustering solution in the training data to a clustering solution in a reference distribution:
- Aligned box criterion (ABC)
- Gap statistic
- Cubic clustering criterion (CCC)
When to Use Clustering
- Anomaly detection
- Outliers typically belong to clusters with 1 observation.
- Identify fraud transactions
- Prepare for other techniques
- Summarize the documents =clusters and use centroids
- Predictive modeling on segments
- Logistic regression results can be improved by performing it on smaller clusters
- Missing value imputation
- Decrease dependence between attributes