Unsupervised learning is one of the branches of machine learning. It identifies clusters or groups based on an unlabeled dataset, with as little human intervention as possible. In the previous article, we looked at the two main categories of unsupervised models, clustering and association rules, as well as some of the main applications of these models. Please view this previous article if you wish to learn more about these topics. Below, we will dive into the main unsupervised learning algorithms which are used for some everyday applications.
Main unsupervised learning algorithms
a. K-means clustering
K-means clustering is one of the most common clustering algorithms, where points from a dataset are assigned to K groups. In this case, K represents the number of groups created based on the distance from each group’s seed. This seed is also known as the starting cluster centroid and is either chosen at random or specified by the data scientist based on the dataset. Once the number of groups (K) and the seeds have been identified, the model assigns each new point to the seed it is closest to: this point is then clustered under the same group or category as this seed. The most common method used to calculate the distance between a point and a seed is the Euclidean squared distance.
One of the main challenges with K-means clustering is that the number of clusters needs to be specified before the start of the model. This can be challenging at times when the datasets are very large. If a large value for K is chosen then it will lead to smaller groups with more granularity, whereas if a smaller value for K is chosen then the model will result in large groups with less granularity.
This method is most often used for document clustering, image segmentation or marketing segmentation.
b. Hierarchical clustering
Hierarchical clustering is another popular clustering algorithm, which creates a tree-like structure also known as a dendrogram. This type of clustering can be divided into two categories: divisive and agglomerative hierarchical clustering. In the case of divisive hierarchical clustering, which is also know as top-down clustering, all the data points start off by being assigned to the same group and as the model is fine tuned, the points are separated into clusters until there is one cluster for each observation. On the other hand, in the case of agglomerative hierarchical clustering, which is also know as bottom-up clustering, each data point starts off by being treated as its own cluster and as the model is fined tuned, pairs of clusters are combined, based on their similarity, into one big cluster containing all the observations.
In a similar fashion as for K-means clustering, measures of distance are used to evaluate the similarity between points. There are four main methods to measure similarity:
i) Single linkage
In this method, the distance between two clusters corresponds to the minimum distance between two points in each cluster:
ii) Complete linkage
In this second method, the distance between two clusters corresponds to the maximum distance between two points in each cluster:
iii) Average linkage
In this third method, the distance between two clusters corresponds to the mean distance of all pairs from each cluster:
iv) Ward’s linkage
In this fourth method, the distance between two clusters corresponds to the increase in the sum of squares after the clusters have been merged. The goal is to minimize the total within cluster variance
Within these methods, the Euclidean distance remains the most common metrics used to calculate the distances between points in each cluster.
c. Apriori algorithm
Apriori is one of the unsupervised learning algorithms that can be classified as an associative rule. Indeed, this algorithm uses a bottom-up approach, in which frequent items, or collection of items, are identified and used to establish association rules. It is based on the idea that a subset of a frequent set of items must also be a frequent set of items.
Apriori algorithms gained in popularity when they started being used for market basket analysis or for music recommendation in popular streaming apps. Indeed, this algorithm will allow to establish the likelihood of an individual buying or listening to item X knowing that he/she used or listened to item Y. Hence, this model requires previous behaviors or habits to be able to make predictions about future items.
d. Principal Component Analysis (PCA)
Principal Component Analysis, commonly referred to as PCA, is a dimensionality reduction method for unsupervised learning. This technique allows to make predictive models with minimal loss of information. To do so, it transforms a set of correlated variables and finds the underlying set of mutually orthogonal variables of largest variance. In other words, this method uses the coordinate system of a given dataset to find a new one by translations and rotations.
In contrast to other models that we may have seen before, PCA returns a vector, instead of a function, which implies that it can represent any type of dataset and helps to visualize high-dimensional data.
Furthermore, since PCA is a dimension reduction method, it has the goal of identifying which features in the dataset drive the patterns observed. The features that influence this phenomenon are called “principal components”. The first principal component will be the direction that maximizes the variance of the dataset, while the second principal component also maximizes the variance but is orthogonal to the first principal component. Each principal component added will be orthogonal to a previous component with the most variance.
e. Single Value Decomposition (SVD)
Single Value Decomposition, also known as SVD, is another dimension reduction method for unsupervised learning. It decomposes a matrix X into a product of three low-rank matrices: , where:
- is an n-by-p diagonal matrix of positive numbers called the singular values of X.
- is an n-by-n orthogonal matrix, whose columns are orthogonal unit vectors of length n called the left singular vectors of X.
- W is a p-by-p orthogonal matrix, whose columns are orthogonal unit vectors of length p called the right singular vectors of X.
In a similar fashion than PCA, this method is often used to reduce noise and reduce the dimensionality of dataset.
What’s next
Hence, a variety of different unsupervised learning algorithms exists. They can be chosen depending on the task at hand and evaluated using regular evaluation metrics such as the one we have depicted in the Supervised Machine Learning and Classification.
Hopefully this article will have given you a good understanding of those algorithms. If you wish to see some real-world examples of unsupervised machine learning, please view the previous article.