GMM EM Algorithm vs K-mean

Project

Let's compare the GMM and k-mean classification algorithm.

Dataset

Reduced data from PCA: (1990, 4)

Number of clusters: 2

Number of samples: 1990

Number of features: 4

I will work on another unsupervised clustering algorithm k-mean classification task.

Before applying GMM, PCA is often used to reduce the dimensionality of the data as GMM works well with low-dimensional data. k-means is also sensitive to dimensionality, as higher dimensions can complicate distance calculations and lead to less meaningful clusters.

The original MNIST images are 28x28 pixels (784-dimensional vectors) which requires dimensionality reduction to make it less expensive and faster. To balance noise (high dimensions) and retaining balance (low dimensions), I'll choose 4 out of 784 dimensional vectors to project data. I'll just consider "6" and "2" digits as this is an experiment to compare the model performance.

EM (Expectation - Maximization) algorithm

Let's assume that labels are given for each data point. In this case, we can estimate the probability distribution for the dataset. There are various methods for this, and Maximum likelihood estimation (MLE) is a very effective approach for esimating the parameters of a distribution. However, MLE only works for labeled dataset because we need to know the probability distribution.

To address this, we can either start by randomly assigning labels or by randomly setting a distribution.

GMM assumes that the dataset follows multiple normal distributions, each with its own parameters (mean, covariance, and weights). To estimate the parameters of a distributions, we need to use EM algorithm as GMM cannot use MLE directly because the parameters are unknown.

The EM algorithm is employed for parametric prabablistic clustering, which asusmes that the data points are generated from a mixture of several Gaussian distributions. The EM algorithm iteratively assigns probabilities for each data point to belong to different clusters (soft assignment), which can effectively handle overlapping clusters by estimating the probability distribution of each cluster. GMM trains its model by iteratively optimizing the parameters through the EM algorithm.

In contrast, the k-mean algorithm is a non-parametric, unsupervised method that does not assume any undrelying distribution of the data. It relies solely on distance metrics, such as Euclidean distance, to partition data points into clusters.

Let's break down my code:

GMM parameters initialization

est_mu: the initial means of the two Gaussian components are randomly generated values between 0 and 1 for each component's mean, resultingin (2 component x 4 features) matrix.
est_vars: calculate the dot product of matrix S1 with its transpose S1.T and add an identity matrix to make sure covariance matrix is positive definite (all eigenvalues are positive).

The diagonal elements of the covariance matrix represent variance, and variance must always be non-negative (positive or zero).

est_phi: the weights sum to 1 and are initialized evenly (0.5 for each component).
est_w: represents the probability that data point i belongs to Gaussian component j.

The matrix is initialized with zeros, and it will be updated in the Expectation steps of the EM algorithm.

n_components = 2 # number 2 and 6
n_data = reduced_data.shape[0] # Number of samples (1990)
n_features = reduced_data.shape[1] # Number of features (4)
est_mu = np.random.rand(n_components, n_features) # Initial means (2,4)

# Generate two random matrices for covariance initialization
S1 = np.random.randn(n_features, n_features) # (4,4)
S2 = np.random.randn(n_features, n_features) # (4,4)

# Initialize two covariance matrices for each component (2,4,4)
est_vars = np.array([
np.dot(S1, S1.T) + np.eye(n_features),
np.dot(S2, S2.T) + np.eye(n_features)
])

est_phi = np.full(n_components, 1 / n_components) # Initial weights(2,)
est_w = np.zeros((n_data, n_components)) # Responsibilities (1990,2)

Expectation (E-step)

The goal is to compute the responsibilities for each data point.

for i_data in range(n_data):
# Get the i-th sample
sample = reduced_data[i_data, :]

# Calculate the likelihoods (probability density)
# multivariate_normal.pdf
# computes the probability density of a multivariate normal dist.
l1 = multivariate_normal.pdf(sample, mean=est_mu[0, :],
cov=est_vars[0, :, :])
l2 = multivariate_normal.pdf(sample, mean=est_mu[1, :],
cov=est_vars[1, :, :])

total = l1 * est_phi[0] + l2 * est_phi[1] # total likelihood
est_w[i_data, 0] = (l1 * est_phi[0]) / total
est_w[i_data, 1] = (l2 * est_phi[1]) / total

Maximization (M-step)

Update the current estimates of the parameters.

Mixture weight (est_phi)

Means

Covariance

Log-likelihood

est_phi = np.sum(est_w, axis=0) / n_data
est_mu[0, :] = np.dot(est_w[:, 0], X) / np.sum(est_w[:, 0])
est_mu[1, :] = np.dot(est_w[:, 1], X) / np.sum(est_w[:, 1])

est_vars[0, :, :] = np.dot(
(est_w[:, 0] * (X- est_mu[0, :]).T),
(X- est_mu[0, :])) / np.sum(est_w[:, 0]
)
est_vars[1, :, :] = np.dot(
(est_w[:, 1] * (X- est_mu[1, :]).T),
(X- est_mu[1, :])) / np.sum(est_w[:, 1]
)

# Calculate log-likelihood
log_likelihood = np.sum(np.log(est_w @ est_phi))
log_likelihoods.append(log_likelihood)

I iterate the EM algorithm until the changes in the parameters converge. Below is a visualization of the EM process for the first feature. The first image shows the initial distribution with randomly selected parameters.

Convergence was achieved after 37 iterations. The numerical weights for each component after convergence are shown below:

Component 1: weight = 0.513099597283767
Component 2: weight = 0.4869004027162331

These weights represent the relative proportions of each component/cluster in the mixture model. A total of 51.3% of the data points belong to component 1 (number 2), while 48.7% belong to component 2 (number 6).

Now, let's visualize the average image represented by each component (digit 2 and 6) in the GMM to understand how each component corresponds to specific image characteristics.

It looks pretty dope.

How about the accuracy?

from sklearn.metrics import accuracy_score

# Get the inferred labels using τik (highest responsibility for each data point)
gmm_inferred_labels = np.argmax(est_w, axis=1)

# GMM: Compare inferred labels with true labels
# Assuming labels "2" are 0 and labels "6" are 1 in the trueLabel array
true_labels_2_and_6 = labels[(labels == 2) | (labels == 6)]  
gmm_pred_2_and_6 = gmm_inferred_labels[(labels == 2) | (labels == 6)]  

# Convert the true labels to binary: 2 -> 0, 6 -> 1
binary_true_labels = np.where(true_labels_2_and_6 == 2, 0, 1)

# Calculate GMM misclassification rate for 2 and 6
gmm_accuracy = accuracy_score(binary_true_labels, gmm_pred_2_and_6)
gmm_misclassification_rate = 1 - gmm_accuracy

GMM Misclassification Rate: 0.0382

Now, let's compare this accuracy with K-means.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(reduced_data)

# K-means predicted labels
kmeans_labels = kmeans.labels_

# Major label gets the 0 label!
if np.mean(kmeans_labels[(labels == 2)]) > 0.5:
    kmeans_labels = 1 - kmeans_labels  

# Filter K-means labels for digits 2 and 6
kmeans_pred_2_and_6 = kmeans_labels[(labels == 2) | (labels == 6)]

# Calculate K-means misclassification rate for 2 and 6
kmeans_accuracy = accuracy_score(binary_true_labels, kmeans_pred_2_and_6)
kmeans_misclassification_rate = 1 - kmeans_accuracy

K-means Misclassification Rate: 0.0638

Conclusion

This project demonstrated that GMM outperforms K-means, particularly with overlapping clusters. However, GMM is computationally more expensive and time-consuming, making K-means a faster option for large datasets. The choice between the two algorithms should depend on the dataset characteristics and analysis goals, with GMM being more suitable for detailed data distribution tasks.

Hailey's Data Science Journey

Search This Blog