Classification (LDA, QDA, Naive Bayes, Logistic Regression, KNN)

Project

EDA

The purpose of this project is to verify the result from EDA regarding which classification model would perform well on this dataset. The data contains 392 rows (the different type of cars), and 8 columns of the car's attributes (mpg, cylinders, displacement, horsepower, weight, acceperation, year, origin). To transform the problem into a classification task rather than regression task, the mpg is converted into a binary variable which shows 0 for below average and 1 for above average.

QQ PLOT

The QQ plot results indicate that only acceleration follows a normal distribution, while other features, such as displacement, horsepower, and weight, exhibit heavy tails. Features with heavy tails deviate from normality, which can negatively affect the performance of models that rely on linearity and the assumption of normally distributed features, such as LDA and logistic regression.

SCATTER PLOT

The first scatter plot shows the relationship between cylinders and mpg, though it is difficult to discern a clear pattern because the data is heavily concentrated around cylinder values of 4,6, and 8, with a generally negative trend.

Displacement, horsepower, and weight all exhibit a negative relationship with mpg, indicating that as these values increase, mpg tends to decrease. Acceleration shows a positive relatioship with mpg, but the higher variation suggests that this is not a strong or consistent correlation. The year has a slight positive relationship with mpg, though this effect appears to be minimal.

Aside from these observations, no other distinct patterns are evident from the scatter plot.

BOX PLOT

From the box plot analysis, several key insights can be observed regarding the central tendency (median), spread/variability, outliers, and symmetry/skewness between the two classes (mpg = 0 and mpg = 1). Most factors exhibit significant differences in median values between the two mpg categories, indicating that lower fuel efficiency (mpg = 0) tends to be associated with higher values for features such as displacement, horsepower, weight, and cylinders.

The interquartile range (IQR) is notably larger for features like cylinders and origin, suggesting greater variability in lower mpg (mpg = 0) for cylinders and higher mpg (mpg = 1) for origin. Outliers are also present in the higher mpg class for features like cylinders, displacement, horsepower, weight, and acceleration.

Additionally, skewness is evident for certain features such as cylinders, horsepower, and year, particularly in the lower mpg category, where the longer whiskers suggest a lack of symmetry. This asymmetry may influence the separation between the two mpg categories, especially for features with non-normal distributions.

KDE Plot

The KDE plots for most features exhibit a clear unimodal peak for higher mpg (orange line), suggesting a strong relationship with higher mpg values. In contrast, the lower mpg group (blue line) tends to show no distinct peak and greater variability, indicating that these features might not be as strongly associated with lower mpg.

For horsepower and weight, there are signs of bimodal peaks; however, the separation between these peaks is not pronounced enough to confirm them as bimodal. Acceleration shows a similar distribution for both higher and lower mpg, with comparable peak points, suggesting it may not be a strong predictor of fuel efficiency.

Conversely, the year feature displays greater variability compared to other features, with distinct peaks for both higher and lower mpg. Older cars tend to cluster in the lower mpg category, indicating that year is more strongly associated with mpg, as newer cars generally exhibit higher fuel efficiency.

HEATMAP

The heatmap reveals that mpg is strongly and negatively correlated with cylinders, displacement, horsepower, and weight, indicating that as these features increase, mpg tendds to decrease. Additionally, cylinders, displacement, horsepower, and weight are strongly and positively correlated with each other, suggesting that these features are likely interrelated. This correlation implies that vehicles with higher value in any one of these features tend to have higher value in the other.

Linear Discriminant Analysis (LDA)

LDA is a supervised classification algorithm, machine learning, and pattern recognition. Its primary objective is to find a linear combination of features that is maximizing the separability in the data. LDA makes two key assumptions about the data when it is used for classification:

Gaussian Distribution: The data points within each class follow a Gaussian (normal) distribution.
Identical Covariance Matrices: The variance and relationships between features are the same across all classes.

The decision boundaries are linear, which makes LDA easy to understand and interpret. Also, LDA is effective at classifying data into mulitiple classes. However, if the true boundary between classes is nonlinear, LDA might not perform well.

Since most of the factors are not linear relationship with the target variable (mpg), this will not be the best choice for the data. Also, the heatmap shows strong linear correlations between features may reduce its performance.

Quadratic Discriminant Analysis (QDA)

QDA is another supervised classification algorithm similar to LDA, but it does not assume that the classes have identical covariance matrices. This means that QDA alows for more flexibility in modeling the relationship between features and can handle situations where the variance or spread of data differs between classes. QDA produces quadratic decision boundaries (Not linear) which can be curved.

There is some non-linear patterns or variances that QDA is able to capture, likely due to differences in the covariance matrices between the classes.

Naive Bayes

Naive bayes will calculate the prior probabilty for each class and the likelihood for the factors. When the event occurs the probablity of the event in the specific class will be determined by multiplying the prior and the probability for the factors. The data point will be included in the higher likelihood class.

The reason that Naive Bayes is Naive is that the model is considering all the factors as same which is just multiplying all the probabilities related to the events. If the two different factors have the same probability, then it will have the same likelihood for the event. That will cause the high bias and low variability.

Anyways, Naive Bayes assumes feature independence, which, as shown by the heatmap, does not hold in this dataset due to the high correlations between features such as displacement, horsepower, and weight.

Logistic Regression

Logistic Regression, like LDA, assumes a linear relationship between the features and the log odds of the target variable. Since the scatter plots reveal a non-linear relationship between most features and mpg, Logistic regression may not perform well. Also, the model's performance will be impacted by multicollinearity which may reduce the model effectiveness.

KNN

KNN is a non-parametric model that makes no assumptions about the distribution of data. KNN (with K=3) performs the best. KNN benefits from the flexibility to identify patterns for non-linear data.

The presence of outliers and variability in higher mpg groups suggests that simpler models like LDA and Logistic Regression might struggle to accomodate these variations.

The KDE plots display clear separability between the two classes (mpg01=1 and mpg01=0), with most features showing unimodal peaks, suggesting that the data is relatively well-separated. However, the slight bimodal patterns and transitions for some features indicate the need for more flexible models, such as KNN or QDA, to capture complex patterns in the data.

Based on the EDA, KNN and QDA are likely to perform well because they are capable of capturing non-linear relationships and variations in the data. KNN, in particular, excels at finding patterns without making strong assumptions about the data distribution, while QDA allows for non-linear decision bondaes by modeling each class with its own covariance structure. These models can adapt to complex data structures, which makes them effective for datasets where linear separability is limited.

RESULT

Classification model	Testing error
Linear Discriminant Analysis	0.12658227848101267
Quadratic Discriminant Analysis	0.11392405063291144
Naïve Bayes	0.12658227848101267
Logistic Regression	0.12658227848101267
KNN (K=3)	0.07594936708860756

I was right :)

Let's try other models which can capture non-linear relationship and variations in the data!

SVM with Non-linear Kernel

When testing the SVM with the default linear kernel, the teting error was 0.13924. The linear kernel may not effectively capture the complexities in the data. Additionally, SVM requires feature scaling for optimal performance, as they rely on distance calculations. After scaling the features and using the non-linear 'rbf' kernel, the testing error reduced to 0.12658, but it is still not performing well. So, I tried 'poly' kernel and the testing error significantly dropped to 0.0886!

The main difference between the 'poly' and 'rbf' kernels lies in how they map the input data to a higher-dimensional feature space to find the optimal separating hyperplane. I will post separately regarding this topic :)

Random Forest

Random Forest consists of multiple decision trees, enabling it to capture complex non-linear relationships that simpler models might overlook. This ensemble methd also helps smooth out noise and variations in the dataset. The testing error for the Random Forest model was 0.06329, demonstrating its robustness in handling the data. This is the best model so far!

Hailey's Data Science Journey

Search This Blog