Classification is the task of predicting a class based on the available data. It is a common and useful problem in many areas, such asmedicine, marketing, and banking. A simple example of a classification problem that you face daily is email. The types of classification algorithms in machine learning classify your emails as spam, or not. It is a binary classification. There are several types of classification algorithms in machine learning:
- Binary Classification
- Multi-Class Classification
- Multi-Label Classification
- Imbalanced Classification
In this article, we’ll take a look at classification types in machine learning. We’ll deal with binary classification and imbalanced classes simultaneously.
The following structure applies to all data science concept in the article. The first step is to determine the classifier. The second is to train the model. At this stage, the classifier fits data with features and its class label. The model learns all classification algorithms in machine learning from previously known data and its categories. Therefore, the task using types of machine learning classification is supervised learning. The classifier tries to establish relationships between the input features and their class. You can proceed to the prediction of new data after model training. And the last step is to check the quality of the tuned algorithm. The F1 Score is a metric for evaluating the performance of the model’s harmonic mean. It has the following calculation formula:
We will also build the confusion matrix by class to check the distribution of predictions between classes.
Representation of a confusion matrix (source)
The processing time is another crucial parameter for analyzing the types of classification algorithms in machine learning.
In the article, we will consider the following 4 machine learning classification types and their implementation in Python:
- Logistic regression
- Naive Bayes
- Support Vector Machines (SVM)
- Random Forest
Dataset
The Bank Marketing Data Set is a dataset from the UCI Machine Learning Repository. The data was collected through calls to customers (features job, education, loans, etc.). The goal is to predict whether a client will subscribe to a term deposit or not (label y).
The dataset has 41,188 rows, 20 feature columns, and one target column. Label y has two values, “yes” (1) and “no” (0). This means the classification is binary. The majority of y values is 0, so we are dealing with imbalanced classes.
y column label distribution
Some of the columns are categorical. These columns are converted to numerical form before using the machine learning classification algorithms (by LabelEncoder). It’s good to standardize the dataset for optimalmodel performance (by StandardScaler). Additionally, we define a weight variable that will correspond to the inversely proportional distribution of y classes:
w = {0:11, 1:89}
The weight will be assigned to the labels for the models’ training due to unbalanced classes. Next, we will look at and apply some of the best classification algorithms.
Logistic Regression
Logistic regression is one of the most basic and standard methods to tackle data related challenges in machine learning. These types of machine learning classification seek to find the most appropriate relationship between a target variable and a set of independent variables. This algorithm is similar to Linear Regression with additionally using a logistic function. Thus, an observation belongs to one of the classes with a probability of 0.5. The line separating the two classes has an S form.
Logistic regression line separating classes (source)
Advantages. This classifier is a pretty simple algorithm with high efficiency. It calculates the probabilities of an observation assigned to a class.
Disadvantages. The algorithm works only for binary all classification algorithms in machine learning. All predictors are assumed to be independent of each other.
Python realization of a Logistic regression for our bank dataset:
The distribution of right predicted observations and incorrectly is seen using the confusion matrix. The first line corresponds to class “0” labels. 6,189 observation marks are proper, but 1,105 are not. Class “1” is in line 2 with reverse order. 826 marks are correct, but 118 are not. Note that the model predicted most of the observations of each class properly. We will compare the F1 score and time between all models in the score table at the end.
Naive Bayes
This algorithm is the classifier based on the Bayes’ theorem. All predictors are assumed to be independent of each other by such machine learning classification types. The presence of each feature and its values doesn’t depend on other columns in the dataset.
Bayes’ theorem (source)
P(A|B) is the probability of event A occurring when event B occurs.
P(B|A) is the probability of event B occurring when event A occurs.
P(A) / P(B) is the probability of event A / B occurring.
Gaussian Naive Bayes are a machine learning classification techniques based on Naive Bayes and the data binomial (normal) distribution.
Advantages. The machine learning classification techniques work well even with a small dataset. It works fast compared to other classifiers.
Disadvantages. The classifier works well with independent features. It is not very common in the real world.
Python realization of Naive Bayes for the bank dataset:
As before, the model correctly predicted most class labels “0” and “1”. Let’s compare the distribution of categories with Logistic Regression. The predictions have improved for class “0” by almost 150 observations. But label “1” was predicted much worse by nearly 300. Taking into account the unbalanced classes, this influenced the decrease of the F1 metric. Note that Gaussian Naive Bayes in Python doesn’t have a hyperparameter for class weights like other types of machine learning classification from this article.
Support Vector Machine (SVM)
SVM is an algorithm for regression and classification problems that has contributed significantly to the achievements of artificial intelligence. The concept is to find a hyperplane that separates the points into different categories. The points in space represent training data. The points from one class should be separated from another class by the broadest possible distance. This distance is called margin. New data is similarly transformed into points in this space. They will be marked with the appropriate label depending on which side of the hyperplane they are on.
SVM space with optimal hyperplane (source)
Advantages. The algorithm is efficient in high-dimensional spaces. It works well when there is a clear margin between the categories.
Disadvantages. Noise in the dataset degrades the result. SVM is not suitable for large datasets.
Python realization of SVM for the bank dataset:
Like previous algorithms, the SVM predicted most of the observations in each class correctly. It is between Logistic Regression and Naive Bayes by valid mark quantity of each category according to the confusion matrix. An important point is the model’s running time. It is several hundred times larger than previous algorithms.
Random Forest
Random Forest doesn’t belong to basic types of machine learning classification algorithms. It is an ensemble of several Decision Trees combined into one algorithm. So first, let’s discuss the Decision Tree method.
This algorithm solves both classification and regression problems and is used by machine learning software development services. The model builds a form of a tree structure and creates a sequence of rules. Step by step, the original dataset is split into smaller and smaller subsets by these rules. So, an associated decision tree appears with decision nodes and leaf nodes. The algorithm simulates human thinking, and you can easily follow the logic.
Decision tree example with nodes (source)
The ensemble of Decision Trees increases the predictive power over the independent one. The Random Forest combines them for better results. Deep trees can have an overfitting problem. Therefore, Random Forest creates trees on random subsets of the dataset, which eliminates the issue of overfitting. Finally, all trees vote to make the prediction decision.
Representation of Random Forest algorithm (source)
Advantages. It is a more accurate algorithm than the Decision Tree and is much less subject to overfitting. Rules make decisions, so this method works with categorical and continuous data, missing values, and the dataset does not need to be normalized.
Disadvantages. A Random Forest consists of many trees, so it requires more computing power and more training time. It slows down a real-time prediction. Also, the model can be complex to implement due to many hyperparameters selection.
Python realization of Random Forest for the bank dataset:
The ensemble of Decision Trees singled out and predicted class “1” very well. The classifier has the most accurate result in this category compared to previous ones. In contrast, Random Forest predicted class “0” worst of all. It completed the task much faster than SVM but still a little slower than the first two algorithms. This is because the classifier is an ensemble of trees.
K-Nearest Neighbor overview
K-Nearest Neighbor (K-NN) is one of the most popular classification methods in machine learning. It is a type of supervised learning algorithm that can be used for both regression and classification tasks. In this article, we will provide an overview of K-NN, one of the best machine learning algorithms for classification.
The K-NN algorithm works by selecting the number of nearest neighbors (K) for each data point, calculating the distance between the data points and their neighbors, and assigning the class of the majority of the K-nearest neighbors to the new data point. The distance measure used can be Euclidean, Manhattan, or Minkowski distance.
One of the strengths of K-NN is that it can work well with both small and large datasets. It is a non-parametric algorithm, meaning it does not make assumptions about the underlying data distribution. Moreover, K-NN does not require training on the data, and it can be used for both binary and multi-class classification problems.
However, K-NN has some limitations. It can be sensitive to noisy or irrelevant features, and it can also suffer from the curse of dimensionality. Additionally, K-NN can be computationally expensive for large datasets since it requires calculating the distance between each data point and all the other data points.
Despite these limitations, K-NN remains one of the best machine learning algorithms for classification. It has been used in a wide range of applications, including image recognition, text classification, and anomaly detection.
In summary, K-NN is a simple and effective algorithm that can be used for classification tasks in machine learning. It is a non-parametric algorithm that can work well with small and large datasets, and it can be used for both binary and multi-class classification problems. Although it has some limitations, it remains one of the best machine learning algorithms for classification.
K-Nearest Neighbor code
Implementing K-NN in code is relatively straightforward. One can use various libraries like scikit-learn, TensorFlow, and PyTorch to implement the K-NN algorithm.
For instance, scikit-learn is a popular machine learning library that provides KNeighborsClassifier class to implement K-NN in Python. Below is a code snippet that demonstrates how to create and predict with a KNN model using scikit-learn:
In the above code, “n_neighbors” specifies the number of neighbors to consider, “metric” specifies the distance measure to use, and “p” is the power parameter for the Minkowski distance. Once the model is created, it can be trained on the training data using the “fit” method and can be used to predict the labels for the test data using the “predict” method.
Overall, implementing K-NN in code using libraries like scikit-learn is simple and easy. With the availability of these libraries, K-NN has become one of the best machine learning algorithms for classification tasks in machine learning.
Comparative analysis. Conclusion
All algorithms classified most of the observations correctly in each class. However, they all had different results, more accurately predicting class “0” or “1” as we have seen in confusion matrices. It influenced the value of the F1 metric for each classifier.
Summary table:
Let’s visualize these parameters.
F1 score of classifiers
Processing time of classifiers
The table and graphs show that the Logistic Regression and SVM have the best results in classifying both categories according to the F1 score. Logistic Regression is much faster than SVM according to the time parameter. Finally, Logistic Regression became the most optimal classifier for The Bank Marketing Data Set.
It is important to note that Logistic Regression is not the best classification algorithm in machine learning for all cases. Each problem is individual, has specific features in the datasets and its own goals. You can analyze data and use different machine learning classification algorithms. Evaluating the model performance determines the appropriate classifier using metrics.
There are many more other types of machine learning classification for solving the classification problem. In this article, we have covered the most basic and popular methods. But there are still many classifiers for your research.
FAQ
What is the best classification model?
There is no perfect model for all classification problems. You need to explore the dataset and compare different algorithms to find what works best for you
How do you choose classification algorithms for comparing?
You need to analyze the dataset. Then make an algorithm list according to their advantages and disadvantages that satisfy a data specific.
How do youchoose the right model based on a few metrics?
You should find a model with the best metric scores. If no model has an ideal performance scope you should seek a balance between models’ metric scores. Sometimes you have to sacrifice processing time or accuracy to advance a project in real life.