Hello, Habr! I present to you the translation of the article "9 Key Machine Learning Algorithms Explained in Plain English" by Nick McCullum .

Machine Learning (MO) is already changing the world. Google uses IOs to offer and display responses to user searches. Netflix uses it to recommend you movies for the evening. And Facebook uses it to offer you new friends you may know.

Machine learning has never been so important and, at the same time, so difficult to learn. This area is full of jargons, and the number of different MO algorithms is growing every year.

This article will introduce you to the fundamental concepts in machine learning. More specifically, we will discuss the basic concepts of the 9 most important MO algorithms today.

Recommendation system

To build a complete system of recommendations with 0, deep knowledge in linear algebra is required. Because of this, if you have never studied this discipline, it may be difficult for you to understand some of the concepts in this section.

But don't worry - the scikit-learn Python library makes it pretty easy to build a CP. So you don’t need such deep knowledge in linear algebra to build a working SR.

How does CP work?

There are 2 main types of recommendation system:

  • Content-based
  • Collaborative filtering

A content-based system makes recommendations based on the similarity of the elements you have already used. These systems behave absolutely the way you expect the SR to behave.

Collaborative CP filtering provides recommendations based on knowledge of how the user interacts with elements (* note: interactions with elements of other users similar in behavior to this user are taken as the basis). In other words, they use the “wisdom of the crowd” (hence the “collaborative” in the name of the method).

In the real world, collaborative SR filtering is far more common than a content-based system. This is mainly due to the fact that they usually give a better result. Some specialists also find the collaborative system easier to understand.

Collaborative CP filtering also has a unique feature that is not found in the content-based system. Namely, they have the ability to learn features on their own.

This means that they can even begin to define similarity in elements based on properties or traits that you did not even provide for the operation of this system.

There are 2 subcategories of collaborative filtering:

  • Based on a model
  • Based on the neighborhood

The good news is: you don’t have to know the difference in these two types of collaborative SR filtering in order to be successful in MO. Just know that there are several types.

To summarize

Here is a brief summary of what we learned about the recommendation system in this article:

  • Real-world recommendation system examples
  • Different types of recommendation system and why collaborative filtering is used more often than content-based system
  • The relationship between the recommendation system and linear algebra

Linear Regression

Linear regression is used to predict some value of y based on some set of values ​​of x.

History of linear regression

Linear regression (LR) was invented in 1800 by Francis Galton. Galton was a scientist studying the relationship between parents and children. More specifically, Galton investigated the relationship between the growth of fathers and the growth of their sons. Galton's first discovery was the fact that the growth of sons, as a rule, was approximately the same as the growth of their fathers. Which is not surprising.

Later, Galton discovered something more interesting. The growth of a son, as a rule, was closer to the average-general growth of all people than to the growth of his own father.

Galton gives the name to this phenomenon - regression . In particular, he said: "The growth of the son tends to regress (or to shift in the direction) of average growth."

This has led to a whole field of statistics and machine learning called regression.

Mathematics of linear regression

In the process of creating a regression model, all we are trying to do is draw a line as close to each point in the data set as possible.

A typical example of such an approach is the “least squares method” of linear regression, which calculates the proximity of a line in the up-down direction.

Example to illustrate:

ITKarma picture

When you create a regression model, your final product is an equation with which you can predict y values ​​for x without knowing y in advance.

Logistic Regression

Logistic regression is similar to linear, with the exception of the fact that instead of calculating the value of y, it estimates which category this data point belongs to.

What is logistic regression?

Logistic regression is a machine learning model used to solve classification problems.

Below are a few examples of the classification problems of MO:

  • Email spam (spam or not spam?)
  • Car insurance claim (compensation or repair?)
  • Diagnosis of diseases

Each of these tasks has clearly 2 categories, which makes them examples of binary classification problems.

Logistic regression is well suited for solving binary classification problems — we simply assign the values ​​0 and 1, respectively, to different categories.

Why do we need logistic regression? Because you cannot use linear regression for binary classification forecasts. It just won't fit, as you will try to draw a straight line through a dataset with two possible values.

This image may help to understand why linear regression is not suitable for binary classification:

ITKarma picture

In this image, the y axis represents the probability that the tumor is malignant. A value of 1 represents the likelihood that the tumor is benign. As you can see, the linear regression model works very poorly to predict the probability for most observations in the data set.

This is why a logistic regression model is useful. It has a bend to the line of better fit, which makes it (the model) much more suitable for predicting qualitative (categorical) data.

Here is an example that illustrates a comparison of linear and logistic regression models on the same data:

ITKarma picture

The Sigmoid Function

The reason logistic regression has a bend is the fact that it does not use a linear equation to calculate it. Instead, the logistic regression model is based on the use of a sigmoid (also called a logistic function, because it is used in logistic regression).

It is not necessary for you to thoroughly memorize a sigmoid in order to succeed in MO. Still, having some idea of ​​this feature will be useful.

Sigmoid Formula:

ITKarma picture

The main characteristic of a sigmoid that is worth sorting out is that it doesn’t matter what value you pass to this function, it will always return a value between 0-1.

Using the logistic regression model for predictions

To use logistic regression for predictions, you usually need to accurately determine the cutoff point. This cut-off point is usually 0.5.

Let's use our cancer diagnosis example from the previous chart to see this principle in practice.If the logistic regression model yields a value below 0.5, then this data point will be categorized as a benign tumor. Similarly, if a sigmoid produces a value above 0.5, then the tumor will be classified as malignant.

Using the error matrix to measure the effectiveness of logistic regression

The error matrix can be used as a tool for comparing true positive, truly negative, false positive and false negative indicators in the Moscow Region.

The error matrix is ​​particularly useful when used to measure the effectiveness of a logistic regression model. Here is an example of how we can use the error matrix:

ITKarma picture

In this table, TN means “true negative,” FN “false negative,” FP “false positive,” TP “true positive.”
The error matrix is ​​useful for evaluating the model whether it contains “weak” quadrants in the error matrix. As an example, it can have an abnormally large number of false positives.

It is also quite useful in some cases, in order to make sure that your model works correctly in a particularly dangerous area of ​​the error matrix.

In this example of cancer diagnosis, for example, you would like to be sure that your model does not have too many false positives, because it will mean that you have diagnosed someone’s malignant tumor as benign.

To summarize

In this section, you had your first acquaintance with the MO model - logistic regression.
Here is a summary of what you learned about logistic regression:

  • Types of classification problems that are suitable for solving using logistic regression
  • The logistic function (sigmoid) always gives a value from 0 to 1
  • How to use clipping points for prediction using a logistic regression model
  • Why the error matrix is ​​useful for measuring the effectiveness of a logistic regression model

K-Nearest Neighbors Algorithm

The k-nearest neighbors algorithm can help solve the classification problem when there are more than 2 categories.

What is the k-nearest neighbor algorithm?

This is a classification algorithm that is based on a simple principle. In fact, the principle is so simple that it is better to demonstrate it with an example.

Imagine that you have data on the height and weight of football and basketball players. The k-nearest neighbors algorithm can be used to predict whether the new player is a soccer player or a basketball player. To do this, the algorithm determines the K data points closest to the object of study.

This image demonstrates this principle with the parameter K=3:

ITKarma picture

In this image, the football players are shown with blue marks, and the basketball players with orange marks. The point we are trying to classify is colored green. Since the majority (2 of 3) of the marks closest to the green dot are colored in blue (football players), then the K-nearest neighbors algorithm predicts that the new player will also be a football player.

How to build a K-nearest neighbor algorithm

The main steps to build this algorithm:

  1. Collect all the data
  2. Calculate the Euclidean distance from the new data point x to all other points in the data set
  3. Sort the points from the data set in increasing order of distance to x
  4. Predict the answer using the same category as most K-closest x data

Importance of the variable K in the K-nearest-neighbor algorithm

Although this may not be obvious from the start, changing the K value in this algorithm will change the category that the new data point will fall into.

More specifically, a too small value of K will result in your model predicting accurately on the training data set, but it will be extremely inefficient for test data. Also, having an indicator that is too high K will make the model unreasonably complex.

The illustration below shows this effect perfectly:

ITKarma picture

Pros and Cons of the K-Nearest Neighbor Algorithm

To summarize getting to know this algorithm, let's briefly discuss the advantages and disadvantages of using it.


  • The algorithm is simple and easy to understand
  • Trivial model training on new training data
  • Works with any number of categories in the classification task
  • Easily add more data to a lot of data
  • The model takes only 2 parameters: K and the distance metric that you would like to use (usually this is Euclidean distance)


  • The high cost of computing, because you need to process the entire amount of data
  • Doesn’t work as well with categorical parameters

To summarize

Summary of what you just learned about the K-nearest neighbors algorithm:

  • An example of a classification problem (soccer players or basketball players) that the algorithm can solve
  • How this algorithm uses the Euclidean distance to neighboring points to predict which category the new data point belongs to
  • Why K parameter values ​​are important for forecasting
  • Pros and Cons of Using the K-Nearest Neighbor Algorithm

Decision Trees and Random Forests

The decision tree and the random forest are 2 examples of the tree method. More precisely, decision trees are MO models used to predict one by one cyclically looking at each function in a data set. Random forest is an ensemble (committee) of decision trees that use random orders of objects in a dataset.

What is a tree method?

Before we dive into the theoretical foundations of the tree method in MO, it would be worthwhile to start with an example.

Imagine playing basketball every Monday. Moreover, you always invite the same friend to go play with you. Sometimes a friend really comes, sometimes not. The decision to come or not depends on many factors: what weather, temperature, wind and fatigue. You start to notice these features and track them along with your friend’s decision to play or not.

You can use this data to predict whether your friend will come today or not. One technique you can use is the decision tree. Here's what it looks like:

ITKarma picture

Each decision tree has 2 types of elements:

  • Nodes: the places where the tree is divided depending on the value of a particular parameter
  • Edges: the result of the division leading to the next node

You can see that there are nodes in the diagram for forecasting (outlook), humidity (humidity) and wind
(windy). And also faces for each potential value of each of these parameters.

Here are a couple more definitions you need to understand before we get started:

  • Root - the node from which the tree splitting begins
  • Leaves - final nodes that predict the final result

Now you have a basic understanding of what a decision tree is. We will look at how to build such a tree from scratch in the next section.

How to build a decision tree from scratch

Building a decision tree is more complicated than it might seem. This is because deciding on which branches (characteristics) to share your data (which is a topic from the field of entropy and data acquisition) is a mathematically difficult task.

To resolve it, MO specialists usually use many decision trees, using random sets of characteristics chosen to divide the tree into them. In other words, new random sets of characteristics are selected for each individual tree, on each separate partition. This technique is called random forests.

In the general case, specialists usually choose the size of a random set of characteristics (denoted by m) so that it is the square root of the total number of characteristics in the data set (denoted by p). In short, then m is the square root of p, and then a particular characteristic is randomly selected from m.

Benefits of Using Random Forest

Imagine that you are working with a lot of data, which has one “strong” characteristic. In other words, there is a characteristic in this data set that is much more predictable with respect to the final result than the other characteristics of this set.

If you build a decision tree manually, then it makes sense to use this feature for the “top” partition in your tree. This means that you will have several trees whose forecasts are highly correlated.

We want to avoid this, because using the average of highly correlated variables does not significantly reduce variance. Using random sets of characteristics for each tree in a random forest, we decorrelate the trees and the variance of the resulting model is reduced. This decorrelation is a major advantage in using random forests compared to hand-made decision trees.

To summarize

So, a summary of what you just learned about decision trees and random forests:

  • An example of a problem whose solution can be predicted using the decision tree
  • Elements of the decision tree: nodes, faces, roots and leaves
  • How the use of a random set of characteristics allows us to build a random forest
  • Why using a random forest to decorrelate variables can be useful to reduce the variance of the resulting model

Support Vector Machines

The support vector method is a classification algorithm (although, technically speaking, they can also be used to solve regression problems), which divides a lot of data into categories at the places of the largest “gaps” between categories. This concept will become clearer if you look at the following example.

What is the support vector method?

The support vector method (MOV) is a MO model with a teacher, with appropriate learning algorithms that analyze data and recognize patterns. MOV can be used both for classification problems and for regression analysis. In this article, we will look specifically at using the support vector method to solve classification problems.

How does the MOB work?

Let's dig deeper into how the MOV really works.

We are given a set of training examples, each of which is marked as belonging to one of 2 categories, and with the help of this set the MOB builds a model. This model distributes new examples into one of two categories. This makes the MOB an incredible binary linear classifier.

DOM uses geometry to make predictions by category. More specifically, the reference vector method model compares data points as points in space and divides them into separate categories so that they are separated by as wide a gap as possible. The forecast that new data points belong to a certain category is based on which side of the gap the point is.

Here is an example of visualization that helps you understand the intuition of MOB:

ITKarma picture

As you can see, if a new data point falls to the left side of the green line, then it will be assigned to “red”, and if to the right - then to “blue”. This green line is called the hyperplane, and is an important term for working with MOV.

Let's see the following visual representation of the MOB:

ITKarma picture

In this diagram, the hyperplane is designated as the “optimal hyperplane”. The theory of the support vector method gives such a definition of an optimal hyperplane - it is a hyperplane that maximizes the field between the two nearest data points from different categories.

As you can see, the field boundary really affects 3 data points - 2 from the red category and 1 from the blue. These points, which are in contact with the field boundary, are called support vectors - this is where the name comes from.

To summarize

Here is a brief outline of what you just learned about the support vector method:

  • MOB is an example of a MO algorithm with a teacher
  • The reference vector can be used to solve both classification problems and regression analysis.
  • How MOU categorizes data using a hyperplane that maximizes the field between categories in a dataset
  • That data points that touch the boundaries of a separating field are called reference vectors. From which the name of the method came.

K-Means Clustering Method

The K-means method is a machine-learning algorithm without a teacher. This means that it accepts untagged data and tries to group clusters of similar observations of your data. The K-means method is extremely useful for solving real applied problems. Here are examples of several tasks suitable for this model:

  • Customer segmentation for marketing groups
  • Document Classification
  • Optimize shipping routes for companies like Amazon, UPS, or FedEx
  • Detection and response to criminal locations in the city
  • Professional sports analytics
  • Prediction and Prevention of Cybercrime

The main goal of the K-means method is to divide the data set into distinguishable groups so that the elements within each group are similar to each other.

Here is a visual representation of how it looks in practice:

ITKarma picture

We will study the mathematical component of the K-means method in the next section of this article.

How does the K-means method work?

The first step in the K-means method is to select the number of groups into which you want to divide your data. This quantity is the value of K, reflected in the name of the algorithm. The choice of the K value in the K-means method is very important. A little later, we will discuss how to choose the correct value of K.

Next, you must randomly select a point in the data set and assign it to a random cluster. This will give you the starting position of the data at which you run the next iteration until the clusters stop changing:

  • Calculation of the centroid (center of gravity) of each cluster by taking the average vector of points in this cluster
  • Relocate each data point to a cluster whose centroid is closest to the point

Choosing the appropriate K value in the K-means method

Strictly speaking, choosing the appropriate K value is quite difficult. There is no “right” answer in choosing the “best” value for K. One of the methods that MO experts often use is called the “elbow method.”
To use this method, the first thing you need to do is calculate the sum of squared errors - the standard deviation for your algorithm for the group of values ​​K. The standard deviation in the K-means method is defined as the sum of the squared distances between each data point in the cluster and the center of gravity of this cluster.

As an example of this step, you can calculate the standard deviation for the values ​​of K 2, 4, 6, 8, and 10. Next, you will want to generate a graph of the standard deviation and these values ​​of K. You will see that the deviation decreases with increasing value of K.

And it makes sense: the more categories you create based on a lot of data - the more likely it is that each data point will be close to the center of the cluster of that point.
With this in mind, the main idea of ​​the elbow method is to choose the value of K at which the standard deviation sharply slows down the rate of decline. This sharp decline forms an “elbow” on the chart.

As an example, here is a graph of the standard deviation relative to K. In this case, the elbow method will suggest using a value of K approximately equal to 6.

ITKarma picture

It is important that K=6 is simply an estimate of the acceptable value of K. There is no “better” value of K in the K-means method. Like many things in the field of civil defense, this is a very situation-specific decision.

To summarize

Here is a brief outline of what you just learned in this section:

  • Examples of tasks of the Moscow State University without a teacher that can be solved by the K-means method
  • Basic principles of the K-means method
  • How the K-Means method works
  • How to use the elbow method to select the appropriate value of the parameter K in this algorithm

Principal Component Analysis

The principal component method is used to transform a data set with many parameters into a new data set with fewer parameters, and each new parameter of this data set is a linear combination of earlier existing parameters. These converted data seeks to justify much of the variance of the original data set with much greater simplicity.

What is the main component method?

The Principal Component Method (CIM) is an MO technique that is used to study the relationships between sets of variables. In other words, the CIM is studying sets of variables in order to determine the basic structure of these variables. MGK is also sometimes called factor analysis.

Based on this description, you might think that MGK is very similar to linear regression. But this is not so. In fact, these 2 techniques have several important differences.

Differences between linear regression and CIM

Linear regression determines the line of best fit through a dataset. The principal component method determines several orthogonal lines of best fit for a dataset.

If you are not familiar with the term orthogonal, then it simply means that the lines are at right angles to each other, like north, east, south and west on the map.
Let's look at an example to help you understand this better.

ITKarma picture

Take a look at the axis labels in this image. The main component of the x axis explains 73% of the variance in this dataset. The main component of the y axis explains about 23% of the variance of the dataset.

This means that 4% of the variance remains unexplained. You can reduce this number by adding more major components to your analysis.

To summarize

Summary of what you just learned about the main component method:

  • CIM is trying to find orthogonal factors that determine variability in a dataset
  • The difference between linear regression and CIM
  • What orthogonal main components look like in the dataset
  • That adding additional major components can help explain variance more precisely in the dataset