This piece is the latest in a series, called “Machine Learning Is Not Magic,” covering how to get started in machine learning, using familiar tools such as Excel, Python, Jupyter Notebooks and cloud services from Azure and Amazon Web Services. Check back here each Friday for future installments.
When you are getting started with machine learning algorithms, it’s a great idea to learn the formula through Excel. It will give you a thorough understanding of the concept behind the algorithm. But to evolve repeatable Machine Learning models that work with new data points, we have to use mature frameworks and tools. Once you get familiar with the concepts, you can start utilizing higher level libraries like NumPy and Scikit-learn in Python. In the upcoming parts of this tutorial, I will walk you through the process of configuring and using Python with the same use case based on Stack Overflow salary calculator.
In the last installment of this tutorial, I introduced the concept of linear regression through Microsoft Excel. We used the LINEST function to validate our assumptions, and also used it to predict salary for values falling outside of the limits of the original dataset.
In this part, we will understand how to streamline linear regression for accuracy and precision. In that process, we will explore the “learning” component of machine learning.
Now that we have the basic understanding of Linear Regression, let’s take a closer look at the learning part of Machine Learning.
If a tool like Microsoft Excel can do ML, what’s all the hype and buzz about? If ML is just about applying the right algorithm to data, where is the learning and training aspect? Let’s try to get an answer for that.
Remember, when compared to the actual salary, our prediction was plus or minus $100. Though this may not matter much in simple scenarios, the difference can vary widely in complex datasets making the predictions inaccurate and almost useless.
The goal of Machine Learning is to combine existing data with a predefined algorithm like linear regression to minimize the gap between actual and the predicted values. In our scenario, based on 10 rows, we assumed that the salary increase is $1,800. But what if Stack Overflow pays better for developers with 10+ years of experience? If it adds $2,200 per each year after crossing 10 years of experience, our assumptions go haywire. Our formula is hardwired to consider $1,800 as the delta which will break the algorithm when we input anything above 10. This scenario emphasizes the need for additional data. For ML, the more the data, the better the accuracy. This is one of the reasons why public cloud providers are luring customers to bring their data to their respective platforms.
For now, let’s assume that we have access to data that reflects the pay scale from 0 to 30 years. This is good enough for us to understand the salary variances between 10, 20, and 30 years.
In the above chart, the red dots reflect the predictions and the red dotted lines represent the gap. So, how do we go about minimizing the difference between the actual and predicted value?
With the availability of additional data, it is possible for us to generate a different combination of y-intercept and slope. By constantly comparing the output of the algorithm (prediction) with the actual values, it is possible for us to find the best possible combination of y-intercept and slope with minimum error. In other words, we need to keep trying different values of a and b in the equation y = a + bx till we are able to fit the predicted values into a straight line that’s close to the actual values.
There are multiple techniques like Mean Absolute Error (MAE) and Root Mean Squared Deviation (RMSD) to find how close the prediction is to the actual values. Advanced techniques like Stochastic Gradient Descent and Nonlinear Conjugate Gradient are applied to complex and large datasets to minimize the error in prediction. These techniques are also known as cost functions since they attempt to reduce the cost of using an algorithm.
Since my objective is to help you demystify machine learning, I am staying away from explaining complex math involved in some of these techniques.
The key takeaway for you from this discussion is that ML uses a variety of techniques on the same dataset till the gap between actual values and predictions is almost zero. So, when we take a large dataset, apply a mathematical formula to it, and iterate multiple times to minimize the error, it results in training an algorithm. This process results in the algorithm “learning” from existing data to ultimately arrive at accurate values for y-intercept (a) and slope (b).
Once those two values are derived, we can start applying the proven formula, y = a + bx to any data point. The final formula with the most accurate parameters (a and b for Linear Regression) that’s applied to production data is called as ML model. This is a tested and trusted equation with all the constants that came from the training process. Since the parameters are tuned based on the historical data, it is now ready to deal with data points that may be outside of the known values.
But how does an ML algorithm know that it has reached the desired level of accuracy? We do this by splitting the dataset into two parts — training data and testing data. It is a common practice to consider 75 percent of the dataset for training and the remaining 25 percent for testing. The first part of the dataset is used for deriving the inferences and correlations while the smaller part is used for comparing the results. When the predictions from the training data come close to the test data, the model is ready to deal with unseen data points. One of the many mechanisms to measure the accuracy of a model is a coefficient of determination. Without getting into the maths involved, the coefficient of determination assigns a score to the model, which typically hovers between the value of 0 and 1. If a model scores 0.20, it means that only 20 percent is predictable. The aim is obviously to get a value that’s closer to 1.0 which indicates better confidence level.
So, the goal of an ML program is to reduce the difference between actual and predicted values. The process of finding the correlation between features and using that to increase the accuracy of a model is the training part of ML.
The below illustration explains the typical workflow involved in Machine Learning:
- Step 1a: Split data the major part of the original dataset into a subset for training.
- Step 1b: The smaller part of the dataset is considered for testing.
- Step 2a: The training dataset is passed to an ML algorithm like Linear Regression.
- Step 2b: The testing dataset is populated and kept ready for evaluating.
- Step 3a: The algorithm is applied to each data point from the training dataset.
- Step 3b: The parameters are then applied to the test dataset to compare the outcome.
- Step 3c: This process is iterated till the algorithm generates values that are a close match to the test data.
- Step 4: The final model is evolved with the right set of parameters tuned for the given dataset.
- Step 5: The model is used in production to predict based on the new data points.
As we can see, the crux of machine learning is based on what happens in step three. Once the right algorithm is chosen, it is tuned iteratively with test data to arrive at accurate predictions. This process represents the “learning” part of Machine Learning. A learned algorithm transforms into a model that can be used with production data.
Apart from Linear Regression, there are many other popular algorithms for supervised machine learning, including logistic regression, the Naïve Bayes Classifier Algorithm, the K Means Clustering Algorithm, K Nearest Neighbor, the Support Vector Machine Algorithm, the Apriori Algorithm, artificial neural networks, and random forests.
Each algorithm is designed to either predict or classify data. For example, logistic regression is used to find out a boolean value that results in a true or false. linear regression, as we have seen, is used in scenarios where the prediction is a number. Other algorithms like Support Vector Machine and K Nearest Neighbor are used for classification. Irrespective of the algorithm, the goal is to use existing data to find accurate parameters to evolve the final model.
Since I covered a lot of ground, it’s time for us to do a quick recap of the concepts without the jargon:
Machine Learning: The ability to build logic based on existing data without the need of explicit programming.
Supervised Machine Learning: The process of predicting or classifying data based on an existing structure and a set of known values.
Algorithm: The mathematical or statistical formula used for prediction, classification, or grouping of data. In the context of ML, it’s a formula applied to an existing dataset.
Cost Function: The function used to measure and minimize the difference between actual values and predicted values.
Coefficient of Determination: A value between 0 and 1 that quantifies the level of accuracy of an algorithm. The higher the value, better the precision.
Model: A fully trained and tested algorithm that can be used with new data points outside of the original dataset used by the algorithm.
In the next part of the series, I will touch upon the importance of mathematics in Machine Learning. You will understand what topics of maths and stats are key for building a career in data science. Stay tuned.
Feature image via Pixabay.