Data / Development / Machine Learning

Machine Learning and Linear Regression for Mere Mortals

21 Apr 2017 8:17am, by

This piece is the latest in a series, called “Machine Learning Is Not Magic,” covering how to get started in machine learning, using familiar tools such as Excel, Python, Jupyter Notebooks and cloud services from Azure and Amazon Web Services. Check back here each Friday for future installments. 

In the previous part of this series, we attempted to solve a problem based on a few assumptions drawn from an existing dataset. We will now validate those assumptions by testing it in Microsoft Excel. It’s also time for me to map the concepts we discussed so far with the official Machine Learning terminology.

Tom M. Mitchell, a renowned American computer scientist and professor at Carnegie Mellon University, is widely quoted for his definition of Machine Learning: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.”

Let’s try to translate the above definition. Machine learning is the ability to build logic based on existing data without the need of explicit programming. Based on certain algorithms, machine learning programs can derive their logic without a programmer explicitly writing the code.

ML algorithms can deal with data that has a reasonably well-defined structure. Our Stack Overflow salary dataset has multiple columns such as category, location, experience, and salary. Though it is a smaller dataset, almost all the rows have complete data from which the algorithms can draw inferences.

The approach of learning from existing data that has been sufficiently labeled along with the known outcome is called as supervised machine learning. In the dataset that we are dealing with, the known outcome — the salary — is already populated for almost all the rows. When this data is used to train an algorithm, it can learn from existing relationship between location, experience and salary of an employee. We can then predict the salary for data that may not be available within the dataset.

On the other hand, when ML is used to identify and group new data based on an existing dataset, it is called unsupervised machine learning. For example, when we input many resumes of technology professionals to an algorithm, it may create a cluster that groups individuals with similar profiles into architects, developers, and administrators. In this case, we don’t train an algorithm with structured data. It’s also important to understand that there is no specific prediction (like salary) except the logical grouping of similar data points.

Though there are other forms of ML, these are the most widely used. Based on my personal experience, I recommend starting with supervised ML.

Given the simplicity, the scope of this introductory series is confined to supervised ML.

Within supervised ML, there are multiple techniques that can be applied to an existing dataset to either predict or classify the data. Each technique is called as an algorithm. Like any other computer science algorithm, ML algorithms are based on mathematical and statistical formulae. Interestingly, these equations and formulae existed for over a century with heavy usage in the field of finance and accounting.

When we apply large data sets to these traditional formulae, we end up getting interesting results. Machine learning helps us tweak and optimize the formulae for each unique data set.

So, it is safe to assume that machine learning is a field of computer science that combines data and mathematics to evolve programs that can predict and classify new data. The learning part of machine learning implies tweaking the formula to create assumptions that are very specific to the input dataset. As we have seen earlier, the assumptions vary with each dataset.

Let’s get back to the use case that we started with. In the Stack Overflow dataset, we have multiple columns that influence one key value — salary. In the official terminology of ML, these columns are called features. The value (salary) that we ultimately end up predicting is called the label.

Deriving Features and Labels from Stack Overflow Dataset.

In the previous lesson, we derived a few assumptions from the existing data set to guestimate the salary:

The assumption was based on a simple fact that the starting salary is $103,100 and the increase with each additional year of experience is roughly $1,800. We then went about applying this assumption to predict the salary of an individual with six years of experience.

Predicted Salary = Starting Salary + 1800 * Years

So, when we wanted to find the salary for six years of experience, the calculation would be:

$113900 = $103100 + $1800 * 6

We can now go about finding the salary for any number of years of experience, and we would get decent results. You can test this assumption with different values and checking the output.

This formula has no clue whether you are finding the salary or house prices or the rainfall. It would just work with any combination of feature and label. We can safely replace the headers, experience, and salary with x and y.

Let’s call the starting salary as a and increase in salary as b to derive a formula that looks like y  = a + bx

Congratulations! You learned your first ML algorithm. The above formula is officially known as linear regression. When we use just one feature (years of experience) to find the value of a label (salary), it is called simple linear regression. Remember that other features such as skill level, location, and position also have an influence on the label (salary). When we consider all of them to predict the label, we use an algorithm called multivariate linear regression. For now, let’s stay focused on understanding simple linear regression with just one feature.

Let’s plot this data in a simple chart to visualize it better.

Technically, the starting point of y in the above chart is called y-intercept. The change in y with each increment in x is called the slope. According to linear regression, a new value of y can be found by multiplying x with the slope and adding it to the y-intercept.

It’s time for us to take our algorithm to the next level through a simple validation.

Do you know that Microsoft Excel does a linear regression? The formula for that existed in Excel for over a decade. Let’s try to implement our new algorithm in the most familiar tool that we use often.

In the above spreadsheet, x represents experience, and y represents the salary. Using an Excel formula called LINEST, we can find the y-intercept and slope of the dataset. LINEST is a function that returns an array, which corresponds to the y-intercept and slope values. We select two cells, type the formula {=LINEST(C3: C23,B3:B23)}, and hit Shift+Control+Enter to populate the selected cells with values.

Based on the first 20 rows, Excel reported 1840.5 as slope and 103118.6 as the y-intercept. Interestingly, these values are not very different from what we found from our assumptions.

With slope (b) and y-intercept (a) in place, we can go ahead and calculate the salary for a given number of years in experience.

I would recommend you to experiment with LINEST in Excel with your dataset. This will validate and clarify your understanding of linear regression. You can access the spreadsheet from GitHub.

In the next part of this tutorial, we will take a closer look at Linear Regression and learn about the techniques to make the predictions more accurate. Stay tuned!

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.