Simple Linear Regression
Regression (predicting an output based on a new input and previous learning).
Basically, Regression Analysis allows us to discover if there’s a relationship
between an independent variable/s and a dependent variable (the target). For
example, in a Simple Linear Regression we want to know if there’s a
relationship between x and y. This is very useful in forecasting (e.g. where is
the trend going) and time series modelling (e.g. temperature levels by year and
if global warming is true).
Here we’ll be dealing with one independent variable and one dependent. Later
on we’ll be dealing with multiple variables and show how can they be used to
predict the target (similar to what we talked about predicting something based
on several features/attributes).
For now, let’s see an example of a Simple Linear Regression wherein we
analyze Salary Data (Salary_Data.csv). Here’s the dataset (comma-separated
values and the columns are years, experience, and salary):
YearsExperience,Salary 1.1,39343.00 1.3,46205.00 1.5,37731.00 2.0,43525.00 2.2,39891.00 2.9,56642.00 3.0,60150.00 3.2,54445.00 3.2,64445.00 3.7,57189.00 3.9,63218.00 4.0,55794.00 4.0,56957.00 4.1,57081.00 4.5,61111.00 4.9,67938.00 5.1,66029.00 5.3,83088.00
Here’s the Python code for fitting Simple Linear Regression to the Training
# Importing the libraries import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Salary_Data.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 1].values # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0) # Fitting Simple Linear Regression to the Training set from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train) # Predicting the Test set results y_pred = regressor.predict(X_test) # Visualising the Training set results plt.scatter(X_train, y_train, color = 'red') plt.plot(X_train, regressor.predict(X_train), color = 'blue') plt.title('Salary vs Experience (Training set)') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.show() # Visualising the Test set results plt.scatter(X_test, y_test, color = 'red')plt.plot(X_train, regressor.predict(X_train), color = 'blue') plt.title('Salary vs Experience (Test set)') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.show()
The overall goal here is to create a model that will predict Salary based on
Years of Experience. First, we create a model using the Training Set (70% of
the dataset). It will then fit a line that is close as possible with most of the data
After the line is created, we then apply that same line to the Test Set (the
remaining 30% or 1/3 of the dataset).
Notice that the line performed well both on the Training Set and the Test Set.
As a result, there’s a good chance that the line or our model will also perform
well on new data.
Let’s have a recap of what happened. First, we imported the necessary
libraries (pandas for processing data, matplotlib for data visualization). Next,
we imported the dataset and assigned X (the independent variable) to Years of
Experience and y (the target) to Salary. We then split the dataset into Training
Set (2⁄3) and Test Set (1⁄3).
Then, we apply the Linear Regression model and fitted a line (with the help of
scikit-learn, which is a free software machine learning library for the Python
programming language). This is accomplished through the following lines of
from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train)
After learning from the Training Set (X_train and y_train), we then apply that
regressor to the Test Set (X_test) and compare the results using data
It’s a straightforward approach. Our model learns from the Training Set and
then applies that to the Test Set (and see if the model is good enough). This is
the essential principle of Simple Linear Regression.