Regression techniques such as Linear regression, logistic regression, all fall under a super class known as generalized linear models (GLM). Simple linear regression is one of the most basic technique of Supervised machine learning. In this post we will discuss about Ordinary Least Squares in Simple Linear Regression
Ordinary Least Squares- Equation
In a simple linear regression, the relationship between an dependent and independent variables. alpha and beta as showcased in the mathematical equation below:
y =_ a+_ bx
where a is called the y-intercept and b is the slope.
So, to estimate the optimal values of a and b, we use OLS or Ordinary Least Squares estimation technique. Here in this technique, the Slope and intercept are chosen such that the sum of squared errors is minimized. Sum of mean squared error indicates the difference between the predicted y value and the actual value.
Mathematically, It can be shown as below:
Here in the above equation y-hat is the estimate for the true y value. e is the error which is the difference between the actual and the predicted y value. Also, as evident the errors have been squared and summed for all the data points inside the data.
Please note that the horizontal line appearing over the x and y terms indicates the mean value of x or y in the following equations. Through mathematical proofs and calculus, it can be proved that the value of a and b at which the error is minimal is:
where cov(x,y) denotes covariance and Var(x) denotes the variance.
b=y-bar - b(x-bar)
These are also known as the optimal values of a and b.
In R-Programming we can easily find out the covariance and variance of any data points. We have an inbuilt formula for covariance, mean and variance in R. :
- cov(data$column1,data$column2): R function used to find covariance
- var(data$column1): R function used to find variance
- mean(data$column1): R function used to find mean
Given all the above functions available to us in R-programming, you can now find out the optimal values of a and b. You can follow these steps (shown in R-programming syntax) to perform OLS estimation on our data points:
- > b <- cov(data$column1,data$column2)/var(data$column1)
- > b
- > a <- mean(data$column2)-b*mean(data$column1)
- > a
Other metrics such as correlation are also very helpful in finding out the optimal relationship between the data points. Correlation between 2 variables tells an analyst about how closely does the relationship between 2 variables follow a linear/straight line pattern.
Learning by measuring the strength of a linear relationship is very important in performing regression analysis and prediction, so it is necessary to find the optimal relationship with a reduced error. OLS is a great way to do so.