Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent and independent variable. It involves graphing a line over a set of data points that most closely fits the overall shape of the data. Regression shows the changes in a dependent variable on the y-axis to the changes in the explanatory variable on the x-axis.
Uses of regression
- Determining the strength of predictors
- Forecasting an effect
- Trend Forecasting
Linear regression is a statistical model that attempts to show the relationship between two variables with the linear equation. I think it is the one of the easiest algorithm in machine learning. It is calculated by using formula;
From the above formula we try to find the value of x and y that every value of x has a corresponding value of y in it if it is continuous. The reason for this is linear regression is always continuous. The output of the linear regression is the value of the variable. The accuracy or the good fit is calculated by using the r squared method.
Selection of linear regression
- Classification and regression capabilities: Regression model predicts a continuous variables such as sales made in a day. Linear regression is not good for classification models because the predicted value is continuous not probabilistic.
- Data Quality: Each missing values removes one data point that could optimize the regression. In simple linear regression outliers can significantly disturb the outcome. if we remove the outliers, our model will perform in a better way.
- Computational Complexity: Linear regression is not that computational expensive as compared to other algorithms like random forest or other clustering algorithms. The order of the complexity for n training example and X features usually falls either in O of X square of Xn.
- Comprehensible and transparent: Linear regression are easily comprehensive and transparent in nature. They can be represented by simple mathematical notation and can be understood very easily.
Where is linear regression used?
- Evaluating trends and sales estimates
- Analyzing the impact of price changes.
- Assessment of risk in financial services and insurance domain
Understanding linear regression in depth
Suppose we have independent variable on the x axis and dependent variable on y axis. The dependent variable is increasing on the x axis and so does the dependent variable on y axis. This is positive linear regression in which the slope is positive.
Suppose we have an independent variable on x axis which is increasing and dependent variable on y axis decreasing, it is negative regression. The slope is negative for this type of regression.
Finding the best fit line
Let’s check how good our model performs. In order to do that we have a method called R-squared.
What actually is R-squared?
- R-squared value is a statistical measure of how close the data are to the fitted regression line. In general, it is considered that a model having the higher r-squared value is the good model.
- It is also known as coefficient of determination or the coefficient of multiple determination.
How R-squared value is calculated?
Calculating the distance of actual value to the mean to distance of predicted value to the mean.
Are low r-squared values always bad?
A high or low R-square isn’t necessarily good or bad, as it doesn’t convey the reliability of the model, nor whether you’ve chosen the right regression. You can get a low R-squared for a good model, or a high R-square for a poorly fitted model, and vice versa. For example, any field that attempts to predict human behavior, such as psychology, typically has R-squared values lower than 50%. Humans are simply harder to predict than, say, physical processes.