How Does Linear Regression Work? – The Data Story Guide

Current-day data analysis software has brought regression analysis to many people who may not have been formally trained in regressions. They might attempt to run Regression without knowing how it works, and get confused and surprised when they get unexpected results or error messages. The success of a linear regression is based on how it is setup, and models simply won't work under certain conditions. In this article, you will learn the basics of:

How linear regression fits the data
What kind of variables are required to run a linear regression
Conditions (assumptions) need to be met to run a linear regression
How to interpret linear regression results

How linear regression fits the data

Linear regression is setup to model a numeric Outcome variable by a number of Predictor variables. It looks at each observation (i.e. a row of your data), and measures the similarities and differences between the Predictors to see how that affects that particular Outcome. In order to do this, it tries to isolate the changes in each Predictor vs the Outcome while the "other" Predictors are held to a constant value. For example, if you're trying to predict a person's weight using height and age, the regression would try to figure out how much weight changes between observations where height is 6ft and age varies. By holding the height value constant, it is more likely to learn how just the age of a person affects their weight. Repeat this many times for many different heights and you can start to estimate the general change on weight due to age (the regression coefficient).

How do regressions do this? There are a few things to keep top-of-mind about your Predictors to ensure your regression is able to isolate the changes in one Predictor vs the others vs the Outcome (more theoretical assumptions about the data are discussed below in detail):

Each observation needs to have a value for each Predictor so the regression can accurately measure what's different and what's the same between the Predictors for an Outcome. Please see Imputing Missing Data for considerations around estimating missing data.
There needs to be a lot more observations than Predictors so the regression has multiple examples of how each Predictor interacts with the Outcome variable (given the other Predictors). A good rule of thumb is 10 times as many cases/observations as Predictors.
Each Predictor must have more than one value because if the value never changes, the regression doesn't have any examples to measure how it will affect the outcome if it does change. The more variance a Predictor has, the more examples the regression has of "changes" in the Predictor vs "changes" in the Outcome. For this same reason, the Outcome variable must have more than 1 value as well.
The Predictors can't directly determine each other (known as multicollinearity). For example, in the model above, you can't have both height in feet and height in meters as Predictors because they will change exactly with each other and the regression can't isolate which one is actually changing the Outcome.

What kind of variables are required to run a linear regression

Linear Regression is used to predict a single numeric dependent by one or more numeric independent variable. Linear regression involving a single independent (scale) variable is the simplest case and is
called simple linear regression. In other words, one numeric variable, say X, is used to predict
another scale variable, say Y.

A regression involving more than one independent variable is called multiple regression and is
a direct extension of simple linear regression. When running a multiple regression one will again
be concerned with how well the equation fits the data, whether a linear model is the best fit to
the data, whether any of the variables are significant predictors, and estimating the coefficients
for the best-fitting prediction equation. In addition, one is interested in the relative importance of
the independent variables in predicting the dependent measure.

Conditions (assumptions) need to be met to run a linear regression

To correctly use multiple regression and apply statistical tests, the following conditions must be met:

A continuous dependent variable. Likert scale variables may be used as dependent variables as long they have 5 or more values.
One or more continuous predictors or dummy (dichotomous) predictors. You can also use Categorical variables as predictors; the procedure will automatically transform the values into dichotomous variables so you can use them in regression.
Predictor variables must be linearly related to the outcome variable.
Predictor variables are normally distributed, and any missing data is at random.
Homogeneity of variance -- residuals are assumed to be independent of the predicted values, implying that the variation of the residuals around the line is homogeneous. The violation of this assumption is known as heteroskedasticity.
No auto-correlation -- residuals are independent of each other. That is, a value within a variable doesn't influence the error of another value. Stock price variables tend to be auto-correlated.
In Multiple Regression, absence of multi-collinearity-no exact or nearly exact linear relation between predictor variables. If there is multicollinearity, the standard errors of the coefficients inflate and estimation of the coefficients becomes unstable because you cannot tell clearly from which variable the effect originates. Because of the large standard errors, the coefficients will most likely not be significant.

The Linear Regression procedure provides many diagnostic statistics, so you can check the assumptions of normality of the residuals, homogeneity of variance, and absence of multicollinearity.

How to interpret linear regression results

When there is a single independent variable, the relationship between the independent variable
and dependent variable can be visualized in a scatterplot, and the concept of linear regression
can be explained using the scatterplot.

If you have only one predictor, the following equation links the dependent variable to the independent variable.

Predicted Y = a + b * X₁

The a and b coefficients must be estimated from the data; their meaning is:

a is the constant term, also referred to as intercept. It represents the predicted value for Y at X=0. In general, the constant is of little interest. For example, when you predict salary using age as predictor you will not be interested in the predicted salary when age =0.
b represents the expected change in Y per unit change in X.
X is the independent variable

The multiple regression equation accommodates more than one independent variable:

Predicted Y = a + b₁ * X₁ + b₂ * X₂ + ... b_n * X_n

b₁ represents the expected change in the dependent variable when X₁ increments by one, all other independent variables being equal. The same holds true for b₂, b₃, and so forth.