Linear Regression

more for data analytics students, but you can read it too

In the realm of data analysis, we often move beyond simply describing our data to trying to understand and even predict future outcomes. One of the foundational techniques for this is simple linear regression. As the name suggests, it’s a straightforward method focused on understanding and predicting the relationship between two numerical variables using a straight line.

At its core, simple linear regression aims to determine if one variable, the independent variable (also known as the predictor variable), can effectively predict the value of another variable, the dependent variable (also known as the criterion variable). Think of it like this: if you feed messenger squirrels different number of nuts (independent variable), can you predict how much they will weigh (dependent variable)?

The magic of linear regression lies in its ability to create a linear model, represented by a straight line that best fits the observed data points on a scatter plot. This calculated line is your prediction tool. The equation of this line essentially describes the relationship between the two variables.

A key output of a simple linear regression analysis is the coefficient of determination, denoted as R². This value, ranging from 0 to 1 (or 0% to 100%), tells you how much of the variance in your dependent variable is explained by your independent variable. For instance, an R² of 0.92 in a model predicting a pet’s height based on its weight means that 92% of the variation in height can be explained by the variation in weight5 . A higher R² indicates a better fit of the model to the data, suggesting that the independent variable is a strong predictor of the dependent variable.

Another crucial piece of information you get from linear regression is the p-value3 …. In this context, the null hypothesis is that the independent variable is NOT a predictor of the dependent variable, and the alternative hypothesis is that the independent variable IS a predictor3 . If the p-value is low enough (typically below a chosen significance level like 0.05), we can reject the null hypothesis and conclude that the independent variable is a statistically significant predictor of the dependent variable.

However, like any statistical technique, simple linear regression relies on several assumptions to ensure the reliability of its results:

  • Linearity—The relationship between the independent and dependent variables should be linear. A scatter plot should roughly resemble a straight line.
  • Normality—The residuals (the differences between the actual and predicted values) should be normally distributed.
  • Independence—The observations should be independent of each other.
  • Homoscedasticity—The variance of the residuals should be roughly constant across all levels of the independent variable.
  • Numerical variables—Both the independent and dependent variables must be numerical.
  • Sample size—While a minimum of 10 observations per independent variable is technically acceptable, a larger sample size (ideally ≥ 100) is generally recommended for better results.

It’s important to distinguish simple linear regression from correlation. While correlation tells us about the strength and direction of the relationship between two numerical variables, linear regression goes a step further by aiming to build a predictive model. You can have a strong correlation, but regression allows you to use one variable to estimate the value of the other.

In conclusion, simple linear regression is a powerful and widely used tool for understanding and predicting the relationship between two numerical variables. By analyzing the regression line, R², and the p-value, and by ensuring the underlying assumptions are met, data analysts can gain valuable insights and make informed predictions across various fields. Whether it’s predicting sales based on advertising spend or estimating product demand based on historical data, simple linear regression provides a fundamental framework for turning data into actionable foresight.