Today we will look into the basics of linear regression. Here we go :
Contents
 Simple Linear Regression (SLR)
 Multiple Linear Regression (MLR)
 Assumptions
1. Simple Linear Regression
Regression is the process of building a relationship between a dependent variable and set of independent variables. Linear Regression restricts this relationship to be linear in terms of coefficients. In SLR, we consider only one independent variable.
Example: The Waist Circumference – Adipose Tissue data

Studies have shown that individuals with excess Adipose tissue (AT) in the abdominal region have a higher risk of cardiovascular diseases

Computed Tomography, commonly called the CT Scan is the only technique that allows for the precise and reliable measurement of the AT (at any site in the body)

The problems with using the CT scan are:
 Many physicians do not have access to this technology
 Irradiation of the patient (suppresses the immune system)
 Expensive

Is there a simpler yet reasonably accurate way to predict the AT area? i.e.
 Easily available
 Risk free
 Inexpensive

A group of researchers conducted a study with the aim of predicting abdominal AT area using simple anthropometric measurements i.e. measurements on the human body

The Waist Circumference – Adipose Tissue data is a part of this study wherein the aim is to study how well waist circumference(WC) predicts the AT area
# Setting working directory
filepath < c("/Users/nkaveti/Documents/Work_Material/Statistics Learning/")
setwd(filepath)
# Reading data
Waist_AT < read.csv("adipose_tissue.csv")
cat("Number of rows: ", nrow(Waist_AT), "\n")
head(Waist_AT)
Number of rows: 109
Waist  AT 

74.75  25.72 
72.60  25.89 
81.80  42.60 
83.95  42.80 
74.65  29.84 
71.85  21.68 
Let's start with a scatter plot of Waist Vs AT, to understand the relationship between these two variables.
plot(AT ~ Waist, data = Waist_AT)
Any observations from above plot?
Now the objective is to find a linear relation between Waist
and AT
. In otherwords, finding the amount of change in AT
per one unit change (increment/decrement) in Waist
.
In SLR, it is equivalent to finding an optimal straight line equation such that the sum of squares of differences between straight line and the points will be minimum. This method of estimation is called as Ordiany Least Squares (OLS).
Where, \(\beta_1\) represents the amount of change in AT
per one unit change in Waist
.
Now our problem becomes an unconstrained optimization problem. We can find optimal values for \(\beta_0\) and \(\beta_1\) using basic calculus.
Lets rewrite above regression equation in matrix form
Where, \( X = [1 \ \ Waist]\) 1 is a vector of ones and \(\beta = (\beta_0, \ \beta_1)\)
Now differentiate this w.r.t to \(\beta\) and equate it to zero. Then we have,
Now we can find the fitted values of model by substituting \(\hat{\beta}\) in above regression equation
Note: We are arriving to above equation through an assumption\(^1\) of \(E(\epsilon)=0\). What happens if this assumption violates?
Let, \(X(X^\intercal X)^{1} X^\intercal = H\)
We call H as an hat matrix, because it transforms \(AT\) into \(\hat{AT}\) :D
# Lets compute the hat matrix
X = cbind(1, Waist_AT$Waist)
temp = solve(t(X) %*% X) %*% t(X)
betahat = temp %*% Waist_AT$AT # Estimated coefficients
cat("Let's compare the computed values with lm() output: \n \n")
cat("Computed Coefficients: \n \n")
print(data.frame(Intercept = betahat[1], Waist = betahat[2]))
cat("======================================================================= \n")
#cat("Optimal value for beta_0 is: ", betahat[1], "and for beta_1 is: ", betahat[2], "\n \n")
fit_lm = lm(AT ~ Waist, data = Waist_AT)
#cat("Compare our computed estimates with lm() estimates", "\n")
print(fit_lm)
cat("======================================================================= \n \n")
H = X %*% temp # Computing hat matrix
AThat = H %*% Waist_AT$AT # Computing predicted values
cat("Therefore, there is a", betahat[2], "increment in AT per one unit change in Waist \n" )
Let's compare the computed values with lm() output:
Computed Coefficients:
Intercept Waist
1 215.9815 3.458859
=======================================================================
Call:
lm(formula = AT ~ Waist, data = Waist_AT)
Coefficients:
(Intercept) Waist
215.981 3.459
=======================================================================
Therefore, there is a 3.458859 increment in AT per one unit change in Waist
What's next?
We succesfully computed estimates for regression coefficients and fitted values.

We are working on only one sample, how can we generalise these results to population?

How to measure model's performance quantitatively?
We are working on only one sample, how can we generalise these results to population?
Let's focus on question 1. Our regression coefficients are computed using only one sample and these values will change, if we change the sample. But how much they vary? We need to estimate the variation for each beta coefficient to check whether the corresponding regressor is consistently explaining the same behaviour even if we change the sample.
Now the big problem is collecting multiple samples to check the above hypothesis. Hence, we use distributions to check statistical significance of regressors.
For our example, we need to test below two hypotheses.
Test Statistic for these hypotheses is,
Test statistic t
follows tdistribution
, assuming\(^2\) dependent variable follows normal distribution
Suggestion: If your not aware of testing of hypothesis, probability distributions and pvalues please browse through the Google.
Let's recall that, \(\hat{\beta} = (X^\intercal X)^{1} X^\intercal AT\)
Note: In the above calculations we assumed\(^3\) \(Var(AT) = \sigma^2\) (Constant). Where, \(\sigma^2\) is variation in population AT.
Suggestion: Try solving \((X^\intercal X)^{1}\) with \(X = [1, \ x]\) where \(x = (x_1, x_2, x_3 ... x_n)\). You will get the following expression.
Diagonal elements of above matrix are varinaces of \(\beta_0\) and \(\beta_1\) respectively. Offdiagonal element is covariance between \(\beta_0\) and \(\beta_1\).
Hence,
One important observation from \(Var(\hat{\beta})\) expressions is, \(Var(x)\) is inversely proportional to \(Var(\hat{\beta})\). That is, we will get more consistent estimators if there is high variation in corresponding predictors.
Recall that, \(\sigma^2\) in above expression is the population variance, not the sample. Hence, we need to estimate this using the sample that we have.
Where, \(e_i = AT_i  \hat{AT}_i\)
# Let's compute variances of beta hat and test statistic 't'
sigmasq = (1/length(AThat[c(1:2)]))*sum((AThat  Waist_AT$AT)^2)
VarBeta0 = (sigmasq * sum(Waist_AT$Waist^2))/(length(AThat) * sum((Waist_AT$Waist  mean(Waist_AT$Waist))^2))
VarBeta1 = sigmasq/sum((Waist_AT$Waist  mean(Waist_AT$Waist))^2)
cat("Let's compare the computed values with lm() output: \n \n")
cat("======================================================================= \n")
cat("Computed Coefficients: \n \n")
res = data.frame(Estimate = betahat, Std.Error = c(sqrt(VarBeta0), sqrt(VarBeta1)), t_value = c(betahat[1]/sqrt(VarBeta0), betahat[2]/sqrt(VarBeta1)))
row.names(res) = c("(Intercept)", "Waist")
res$p_value = 2*pt(abs(res$t_value), nrow(Waist_AT)1, lower.tail = FALSE)
print(res)
cat("=======================================================================")
summary(fit_lm)
cat("=======================================================================")
Let's compare the computed values with lm() output:
=======================================================================
Computed Coefficients:
Estimate Std.Error t_value p_value
(Intercept) 215.981488 21.7962708 9.909103 7.507198e17
Waist 3.458859 0.2346521 14.740376 1.297124e27
=======================================================================
Call:
lm(formula = AT ~ Waist, data = Waist_AT)
Residuals:
Min 1Q Median 3Q Max
107.288 19.143 2.939 16.376 90.342
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 215.9815 21.7963 9.909 <2e16 ***
Waist 3.4589 0.2347 14.740 <2e16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 33.06 on 107 degrees of freedom
Multiple Rsquared: 0.67, Adjusted Rsquared: 0.667
Fstatistic: 217.3 on 1 and 107 DF, pvalue: < 2.2e16
=======================================================================
Note: Residual standard error = \(\sqrt{sigmasq}\)
How to measure model's performance quantitatively?
Let's focus on question 2 (How to measure model's performance quantitatively?). Recall that, our objective of building model is to explain the variation in AT
using the variation in Waist
.
Total variation in AT is, \(\sum_{i=1}^n (AT  mean(AT))^2\) this can be splitted into two parts as follows:
Where, \(\sum_{i=1}^n (AT_i  \bar{AT})^2\) is the total variation in AT, \(\sum_{i = 1}^n (\hat{AT_i}  \bar{AT})^2\) is the explained variation in AT, this is also called as Regression Sum of Squares and \(\sum_{i=1}^n (AT_i  \hat{AT_i})^2\) is the unexplained variation in AT, this is also called as Error Sum of Squares
We can measure our model using the proportion of total variation explained by independent variable(s). That is, \(\frac{Regression \ Sum \ of \ Squares}{Total \ Sum \ of \ Squares}\)
The above measure is called as Multiple Rsquared:
Interesting facts: Multiple Rsquared value in SLR is equals to \(r^2\) and (1  Multiple Rsquared) is equals to the variance in residuals.
Where, r is pearson's correlation coefficient between dependent and independent variable.
# Let's compute Multiple Rsquared measure for our example
SSR = sum((AThat  mean(Waist_AT$AT))^2)
SST = sum((Waist_AT$AT  mean(Waist_AT$AT))^2)
MulRSq = SSR/SST
cat("Compute Multiple Rsquared: ", MulRSq, "\n \n")
cat("Note that computed R squared value is matching with lm() Multiple Rsquared value in above output \n \n")
cat("======================================================================= \n \n")
Compute Multiple Rsquared: 0.6700369
Note that computed R squared value is matching with lm() Multiple Rsquared value in above output
=======================================================================
What happens to the Multiple Rsquared value when you add an irrelevant variable to the model?
In the below model, I am generating a random sample of uniform numbers between 1 to 100 and considering this as one of indepedent variable.
set.seed(1234)
fit_lm2 = lm(AT ~ Waist + runif(nrow(Waist_AT), 1, 100), data = Waist_AT)
summary(fit_lm2)
cat("======================================================================= \n \n")
Call:
lm(formula = AT ~ Waist + runif(nrow(Waist_AT), 1, 100), data = Waist_AT)
Residuals:
Min 1Q Median 3Q Max
106.06 17.53 3.63 13.70 91.36
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 226.2894 23.4350 9.656 3.33e16 ***
Waist 3.5060 0.2376 14.757 < 2e16 ***
runif(nrow(Waist_AT), 1, 100) 0.1397 0.1181 1.183 0.239

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 33 on 106 degrees of freedom
Multiple Rsquared: 0.6743, Adjusted Rsquared: 0.6682
Fstatistic: 109.7 on 2 and 106 DF, pvalue: < 2.2e16
=======================================================================
Multiple Rsquared value increases irrespective of quality of explanation, which is incorrect. We should penalize our model performance if the quality of explanation is poor, that is why we need to adjust our Rsquared value.
To penalize the explained part of AT, we inflate the unexplained part of AT with \(\frac{Total \ degrees \ of \ freedom}{Error \ degrees \ of \ freedom}\). That is,
Where, n = Total number of observations; p = Total number of predictors (excluding intercept)
Adding a new independent variable will increase \(\frac{n1}{np1}\) and \(R^2\). If the amount of increment in \(R^2\) is less than the amount of increment in \(\frac{n1}{np1}\) than it will decrease the Adjusted Rsquared value.
In fit_lm2
model Adjusted Rsquared decreases when we add randomly generated variable into the model.
# Let's compute adjusted Rsquared for our example
TDF = nrow(Waist_AT[1, ]) # Total degrees of freedom
EDF = nrow(Waist_AT[1, ])  1 # Error degrees of freedom, where 1 is the number of predictors
AdjRSq = 1  (1  MulRSq) * (TDF/EDF) # Adjusted R square
cat("Compute Multiple Rsquared: ", AdjRSq, "\n \n")
cat("Note that computed Adjusted Rsquared value is matching with lm() Adjusted Rsquared value in the above output \n \n")
cat("Note: We are comparing with fit_lm model, not fit_lm2 \n")
cat("======================================================================= \n")
Compute Multiple Rsquared: 0.6669531
Note that computed Adjusted Rsquared value is matching with lm() Adjusted Rsquared value in the above output
Note: We are comparing with fit_lm model, not fit_lm2
=======================================================================
Aforementioned measures (Multiple Rsquared & Adjusted Rsquared) for Goodness of fit are functions of sample and these will vary as sample changes. Similar to ttest
for regression coefficeints we need some statistical test to test model's performance for population.
Objective is to compare the Mean sum of squares due to regression and Mean sum of squares due to error. Ftest
is very helpful to compare the variations.
Note: Above expression follows F distribution only if, AT follows Normal Distribution
RDF = TDF  EDF
SSE = SST  SSR
MSR = (1/RDF)*SSR
MSE = (1/EDF)*SSE
F_value = MSR/MSE
cat("Compute F statistic: ", F_value, "\n \n")
cat("Note that computed Fstatistic is matching with lm() Fstatistic value in the above output \n \n")
cat("Note: We are comparing with fit_lm model, not fit_lm2 \n")
cat("======================================================================= \n")
Compute F statistic: 217.2787
Note that computed Fstatistic is matching with lm() Fstatistic value in the above output
Note: We are comparing with fit_lm model, not fit_lm2
=======================================================================
2. Multiple Linear Regression (MLR)
In multiple linear regression we consider more than one predictor and one dependent variable. Most of the above explanation is valid for MLR too.
Example: Car's MPG (Miles Per Gallon) prediction
Our interest is to model the MPG of a car based on the other variables.
Variable Description:
 VOL = cubic feet of cab space
 HP = engine horsepower
 MPG = average miles per gallon
 SP = top speed, miles per hour
 WT = vehicle weight, hundreds of pounds
# Reading Boston housing prices data
car = read.csv("Cars.csv")
cat("Number of rows: ", nrow(car), "\n", "Number of variables: ", ncol(car), "\n")
head(car)
Number of rows: 81
Number of variables: 5
HP  MPG  VOL  SP  WT 

49  53.70068  89  104.1854  28.76206 
55  50.01340  92  105.4613  30.46683 
55  50.01340  92  105.4613  30.19360 
70  45.69632  92  113.4613  30.63211 
53  50.50423  92  104.4613  29.88915 
70  45.69632  89  113.1854  29.59177 
Our objective is to model the variation in MPG
using other independent variables. That is,
Where, \(\beta_1\) represents the amount of change in MPG
per one unit change in VOL
provided other variables are fixed. Let's consider below two cases,
Case1: HP = 49; VOL = 89; SP = 104.1854; WT = 28.76206 => MPG = 104.1854
Case2: HP = 49; VOL = 90; SP = 104.1854; WT = 28.76206 => MPG = 105.2453
then \(\beta_1 = 105.2453  104.1854 = 1.0599\). Similarly, \(\beta_2, \beta_3, \beta_4\)
The above effect is called as Ceteris Paribus Effect
.
But in real world it is very difficult to collect records in above manner. That's why we compute (function of) partial correlation coefficients to quantify the effect of one variable, keeping others constant.
# Let's build MLR model to predict MPG based using other variables
fit_mlr_actual = lm(MPG ~ ., data = car)
summary(fit_mlr_actual)
Call:
lm(formula = MPG ~ ., data = car)
Residuals:
Min 1Q Median 3Q Max
0.94530 0.32792 0.04058 0.24256 1.71034
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 7.100e17 5.461e02 0.000 1.0000
HP 1.285e+00 2.453e01 5.239 1.4e06 ***
VOL 8.207e01 1.389e+00 0.591 0.5563
SP 6.144e01 2.458e01 2.500 0.0146 *
WT 3.287e01 1.390e+00 0.237 0.8136

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4915 on 76 degrees of freedom
Multiple Rsquared: 0.7705, Adjusted Rsquared: 0.7585
Fstatistic: 63.8 on 4 and 76 DF, pvalue: < 2.2e16
One key observation from above output is, Std. Error for VOL
and WT
is very huge comparing to others and this inflates t values
and p value
. Hence, these two variables becomes very insignificant for the model.
Let's go into deep, what happened to \(Var(\hat{\beta_{VOL}})\) and \(Var(\hat{\beta_{WT}})\)?
Analogy for \(Var(\hat{\beta})\) in MLR is as follows:
Where, \(R_{VOL}^2\) = Multiple Rsquared value obtained by regressing VOL on all other independent variables
Task: To understand it more clearly, take few random samples from cars data and run the MLR model and observe the variation in \(\hat{\beta_{VOL}}\) and \(\hat{\beta_{WT}}\).
# Let's regress VOL on all other independent variables'
fit_mlr = lm(VOL ~ HP + SP + WT, data = car)
summary(fit_mlr)
Call:
lm(formula = VOL ~ HP + SP + WT, data = car)
Residuals:
Min 1Q Median 3Q Max
0.068938 0.031641 0.008794 0.032018 0.077931
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 6.155e18 4.481e03 0.000 1.000
HP 2.331e02 1.995e02 1.168 0.246
SP 2.294e02 2.000e02 1.147 0.255
WT 9.998e01 4.557e03 219.396 <2e16 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04033 on 77 degrees of freedom
Multiple Rsquared: 0.9984, Adjusted Rsquared: 0.9984
Fstatistic: 1.637e+04 on 3 and 77 DF, pvalue: < 2.2e16
It's surprising that, \(R_{VOL}^2\) is 0.9984 and also only WT
is significant. That is, these two predictors (VOL
and WT
) are highly correlated. This inflates \(Var(\hat{\beta_{VOL}})\) and thus t value
. We might be missing some of the important information because of high correlation between predictors. This problem is called as Multicollinearity.
One quick solution for this problem is to remove either VOL
or WT
from the model. Let's compute partial correlation coeficient between MPG
and VOL
by removing the effect of WT
(say, \(r_{MV.W}\)) and partial correlation coeficient between MPG
and WT
by removing the effect of VOL
(say, \(r_{MW.V}\)).
To compute \(r_{MV.W}\) we need to compute the correlation between (a) part of VOL
which cannot be explained by WT
(regress VOL
on WT
and take the residuals) and (b) the part of MPG
which cannot be explained by WT
(regress MPG
on WT
and take the residuals)
fit_partial = lm(VOL ~ WT, data = car)
fit_partial2 = lm(MPG ~ WT, data = car)
res1 = fit_partial$residual
res2 = fit_partial2$residual
cat("Partial correlation coefficient between MPG and VOL by removing the effect of WT is: ", cor(res1, res2))
Partial correlation coefficient between MPG and VOL by removing the effect of WT is: 0.08008873
fit_partial3 = lm(WT ~ VOL, data = car)
fit_partial4 = lm(MPG ~ VOL, data = car)
res1 = fit_partia3$residual
res2 = fit_partial4$residual
cat("Partial correlation coefficient between MPG and WT by removing the effect of VOL is: ", cor(res1, res2))
Partial correlation coefficient between MPG and WT by removing the effect of VOL is: 0.05538241
Since, \(abs(r_{MV.W}) >= abs(r_{MW.V})\) we may remove WT
from the model.
# Remove WT and rerun the model
fit_mlr_actual2 = lm(MPG ~ .WT, data = car)
summary(fit_mlr_actual2)
Call:
lm(formula = MPG ~ .  WT, data = car)
Residuals:
Min 1Q Median 3Q Max
0.94036 0.31695 0.03457 0.23316 1.71570
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 7.910e17 5.427e02 0.000 1.0000
HP 1.293e+00 2.415e01 5.353 8.64e07 ***
VOL 4.925e01 5.516e02 8.928 1.65e13 ***
SP 6.222e01 2.421e01 2.571 0.0121 *

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4884 on 77 degrees of freedom
Multiple Rsquared: 0.7704, Adjusted Rsquared: 0.7614
Fstatistic: 86.11 on 3 and 77 DF, pvalue: < 2.2e16
After eliminating WT
from the model there is an increment of ~0.3% in Adjusted Rsquared and more importantly, VOL
becomes significant at 0 los (level of significance)
3. Assumptions
Linear in Parameters: We assume that there is a linear relation between dependent and set of independent variables
Zero conditional mean: \(E(\epsilon \mid X) = 0\)
Homoskedasticity: \(Var(\epsilon \mid X) = \sigma^2\) (Constant)
No perfect Collinearity: All predecitors must be independent among themselves
No serial correlation in errors: Erros must be uncorrelated among themselves. In otherwords, observations or records must be independent of each other.
We discussed first 4 assumptions in section 1 and 2.
Here is a book that I recommend to learn more about this: