# Bayesian linear regression

# 1 Introduction

For statistical inferences we have tow general approaches or frameworks:

**Frequentist**approach in which the data sampled from the population is considered as random and the population parameter values, known as null hypothesis, as fixed (but unknown). To estimate thus this null hypothesis we look for the sample parameters that maximize the likelihood of the data. However, the data at hand, even it is sampled randomly from the population, it is fixed now, so how can we consider this data as random. The answer is that we assume that the population distribution is known and we work out the maximum likelihood of the data using this distribution. Or we repeat the study many times with different samples then we average the results. So if we get very small value for the likelihood of the data which is known as**p-value**we tend to reject the null hypothesis. The main problem, however, is the misunderstanding and misusing of this p-value when we decide to reject the null hypothesis based on some threshold, from which we wrongly interpreting it as the probability of rejecting the null hypothesis. For more detail about p-value click here.**Bayesian**approach, in contrast, provides true probabilities to quantify the uncertainty about a certain hypothesis, but requires the use of a first belief about how likely this hypothesis is true, known as**prior**, to be able to derive the probability of this hypothesis after seeing the data known as**posterior probability**. This approach called bayesian because it is based on the bayes’ theorem, for instance if a have population parameter to estimate \(\theta\) , and we have some data sampled randomly from this population \(D\), the posterior probability thus will be \[\overbrace{p(\theta/D)}^{Posterior}=\frac{\overbrace{p(D/\theta)}^{Likelihood}.\overbrace{p(\theta)}^{Prior}}{\underbrace{p(D)}_{Evidence}}\] The**Evidence**is the probability of the data at hand regardless the parameter \(\theta\).

# 2 Data preparation

For simplicity we use the **BostonHousing** data from **mlbench** package, For more detail about this data run this command `?BostonHousing`

after calling the package. But first Let’s call all the packages that we need throughout this article.

```
options(warn = -1)
library(mlbench)
library(rstanarm)
```

`## Loading required package: Rcpp`

`## This is rstanarm version 2.21.1`

`## - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!`

`## - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.`

`## - For execution on a local, multicore CPU with excess RAM we recommend calling`

`## options(mc.cores = parallel::detectCores())`

```
library(bayestestR)
library(bayesplot)
```

`## This is bayesplot version 1.7.2`

`## - Online documentation and vignettes at mc-stan.org/bayesplot`

`## - bayesplot theme set to bayesplot::theme_default()`

`## * Does _not_ affect other ggplot2 plots`

`## * See ?bayesplot_theme_set for details on theme setting`

```
library(insight)
library(broom)
```

```
data("BostonHousing")
str(BostonHousing)
```

```
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : num 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ b : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
```

To well understand how the Bayesian regression works we keep only three features, two numeric variables **age**, **dis** and one categorical **chas**, with the target variable **medv** the median value of owner-occupied homes.

```
bost <- BostonHousing[, c("medv", "age", "dis", "chas")]
summary(bost)
```

```
## medv age dis chas
## Min. : 5.00 Min. : 2.90 Min. : 1.130 0:471
## 1st Qu.:17.02 1st Qu.: 45.02 1st Qu.: 2.100 1: 35
## Median :21.20 Median : 77.50 Median : 3.207
## Mean :22.53 Mean : 68.57 Mean : 3.795
## 3rd Qu.:25.00 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :50.00 Max. :100.00 Max. :12.127
```

From the summary we do not have any special issues like missing values for example.

# 3 Classical linear regression model

To highlight the difference between the bayesian regression and the traditional linear regression (frequentist approach), Let’s first fit the latter to our data.

```
model_freq <- lm(medv ~ ., data = bost)
tidy(model_freq)
```

```
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 32.7 2.25 14.6 2.33e-40
## 2 age -0.143 0.0198 -7.21 2.09e-12
## 3 dis -0.246 0.265 -0.928 3.54e- 1
## 4 chas1 7.51 1.46 5.13 4.16e- 7
```

Using the p.value of each regressor, all the regressors ar significant. except for the **dis** variable. Since the variable **chas** is categorical with twolevels The coefficient of **chas1** is the different between the madian price of houses on the bounds charles River and that of the others, so the median price of the former are higher about 7.513.

# 4 Bayesian regression

To fit a bayesian regresion we use the function `stan_glm`

from the rstanarm package. This function as the above **lm** function requires providing the **formula** and the data that will be used, and leave all the following arguments with their default values:

**family**: by default this function uses the**gaussian**distribution as we do with the classical`glm`

function to perform`lm`

model.**prior**: The prior distribution for the regression coefficients, By default the normal prior is used. There are subset of functions used for the prior provided by rstanarm like ,**student t family**,**laplace family**…ect. To get the full list with all the details run this command`?priors`

. If we want a flat uniform prior we set this to**NULL**.**prior_intercept**: prior for the intercept, can be normal, student_t , or cauchy. If we want a flat uniform prior we set this to**NULL**.**prior_aux**: prior fo auxiliary parameters such as the error standard deviation for the gaussion family.**algorithm**: The estimating approach to use. The default is "sampling MCMC^{1}.**QR**: FALSE by default, if true QR decomposition applied on the design matrix if we have large number of predictors.**iter**: is the number of iterations if the MCMC method is used, the default is 2000.**chains**: the number of Markov chains, the default is 4.**warmup**: also known as burnin, the number of iterations used for adaptation, and should not be used for inference. By default it is half of the iterations.

`model_bayes <- stan_glm(medv ~ ., data = bost, seed = 111)`

if we print the model we get the following

`print(model_bayes, digits = 3)`

```
## stan_glm
## family: gaussian [identity]
## formula: medv ~ .
## observations: 506
## predictors: 4
## ------
## Median MAD_SD
## (Intercept) 32.834 2.285
## age -0.143 0.020
## dis -0.258 0.257
## chas1 7.543 1.432
##
## Auxiliary parameter(s):
## Median MAD_SD
## sigma 8.324 0.260
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
```

The **Median** estimate is the median computed from the MCMC simulation, and **MAD_SD** is the median absolute deviation computed from the same simulation. To well understand how getting these outputs let’s plot the MCMC simulation of each predictor using bayesplot

`mcmc_dens(model_bayes, pars = c("age")) + vline_at(-0.143, col = "red")`

As you see the point estimate of **age** falls on the median of this distribution (red line). The same thing is true for **dis** and **shas** predictors.

`mcmc_dens(model_bayes, pars = c("chas1")) + vline_at(7.496, col = "red")`

`mcmc_dens(model_bayes, pars = c("dis")) + vline_at(-0.244, col = "red")`

Now how can we evaluate the model parameters? The answer is by analyzing the posteriors using some specific statistics. To get the full statistics provided by bayestestR package, we make use of the function `describe_posterior`

.

`describe_posterior(model_bayes)`

`## Possible multicollinearity between dis and age (r = 0.76). This might lead to inappropriate results. See 'Details' in '?rope'.`

```
## # Description of Posterior Distributions
##
## Parameter | Median | 89% CI | pd | 89% ROPE | % in ROPE | Rhat | ESS
## ------------------------------------------------------------------------------------------------
## (Intercept) | 32.834 | [29.218, 36.295] | 1.000 | [-0.920, 0.920] | 0 | 1.002 | 2029.279
## age | -0.143 | [-0.175, -0.112] | 1.000 | [-0.920, 0.920] | 100 | 1.001 | 2052.155
## dis | -0.258 | [-0.667, 0.179] | 0.819 | [-0.920, 0.920] | 100 | 1.002 | 2115.192
## chas1 | 7.543 | [ 5.159, 9.813] | 1.000 | [-0.920, 0.920] | 0 | 1.000 | 3744.403
```

Before starting analyzing the table we shoud first understanding the above various statistics commonly used in bayes regression.

**CI**: Credible Interval, it is used to quantify the uncertainty about the regression coefficients. Ther are tow methods to compute**CI**, the highest density interval`HDI`

which is the default, and the Equal-tailed Interval`ETI`

. with 89% probability (given the data) that a coefficient lies above the**CI_low**value and under**CI_high**value. This strightforward probabilistic interpretation is completely diffrent from the confidence interval used in classical linear regression where the coefficient fall inside this confidence interval (if we choose 95% of confidence) 95 times if we repeat the study 100 times.

**pd**: Probability of Direction , which is the probability that the effect goes to the positive or to the negative direction, and it is considered as the best equivalent for the p-value.**ROPE_CI**: Region of Practical Equivalence, since bayes method deals with true probabilities , it does not make sense to compute the probability of getting the effect equals zero (the null hypothesis) as a point (probability of a point in continuous intervals equal zero ). Thus, we define instead a small range around zero which can be considered practically the same as no effect (zero), this range therefore is called**ROPE**. By default (according to Cohen, 1988) The Rope is [-0.1,0.1] from the standardized coefficients.**Rhat**: scale reduction factor \(\hat R\), it is computed for each scalar quantity of interest, as the standard deviation of that quantity from all the chains included together, divided by the root mean square of the separate within-chain standard deviations. When this value is close to 1 we do not have any convergence problem with MCMC.**ESS**: effective sample size, it captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm, the higher the ESS the better. The threshold used in practice is 400.

Aternatively, we can get the coefficeient estimates (which are the medians by default) separatly by using the package **insight**

```
post <- get_parameters(model_bayes)
print(purrr::map_dbl(post, median), digits = 3)
```

```
## (Intercept) age dis chas1
## 32.834 -0.143 -0.258 7.543
```

We can also compute the Maximum A posteriori (MAP), and the mean as follows

`print(purrr::map_dbl(post, map_estimate), digits = 3)`

```
## (Intercept) age dis chas1
## 33.025 -0.145 -0.295 7.573
```

`print(purrr::map_dbl(post, mean), digits = 3)`

```
## (Intercept) age dis chas1
## 32.761 -0.143 -0.248 7.523
```

As we see the values are closer to each other due to the like normality of the distribution of the posteriors where all the central statistics (mean, median, mode) are closer to each other. Using the following plot to visualize the age coefficient using different statistics as follows

```
mcmc_dens(model_bayes, pars = c("age")) + vline_at(median(post$age), col = "red") +
vline_at(mean(post$age), col = "yellow") + vline_at(map_estimate(post$age), col = "green")
```

As expected they are approximately on top of each other.

# 5 Bayesian inferences

As we do with classical regression (frequentist), we can test the significance of the bayesian regression coefficients by checking whether the corresponding credible interval contains zero or not, if no then this coefficient is significant. Let’s go back to our model and check the significance of each coefficient (using credible based on the default `hdi`

).

`hdi(model_bayes)`

```
## # Highest Density Interval
##
## Parameter | 89% HDI
## ----------------------------
## (Intercept) | [29.22, 36.29]
## age | [-0.18, -0.11]
## dis | [-0.67, 0.18]
## chas1 | [ 5.16, 9.81]
```

And based on the `eti`

`eti(model_bayes)`

```
## # Equal-Tailed Interval
##
## Parameter | 89% ETI
## ----------------------------
## (Intercept) | [29.20, 36.28]
## age | [-0.17, -0.11]
## dis | [-0.67, 0.18]
## chas1 | [ 5.17, 9.83]
```

Using both methods, the only non significant coefficient is **dis** variable, which is inline with the classical regression.

**Note**: this similar result between frequentist and bayesian regression may due to the normality assumption for the former that is well satisfied which gives satisfied results and due to the normal prior used in the latter. However, in real world it is less often to be sure about the normality assumption which may give contradict conclusions between the two approaches.

Another way to test the significance by checking the part of the credible interval that falls inside the ROPE interval. we can get this by calling the `rope`

from **bayestestR** package

`rope(post$age)`

```
## # Proportion of samples inside the ROPE [-0.10, 0.10]:
##
## inside ROPE
## -----------
## 0.00 %
```

For age almost all the credible interval (HDI) is outside the ROPE range, which means that coefficient is highly significant.

`rope(post$chas1)`

```
## # Proportion of samples inside the ROPE [-0.10, 0.10]:
##
## inside ROPE
## -----------
## 0.00 %
```

`rope(post$`(Intercept)`)`

```
## # Proportion of samples inside the ROPE [-0.10, 0.10]:
##
## inside ROPE
## -----------
## 0.00 %
```

The same thing is true for the **chas** and **intercept** variable.

`rope(post$dis)`

```
## # Proportion of samples inside the ROPE [-0.10, 0.10]:
##
## inside ROPE
## -----------
## 20.02 %
```

In contrast, almost the quarter of the credible interval of **dis** variable is inside the ROPE interval. In other words, the probability of this coefficient to be zero is 23.28%.

`rope_range(model_bayes)`

`## [1] -0.9197104 0.9197104`

# 6 PD and P-value

Sometimes we are only interested to check the direction of the coefficient (positive or negative). this is the role of **pd** statistic in the above table, high value means that the associated effect is concentrated on the same side as the median. For our model, since pd’s equal to 1, almost all the posteriors of the two variables **age** and **chas1** and the intercept are on the same side (if median negative all other values are negatives). However, it should be noted that this statistic does not assess the significance of the effect.
Something more important to mention is that it exists a strong relation between this probability and the p-value approximated as follows: \(p-value=1-pd\). let’s check this with our variables.

```
df1 <- dplyr::select(tidy(model_freq), c(term, p.value))
df1$p.value <- round(df1$p.value, digits = 3)
df2 <- 1 - purrr::map_dbl(post, p_direction)
df <- cbind(df1, df2)
df
```

```
## term p.value df2
## (Intercept) (Intercept) 0.000 0.00000
## age age 0.000 0.00000
## dis dis 0.354 0.18075
## chas1 chas1 0.000 0.00000
```

# Conclusion

within the last decade more practitioner , specially in some fields such as medicine and psychology, are turning towards bayesian analysis since almost every thing can be interpreted straightforwardly with a probabilistic manner. However, the Bayesian analysis has also some drawback , like the subjective way to define the priors (which play an important role to compute the posterior), or for problems that do not have conjugate prior, not always the mcmc alghorithm converges easily to the right values (specially with complex data).

Kevin P.murphy, Machine Learning: A Probabilistic Perspective, 2012, page 589↩︎