A sample statistic is random. The values change with different samples. The probability distribution of sample statistic is called sampling distribution. Theoretically, if you are able to collect all possible random samples of the same size from the population, he values of the sample statistic from all the samples form a sampling distribution. In practice, we only have one sample. It is not possible to observe the sampling distribution. But simulation enables us to do so.
We can simulate the process of repeated sampling and observe the distribution of the sample statistics, as well as calculate the mean and variance of the distribution and compare them with theoretical results. In the lecture slides, we see the sampling distribution of $b_0$ and $b_1$. Here we take the sample statistic MSR as an example.
We follow the following steps:
We know when $\beta_1=0, E(MSR)=\sigma^2$ and $\frac{MSR}{\sigma^2}$ follows a $\mathcal{X}_1^2$ distribution.
R Codes, Sigma = 1
n <- 100 # sample size is 100
X <- seq(1,10, length=100) # covariates don't change
msr <- array(NA, dim = 1000) # construct an empty vector to save the results
set.seed(1) # without set.seed, the result will be different each time due to randomness
sigm <- 1 # sigma is 1
for( r in 1:1000) # repeated sampling 1000 times
{
# generate Y sample of size 100 from the true model, sigma is 1
Y <- 1 + rnorm(100, 0, sigm)
ybar <- mean(Y)
mod <- lm(Y ~ X)
# calculate MSR value and record it into the vector named "msr"
msr[r] <- sum((mod$fitted.values-ybar)^2)
}
# find the sample mean of these 1000 MSR values and check if it is close to the theoretical mean of 1.
mean(msr)
## [1] 0.9979618
hist(msr/(sigm^2), prob = TRUE) # look at the shape of the sampling distribution of MSR
plot(function(x) dchisq(x, df=1), 0, 10, ylab="", add = T,
col= 2, lty =2, lwd = 2)
Sigma = 2
n <- 100
X <- seq(1,10, length=100)
msr <- array(NA, dim = 1000)
sigm <- 2 # sigma is 2
for( r in 1:1000)
{
Y <- 1 + rnorm(100, 0, sigm)
ybar <- mean(Y)
mod <- lm(Y ~ X)
msr[r] <- sum((mod$fitted.values-ybar)^2)
}
mean(msr) # compare with theoretical mean of 4.
## [1] 3.880779
hist(msr/(sigm^2), prob = TRUE)
plot(function(x) dchisq(x, df=1), 0, 10, ylab="", add = T,
col= 2, lty =2, lwd = 2)