Simulation

A sample statistic is random. The values change with different samples. The probability distribution of sample statistic is called sampling distribution. Theoretically, if you are able to collect all possible random samples of the same size from the population, he values of the sample statistic from all the samples form a sampling distribution. In practice, we only have one sample. It is not possible to observe the sampling distribution. But simulation enables us to do so.

We can simulate the process of repeated sampling and observe the distribution of the sample statistics, as well as calculate the mean and variance of the distribution and compare them with theoretical results. In the lecture slides, we see the sampling distribution of $b_0$ and $b_1$. Here we take the sample statistic MSR as an example.

We follow the following steps:

  1. We know the true model; i.e. we know the values of all parameters. In this example, we specify: $\beta_0=1,\beta_1=0,\sigma=1,2,3.$
  2. We generate a random sample from the true model. $Y_i=1+\epsilon_i,$ where $\epsilon_i\sim \mathcal{N}(0,\sigma^2)$
  3. We pretend not knowing the values of the parameters and use least squares estimation to get estimated values for the parameters. In this example, we can calculate the sample statistic MSR.
  4. Repeat the previous two steps 1000 times, so that we have many random samples and their corresponding estimates. We have $(MSR_1, ..., MSR_{1000})$
  5. We are able to look at the distribution of the estimates and calculated their mean and variance. Compare with theoretical results.

We know when $\beta_1=0, E(MSR)=\sigma^2$ and $\frac{MSR}{\sigma^2}$ follows a $\mathcal{X}_1^2$ distribution.

R Codes, Sigma = 1

n <- 100 # sample size is 100
X <- seq(1,10, length=100) # covariates don't change
msr <- array(NA, dim = 1000) # construct an empty vector to save the results
set.seed(1) # without set.seed, the result will be different each time due to randomness
sigm <- 1 # sigma is 1
for( r in 1:1000) # repeated sampling 1000 times
{
# generate Y sample of size 100 from the true model, sigma is 1
Y <- 1 + rnorm(100, 0, sigm)
ybar <- mean(Y)
mod <- lm(Y ~ X)
# calculate MSR value and record it into the vector named "msr"
msr[r] <- sum((mod$fitted.values-ybar)^2)
}
# find the sample mean of these 1000 MSR values and check if it is close to the theoretical mean of 1.
mean(msr)
## [1] 0.9979618
hist(msr/(sigm^2), prob = TRUE) # look at the shape of the sampling distribution of MSR
plot(function(x) dchisq(x, df=1), 0, 10, ylab="", add = T,
col= 2, lty =2, lwd = 2)

Untitled

Sigma = 2

n <- 100
X <- seq(1,10, length=100)
msr <- array(NA, dim = 1000)
sigm <- 2 # sigma is 2
for( r in 1:1000)
{
Y <- 1 + rnorm(100, 0, sigm)
ybar <- mean(Y)
mod <- lm(Y ~ X)
msr[r] <- sum((mod$fitted.values-ybar)^2)
}
mean(msr) # compare with theoretical mean of 4.
## [1] 3.880779
hist(msr/(sigm^2), prob = TRUE)
plot(function(x) dchisq(x, df=1), 0, 10, ylab="", add = T,
col= 2, lty =2, lwd = 2)