Embed Notice
HTML Code
Corresponding Notice
- Embed this noticeWe're covering the Central Limit Theorem and taking large numbers of samples from a dataset. To my way of thinking, we should be sampling with replacement, but the exercise is incorrect if I do it that way. And unfortunately, DataCamp doesn't tell me *why* they're sampling without replacement.
Sample with replacement:
* underlying data source for every sample is the same
* taking the mean of the sample means is normally distributed, so standard stuff just works (example: 68% of sample means within plus/minus one SD of mean/95% of sample means within plus/minus 2 SD of mean)
* with a large enough set of samples, there can be more sampled observations than actual observations.
Sample without replacement:
* underlying data source changes for each succeeding sample
* taking mean of sample means should become gradually closer to that of the data distribution of the source, since as the number of samples increases, the samples get closer to exhausting the underlying source.
* exhausting the underlying source means that there are no more samples once sampled observations is equal to underlying observations.
I can see where CLT applies when replacement is turned on, but without replacement, each sample is from a different dataset ... smaller and smaller subsets of the original dataset.