top of page

Sample Size Impact on Probability Distribution: A Visualization Project

A probability distribution describes how the values of a random variable are distributed. It provides a model which gives us the probabilities of all possible outcomes of a random process.

For continuous random variables, this is represented as a probability density function (pdf), while for discrete random variables, it's represented as a probability mass function (pmf).


The most renowned among the continuous probability distributions is the normal distribution, often referred to as the bell curve because of its bell-like shape. It's described completely by its mean (μ) and standard deviation (σ), with about 68% of the data falling within one standard deviation from the mean, 95% within two, and 99.7% within three.



Construct a probability distribution using sample sizes of 1000, 10000, and 1000000 (1 million), each generated from a normal distribution.


Objective:

To understand and visualize how the shape and accuracy of the probability distribution change with varying sample sizes. This project focuses on using a normal distribution as an example.


Requirements:

You need to generate random samples from a normal distribution. The sample sizes to consider are 1k, 10k, and 1m. Define the sample sizes as sample1, sample2, and sample3. Use the "scipy.stats.norm" module to generate random samples from a normal distribution for the three sample sizes.


Sample size plays a crucial role in statistical inference, particularly when estimating the true underlying distribution from which the sample data has been drawn.


  • With smaller sample sizes, there is a greater chance of observing extreme values or outliers. The estimated probability distribution can be more erratic and might not represent the true underlying distribution accurately. As a result, any inferences or conclusions drawn from such a sample might not be reliable.


  • As the sample size increases, the estimated distribution begins to resemble the true underlying distribution more closely, thanks to the Law of Large Numbers. This law states that as the size of the sample drawn from a population increases, the average of this sample gets closer and closer to the average of the whole population.


larger sample sizes provide a more stable and accurate representation of the distribution. It reduces the impact of outliers or extreme values, making the sample mean converge to the population mean and the sample variance stabilize around the population variance.

Create a figure with three subplots side by side, each representing a histogram of one of the sample sizes. Overlay the expected probability density function of the normal distribution on each histogram. Each subplot should have an appropriate title indicating the sample size.


The histograms and PDFs for the normal distribution showed a clear trend: as the sample size increased, the histograms became smoother and more closely resembled the expected PDF of the normal distribution. This demonstrates the impact of sample size on the accuracy and reliability of estimated distributions.


Repeat the above steps but for a different continuous distribution of your choice. Compare and contrast the histograms and probability density functions between the normal distribution and the new distribution you chose.

206 views0 comments

Comments


bottom of page