2. Population_sampling

VCMNA307: level 9: Apply set structures to solve real-world problems

Exploring variation in proportion and means of random samples, drawn from a population

2.1. Population mean

Population mean \(\mu\) = \(\displaystyle \frac{\text{sum of all population values}}{\text{population size}}\)
To get the population mean, collect numerical data about every object in a population and calculate the mean using the formula above.
It is usually impractical or costly to be able to determine the population mean exactly.

2.2. Sample Mean

Sample mean \(\bar{x}\) = \(\displaystyle \frac{\text{sum of all sample values}}{\text{sample size}}\)

The sample mean is found by selecting a sample from the population and determining its mean instead.
The sample mean will vary from sample to sample.
The more representative the sample is of the population or the larger the size of the sample, the more likely the sample mean will provide a good estimate of the population mean.

2.3. Sample Means: Increasing samples

What is the effect of increasing the number of samples taken on the estimate of the population mean?
The code below generates a population of 10000 individuals with a mean of 50 and a standard deviation of 10.
It then draws random samples of a given size from the population and calculates the sample means.
Finally, it plots histograms of the sample means to visualize their variation.

Increasing the number of samples drawn from a population generally leads to more accurate estimates of the population mean.
This is because as the number of samples increases, the distribution of sample means tends to become more tightly clustered around the true population mean.
This phenomenon is known as the Central Limit Theorem.

In other words, as the number of samples increase, the sample means are more likely to be close to the population mean, and the variation among the sample means decreases.

This can be seen in the histograms above: as the number of samples increases, the histograms become taller and narrower, indicating that the sample means are becoming more concentrated around the population mean.

"""histograms increasing the number of samples
"""
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt

currfile_dir = Path(__file__).parent

# Set the population parameters
population_size = 10000
population_mean = 50
population_std = 10

# Generate the population
population = np.random.normal(loc=population_mean, scale=population_std, size=population_size)

# Set the sample parameters
sample_size = 100
number_of_samples_list = [10, 100, 1000]

def inc_sample_means(sample_size,number_of_samples_list):
    fig, ax = plt.subplots(1, 3, figsize=(15, 5), sharex=True)
    # Generate the samples
    for i in range(len(number_of_samples_list)):
        samples = np.random.choice(population, size=(number_of_samples_list[i], sample_size))
        # Calculate the sample means
        sample_means = samples.mean(axis=1)
        # Plot the sample means
        
        ax[i].hist(sample_means)
        ax[i].set_title('Sample Means: ' + str(number_of_samples_list[i]) + " samples")
    # Plot the population histogram
    # ax[3].hist(population)
    # ax[3].set_title('Population')

    # Save the figure as a PNG image
    filepath = currfile_dir / ('sample_means_inc_samples.png')
    plt.savefig(filepath, dpi=600)
    plt.show()

inc_sample_means(sample_size,number_of_samples_list)

2.4. Sample Means: Increasing sample size

What is the effect of increasing the sample size on the estimate of the population mean?
The code below generates a population of 10000 individuals with a mean of 50 and a standard deviation of 10.
It then draws random samples of a given size from the population and calculates the sample means.
Finally, it plots histograms of the sample means to visualize their variation.

Increasing the size of the samples drawn from a population generally leads to more accurate estimates of the population mean.
This is because as the sample size increases, the sample means tend to become more tightly clustered around the true population mean.
This phenomenon is also a consequence of the Central Limit Theorem.

As the sample size increases, the sample means are more likely to be close to the population mean, and the variation among the sample means decreases.

This can be seen in histograms of the sample means: as the sample size increases, the histograms become taller and narrower, indicating that the sample means are becoming more concentrated around the population mean.

It’s important to note that increasing the sample size has diminishing returns.
As the sample size gets larger and larger, the improvement in accuracy becomes smaller and smaller.
At some point, increasing the sample size further may not be worth the additional cost and effort.

"""histograms increasing the sample size
"""
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt

currfile_dir = Path(__file__).parent

# Set the population parameters
population_size = 10000
population_mean = 50
population_std = 10

# Generate the population
population = np.random.normal(loc=population_mean, scale=population_std, size=population_size)

# Set the sample parameters
sample_size_list = [10, 100, 1000]
number_of_samples = 100


def inc_sample_size(sample_size_list,number_of_samples):
    fig, ax = plt.subplots(1, 3, figsize=(15, 5), sharex=True)
    # Generate the samples
    for i in range(len(sample_size_list)):
        samples = np.random.choice(population, size=(number_of_samples, sample_size_list[i]))
        # Calculate the sample means
        sample_means = samples.mean(axis=1)
        # Plot the sample means
        
        ax[i].hist(sample_means)
        ax[i].set_title('Sample Means: size ' + str(sample_size_list[i]))
    # Plot the population histogram
    # ax[3].hist(population)
    # ax[3].set_title('Population')

    # Save the figure as a PNG image
    filepath = currfile_dir / ('sample_means_inc_size.png')
    plt.savefig(filepath, dpi=600)
    plt.show()

inc_sample_size(sample_size_list,number_of_samples)

2.5. Population Proportion

Population proportion \(p\) = \(\displaystyle \frac{\text{number of objects with trait in population}}{\text{population size}}\)
The population proportion is found by collecting categorical data about every object in a population and calculating the proportion with a trait.
However, it isusually impractical or costly to be able to determine the population proportion exactly.

2.6. Sample Proportion

Sample proportion \(\hat{p}\) = \(\displaystyle \frac{\text{number of objects with trait in sample}}{\text{sample size}}\)
The sample proportion is found by selecting a sample of that population and determining the proportion with a trait.
The more representative the sample is of the population or the larger the sample size, the more likely the sample proportion will provide a good estimate of the population proportion.
The sample proportion can vary from sample to sample between 0 and 1.

2.7. Sample Proportion: Increasing samples

What is the effect of increasing the number of samples taken on the estimate of the population proportion?
The population is made up of 100 blue balls and 400 red balls.
The proportion of blues balls in the population is 100 out 500 or 0.20.
Samples of 10 balls are taken in each sample.
The histograms compare the proportions of blue balls in 3 attempts at taking 5 samples and 1 with 100 samples.

../_images/sample_proportions_inc_samples.png

"""histograms increasing the number of samples
"""
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt

currfile_dir = Path(__file__).parent

# Set the population parameters
population = ['blue'] * 100 + ['red'] * 400

# Set the sample parameters
sample_size = 10
number_of_samples_list = [5, 5, 5, 1000]

def inc_sample_proportions(sample_size,number_of_samples_list):
    fig, ax = plt.subplots(1, 4, figsize=(15, 5), sharex=True)
    # Generate the samples
    for i in range(len(number_of_samples_list)):
        samples = [np.random.choice(population, size=sample_size, replace=False) for _ in range(number_of_samples_list[i])]
        # Calculate the sample proportions
        sample_proportions = [np.mean(sample == 'blue') for sample in samples]
        # Plot the sample proportions
        
        ax[i].hist(sample_proportions)
        ax[i].set_title('Sample Proportions: ' + str(number_of_samples_list[i]) + " samples")
    # Add a title to the figure
    fig.suptitle('Proportion of Blue Balls')
    # Save the figure as a PNG image
    filepath = currfile_dir / ('sample_proportions_inc_samples.png')
    plt.savefig(filepath, dpi=600)
    plt.show()

inc_sample_proportions(sample_size,number_of_samples_list)

The code line population = ['blue'] * 100 + ['red'] * 400 creats a population list that contains 500 elements: 100 ‘blue’ strings followed by 400 ‘red’ strings. This represents a population of 500 balls, where 100 are blue and 400 are red.

The code line samples = [np.random.choice(population, size=sample_size, replace=False) for _ in range(number_of_samples_list[i])] generates a list of random samples from the population list. Each sample has a size of sample_size and is drawn without replacement.

The np.random.choice function is used to generate a single random sample from the population list. The size parameter specifies the size of the sample, and the replace parameter specifies whether sampling should be done without replacement (i.e., whether the same element can be selected multiple times).
The list comprehension [np.random.choice(population, size=sample_size, replace=False) for _ in range(number_of_samples_list[i])] applies this operation number_of_samples_list[i] times to generate a list of number_of_samples_list[i] random samples.
After this line of code is executed, the samples list contains number_of_samples_list[i] random samples from the population list. Each sample is a list of sample_size elements drawn randomly from the population list without replacement.

The code line sample_proportions = [np.mean(sample == 'blue') for sample in samples] calculates the proportion of blue balls in each sample and stores the results in a list named sample_proportions.

The expression sample == ‘blue’ creates a Boolean array that has the same shape as sample and contains True where the elements of sample are equal to ‘blue’ and False elsewhere.
The np.mean function calculates the mean of this Boolean array by treating True as 1 and False as 0. This gives the proportion of blue balls in the sample.
The list comprehension [np.mean(sample == ‘blue’) for sample in samples] applies this calculation to each sample in the list samples and stores the results in a new list named sample_proportions.
After this line of code is executed, the sample_proportions list contains the proportion of blue balls in each sample.

2.8. Sample Proportion: Increasing sample size

What is the effect of increasing the sample size on the estimate of the population proportion?
The population is made up of 100 blue balls and 400 red balls.
The proportion of blues balls in the population is 100 out 500 or 0.20.
Samples of 10, 20, 40 and 100 balls are taken in each sample.
The histograms compare the proportions of blue balls taken for each sample size.

../_images/sample_proportions_inc_size.png

"""histograms increasing the number of samples
"""
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt

currfile_dir = Path(__file__).parent

# Set the population parameters
population = ['blue'] * 100 + ['red'] * 400

# Set the sample parameters
sample_size_list = [10, 20, 40, 100]
number_of_samples = 20

def inc_sample_size_proportions(sample_size_list,number_of_samples):
    fig, ax = plt.subplots(1, 4, figsize=(15, 5), sharex=True)
    # Generate the samples
    for i in range(len(sample_size_list)):
        samples = [np.random.choice(population, size=sample_size_list[i], replace=False) for _ in range(number_of_samples)]
        # Calculate the sample proportions
        sample_proportions = [np.mean(sample == 'blue') for sample in samples]
        # Plot the sample proportions
        
        ax[i].hist(sample_proportions)
        ax[i].set_title('Sample Proportions: size ' + str(sample_size_list[i]))
    # Add a title to the figure
    fig.suptitle('Proportion of Blue Balls')
    # Save the figure as a PNG image
    filepath = currfile_dir / ('sample_proportions_inc_size.png')
    plt.savefig(filepath, dpi=600)
    plt.show()

inc_sample_size_proportions(sample_size_list,number_of_samples)