Summary
IThe central limit theoremAIntroductive example and vocabularyBThe central limit theoremIIConfidence intervalsAPrincipleBBuilding a confidence intervalIIIIntroduction to hypothesis testingAExperiment design and formulating hypothesesBType of errorsCLeading a hypothesis testingThe central limit theorem
The main goal of statistics is to use sample data to draw conclusions about an entire population. The central limit theorem is a fundamental theorem of statistics which essentially states that it is possible to make conclusions about a population based on the information of a sample population.
Introductive example and vocabulary
Population
A population is a set of similar items or events which are of interest to a question or experiment.
Suppose we are interested in heights of humans, the population would be of all people on Earth.
Sample Population
A sample population is a part of population which is measured or studied in order to draw conclusions about the entire population.
Suppose we are interested in heights of humans. If we advertised that we are looking for volunteers whose heights we can measure and 300 hundred people volunteered, then those 300 people would be a sample population of all people on Earth.
The central limit theorem
Central limit theorem
Let X_1, ..., X_n be random variables that follow the same distribution and have a finite mean \mu and standard deviation \sigma.
Let \overline{X} be equal to \dfrac1n\sum_{i=1}^{n}X_i and let Z be a random variable that follows \mathcal{N} \left(\mu,\dfrac{\sigma}{\sqrt{n}} \right).
When n is large (usually n≥30 ), then for any a\in\mathbb{R} :
P\left(\overline{X}≤a\right)\approx P\left(Z≤a\right)
The central limit theorem establishes that when you realize a great number of identical and independent experiments, the sum of the random variables can be well approximated as if it was a random variable that follows a normal distribution.
Confidence intervals
Principle
One way to draw conclusions about a population is to develop confidence intervals based off a sample population.
Confidence interval
A confidence interval is an interval of values in which we can state that the population lies, from given observations, and for a given confidence level.
Suppose we wish to label a new candy bar and the number of calories each candy bar contains. To do so, we can randomly select 50 candy bars and measure the calories of each. The various measurements of the calories of the candy bars will produce a range of values. That interval of values is the confidence interval.
Confidence level
The confidence level of a confidence interval is the expected percentage of the population which falls within the range of values.
A 95% confidence level of a confidence interval means that we expect 95% of the population to fall in that interval.
Building a confidence interval
z-score
The z-score of a data point of a population is the number of standard deviation that data point is away from the mean.
Suppose that:
- In a population of 100 birds, the average weight of all the birds is 50 grams.
- The data set has a standard deviation of 5 grams.
If the weight of a bird x is 53 grams, then the z-score of bird x is:
\dfrac{53-50}{5}=0.6
If the weight of bird y is 42 grams, then the z-score of bird y is:
\dfrac{42-50}{5}=-1.6
Confidence intervals are typically built around confidence level of either 90\%, 95\%, or 99\%. Confidence levels correspond to z -scores of:
- 1.65 for a confidence level of 90\%.
- 1.96 for a confidence level of 95\%.
- 2.58 for a confidence level of 99\%.
A confidence interval is built from a sample population of size n as follows:
- Select the desired confidence level and let z be the corresponding z-score.
- Find \mu, the mean of the sample population.
- Find \sigma, the standard deviation of the sample population.
- Find the margin of error, which is given by the formula \dfrac{z\cdot \sigma}{\sqrt{n}}.
The confidence interval is the interval:
\left(\mu- \dfrac{z\cdot \sigma}{\sqrt{n}}, \mu+\dfrac{z\cdot}{\sqrt{n}} \sigma\right)
In a sample population of 100 birds, the weights of the birds have an average of \mu=50 grams with a standard deviation of \sigma=5. The confidence interval with a confidence level of 90\% is built as follows:
- The z-score corresponding to a 90\% confidence level is 1.65.
- The margin of error is \dfrac{z\cdot \mu}{\sqrt{100}}=\dfrac{1.65\cdot 50}{10}=8.25.
The confidence interval is therefore:
\left(50-8.25{,}50+8.25\right)=\left(41.75{,}58.25\right)
We therefore expect 90\% of birds to have a weight at the confidence interval \left(41.75{,}58.25\right).
Introduction to hypothesis testing
Experiment design and formulating hypotheses
Hypothesis
A hypothesis is an educated guess about a population based on information gathered from a sample population.
Suppose we are studying eating habits of adults and observe that in a sample population of 100 adults the majority of the obese adults drink sodas and do not regularly exercise. We could make a hypothesis that drinking sodas and not exercising regularly leads to obesity.
Every simple hypothesis has a corresponding null hypothesis which is the negation of the simple hypothesis.
Suppose we want to run an experiment on the growth of cucumber plants and our hypothesis is "Watering a cucumber plant 200 milliliters a day will make the cucumber plant grow faster than watering the plant 300 milliliters a day."
The corresponding null hypothesis is: "Watering a cucumber plant 200 milliliters of water a day will not make the plant grow any faster than watering the same plant 300 milliliters of water a day."
p-value
The p-value of an experiment is the probability that the null hypothesis of an experiment is true. The p-values are determined not only by the data of the experiment, but also by the chosen confidence level of the experiment.
Small p-values indicate the hypothesis on an experiment is likely to be true. Typically, having a p-value below 5\% (or sometimes even 1\% ) is an indication that the experiment is statistically significant.
In modern statistics, computing the p-value of an experiment by hand is quite difficult and requires advanced numerical methods. However, there are easy to use and widely available computing programs which compute p-values.
Biased and unbiased samples
A sample population is unbiased if every member of the population had the same chance to be a selected member of the sample population. Otherwise, the sample population is said to be biased.
In an experiment to test the average speed a human being can run, the samples are only drawn among people in a particular high school. This would be an example of a biased sample because not every member of the population had an equal chance of being selected to be part of the sample population.
Type of errors
Type I error
A type I error is a false positive. This occurs when a sample population indicates that the null hypothesis of an experiment is true but the experiment rejects the null hypothesis.
Suppose an experiment is being run on people with a cold to determine if medicine X is effective in treating the cold.
- Hypothesis: "Medicine X relieves cold symptoms."
- Null Hypothesis: "Medicine X has no effect on cold symptoms."
Suppose that Medicine X has no effect on treating cold symptoms but our experiment on a sample population indicated that medicine X did relieve cold symptoms. Then our experiment will have produced a false positive and this would be a Type I error.
Type II error
A type II error is a false negative. This occurs when null hypothesis is false but the experiment fails to reject the null hypothesis.
Suppose an experiment is being run on people with a cold to determine if medicine X is effective in treating the cold.
- Hypothesis: "Medicine X relieves cold symptoms."
- Null Hypothesis: "Medicine X has no effect on cold symptoms."
Suppose that Medicine X does relieve cold symptoms but our experiment on a sample population indicated that medicine X has no effect on cold symptoms. Then our experiment will have produced a false negative and this would be a Type II error.
Leading a hypothesis testing
Level of significance
Suppose that the chosen level of confidence in an experiment is a\%, where a is a number between 0 and 100\%. Then the level of significance of the experiment is 100\%-a\%.
If an experiment is being ran with a chosen confidence level of 99\%, then the level of significance is:
100\%-99\%=1\%
Recall that the level of confidence is usually chosen to be either 90\%, 95\%, or 99\%.
Suppose an experiment with a chosen confidence level, a hypothesis, and a null hypothesis is ran on a sample population. Suppose that \alpha is the level of significance of the experiment and p is the p-value of the experiment.
- If p\leq \alpha, then we reject the null hypothesis.
- If p \gt \alpha, then we do not reject the null hypothesis.
Suppose an experiment is being conducted on the amount of water a cucumber plant needs in order to optimize its growth. Suppose the following:
- The hypothesis of the experiment is that 200 milliliters of water per day will make a cucumber grow faster than 300 milliliters of water per day.
- The null hypothesis of the experiment is that 200 milliliters of water per day will not make a cucumber grow faster than 300 milliliters of water per day.
The experiment is being conducted with a confidence level of 95\% so that the corresponding level of significance of the experiment is 5\%.
If an experiment is conducted and the p-value of the experiment is found to be 3\%, then we can reject the null hypothesis with a confidence level of 95\%.
The way an experiment is conducted will effect the results. It is therefore important to design experiments so that there is as little outside influence as possible on the results.
Consider the following hypothesis:
Taking a class in a classroom is more beneficial to the student than taking an online class.
If an experiment was designed to test the above hypothesis by only measuring student's exam scores, then the experiment would be flawed. The experiment is not taking into account students preparation time before taking the exam.