Summary
IIntroduction to set of dataAQuantitative & qualitative dataBDiscrete or continuous data, levels of measurement1Discrete data2Continuous data3Levels of measurementCDifferent forms of graphs1Histograms, bar graphs, circle graphs and line graphs2Stem-and-leaf plots3Box-and-whisker plotsIIKey concepts in statisticsAIndicators of central tendencyBIndicators of dispersionCIndicators of correlationIIIScatter plotsADefinition of a scatter plotBScatter plots and regression lineIntroduction to set of data
Quantitative & qualitative data
Quantitative data
A quantitative data set is a data set which is measured using numbers.
The following data set is the collection of the heights of students in a class.
Student Name | Height |
Amy | 180 cm |
John | 172 cm |
Louis | 193 cm |
Pierre | 201 cm |
Marcus | 162 cm |
Emmy | 160 cm |
Heights are recorded using numbers. Therefore, the above data set is a quantitative data set.
Qualitative data
A data set is qualitative if its data characterizes a property of members of a population without using numbers.
The following data set is the collection of the hair colors of students in a class.
Student Name | Hair color |
Amy | Brown |
John | Brown |
Louis | Blonde |
Pierre | Black |
Marcus | Red |
Emmy | Brown |
The colors of people's hair is not recorded using numbers. Therefore, the above data set is a qualitative set.
Discrete or continuous data, levels of measurement
Discrete data
Discrete data
A discrete data set is a data set where the possible values that a member of the population can have is finite.
The following data set is a record of students responses to the question: "Do you enjoy studying math?"
Student Name | Enjoys math |
Amy | Yes |
John | No |
Louis | Yes |
Pierre | Yes |
Marcus | No |
Emmy | Yes |
There are only two possible responses to the question "Do you enjoy studying math?" Therefore, the above data set is a discrete data set.
Continuous data
Continuous data
A continuous data set is a data set where the possible values that a member of the population can have is infinite.
The following data set is the recorded speed of cars traveling along a highway.
Car number | Speed in kph |
1 | 100 |
2 | 95 |
3 | 98.25 |
4 | 78.5 |
5 | 79 |
6 | 110.75 |
7 | 105.25 |
8 | 103.5 |
9 | 100 |
10 | 100.5 |
The possible set of values for speeds of cars is infinite. Therefore, the above data set is a continuous data set.
If we measured speeds of cars but rounded the speed to the nearest unit, then the data set would actually be discrete.
Levels of measurement
There are four different levels of measurement.
Nominal level of measurement
A nominal level measurement is a measurement where attributes cannot be ordered.
The following data set is the collection of the hair colors of students in a class.
Student Name | Hair Color |
Amy | Brown |
John | Brown |
Louis | Blonde |
Pierre | Black |
Marcus | Red |
Emmy | Brown |
Colors of people's hair categorizes members of the population in a way which cannot be ordered. Therefore, the measurement of recording people's hair color is a nominal level of measurement.
Ordinal level of measurement
A ordinal level measurement is a measurement where attributes can be ordered.
In a customer survey, a grocery store asked customers how satisfied they were with their shopping experience by asking them to select the value on the following scale which best represents their experience:
- 5 - Very satisfied
- 4 - Satisfied
- 3 - Okay
- 2 - Unsatisfied
- 1 - Very unsatisfied
Customer number | Survey response |
1 | 4 |
2 | 3 |
3 | 2 |
4 | 5 |
5 | 5 |
6 | 3 |
7 | 3 |
8 | 5 |
9 | 4 |
10 | 1 |
The measurements of the customer's satisfaction can be ordered and the measurement of customer's satisfaction is therefore an ordinal level of measurement.
Interval level of measurement
An interval level measurement is an ordinal level of measurement where we know the exact differences between measurements.
The following data set is the average recorded outside temperature in celsius of a city over a period of ten days.
Day number | Temperature |
1 | 25 |
2 | 24 |
3 | 20 |
4 | 17 |
5 | 20 |
6 | 21 |
7 | 15 |
8 | 13 |
9 | 12 |
10 | 12 |
The measurements of temperature can be ordered, but we also know how to measure the difference between two different measurements. Therefore, the measurement of temperature is an interval level of measurement.
Ratio level of measurement
A ratio level of measurement is an interval level of measurement where there is a true zero, which is the value for which no value of measurement can fall below.
The following data set is the recorded speed of cars traveling along a highway.
Car number | Speed in kph |
1 | 100 |
2 | 95 |
3 | 98.25 |
4 | 78.5 |
5 | 79 |
6 | 110.75 |
7 | 105.25 |
8 | 103.5 |
9 | 100 |
10 | 100.5 |
The measurements of the speed of cars can be ordered, but we also know how to measure the difference between two different measurements. Therefore, the measurement of speed is an interval level of measurement. On top of this, there is a true zero to speed, namely 0 kph. Therefore, the measurements of the speed of cars on a highway is a ratio level of measurement.
Different forms of graphs
Graphs are used to represent data sets.
Histograms, bar graphs, circle graphs and line graphs
Bar Graphs
A bar graph is a pictorial representation of data which compares values assigned to different categories. Each bar represents only one numeric value of data.
The following bar graph represents number of pages in various books.
Histogram
A histogram is a special kind of bar graph to represent the frequency of numerical data. The bars are organized into equal intervals and there is no gap between them.
The following histogram represents the number of registered students in each academic year at a certain high school.
Circle Graphs
Circle graphs, also known as "pie charts", are circular graphs which are divided into different sections. Each of the sections denotes a percentage or a value out of the total.
The following circle graph represents the relative size of each grade level of students in a high school as numbers and as percentages.
Line Graphs
A line graph is used to display continuous data. It displays information as a series of data points connected by a straight line.
The following line graph represents the evolution of temperatures during a week in degrees Fahrenheit.
Stem-and-leaf plots
Stem-and-leaf plot
In a stem-and-leaf plot, the data is organized from least to greatest. It is a special table where each data value is split into :
- A "stem", the first digit or digits of the value.
- A "leaf", which is usually the last digit of the value.
Consider the following dataset:
21, 37, 15, 18, 32, 28, 11
It has 7 data values:
- 21 can be split into a stem of "2" and a leaf of "1".
- 37 can be split into a stem of "3" and a leaf of "7".
- 15 is split into a stem of "1" and a leaf of "5" and so on...
In the end, there will be values 1,2, and 3 as stems.
For a stem of 1, there is a leaf of 1 (for 11), a leaf of 5 (for 15), and a leaf of 8 (for 18).
It makes the following steam-and-leaf plot:
Stem | Leaf |
---|---|
1 | 1,5,8 |
2 | 1,8 |
3 | 2,7 |
Box-and-whisker plots
Box-and-whisker plot
A box-and-whisker plot is a diagram that summarizes data by dividing it into four equal parts (quartiles) using five numbers:
- The smallest number.
- The three quartiles (as defined later in this lesson).
- The biggest number.
A box plot is drawn from the first quartile to the third quartile and a vertical line goes through the box at the median. The whiskers go from each quartile to the minimum or maximum:
Consider the following data set:
2, 3, 6, 7, 8, 9, 11
Assume that:
- The median is 7.
- The lower quartile is 3.
- The upper quartile is 9.
The box-and-whisker plot for the above data set is:
Key concepts in statistics
Indicators of central tendency
Mean
The mean of a numerical data set is the average value of the data. To calculate the mean, add up all the numbers (sum) and divide it by the number of numbers (count):
Mean= \dfrac{sum}{count}
Consider the following data set:
2, 3, 5, 6, 4
To find its mean, first add all the terms to find the sum :
2 + 3 + 5 + 6 + 4 = 20
The total number of terms (or the count) is 5.
Now calculate the mean:
\dfrac{20}{5}=4
Median
The median of an ordered data set is the middle value of the data set:
- If there is an odd number of terms in a list, then there is one unique value that is in the middle of the sorted list.
- If there is an even number of terms in a list, then there will be two values in the middle. In this case, we take the average of these two values in order to calculate the median.
Consider the following data set:
10, 15, 5 , 25, 20.
In order to find the median, the set first needs to be sorted in order:
5 , 10 , 15, 20 , 25
There are 5 terms in the set. The middle value is the value that is found in the third place, which is 15.
Consider finding the median for the following data set:
20, 15, 5, 10, 25, 30.
First, the set needs to be sorted in order:
5, 10, 15, 20, 25, 30
The total number of terms in this data set is 6 (which is an even number). There are two numbers in the middle (15 and 20) which are the third and forth numbers in the set. To find the median, calculate the average of these two values:
\text{Median} = \dfrac{15+20}{2}= \dfrac{35}{2}= 17.5
The median is 17.5.
The mean and the median of a data set can be different.
Consider the following data set:
1{,}1, 10
The mean of the data set is:
\dfrac{1+1+10}{3}=\dfrac{12}{3}=4
However, the median is 1.
Mode
The mode is the number that appears the most in the list. To find the mode, it is best to sort the list in order and then count how many times each number appears. The number that appears most often is the mode.
Consider finding the mode for the following list:
12, 11, 43, 17, 12, 11, 34, 38, 43, 12, 72
First, arrange the given data set in order:
11, 11, 12, 12, 12, 34, 38, 43, 43, 72
Now we can see that 12 occurs more times than any other number in this data set. Therefore, the mode is 12.
Outlier
An outlier is a value that is drastically larger or smaller than the rest of the values in a list.
Consider the following list of numbers:
2,3,5,9,28.
28 is drastically larger than the other values and is an outlier.
Outliers can be subjective or the result of a mistake in experimentation. They are often removed from datasets to better understand tendencies.
Indicators of dispersion
Range
The range of a data set is the difference between the largest and smallest value in the data set.
The following data set contains the ages of 7 students:
15, 12, 9, 13, 17, 10, 11
- The highest age is 17.
- The lowest age is 9.
The range of the data set is:
17 - 9 = 8
Standard deviation
Standard deviation measures how spread out numbers are from the mean value of a data set. Standard deviation is usually denoted by SD or \sigma .
\sigma =\sqrt {\dfrac {\sum\left(x_i-\overline{x}\right)^2}{N}}
- x_i represents each data point.
- \overline{x} represents the mean of the dataset.
- N is he size of the dataset.
Consider finding the standard deviation and variance of the following data set:
22, 25, 24, 27, 28, 24
First, the mean needs to be calculated:
Mean =\overline{x} = \dfrac{22+25+24+27+28+24}{6}=\dfrac{150}{6}=25
The next step is to create a table:
Data point | Difference from the mean | Squared |
x_i | x_i - \overline{x} | \left(x_i-\overline{x}\right)^2 |
22 | -3 | 9 |
25 | 0 | 0 |
24 | -1 | 1 |
27 | 2 | 4 |
28 | 3 | 9 |
24 | -1 | 1 |
Next, the average of the squared differences needs to be calculated. It represents the variance or \sigma^2 :
\sigma^2 = \dfrac{9+0+1+4+9+1}{6}=\dfrac{24}{6}=4
Finally, the standard deviation is the square root of the variance :
\sigma=\sqrt{4}=2
Standard deviation is used to quantify dispersion in a data set or to simply find out how much the numbers in a data set differ from the mean value.
Variance
Variance is the square of the standard deviation:
V=\sigma^2
First and Third Quartile
- The first quartile, or Q_1, is the median of the lower half of the data set and can be called lower quartile.
- The third quartile, or Q_3, is the median of the upper half of the data set and can be called upper quartile.
Consider finding the first and third quartiles for the following data set :
3, 9, 13, 8, 4, 5, 5, 10, 7
First, the set needs to be sorted in order:
3, 4, 5, 5, 7, 8, 9, 10, 13
As there are 9 elements in this set, the median is found in the middle (the fifth element in the ordered set). The median is 7.
The first quartile is the median of the lower half of the set : 3, 4, 5, 5. It is the average of the two elements in the middle :
Q_1=\dfrac{4+5}{2}=4.5
The third quartile is the median of the higher half of the set : 8, 9, 10, 13. It is the average of the two elements in the middle:
Q_3=\dfrac{9+10}{2}=9.5
The median of a set is also known as the second quartile or Q_2.
The first quartile (Q_1), the median (Q_2), and the third quartile (Q_3) divide a set into four equal parts.
Interquartile Range
The interquartile range is the difference between the third and first quartile:
\text{Interquartile range}=Q_3-Q_1
In the previous example, the interquartile range is:
Q_3-Q_1=9.5-4.5 =5
Indicators of correlation
Correlation coefficient
Correlation coefficient indicates the extent to which two or more variables fluctuate together.
- A positive correlation indicates the extent to which those variables increase the same direction or decrease in the same direction.
- A negative correlation indicates the extent to which one variable increases as the other decreases.
Consider the following data:
X | Y |
12 | 20 |
10 | 18 |
9 | 15 |
7 | 14 |
7 | 12 |
5 | 12 |
3 | 10 |
Before knowing how to calculate the exact correlation coefficient, it is possible to figure out whether it is likely to exist and whether it is a positive or a negative one. The two variables X and Y fluctuate together and as X decreases so does Y. Since they fluctuate in the same direction, the correlation is likely to exist and be a positive one.
The correlation coefficient ranges between -1 and 1. Specifically:
- 1 is a perfect positive correlation.
- 0 is no correlation (the values don't seem linked at all).
- -1 is a perfect negative correlation.
The correlation coefficient is denoted by r and can be calculated for two variables x and y using this formula :
r_{xy}=\dfrac{\sum_{i=1}^{n}\left(x_i-\overline{x}\right)\left(y_i-\overline{y}\right)}{\sqrt{\sum_{i=1}^{n}\left(x_i-\overline{x}\right)^2\sum_{i=1}^{n}\left(y_i-\overline{y}\right)^2}}
Consider finding the correlation coefficient between the following datasets:
- x = \left[ 16, 1, 8, 8, 15 \right]
- y = \left[ 5, 13, 1, 4, 4 \right]
First, calculate the mean of each dataset:
\bar{x} = \dfrac{16+1+8+8+15}{5} = 9.6
\bar{y} = \dfrac{5+13+1+4+4}{5} = 5.4
Then, construct and fill in the following table:
x | y | x_i-\bar{x} | y_i-\bar{y} | \left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right) | \left(x_i-\bar{x}\right)^2 | \left(y_i-\bar{y}\right)^2 |
16 | 5 | 6.4 | -0.4 | -2.56 | 40.96 | 0.16 |
1 | 13 | -8.6 | 7.6 | -65.36 | 73.96 | 57.76 |
8 | 1 | -1.6 | -4.4 | 7.04 | 2.56 | 19.36 |
8 | 4 | -1.6 | -1.4 | 2.24 | 2.56 | 1.96 |
15 | 4 | 5.4 | -1.4 | -7.56 | 29.16 | 1.96 |
Then, calculate the sum of the elements in the fifth column:
\sum_{i=1}^{n} \left(x_i-\bar{x}\right)\left( y_i-\bar{y}\right) =-66.2
Calculate the sum of the elements in the sixth column:
\sum_{i=1}^{n} \left(x_i-\bar{x}\right)^2 = 149.2
Calculate the sum of the elements in the seventh column:
\sum_{i=1}^{n} \left(y_i-\bar{y}\right)^2 = 81.2
Finally, put these values in the formula for the correlation coefficient:
r = \dfrac{-66.2}{\sqrt{149.2\times 81.2}} = -0.601
Correlation does not always mean causation. Two values may be correlated without one causing the other one.
The sale of hot beverages and umbrellas are likely to be positively correlated, but one does not cause the other. They are both caused by another element (the weather).
Scatter plots
Definition of a scatter plot
Scatter plot
A scatter plot is a set of points plotted on a horizontal and vertical axes. It shows the relationship between two sets of data.
The following scatter plot shows age on the horizontal axis and height on the vertical axis:
Scatter plots and regression line
Regression line
A regression line is a line drawn on a scatter plot that best describes the behavior of a set of data. This line is as close to all the data points as possible and as such represents the best fit for the trend of a given set of data. It can be used to predict values that we do not have.
Here is a regression line drawn on a scatter plot that represents age and height:
Using this regression line, we can predict that a 16 year old person will have a height of 75 inches.
There are formulas for finding regression lines of data sets, but they are lengthy and typically computed using computer software.
Certain bamboo plants have been observed to grow at surprisingly fast rates. The following table is the measured height of a bamboo plant at 5:00 pm each day in a weeks time.
Day of the week | Height of bamboo |
Sunday | 10cm |
Monday | 14cm |
Tuesday | 19cm |
Wednesday | 26cm |
Thursday | 31cm |
Friday | 36cm |
Saturday | 40cm |
We then plot the data points on a graph with the x -axis being the days of the weeks and the y -axis being the height of the bamboo plant.
We then use computer software to find the line of best fit. Software will even provide the equation of the line of best fit which can be used to make predictions about future measurements.
A prediction made using regression line may not be accurate. It is the best guess we can have given the previous data but the actual value might vary.
If the data in the previous example is used in order to predict the height of a 40 year old person, the result would be 160 inches. This is obviously impossible, but since the data stopped at the age 11, the best prediction using the data would be that the height continues to increase at the same rate.