class: center, middle, inverse layout: yes name: inverse ## STAT 305: Chapter 3 ## Elementary Descriptive Statistics ### Amin Shirazi .footnote[Course page: [ashirazist.github.io/stat305_s2020.github.io](https://ashirazist.github.io/stat305_s2020.github.io/)] --- name: inverse layout: true class: center, middle, inverse --- # Section 3.1 ## Elementary Graphical and Tabular Treatment ## of ## Quantitative Data --- layout:false .left-column[ ### Summarizing ### Intro ] .right-column[ ## Summarizing Univariate Data ### Introduction: Creative Writing Workshops Two methods of teaching a creative writing workshop are being studied for their effectiveness of improving writing skills. First, two groups of creative writing students who were randomly assigned to one of two different 3-hour workshops. At the end of the workshop, the students were given a standard creative writing test and their score on the test was recorded. **Exam Scores for Two Groups of Students Following Different Courses** ``` Group 1 Group 2 74 79 77 81 65 77 78 74 68 79 81 76 76 73 71 71 81 80 80 78 86 81 76 89 88 83 79 91 79 78 77 76 79 75 74 73 72 76 75 79 ``` ] --- layout:false .left-column[ ### Summarizing ### Intro ] .right-column[ **Exam Scores for Two Groups of Students Following Different Courses** ``` Group 1 Group 2 74 79 77 81 65 77 78 74 68 79 81 76 76 73 71 71 81 80 80 78 86 81 76 89 88 83 79 91 79 78 77 76 79 75 74 73 72 76 75 79 ``` We may have several questions we are interested in answering using this data. For instance, - Which group did better on average? - Which group has the most consistent scores? - Were there any unusually low or high scores in either group? - If we ignore unusual scores, which group is better? - Which group had the most scores over 80? - ... However, none of these are immediately clear looking at the raw recorded data. ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ] .right-column[ ### The Purpose of Summaries Certain questions can and should be asked across many types of experiments. But seeing data in this kind of *flat* format presents lots of problems for a person trying to understand the relationship between the two groups. **Summaries** are tools (mainly mathematical or graphical) which help researchers quickly understand the data they have collected. The purpose of a summary is to faithfully present aspects of the data in such a way that - we are capable of answering the types of core questions about the data asked on the previous page, - we are able to identify more complicated aspects of the data that we may want to investigate further. **Key Idea**: Good summaries should be quickly interpreted, provide clear insight into the data, and be widely applicable. ] --- layout:true class: center, middle, inverse --- # Descriptive statistics --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ] .right-column[ # Descriptive statistics >Engineering data are always variable. Given precise enough measurement, even constant process conditions produce different responses. Thus, it is not the individual data values that are important, but their **distribution**. We will discuss simple methods that describe important distributional characteristics of data. **Descriptive statistics** is the use of plots and numerical summaries to describe data without drawing any formal conclusions. ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ] .right-column[ # Descriptive statistics Through the use of *descriptive statistics*, we seek to find the following features of data sets: >- **Center** : The point that the data are closest on average >- **Spread**: how wide the data look, how varied the points are >- **Shape **: common patterns/ trends that are present in the data >- **Outliers**: points that lie way beyond the rest of the data (wierd points) ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ] .right-column[ ## Graphical and tabular displays of quantitative data Almost always, the place to start a data analysis is with appropriate graphical and tabular displays. When only a few samples are involved, a good plot can tell most of the story about data and drive an analysis. ### Dot diagrams and stem-and-leaf plots When a study produces a small or moderate amount of **univariate quantitative data**, a *dot diagram* can be useful. >A **dot diagram** shows each observation as a dot placed at the position corresponding to its numerical value along a number line. ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ] .right-column[ ### Dot diagrams and stem-and-leaf plots **Example:**[Heat treating gears, cont'd] Recall the example from Chapter 1. A process engineer is faced with the question, "How should gears be loaded into a continuous carburizing furnace in order to minimize distortion during heat treating?" The engineer conducts a well-thought-out study and obtains the runout values for 38 gears laid and 39 gears hung. <img src= "ch3_files/figure-html/gear-dotplot-1.png"> ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ] .right-column[ ### Simple Graphical Summaries **Exaple:** ``` Group 1 Group 2 74 79 77 81 65 77 78 74 68 79 81 76 76 73 71 71 81 80 80 78 86 81 76 89 88 83 79 91 79 78 77 76 79 75 74 73 72 76 75 79 ``` Simple graphical summaries aim to provide a better view of the entire set of data. The best graphs are able to make important points clearly and give valuable insights with closer study. Producing good graphs is an [art](http://www.edwardtufte.com/tufte/graphics/poster_OrigMinard.gif). **Two common graphical summaries** - Dot Diagrams - Stem and Leaf Diagrams Carries much the same visual information as a dot diagram while preserving the original values exactly ] --- layout:false .left-column[ layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ] .right-column[ ### Simple Graphical Summaries ``` Group 1 Group 2 74 79 77 81 65 77 78 74 68 79 81 76 76 73 71 71 81 80 80 78 86 81 76 89 88 83 79 91 79 78 77 76 79 75 74 73 72 76 75 79 ``` **Dot Diagrams ** <center> <h3></h3> <img src="g1dot.png" alt="dmc logo" height="125"> <img src="g2dot.png" alt="dmc logo" height="125"> </center> ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ] .right-column[ Dot diagrams are good for getting a general feel for the data (and can be done with pencil and paper), but do not allow the recovery of the exact values used to make them. >A **stem-and-leaf plot** is made by using the last few digits of each data point to indicated where it falls. ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ] .right-column[ **Example:** ``` Group 1 Group 2 74 79 77 81 65 77 78 74 68 79 81 76 76 73 71 71 81 80 80 78 86 81 76 89 88 83 79 91 79 78 77 76 79 75 74 73 72 76 75 79 ``` **Stem and Leaf Diagrams** <center> <h3></h3> <img src="g1.png" alt="dmc logo" height="250"> <img src="g2.png" alt="dmc logo" height="250"> </center> ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ] .right-column[ ### Frequency tables and histograms Dot diagrams and stem-and-leaf plots are useful for getting to know a data set, but they are not commonly used in papers and presentations. >A **frequency table** is made by first breaking an interval containing all the data into an appropriate number of smaller intervals of equal length. Then tally marks can be recorded to indicate the number of data points falling into each interval. Finally, frequencies, relative frequencies, and cumulative relative frequencies can be added. ] --- .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ] .right-column[ ### Frequency tables and histograms A frequency table is made by - First breaking an interval containing all the data into an apropriate number of smaller intervals of **equal length. ** - Then tally marks can be recorded to indicate the number of data points falling into each interval. - Finally, add frequency, relative frequency and cumlative relative frequency can be added. <center> <img src= "p1.jpg" alt="dmc logo" height="250"> </center> ] --- .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ] .right-column[ ## Histogram After making a frequency table, it is common to use the organization provided by the table to create a histogram. A **(frequency or relative frequency) histogram** is a kind of bar chart used to portray the shape of a distribution of data points. Guidelines for making histograms: - Use intervals of equal length - Show the entire vertical axis starting at *zero* - Avoid breaking either axes - keep a uniform scale for axes (tick marks) - Center bars of appropriate heights at midpoint of the intervals ] --- .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ] .right-column[ ## Histogram **Example:**[Bullet penetration depth, pg. 67] Sale and Thom compared penetration depths for several types of .45 caliber bullets fired into oak wood from a distance of 15 feet. They recorded the penetration depths (in mm from the target surface to the back of the bullets) for two bullet types. <table> <thead> <tr> <th style="text-align:left;"> 200 grain jacketed bullets </th> <th style="text-align:left;"> 230 grain jacketed bullets </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 63.8, 64.65, 59.5, 60.7, 61.3, 61.5, 59.8, 59.1, 62.95, 63.55, 58.65, 71.7, 63.3, 62.65, 67.75, 62.3, 70.4, 64.05, 65, 58 </td> <td style="text-align:left;"> 40.5, 38.35, 56, 42.55, 38.35, 27.75, 49.85, 43.6, 38.75, 51.25, 47.9, 48.15, 42.9, 43.85, 37.35, 47.3, 41.15, 51.6, 39.75, 41 </td> </tr> </tbody> </table> ] --- .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ] .right-column[ ## Histogram **Example:**[Bullet penetration depth, pg. 67] <img src="ch3_files/figure-html/bullets-hist-1.png" height="250"> ![](https://media.giphy.com/media/mleJtlzo1gITK/giphy.gif) ] --- .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ] .right-column[ **Example:**[Histogram] Suppose you have the following data: `$$74, 79, 77, 81, 68, 79, 81, 76, 81, 80, 80\\, 78, 88, 83, 79, 91, 79, 75, 74, 73$$` Create the corresponding *frequency table* and *frequency histogram}* ] --- layout:true class: middle, center, inverse --- ## Why plotting data? --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ] .right-column[ ## Why plotting data? Why do we plot data? Information on **location**, **spread**, and **shape** is portrayed clearly in a histogram and can give hints as to the functioning of the physical process that is generating the data. ![Common distributional shapes.](ch3_files/figure-html/data-shape-1.png) ] --- .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ] .right-column[ ## Why plotting data? If data on the diameters of machined metal cylinders purchased from a vendor produce a histogram that is decidedly **bimodal**, this suggests >the machining was done on 2 machines or by two operators or at two different times, etc. ... If the histogram is **truncated**, this might suggest >the cylinders have been 100% inspected ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ] .right-column[ ### Scatter plots Dot-diagrams, stem-and-leaf plots, frequency tables, and histograms are univariate tools. But engineering questions often concern multivariate data and *relationships between the quantitative variables*. >A **scatterplot** is a simple and effective way of displaying potential relationships between two quantitative variable by assigning each variable to either the `\(x\)` or `\(y\)` axis and plotting the resulting coordinate points. ] --- .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ] .right-column[ ### Scatter plots **Example:**[Orange trees] Jim and Jane want to know the relationship between an orange tree's age (in days since 1968-12-31) and its circumference (in mm). They recorded the data for `\(35\)` orange trees. <img src ="ch3_files/figure-html/orange-scatter-1.png"> ] --- .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ] .right-column[ ### Scatter plots There are three typical association/relationship between two variables: ![](ch3_files/figure-html/unnamed-chunk-3-1.png)<!-- --> ] --- layout:true class: center, middle, inverse --- # Summaries of Location and Central Tendency --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ] .right-column[ ### Summaries of Location and Central Tendency Motivated by asking what is *normal/common/expected* for this data. There are three main types used: **Mean**: A "fair" center value. The symbol used differs depending on whether we are dealing with a sample or population: | | | Mean | |:----------------|----------------------|--------------------------------------| | | | | | **Population** | | `\(\mu = \frac{1}{N} \sum_1^N x_i \)` | | | | | | **Sample** | | `\(\bar{x} = \frac{1}{n} \sum_1^n x_i \)` | **N** is the population size and **n** is the sample size. **Mode**: The most commonly occurring data value in set. ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ] .right-column[ ### Summaries of Location and Central Tendency **Quantiles**: The number that divides our data values so that the proportion, `\(p\)`, of the data values are below the number and the proportion `\(1-p\)` are above the number. **Median**: The value dividing the data values in half (the middle of the values). The median is just the 50th quantile. **Range**: The difference between the highest and lowest values (Range = max - min) **IQR**: The Interquartile Range, how spread out is the middle 50% (IQR = Q3 - Q1) ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ] .right-column[ ### Summaries of Location and Central Tendency ``` Group 1 Group 2 74 79 77 81 65 77 78 74 68 79 81 76 76 73 71 71 81 80 80 78 86 81 76 89 88 83 79 91 79 78 77 76 79 75 74 73 72 76 75 79 ``` **Calculating Mean** Think of it as an equal division of the total - each value in the data is an "\\(x_i\\)" (\\(i\\) is a **subscript**) - Group 1: \\(x\_1 = 74, x\_2 = 79, ..., x\_{20} = 73\\) - The sum: \\(x\_1 + x\_2 + x\_3 + ... + x\_{20}\\) - divides : \\((x\_1 + x\_2 + x\_3 + ... + x\_{20})/20\\) - Or using summation notation: \\(\frac{1}{20} \sum_{i = 1}^{20} x_i\\) ] --- name: inverse layout: true class: center, middle, inverse --- ## Quantiles --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ] .right-column[ ### Summaries of Location and Central Tendency **The Quantile Function** Two useful pieces of notation: **floor**: `\(\lfloor x \rfloor \)` is the largest integer smaller than or equal to `\( x \)` **ceiling**: `\(\lceil x \rceil \)` is the smallest integer larger than or equal to `\( x \)` **Examples** - `\(\lfloor 55.2 \rfloor = 55\)` - `\(\lceil 55.2 \rceil = 56\)` - `\(\lfloor 19 \rfloor = 19\)` - `\(\lceil 19 \rceil = 19\)` - `\(\lceil -3.2 \rceil = -3\)` - `\(\lfloor -3.2 \rfloor = -4\)` ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ] .right-column[ ### Summaries of Location and Central Tendency ### Quantiles Already familiar with the concept of "percentile". The <font color="red"> `\(p^{th}\)` percentile</font> of a data set is a number greater than `\(p\)`% of the data and less than the rest. > "You scored at the `\(90^{th}\)` percentile on the SAT" means that your score was higher than 90% of the students who took the test and lower than the other 10% > "Zorbit was positioned at the `\(80^{th}\)` percentile of the list of fastest growing companies compiled by INC magazine.” means Zorbit was growing faster than 80% of the companies in the list and slower than the other 20%. ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ] .right-column[ ### Summaries of Location and Central Tendency ### Quantiles - It is more convenient to work in terms of fractions between 0 and 1 rather than percentages between 0 and 100. We then use terminology **Quantiles** rather than percentiles. - For a number **p** between 0 and 1, the **p quantle** of a distribution is a number such that a fraction p of the distribution lies to the left of that value, and a fraction 1-p of the distribution lies to the right. ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ] .right-column[ ### Summaries of Location and Central Tendency ### Quantiles For a data set consisting of `\(n\)` values that when ordered are `\(x_1 \le x_2 \le \cdots \le x_n\)`, - if `\(p = \frac{i - .5}{n}\)` for a positive integer `\(i \le n\)`, the `\(p\)` **quantile** of the data set is `$$Q(p) = Q\left(\frac{i-.5}{n}\right) = x_i$$` (the `\(i\)`th smallest data point will be called the `\(\frac{i-.5}{n}\)` quantile) ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ] .right-column[ ### Summaries of Location and Central Tendency ### Quantiles - for any number `\(p\)` between `\(\frac{.5}{n}\)` and `\(\frac{n-.5}{n}\)` that is not of the form `\(\frac{i-.5}{n}\)` for an integer `\(i\)`, the `\(p\)` **quantile** of the data set will be obtained by linear interpolation between the two values of `\(Q\left(\frac{i-.5}{n}\right)\)` with corresponding `\(\frac{i-.5}{n}\)` that bracket `\(p\)`. In both cases, the notation `\(Q(p)\)` will denote the `\(p\)` quantile. ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ] .right-column[ ### Summaries of Location and Central Tendency **The Quantile Function** For a data set consisting of `\(n\)` values that when ordered are `\(x_1 \le x_2 \le \ldots \le x_n\)` and `\( 0 \le p \le 1\)`. We define the **quantile function** `\(Q(p)\)` as: <span style = "font-size: 50%"> `\[ Q(p) = \begin{cases} \small x_i & \small \lfloor n \cdot p + .5 \rfloor = n \cdot p + .5 \\ \small x_i +\left( n p - i + .5 \right) \left( x_{i+1} - x_i \right) &\small \lfloor n \cdot p + .5 \rfloor \ne n \cdot p + .5 \\ \end{cases} \]` </span> (note: this is the definition used in the book - it's just written using *floor* and *ceiling* instead of in words) ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ] .right-column[ - `\(\small Q\left(\frac{1-.5}{n}\right)\)` is called the **minimum** and `\(\small Q\left(\frac{n-.5}{n}\right)\)` is called the **maximum** of a distribution. - `\(\small Q(.5)\)` is called the **median** of a distribution. `\(\small Q(.25)\)` and `\(\small Q(.75)\)` are called the **first (or lower) quartile** and **third (or upper) quartile** of a distribution, respectively. - The **interquartile range (IQR)** is defined as `$$\small IQR = Q(.75) - Q(.25)$$`. - An **outlier** is a data point that is larger than `\(\small Q(.75) + 1.5*IQR\)` or smaller than `\(\small Q(.25) - 1.5*IQR\)`. ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ] .right-column[ **Example**: Find the median, first quartile, 17th quantile and 65th quantile for the following set of data values: <center> 58, 76, 66, 61, 50, 77, 67, 64, 41, 61 </center> First notice that `\(n = 10\)`. It is possible helpful to set up the following table: - **Step 1: sort the data** | | | | | | | | | | | | |----------|----|----|----|----|----|----|----|----|----|----| | data | 41 | 50 | 58 | 61 | 61 | 64 | 66 | 67 | 76 | 77 | | `\(i\)` | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ] .right-column[ **Example**: Find the median, first quartile, 17th quantile and 65th quantile for the following set of data values: <center> 58, 76, 66, 61, 50, 77, 67, 64, 41, 61 </center> - **Step 2: find `\(\frac{i - .5}{n}\)`** | | | | | | | | | | | | |----------|----|----|----|----|----|----|----|----|----|----| | data | 41 | 50 | 58 | 61 | 61 | 64 | 66 | 67 | 76 | 77 | | `\(i\)` | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | | `\(\frac{i - .5}{n}\)` | 0.05 | 0.15 | 0.25 | 0.35 | 0.45 | 0.55 | 0.65 | 0.75 | 0.85 | 0.95 | ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ] .right-column[ - **Step 3: find `\(Q(p)\)`** | | | | | | | | | | | | |----------|----|----|----|----|----|----|----|----|----|----| | data | 41 | 50 | 58 | 61 | 61 | 64 | 66 | 67 | 76 | 77 | | `\(i\)` | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | | `\(\frac{i - .5}{n}\)` | 0.05 | 0.15 | 0.25 | 0.35 | 0.45 | 0.55 | 0.65 | 0.75 | 0.85 | 0.95 | <span style = "font-size: 80%"> `\[ Q(p) = \begin{cases} \small x_i & \small \lfloor n \cdot p + .5 \rfloor = n \cdot p + .5 \\ \small x_i +\left( n p - i + .5 \right) \left( x_{i+1} - x_i \right) &\small \lfloor n \cdot p + .5 \rfloor \ne n \cdot p + .5 \\ \end{cases} \]` </span> Finding the first **quartile** (`\(Q(.25)\)`): - `\( n p + .5 = 10 \cdot .25 + .5 = 3 \)`. - since `\(\lfloor 3 \rfloor = 3 \)` then `\(i = 3\)` and `\(Q(.25) = x_3 = 58 \)` ] --- name: inverse layout: true class: center, middle, inverse --- ## Your turn ### Find the median --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ] .right-column[ | | | | | | | | | | | | |----------|----|----|----|----|----|----|----|----|----|----| | data | 41 | 50 | 58 | 61 | 61 | 64 | 66 | 67 | 76 | 77 | | `\(i\)` | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | | `\(\frac{i - .5}{n}\)` | 0.05 | 0.15 | 0.25 | 0.35 | 0.45 | 0.55 | 0.65 | 0.75 | 0.85 | 0.95 | <span style = "font-size: 80%"> `\[ Q(p) = \begin{cases} \small x_i & \small \lfloor n \cdot p + .5 \rfloor = n \cdot p + .5 \\ \small x_i +\left( n p - i + .5 \right) \left( x_{i+1} - x_i \right) &\small \lfloor n \cdot p + .5 \rfloor \ne n \cdot p + .5 \\ \end{cases} \]` <span> - `\( n p + .5 = 10 \cdot 0.5 + 0.5 = 5.5 \)`. - since `\(\lfloor 5.5 \rfloor = 5 \)` then `\(i = 5\)` and <span style = "font-size: 70%"> `\[ \begin{align} Q(.5) &= x_i + (n \cdot p - i + .5) \cdot \left( x_{i+1} - x_i \right) \\\\ &= x_5 + (10 \cdot 0.5 - 5 + .5) \cdot \left( x_{5+1} - x_5 \right) \\\\ &= x_5 + (.5) \cdot \left( x_{6} - x_5 \right) \\\\ &= 61 + (.5) \cdot \left( 64 - 61 \right) \\\\ &= 62.5 \end{align} \]` <span> ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ] .right-column[ Finding `\(Q(.17)\)` - `\( n p + .5 = 10 \cdot 0.17 + 0.5 = 2.2 \)`. - since `\(\lfloor 2.2 \rfloor = 2 \)` then `\(i = 2\)` and `\[ \begin{align} Q(.17) &= x_i + (n \cdot p - i + .5) \cdot \left( x_{i+1} - x_i \right) \\\\ &= x_2 + (10 \cdot 0.17 - 2 + .5) \cdot \left( x_{2+1} - x_2 \right) \\\\ &= x_9 + (.2) \cdot \left( x_{3} - x_2 \right) \\\\ &= 50 + (.2) \cdot \left( 58 - 50 \right) \\\\ &= 51.6 \end{align} \]` ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ] .right-column[ Finding `\(Q(.65)\)` - `\( n p + .5 = 10 \cdot 0.65 + 0.5 = 7 \)`. - since `\(\lfloor 7 \rfloor = 7 \)` then `\(i = 7\)` and `\[ \begin{align} Q(.65) &= x_i + (n \cdot p - i + .5) \cdot \left( x_{i+1} - x_i \right) \\\\ &= x_7 + (10 \cdot 0.65 - 7 + .5) \cdot \left( x_{7+1} - x_7 \right) \\\\ &= x_7 + (0) \cdot \left( x_{8} - x_7 \right) \\\\ &= x_7 + 0 \\\\ &= 66 \end{align} \]` ] --- name: inverse layout: true class: center, middle, inverse --- # Section 3.2: Plots --- # Boxplots --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ] .right-column[ ## Boxplots Quantiles are useful in making *boxplots*, an alternative to dot diagrams or histograms. The boxplot shows less information, but many can be placed side by side on a single page for comparisons. <img src="ch3_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ] .right-column[ ## Boxplots A simple plot making use of the first, second and third quartiles (i.e., `\(Q(.25)\)`, `\(Q(.5)\)` and `\(Q(.75)\)`. 1. A box is drawn so that it covers the range from `\(Q(.25)\)` up to `\(Q(.75)\)` with a vertical line at the median. 2. Whiskers extend from the sides of the box to the furthest points within 1.5 IQR of the box edges 3. Any points beyond the whiskers are plotted on their own. ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ] .right-column[ **Example**: Draw boxplots for the groups using quantile function ``` Group 1 Group 2 74 79 77 81 65 77 78 74 68 79 81 76 76 73 71 71 81 80 80 78 86 81 76 89 88 83 79 91 79 78 77 76 79 75 74 73 72 76 75 79 ``` **solution**: First we need the quartile values: | | `\(Q(.25)\)` | `\(Q(.5)\)` | `\(Q(.75)\)` | |----------|--------------|-------------|--------------| | Group 1 | 75.5 | 79 | 81 | | Group 2 | 73.5 | 76 | 78.5 | This means that Group 1 has IQR = 5.5 and - 1.5\*IQR = 8.25 while Group 2 has IQR = 5 and - 1.5\*IQR = 7.5 ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ] .right-column[ **Example**: <img src="ch3_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ] .right-column[ ### Anatomy of a Boxplot <center> <img src="boxplot.JPG" alt="dmc logo" height="300"> </center> ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ] .right-column[ **Example:**[Bullet penetration depths, cont'd] <span style = "font-size: 60%"> <table class="table" style="font-size: 8px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> i </th> <th style="text-align:right;"> `\(\frac{i-.5}{20}\)` </th> <th style="text-align:right;"> 200 grain bullets </th> <th style="text-align:right;"> 230 grain bullets </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0.025 </td> <td style="text-align:right;"> 58.00 </td> <td style="text-align:right;"> 27.75 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0.075 </td> <td style="text-align:right;"> 58.65 </td> <td style="text-align:right;"> 37.35 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 0.125 </td> <td style="text-align:right;"> 59.10 </td> <td style="text-align:right;"> 38.35 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 0.175 </td> <td style="text-align:right;"> 59.50 </td> <td style="text-align:right;"> 38.35 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 0.225 </td> <td style="text-align:right;"> 59.80 </td> <td style="text-align:right;"> 38.75 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 0.275 </td> <td style="text-align:right;"> 60.70 </td> <td style="text-align:right;"> 39.75 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 0.325 </td> <td style="text-align:right;"> 61.30 </td> <td style="text-align:right;"> 40.50 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 0.375 </td> <td style="text-align:right;"> 61.50 </td> <td style="text-align:right;"> 41.00 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 0.425 </td> <td style="text-align:right;"> 62.30 </td> <td style="text-align:right;"> 41.15 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 0.475 </td> <td style="text-align:right;"> 62.65 </td> <td style="text-align:right;"> 42.55 </td> </tr> <tr> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 0.525 </td> <td style="text-align:right;"> 62.95 </td> <td style="text-align:right;"> 42.90 </td> </tr> <tr> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 0.575 </td> <td style="text-align:right;"> 63.30 </td> <td style="text-align:right;"> 43.60 </td> </tr> <tr> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 0.625 </td> <td style="text-align:right;"> 63.55 </td> <td style="text-align:right;"> 43.85 </td> </tr> <tr> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 0.675 </td> <td style="text-align:right;"> 63.80 </td> <td style="text-align:right;"> 47.30 </td> </tr> <tr> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 0.725 </td> <td style="text-align:right;"> 64.05 </td> <td style="text-align:right;"> 47.90 </td> </tr> <tr> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 0.775 </td> <td style="text-align:right;"> 64.65 </td> <td style="text-align:right;"> 48.15 </td> </tr> <tr> <td style="text-align:right;"> 17 </td> <td style="text-align:right;"> 0.825 </td> <td style="text-align:right;"> 65.00 </td> <td style="text-align:right;"> 49.85 </td> </tr> <tr> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 0.875 </td> <td style="text-align:right;"> 67.75 </td> <td style="text-align:right;"> 51.25 </td> </tr> <tr> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 0.925 </td> <td style="text-align:right;"> 70.40 </td> <td style="text-align:right;"> 51.60 </td> </tr> <tr> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 0.975 </td> <td style="text-align:right;"> 71.70 </td> <td style="text-align:right;"> 56.00 </td> </tr> </tbody> </table> </span> ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ] .right-column[ **Example:**[Bullet penetration depths, cont'd] <img src="ch3_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ] --- layout: true class: middle, center, inverse --- ### Quantile-quantile (Q-Q) plots --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ] .right-column[ ### Quantile-quantile (Q-Q) plots Often times, we want to compare the shapes of two distributions. A more sensitive way is to make a single plot based on the quantile functions for two distributions. A **Q-Q plot** for two data sets with respective quantile functions `\(Q_1\)` and `\(Q_2\)` is a plot of ordered pairs `\((Q_1(p), Q_2(p))\)` for appropriate values of `\(p\)`. When two data sets of size `\(n\)` are involved, the values of `\(p\)` used to make the plot will be `\(\frac{i-.5}{n}\)` for `\(i = 1, \dots, n\)`. ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ] .right-column[ ### Quantile-quantile (Q-Q) plots **Example:**[Bullet penetration depth, cont'd] **Example:**[Bullet penetration depth cont'd] - 230 Grain penetration (mm) - 200 Grain penetration (mm) ![](ch3_files/figure-html/bullets-qqplot-1.png)<!-- --> Except extreme lower values, it **seems** the two distributions have smiliar shapes; however, it still needs more attention to make a rough decision (consider boxplots). ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ] .right-column[ To make a Q-Q plot for two data sets of the same size, 1. order each from the smallest observation to the largest, 2. pair off corresponding values in the two data sets 3. plot ordered pairs, with the horizontal coordinated coming from the first data set and the vertical ones from the second. **Example:**[Q-Q plot by hand] Make a Q-Q plot for the following small artificial data sets. <table> <thead> <tr> <th style="text-align:left;"> Data set 1 </th> <th style="text-align:left;"> Data set 2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 3, 5, 4, 7, 3 </td> <td style="text-align:left;"> 15, 7, 9, 7, 11 </td> </tr> </tbody> </table> ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ] .right-column[ ### Quantile-Quantile Plots: **QQ plots** are created by plotting the values of `\(Q(p)\)` for a data set against values of `\(Q(p)\)` coming from some other source. - Compare the shape of two data sets (distributions). - Two data sets having "equal shape" is equivalent to say their quantile functions are "**linearly related**". - If the two data sets have different sizes, the size of smaller set is used for both. - A **QQ plot** that is linear indicates the two distributions have similar shape. - If there are significant departures from linearity, the character of those departures reveals the ways in which the shapes differ. ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ] .right-column[ ### Quantile-Quantile Plots: **Example**: How similar the two data sets are? - Set 1: 36, 15, 35, 34, 18, 13, 19, 21, 39, 35 - Set 2: 37, 39, 79, 31, 69, 71, 43, 27, 73, 71 | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |:-----------------|-----|-----|-----|-----|-----|-----|-----|-----|-----|------| | `\(p\)` | | | | | | | | | | | | | | | | | | | | | | | | Set 1 `\(Q(p)\)` | | | | | | | | | | | | | | | | | | | | | | | | Set 2 `\(Q(p)\)` | | | | | | | | | | | ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ] .right-column[ ### Quantile-Quantile Plots: | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |------------------|------|------|------|------|------|------|------|------|------|------| | `\(p\)` | 0.05 | 0.15 | 0.25 | 0.35 | 0.45 | 0.55 | 0.65 | 0.75 | 0.85 | 0.95 | | `\(Q_1(p)\)` | 13 | 15 | 18 | 19 | 21 | 34 | 35 | 35 | 36 | 39 | | `\(Q_2(p)\)` | 27 | 31 | 37 | 39 | 43 | 69 | 71 | 71 | 73 | 79 | <img src="ch3_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ] .right-column[ ### Quantile-Quantile Plots: **Interpretation** The resulting plot shows some kind of linear pattern This means that the quantiles increase at the same rate, even if the sizes of the values themselves are very different. ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ] .right-column[ ### Quantile-Quantile Plots: **Example 6 of chapter 3**: Bullet penetration depth - 230 Grain penetration (mm) - 200 Grain penetration (mm) <img src="ch3_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> Except extreme lower values, it **seems** the two distributions have smiliar shapes; however, it still needs more attention to make a rough decision (consider boxplots). Might want to figure out what has caused the extereme value ] --- layout: true class: middle, center, inverse --- # Theoretical quantile-quantile plots --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ] .right-column[ ### Theoretical quantile-quantile plots Q-Q plots are useful when comparing two finite data sets, but a Q-Q plot can also be used to compare a data set and an expected shape, or *theoretical distribution*. >A **theoretical Q-Q plot** for a data set of size `\(n\)` and a theoretical distribution, with respective quantile functions `\(Q_1\)` and `\(Q_2\)` is a plot of ordered pairs `\((Q_1(p), Q_2(p))\)` for `\(p = \frac{i-.5}{n}\)` where `\(i = 1, \dots, n\)`. The most famous theoretical Q-Q plot occurs when quantiles for the *standard Normal* or *Gaussian* distribution are used. A simple numerical approximation to the quantile function for the Normal distribution is $$ Q(p) \approx 4.9(p^{.14} - (1-p)^{.14}). $$ ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ] .right-column[ ### Theoretical quantile-quantile plots The standard Normal quantiles can be used to make a theoretical Q-Q plot as a way of assessing how bell-shaped a data set is. The resulting plot is called a **normal Q-Q plot**. ![Histogram and approximate quantile function for a standard Normal distirbution.](ch3_files/figure-html/normal-dist-1.png)![Histogram and approximate quantile function for a standard Normal distirbution.](ch3_files/figure-html/normal-dist-2.png) ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ] .right-column[ ### Quantile-Quantile Plots: ### Summary The idea of QQ plots is most useful when applied to one quantile function that represents data and a second that represents a **theoretical distribution** - Empirical QQ plots: the other source are quantiles from another actual data set. - Theoretical QQ plots: the other source are quantiles from a theoretical set - we know the quantiles without having any data. This allows to ask "Does the data set have a shape similar to the theoretical distribution?" ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ] .right-column[ **Example:**[Breaking strengths of paper towels, pg. 79] Here is a study of the dry breaking strength (in grams) of generic paper towels. <table> <thead> <tr> <th style="text-align:right;"> test </th> <th style="text-align:right;"> strength </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 8577 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 9471 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 9011 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 7583 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 8572 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 10688 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 9614 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 9614 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 8527 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 9165 </td> </tr> </tbody> </table> ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ] .right-column[ **Example:**[Breaking strengths of paper towels, pg. 79] <table> <thead> <tr> <th style="text-align:right;"> i </th> <th style="text-align:right;"> `\(\frac{i-.5}{20}\)` </th> <th style="text-align:right;"> Breaking strength Q() </th> <th style="text-align:right;"> Standard Normal Q() </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0.05 </td> <td style="text-align:right;"> 7583 </td> <td style="text-align:right;"> -1.6448536 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0.15 </td> <td style="text-align:right;"> 8527 </td> <td style="text-align:right;"> -1.0364334 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 0.25 </td> <td style="text-align:right;"> 8572 </td> <td style="text-align:right;"> -0.6744898 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 0.35 </td> <td style="text-align:right;"> 8577 </td> <td style="text-align:right;"> -0.3853205 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 0.45 </td> <td style="text-align:right;"> 9011 </td> <td style="text-align:right;"> -0.1256613 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 0.55 </td> <td style="text-align:right;"> 9165 </td> <td style="text-align:right;"> 0.1256613 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 0.65 </td> <td style="text-align:right;"> 9471 </td> <td style="text-align:right;"> 0.3853205 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 0.75 </td> <td style="text-align:right;"> 9614 </td> <td style="text-align:right;"> 0.6744898 </td> </tr> <tr> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 0.85 </td> <td style="text-align:right;"> 9614 </td> <td style="text-align:right;"> 1.0364334 </td> </tr> <tr> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 0.95 </td> <td style="text-align:right;"> 10688 </td> <td style="text-align:right;"> 1.6448536 </td> </tr> </tbody> </table> ] --- layout:false .left-column[ ### Summarizing ### Intro ### Purpose ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ] .right-column[ ### Theoretical quantile-quantile plots **Example:**[Breaking strengths of paper towels, pg. 79] <img src="ch3_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] --- name: inverse layout: true class: center, middle, inverse --- ## Summarizing data Numerically ### Location and central tendency ### Measures of Spread --- layout:false .left-column[ ### Summarizing ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ### Numerical summaries ] .right-column[ ## Numerical summaries When we have a large amount of data, it can become important to reduce the amount of data to a few informative numerical summary values. Numerical summaries highlight important features of the data A **numerical summary (or statistic)** is a number or list of numbers calculated using the data (and only the data). ### Measures of location An "average" represents the center of a quantitative data set. There are several potential technical meanings for the word "average", and they are all *measures of location*. The **(arithmetic) mean** of a sample of quantitative data `\((x_1, \dots, x_n)\)` is `$$\overline{x} = \frac{1}{n} \sum\limits_{i=1}^nx_i$$` ] --- layout:false .left-column[ ### Summarizing ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ### Numerical summaries ] .right-column[ ## Numerical summaries ### Measures of location The **mode** of a discrete or categorical data set is the most frequently-occurring value. We have also seen the *median*, `\(Q(.5)\)`, which is another measure of location. A shortcut to calculating `\(Q(0.5)\)` is - `\(Q(0.5) = x_{\lceil n/2 \rceil}\)` if `\(n\)` is odd - `\(Q(0.5)\)` = `\((x_{n/2} + x_{n/2+1})/2\)` if `\(n\)` is even. **Example:**[Measures of location] Calculate the three measures of location for the following data. `$$0, 1, 1, 2, 3, 5$$` ] --- layout:false .left-column[ ### Summarizing ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ### Numerical summaries ] .right-column[ ## Numerical summaries ### Measures of spread Quantifying variation in a data set can be as important as measuring its location. Again, there are many way to measure the spread of a data set. The **range** of a data set consisting of ordered values `\(x_1 \le \cdots \le x_n\)` is `$$R = x_n - x_1.$$` The **sample variance** of a data set consisting of values `\(x_1, \dots, x_n\)` is `$$s^2 = \frac{1}{n-1} \sum\limits_{i=1}^n (x_i - \overline{x})^2$$` The **sample standard deviation**, `\(s\)`, is the nonnegative square root of the sample variance. We have also seen the *IQR*, `\(Q(.75) - Q(.25)\)`, which is another measure of spread. ] --- layout:false .left-column[ ### Summarizing ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ### Numerical summaries ] .right-column[ ## Numerical summaries ## Measures of spread **Example:**[Measures of spread] Calculate the four measures of spread for the following data. `$$0, 1, 1, 2, 3, 5$$` ] --- layout:false .left-column[ ### Summarizing ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ### Numerical summaries ] .right-column[ ## Numerical summaries ## Measures of spread **Example:**[Sensitivity to outliers] Which measures of center and spread differ drastically between the `\(x_i\)`s and the `\(y_i\)`s? Which ones are the same? `\begin{align*} x_i:& 0, 1, 1, 2, 3, 5 \\ y_i:& 0, 1, 1, 2, 3, 817263489 \end{align*}` ] --- layout:false .left-column[ ### Summarizing ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ### Numerical summaries ] .right-column[ ### Statistics and parameters It's important now to stop and talk about terminology and notation. - Numerical summarizations of sample data are called (sample) **statistics**. Numerical summarizations of population and theoretical distributions are called (population or model) **parameters**. - If a data set, `\(x_1, \dots, x_N\)`, represents an entire population, then the **population (or true) mean** is defined as `$$\mu = \frac{1}{N}\sum\limits_{i = 1}^N x_i.$$` ] --- layout:false .left-column[ ### Summarizing ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ### Numerical summaries ] .right-column[ ### Statistics and parameters - If a data set, `\(x_1, \dots, x_N\)`, represents an entire population, then the **population (or true) variance** is defined as `$$\sigma^2 = \frac{1}{N}\sum\limits_{i=1}^N(x_i - \mu)^2.$$` The **population (or true) standard deviation**, `\(\sigma\)` is the nonnegative square root of `\(\sigma^2\)`. ] --- layout:false .left-column[ ### Summarizing ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ### Numerical summaries ] .right-column[ ## Categorical and count data So far we have talked mainly about summarizing quantitative, or measurement, data. Sometimes, we have categorical or count data to summarize. In this case, we can revisit the *frequency table* and introduce a new type of plot. **Example:**[Cars] Fuel consumption and 10 aspects of automobile design and performance are available for 32 automobiles (1973–74 models) from 1974 Motor Trend US Magazine. ] --- layout:false .left-column[ ### Summarizing ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ### Numerical summaries ] .right-column[ **Example:**[Cars] <table class="table" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> mpg </th> <th style="text-align:left;"> cyl </th> <th style="text-align:left;"> disp </th> <th style="text-align:left;"> hp </th> <th style="text-align:left;"> drat </th> <th style="text-align:left;"> wt </th> <th style="text-align:left;"> qsec </th> <th style="text-align:left;"> vs </th> <th style="text-align:left;"> am </th> <th style="text-align:left;"> gear </th> <th style="text-align:left;"> carb </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mazda RX4 </td> <td style="text-align:left;"> 21 </td> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 160 </td> <td style="text-align:left;"> 110 </td> <td style="text-align:left;"> 3.9 </td> <td style="text-align:left;"> 2.62 </td> <td style="text-align:left;"> 16.46 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Mazda RX4 Wag </td> <td style="text-align:left;"> 21 </td> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 160 </td> <td style="text-align:left;"> 110 </td> <td style="text-align:left;"> 3.9 </td> <td style="text-align:left;"> 2.875 </td> <td style="text-align:left;"> 17.02 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Datsun 710 </td> <td style="text-align:left;"> 22.8 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 108 </td> <td style="text-align:left;"> 93 </td> <td style="text-align:left;"> 3.85 </td> <td style="text-align:left;"> 2.32 </td> <td style="text-align:left;"> 18.61 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Hornet 4 Drive </td> <td style="text-align:left;"> 21.4 </td> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 258 </td> <td style="text-align:left;"> 110 </td> <td style="text-align:left;"> 3.08 </td> <td style="text-align:left;"> 3.215 </td> <td style="text-align:left;"> 19.44 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Hornet Sportabout </td> <td style="text-align:left;"> 18.7 </td> <td style="text-align:left;"> 8 </td> <td style="text-align:left;"> 360 </td> <td style="text-align:left;"> 175 </td> <td style="text-align:left;"> 3.15 </td> <td style="text-align:left;"> 3.44 </td> <td style="text-align:left;"> 17.02 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Valiant </td> <td style="text-align:left;"> 18.1 </td> <td style="text-align:left;"> 6 </td> <td style="text-align:left;"> 225 </td> <td style="text-align:left;"> 105 </td> <td style="text-align:left;"> 2.76 </td> <td style="text-align:left;"> 3.46 </td> <td style="text-align:left;"> 20.22 </td> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> 1 </td> </tr> <tr> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> </tr> </tbody> </table> We can construct a frequency table for the *cylinder* variable. <table class="table" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> cyl </th> <th style="text-align:right;"> Frequency </th> <th style="text-align:right;"> Relative Frequency </th> <th style="text-align:right;"> Cumulative Frequency </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:right;"> 0.34375 </td> <td style="text-align:right;"> 0.34375 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 0.21875 </td> <td style="text-align:right;"> 0.56250 </td> </tr> <tr> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:right;"> 0.43750 </td> <td style="text-align:right;"> 1.00000 </td> </tr> </tbody> </table> From this frequency data, we can summarize the categorical data graphically. ] --- layout:false .left-column[ ### Summarizing ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ### Numerical summaries ] .right-column[ - A **bar plot** presents categorical data with rectangular bars with lengths proportional to the values that they represent (usually frequency of occurrence). **Example:**[Cars, cont'd] ![](ch3_files/figure-html/unnamed-chunk-17-1.png)<!-- --> ] --- layout:false .left-column[ ### Summarizing ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ### Numerical summaries ] .right-column[ **Example**: Taking a sample of size 5 from a population we record the following values: <center> 68, 54, 60, 66, 58 </center> Find the variance and standard deviation of this sample. ] --- layout:false ## Example: Finding the Variance Since we are told it is a sample, we need to use **sample variance**. The mean of 68, 54, 60, 66, 58 is 61.2 <span style = "font-size: 60%"> `\begin{align} \small s^2 &= \frac{1}{n-1}\sum_{i=1}^5 (x_i - \bar{x})^2 \\\\ &=\small \frac{1}{n-1}\left( (x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + (x_3 - \bar{x})^2 + (x_4 - \bar{x})^2 + (x_5 - \bar{x})^2 \right) \\\\ &=\small \frac{1}{5-1} \left((68 - 61.2)^2 + (54 - 61.2)^2 + (60 - 61.2)^2 + (66 - 61.2)^2 + (58 - 61.2)^2 \right) \\\\ &=\small \frac{1}{4} \left( (6.8)^2 + (-7.2)^2 + (-1.2)^2 + (4.8)^2 + (-3.2)^2 \right) \\\\ &=\small \frac{1}{4} \left( 46.24 + 51.84 + 1.44 + 23.04 + 10.24 \right) \\\\ &= \small 33.2\\\\ \end{align}` <span> --- layout:false .left-column[ ### Summarizing ### Descriptive statistics ### Plots ### Freq Tables ### Histogram ### Scatter plots ### Center Stats ### Quantiles ### Boxplot ### QQ-Plot ### Numerical summaries ] .right-column[ ## Example: Finding the Standard Deviation With `\(s^2\)` known, finding `\(s\)` is simple: <span style = "font-size: 65%"> `\begin{align} s &= \sqrt{s^2} \\\\ &= \sqrt{33.2} \\\\ &= 5.7619441\\\\ \end{align}` <span> ]