ch4.utf8.md

class: center, middle, inverse
layout: yes
name: inverse
## STAT 305: Chapter 4 
## Part I
### Amin Shirazi
.footnote[Course page: [ashirazist.github.io/stat305_s2020.github.io](https://ashirazist.github.io/stat305_s2020.github.io/)]

---

layout: true
class: center, middle, inverse
---
# Chapter 4, Section 1
## Linear Relationships Between Variables
---
layout: false
.left-column[
### Describing Relationships
### Idea
]
.right-column[
### Describing Relationships between variables

This chapter provides methods that address a more involved problem of describing relationships between variables and require more computation. We start with relationships between two variables and move on to more.

## Fitting a line by least squares

**Goal:**Notice a relationship between two quantitative variables.

>We would like to use an equation to describe how a dependent (response) variable, `$y$`, changes in response to a change in one or more independent (experimental) variable(s), `$x$`.
]

---
layout: false
.left-column[
### Describing Relationships
### Idea
]
.right-column[
### Describing Relationships between variables

### Line review

Recall a linear equation of the form `$$y = mx + b$$`

Where `$m$` is the slope and `$b$` is the intercept of the line.

In statistics, we use the notation `$y = \beta_0 + \beta_1 x + \epsilon$` where we assume `$\beta_0$` and `$\beta_1$` are unknown parameters and `$\epsilon$` is some error.

The goal is to find estimates `$b_0$` (intercept) and `$b_1$` (slope) for the parameters.

]
---
layout: false
.left-column[
### Describing Relationships
### Idea
]
.right-column[
### Describing Relationships

We have a standard idea of how our experiment works:

Bivariate data oftern arise because a quantitative experimental variable *x* has been varied between several different setting (treatment).

It is helpful to have an equation relating *y* (the response) to *x* when the purposes are summarization, interpolation, limited extrapolation, and/or process optimization/ adjusment.

*and* we know that with an valid experiment, we can say that the changes in our experimental variables actually *cause* changes in our response.

But how do we describe those response when we know that random error would make each result different...
]
---
.left-column[
### Describing Relationships
### Idea
]
.right-column[
### Types of relationships

]
---
.left-column[
### Describing Relationships
### Idea
]
.right-column[

### The Underlying Idea

We start with a valid mathematical model, for instance a line:

\\[
 y = \beta_0 + \beta_1 \cdot x
\\]

In this case,

-  \$\beta_0\$ is the intercept - when \$x = 0\$, \$y = \beta_0\$.

-  \$\beta_1\$ is the slope - when \$x\$ increase by one unit, \$y\$ increases by \$\beta_1\$ units.

]
---
.left-column[
### Describing Relationships
### Idea
### Ex: Bar Stress
]
.right-column[

## Example: Stress on Bars

An experiment examining the effects of **stress** on **time until fracture** is performed by taking a sample of 10 stainless steel rods immersed in 40% CaCl solution at 100 degrees Celsius and applying different amounts of uniaxial stress.

The results are recorded below:

|                                      |      |      |      |      |      |      |      |      |      |      |
|--------------------------------------|------|------|------|------|------|------|------|------|------|------|
| **stress** \$(\text{kg/mm}^2)\$    |  2.5 |  5.0 | 10.0 | 15.0 | 17.5 | 20.0 | 25.0 | 30.0 | 35.0 | 40.0 |
| **lifetime** (hours)                 |  63  |  58  |  55  |  61  |  62  |  37  |  38  |  45  |  46  |  19  |

A good first place to investigate the relationship between our experimental variables (in this case, stress) and the response (in this case, lifetime) is to use a scatterplot and look to see if there might be any basic mathematical function that could describe the relationship between the variables.

]
---
.left-column[
### Describing Relationships
### Idea
### Ex: Bar Stress
]
.right-column[

** Example: Stress on Bars (continued) **

Our data:

-  Plotting stress along the \$x\$-axis and plotting lifetime along the \$y\$-axis we get

]
---
.left-column[
### Describing Relationships
### Idea
### Ex: Bar Stress
]
.right-column[

** Example: Stress on Bars (continued) **

Our data:

-  Examining the plot, we might determine that there could be a linear relationship between the two.  The red line looks like it fits the data pretty well.

]
---
.left-column[
### Describing Relationships
### Idea
### Ex: Bar Stress
]
.right-column[

** Example: Stress on Bars (continued) **

Our data:

-  But there are several other lines that fit the data pretty well, too.

-  How do we decide which is best?

]
---
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
]
.right-column[

### Where the line comes from

When we are trying to find a line that fits our data what we are _really_ doing is saying that there is a true physical relationship between our experimental variable \$x\$ is related to our response \$y\$ that has the following form:

**Theoretical Relationship**
\\[
 y = \beta_0 + \beta_1 \cdot x
\\]

However, the response we observe is also effected by random noise:

**Observed Relationship**
`\begin{align}
y &= \beta_0 + \beta_1 \cdot x + \text{errors} \\\\
  &= \text{signal} + \text{noise} 
\end{align}`

If we did a good job, hopefully we will have small enough errors so that we can say
\\[
y \approx \beta_0 + \beta_1 \cdot x 
\\]

]
---
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
]
.right-column[

### Where the line comes from

So, if things have gone well, we are attempting to estimate the value of \$\beta_0\$ and \$\beta_1\$ from our observed relationship

\\[
y \approx \beta_0 + \beta_1 \cdot x 
\\]

Using the following notation:
-  \$b\_0\$ is the estimated value of \$\beta\_0\$ and
-  \$b\_1\$ is the estimated value of \$\beta\_1\$
-  \$\hat{y}\$ is the estimated response

We can write a **fitted relationship**:
\\[
\hat{y} = b\_0 + b\_1 \cdot x 
\\]

The key here is that we are going from the underlying _true, theoretical_ relationship to an _estimated_ relationship.

In other words, we will never get the true values \$\beta_0\$ and \$\beta_1\$ but we can estimate them.

However, this doesn't tell us _how_ to estimate them.

]
---
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
]
.right-column[

### The principle of Least Squares

A good estimte should be based on the data.

Suppose that we have observed responses \$y\_1, y\_2, \ldots, y\_n\$ for experimental variables set at \$x\_1, x\_2, \ldots, x\_n\$.

Then the **Principle of Least Squares** says that the best estimate of \$\beta\_0\$ and \$\beta\_1\$ are values that **minimize**

\\[
\sum_{i = 1}^n (y\_i - \hat{y}\_i)^2
\\]

In our case, since \$ \hat{y}\_i = b\_0 + b\_1 \cdot x\_i \$ we need to choose values for \$b\_0\$ and \$b\_1\$ that minimize

\\[
\sum\_{i = 1}^n (y\_i - \hat{y}\_i)^2 = \sum\_{i = 1}^n \left(y\_i - (b\_0 + b\_1 \cdot x\_i ) \right)^2
\\]
In other words, we need to minimize something with respect to two values we get to choose - we can do this by taking derivatives.

]
---

### Deriving the Least Squares Estimates(Optional reading)

We can rewrite the target we want to minimize so that the variables are less tangled together:

`\begin{align}
\sum_{i = 1}^n (y_i - \hat{y}_i)^2 &= \sum_{i = 1}^n \left(y_i - (b_0 + b_1 x_i ) \right)^2 \\\\
 &= \sum_{i = 1}^n \left(y_i^2 - 2 y_i (b_0 + b_1 x_i ) + (b_0 + b_1 x_i )^2\right) \\\\
 &= \sum_{i = 1}^n y_i^2 - \sum_{i = 1}^n 2 y_i (b_0 + b_1 x_i ) + \sum_{i = 1}^n (b_0 + b_1 x_i )^2 \\\\
 &= \sum_{i = 1}^n y_i^2 - \sum_{i = 1}^n (2 y_i b_0 + 2 y_i b_1 x_i ) + \sum_{i = 1}^n \left(b_0^2 + 2 b_0 b_1 x_i + (b_1 x_i )^2 \right) \\\\
 &= \sum_{i = 1}^n y_i^2 - \sum_{i = 1}^n 2 y_i b_0 - \sum_{i = 1}^n 2 y_i b_1 x_i + \sum_{i = 1}^n b_0^2 + \sum_{i = 1}^n 2 b_0 b_1 x_i + \sum_{i = 1}^n b_1^2 x_i^2 \\\\
 &= \sum_{i = 1}^n y_i^2 - 2 b_0 \sum_{i = 1}^n y_i - 2 b_1 \sum_{i = 1}^n y_i x_i + n b_0^2 + 2 b_0 b_1 \sum_{i = 1}^n x_i + b_1^2 \sum_{i = 1}^n x_i^2 \\\\
\end{align}`

---
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
]
.right-column[

### Deriving the Least Squares Estimates (continued)

How do we minimize it?

-  Since we have two "variables" we need to take derivates with respect to both.

-  Remember we have our data so we know every value of \$x_i\$ and \$y_i\$ and can treat those parts as constants.

**The derivative with respect to \$\mathbf{b_0}\$**:
`\[
-2 \sum_{i = 1}^n y_i + 2 n b_0 + 2 b_1 \sum_{i = 1}^n x_i 
\]`

**The derivative with respect to \$\mathbf{b_1}\$**:
`\[
-2 \sum_{i = 1}^n y_i x_i + 2 b_0 \sum_{i = 1}^n x_i + 2 b_1 \sum_{i = 1}^n x_i^2
\]`

]
---
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
]
.right-column[

### Deriving the Least Squares Estimates (continued)

We set both equal to 0 and solve them at the same time:

`\begin{align}
-2 \sum_{i = 1}^n y_i + 2 n b_0 + 2 b_1 \sum_{i = 1}^n x_i &= 0 \\\\
-2  \sum_{i = 1}^n y_i x_i + 2 b_0 \sum_{i = 1}^n x_i + 2 b_1 \sum_{i = 1}^n x_i^2 &=0 \\\\
\end{align}`

We can rewrite the first equation as:
`\begin{align}
b_0 &= \frac{1}{n} \sum_{i = 1}^n y_i - b_1 \frac{1}{n} \sum_{i = 1}^n x_i  \\\\
     &= \bar{y} - b_1 \bar{x}
\end{align}`

and then replace all \$b_0\$ in the second equation (there is some algebra type stuff along the way, of course)
]
---
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
]
.right-column[

### Deriving the Least Squares Estimates (continued)

After a little simplification we arrive at our estimates:

**Least Squares Estimates for Linear Fit**

`\begin{align}
b_0 &= \bar{y}- b_1 \bar{x} \\\\
b_1 &= \frac{\sum_{i = 1}^n y_i x_i - n \bar{x} \bar{y}}{\sum_{i = 1}^n x_i^2 - n \bar{x}^2} \\\\
     &= \frac{\sum_{i = 1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i = 1}^n (x_i - \bar{x})^2}
\end{align}`
]

---
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
]
.right-column[

**Wrap Up**
>-  Don't try to memorize the derivation. I will never ask you to do that on an exam.

>-  Try to understand the simplification steps - the ones that moved constants out of summations for example.

>-  This is one rule - there are others, but **Least Squares Estimates** have some useful properties that will make them the obvious best choice as we continue the course.

]
---
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate

]
.right-column[

**Example: Stress on Bars **

Estimating the best slope and intercept using least squares:

In our case we have the following:

`\begin{align}
\sum_{i = 1}^{10} y_i = 484, 
\sum_{i = 1}^{10} x_i = 200, 
\sum_{i = 1}^{10} x_i y_i = 8407.5, 
\sum_{i = 1}^{10} x_i^2 = 5412.5, 
\end{align}`

]
---
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate

]
.right-column[

** Example: Stress on Bars **

`\begin{align}
\sum_{i = 1}^{10} y_i = 484, 
\sum_{i = 1}^{10} x_i = 200, 
\sum_{i = 1}^{10} x_i y_i = 8407.5, 
\sum_{i = 1}^{10} x_i^2 = 5412.5, 
\end{align}`

Using this we can estimate \$b_1\$:

`\begin{align}
b_1 &= \frac{\sum_{i = 1}^n y_i x_i - n \bar{x} \bar{y}}{\sum_{i = 1}^n x_i^2 - n \bar{x}^2} \\\\
 &= \frac{8407.5 - 10 \left(\frac{200}{10}\right) \left(\frac{484}{10}\right)}{5412.5 - 10 \left(\frac{200}{10}\right)^2} \\\\
 &= \frac{-1272.5}{1412.5} \\\\
 &\approx -0.9009
\end{align}`

]
---
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
]
.right-column[

** Example: Stress on Bars **

`\begin{align}
\sum_{i = 1}^{10} y_i = 484, 
\sum_{i = 1}^{10} x_i = 200, 
\sum_{i = 1}^{10} x_i y_i = 8407.5, 
\sum_{i = 1}^{10} x_i^2 = 5412.5, 
\end{align}`

And using \$b_1\$ we can estimate \$b_0\$:

`\begin{align}
b_0 &= \bar{y} - b_1 \bar{x} \\\\
 &= \left(\frac{484}{10}\right) - b_1 \left(\frac{200}{10}\right) \\\\
 &= 48.4 - \left(\frac{-1272.5}{1412.5}\right) 20.0\\\\
 &= 66.4177
\end{align}`

Which gives us the **Fitted Relationship**:

\\[
\hat{y} = 66.4177 - 0.9009 x
\\]
]
---
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
]
.right-column[

** Example: Stress on Bars **

\\[
\hat{y} = 66.4177 - 0.9009 x
\\]

<img src="ch4_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" />
]

---
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
]
.right-column[

**Example: Stress on Bars **

**Fitted line**

<img src="ch4_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />
]
---
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
]
.right-column[

When making predictions, don't *extrapolate*.

> **Extrapolation** is when a value of `$x$` beyond the range of our actual observations is used to find a predicted value for `$y$`. We don't know the behavior of the line beyond our collected data.

>**Interpolation** is when a value of `$x$` within the range of our observations is used to find a predicted value for `$y$`.

]

---
name: inverse
layout: true
class: center, middle, inverse
---
## Good Fit
---
layout: false
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
### Good Fit
]
.right-column[

### Knowing when a relationship fits the data well</h3>

So far we have been fitting lines to describe our data. A first question to ask may be something like:

-  **Q**: What kind of situations can a linear fit be used to describe the relationship between an expreimental variable and a response?

-  **A**: Any time both the experimental variable and the response variable are numeric.

**However** all fits are not created the same:

]
---
name: inverse
layout: true
class: center, middle, inverse
---
##  Correlation 
---
layout:false
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
### Good Fit
### Correlation
]
.right-column[
### Correlation

Visually we can assess if a fitted line does a good job of fitting the data using a scatterplot. However, it is also helpful to have methods of quantifying the quality of that fit.

>**Correlation** gives the strength and direction of the linear relationship between two variables.

For a sample consisting of data pairs 
\$(x_1, y_1)\$, 
\$(x_2, y_2)\$, 
\$(x_3, y_3)\$, 
...
\$(x_n, y_n)\$, the sample linear correlation, \$r\$, is defined by

`\[
r = \frac{ \sum_{i = 1}^{n} (x_i - \bar{x}) (y_i - \bar{y}) }{ 
\sqrt{ \left(\sum_{i = 1}^{n} (x_i - \bar{x})^2 \right) \left(\sum_{i = 1}^{n} (y_i - \bar{y})^2 \right) } }
\]`

which can also be written as

`\[
r = \frac{ \sum_{i = 1}^{n} x_i y_i - n \bar{x}\bar{y} }{ 
\sqrt{ \left(\sum_{i = 1}^{n} x_i^2 - n\bar{x}^2 \right) \left(\sum_{i = 1}^{n} y_i^2 - n \bar{y}^2 \right) } }
\]`
]
---
layout:false
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
### Good Fit
### Correlation
]
.right-column[
### Correlation

**1. Sample correlation (aka, sample linear correlation)**

The value of \$r\$ is always between -1 and +1.

-  The closer the value is to -1 or +1 the stronger the linear relationship.

-  Negative values of \$r\$ indicate a negative relationship (as \$x\$ increases, \$y\$ decreases).

-  Positive values of \$r\$ indicate a positive relationship (as \$x\$ increases, \$y\$ increases).
]
---
layout:false
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
### Good Fit
### Correlation
]
.right-column[

-  One possible rule of thumb:

| Range of \$r\$  | Strength      | Direction |
|-------------------|---------------|-----------|
| 0.9 to 1.0        | Very Strong   | Positive  |
| 0.7 to 0.9        |      Strong   | Positive  |
| 0.5 to 0.7        |      Moderate | Positive  |
| 0.3 to 0.5        |      Weak     | Positive  |
| -0.3 to 0.3       | Very Weak/No Relationship |  |
| -0.5 to -0.3      |      Weak     | Negative  |
| -0.7 to -0.5      |      Moderate | Negative  |
| -0.9 to -0.7      |      Strong   | Negative  |
| -1.0 to -0.9      | Very Strong   | Negative  |

]
---
layout:false
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
### Good Fit
### Correlation
]
.right-column[

The values of \$r\$ from left to right are in the plot above are:
```
          r=0.9998782       r=-0.8523543    r=-0.1347395
```

-  In there first case the linear relationship is almost perfect, and we would happily refer to this as a **very strong**, **positive** relationship between \$x\$ and \$y\$.

-  In there second case the linear relationship is seems appropriate - we could safely call it a **strong**, **negative** linear relationship between \$x\$ and \$y\$.

-  In there third case the value of \$r\$ indicates that there is **no linear relationship** between the value of \$x\$ and the value of \$y\$.

In each case we *can* fit a linear model. However,

-  a line is clearly a good choice for the data on the left
-  the middle data could be described well by a line, but the relationship is not as obvious as the case on the left.
-  a linear relationship is clearly clearly inappropriate for the data on the right (something like \$x^2\$ would be better).

We need a way to identify the quality of the fit **concretely** 
]
---
layout:false
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
### Good Fit
### Correlation
]
.right-column[

**1. Sample correlation (aka, sample linear correlation)**

**Example**: Stress and Lifetime of Bars

We can use it to calculate the following values:

<center>

\\[
\sum\_{i = 1}^{10} x\_i = 200, 
\sum\_{i = 1}^{10} x\_i^2 = 5412.5, 
\\]

\\[
\sum\_{i = 1}^{10} y\_i = 484, 
\sum\_{i = 1}^{10} y\_i^2 = 25238, 
\sum\_{i = 1}^{10} x\_i y\_i = 8407.5, 
\\]

</center>
and we can write:

`\begin{align}
r &= \frac{ \sum_{i = 1}^{n} x_i y_i - n \bar{x}\bar{y} }{ \sqrt{ \left(\sum_{i = 1}^{n} x_i^2 - n\bar{x}^2 \right) \left(\sum_{i = 1}^{n} y_i^2 - n \bar{y}^2 \right) } } \\\\
  &= \frac{ 8407.5 - 10 (20) (48.5) }{ \sqrt{ \left(5412.5 - 10 (20)^2 \right) \left(25238 - 10 (48.4)^2 \right) } } \\\\
  &= -0.795
\end{align}`

So we would say that stress applied and lifetime of the bar have a **strong, negative, linear relationship**.

]
---
name: inverse
layout: true
class: center, middle, inverse
---
## Residuals
---
layout:false
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
### Good Fit
### Correlation
### Residuals
]
.right-column[

### Residuals

-  The "residue" left over from fitting a line

-  Each point represents some \$(x_i, y_i)\$ pair from our data

-  We use the Least Squares approach to find the best fit line, \$\hat{y} = b_0 + b_1 x\$

-  For any value \$x_i\$ in our data set, we can get a fitted (or predicted) value \$\hat{y}_i = b_0 + b_1 x_i \$

]
---
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
### Good Fit
### Correlation
### Residuals
]
.right-column[

### Residuals

-  The residual is the difference between the observed data point and the fitted prediction:
\\[
e_i = y_i - \hat{y}_i
\\]

-  **In the linear case**, using \$\hat{y} = b_0 + b_1 x\$, we can also write
\\[
e_i = y_i - \hat{y}_i = y_i - (b_0  + b_1 x_i)
\\]
for each pair \$(x_i, y_i)\$.

]
---
.left-column[
### Describing Relationships
### Idea
### Ex: Bars
### Fitting Lines
### Best Estimate
### Good Fit
### Correlation
### Residuals
]
.right-column[

### Residuals

**ROPe**: **R**esiduals = **O**bserved - **P**redicted (using symbol \$e_i\$)

-  If \$e\_i > 0\$ then \$y_i - \hat{y}\_i > 0\$ and \$y\_i > \hat{y}\_i\$ meaning the observed is larger than the predicted - we are "underpredicting"

- If \$e\_i < 0\$ then \$y_i - \hat{y}\_i < 0\$ and \$y\_i < \hat{y}\_i\$ meaning the observed is smaller than the predicted - we are "overpredicting"

Obviously, we would like our residuals to be small compared to the size of response values.

]

---
layout: true
class: center, middle, inverse

---

#Assessing Models
---
layout:false
.left-column[
### Describing Relationships
### Idea
### Fitting Lines
### Best Estimate
### Good Fit
### Correlation
### Residuals
### Assessment
]
.right-column[

### Assessing models

When modeling, it's important to assess the (1) **validity** and (2) **usefulness** of your model.

To assess the validity of the model, we will look to the residuals. If the fitted equation is the good one, the residuals will be:

- Ptternless (cloud like, random scatter)
- Centered at zero
- Bell shaped distribution

To check if these three things hold, we will use two plotting methods.

>A **residual plot}** is a plot of the residuals, `$e = y - \hat{y}$` vs. `$x$` (or `$\hat{y}$` in the case of multiple regression, Section 4.2).

]
---
layout:false
.left-column[
### Describing Relationships
### Idea
### Fitting Lines
### Best Estimate
### Good Fit
### Correlation
### Residuals
### Assessment
]
.right-column[
#### Assessing models
#### Residual plot
<img src="ch4_files/figure-html/residual_plots-1.png" width=".48\textwidth" style="display: block; margin: auto;" /><img src="ch4_files/figure-html/residual_plots-2.png" width=".48\textwidth" style="display: block; margin: auto;" />
]
---
layout:false
.left-column[
### Describing Relationships
### Idea
### Fitting Lines
### Best Estimate
### Good Fit
### Correlation
### Residuals
### Assessment
]
.right-column[
#### Assessing models
#### Residual plot

]
---
layout:false
.left-column[
### Describing Relationships
### Idea
### Fitting Lines
### Best Estimate
### Good Fit
### Correlation
### Residuals
### Assessment
]
.right-column[
#### Normality of residuals
- In addition to the residual versus predicted plot, there are other residual plots we can use to check regression assumptions.

- A **histogram of residuals** and a **normal probability plot (QQ-plot)** of residuals can be used to evaluate whether our residuals are approximately normally distributed. 
   - However, unless the residuals are far from normal or have an obvious pattern, we generally don’t need to be overly concerned about normality.

- Note that we check the residuals for normality. We don’t need to check for normality of the raw data. Our response and predictor variables do not need to be normally distributed in order to fit a linear regression model.
]
---
layout:false
.left-column[
### Describing Relationships
### Idea
### Fitting Lines
### Best Estimate
### Good Fit
### Correlation
### Residuals
### Assessment
]
.right-column[
#### Normality of residuals

Draw a histogram of the residuals (review the JMP toturial for histograms)

It seems the residuals are not normaly distributed in this example.The residuals have a left skewed distirbution.

]
---
layout:false
.left-column[
### Describing Relationships
### Idea
### Fitting Lines
### Best Estimate
### Good Fit
### Correlation
### Residuals
### Assessment
]
.right-column[
#### Normality of residuals
As the instructions on the JMP toturials (and also HW #3), you can draw **Normal QQ-plot** to evaluate if the residuals meet the assumptions of normaly distributed.

Plotting Normal QQ-plot of the same example

- Again, the QQ-plot also confirms that the assumption of Normal distribution of residuals is violated  to some extend in this example.

- More examination is required to fix the issue or to find the problem.

]
---
layout: true
class: middle, center, inverse
---
# Coefficient of Determination
---
layout:false
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
]
.right-column[
#### Coeffecient of Determination (\$R^2\$)

We know that our responses have variability - they are not always the same. We hope that the relationship between our response and our explanatory variables explains some of the variability in our responses.

\$R^2\$ is the fraction of the total variability in the response (\$y\$) accounted for by the fitted relationship.

-  When \$R^2\$ is close to 1 we have explained **almost all** of the variability in our response using the fitted relationship (i.e., the fitted relationship is good).

-  When \$R^2\$ is close to 0 we have explained **almost none** of the variability in our response using the fitted relationship (i.e., the fitted relationship is bad).

There are a number of ways we can calculate \$R^2\$. 
Some require you to know more than others or do more work by hand.

]
---
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
]
.right-column[

#### Calculating Coeffecient of Determination (\$R^2\$

**Method a**. Using the data and our fitted relationship:

For an experiment with response values \$y_1, y_2, \ldots, y_n\$ 
and fitted values \$\hat{y}_1, \hat{y}_2, \ldots, \hat{y}_n\$ we calcuate the following:

`\[
R^2 = \frac{
\sum_{i=1}^n (y_i - \bar{y})^2 -  \sum_{i=1}^n (y_i - \hat{y}_i)^2 
}{
\sum_{i=1}^n (y_i - \bar{y})^2
}
\]`

-  This is the longest way to calculate \$R^2\$ by hand.

-  It requires you to know every response value in the data (\$y_i\$) and every fitted value (\$\hat{y}_i\$)
]
---
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
]
.right-column[

#### Calculating Coeffecient of Determination (\$R^2\$

**Method b**. Using Sums of Squares

For an experiment with response values \$y_1, y_2, \ldots, y_n\$ 
and fitted values \$\hat{y}_1, \hat{y}_2, \ldots, \hat{y}_n\$ we calcuate the following:

-  Total Sum of Squares (SSTO): a baseline for the variability in our response.

\\[
SSTO  = \sum_{i=1}^n (y_i - \bar{y})^2
\\]

-  Error Sum of Squares (SSE): The variability in the data after fitting the line

\\[
SSE = \sum_{i=1}^n (y_i - \hat{y}_i)^2
\\]

-  Regression Sum of Squares (SSR): The variability in the data accounted for by the fitted relationship

\\[
SSR = SSTO - SSE
\\]
]
---
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
]
.right-column[

#### Calculating Coeffecient of Determination (\$R^2\$

**Method b**. Using Sums of Squares

We can write the \$R^2\$ using these sums of squares:

\\[
R^2 = \frac{SSR}{SSTO} = \frac{SSTO - SSE}{SSTO} = 1 - \frac{SSE}{SSTO}
\\]

-  **Q**: What's the advantage of using the sums of squares?

-  **A**: The values of SSTO, SSE, and SSR are used in many statistical calculations. Because of this, they are commonly reported by statistical software. For instance, fitting a model in JMP produces these as part of the output.

]
---
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
]
.right-column[

#### Calculating Coeffecient of Determination (\$R^2\$

**Method c**. A special case when the relationship is linear

If the relationship we fit between \$y\$ and \$x\$ is linear, 
then we can use the sample correlation, \$r\$ to get:

\\[
R^2 =(r)^2
\\]

**NOTE**: Please, please, please, understand that this is only true for linear relationships.

]
---
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
]
.right-column[

#### Calculating Coeffecient of Determination (\$R^2\$
**Example: Stress on Bars **

Earlier, we found \$r = -0.795\$.

Since we are describing the relationship using a line, then we can use the special case:

\\[
R^2 = (r)^2 = (-0.795)^2 = 0.633
\\]

>In other words, 63.3% of the variability in the lifetime of the bars can be explained by the linear relationship between  the stress the bars were placed under and the lifetime.

]
---
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
]
.right-column[
### Precautions

Precautions about Simple Linear Regression (SLR)

- `$r$` only measures linear relationships
- `$R^2$` and `$r$` can be drastically affected by a few unusual data points.

### Using a computer

You can use JMP (or `R`) to fit a linear model. See BlackBoard for videos on fitting a model using JMP.

]
---
layout: true
class: center, middle, inverse
---
## Section 4.2
### Fitting Curves and Surfaces by Least Squares
### Multiple Linear Regression
---
layout:false
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
]
.right-column[
## Linear Relationships

- The idea of simple linear regression can be generalized to produce a powerful engineering tool: **Multiple Linear Regression** (MLR).

- SLR is associated with **line fitting**
- MLR is associated with **curve fitting and surface fitting**

- What we mean by multiple **linear** relationship is that the relation between the variables and the response is linear **in their parameters**.

- **Multiple linear regression in general:** when there are more than one experimental variable in the experiment 
`$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2+\cdots+ \beta_k x_k$$`
   - **polynomial equation of order k:** 
`$$y = \beta_0 + \beta_1 x + \beta_2 x^2+ + \beta_3 x^3+\cdots+ \beta_k x^k$$`
]
---
layout:false

## Non-Linear Relationships
- And there are also **non-linear relationship** where the relationship between the variables and the response is non-linear **in their parameters**.

`\begin{align}
y &=  \beta_0 + e^{\beta_1} x  \\\\
\end{align}`

$$ y= \frac{\beta_0}{\beta_1 + \beta_2x} $$

- The point is that fitting curves and surfaces by the least square method needs a lot of matrix algebra concepts and it is difficult to be done by hand.

- We need software to fit surfaces and curves.

]
---
name: inverse
layout: true
class: center, middle, inverse
---
### Example

---
layout:false
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
]
.right-column[
#### Example:  
Compressive Strength of Fly Ash Cylinders as a Function of Amount of Ammonium Phoshate Additive

| Ammonium Phosphate(%)  | Compressive Strength (psi)| Ammonium Phosphate(%)  | Compressive Strength (psi)| 
|------------------------|---------------------------|------------------------|---------------------------|
|          0             |        1221               |           3            |         1609              | 
|          0             |        1207               |           3            |         1627              | 
|          0             |        1187               |           3            |         1642              | 
|          1             |        1555               |           4            |         1451              | 
|          1             |        1562               |           4            |         1472              | 
|          1             |        1575               |           4            |         1465              | 
|          2             |        1827               |           5            |         1321              | 
|          2             |        1839               |           5            |         1289              | 
|          2             |        1802               |           3            |         1292              |

]
---
layout:false

.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
]
.right-column[
####Example:  
Compressive Strength of Fly Ash Cylinders as a Function of Amount of Ammonium Phoshate Additive

]
---
layout:false
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
]
.right-column[
####Example:  
Compressive Strength of Fly Ash Cylinders as a Function of Amount of Ammonium Phoshate Additive

<img src="ch4_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" />
]
---
layout:false
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
]
.right-column[
#### Example: 
Compressive Strength of Fly Ash Cylinders as a Function of Amount of Ammonium Phoshate Additive

<img src="ch4_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" />
]

---
name: inverse
layout: true
class: center, middle, inverse
---
### One More Example in  Fitting Surface and Curves
---
layout:false
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
]
.right-column[
#### Example: Hardness of Alloy
A group of researchers are studying influences on the hardness of a metal alloy. The researchers varied the percent copper and tempering temperature,
measuring the hardness on the Rockwell scale.

The goal is to describe a relationship between our response, Hardness, and our two experimental variables, the percent copper (\$x_1\$) and tempering temperature (\$x_2\$).

| Percent Copper | Temperature | Hardness | 
|----------------|-------------|----------|
| 0.02 | 1000  |  78.9 | 
|      | 1100  |  65.1 | 
|      | 1200  |  55.2 | 
|      | 1300  |  56.4 | 
| 0.10 | 1000  |  80.9 | 
|      | 1100  |  69.7 | 
|      | 1200  |  57.4 | 
|      | 1300  |  55.4 | 
| 0.18 | 1000  |  85.3 | 
|      | 1100  |  71.8 | 
|      | 1200  |  60.7 | 
|      | 1300  |  58.9 |

We start by writing down a theoretical relationship. With one experimental variable, we may start with a line. Extending that idea for two variables, we start with a plane:
$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2
$$

**Observed Relationship**:

In our data, the true relationship will be shrouded in error.
`\begin{align}
y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \text{errors} \\\\
  &= [\ \ \ \ \ \ \ \ \text{signal}\ \ \ \ \ \ \ ] + [\text{noise}]
\end{align}`
]

If we are right about our theoretical relationship, though, and the signal-to-noise ratio is small, we might be able to estimate the relationship:
$$
\hat{y} = b_0 + b_1 x_1 + b_2 x_2 
$$
]
---
layout:false
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
]
.right-column[
#### Example: Hardness of Alloy
Enter the data in JMP

<img src="figure/alloy-data-in-jmp.png" alt="alloy" width="500">
]
---
layout:false
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
]
.right-column[
#### Example: Hardness of Alloy
In JMP, go to

> `Analyze > Fit Model`

to define the model you are fitting:

<img src="figure/alloy-fit-model.png" alt="alloy" width="500">
]
---
layout:false
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
]
.right-column[
#### Example: Hardness of Alloy
After clicking `Run` we get the following model fit results:

<img src="figure/alloy-fit-results.png" alt="alloy" width="400">
]
---
layout:false
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
]
.right-column[
#### Example: Hardness of Alloy
From this output, we can get the value of `$R^2$`, the coeffecient of determination:

Since `$R^2 = 0.899073$`, we can say 
>89.9074% of the variability in the hardness we observed can be explained by its relationship with temperature and percent copper.
]

---
layout:false
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
]
.right-column[
#### Example: Hardness of Alloy
From this output, we can get the sum of squares.

This "Analysis of Variance" table has the same format across almost all textbooks, journals, software, etc. In our notation, 
- `$SSR = 1152.1888$`
- `$SSE = 129.3404$`
- `$SSTO = 1281.5292$`

We can use these for lots of purposes. In this class, we have seen that we can get `$R^2$`:

$$
R^2 = 1 - \frac{SSE}{SSTO} = 1 - \frac{129.3404}{1281.5292} = 0.8990734
$$

]
---
layout:false
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
]
.right-column[
#### Example: Hardness of Alloy
The parameter estimates give us the fitted values used in our model:

Since we defined percent copper as `$x_1$` earlier and temperature as `$x_2$` then we can write:
`$$\hat{y} = 161.33646 + 32.96875 \cdot x_1 - 0.0855 \cdot x_2$$`

We can use this to get fitted values. If we use temperature of 1000 degrees and percent copper of 0.10 then we would predict a hardness of

`\begin{align}
\hat{y} &= 161.33646 + 32.96875 \cdot (0.10) - 0.0855 \cdot (1000) \\\\
        &= 161.33646 + 3.296875 - 85.5 \\\\
        &= 79.13333
\end{align}`

]
---
layout:false
.left-column[
#### Describing Relationships
#### Idea
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
]
.right-column[
#### Example: Hardness of Alloy
While our model looks pretty good, we still need to check a few things involving residuals. We can save our residuals from the model fit drop down and analyze them.

From Analyze > Distribution:

There aren't many residuals here (just 12) but we would like to make sure that the histogram has rough bell-shape (normal residuals are good). I would call this one inconclusive.

Another way to check if the residuals are approximately normal is to compare the quantiles of our residuals to the theoretical quantiles of the true normal distribution.

From the dropdown menu, choose Normal Quantile Plot to get:

- If the points all fall on the line, then the residuals have the same spread as the normal distribution (i.e., the residuals follow a bell-shape, which is what we want). 
- If they stay within the curves, then we can say the residuals follow a rough bell shape (which is good).
- If points fall outside the curves, our model has problems (which is bad).

]
---
layout: true
class: center, middle, inverse
---
## Transformations
---
layout:false
.left-column[
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
#### Transformation
]
.right-column[
#### Transformations: Fitting complicated relationships

Consider the simulated dataset 'transform.csv' in the lecture module. Here's the scatterplot:
<center>
<img src="figure/transform-plot.png" alt="alloy" width="400">
</center>

Consider the residual plot you would get by trying to fit a line. What would that look like?

Now consider the residual plot you would get by trying to fit a quadratic. What would that look like?

What can we do about the size of the residuals??

We need a function that can both adjust the scale our responses and account for the curve!!

One possible function that could do that: `$ln(x)$`.
<center>
<img src="figure/transform-plot2.png" alt="alloy" width="300">
</center>

Transforming our variables can allow us to get better fits, but you need to be careful about the meaning of the relationship. For instance, the slope now means "the change in the response when *the natural log of x is increased by 1* - the relationship to `$x$` itself is not always easy to translate back.

]
---
layout: true
class: center, middle, inverse
---
## Dangers in Fits
---
layout:false
.left-column[
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
#### Transformation
#### Dangers in Fits
#### Overfitting
]
.right-column[
#### Dangers in Fitting Relationships
**Example**: Stress and Lifetime of Bars

Consider the bars example again

Here's the linear fit:

<img src="figure/barlife-linear-plot.png" alt="alloy" width="300", height= "200">
]
---
layout:false
.left-column[
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
#### Transformation
#### Dangers in Fits
#### Overfitting
]
.right-column[
#### Dangers in Fitting Relationships
**Example**: Stress and Lifetime of Bars

The fitted line doesn't touch all the points, but we can push our relationship further by adding `$(stress)^2$`, `$(stress)^3$`, `$(stress)^4$`, and so on.

Everytime we add a new term to the polynomial, we give the fitted relationship the ability to make one more turn.

This leads to a problem called **overfitting**: our model is just following *the data*, including the errors, instead of uncovering *the true relationship*.

]
---
layout:false
.left-column[
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
#### Transformation
#### Overfitting
#### Multicollinearity
]
.right-column[
#### Multicollinearity
Multicollinearity occurs when you have strongly correlated experimental variables.
<center>
<img src="figure/multicollinearity.png" alt="alloy" width="400">
</center>

Multicollinearity can lead to several problems:
- Since the variables are all related to each other, the impact each variable has in the relationship to the response becomes difficult to determine
- Since the disentangling the relationships is difficult, the estimates of the slopes for each variable become very sensitive (different samples lead to very different estimates)
- Since the correlated experimental variables will have similar relationships to the response, most of them are not needed. Including them leads to an overfit.

Ultimately while it may look like a good fit on paper, the model will be inaccurate.
]
---
.left-column[
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
#### Transformation
#### Overfitting
#### Multicollinearity
#### Wrapup
]
.right-column[
#### Finding the Best Fit 
-  Again, we can use the **Least Squares** principle to find the best estimates, \$b_0\$, \$b_1\$, and \$b_2\$.

-  The calculations are fairly advanced now that we have three values to estimate,

-  so these calculations are usually done in statistical software (like JMP).

]
---
.left-column[
#### Fitting Lines
#### Best Estimate
#### Good Fit
#### Correlation
#### Residuals
#### Assessment
#### `$R^2$`
#### Fitting Curves
#### MLR
#### Transformation
#### Overfitting
#### Multicollinearity
#### Wrapup
]
.right-column[
####Judging The Fit

-  Not all Theoretical Relationships we may imagine are real!

-  Perhaps a better relationship could be found using \\[ y = \beta\_0 + \beta\_1 x\_1 + \beta\_2 \ln(x\_2) \\]

-  We determine which relationships to try by examining plots of the data, fit statistics (like \$R^2\$), and plots of residuals.

-  Be careful of overfitting and multicollinearity (when the experimental variables are correlated).

]