For each of the following variables decide whether the data is categorical or quantitative

Daily low temperature in New York

Brand of cereal in supermarket

Telephone number

License plates of cars

Weight lost in a weight loss program

Time spent on studying for the class during last week

0.5 , 1 , 1.5 , 1.9 , 2.1 , 2.2 , 2.7 , 2.8 , 3.3 , 3.6 , 3.9 , 3.9 , 3.9 , 4 , 4 , 4.3 , 4.3 , 4.5 , 4.5 , 5 , 5 , 5.1 , 5.1 , 5.5 , 5.6 , 5.9 , 6.2 , 6.3 , 7.1 , 27.1

Find the mean, median, range and standard deviation.

Find the 20

^{th}and the 64^{th}percentile of this data set.draw the boxplot for this data set

Using the data from problem 2, find the z score of x = 3.6. What x would have a z score = 1?

Consider the data set for Friday the 13^{th}. This data set has several comparisons of a Friday the 13th and the previous Friday the 6th, for example the number of cars passing through a junction (traffic), shoppers for a supermarket (shopping), or admissions due to transport accidents (accident)

`head(friday13)`

```
## Dataset six thirteen
## 1 traffic 139246 138548
## 2 traffic 134012 132908
## 3 traffic 137055 136018
## 4 traffic 133732 131843
## 5 traffic 123552 121641
## 6 traffic 121139 118723
```

Use R to compute the mean and the standard deviation for the two Fridays and the three data sets separately (so there will be the mean and st. dev. for the number of accidents on Friday the 6^{th}, the mean and st. dev. for the number of accidents on Friday the 13^{th}, the mean and st. dev. for the number of shoppers on Friday the 6^{th} and so on).

Does any of these numbers support the idea that Friday the 13^{th} is special?

Consider the data in AIDS in Americas in 1995.

`head(aids)`

```
## Country AIDS
## 1 Anguilla 0.0
## 2 Antigua Barbuda 7.3
## 3 Argentina 5.6
## 4 Bahamas 131.4
## 5 Barbados 44.1
## 6 Belize 4.5
```

Find the 20

^{th}and the 80^{th}percentiles of the AIDS rates.Find the 5 number summary for the aids rates.

According to the boxplot, which countries are outliers in this data set?

Let’s say the WHO wants to use the “average” rate of AIDS infection (together with the number of people living in the Americas) to estimate the number of AIDS infected people in the Americas. Should they use the mean or the median to find the “average”?

In this exercise we study the data set

`head(headache)`

```
## Time Dose Sex BP.Quan
## 1 35 2 0 0.25
## 2 43 2 0 0.50
## 3 55 2 0 0.75
## 4 47 2 1 0.25
## 5 43 2 1 0.50
## 6 57 2 1 0.75
```

What is the type of data of the variables?

Find the mean and standard deviation of Time.

Find the 5-number summary and draw the boxplot of Time

Company XYZ has a contract with a supplier for metal rods. The contract says that all the rods have to be between 15.26cm and 15.47cm long. XYZ just received a shipment of 50000 rods. They randomly select 100 of them and measure the length of each. They find \(\overline{X}=15.344\) and \(s=0.041\). Should they except this shipment?

Consider the following data set:

```
x <- 1:10
y <- c(3, 0, 0, 10, 8, 4, 5, 8, 14, 9)
kable(data.frame(x=x,y=y))
```

x | y |
---|---|

1 | 3 |

2 | 0 |

3 | 0 |

4 | 10 |

5 | 8 |

6 | 4 |

7 | 5 |

8 | 8 |

9 | 14 |

10 | 9 |

Now we find

`round(cor(x, y), 3)`

`## [1] 0.704`

Find another observation (a, b) such that

`round(cor(c(x, a), c(y, b)), 3)`

`## [1] 0`

For each of the following variables decide whether the data is categorical or quantitative

Daily low temperature in New York - quantitative

Brand of cereal in supermarket - categorical

Telephone number - categorical

License plates of cars - categorical

Weight lost in a weight loss program - quantitative

Time spent on studying for the class during last week - quantitative

0.5 , 1 , 1.5 , 1.9 , 2.1 , 2.2 , 2.7 , 2.8 , 3.3 , 3.6 , 3.9 , 3.9 , 3.9 , 4 , 4 , 4.3 , 4.3 , 4.5 , 4.5 , 5 , 5 , 5.1 , 5.1 , 5.5 , 5.6 , 5.9 , 6.2 , 6.3 , 7.1 , 27.1

the data is comma delimited, so after copying it in R type

`x <- getx(sep=",")`

`mean(x)`

`## [1] 4.76`

`median(x)`

`## [1] 4.15`

`sd(x)`

`## [1] 4.51859`

- Find the 20
^{th}and the 64^{th}percentile of this data set.

`quantile(x, c(0.2, 0.64))`

```
## 20% 64%
## 2.60 4.78
```

- draw the boxplot for this data set

`bplot(x)`

Using the data from problem 2, find the z score of x = 3.6. What x would have a z score = 1?

We found \(\overline{X}=4.76\) and \(s= 4.519\), so the z score of x=3.6 is

\(z = (x-\overline{X})/s = (3.6-4.76)/4.519 = -0.2567\)

We want z score =1, so

\(1 = (x-\overline{X})/s\)

or

\(s= x-\overline{X}\)

or

\(x = s+\overline{X} = 4.519+4.76 = 9.279\)

Consider the data set for Friday the 13^{th}. This data set has several comparisons of a Friday the 13th and the previous Friday the 6th, for example the number of cars passing through a junction (traffic), shoppers for a supermarket (shopping), or admissions due to transport accidents (accident)

Use R to compute the mean and the standard deviation for the two Fridays and the three data sets separately (so there will be the mean and st. dev. for the number of accidents on Friday the 6^{th}, the mean and st. dev. for the number of accidents on Friday the 13^{th}, the mean and st. dev. for the number of shoppers on Friday the 6^{th} and so on). Does any of these numbers support the idea that Friday the 13^{th} is special?

```
attach(friday13)
mean(friday13[Dataset=="accident",3])
```

`## [1] 10.83333`

`mean(friday13[Dataset=="accident",3])`

`## [1] 10.83333`

`mean(friday13[Dataset == "traffic", 2])`

`## [1] 128385.3`

`sd(friday13[Dataset == "traffic", 2])`

`## [1] 7259.223`

`mean(friday13[Dataset == "traffic", 3])`

`## [1] 126549.5`

`sd(friday13[Dataset == "traffic", 3])`

`## [1] 7664.282`

`mean(friday13[Dataset == "shopping", 2])`

`## [1] 4970.511`

`sd(friday13[Dataset == "shopping", 2])`

`## [1] 1165.615`

`mean(friday13[Dataset == "accident", 2])`

`## [1] 7.5`

`sd(friday13[Dataset == "accident", 2])`

`## [1] 3.331666`

`mean(friday13[Dataset == "accident", 3])`

`## [1] 10.83333`

`sd(friday13[Dataset == "accident", 3])`

`## [1] 3.600926`

There does not appear to be anything special about Friday the 13^{th}

Consider the data in AIDS in Americas in 1995.

- Find the 20
^{th}and the 80^{th}percentiles of the AIDS rates.

```
attach(aids)
quantile(AIDS, c(0.2,0.8))
```

```
## 20% 80%
## 0.96 13.82
```

- Find the 5 number summary for the aids rates.

`fivenumber(AIDS)`

```
## Minimum Q1 Median Q3 Maximum
## 0 1.5 5.6 10.9 131.4
## IQR = 9.4
```

- According to the boxplot, which countries are outliers in this data set?

`bplot(AIDS)`

the 5 countries with rates over 30are outliers, so

`aids[AIDS > 30,]`

```
## Country AIDS
## 4 Bahamas 131.4
## 5 Barbados 44.1
## 7 Bermuda 77.2
## 21 French Guiana 62.5
## 23 Guadaloupe 31.9
```

- Let’s say the WHO wants to use the “average” rate of AIDS infection (together with the number of people living in the Americas) to estimate the number of AIDS infected people in the Americas.

Should they use the mean or the median to find the “average”?

Mean, because the countries with the highest AIDS rates have to influence our “average”, and they don’t if we use the median.

In this exercise we study the data set **headache**

- What is the type of data of the variables?

Time: quantitative

Dose: quantitative

Sex: categorical

BP Quan: categorical.

- Find the mean and standard deviation of Time.

```
attach(headache)
mean(Time)
```

`## [1] 26.33333`

`sd(Time)`

`## [1] 14.56818`

- Find the 5-number summary and draw the boxplot of Time

`fivenumber(Time)`

```
## Minimum Q1 Median Q3 Maximum
## 3 17.8 26 30.5 57
## IQR = 12.7
```

` bplot(y=Time)`

Company XYZ has a contract with a supplier for metal rods. The contract says that all the rods have to be between 15.26cm and 15.47cm long. XYZ just received a shipment of 50000 rods. They randomly select 100 of them and measure the length of each. They find \(\overline{X}=15.344\) and \(s=0.041\). Should they except this shipment?

We can use the empirical rule to decide. This requires that the lengths of the rods have a bell-shaped histogram, which of course should be checked. Then

\(\overline{X} \pm 2s = 15.344 \pm 2 \times 0.041 = (15.262, 15.426)\)

The interval is supposed to include 95% the observations (or lengths of the rods), so we can conclude that 95% the rods have a length at least 15.262cm and at most 15.462cm, which is in accordance with the contract. So XYZ should accept the shipment.

Consider the following data set:

```
x <- 1:10
y <- c(3, 0, 0, 10, 8, 4, 5, 8, 14, 9)
kable(data.frame(x=x,y=y))
```

x | y |
---|---|

1 | 3 |

2 | 0 |

3 | 0 |

4 | 10 |

5 | 8 |

6 | 4 |

7 | 5 |

8 | 8 |

9 | 14 |

10 | 9 |

Now w | e find |

`round(cor(x,y), 3)`

Find another observation (a, b) such that

`round(cor( c(x,a), c(y,b)), 3)`

There are many solutions, here is one of them:

`round(cor(c(x, 19.9), c(y, -0.5)), 3)`

`## [1] 0`