For each of the following variables decide whether the data is categorical or quantitative
Daily low temperature in New York
Brand of cereal in supermarket
Telephone number
License plates of cars
Weight lost in a weight loss program
Time spent on studying for the class during last week
0.5 , 1 , 1.5 , 1.9 , 2.1 , 2.2 , 2.7 , 2.8 , 3.3 , 3.6 , 3.9 , 3.9 , 3.9 , 4 , 4 , 4.3 , 4.3 , 4.5 , 4.5 , 5 , 5 , 5.1 , 5.1 , 5.5 , 5.6 , 5.9 , 6.2 , 6.3 , 7.1 , 27.1
Find the mean, median, range and standard deviation.
Find the 20th and the 64th percentile of this data set.
draw the boxplot for this data set
Using the data from problem 2, find the z score of x = 3.6. What x would have a z score = 1?
Consider the data set for Friday the 13th. This data set has several comparisons of a Friday the 13th and the previous Friday the 6th, for example the number of cars passing through a junction (traffic), shoppers for a supermarket (shopping), or admissions due to transport accidents (accident)
head(friday13)
## Dataset six thirteen
## 1 traffic 139246 138548
## 2 traffic 134012 132908
## 3 traffic 137055 136018
## 4 traffic 133732 131843
## 5 traffic 123552 121641
## 6 traffic 121139 118723
Use R to compute the mean and the standard deviation for the two Fridays and the three data sets separately (so there will be the mean and st. dev. for the number of accidents on Friday the 6th, the mean and st. dev. for the number of accidents on Friday the 13th, the mean and st. dev. for the number of shoppers on Friday the 6th and so on).
Does any of these numbers support the idea that Friday the 13th is special?
Consider the data in AIDS in Americas in 1995.
head(aids)
## Country AIDS
## 1 Anguilla 0.0
## 2 Antigua Barbuda 7.3
## 3 Argentina 5.6
## 4 Bahamas 131.4
## 5 Barbados 44.1
## 6 Belize 4.5
Find the 20th and the 80th percentiles of the AIDS rates.
Find the 5 number summary for the aids rates.
According to the boxplot, which countries are outliers in this data set?
Let’s say the WHO wants to use the “average” rate of AIDS infection (together with the number of people living in the Americas) to estimate the number of AIDS infected people in the Americas. Should they use the mean or the median to find the “average”?
In this exercise we study the data set
head(headache)
## Time Dose Sex BP.Quan
## 1 35 2 0 0.25
## 2 43 2 0 0.50
## 3 55 2 0 0.75
## 4 47 2 1 0.25
## 5 43 2 1 0.50
## 6 57 2 1 0.75
What is the type of data of the variables?
Find the mean and standard deviation of Time.
Find the 5-number summary and draw the boxplot of Time
Company XYZ has a contract with a supplier for metal rods. The contract says that all the rods have to be between 15.26cm and 15.47cm long. XYZ just received a shipment of 50000 rods. They randomly select 100 of them and measure the length of each. They find \(\overline{X}=15.344\) and \(s=0.041\). Should they except this shipment?
Consider the following data set:
x <- 1:10
y <- c(3, 0, 0, 10, 8, 4, 5, 8, 14, 9)
kable(data.frame(x=x,y=y))
x | y |
---|---|
1 | 3 |
2 | 0 |
3 | 0 |
4 | 10 |
5 | 8 |
6 | 4 |
7 | 5 |
8 | 8 |
9 | 14 |
10 | 9 |
Now we find
round(cor(x, y), 3)
## [1] 0.704
Find another observation (a, b) such that
round(cor(c(x, a), c(y, b)), 3)
## [1] 0
For each of the following variables decide whether the data is categorical or quantitative
Daily low temperature in New York - quantitative
Brand of cereal in supermarket - categorical
Telephone number - categorical
License plates of cars - categorical
Weight lost in a weight loss program - quantitative
Time spent on studying for the class during last week - quantitative
0.5 , 1 , 1.5 , 1.9 , 2.1 , 2.2 , 2.7 , 2.8 , 3.3 , 3.6 , 3.9 , 3.9 , 3.9 , 4 , 4 , 4.3 , 4.3 , 4.5 , 4.5 , 5 , 5 , 5.1 , 5.1 , 5.5 , 5.6 , 5.9 , 6.2 , 6.3 , 7.1 , 27.1
the data is comma delimited, so after copying it in R type
x <- getx(sep=",")
mean(x)
## [1] 4.76
median(x)
## [1] 4.15
sd(x)
## [1] 4.51859
quantile(x, c(0.2, 0.64))
## 20% 64%
## 2.60 4.78
bplot(x)
Using the data from problem 2, find the z score of x = 3.6. What x would have a z score = 1?
We found \(\overline{X}=4.76\) and \(s= 4.519\), so the z score of x=3.6 is
\(z = (x-\overline{X})/s = (3.6-4.76)/4.519 = -0.2567\)
We want z score =1, so
\(1 = (x-\overline{X})/s\)
or
\(s= x-\overline{X}\)
or
\(x = s+\overline{X} = 4.519+4.76 = 9.279\)
Consider the data set for Friday the 13th. This data set has several comparisons of a Friday the 13th and the previous Friday the 6th, for example the number of cars passing through a junction (traffic), shoppers for a supermarket (shopping), or admissions due to transport accidents (accident)
Use R to compute the mean and the standard deviation for the two Fridays and the three data sets separately (so there will be the mean and st. dev. for the number of accidents on Friday the 6th, the mean and st. dev. for the number of accidents on Friday the 13th, the mean and st. dev. for the number of shoppers on Friday the 6th and so on). Does any of these numbers support the idea that Friday the 13th is special?
attach(friday13)
mean(friday13[Dataset=="accident",3])
## [1] 10.83333
mean(friday13[Dataset=="accident",3])
## [1] 10.83333
mean(friday13[Dataset == "traffic", 2])
## [1] 128385.3
sd(friday13[Dataset == "traffic", 2])
## [1] 7259.223
mean(friday13[Dataset == "traffic", 3])
## [1] 126549.5
sd(friday13[Dataset == "traffic", 3])
## [1] 7664.282
mean(friday13[Dataset == "shopping", 2])
## [1] 4970.511
sd(friday13[Dataset == "shopping", 2])
## [1] 1165.615
mean(friday13[Dataset == "accident", 2])
## [1] 7.5
sd(friday13[Dataset == "accident", 2])
## [1] 3.331666
mean(friday13[Dataset == "accident", 3])
## [1] 10.83333
sd(friday13[Dataset == "accident", 3])
## [1] 3.600926
There does not appear to be anything special about Friday the 13th
Consider the data in AIDS in Americas in 1995.
attach(aids)
quantile(AIDS, c(0.2,0.8))
## 20% 80%
## 0.96 13.82
fivenumber(AIDS)
## Minimum Q1 Median Q3 Maximum
## 0 1.5 5.6 10.9 131.4
## IQR = 9.4
bplot(AIDS)
the 5 countries with rates over 30are outliers, so
aids[AIDS > 30,]
## Country AIDS
## 4 Bahamas 131.4
## 5 Barbados 44.1
## 7 Bermuda 77.2
## 21 French Guiana 62.5
## 23 Guadaloupe 31.9
Should they use the mean or the median to find the “average”?
Mean, because the countries with the highest AIDS rates have to influence our “average”, and they don’t if we use the median.
In this exercise we study the data set headache
Time: quantitative
Dose: quantitative
Sex: categorical
BP Quan: categorical.
attach(headache)
mean(Time)
## [1] 26.33333
sd(Time)
## [1] 14.56818
fivenumber(Time)
## Minimum Q1 Median Q3 Maximum
## 3 17.8 26 30.5 57
## IQR = 12.7
bplot(y=Time)
Company XYZ has a contract with a supplier for metal rods. The contract says that all the rods have to be between 15.26cm and 15.47cm long. XYZ just received a shipment of 50000 rods. They randomly select 100 of them and measure the length of each. They find \(\overline{X}=15.344\) and \(s=0.041\). Should they except this shipment?
We can use the empirical rule to decide. This requires that the lengths of the rods have a bell-shaped histogram, which of course should be checked. Then
\(\overline{X} \pm 2s = 15.344 \pm 2 \times 0.041 = (15.262, 15.426)\)
The interval is supposed to include 95% the observations (or lengths of the rods), so we can conclude that 95% the rods have a length at least 15.262cm and at most 15.462cm, which is in accordance with the contract. So XYZ should accept the shipment.
Consider the following data set:
x <- 1:10
y <- c(3, 0, 0, 10, 8, 4, 5, 8, 14, 9)
kable(data.frame(x=x,y=y))
x | y |
---|---|
1 | 3 |
2 | 0 |
3 | 0 |
4 | 10 |
5 | 8 |
6 | 4 |
7 | 5 |
8 | 8 |
9 | 14 |
10 | 9 |
Now w | e find |
round(cor(x,y), 3)
Find another observation (a, b) such that
round(cor( c(x,a), c(y,b)), 3)
There are many solutions, here is one of them:
round(cor(c(x, 19.9), c(y, -0.5)), 3)
## [1] 0