Chapter 2: Some Basic Ideas and Concepts

Introduction

This page discusses some general concepts of Statistics.

Motivation for learning Statistics

Quote:

If I had only one hour to live, I would choose to live it in statistics class because it would seem to last forever

(A student’s complaint)

Why does Statistics appear to be so boring?

Consider the following questions:

  • Does aspirin lower the risk of heart attacks? (Medical research)
  • Are there fewer Manatees in Puerto Rico today than 10 years ago? (Biology)
  • Does a certain brand of gasoline really clean the engine? (Chemistry and Mechanical Engineering)
  • Do tax cuts work? (Economics and Government Policy)
  • Does anger management training work? (Psychology and Law)
  • Do frequent flier programs increase airline ticket sales? (Business)
  • Do the salaries of men and women differ? (Social Sciences and Law)

Amazingly enough, a person investigating any of these questions might well end up using the same statistical method to answer them! (It is called the one-sample t test and we will study it at some time during this semester)

The power, strength (and beauty) of statistics lies in its universal applicability!

What is Statistics?

Answer 1: Statistics is the Science of data (or information)

  • How to collect data
  • How to analyze data
  • How to present data

Answer 2: Statistics is the Science of Uncertainty

  • where does it come from
  • what types are there
  • how to deal with it

Why everybody should know a little bit about Statistics - Misuse of Statistics

Statistics can be used in many ways to make things appear to be something that it is not (lying!)

Another quote:

There are Lies, Damn Lies and Statistics

(maybe Benjamin Disraeli, probably not)

What can we do with Statistics?

Case Study: WRInc

WR Inc. is a large (fictitious) company. It recently did a survey of all its employees, asking them to fill out a questionnaire with questions regarding their gender, income etc.
In addition they randomly selected 500 employees and asked them some additional questions.

Let’s start by checking what’s in the data set:

head(wrinccensus)   
##   Id.Number Gender Income Job.Level Years Satisfaction
## 1     10001 Female  22800         1     6            4
## 2     10003 Female  18600         1     1            2
## 3     10004 Female  23900         1     4            2
## 4     10008   Male  37200         1    13            4
## 5     10010   Male  29800         1     9            1
## 6     10014   Male  53700         5    16            1

shows us the names of the variables and the first six rows of data.

Note: when something appears in a box like this it means you can type (or copy-paste) this into R and get the same answer.

dim(wrinccensus) 
## [1] 23791     6

shows us that there are 23791 observations, one for each employee, and with 6 pieces of information (variables) for a total of 23791*6 = 142746.

Trying to look at so much information is very difficult, so organizing it in some fashion is very useful.

Often just making a little table is a good idea:

attach(wrinccensus) 
table(Gender) 
## Gender
## Female   Male 
##   9510  14281

Sometimes it is better to consider percentages:

length(Gender)
## [1] 23791
table(Gender)/length(Gender)*100
## Gender
##  Female    Male 
## 39.9731 60.0269
round(table(Gender)/length(Gender)*100, 2) 
## Gender
## Female   Male 
##  39.97  60.03

or maybe even both:

Gender N Percentage
Female 9510 39.97%
Male 14281 60.03%

Another good way to study a data set is via graphs. For example, it seems reasonable that there should be a connection between job level and income, after all usually people with a better job make more money. Is this true for our company? For this we can use

splot(Income, Job.Level)

As we will see soon, often we have different ways to do the same thing in Statistics. For example, instead of the scatterplot above we could draw something called a boxplot, which we will discuss at some point in this class. And in fact, this would be a very good idea because there is a serious problem with the scatterplot. Can you see what it is?



An important question would be whether there is job discrimination in this company, that is whether men are paid more than women. How can we find out? Let’s compute the average income of the men and the average income of the women. But before we do we need to understand that

  • the two will not be the exactly the same!
  • so one will be higher than the other, just by random chance.

In fact even if there is no job discrimination there is a 50-50 chance that the average income of the men is (a little bit) higher than the average income of the women. Of course if there is job discrimination we would expect the average income of the men to be substantially higher than the average income of the women. What we need to find out is whether the men’s income is statistically significantly higher!

Something is statistically significant if it cannot be explained by random chance alone.

Example 4 heads in 4 flips of a fair coin has a probability of 1 in 16 or 6.25%, so this would not be considered unusual

Example 10 heads in 10 flips of a fair coin has a probability of 1 in 1028 or 0.01%, so this would be considered very unusual. In fact one would now conclude that this coin is not a fair coin.

Note What is and what is not statistically significant is a question of probability.

back to WRInc:

stat.table(Income , Gender, ndigit=-1)
##        Sample Size  Mean Standard Deviation
## Female        9510 33150               9370
## Male         14280 33520               9460

We find that the average income is

Female: \(\$33150\)
Male: \(\$33520\)

so the difference is \(33520 -33150 = \$370\).

  • Is this a “substantial” or a “little” difference?
  • Is this a “statistically significant” difference?
  • Does it “prove” discrimination?

“prove” here has just about the same meaning as it does in a criminal trial: beyond any reasonable doubt.

To answer the question we would need to do a hypothesis test.

One way to answer that question is to use a method called the two-sample t test. You might be surprised to learn that indeed the difference is to large to be due to random chance!

Careful, though: although the difference is statistically significant, it still does not mean that there is discrimination, because the difference in salaries might be caused by other factors such as the fact that there are more men at the higher job levels, that men tend to have more years at the company etc. This is an example of one of the major issues in Statistics:

Correlation does not imply Causation or here: there is some connection between gender and income (specifically men are paid more than women) but it is not clear yet why that is. One possible reason is discrimination, but there could be others.

Some Basic Terminology of Statistics

Population : all of the entities (people, events, things etc.) that are the focus of a study

Example 1 Say we are interested in the average age of the undergraduate students at the Colegio.

Example 2 A company is considering to sell a new product in Puerto Rico, but before they do they want to know how many people in Puerto Rico might be interested in buying it.

Example 3 All possible hurricanes, past and future. Clearly this last one is much more complicated population than the undergraduate students at the Colegio. In order to properly describe it we will need probability.


Census : If all the entities of a population are included in the study.

Example 1 if we ask the Registrars Office they might give as the ages of all these students, and if we then ask all the students how old they are we we would have done a census.

Example 2 impossible for practical reasons (we cannot ask every person in Puerto Rico )

Example 3 impossible for theoretical reasons (future?)


Sample : any subset of the population

Example 1 let’s take the students in the room as a sample.

Example 2 ask all our friends and relatives

Example 3 all the hurricanes during the last 10 years.


Random sample : a sample found through some randomization (flip of a coin, random numbers on computer etc.)

Example 1: Are you a random sample?

Example 2 do a telephone survey

Example 3 yes


Simple Random Sample (SRS) : each “entity” in the population has an equal chance of being chosen for the sample.

Example 1 Are you a simple random sample?

Example 2 depends on exactly how the telephone numbers are chose, but generally ok

Example 3 all the hurricanes during the last 10 years: yes


Data : the collection of many pieces of information

Example 1 a table with the ages of the students in our sample

Example 2 list of answers (yes - I would buy the product, no - I would not)

Example 3 all the data available about a hurricane: track, wind speed, air pressure etc.

Hurricane Maria


Parameter : any numerical quantity associated with a population.

Example 1 If we had the ages of all 10000 or so undergraduate students we could calculate the average, and it would be a parameter.

Example 2 the percentage of all the people in PR who would buy the product - impossible to find exactly because we cannot do a census.

Example 3 the average top wind speed of the strongest hurricane in any one year. This is a number that nobody knows or can know, even theoretically.


Statistic : any numerical quantity associated with a sample.

Example 1 let’s calculate your average age. You are a sample, so this is a statistic.

Example 2 the percentage of people in our sample who say they would buy the product

Example 3 Take the last 10 years as a sample, and calculate the average of the top wind speeds of the strongest hurricane in each year.



Note that there is one value of the population parameter (as long as the population is the same) but there are many different values of the statistic, depending on the sample that was selected.


Statistical Error the uncertainty in the value of the statistic due to the fact we only used a sample.

Example 1 If the average age of the students in the classroom is 21.25, what does this tell us about the average age of all the undergraduate students? Is it possible that this might be much higher, maybe even over 22 years?

Example 2 If 30% of the people in the sample say they would buy the product, what might be the number for the whole island?

Example 3 If the strongest hurricane in the last 10 years had wind speeds of 160 miles per hour (Maria!), does this mean we will never have one with 170 miles?


Bias Any systematic difference between the population and the sample with respect to a variable.

Example 1 Are you (the class) a biased sample?

Example 2 depends on how the selection of telephone numbers is done.

Example 3 Are the last 10 years a biased sample?

Avoiding bias is the main reason for using a Simple Random Sample.



Of course we away want a small variance and a small bias, but if that is not possible, what is better (if any)?

Large Variance - Small Bias

or

Small Variance - Large Bias

App:

for an illustration of the bias vs variance issue run

run.app(bias.variance)  

For a discussion of these terms as well as much more you can look at the web site of the

[Australian Bureau of Statistics] (“http://www.abs.gov.au/websitedbs/a3121120.nsf/home/statistical+language”)

Categorical vs. Quantitative Variables

Example Let’s consider again WRInc

  1. Variable Gender: most obvious thing to do: count how many males and females are in the company

  2. Variable Income: find average income of employees.

Why the difference?

In real live we always use a computer for all the calculations. That leaves two tasks for the human being doing Statistics:

Decide

  • what is the best method for analyzing a specific data set?

  • what is the result of the analysis telling you about the experiment?

Most important here are

  • the computer will not do these steps for you

  • (Almost always) the computer will do the analysis you ask it to do, even if this analysis is complete nonsense

In order to know what method to use it is important to understand some basic features of your data. One is its data type:

We categorize variables as follows:

  1. Quantitative data is numeric, and arithmetic makes sense (adding, multiplying etc.)

Example

  1. Yearly income of a family in Puerto Rico
    (typical answers: 35300, 45100, 19500, …)

  2. Temperature in Mayaguez at 12 Noon
    (typical answers: 93, 89, 91, 92 …)

  3. Amount paid for the phone bill
    (typical answers: 75.20, 109,75, 54.34 …)

  1. Categorical

everything else

Example

  1. A students major
    (typical answers: English, Biology, Psychology …)

  2. in an experiment to grow rice three different fertilizers were used. They are labeled A, B and C
    (typical answers: B B A C …)

  3. in an experiment to grow rice three different fertilizers were used. They are labeled 1, 2 and 3
    (typical answers: 2 2 1 3 …)

  4. Your student id number

Note Often whether a variable is categorical or quantitative depends on how (and how precisely) it is measured.

Example In a study on weather we want to include “precipitation” (rain) at 12 noon of each day.

  • Is it raining at all? “Yes” or “No” → categorical

  • We put a cup outside. The cup has marks for each \(1/10^{th}\) of an inch. Our data is the number of marks. Values will be 0, 1, 2 etc. → quantitative

Categorical data comes in one of two versions - ordered or unordered:

Examples

  1. grades in a course: A, B, C, D, F and W - ordered

  2. gender: Male, Female - unordered

  3. Treatments in a clinical trial: A, B, C - unordered

  4. Treatments in a clinical trial: 1, 2, 3 - unordered

  5. blood pressure: low medium high - ordered

  6. directions: north east south west - unordered

One consequence of having an ordering is that it should be used in graphs, tables etc.

Case Study: WRInc

Let’s look at the variables in the survey of wrinccensus

head(wrinccensus)
##   Id.Number Gender Income Job.Level Years Satisfaction
## 1     10001 Female  22800         1     6            4
## 2     10003 Female  18600         1     1            2
## 3     10004 Female  23900         1     4            2
## 4     10008   Male  37200         1    13            4
## 5     10010   Male  29800         1     9            1
## 6     10014   Male  53700         5    16            1