Inference for a Proportion (Percentage) \(\pi\)

Assumptions
Confidence Interval
Hypothesis Test
Power Sample Size

In this section we will discuss inference for proportions (or percentages) such as the percentage of people who prefer Coke over Pepsi, who will vote PNP in the next election, who earn more than $50,000 per year etc.

Say we do a survey of n people and ask them “Do you prefer Coke over Pepsi?” Then if we only allow “Yes”and “No” answers we have the Bernoulli trial with success probability \(\pi\). The object of interest here is \(\pi\), the proportion of people in the whole population who prefer Coke over Pepsi. Obviously the proportion in the sample who prefer Coke over Pepsi will be our point estimate of \(\pi\).

Notation: often in this context we use \(\widehat{p}\) for the sample proportion.

Example Say in a survey of 500 people 312 say they prefer Coke over Pepsi. Then a point estimate for the proportion of people who prefer Coke over Pepsi is

\[ \widehat{p} = \frac{312}{500} = 0.624 \]


Note Most often problems are stated in terms of percentages instead of proportions but all the methods use proportions. Simply multiply by 100% at the end.

Example A point estimate for the percentage of people who prefer Coke over Pepsi is \(62.4\%\)

Note Sometimes problems are for probabilites, that is the same as proportion in this context.

Example The probability of a six on a fair die is 16%

Note: Rounding: generally proportions and probabilities are rounded to 3 digits, percentages to 1 digit.

Method

Exact Binomial

R commands:

  • one.sample.prop confidence intervals and hypothesis testing
  • prop.ps power and sample size

Assumptions

None!

Confidence Interval

A \(100(1-\alpha)\%\) confidence interval for the population proportion \(\pi\) is found with the one.sample.prop command.

Case Study: Binge Drinking in College

Alcohol on college campuses is a very serious problem. But how common is it? A survey of 17,096 students in US four-year colleges collected information on drinking behavior and alcohol-related problems.

(Henry Wechsler et al., “Health and Behavioral Consequences of Binge Drinking in College”, Journal of the American Medical Association, 272 (1994).

The researchers defined “frequent binge drinking” as having five or more drinks in a row three or more times in the past two weeks. According to this definition 3,314 students were classified as frequent binge drinkers.

Problem: Find a point estimate for the percentage of frequent binge drinkers.

Solution: A point estimate for the proportion of frequent binge drinkers is

\[ \widehat{p} = \frac{3314}{17096} = 0.194 \]

therefore a point estimate for the percentage is 19.4%

Problem: Find a 99% confidence interval for the percentage of frequent binge drinkers.

one.sample.prop(x = 3314, n = 17096, conf.level = 99) 
## A 99% confidence interval for the population proportion is (0.186, 0.202)

or \((18.6, 20.2)\) percent.

Note unlike the one.sample.t command the one.sample.prop command has no argument shat. As we said before, a Bernoulli trial has only one parameter (\(\pi\)) and the the standard deviation follows from it.

Case Study: Vacations of Puerto Ricans

The website of the Puerto Rico Tourism Company had the results of a survey of Puerto Ricans and their vacation travel. The study measures short trip leisure travel habits of average Puerto Rican families and allows for the monitoring of consumer preferences on a continuous basis. According to the report for July - September 2009 10% of the respondents had made a trip to Cabo Rojo, the highest number of any place in PR.

Find a 95% CI for the true proportion of PR travelers who visit Cabo Rojo. The survey was based on 400 interviews.

one.sample.prop(x = 40, n = 400, conf.level = 95) 
## A 95% confidence interval for the population proportion is (0.072, 0.134)

Example In a sample of 200 people entering a store, 61 actually bought something. Find a 90% confidence interval for the percentage of “buyers”.

one.sample.prop(x = 61, n = 200, conf.level = 90) 
## A 90% confidence interval for the population proportion is (0.251, 0.363)

or \((25.2\%, 36.6\%)\)

Note: don’t use the % sign in moodle.

Example In a survey of 1200 likely voters 457 said they would vote for candidate AA. So a \(95\%\) confidence interval for the true percentage of voters for AA is

one.sample.prop(x = 457, n = 1200) 
## A 95% confidence interval for the population proportion is (0.353, 0.409)

or \((35.3\%, 40.9\%)\)

Hypothesis Test

Null Hypothesis: \(H_0: \pi = \pi_0\)

Alternative Hypothesis: Choose one of the following:

  1. \(H_a: \pi < \pi_0\)
  2. \(H_a: \pi > \pi_0\)
  3. \(H_a: \pi \ne \pi_0\)

Again we can use the one.sample.prop command. To get the p value of a test we need to use the argument pi.null

Case Study: Jon Kerrichs Coin

Test at the 5% level of significance whether 5067 heads in 10000 flips are compatible with a fair coin.

  1. Parameter: proportion \(\pi\)
  2. Method: exact binomial
  3. Assumptions: None
  4. \(\alpha = 0.05\)
  5. \(H_0: \pi = 0.5\) (50% of flips result in “Heads”, coin is fair)
  6. \(H_a: \pi \ne 0.5\) (coin is not fair)
  7. \(p = 0.1835\)
one.sample.prop(x = 5067, n = 10000, pi.null = 0.5) 
## p value of test H0: pi=0.5 vs. Ha: pi <> 0.5:  0.1835
  1. \(p = 0.1835 > \alpha=0.05\), so we fail to reject the null hypothesis.
  2. it appears Jon Kerrich’s coin was indeed fair.

Example Let’s assume for a moment that Jon Kerrichs coin was actually not a fair coin but one with \(\pi = 0.505\). How often would he have had to flip his coin to reject the null hypothesis?

Of course now we don’t have any data, so we have to guess what \(\widehat{\pi}\) might have been. For example if he had flipped this coin 10000 times we would expect him to get about \(10000\times 0.505 = 5050\) heads. Running the test with these numbers we find:

n <- 10000
one.sample.prop(x = 0.505*n, n = n, pi.null = 0.5)
## p value of test H0: pi=0.5 vs. Ha: pi <> 0.5:  0.3222
n <- 20000
one.sample.prop(x = 0.505*n, n = n, pi.null = 0.5)
## p value of test H0: pi=0.5 vs. Ha: pi <> 0.5:  0.1594
n <- 30000
one.sample.prop(x = 0.505*n, n = n, pi.null = 0.5)
## p value of test H0: pi=0.5 vs. Ha: pi <> 0.5:  0.0843
n <- 40000
one.sample.prop(x = 0.505*n, n = n, pi.null = 0.5)
## p value of test H0: pi=0.5 vs. Ha: pi <> 0.5:  0.046

so if he had flipped his coin about 40000 times he would have rejected the null hypothesis of a fair coin at the 5% level.

Remember: even small differences (0.5 vs 0.505) will be rejected if the sample size is large enough

Example

Say we roll a die 500 times and got 100 “sixes”. Is this compatible with a fair die? Test at the 5% level.

  1. Parameter: proportion \(\pi\)
  2. Method: exact binomial
  3. Assumptions: None
  4. \(\alpha = 0.05\)
  5. \(H_0: \pi = 1/6\) (die is fair)
  6. \(H_a: \pi \ne 1/6\) (die is not fair)
  7. \(p = 0.0477\)
one.sample.prop(x = 100, n = 500, pi.null = 1/6) 
## p value of test H0: pi=0.166666666666667 vs. Ha: pi <> 0.166666666666667:  0.0477
  1. \(p = 0.0477 < \alpha=0.05\), so we reject the null hypothesis (barely!).
  2. it appears this is not a fair die

Example if in a survey of 350 people 200 say that they prefer Coke over Pepsi, can Coca-Cola claim that more than half of the people prefer Coke over Pepsi? Test at the 5% level.

  1. Parameter: proportion \(\pi\)
  2. Method: exact binomial
  3. Assumptions: None
  4. \(\alpha = 0.05\)
  5. \(H_0: \pi = 0.5\) (half like Coke, half Pepsi)
  6. \(H_a: \pi > 0.5\) (more than half like Coke)
  7. \(p = 0.0044\)
one.sample.prop(x = 200, n = 350, 
                pi.null = 0.5,
                alternative = "greater") 
## p value of test H0: pi=0.5 vs. Ha: pi > 0.5:  0.0044
  1. \(p = 0.0044 < \alpha=0.05\), so we reject the null hypothesis.
  2. it appears that indeed more than half of the people prefer Coke over Pepsi.

Power of the Test

Again we need to worry about the power of our test.

Case Study: Jon Kerrichs Coin

let’s assume his coin had a probability of 0.505 to come up heads.

What was the power of the test we did above? To find out we can use the prop.ps command:

prop.ps(n = 10000, phat = 0.505, pi.null = 0.5)

## [1] "Power of Test = 16.6%"

so with “just” 10000 flips there was only a very small chance of detecting that his coin was a little unfair.

Again one would probably do this for many values of \(\pi\) and draw a graph.

Note unlike in the t.ps command in the prop.ps command we need need both phat and pi.null, not just the difference:

prop.ps(n = 10000, phat = 0.505, pi.null = 0.5)
## [1] "Power of Test = 16.6%"
prop.ps(n = 10000, phat = 0.5, pi.null = 0.505)
## [1] "Power of Test = 17.4%"

Example In the past about \(12\%\) of the students in a class received an A. After some substantial changes to the class we hope that the percentage has gone up to 20. If we test at the \(5\%\) and if there are 150 students in the class, what is the power of this test?

We will do the test

  1. Parameter: proportion \(\pi\)
  2. Method: exact binomial
  3. Assumptions: None
  4. \(\alpha = 0.05\)
  5. \(H_0: \pi = 0.12\) (same percentage as before)
  6. \(H_a: \pi > 0.12\) (percentage has gone up)
prop.ps(n=150, phat=0.2, pi.null=0.12,
        alternative = "greater")

## [1] "Power of Test = 81.9%"

so the power is \(81.9\%\).

Sample Size Calculation

As with the mean the sample size calculation is different depending on whether we want to do a hypothesis test or find a confidence interval.

Example Same story as above: in the past about \(12\%\) of the students in a class received an A. After some substantial changes to the class we hope that the percentage has gone up to 20. If we test at the \(5\%\) and if there are 150 students in the class, we found that we would have a power of \(81.9\%\). What sample size would we need to have a power of \(95\%\)?

prop.ps(power=95, phat=0.2, pi.null=0.12,
        alternative = "greater")
## [1] "Sample size required is  232"

Case Study: Jon Kerrichs Coin

If indeed his coin had a probability of Heads of 0.505, how often would he have to flip his coin to have a power of 90%?

prop.ps(power = 90, phat = 0.505, pi.null = 0.5)
## [1] "Sample size required is  105281"

How about if we want to find a confidence interval? As with the mean we have to decide on the error E (half the length of the interval) we want. Then we can use the prop.ps command.

We have the same problem as with the mean here, the sample size depends on the true \(\pi\), but we are trying to estimate \(\pi\)!

The same ideas as with the mean such as doing a pilot study work here as well. In addition we have something else we can do here. It turns out that using phat = 0.5 will lead to a sample size that is always sufficient. prop.ps does this unless another phat is given.

Example You want to do a survey of likely voters for the next election. You want to find a 95% confidence interval for the percentage of voters for the PNP, with an error of E = 0.03. What sample size is required?

prop.ps(E = 0.03)
## [1] "Sample size required is  1068"

Example same as above, but for the PIP. Here we already know that \(\pi\) is around 5%, so

prop.ps(phat = 0.05, E = 0.03)
## [1] "Sample size required is  203"

Look again at the prop.ps command for the sample size. There is something truly amazing about what is not part of command!

Example

We want to do a study on the percentage of students at some large University that are female. We want to find a 95% confidence interval with an error of 5%. What sample size will we need?

prop.ps(E = 0.05)
## [1] "Sample size required is  385"

Example A company regularly receives a shipment of electronic parts. Their contract with the supplier says that the shipment can contain up to 5% faulty parts. They suspect that the current shipment has 10% faulty parts. If they plan on randomly selecting parts, testing them and then do a hypothesis test at the 10% level, how many parts do they need to select so that the hypothesis test has a power of 90%?

prop.ps(power = 90, phat = 0.1, pi.null = 0.05, 
    alpha = 0.1, alternative = "greater")
## [1] "Sample size required is  187"