In 1929 Edwin Hubble published a paper showing a relationship between the distance and radial velocity away from Earth of “extra-galactic nebulae” (galaxies). His findings revolutionized astronomy. The “Hubble constant,” the slope of the regression of velocity (Y) on distance (X), is still a subject of research and debate.
The data here are those Hubble published in his original paper. It also has data from much more recent studies.
head(hubble)
## Velocity Distance Galaxy.NGC. velocity.km.s Distance.Mpc.
## 1 170 0.032 925 553 9.70
## 2 290 0.034 1326A 1831 15.81
## 3 -130 0.214 1365 1636 18.48
## 4 -70 0.263 1425 1510 20.83
## 5 -185 0.275 2090 921 11.57
## 6 -220 0.275 2541 548 12.06
The data set has the variables Velocity and Distance. Here is some info on these:
Velocity (Speed with a sign) is measured in km/sec. How can one measure the speed with which a galaxy moves relative to earth? This is done using the Doppler Effect:
For more on the Doppler Effect go to Doppler Effect
The unit of distance in our dataset is one Megaparsec, or 1 million parsecs. A parsec is equal to 3.262 light years, or 19.17 billion miles. Here are some astronomical distances for illustration:
How does one measure the distance of a galaxy (or a star)? It is done using a method called parallax:
For more on parallax go here
attach(hubble)
splot(Velocity, Distance)
The scatterplot of Velocity by Distance shows a strong relationship.
slr(Velocity, Distance, show.tests = TRUE)
## The least squares regression equation is:
## Velocity = -40.784 + 454.158 Distance
## Constant: p = 0.6298
## Distance : p = 0
## R^2 = 62.35%
The two graphs show us that the assumptions of LSR are justified. Let’s discuss the next part of the output:
Whenever there is a p value, there is a hypothesis test. Here there are two. The first one is for the
\[ \begin{aligned} &H_0: \beta_0 = 0 \text{ (intercept is zero)} \\ &H_a: \beta_0 \ne 0 \text{ (intercept is not zero)} \end{aligned} \] If we fail to reject \(H_0\), we conclude that the constant is not statistically significantly different from 0 (at the sample size of the data set!).
If we reject \(H_0\), we conclude that the constant is statistically significantly different from 0.
Consequences:
We are fitting the model \[ y=\beta_0+\beta_1x \] If H0 is true then \(\beta_0=0\), so the model becomes \[ y=\beta_1x \] this is called a no-intercept model. To get this model we have to rerun the regression:
slr(Velocity, Distance, no.intercept=TRUE)
## The least squares regression equation is:
## Velocity = 423.937 Distance
## R^2 = 62.35%
The slope of the line 423.9 is called Hubble’s constant and is one of the fundamental constants in the universe!
Note the slope of the no intercept model (423.9) is NOT the same as the slope of the regular model (454.2)
Note The decision whether an intercept should be fit or not is best made based on the background of the data and whether if x=0 then y=0 makes sense.
One consequence of this model is that if x=0 then
\[ y=\beta_1x=\beta_1 0=0 \] so the point (0,0) is always on this line.
Example Say we have data with x = Number of Hurricanes in a year and y = $ Amount of Damage done by the Hurricanes. Now if x=0 (there were no hurricanes) obviously y=0 (no damage), so a no-intercept model is appropriate (even if the corresponding hypothesis test says otherwise!)
\[ \begin{aligned} &H_0: \beta_1 = 0 \text{ (slope is zero)} \\ &H_a: \beta_1 \ne 0 \text{ (slope is not zero)} \end{aligned} \] Consequences: our model is \[ y=\beta_0+\beta_1x \] If \(H_0\) is true then \(\beta_1 =0\), so the model becomes \[ y=\beta_0+0x=\beta_0 \] But there is no more predictor x here! So if we fail to reject \(H_0\) it means that the predictor has no statistically significant relationship with the response (at least not for the sample size of the dataset).
If we do reject \(H_0\) we conclude that there is a statistically significant relationship between predictor and the response y.
Note in a simple regression model such as we have here this test is the same as the test for Pearson’s correlation coefficient.
Constant:
Distance:
Notice that when I ran the least squares regression command
slr(Velocity, Distance, show.tests = TRUE)
I added the argument show.tests = TRUE. Without it these tests would not be done. That is because in many ways they are useless!
whether or not a no-intercept model is what we want should be decided by our understanding of the experiment, not the outcome of the the test for the constant
the test for the slope is the same as Pearson’s correlation test, which we likely already did!
I have discussed them here because you will see them in real live and so you should know what they are.
What are the consequences of all this for our understanding of the universe?