This page contains some basic information on how to use the computer and the R program.
To log on to computers in Ch115:
Username: .\esma ( important: do not forget to include “. " before the word esma )
Password: Mate1234 ( important: uppercase letter”M" )
To log on to computers in SH005:
Username: Estudiante
Password: salon005
The class webpages are at http://academic.uprm.edu/wrolke/esmaXXXX (3015, 3101, 3102, 6661 etc)
At the end of each session log off
You can get a free version of R for your computer from a number of sources. The download is about 70MB and setup is fully automatic. Here are some links:
After the installation is finished close R (if it is open). From now on ALWAYS open R by clicking on the link to to the RESMA3 file on top of the homepage. You can also download and save that file to your own computer and start R from there. The first time you do this the program will download a number of additional stuff, just let it. Also a window might pop up and ask whether to save something, if so click on yes.
Note
You might be asked at several times whether you want to do something (allow access, run a program, save a library, …), always just say yes!
You will need to connect to a reasonably fast internet for these steps.
This will take a few minutes, just wait until the > sign appears.
FOR MAC OS USERS ONLY
There are a few things that are different from MacOS and Windows. Here is one thing you should do:
Download XQuartz - XQuartz-2.7.11.dmg
Open XQuartz
Type the letter R (to make XQuartz run R)
Hit enter Open R Run the command .First()
Then, every command should work correctly.
there is a program called RStudio that a lot of people like to use to run R. You can download it at RStudio. Before you can use RStudio with Resma3 you need to run Resma3 JUST ONCE from R itself.
So do this
follow ALL the instructions above
only if everything is running correctly install RStudio.
For the purpose of the class R itself is enough, we don’t need RStudio.
if you try to run a command and get an error
could not find function “ggplot”
(or grid or shiny)
first try this: run the command
ls()
You should see a listing of many things (over 200). If you do not Resma3 did not load correctly. Close R and restart it by clicking on the link to Resma3 on the homepage.
If you do see the listing, type
one.time.setup()
A number of things should be happening, just wait until you see the > again and see whether that fixes the problem.
If this does not work turn off R and restart it with a new version of Resma3 from the top of the class homepage.
If this also does not work send me an email with the explanation of the problem. The best thing to do is to include a screenshot. Here is how:
You can also just use your cell phone to take a picture of the screen, but make sure it is is readable!
I often get an email saying that something is not working, and my answer is simply:
RGDM
this means: Read the God-Damn Manual!
that is the answer to your problem is somewhere on these pages, and you should have found it there before sending an email!
Throughout this class when you see something like this:
text
it means commands you should type (or copy-paste) what is in the box (“text”) into the R console. If you type it be careful to get it exactly right!
To see whether everything is installed correctly copy-paste the following line into R and hit enter:
hplot(rnorm(1000))
You should see a graph like this (called a histogram)
For a much more extensive introduction to R go here
Once you have started a session the first thing you see is some text, and then the > sign. This is the R prompt, it means R is waiting for you to do something. Sometimes the prompt changes to a different symbol, as we will see.
Let’s start with
ls()
shows you a “listing”" of the files (data, routines etc.)
Everything in R is either a data set or a function. It is a function if it is supposed to do something (maybe calculate something, show you something like a graph or something else etc. ). If it is a function is ALWAYS NEEDS (). Sometimes the is something in between the prentices, like in the hplot() above. Sometimes there isn’t like in the ls(). But the () has to be there anyway.
If you have worked for a while you might have things you need to save, do that by clicking on
File > Save Workspace
If you quit the program without saving your stuff everything you did will be lost. R has a somewhat unusual file system, everything belonging to the same project (data, routines, graphs etc.) are stored in just one file, with the extension .RData.
To quit R, type
q()
or click the x in the upper right corner.
R has a nice recall feature, using the up and down arrow keys. Also, typing
history()
shows you the most recent things entered.
R is case-sensitive, so a and A are two different things.
Often during a session you create objects that you need only for a short time. When you no longer need them use rm to get rid of them:
x <- 10
x^2
## [1] 100
rm(x)
the <- is the assignment character in R, it assigns what is on the right to the symbol on the left. (Think of an arrow to the left)
For a few numbers the easiest thing is to just type them in:
x <- c(10, 2, 6, 9)
x
## [1] 10 2 6 9
c() is a function that takes the objects inside the () and combines them into one single object (a vector).
Most moodle quizzes will require you to transfer data from the quiz to R. This is done with the command get.moodle.data(). There are two steps:
in moodle use the mouse to highlight the data. If it is a table with several columns ALWAYS include the column headers (names of variables).
switch to R and run
get.moodle.data()
Now the data should be in R. It is called x. You can always check by typing x and ENTER.
x
## [1] 10 2 6 9
Here are some examples:
101.6 115.0 100.9 103.8 77.6 102.6 99.6 108.5 100.8 92.5 101.8 81.6 103.7 94.9 103.3 86.7 101.6 106.6 101.5 96.9
highlight the data with the mouse, copy it, go to R and type
get.moodle.data()
x
## [1] 101.6 115.0 100.9 103.8 77.6 102.6 99.6 108.5 100.8 92.5 101.8
## [12] 81.6 103.7 94.9 103.3 86.7 101.6 106.6 101.5 96.9
this also works if the data is not numbers:
Old Old Young Old Young Young
get.moodle.data()
## [1] "Old" "Old" "Young" "Old" "Young" "Young"
sometimes parts of the data are separated by some symbol, for example a comma. In that case you can use the sep argument:
1.5, 2.3, 5.3, 2.4, 7.9, 8.1, 2.7, 4.2
get.moodle.data(sep = ",")
## [1] 1.5 2.3 5.3 2.4 7.9 8.1 2.7 4.2
Age | Gender |
---|---|
18 | Female |
19 | Female |
20 | Male |
20 | Female |
18 | Female |
25 | Male |
20 | Male |
24 | Female |
21 | Male |
22 | Female |
get.moodle.data()
## Age Gender
## 1 18 Female
## 2 19 Female
## 3 20 Male
## 4 20 Female
## 5 18 Female
## 6 25 Male
## 7 20 Male
## 8 24 Female
## 9 21 Male
## 10 22 Female
Note if the data is a single vector it is given the name x, and you can now do things like
mean(x)
if the data is a table it is immediately attached and you can use the column names, for example
mean(Age)
Note on rare occasions the routine can fail if the data is a table but everyting is text. In that case use the argument is.table=TRUE.
Note sometimes you might get a warning from R, as long as the data is transfered correctly you can ignore that.
the most basic type of data in R is a vector, simply a list of values.
Say we want the numbers 1.5, 3.6, 5.1 and 4.0 in an R vector called x, then we can type
x <- c(1.5, 3.6, 5.1, 4.0)
x
## [1] 1.5 3.6 5.1 4.0
Often the numbers have a structure one can make use of:
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
10:1
## [1] 10 9 8 7 6 5 4 3 2 1
1:20*2
## [1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
c(1:10, 1:10*2)
## [1] 1 2 3 4 5 6 7 8 9 10 2 4 6 8 10 12 14 16 18 20
Sometimes you need parentheses:
n <- 10
1:n-1
## [1] 0 1 2 3 4 5 6 7 8 9
1:(n-1)
## [1] 1 2 3 4 5 6 7 8 9
The rep (“repeat”) command is very useful:
rep(1, 10)
## [1] 1 1 1 1 1 1 1 1 1 1
rep(1:3, 10)
## [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(1:3, each=3)
## [1] 1 1 1 2 2 2 3 3 3
rep(c("A", "B", "C"), c(4,7,3))
## [1] "A" "A" "A" "A" "B" "B" "B" "B" "B" "B" "B" "C" "C" "C"
what does this do?
rep(1:10, 1:10)
To find out how many elements a vector has use the length command:
x <- c(1.4, 5.1, 2.0, 6.8, 3.5, 2.1, 5.6, 3.3, 6.9, 1.1)
length(x)
## [1] 10
The elements of a vector are accessed with the bracket [ ] notation:
x[3]
## [1] 2
x[1:3]
## [1] 1.4 5.1 2.0
x[c(1, 3, 8)]
## [1] 1.4 2.0 3.3
x[-3]
## [1] 1.4 5.1 6.8 3.5 2.1 5.6 3.3 6.9 1.1
x[-c(1, 2, 5)]
## [1] 2.0 6.8 2.1 5.6 3.3 6.9 1.1
Instead of numbers a vector can also consist of characters (letters, numbers, symbols etc.) These are identified by quotes:
c("A", "B", 7, "%")
## [1] "A" "B" "7" "%"
A vector is either numeric or character, but never both (see how the 7 was changed to “7”).
You can turn one into the other (if possible) as follows:
x <- 1:10
x
## [1] 1 2 3 4 5 6 7 8 9 10
as.character(x)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
x <- c("1", "5", "10", "-3")
x
## [1] "1" "5" "10" "-3"
as.numeric(x)
## [1] 1 5 10 -3
A third type of data is logical, with values either TRUE or FALSE.
x <- 1:10
x
## [1] 1 2 3 4 5 6 7 8 9 10
x > 4
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
these are often used as conditions:
x[x>4]
## [1] 5 6 7 8 9 10
This, as we will see shortly, is EXTREMELY useful!
data frames are the basic format for data in R. They are essentially vectors put together as columns.
The main thing you need to know about working with data frames are the following commands:
consider the upr data set . This is the application data for all the students who applied and were accepted to UPR-Mayaguez between 2003 and 2013.
dim(upr)
## [1] 23666 16
tells us that there were 23666 applications and that for each student there are 16 pieces of information.
colnames(upr)
## [1] "ID.Code" "Year" "Gender" "Program.Code"
## [5] "Highschool.GPA" "Aptitud.Verbal" "Aptitud.Matem" "Aprov.Ingles"
## [9] "Aprov.Matem" "Aprov.Espanol" "IGS" "Freshmen.GPA"
## [13] "Graduated" "Year.Grad." "Grad..GPA" "Class.Facultad"
shows us the variables
head(upr, 3)
## ID.Code Year Gender Program.Code Highschool.GPA Aptitud.Verbal
## 1 00C2B4EF77 2005 M 502 3.97 647
## 2 00D66CF1BF 2003 M 502 3.80 597
## 3 00AB6118EB 2004 M 1203 4.00 567
## Aptitud.Matem Aprov.Ingles Aprov.Matem Aprov.Espanol IGS Freshmen.GPA
## 1 621 626 672 551 342 3.67
## 2 726 618 718 575 343 2.75
## 3 691 424 616 609 342 3.62
## Graduated Year.Grad. Grad..GPA Class.Facultad
## 1 Si 2012 3.33 INGE
## 2 No NA NA INGE
## 3 No NA NA CIENCIAS
shows us the first three cases.
Let’s say we want to find the number of males and females. We can use the table command for that:
table(Gender)
## Error: object 'Gender' not found
What happened? Right now R does not know what Gender is because it is “hidden” inside the upr data set. Think of upr as a box that is currently closed, so R can’t look inside and see the column names. We need to open the box first:
attach(upr)
table(Gender)
## Gender
## F M
## 11487 12179
there is also a detach command to undo an attach, but this is not usually needed because the attach goes away when you close R.
Note: you need to attach a data frame only once in each session working with R.
Note: Say you are working first with a data set “students 2016” which has a column called Gender, and you attached it. Later (but in the same R session) you start working with a data set “students 2017” which also has a column called Gender, and you are attaching this one as well. If you use Gender now it will be from “students 2017”.
Consider the following data frame (not a real data set):
students
## Age GPA Gender
## 1 22 3.1 Male
## 2 23 3.2 Male
## 3 20 2.1 Male
## 4 22 2.1 Male
## 5 21 2.3 Female
## 6 21 2.9 Male
## 7 18 2.3 Female
## 8 22 3.9 Male
## 9 21 2.6 Female
## 10 18 3.2 Female
Here each single piece of data is identified by its row number and its column number. So for example in row 2, column 2 we have “3.2”, in row 6, column 3 we have “Male”.
As with the vectors before we can use the [ ] notation to access pieces of a data frame, but now we need to give it both the row and the column number, separated by a ,:
students[6, 3]
## [1] "Male"
As before we can pick more than one piece:
students[1:5, 3]
## [1] "Male" "Male" "Male" "Male" "Female"
students[1:5, 1:2]
## Age GPA
## 1 22 3.1
## 2 23 3.2
## 3 20 2.1
## 4 22 2.1
## 5 21 2.3
students[-c(1:5), 3]
## [1] "Male" "Female" "Male" "Female" "Female"
students[1, ]
## Age GPA Gender
## 1 22 3.1 Male
students[, 2]
## [1] 3.1 3.2 2.1 2.1 2.3 2.9 2.3 3.9 2.6 3.2
students[, -3]
## Age GPA
## 1 22 3.1
## 2 23 3.2
## 3 20 2.1
## 4 22 2.1
## 5 21 2.3
## 6 21 2.9
## 7 18 2.3
## 8 22 3.9
## 9 21 2.6
## 10 18 3.2
R allows us to apply any mathematical functions to a whole vector:
x <- 1:10
2*x
## [1] 2 4 6 8 10 12 14 16 18 20
x^2
## [1] 1 4 9 16 25 36 49 64 81 100
log(x)
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
## [8] 2.0794415 2.1972246 2.3025851
sum(x)
## [1] 55
y <- 21:30
x+y
## [1] 22 24 26 28 30 32 34 36 38 40
x^2+y^2
## [1] 442 488 538 592 650 712 778 848 922 1000
mean(x+y)
## [1] 31
Let’s try something strange:
c(1, 2, 3) + c(1, 2, 3, 4)
## Warning in c(1, 2, 3) + c(1, 2, 3, 4): longer object length is not a
## multiple of shorter object length
## [1] 2 4 6 5
so R notices that we are trying to add a vector of length 3 to a vector of length 4. This should not work, but it actually does!
When it runs out of values in the first vector, R simply starts all over again.
In general this is more likely a mistake by you, check that this is what you really wanted to do!
One of the most common tasks in Statistic is to select a part of a data set for further analysis. There is even a name for this: data wrangling.
Description: Daily measurements of air quality in New York, May to September 1973.
A data frame with 154 observations on 6 variables.
Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island
Solar.R: Solar radiation in Langleys in the frequency band 4000-7700 Angstroms from 0800 to 1200 hours at Central Park
Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.
Source: The data were obtained from the New York State Department of Conservation (ozone data) and the National Weather Service (meteorological data).
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
Let’s say that instead of looking at the whole data set we want to consider only the months of August and September. Those have Month = 8, 9 and we can select this part of the data set with the [ , ] notation we discussed earlier:
attach(airquality)
airAugSept <- airquality[Month>=8, ]
head(airAugSept)
## Ozone Solar.R Wind Temp Month Day
## 93 39 83 6.9 81 8 1
## 94 9 24 13.8 81 8 2
## 95 16 77 7.4 82 8 3
## 96 78 NA 6.9 86 8 4
## 97 35 NA 7.4 85 8 5
## 98 66 NA 4.6 87 8 6
This task of data wrangling is so important, there are quite a lot of routines that are helping with it. One of them is isubset.
Here is what you do:
airAugSept<- isubset(airquality)
The app lets you use up to three conditions, we just have one (Month \(\ge\) 8), so we can leave that alone. Now choose the condition and then hit “Click when ready to run”
Here is a screenshot:
now hit Close App and return to R.
In this example we used a very simple condition: Month \(\ge\) 8. These conditions can be much more complicated using & (AND), | (OR) and !(NOT).
Let’s say what we want only those days in August and September with a Temperature less than 80:
airAugSeptTemp80 <- isubset(airquality)
Finally let’s say we want only either those days in August and September with a Temperature less than 80, or days with Wind>10:
Let’s get back to the days in August and September. What we want to do with those days is to find the mean Ozone level:
airAugSept <- isubset(airquality)
mean(Ozone)
## [1] NA
Oh! Something went wrong! The problem is that the column Ozone has missing values, which R codes as NA. These are just what it says, for some days the Ozone level was not measured and so is missing. One way to go is to tell R to ignore the missing values:
mean(Ozone, na.rm=TRUE)
## [1] 42.12931
or we could use:
stat.table(Ozone)
## Warning: 37 missing values were removed!
## Sample Size Mean Standard Deviation
## Ozone 116 42.1 33
OK!
But wait a minute: we are told there are 37 missing values and 116 “good” ones, for a total of 37+116=153. But there are supposed to be only 61 rows (or observations) in airAugSept. Let’s check:
length(Ozone)
## [1] 153
nrow(airAugSept)
## [1] 61
What’s wrong?
The problem is that Ozone still comes from the original airquality data set, but our Ozone is still hidden inside airAugSept. One solution would be to
attach(airAugSept)
## The following objects are masked from airquality:
##
## Day, Month, Ozone, Solar.R, Temp, Wind
but as R is warning us, now there are two Ozones, and it can get quite confusing. To be sure we work with the correct data we can do this:
detach(airquality)
stat.table(Ozone)
## Warning: 6 missing values were removed!
## Sample Size Mean Standard Deviation
## Ozone 55 44.9 35.2
Breakdown of the population of USA and Puerto Rico by age and gender, according to the 2000 Census
head(agesex)
## Age Male Female
## 1 Less than 1 29601 28442
## 2 1 29543 28130
## 3 2 30252 28881
## 4 3 30643 28867
## 5 4 31248 29799
## 6 5 31621 29696
tail(agesex)
## Age Male Female
## 98 97 282 418
## 99 98 189 296
## 100 99 123 196
## 101 100 - 104 258 448
## 102 105 - 109 47 59
## 103 Over 110 17 27
shows us that the data set consists of three vectors: the ages, the number of males and the number of females. The first one is a character vector (“less than 1”) and the other two are numeric.
Let’s answer a few questions about the age and gender in PR in 2000:
attach(agesex)
sum(Male)
## [1] 1833577
sum(Female)
## [1] 1975033
Simple:
sum(Male)+sum(Female)
## [1] 3808610
we will need the column with the Male and Female counts a few more times, so maybe we should do it this way:
People <- Male + Female
head(People)
## [1] 58043 57673 59133 59510 61047 61317
sum(People)
## [1] 3808610
Note
we now have another variable called People among the data sets, as we can see with
ls()
It will stay there until we close R. If we want to keep it for the next time we use R we need to save everything with File > Save Workspace. If we want to save the workspace but not this variable we first have to
rm(People)
People[1]
## [1] 58043
teenagers (Age from 13 to 19) are in rows 14 - 20, so
sum(People[14:20])
## [1] 433764
sum(Male)/sum(People)*100
## [1] 48.14294
round(sum(Male)/sum(People)*100, 1)
## [1] 48.1
Let’s start with
Male > Female
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE
and now we can find
sum(Male > Female)
## [1] 21
max(People)
## [1] 64795
People==max(People)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE FALSE FALSE
Age[People==max(People)]
## [1] " 10"
Note == is the symbol for “is equal to”. The others are
So the age group of 10 year olds is the largest. Why is this answer a bit strange?
Here is another way to do this:
order(People, decreasing = TRUE)
## [1] 11 21 19 18 20 10 6 8 17 5 22 23 16 7 13 12 15
## [18] 14 9 4 3 24 1 2 25 26 30 35 36 29 31 37 28 38
## [35] 27 41 40 34 39 33 32 43 44 46 42 45 51 53 47 48 54
## [52] 50 49 52 55 56 57 58 59 61 60 62 63 64 66 65 68 67
## [69] 69 70 72 71 73 74 75 76 77 78 79 80 81 82 83 84 85
## [86] 86 87 88 89 90 91 92 93 94 95 96 97 101 98 99 100 102
## [103] 103
head( agesex[ order(People, decreasing = TRUE), ])
## Age Male Female
## 11 10 33188 31607
## 21 20 32441 32154
## 19 18 32216 31705
## 18 17 32735 31070
## 20 19 32038 31744
## 10 9 31798 30101
another useful command is sort, which we can use to order one variable, by default from smallest to largest:
sort(People)
## [1] 44 106 319 485 700 706 847 1122 1332 1728 2285
## [12] 2694 3640 4466 5261 6278 7279 8414 8726 9132 10436 11659
## [23] 13449 14211 15293 16657 17514 19403 19673 20588 21421 21865 23123
## [34] 24982 25596 26222 26929 30387 30552 30690 32035 32737 34118 34715
## [45] 36268 38544 39146 40807 44265 45004 45280 45875 45926 46155 46311
## [56] 46579 48142 48987 49262 49499 50003 50009 50828 50951 51259 52213
## [67] 52395 52553 52795 52807 53293 53573 53709 54352 54815 55124 55313
## [78] 55754 56337 57673 58043 58725 59133 59510 60020 60112 60216 60221
## [89] 60456 60695 60707 60748 60786 61047 61221 61231 61317 61899 63782
## [100] 63805 63921 64595 64795