To start run
ls()
This shows you a “listing”" of the files (data, routines etc.) in the current project. (Likely there is nothing there right now)
Everything in R is either a data set or a function. It is a function if it is supposed to do something (maybe calculate something, show you something like a graph or something else etc. ). If it is a function is ALWAYS NEEDS (). Sometimes the is something in between the parentheses, like in
mean(x)
## [1] 6
Sometimes there isn’t like in the ls(). But the () has to be there anyway.
If you have worked for a while you might have things you need to save, do that by clicking on
File > Save
RStudio has a nice recall feature, using the up and down arrow keys. Also, clicking on the History tab shows you the recently run commands. Finally, typing the first three letters of a command in the console and then typing CTRL-^ shows you a list of when you ran commands like this the last times.
R is case-sensitive, so a and A are two different things.
Often during a session you create objects that you need only for a short time. When you no longer need them use rm to get rid of them:
x <- 10
x^2
## [1] 100
rm(x)
the <- is the assignment character in R, it assigns what is on the right to the symbol on the left. (Think of an arrow to the left)
For a few numbers the easiest thing is to just type them in:
x <- c(10, 2, 6, 9)
x
## [1] 10 2 6 9
c() is a function that takes the objects inside the () and combines them into one single object (a vector).
the most basic type of data in R is a vector, simply a list of values.
Say we want the numbers 1.5, 3.6, 5.1 and 4.0 in an R vector called x, then we can type
x <- c(1.5, 3.6, 5.1, 4.0)
x
## [1] 1.5 3.6 5.1 4.0
Often the numbers have a structure one can make use of:
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
10:1
## [1] 10 9 8 7 6 5 4 3 2 1
1:20*2
## [1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
c(1:10, 1:10*2)
## [1] 1 2 3 4 5 6 7 8 9 10 2 4 6 8 10 12 14 16 18 20
Sometimes you need parentheses:
n <- 10
1:n-1
## [1] 0 1 2 3 4 5 6 7 8 9
1:(n-1)
## [1] 1 2 3 4 5 6 7 8 9
The rep (“repeat”) command is very useful:
rep(1, 10)
## [1] 1 1 1 1 1 1 1 1 1 1
rep(1:3, 10)
## [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(1:3, each=3)
## [1] 1 1 1 2 2 2 3 3 3
rep(c("A", "B", "C"), c(4,7,3))
## [1] "A" "A" "A" "A" "B" "B" "B" "B" "B" "B" "B" "C" "C" "C"
what does this do?
rep(1:10, 1:10)
To find out how many elements a vector has use the length command:
x <- c(1.4, 5.1, 2.0, 6.8, 3.5, 2.1, 5.6, 3.3, 6.9, 1.1)
length(x)
## [1] 10
The elements of a vector are accessed with the bracket [ ] notation:
x[3]
## [1] 2
x[1:3]
## [1] 1.4 5.1 2.0
x[c(1, 3, 8)]
## [1] 1.4 2.0 3.3
x[-3]
## [1] 1.4 5.1 6.8 3.5 2.1 5.6 3.3 6.9 1.1
x[-c(1, 2, 5)]
## [1] 2.0 6.8 2.1 5.6 3.3 6.9 1.1
Instead of numbers a vector can also consist of characters (letters, numbers, symbols etc.) These are identified by quotes:
c("A", "B", 7, "%")
## [1] "A" "B" "7" "%"
A vector is either numeric or character, but never both (see how the 7 was changed to “7”).
You can turn one into the other (if possible) as follows:
x <- 1:10
x
## [1] 1 2 3 4 5 6 7 8 9 10
as.character(x)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
x <- c("1", "5", "10", "-3")
x
## [1] "1" "5" "10" "-3"
as.numeric(x)
## [1] 1 5 10 -3
A third type of data is logical, with values either TRUE or FALSE.
x <- 1:10
x
## [1] 1 2 3 4 5 6 7 8 9 10
x > 4
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
these are often used as conditions:
x[x>4]
## [1] 5 6 7 8 9 10
This, as we will see shortly, is EXTREMELY useful!
data frames are the basic format for data in R. They are essentially vectors of equal length put together as columns.
A data frame can be created as follows:
df <- data.frame(
Gender=c("M", "M", "F", "F", "F"),
Age=c(23, 25, 19, 22, 21),
GPA=c(3.5, 3.7, 2.9, 2.8, 3.1)
)
df
## Gender Age GPA
## 1 M 23 3.5
## 2 M 25 3.7
## 3 F 19 2.9
## 4 F 22 2.8
## 5 F 21 3.1
The most general data structures are lists. They are simply a collection of objects. There are no restrictions on what those objects are.
lst <- list(
Gender=c("M", "M", "F", "F", "F"),
Age=c(23, 25, 19, 22, 21, 26, 34),
f=function(x) x^2,
list(A=c(1, 1), B=c("X", "X", "Y"))
)
lst
## $Gender
## [1] "M" "M" "F" "F" "F"
##
## $Age
## [1] 23 25 19 22 21 26 34
##
## $f
## function(x) x^2
##
## [[4]]
## [[4]]$A
## [1] 1 1
##
## [[4]]$B
## [1] "X" "X" "Y"
A data frame is a list with an additional requirement, namely that the elements of the list be of equal length.
consider the upr data set . This is the application data for all the students who applied and were accepted to UPR-Mayaguez between 2003 and 2013.
dim(upr)
## [1] 23666 16
tells us that there were 23666 applications and that for each student there are 16 pieces of information.
colnames(upr)
## [1] "ID.Code" "Year" "Gender" "Program.Code"
## [5] "Highschool.GPA" "Aptitud.Verbal" "Aptitud.Matem" "Aprov.Ingles"
## [9] "Aprov.Matem" "Aprov.Espanol" "IGS" "Freshmen.GPA"
## [13] "Graduated" "Year.Grad." "Grad..GPA" "Class.Facultad"
shows us the variables
head(upr, 3)
## ID.Code Year Gender Program.Code Highschool.GPA Aptitud.Verbal
## 1 00C2B4EF77 2005 M 502 3.97 647
## 2 00D66CF1BF 2003 M 502 3.80 597
## 3 00AB6118EB 2004 M 1203 4.00 567
## Aptitud.Matem Aprov.Ingles Aprov.Matem Aprov.Espanol IGS Freshmen.GPA
## 1 621 626 672 551 342 3.67
## 2 726 618 718 575 343 2.75
## 3 691 424 616 609 342 3.62
## Graduated Year.Grad. Grad..GPA Class.Facultad
## 1 Si 2012 3.33 INGE
## 2 No NA NA INGE
## 3 No NA NA CIENCIAS
shows us the first three cases.
Let’s say we want to find the number of males and females. We can use the table command for that:
table(Gender)
## Error: object 'Gender' not found
What happened? Right now R does not know what Gender is because it is “hidden” inside the upr data set. Think of upr as a box that is currently closed, so R can’t look inside and see the column names. We need to open the box first:
attach(upr)
table(Gender)
## Gender
## F M
## 11487 12179
there is also a detach command to undo an attach, but this is not usually needed because the attach goes away when you close R.
Note: you need to attach a data frame only once in each session working with R.
Note: Say you are working first with a data set “students 2016” which has a column called Gender, and you attached it. Later (but in the same R session) you start working with a data set “students 2017” which also has a column called Gender, and you are attaching this one as well. If you use Gender now it will be from “students 2017”.
Consider the following data frame (not a real data set):
students
## Age GPA Gender
## 1 22 3.1 Male
## 2 23 3.2 Male
## 3 20 2.1 Male
## 4 22 2.1 Male
## 5 21 2.3 Female
## 6 21 2.9 Male
## 7 18 2.3 Female
## 8 22 3.9 Male
## 9 21 2.6 Female
## 10 18 3.2 Female
Here each single piece of data is identified by its row number and its column number. So for example in row 2, column 2 we have “3.2”, in row 6, column 3 we have “Male”.
As with the vectors before we can use the [ ] notation to access pieces of a data frame, but now we need to give it both the row and the column number, separated by a ,:
students[6, 3]
## [1] "Male"
As before we can pick more than one piece:
students[1:5, 3]
## [1] "Male" "Male" "Male" "Male" "Female"
students[1:5, 1:2]
## Age GPA
## 1 22 3.1
## 2 23 3.2
## 3 20 2.1
## 4 22 2.1
## 5 21 2.3
students[-c(1:5), 3]
## [1] "Male" "Female" "Male" "Female" "Female"
students[1, ]
## Age GPA Gender
## 1 22 3.1 Male
students[, 2]
## [1] 3.1 3.2 2.1 2.1 2.3 2.9 2.3 3.9 2.6 3.2
students[, -3]
## Age GPA
## 1 22 3.1
## 2 23 3.2
## 3 20 2.1
## 4 22 2.1
## 5 21 2.3
## 6 21 2.9
## 7 18 2.3
## 8 22 3.9
## 9 21 2.6
## 10 18 3.2
another way of subsetting a data frame is by using the $ notations:
students$Gender
## [1] "Male" "Male" "Male" "Male" "Female" "Male" "Female" "Male"
## [9] "Female" "Female"
The double bracket and the $ notation also work for lists:
lst <- list(
Gender=c("M", "M", "F", "F", "F"),
Age=c(23, 25, 19, 22, 21, 26, 34),
f=function(x) x^2,
list(A=c(1, 1), B=c("X", "X", "Y"))
)
lst[[4]][[2]]
## [1] "X" "X" "Y"
lst$Gender
## [1] "M" "M" "F" "F" "F"
R allows us to apply any mathematical functions to a whole vector:
x <- 1:10
2*x
## [1] 2 4 6 8 10 12 14 16 18 20
x^2
## [1] 1 4 9 16 25 36 49 64 81 100
log(x)
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
## [8] 2.0794415 2.1972246 2.3025851
sum(x)
## [1] 55
y <- 21:30
x+y
## [1] 22 24 26 28 30 32 34 36 38 40
x^2+y^2
## [1] 442 488 538 592 650 712 778 848 922 1000
mean(x+y)
## [1] 31
Let’s try something strange:
c(1, 2, 3) + c(1, 2, 3, 4)
## [1] 2 4 6 5
so R notices that we are trying to add a vector of length 3 to a vector of length 4. This should not work, but it actually does!
When it runs out of values in the first vector, R simply starts all over again.
In general this is more likely a mistake by you, check that this is what you really wanted to do!
A very useful routine in R is apply, and its brothers.
Let’s say we have the following matrix:
Age <- matrix(sample(20:30, size=100, replace=TRUE), 10, 10)
Age[1:5, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 20 27 27 20 29
## [2,] 25 23 26 25 27
## [3,] 25 26 20 21 24
## [4,] 25 30 23 22 20
## [5,] 23 26 30 30 26
and we want to find the sums of the ages in each column. Easy:
sum(Age[, 1])
## [1] 249
sum(Age[, 2])
## [1] 263
…
sum(Age[, 10])
## [1] 269
or much easier
apply(Age, 2, sum)
## [1] 249 263 252 226 251 248 271 252 271 269
There are a number of apply routines for different data formats.
Let’s say we want to find the mean Highschool GPA:
mean(Highschool.GPA)
## [1] 3.65861
But what if we want to do this for each year separately? Notice that apply doesn’t work here because the Years are not in separated columns. Instead we can use
tapply(Highschool.GPA, Year, mean)
## 2003 2004 2005 2006 2007 2008 2009 2010
## 3.646627 3.642484 3.652774 3.654729 3.628072 3.648552 3.642946 3.665298
## 2011 2012 2013
## 3.685485 3.695046 3.710843
There are some routines that I wrote for myself and use a lot:
kable.nice <- function (x, do.row.names = TRUE, col.names = NA, font.size = 15) {
library(tidyverse)
library(kableExtra)
kable(x, row.names = do.row.names, col.names = col.names) %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE,
font_size = font.size)
}
ggcurve <- function (plt, fun, A = 0, B = 1,
npoints = 250, col = "blue", size = 1.2) {
x <- seq(A, B, length = npoints)
y <- fun(x)
dta <- data.frame(x = x, y = y)
if (missing(plt))
plt <- ggplot(aes(x, y), data = dta)
plt + geom_line(aes(x, y), colour = col, size = size, data = dta)
}