tidyverse
.R consists of many modules, or packages:
mean
, sort
, cor
, lm
(linear model).GPArotation
provides various methods of factor rotation, in exploratory factor analysis.RStudio - is an integrated development environment that facilitates programming. Needs to be installed after installing R.
❗️There is a trial online version: https://rstudio.cloud/ - very slow though.
install.packages("readxl") # Run once on the same machine
library("readxl") # Run after every restart of R
# By the way, hashtag is used to comment R code, it is not read as R
You may think of R as a smart calculator.
5+5
[1] 10
Upper field is input (by human), lower field is output (by R).
[1]
in output means a sequential number of value.
In addition to computation, R can store different data, every piece of data is called “object”. Every object has a name.
❗️Object names should begin with a character (not a number!).
❗️R differentiates between UPPER and lower case. (For ex., Mydata
and mydata
are treated as two different objects).
Objects can contain any information: numbers, texts, arrays of texts and numbers, tables, images, functions, other objects, etc.
Functions that read data in, try to identify the type of the variable automatically, but it is better to check each variable’s class manually. To know the class, or structure of object, use str()
.
[1] 1.10 2.50 3.23 4.00 5.00
numeric
are used for continuous variables;
[1] "cat" "dog" "mouse"
string
or character
for text variables;
[1] horsebean horsebean horsebean
Levels: casein horsebean linseed meatmeal soybean sunflower
so called factor
are for nominal or
[1] Primary Primary
Levels: Primary < Secondary < Higher
ordinal (ordered factor
) variables
[1] FALSE TRUE TRUE TRUE
logical
can take only TRUE or FALSE values.
c()
for separate variables;data.frame
most often used table of data;matrix
- also two-dimensional, often used for matrix manipulations,array
- multidimensional tables, for example, 3-dimensional or layered table,list
- can contain different kinds of data, or other lists.You can give names to each element in the data structures.
Each object can be created manually with a function of the same name:
c(1,2,3) # 'c' stands for "concatenate", it puts together several values and coerces them to the same type.
[1] 1 2 3
data.frame(
height=c(145,203,169),
first.name=c("Ana", "Boris", "Claire")
)
height first.name
1 145 Ana
2 203 Boris
3 169 Claire
[,1] [,2] [,3] [,4] [,5]
[1,] 1 7 13 19 25
[2,] 2 8 14 20 26
[3,] 3 9 15 21 27
[4,] 4 10 16 22 28
[5,] 5 11 17 23 29
[6,] 6 12 18 24 30
, , 1
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
, , 2
[,1] [,2] [,3] [,4] [,5]
[1,] 16 19 22 25 28
[2,] 17 20 23 26 29
[3,] 18 21 24 27 30
$`the name of the first element`
[1] 1 2 3 4 5
$`the name is not required though`
[1] "a" "c"
[[3]]
[1] "Just a single string"
We can create objects by using operators. Two objects a
and b
are assigned values. Then, we can use object names instead of values to proceed with analysis:
a <- 5
b <- 6
a + b
[1] 11
<-
- assignement =
- same as <-
, but under R conventions, it is not recommended for object assignements.==
- checks if two values are equal (do not confuse with single =
!).-
, +
, /
, *
, ^
- basic math operators.Functions - are typical operations saved to objects, they are created in order to avoid repeating the same code multiple times.
Functions drive R.
For example, mean()
computed mean, sd()
- standard deviation, с()
concatinates several values to vector converts them nto the same type.
c(1, 2, "R", "", 0)
[1] "1" "2" "R" "" "0"
Every function has some input information, or arguments (for example, data) and output (for example, an estimated statistical model).
Most functions have the following form:
FUNCTION_NAME(argument1 = "default value 1",
argument2 = "default value 2",
...)
Sometimes default values are not specified but required, such arguments should be specified by a user. For example, function c()
does not have defalt values at all, every argument is a value.
Function subset
has three required arguments:
subset(x, subset, select)
x
- data to use in subsetting,subset
- what observation of data x should be left (filtering variable),select
- what variables of data x should be left.Functions return some value, prints something in the terminal, or draws a plot. Functions differ by what they return, it is usually specified in the Help documentation under section “Value”.
Values can be assigned to arguments using only =
operator.
❗️In order to check equality use ==
. For example: a == b
returns TRUE or FALSE, whereasa = b
assigns object a
with whatever value b
has.
If argument has a default value and we agree with it, the argument can be omitted. Names of arguments can be omitted too, however, their values should be stated strictly in the specific order (order is shown in Help).
# These two lines are equivalent:
subset(x = mydata, subset = age > 65, select = c("health", "income"))
subset( mydata, age > 65, c("health", "income"))
Spaces do not affect the functioning of R. Use spaces freely to format R code.
In most cases you want to save a function output to an object:
# This line saves subset of older repondents and only two variables into a new object named old.respondents:
old.respondents <- subset(mydata, age > 65, c("health", "income"))
❗️If you don’t save an output into an object, it is printed in the console window and lost, you can’t access it anymore. It is sometimes convenient only when you need to see just a single number.
If you save the result into an object, nothing is usually printed.
data.frame
Any kind of data object can be indexed. data.frame is indexed by rows and columns:
flat.table[ROWS, COLUMNS]
To access the value stored in the first row and fifth column:
PT[1, 5]
[1] 110158
Rows and columns may be accessed by their names as well. To return the first value of the variable “idno”:
PT[1, "idno"]
[1] 110158
In order to access all the values in the variable, just omit the row specification (but leave the comma!):
PT[, "idno"]
Equivalently, a variable in data.frame
can be accessed by $
sign:
PT$idno
Useful operator colon :
to create sequence of numbers:
1:5
[1] 1 2 3 4 5
To return five upper rows of data.from.spss
data and only two variables idno
и cntry
:
PT[1:5, c("idno", "cntry")]
In order to filter rows/observations by values of some variable, it requires two steps.
Step 1. Create a filtering variable of the class logical
.
Logical values can be created using ==
sign:
2 + 2 == 5
[1] FALSE
The same can be done with a vector (variable):
c(1,2,1) == 2
[1] FALSE TRUE FALSE
Finally, to create a filtering variable that is TRUE for all females and FALSE for all males:
filter.female <- PT$gndr == 2 # double equality sign!
summary(filter.female)
Mode FALSE TRUE
logical 530 740
Step 2. Use the filtering variable at the row index:
portuguese.females <- PT[filter.female, ]
The two steps may be combined in a single expression:
portuguese.females <- PT[PT$gndr == 2, ]
subset()
Equivalent result may be obtained with subset()
function:
portuguese.females <-
subset(
x = PT, #data
subset = gndr == 2 #criteria of row selection
)
Lists are indexed using double brackets.
To get the first element of the list: - one.list[[1]]
or, if the list elements have names, by its name:
one.list[["my.named.element"]]
analogously
one.list$my.named.element
If the list’s element is subsettable, it can be accessed directly. For example, if the second elelment is data.frame, one can return first ten rows of it:
one.list [[ 2 ]] [1:10, ]
by the same token, if the first element of the list is another list:
one.list[[ 1 ]] [[ 5 ]]
- to get fith element of the nested list.Function names()
returns a vector of names of an object elements:
names(PT)[1:10]
[1] "name" "essround" "edition" "proddate" "idno" "cntry"
[7] "nwspol" "netusoft" "netustm" "ppltrst"
Variable names can be replaced by direct assignement of new names:
names(PT)[1:3] <- c("IDnumber", "ESSround", "Version")
❗️ Names can include spaces, but it’s safewr to keep them short and spaceless.
PT$`ID number`
or PT[, "ID number"]
plot()
is a generic function to build R base plots, that applies different methods depending on the class of the object in its arguments.
plot(PT$big.city)
plot(PT$agea, PT$eduyrs)
abline(17.7, -0.1463, col="red")
hist(PT$eduyrs)
library(lattice)
xyplot(eduyrs ~ agea | bigc, PT)
library(lattice)
cloud(impfree.reversed ~ eduyrs*agea | bigc, PT)
levelplot(eduyrs ~ agea*impfree.reversed, PT)
library(lattice)
bwplot(eduyrs ~ big.city, PT, horizontal=FALSE)
Whiskers - minimum (+1.5*IQR from the box), box - Q1 and Q3, median is the dot.
histogram(as.vector(PT$eduyrs))
densityplot(as.vector(PT$eduyrs))
car
Scatterplot# Enriched scatterplot
library("car")
scatterplot(eduyrs ~ agea, PT)
corrplot
Correlation plots#install.packages("corrplot")
library("corrplot")
corrplot(cor(PT[,501:511], use = "complete.obs"))
Next time!
🐑 | 🐕 | 🐈 | 🐌 | 🐸 | 🐵