Recap

Read the sample code and try to understand what each line means:

library("car")

a <- 5

PT$impfree2 <- PT$impfree ^ 2

PT2 <- PT[PT$domicil == 6, ]

PT$female <- Recode(PT$gndr, "1 = 0; 2 = 1; else = NA")

plot(PT$agea, PT$eduyrs)

mean(PT$impfun, na.rm = T )

👻 Missing values

Types

The main type of missing values is NA (Not Available).

Other types of missing values: - NaN (Not A Number) - appears as a result of impossible computation, e.g.

log(-1) # logarithm of negative value

[1] NaN

NULL means void (usually a placeholder for some value or removed value), works differently compared to NA. For example, NULL is returned when non-existing variable is requested:

PT$gggggggggg

NULL

# Likewise, to delete variable from data.frame:

PT$impfun<- NULL

Inf - means infinity, but per se cannot be used, thus canbe treated as missing value.

100/0 # divide by zero

[1] Inf

~~It is desirable to have all the missing values in data to be NA.~~

Detect NAs

is.na() returns logical values of the same size as its single argument. TRUE means the value is NA, FALSE for anything else.

var.with.missings <- c(5, NA, NaN, Inf, NULL, "", " ") # creates a vector (variable) of value with different kinds of missing values.

is.na(var.with.missings) # check which of them are NAs

[1] FALSE  TRUE FALSE FALSE FALSE FALSE

Count the number of missing values

Logical vairables (TRUE/FALSE) are automatically converted to numeric 0/1 in R base functions. We can use both is.na() and sum() to compute the number of missing values in the variable.

miss.rev <- is.na(PT$impfree) # created a logical variables with TRUE for missing values.

sum(miss.rev) # sums up the TRUEs in miss.rev, that is, counts the number of missing values.

[1] 9

# Or, the two above lines in one expression:

sum(is.na(PT$impfree))

[1] 9

Operator ! inverts logical variable. Thus, to count the number of non-missing values in variable:

sum(!is.na(PT$impfree))  # how many NON-missing values does PT$impfree contains?

[1] 1261

Delete, replace

LISTWISE deletion

na.omit() excludes all the missing values from the vector. If applied to a data.frame, it excludes missing values listwise (i.e. removes the whole observation, if at least one value is NA. ~~Dangerous!~~):

# For vector/variable

length(PT$impfree) #Check the length of the variable

[1] 1270

imp.free.no.missings <- na.omit(PT$impfree) # drop the missing values from the variable

length(imp.free.no.missings) # Check the length of the variable with excluded missing values.

[1] 1261

#  For data.frame

nrow(PT) #Check the number of rows (observations)

[1] 1270

PT.nomissings <- na.omit(PT) # Drop the missing values listwise

nrow(PT.nomissings)  # Check the number of rows (observations) after listwise exclusion.

[1] 0

Recode

For example, we need to recode variable PT$impfree’s response 6 to NA.

PT$impfree[PT$impfree == 6] <- NA

# or, equivalently
library("car")
PT$impfree <- Recode(PT$impfree, "6 = NA")

In character variables

Whitespaces and empty strings are not considered missing values!

b <- c("One", "Two", "", " ", NA)
is.na(b)

[1] FALSE FALSE FALSE FALSE  TRUE

But you can replace empty strings and whitespaces with NA:

b[b==""] <- NA # filter out empty strings and assign NA 
b[b==" "] <- NA # filter out whitespaces and assign NA 
is.na(b) # check is it worked

[1] FALSE FALSE  TRUE  TRUE  TRUE

In different functions

For example, in table(), argument useNA can take values of always (add column with NAs), ifany (add only if there are NAs), and no (do not add, it is a default).
In aggregate(), argument is called na.action, takes values na.omit (delete before computations) и na.pass (pass to aggregation function).
In cor(), argument use, that takes values of everything, complete.obs, etc.
More often uyou will see argument na.rm, taking TRUE (delete all missing values) or FALSE (leave missing values for computations).

🏷 Object attributes

Every object has some attributes, beside the data itself. Object class is a mandatory attribute. It can be extracted using class() function.

Check class of a object:

a <- 5
class(a)

[1] "numeric"

Or c function:

class(c)

[1] "function"

Other embedded attributes:

a <- c(one=2, two=2, three=3)
  
 length(a) # legnth of the object  (for vector it's number of elements, for data.frame it's a number of variables)

[1] 3

 levels(a)   # in `factors` (all possible values of the variable)

NULL

 names(a)   # names of object's parts (for data.frame - names of variables)

[1] "one"   "two"   "three"

You can specify custom attributes using attr() function.

attr(a, "my.super.useful.comment") <- "This is a!"
str(a)

 atomic [1:3] 2 2 3
 - attr(*, "my.super.useful.comment")= chr "This is a!"

Extraction of labels

When labelled data is read with haven package, every variable has attributes “label” (variable label) and “labels” (value labels).

Any attributed can be extracted with attr():

attr(PT$impfree, "labels")

 Very much like me            Like me   Somewhat like me 
                 1                  2                  3 
  A little like me        Not like me Not like me at all 
                 4                  5                  6 
           Refusal         Don't know          No answer 
                 7                  8                  9

and variable label

attr(PT$impfree, "label")

[1] "Important to make own decisions and be free"

❗️Аttributes can be lost while recoding or subsetting (can be fixed with sjlabelled::copy_labels or using only functions from tidyverse.
❗️Keeping labels might be unnecessary pain. If it is very important, check packages tidyr, dplyr, sjmisc, sjlabelled.

`to_label()`

Function to_labelfrom sjmisc package converts labels to the factor values:

library(sjmisc)

table(PT$domicil, PT$gndr)

PT$home <- to_label(PT$domicil)
PT$gender <- to_label(PT$gndr)

table(PT$home, PT$gender)

                                  
                                   Male Female No answer
  A big city                        107    149         0
  Suburbs or outskirts of big city   94    106         0
  Town or small city                179    246         0
  Country village                   137    222         0
  Farm or home in countryside        12     16         0
  Refusal                             0      0         0
  Don't know                          0      0         0
  No answer                           0      0         0

Frequences and cross-tabs with labels: `sjmisc`

frq(PT$domicil) # frequencies for a single variable


# Domicile, respondent's description (x) <numeric> 
# total N=1270  valid N=1268  mean=2.77  sd=1.13
 
 val                            label frq raw.prc valid.prc cum.prc
   1                       A big city 256   20.16     20.19   20.19
   2 Suburbs or outskirts of big city 200   15.75     15.77   35.96
   3               Town or small city 425   33.46     33.52   69.48
   4                  Country village 359   28.27     28.31   97.79
   5      Farm or home in countryside  28    2.20      2.21  100.00
   7                          Refusal   0    0.00      0.00  100.00
   8                       Don't know   0    0.00      0.00  100.00
   9                        No answer   0    0.00      0.00  100.00
  NA                               NA   2    0.16        NA      NA

flat_table(PT, "domicil", "gndr", margin = "col") # contingency table

                                 gndr  Male Female No answer
domicil                                                     
A big city                            20.23  20.16       NaN
Suburbs or outskirts of big city      17.77  14.34       NaN
Town or small city                    33.84  33.29       NaN
Country village                       25.90  30.04       NaN
Farm or home in countryside            2.27   2.17       NaN
Refusal                                0.00   0.00       NaN
Don't know                             0.00   0.00       NaN
No answer                              0.00   0.00       NaN

📈 Linear regressions

Recap: http://setosa.io/ev/ordinary-least-squares-regression/

Linear regessions

Function lm()

General form

fit1 <-  lm (
    formula = dependent variable ~ independent1 + independent2 + ..., 
    data = object containing all variables,
    subset = logical variable to subset the data object,
    weights = survey weights, 
    na.action = na.omit how to deal with missing values
    )

~ - this is tilde

❗️ ~~Always save a fitted model to the object~~

Example

Fit the model

impfree.reversed - importance of freedom for a resppondent.

M1 <- lm(impfree.reversed ~ agea + gndr + eduyrs,
          data = PT)

Extract parameters and fit

After the model has been successfully fitted and saved to an object, different methods can be applied:

summary(M1)


Call:
lm(formula = impfree.reversed ~ agea + gndr + eduyrs, data = PT)

Residuals:
<Labelled double>: Important to make own decisions and be free
    Min      1Q  Median      3Q     Max 
-3.8971 -0.7315  0.1492  1.0782  1.5934 

Labels:
 value              label
     1  Very much like me
     2            Like me
     3   Somewhat like me
     4   A little like me
     5        Not like me
     6 Not like me at all
     7            Refusal
     8         Don't know
     9          No answer

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.497511   0.182566  24.635  < 2e-16 ***
agea        -0.002568   0.001958  -1.312 0.189892    
gndr         0.096549   0.062903   1.535 0.125067    
eduyrs       0.022556   0.006721   3.356 0.000815 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.097 on 1247 degrees of freedom
  (19 observations deleted due to missingness)
Multiple R-squared:  0.01955,   Adjusted R-squared:  0.01719 
F-statistic: 8.288 on 3 and 1247 DF,  p-value: 1.845e-05

coef(M1)

 (Intercept)         agea         gndr       eduyrs 
 4.497510622 -0.002568256  0.096548556  0.022556206

broom::tidy(M1)

Diagnose

# Multicollinearity
vif(M1)

    agea     gndr   eduyrs 
1.332546 1.002369 1.334964

# Heteroscedasticity
lmtest::bptest(M1)


    studentized Breusch-Pagan test

data:  M1
BP = 22.087, df = 3, p-value = 6.256e-05

Modify model

library("sjmisc")
PT$d <- to_label(PT$domicil)

M2 <- lm(impfree.reversed ~ agea + gndr + eduyrs + d,
          data = PT)

summary(M2)


Call:
lm(formula = impfree.reversed ~ agea + gndr + eduyrs + d, data = PT)

Residuals:
<Labelled double>: Important to make own decisions and be free
    Min      1Q  Median      3Q     Max 
-3.9032 -0.7229  0.1458  1.0579  1.7453 

Labels:
 value              label
     1  Very much like me
     2            Like me
     3   Somewhat like me
     4   A little like me
     5        Not like me
     6 Not like me at all
     7            Refusal
     8         Don't know
     9          No answer

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)
(Intercept)                        4.615136   0.200317  23.039  < 2e-16
agea                              -0.002833   0.001962  -1.444  0.14888
gndr                               0.098646   0.062949   1.567  0.11735
eduyrs                             0.018935   0.006958   2.721  0.00659
dSuburbs or outskirts of big city -0.065634   0.104058  -0.631  0.52832
dTown or small city               -0.018923   0.087945  -0.215  0.82967
dCountry village                  -0.178532   0.093472  -1.910  0.05636
dFarm or home in countryside      -0.275963   0.230911  -1.195  0.23227
                                     
(Intercept)                       ***
agea                                 
gndr                                 
eduyrs                            ** 
dSuburbs or outskirts of big city    
dTown or small city                  
dCountry village                  .  
dFarm or home in countryside         
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.095 on 1241 degrees of freedom
  (21 observations deleted due to missingness)
Multiple R-squared:  0.02477,   Adjusted R-squared:  0.01927 
F-statistic: 4.503 on 7 and 1241 DF,  p-value: 5.732e-05

## equivalent alternative:

M2 <- update(M1, . ~ . + home) # dots stand for "the same"

Interactions

To include a product of two independent variables (interaction) top the model, use star * character.

M3 <- lm(impfree.reversed ~ agea + gndr * eduyrs + home,
          data = PT)

summary(M3)


Call:
lm(formula = impfree.reversed ~ agea + gndr * eduyrs + home, 
    data = PT)

Residuals:
<Labelled double>: Important to make own decisions and be free
    Min      1Q  Median      3Q     Max 
-3.9170 -0.7136  0.2023  1.0257  1.7363 

Labels:
 value              label
     1  Very much like me
     2            Like me
     3   Somewhat like me
     4   A little like me
     5        Not like me
     6 Not like me at all
     7            Refusal
     8         Don't know
     9          No answer

Coefficients:
                                      Estimate Std. Error t value Pr(>|t|)
(Intercept)                           5.120246   0.280541  18.251   <2e-16
agea                                 -0.002727   0.001958  -1.393   0.1639
gndr                                 -0.215391   0.137570  -1.566   0.1177
eduyrs                               -0.030225   0.020379  -1.483   0.1383
homeSuburbs or outskirts of big city -0.077353   0.103925  -0.744   0.4568
homeTown or small city               -0.019675   0.087748  -0.224   0.8226
homeCountry village                  -0.180359   0.093265  -1.934   0.0534
homeFarm or home in countryside      -0.311354   0.230806  -1.349   0.1776
gndr:eduyrs                           0.030658   0.011949   2.566   0.0104
                                        
(Intercept)                          ***
agea                                    
gndr                                    
eduyrs                                  
homeSuburbs or outskirts of big city    
homeTown or small city                  
homeCountry village                  .  
homeFarm or home in countryside         
gndr:eduyrs                          *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.093 on 1240 degrees of freedom
  (21 observations deleted due to missingness)
Multiple R-squared:  0.02992,   Adjusted R-squared:  0.02366 
F-statistic: 4.781 on 8 and 1240 DF,  p-value: 8.336e-06

Export

Now we can compare these models by puttiing them in one table with stargazer package

Either by printing in Console:

library("stargazer")

stargazer(
  M1, M2, M3,          # Models to include
  type="text",         # Output format
  keep.stat = c("n", "adj.rsq", "f"),  # What statistics include in the table?
  style="ajs"          # Style of formatting (American Journal of Sociology)
  )

Or export as html to paste into your paper

stargazer(
  M1, M2, M3,          # Models to include
  type="html",         # Output format
  keep.stat = c("n", "adj.rsq", "f"),  # What statistics include in the table?
  style="ajs",          # Style of formatting (American Journal of Sociology)
  out="data/table1.html"  # File name to save html
   )

Resulting file: table1.html

Other packages: xtables, papaja, apaTables (didn’t work for me though).

💱 Relations between two variables

Correlations

For two variables:

cor(PT$impfree, # first variable
    PT$eduyrs,  #  second variable
    use="pairwise", # how to deal with missing values?
    method="pearson")  # can be also spearman

[1] -0.1278352

For many (>2) variables:

cor(
  PT[, c("impfree", "eduyrs", "agea")], # a subset with numeric variables
  use="pairwise",                       # how missing values should be treated
  method="spearman")                    # type of coefficient

            impfree     eduyrs        agea
impfree  1.00000000 -0.1172985  0.09526061
eduyrs  -0.11729845  1.0000000 -0.53513355
agea     0.09526061 -0.5351336  1.00000000

Covariances

analogous tp correlations:

cov(
  PT[, c("impfree", "eduyrs", "agea")], # a subset with numeric variables
  use="pairwise"                        # how missing values should be treated
  )

           impfree      eduyrs       agea
impfree  1.2275808  -0.7540195   2.025171
eduyrs  -0.7540195  28.5331340 -49.070769
agea     2.0251709 -49.0707694 334.906992

Significance of correlations

Only for two variables:

cor.test(PT$impfree, 
         PT$eduyrs, 
         use="pairwise")


    Pearson's product-moment correlation

data:  PT$impfree and PT$eduyrs
t = -4.5552, df = 1249, p-value = 5.745e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.18196962 -0.07292819
sample estimates:
       cor 
-0.1278352

Chi-square

Chi-square criterion for contingency tables is computed in two steps.

Step 1. Compute cross-tab (contigency table) and save to an object

cross.tab <- table(PT$impfree.reversed, PT$big.city, 
                   useNA="no") # usual chi-square doesn't work with missing, so we have to drop it

Step 2. Compute chi-square.

chisq.test(cross.tab)


    Pearson's Chi-squared test

data:  cross.tab
X-squared = 20.672, df = 10, p-value = 0.0235

Are variables impfree.reversed and big.city significantly related?

t-test

For the two independent samples: older than 60 and younger then 60:

t.test(
  PT$impfree.reversed[PT$agea <  60],  # subsetted variable in group 1
  PT$impfree.reversed[PT$agea >= 60]   # subsetted variable in group 2
  )


    Welch Two Sample t-test

data:  PT$impfree.reversed[PT$agea < 60] and PT$impfree.reversed[PT$agea >= 60]
t = 3.0234, df = 979.13, p-value = 0.002565
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.06899106 0.32420184
sample estimates:
mean of x mean of y 
 4.820078  4.623482

Older or younger people endorse freedom more?

Compare with a single value:

t.test(PT$impfree.reversed, mu=1)


    One Sample t-test

data:  PT$impfree.reversed
t = 119.97, df = 1260, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 1
95 percent confidence interval:
 4.681850 4.804273
sample estimates:
mean of x 
 4.743061

Does the mean importance of freedom differ from 1?

NB. For comparison of matched or paired samples, use argument paired = TRUE.

Wilcoxon test (Mann-Whitney)

Analogous to t-test, but doesn’t assume normal distribution.

wilcox.test(
  PT$impfree.reversed[PT$agea <  60],  # subsetted variable in group 1
  PT$impfree.reversed[PT$agea >= 60]   # subsetted variable in group 2
  )


    Wilcoxon rank sum test with continuity correction

data:  PT$impfree.reversed[PT$agea < 60] and PT$impfree.reversed[PT$agea >= 60]
W = 206720, p-value = 0.004382
alternative hypothesis: true location shift is not equal to 0

NB. For comparison of matched or paired samples, use argument paired = TRUE.

♻️ Multiple comparisons of means

Analysis of variance - ANOVA

an.of.var <- aov(
              formula = impfree.reversed ~ big.city, # formula: dependen variable ~ independent (categorical) variable
              data = PT
             )
summary(an.of.var)

              Df Sum Sq Mean Sq F value  Pr(>F)   
big.city       2     14   7.014   5.759 0.00324 **
Residuals   1256   1530   1.218                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
11 observations deleted due to missingness

Pr(>F) - is a significance of F-stistic.

❗️ ~~Function anova() does NOT compute analysis of variance, it is used for model comparison. For ANOVA use aov().~~

Tukey’s honest significant differences

Pairwise comparison of means across groups adjusted for multiple comparisons.

TukeyHSD(an.of.var)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = impfree.reversed ~ big.city, data = PT)

$big.city
                                       diff         lwr         upr
Rural-Big city or suburbs      -0.231623498 -0.41140714 -0.05183986
Small town-Big city or suburbs -0.004264692 -0.17925272  0.17072333
Small town-Rural                0.227358806  0.04458973  0.41012788
                                   p adj
Rural-Big city or suburbs      0.0071988
Small town-Big city or suburbs 0.9981986
Small town-Rural               0.0099951

Welch test

Does not assume equality of variances across groups.

oneway.test(impfree.reversed ~ big.city, # formula
            data=PT
            )


    One-way analysis of means (not assuming equal variances)

data:  impfree.reversed and big.city
F = 5.6109, num df = 2.00, denom df = 825.55, p-value = 0.003798

Kruskal-Wallis test

Does not assume normal distribution of dependent variable.

kruskal.test(impfree.reversed ~ big.city, PT)


    Kruskal-Wallis rank sum test

data:  impfree.reversed by big.city
Kruskal-Wallis chi-squared = 12.681, df = 2, p-value = 0.001763

🆚 Binary logit regression

PT$never <-  PT$netusoft==1  #new binary variable  representing those who never use Internet/

fit.logit <-
  glm(
   formula = never ~ agea + gndr + eduyrs, 
   data = PT,
   family = binomial(link="logit")  # non-linear function relating probability of response and independent variables
   )

summary(fit.logit)


Call:
glm(formula = never ~ agea + gndr + eduyrs, family = binomial(link = "logit"), 
    data = PT)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4927  -0.3146  -0.0874   0.3467   3.3192  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.625587   0.666559  -5.439 5.35e-08 ***
agea         0.091929   0.008435  10.899  < 2e-16 ***
gndr         0.433239   0.200568   2.160   0.0308 *  
eduyrs      -0.390164   0.029218 -13.354  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1562.46  on 1257  degrees of freedom
Residual deviance:  672.35  on 1254  degrees of freedom
  (12 observations deleted due to missingness)
AIC: 680.35

Number of Fisher Scoring iterations: 6

Interpretation of coefficients in logistic regression

\[ log(\frac{Prob_{never}}{1 - Prob_{never}}) = \beta_0 + \beta_1*Gender + \beta_2*Age + \beta_3Education \]

cf <- coef(fit.logit)
exp(cf)

(Intercept)        agea        gndr      eduyrs 
 0.02663345  1.09628738  1.54224501  0.67694554

library(MASS)
exp(confint(fit.logit))

                  2.5 %     97.5 %
(Intercept) 0.007051159 0.09655775
agea        1.078876330 1.11519925
gndr        1.042485119 2.29097988
eduyrs      0.637811283 0.71532407

📶 Ordinal regression

PT$impfree.reversed <- factor(PT$impfree.reversed, ordered=TRUE)

library(MASS)
fit.ordinal1 <-
  polr(
    impfree.reversed ~ agea + gndr + eduyrs,
    data=PT,
    method = "logistic" # link function
    )

summary(fit.ordinal1)

Call:
polr(formula = impfree.reversed ~ agea + gndr + eduyrs, data = PT, 
    method = "logistic")

Coefficients:
           Value Std. Error t value
agea   -0.004564   0.003228  -1.414
gndr    0.216655   0.103237   2.099
eduyrs  0.031723   0.011038   2.874

Intercepts:
    Value    Std. Error t value 
1|2  -4.4224   0.4316   -10.2474
2|3  -2.8406   0.3279    -8.6624
3|4  -1.6642   0.3055    -5.4484
4|5   0.0294   0.2980     0.0987
5|6   1.2621   0.3005     4.1994

Residual Deviance: 3522.331 
AIC: 3538.331 
(19 observations deleted due to missingness)

How to interpret the ordinal regression coefficients?

Alternatively you may predict probabilities of the outcome conditioned on typical values of independent variables.

predict(fit.ordinal1,
        newdata = # create a fake dataset with typical values. It is advisable to vary only one or two variables, keeping the rest constant (usually equal to a sample mean)
          data.frame(
           agea=c(20, 30, 40, 50),  # different age
           gndr= c(1, 1, 1, 1),     # males
           eduyrs = c(12,12,12,12)  # average education
           ),
        type="p")

            1          2          3         4         5         6
1 0.007185965 0.02681912 0.06844513 0.2805955 0.2974524 0.3195019
2 0.007519038 0.02801770 0.07118743 0.2871642 0.2964509 0.3096608
3 0.007867426 0.02926731 0.07401961 0.2936822 0.2951743 0.2999891
4 0.008231823 0.03056987 0.07694290 0.3001356 0.2936273 0.2904925

Rows - different age groups, columns - probability of response from 1 (“not like me at all”) to 6 (“very much like me”).

🛁 Tidyverse

Tidyverse - a set of packages for data cleaning, structuring, and facilitating R coding.

`%>%` pipes

“Pipes” - send a left-side function output to the right-side function, as a first argument.

library(magrittr) # package that provides operator %>%

c(2, 5, 10) %>% sum()

[1] 17

# if is the first and the only argument one can omit parentheses
c(2,5) %>% sum

[1] 7

# if the argument is not the first, one can use dot `.` symbol to signifiy the output of the left-side function.
c(2,5) %>% table(c(10, 100), .)

Pipes can be as long as you like (but not recommended more than 10 rows, as debugging becomes difficult)

c(2,5) %>% sum %>% sqrt

[1] 2.645751

Результат работы трубы сохраняется как обычно - в объект в начале этой трубы.

pipe.result <- c(2,5) %>% sum %>% log 

pipe.result

[1] 1.94591

Full example of pipe

library(dplyr)
library(magrittr)
library(sjmisc)

Portugal <-   # the result will be saved to object "Portugal"
  PT %>%      # object PT is sent as a first argument to the next function
  dplyr::select("impfree", "cntry", "yrbrn", "gndr") %>%  # select variables
  filter(cntry=="PT") %>%                         # subset/filter respondents
      mutate(age = 2017 - yrbrn,                  # recode
             female = as.numeric(gndr==2 & !is.na(gndr)),   
             impfree.reversed =  
                rec(impfree, rec=c(" 1=6 [very much]; 
                                     2=5 [like me]; 
                                     3=4 [somewhat]; 
                                     4=3 [a little]; 
                                     5=2 [not like me]; 
                                     6=1 [not at all]"), 
               append    = TRUE, #add top the existing data
               as.num    = TRUE, # is it numeric/continuous?
               var.label = "Importance of freedom"))


Portugal

# A tibble: 1,270 x 7
   impfree   cntry     yrbrn     gndr     age      female impfree.reversed
   <dbl+lbl> <chr+lbl> <dbl+lbl> <dbl+lb> <dbl+lb>  <dbl>            <dbl>
 1 " 2"      PT        1958      1        59            0                5
 2 " 2"      PT        1970      1        47            0                5
 3 " 1"      PT        1961      2        56            1                6
 4 " 1"      PT        1993      2        24            1                6
 5 " 3"      PT        1952      1        65            0                4
 6 " 2"      PT        1957      1        60            0                5
 7 NA        PT        1996      2        21            1               NA
 8 " 3"      PT        1962      2        55            1                4
 9 " 3"      PT        1945      2        72            1                4
10 " 2"      PT        1964      2        53            1                5
# ... with 1,260 more rows

Tables and frequencies

library(sjmisc)

# frequencies
Portugal$impfree %>% 
  table %>% 
  prop.table

.
          1           2           3           4           5           6 
0.305313243 0.291038858 0.291038858 0.074544013 0.030134814 0.007930214

  # or

frq(Portugal$impfree)


# Important to make own decisions and be free (x) <numeric> 
# total N=1270  valid N=1261  mean=2.26  sd=1.11
 
 val              label frq raw.prc valid.prc cum.prc
   1  Very much like me 385   30.31     30.53   30.53
   2            Like me 367   28.90     29.10   59.64
   3   Somewhat like me 367   28.90     29.10   88.74
   4   A little like me  94    7.40      7.45   96.19
   5        Not like me  38    2.99      3.01   99.21
   6 Not like me at all  10    0.79      0.79  100.00
   7            Refusal   0    0.00      0.00  100.00
   8         Don't know   0    0.00      0.00  100.00
   9          No answer   0    0.00      0.00  100.00
  NA                 NA   9    0.71        NA      NA

# cross-tabulation
Portugal %>% 
  dplyr::select(impfree, gndr) %>% 
  table %>% 
  prop.table

       gndr
impfree           1           2
      1 0.103885805 0.201427439
      2 0.141157811 0.149881047
      3 0.133227597 0.157811261
      4 0.026962728 0.047581285
      5 0.010309278 0.019825535
      6 0.003172086 0.004758128

  # or

flat_table(Portugal, impfree, gndr,  margin="col")

                   gndr  Male Female No answer
impfree                                       
Very much like me       24.81  34.65       NaN
Like me                 33.71  25.78       NaN
Somewhat like me        31.82  27.15       NaN
A little like me         6.44   8.19       NaN
Not like me              2.46   3.41       NaN
Not like me at all       0.76   0.82       NaN
Refusal                  0.00   0.00       NaN
Don't know               0.00   0.00       NaN
No answer                0.00   0.00       NaN

By group

# means
Portugal %>%
  group_by(gndr) %>%
  summarize(
    mean = mean(impfree, na.rm=T), # [name of column = function]
    sd   = sd(impfree, na.rm=T),
    n    = n(),
    S.E. = sd(impfree, na.rm=T)/sqrt(n())
    )

Maksim Rudnev, 2019 using RMarkdown.

Lecture 2 - Missing values, labels, regressions, stats

31/01/2019