Read the sample code and try to understand what each line means:
library("car")
a <- 5
PT$impfree2 <- PT$impfree ^ 2
PT2 <- PT[PT$domicil == 6, ]
PT$female <- Recode(PT$gndr, "1 = 0; 2 = 1; else = NA")
plot(PT$agea, PT$eduyrs)
mean(PT$impfun, na.rm = T )
The main type of missing values is NA
(Not Available).
Other types of missing values: - NaN
(Not A Number) - appears as a result of impossible computation, e.g.
log(-1) # logarithm of negative value
[1] NaN
NULL
means void (usually a placeholder for some value or removed value), works differently compared to NA. For example, NULL
is returned when non-existing variable is requested:PT$gggggggggg
NULL
# Likewise, to delete variable from data.frame:
PT$impfun<- NULL
Inf
- means infinity, but per se cannot be used, thus canbe treated as missing value.100/0 # divide by zero
[1] Inf
It is desirable to have all the missing values in data to be NA
.
is.na()
returns logical values of the same size as its single argument. TRUE means the value is NA, FALSE for anything else.
var.with.missings <- c(5, NA, NaN, Inf, NULL, "", " ") # creates a vector (variable) of value with different kinds of missing values.
is.na(var.with.missings) # check which of them are NAs
[1] FALSE TRUE FALSE FALSE FALSE FALSE
Logical vairables (TRUE/FALSE) are automatically converted to numeric 0/1 in R base functions. We can use both is.na()
and sum()
to compute the number of missing values in the variable.
miss.rev <- is.na(PT$impfree) # created a logical variables with TRUE for missing values.
sum(miss.rev) # sums up the TRUEs in miss.rev, that is, counts the number of missing values.
[1] 9
# Or, the two above lines in one expression:
sum(is.na(PT$impfree))
[1] 9
Operator !
inverts logical variable. Thus, to count the number of non-missing values in variable:
sum(!is.na(PT$impfree)) # how many NON-missing values does PT$impfree contains?
[1] 1261
na.omit()
excludes all the missing values from the vector. If applied to a data.frame, it excludes missing values listwise (i.e. removes the whole observation, if at least one value is NA. Dangerous!):
# For vector/variable
length(PT$impfree) #Check the length of the variable
[1] 1270
imp.free.no.missings <- na.omit(PT$impfree) # drop the missing values from the variable
length(imp.free.no.missings) # Check the length of the variable with excluded missing values.
[1] 1261
# For data.frame
nrow(PT) #Check the number of rows (observations)
[1] 1270
PT.nomissings <- na.omit(PT) # Drop the missing values listwise
nrow(PT.nomissings) # Check the number of rows (observations) after listwise exclusion.
[1] 0
For example, we need to recode variable PT$impfree
’s response 6
to NA
.
PT$impfree[PT$impfree == 6] <- NA
# or, equivalently
library("car")
PT$impfree <- Recode(PT$impfree, "6 = NA")
Whitespaces and empty strings are not considered missing values!
b <- c("One", "Two", "", " ", NA)
is.na(b)
[1] FALSE FALSE FALSE FALSE TRUE
But you can replace empty strings and whitespaces with NA:
b[b==""] <- NA # filter out empty strings and assign NA
b[b==" "] <- NA # filter out whitespaces and assign NA
is.na(b) # check is it worked
[1] FALSE FALSE TRUE TRUE TRUE
table()
, argument useNA
can take values of always
(add column with NAs), ifany
(add only if there are NAs), and no
(do not add, it is a default).aggregate()
, argument is called na.action
, takes values na.omit
(delete before computations) и na.pass
(pass to aggregation function).cor()
, argument use
, that takes values of everything
, complete.obs
, etc.na.rm
, taking TRUE
(delete all missing values) or FALSE
(leave missing values for computations).Every object has some attributes, beside the data itself. Object class is a mandatory attribute. It can be extracted using class()
function.
Check class of a
object:
a <- 5
class(a)
[1] "numeric"
Or c
function:
class(c)
[1] "function"
Other embedded attributes:
a <- c(one=2, two=2, three=3)
length(a) # legnth of the object (for vector it's number of elements, for data.frame it's a number of variables)
[1] 3
levels(a) # in `factors` (all possible values of the variable)
NULL
names(a) # names of object's parts (for data.frame - names of variables)
[1] "one" "two" "three"
You can specify custom attributes using attr()
function.
attr(a, "my.super.useful.comment") <- "This is a!"
str(a)
atomic [1:3] 2 2 3
- attr(*, "my.super.useful.comment")= chr "This is a!"
When labelled data is read with haven
package, every variable has attributes “label” (variable label) and “labels” (value labels).
Any attributed can be extracted with attr()
:
attr(PT$impfree, "labels")
Very much like me Like me Somewhat like me
1 2 3
A little like me Not like me Not like me at all
4 5 6
Refusal Don't know No answer
7 8 9
and variable label
attr(PT$impfree, "label")
[1] "Important to make own decisions and be free"
tidyr
, dplyr
, sjmisc
, sjlabelled
.to_label()
Function to_label
from sjmisc
package converts labels to the factor values:
library(sjmisc)
table(PT$domicil, PT$gndr)
1 2
1 107 149
2 94 106
3 179 246
4 137 222
5 12 16
PT$home <- to_label(PT$domicil)
PT$gender <- to_label(PT$gndr)
table(PT$home, PT$gender)
Male Female No answer
A big city 107 149 0
Suburbs or outskirts of big city 94 106 0
Town or small city 179 246 0
Country village 137 222 0
Farm or home in countryside 12 16 0
Refusal 0 0 0
Don't know 0 0 0
No answer 0 0 0
sjmisc
frq(PT$domicil) # frequencies for a single variable
# Domicile, respondent's description (x) <numeric>
# total N=1270 valid N=1268 mean=2.77 sd=1.13
val label frq raw.prc valid.prc cum.prc
1 A big city 256 20.16 20.19 20.19
2 Suburbs or outskirts of big city 200 15.75 15.77 35.96
3 Town or small city 425 33.46 33.52 69.48
4 Country village 359 28.27 28.31 97.79
5 Farm or home in countryside 28 2.20 2.21 100.00
7 Refusal 0 0.00 0.00 100.00
8 Don't know 0 0.00 0.00 100.00
9 No answer 0 0.00 0.00 100.00
NA NA 2 0.16 NA NA
flat_table(PT, "domicil", "gndr", margin = "col") # contingency table
gndr Male Female No answer
domicil
A big city 20.23 20.16 NaN
Suburbs or outskirts of big city 17.77 14.34 NaN
Town or small city 33.84 33.29 NaN
Country village 25.90 30.04 NaN
Farm or home in countryside 2.27 2.17 NaN
Refusal 0.00 0.00 NaN
Don't know 0.00 0.00 NaN
No answer 0.00 0.00 NaN
Recap: http://setosa.io/ev/ordinary-least-squares-regression/
Function lm()
General form
fit1 <- lm (
formula = dependent variable ~ independent1 + independent2 + ...,
data = object containing all variables,
subset = logical variable to subset the data object,
weights = survey weights,
na.action = na.omit how to deal with missing values
)
~
- this is tilde
❗️ Always save a fitted model to the object
impfree.reversed
- importance of freedom for a resppondent.
M1 <- lm(impfree.reversed ~ agea + gndr + eduyrs,
data = PT)
After the model has been successfully fitted and saved to an object, different methods can be applied:
summary(M1)
Call:
lm(formula = impfree.reversed ~ agea + gndr + eduyrs, data = PT)
Residuals:
<Labelled double>: Important to make own decisions and be free
Min 1Q Median 3Q Max
-3.8971 -0.7315 0.1492 1.0782 1.5934
Labels:
value label
1 Very much like me
2 Like me
3 Somewhat like me
4 A little like me
5 Not like me
6 Not like me at all
7 Refusal
8 Don't know
9 No answer
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.497511 0.182566 24.635 < 2e-16 ***
agea -0.002568 0.001958 -1.312 0.189892
gndr 0.096549 0.062903 1.535 0.125067
eduyrs 0.022556 0.006721 3.356 0.000815 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.097 on 1247 degrees of freedom
(19 observations deleted due to missingness)
Multiple R-squared: 0.01955, Adjusted R-squared: 0.01719
F-statistic: 8.288 on 3 and 1247 DF, p-value: 1.845e-05
coef(M1)
(Intercept) agea gndr eduyrs
4.497510622 -0.002568256 0.096548556 0.022556206
broom::tidy(M1)
# Multicollinearity
vif(M1)
agea gndr eduyrs
1.332546 1.002369 1.334964
# Heteroscedasticity
lmtest::bptest(M1)
studentized Breusch-Pagan test
data: M1
BP = 22.087, df = 3, p-value = 6.256e-05
library("sjmisc")
PT$d <- to_label(PT$domicil)
M2 <- lm(impfree.reversed ~ agea + gndr + eduyrs + d,
data = PT)
summary(M2)
Call:
lm(formula = impfree.reversed ~ agea + gndr + eduyrs + d, data = PT)
Residuals:
<Labelled double>: Important to make own decisions and be free
Min 1Q Median 3Q Max
-3.9032 -0.7229 0.1458 1.0579 1.7453
Labels:
value label
1 Very much like me
2 Like me
3 Somewhat like me
4 A little like me
5 Not like me
6 Not like me at all
7 Refusal
8 Don't know
9 No answer
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.615136 0.200317 23.039 < 2e-16
agea -0.002833 0.001962 -1.444 0.14888
gndr 0.098646 0.062949 1.567 0.11735
eduyrs 0.018935 0.006958 2.721 0.00659
dSuburbs or outskirts of big city -0.065634 0.104058 -0.631 0.52832
dTown or small city -0.018923 0.087945 -0.215 0.82967
dCountry village -0.178532 0.093472 -1.910 0.05636
dFarm or home in countryside -0.275963 0.230911 -1.195 0.23227
(Intercept) ***
agea
gndr
eduyrs **
dSuburbs or outskirts of big city
dTown or small city
dCountry village .
dFarm or home in countryside
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.095 on 1241 degrees of freedom
(21 observations deleted due to missingness)
Multiple R-squared: 0.02477, Adjusted R-squared: 0.01927
F-statistic: 4.503 on 7 and 1241 DF, p-value: 5.732e-05
## equivalent alternative:
M2 <- update(M1, . ~ . + home) # dots stand for "the same"
To include a product of two independent variables (interaction) top the model, use star *
character.
M3 <- lm(impfree.reversed ~ agea + gndr * eduyrs + home,
data = PT)
summary(M3)
Call:
lm(formula = impfree.reversed ~ agea + gndr * eduyrs + home,
data = PT)
Residuals:
<Labelled double>: Important to make own decisions and be free
Min 1Q Median 3Q Max
-3.9170 -0.7136 0.2023 1.0257 1.7363
Labels:
value label
1 Very much like me
2 Like me
3 Somewhat like me
4 A little like me
5 Not like me
6 Not like me at all
7 Refusal
8 Don't know
9 No answer
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.120246 0.280541 18.251 <2e-16
agea -0.002727 0.001958 -1.393 0.1639
gndr -0.215391 0.137570 -1.566 0.1177
eduyrs -0.030225 0.020379 -1.483 0.1383
homeSuburbs or outskirts of big city -0.077353 0.103925 -0.744 0.4568
homeTown or small city -0.019675 0.087748 -0.224 0.8226
homeCountry village -0.180359 0.093265 -1.934 0.0534
homeFarm or home in countryside -0.311354 0.230806 -1.349 0.1776
gndr:eduyrs 0.030658 0.011949 2.566 0.0104
(Intercept) ***
agea
gndr
eduyrs
homeSuburbs or outskirts of big city
homeTown or small city
homeCountry village .
homeFarm or home in countryside
gndr:eduyrs *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.093 on 1240 degrees of freedom
(21 observations deleted due to missingness)
Multiple R-squared: 0.02992, Adjusted R-squared: 0.02366
F-statistic: 4.781 on 8 and 1240 DF, p-value: 8.336e-06
Now we can compare these models by puttiing them in one table with stargazer
package
Either by printing in Console:
library("stargazer")
stargazer(
M1, M2, M3, # Models to include
type="text", # Output format
keep.stat = c("n", "adj.rsq", "f"), # What statistics include in the table?
style="ajs" # Style of formatting (American Journal of Sociology)
)
Or export as html to paste into your paper
stargazer(
M1, M2, M3, # Models to include
type="html", # Output format
keep.stat = c("n", "adj.rsq", "f"), # What statistics include in the table?
style="ajs", # Style of formatting (American Journal of Sociology)
out="data/table1.html" # File name to save html
)
Resulting file: table1.html
Other packages: xtables
, papaja
, apaTables
(didn’t work for me though).
For two variables:
cor(PT$impfree, # first variable
PT$eduyrs, # second variable
use="pairwise", # how to deal with missing values?
method="pearson") # can be also spearman
[1] -0.1278352
For many (>2) variables:
cor(
PT[, c("impfree", "eduyrs", "agea")], # a subset with numeric variables
use="pairwise", # how missing values should be treated
method="spearman") # type of coefficient
impfree eduyrs agea
impfree 1.00000000 -0.1172985 0.09526061
eduyrs -0.11729845 1.0000000 -0.53513355
agea 0.09526061 -0.5351336 1.00000000
analogous tp correlations:
cov(
PT[, c("impfree", "eduyrs", "agea")], # a subset with numeric variables
use="pairwise" # how missing values should be treated
)
impfree eduyrs agea
impfree 1.2275808 -0.7540195 2.025171
eduyrs -0.7540195 28.5331340 -49.070769
agea 2.0251709 -49.0707694 334.906992
Only for two variables:
cor.test(PT$impfree,
PT$eduyrs,
use="pairwise")
Pearson's product-moment correlation
data: PT$impfree and PT$eduyrs
t = -4.5552, df = 1249, p-value = 5.745e-06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.18196962 -0.07292819
sample estimates:
cor
-0.1278352
Chi-square criterion for contingency tables is computed in two steps.
cross.tab <- table(PT$impfree.reversed, PT$big.city,
useNA="no") # usual chi-square doesn't work with missing, so we have to drop it
chisq.test(cross.tab)
Pearson's Chi-squared test
data: cross.tab
X-squared = 20.672, df = 10, p-value = 0.0235
Are variables impfree.reversed
and big.city
significantly related?
For the two independent samples: older than 60 and younger then 60:
t.test(
PT$impfree.reversed[PT$agea < 60], # subsetted variable in group 1
PT$impfree.reversed[PT$agea >= 60] # subsetted variable in group 2
)
Welch Two Sample t-test
data: PT$impfree.reversed[PT$agea < 60] and PT$impfree.reversed[PT$agea >= 60]
t = 3.0234, df = 979.13, p-value = 0.002565
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.06899106 0.32420184
sample estimates:
mean of x mean of y
4.820078 4.623482
Older or younger people endorse freedom more?
Compare with a single value:
t.test(PT$impfree.reversed, mu=1)
One Sample t-test
data: PT$impfree.reversed
t = 119.97, df = 1260, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 1
95 percent confidence interval:
4.681850 4.804273
sample estimates:
mean of x
4.743061
Does the mean importance of freedom differ from 1?
NB. For comparison of matched or paired samples, use argument paired = TRUE
.
Analogous to t-test, but doesn’t assume normal distribution.
wilcox.test(
PT$impfree.reversed[PT$agea < 60], # subsetted variable in group 1
PT$impfree.reversed[PT$agea >= 60] # subsetted variable in group 2
)
Wilcoxon rank sum test with continuity correction
data: PT$impfree.reversed[PT$agea < 60] and PT$impfree.reversed[PT$agea >= 60]
W = 206720, p-value = 0.004382
alternative hypothesis: true location shift is not equal to 0
NB. For comparison of matched or paired samples, use argument paired = TRUE
.
an.of.var <- aov(
formula = impfree.reversed ~ big.city, # formula: dependen variable ~ independent (categorical) variable
data = PT
)
summary(an.of.var)
Df Sum Sq Mean Sq F value Pr(>F)
big.city 2 14 7.014 5.759 0.00324 **
Residuals 1256 1530 1.218
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
11 observations deleted due to missingness
Pr(>F) - is a significance of F-stistic.
❗️ Function anova()
does NOT compute analysis of variance, it is used for model comparison. For ANOVA use aov()
.
See also manova()
Pairwise comparison of means across groups adjusted for multiple comparisons.
TukeyHSD(an.of.var)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = impfree.reversed ~ big.city, data = PT)
$big.city
diff lwr upr
Rural-Big city or suburbs -0.231623498 -0.41140714 -0.05183986
Small town-Big city or suburbs -0.004264692 -0.17925272 0.17072333
Small town-Rural 0.227358806 0.04458973 0.41012788
p adj
Rural-Big city or suburbs 0.0071988
Small town-Big city or suburbs 0.9981986
Small town-Rural 0.0099951
Does not assume equality of variances across groups.
oneway.test(impfree.reversed ~ big.city, # formula
data=PT
)
One-way analysis of means (not assuming equal variances)
data: impfree.reversed and big.city
F = 5.6109, num df = 2.00, denom df = 825.55, p-value = 0.003798
Does not assume normal distribution of dependent variable.
kruskal.test(impfree.reversed ~ big.city, PT)
Kruskal-Wallis rank sum test
data: impfree.reversed by big.city
Kruskal-Wallis chi-squared = 12.681, df = 2, p-value = 0.001763
PT$never <- PT$netusoft==1 #new binary variable representing those who never use Internet/
fit.logit <-
glm(
formula = never ~ agea + gndr + eduyrs,
data = PT,
family = binomial(link="logit") # non-linear function relating probability of response and independent variables
)
summary(fit.logit)
Call:
glm(formula = never ~ agea + gndr + eduyrs, family = binomial(link = "logit"),
data = PT)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4927 -0.3146 -0.0874 0.3467 3.3192
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.625587 0.666559 -5.439 5.35e-08 ***
agea 0.091929 0.008435 10.899 < 2e-16 ***
gndr 0.433239 0.200568 2.160 0.0308 *
eduyrs -0.390164 0.029218 -13.354 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1562.46 on 1257 degrees of freedom
Residual deviance: 672.35 on 1254 degrees of freedom
(12 observations deleted due to missingness)
AIC: 680.35
Number of Fisher Scoring iterations: 6
\[ log(\frac{Prob_{never}}{1 - Prob_{never}}) = \beta_0 + \beta_1*Gender + \beta_2*Age + \beta_3Education \]
cf <- coef(fit.logit)
exp(cf)
(Intercept) agea gndr eduyrs
0.02663345 1.09628738 1.54224501 0.67694554
library(MASS)
exp(confint(fit.logit))
2.5 % 97.5 %
(Intercept) 0.007051159 0.09655775
agea 1.078876330 1.11519925
gndr 1.042485119 2.29097988
eduyrs 0.637811283 0.71532407
PT$impfree.reversed <- factor(PT$impfree.reversed, ordered=TRUE)
library(MASS)
fit.ordinal1 <-
polr(
impfree.reversed ~ agea + gndr + eduyrs,
data=PT,
method = "logistic" # link function
)
summary(fit.ordinal1)
Call:
polr(formula = impfree.reversed ~ agea + gndr + eduyrs, data = PT,
method = "logistic")
Coefficients:
Value Std. Error t value
agea -0.004564 0.003228 -1.414
gndr 0.216655 0.103237 2.099
eduyrs 0.031723 0.011038 2.874
Intercepts:
Value Std. Error t value
1|2 -4.4224 0.4316 -10.2474
2|3 -2.8406 0.3279 -8.6624
3|4 -1.6642 0.3055 -5.4484
4|5 0.0294 0.2980 0.0987
5|6 1.2621 0.3005 4.1994
Residual Deviance: 3522.331
AIC: 3538.331
(19 observations deleted due to missingness)
How to interpret the ordinal regression coefficients?
Alternatively you may predict probabilities of the outcome conditioned on typical values of independent variables.
predict(fit.ordinal1,
newdata = # create a fake dataset with typical values. It is advisable to vary only one or two variables, keeping the rest constant (usually equal to a sample mean)
data.frame(
agea=c(20, 30, 40, 50), # different age
gndr= c(1, 1, 1, 1), # males
eduyrs = c(12,12,12,12) # average education
),
type="p")
1 2 3 4 5 6
1 0.007185965 0.02681912 0.06844513 0.2805955 0.2974524 0.3195019
2 0.007519038 0.02801770 0.07118743 0.2871642 0.2964509 0.3096608
3 0.007867426 0.02926731 0.07401961 0.2936822 0.2951743 0.2999891
4 0.008231823 0.03056987 0.07694290 0.3001356 0.2936273 0.2904925
Rows - different age groups, columns - probability of response from 1 (“not like me at all”) to 6 (“very much like me”).
Tidyverse - a set of packages for data cleaning, structuring, and facilitating R coding.
%>%
pipes“Pipes” - send a left-side function output to the right-side function, as a first argument.
library(magrittr) # package that provides operator %>%
c(2, 5, 10) %>% sum()
[1] 17
# if is the first and the only argument one can omit parentheses
c(2,5) %>% sum
[1] 7
# if the argument is not the first, one can use dot `.` symbol to signifiy the output of the left-side function.
c(2,5) %>% table(c(10, 100), .)
.
2 5
10 1 0
100 0 1
Pipes can be as long as you like (but not recommended more than 10 rows, as debugging becomes difficult)
c(2,5) %>% sum %>% sqrt
[1] 2.645751
Результат работы трубы сохраняется как обычно - в объект в начале этой трубы.
pipe.result <- c(2,5) %>% sum %>% log
pipe.result
[1] 1.94591
library(dplyr)
library(magrittr)
library(sjmisc)
Portugal <- # the result will be saved to object "Portugal"
PT %>% # object PT is sent as a first argument to the next function
dplyr::select("impfree", "cntry", "yrbrn", "gndr") %>% # select variables
filter(cntry=="PT") %>% # subset/filter respondents
mutate(age = 2017 - yrbrn, # recode
female = as.numeric(gndr==2 & !is.na(gndr)),
impfree.reversed =
rec(impfree, rec=c(" 1=6 [very much];
2=5 [like me];
3=4 [somewhat];
4=3 [a little];
5=2 [not like me];
6=1 [not at all]"),
append = TRUE, #add top the existing data
as.num = TRUE, # is it numeric/continuous?
var.label = "Importance of freedom"))
Portugal
# A tibble: 1,270 x 7
impfree cntry yrbrn gndr age female impfree.reversed
<dbl+lbl> <chr+lbl> <dbl+lbl> <dbl+lb> <dbl+lb> <dbl> <dbl>
1 " 2" PT 1958 1 59 0 5
2 " 2" PT 1970 1 47 0 5
3 " 1" PT 1961 2 56 1 6
4 " 1" PT 1993 2 24 1 6
5 " 3" PT 1952 1 65 0 4
6 " 2" PT 1957 1 60 0 5
7 NA PT 1996 2 21 1 NA
8 " 3" PT 1962 2 55 1 4
9 " 3" PT 1945 2 72 1 4
10 " 2" PT 1964 2 53 1 5
# ... with 1,260 more rows
library(sjmisc)
# frequencies
Portugal$impfree %>%
table %>%
prop.table
.
1 2 3 4 5 6
0.305313243 0.291038858 0.291038858 0.074544013 0.030134814 0.007930214
# or
frq(Portugal$impfree)
# Important to make own decisions and be free (x) <numeric>
# total N=1270 valid N=1261 mean=2.26 sd=1.11
val label frq raw.prc valid.prc cum.prc
1 Very much like me 385 30.31 30.53 30.53
2 Like me 367 28.90 29.10 59.64
3 Somewhat like me 367 28.90 29.10 88.74
4 A little like me 94 7.40 7.45 96.19
5 Not like me 38 2.99 3.01 99.21
6 Not like me at all 10 0.79 0.79 100.00
7 Refusal 0 0.00 0.00 100.00
8 Don't know 0 0.00 0.00 100.00
9 No answer 0 0.00 0.00 100.00
NA NA 9 0.71 NA NA
# cross-tabulation
Portugal %>%
dplyr::select(impfree, gndr) %>%
table %>%
prop.table
gndr
impfree 1 2
1 0.103885805 0.201427439
2 0.141157811 0.149881047
3 0.133227597 0.157811261
4 0.026962728 0.047581285
5 0.010309278 0.019825535
6 0.003172086 0.004758128
# or
flat_table(Portugal, impfree, gndr, margin="col")
gndr Male Female No answer
impfree
Very much like me 24.81 34.65 NaN
Like me 33.71 25.78 NaN
Somewhat like me 31.82 27.15 NaN
A little like me 6.44 8.19 NaN
Not like me 2.46 3.41 NaN
Not like me at all 0.76 0.82 NaN
Refusal 0.00 0.00 NaN
Don't know 0.00 0.00 NaN
No answer 0.00 0.00 NaN
# means
Portugal %>%
group_by(gndr) %>%
summarize(
mean = mean(impfree, na.rm=T), # [name of column = function]
sd = sd(impfree, na.rm=T),
n = n(),
S.E. = sd(impfree, na.rm=T)/sqrt(n())
)