Introduction to R
This page is maintained by the software tutors. For errors and/or amendments please contact the current tutor supporting the program.
This guide will provide an easy to read, starter guide to R. It is not intended to replace the full user manual provided at R's homepage.
For any further questions or assistance feel free to contact the R software tutor.
R is an opensource tool for statistics and data modeling through the use of syntax. It is by far the most comprehensive computing environment for statistical and econometric computing. Early steps in R are as difficult as learning a new language for someone unacquainted with programming languages. However, once users know the basic grammar of R programming, they are able to be good producers and consumers of R’s outputs. This means that users can by their own understand and write commands to manipulate quantitative and qualitative data easily and present it in compelling ways through the use of customized tables, powerful graphs and others visual characters. The only thing that you need is patience and discipline in your process learning. In case of doubts or despair, do not hesitate to contact with your EUI R tutor(s).
In this background context, this webpage is aimed to help to face your start in R. We suggest answers to the following questions:
 What are the differences between R and other statistical software?
 What sort of analysis I can do in R that is not available in other software?
 Where and how I can start working with R?
 How do I plot objects in R?
 How can I make my R code more efficient?
 How can I make use of parallel computation and the EUIHPC computing cluster with R?
In a schematic way, the table below shows the main difference between wellknown statistical software: R, MATLAB, SAS, STATA and SPSS:

R

MATLAB

SAS/STATA

SPSS

Required license?

No

Yes

Yes

Yes

Size of library?

Unlimited

Unlimited

Unlimited, but only official ones

Limited

Flexibility to compute statistics?

Yes

Yes

Medium

No

Speed in computation?

fast

The fastest

Medium

Slow

Simple commands?

Yes

No

Medium

No

Customized graph?

Yes

Yes

No

No

Limitations in the type of analysis?

No

No

Medium

Yes

Table 1: Differences among main statistical software
The following table shows the readily available statistical analysis from the official, widely known and reliable public websites. Most of the time, the required R packages and the procedure to implement the analysis are well explained in the same website:
TYPE OF STATISTICAL ANALYSIS

R

MATLAB

SAS

STATA

SPSS

Examples of R code and written tutorials*

Video tutorials*

Nonparametric Tests

+

+

+

+

+

e.g. sample codes for Sign Test, Wilcoxon SignedRank Test, MannWhitneyWilcoxon Test, KruskalWallis Test


Ttest

+

+

+

+

+

sample code


ANOVA & MANOVA

+

+

+

+

+

sample code


ANCOVA & MANCOVA

+

+

+

+

+

check the link above


Linear Regression

+

+

+

+

+

sample code and overview of related techniques here, here and here

video explaining some basic techniques (~15 min)

Generalized Least Squares

+

+

+

+

+

description and exampleusing gls() function from ‘nlme’ package // yet another example


Ridge Regression

+

+

+

limited

limited



Lasso

+

+

+

limited




Generalized Linear Models

+

+

+

+

+

introductory tutorial


Logistic Regression

+

+

+

+

+

examples involving the use of binary logit, ordinal logit, and multinomial logit // presentation with sample code involving the use of binary, ordinary and multinomial logit (incl. calculation of marginal effects)

binary logit (and probit) (~12 min) // ordinal logit (and probit) (~6 min) // multinomial logit (and probit) (~15 min)

Mixed Effects Models

+

+

+

+

+

requires ‘lme4’ or ‘nlme’ package //basic introductory guide with sample code using ‘lme4’ package //more comprehensive guideon ‘lme4’ package // example involving the use of ‘nlme’ package


Nonlinear Regression

+

+

+

limited

limited

basic introductory guide with sample code // more comprehensive guide

video demonstration: one (~4 min), two (~5 min)

Discriminant Analysis

+

+

+

+

+

requires ‘MASS’ package // sample code and brief overview // more comprehensive guide

video demonstration (~6 min)

Factor & Principal Components Analysis

+

+

+

+

+

sample code and brief overview // practical example

video demonstration (~15 min)

Canonical Correlation Analysis

+

+

+

+

+



Copula Models

+

+






Path Analysis

+

+

+

+

+

requires ‘plsm’ package (for PLS path analysis) // ‘plsm’ package tutorial // for SEM path analysis see below


Structural Equation Modeling (Latent Factors)

+

+

+

+

limited

requires ‘lavaan’ or ‘sem’ package // ‘lavaan’ package tutorial // ‘sem’ package tutorial // short 'sem' demonstration

introduction to ‘lavaan’ (~45 min) // yet another ‘lavaan’ video tutorial (~45 min)

Extreme Value Theory

+

+






Variance Stabilization

+

+






Bayesian Statistics

+

+

limited



introduction to Bayesian statistics in R


Monte Carlo, Classic Methods

+

+

+

+

limited



Markov Chain Monte Carlo

+

+

+





EM Algorithm

+

+

+





Missing Data Imputation

+

+

+

+

+

overview of several existing packages


Bootstrap & Jackknife

+

+

+

+

+

some approaches demonstrated here, here and here // applications to regression models here and here

bootstrapping video example (~20 min)

Outlier Diagnostics

+

+

+

+

+

examples of several basic techniques

video demonstration: one, two, three

Robust Estimation

+

+

+

+


example of estimating robust regressions using ‘MASS’ package // comprehensive guide to robust estimation


CrossValidation

+

+

+





Longitudinal (Panel) Data

+

+

+

+

limited

requires ‘plm’ or ‘lme4’ package // for ‘lme4’ guides refer to MIXED EFFECT MODELS above // ‘plm’ sample code demonstration // ‘plm’ package tutorial

tutorial one (~15 min) // tutorial two (~10 min)

Survival Analysis

+

+

+

+

+

requires ‘survival’ package // ‘survival’ package tutorial // another tutorial here

introductory video (~15 min)// more indepth tutorial (~1 hour 20 min)

Propensity Score Matching

+

+

limited

limited


requires ‘MatchIt’ or ‘matching’ package // ‘MatchIt’ package tutorial // ‘matching’ package tutorial

video demonstration using ‘matching’ package (~13 min) // video demonstration using ‘MatchIt’ package (~17 min)

Stratified Samples (Survey Data)

+

+

+

+

+



Experimental Design

+

+

limited





Quality Control

+

+

+

+

+



Reliability Theory

+

+

+

+

+



Univariate Time Series

+

+

+

+

limited

comprehensive tutorial

video tutorial (~1 hour)

Multivariate Time Series

+

+

+

+




Stochastic Volatility Models, Discrete Case

+

+

+

+

limited



Stochastic Volatility Models, Continuous Case

+

+

limited

limited




Diffusions

+

+






Markov Chains

+

+






Hidden Markov Models

+

+






Counting Processes

+

+

+



various showcases of estimating count data models here, here and here

count data models (~11 min)

Filtering

+

+

limited

limited




Instrumental Variables

+

+

+

+

+

tutorial involving ‘ivreg’ function from ‘AER’ package

video demonstration (~12 min)

Splines

+

+

+

+




Nonparametric Smoothing Methods

+

+

+

+




Spatial Statistics

+

+

limited

limited




Cluster Analysis

+

+

+

+

+

requires various packages (check tutorials) // brief introduction into R’s cluster analysis capacities // more comprehensive tutorials here and here

a set of videos on hierarchical cluster analysis: one, two, three, four and five // a set of videos on Kmeans clustering: one and two

Neural Networks

+

+

+


limited



Classification & Regression Trees

+

+

limited


limited



Random Forests

+

+

limited





Support Vector Machines

+

+

+





Signal Processing

+

+






Wavelet Analysis

+

+

+





Bagging

+

+

+





ROC Curves

+

+

+

+

+



Metaanalysis

+

+

limited

+




Deterministic Optimization

+

+

+

limited




Stochastic Optimization

+

+

limited





Content Analysis

+


limited

limited

limited

short tutorials using ‘RQDA’ package here, here and here // tutorial using ‘tm’ package here

‘RQDA’ video tutorial (~5 hours) // ‘tm’ video tutorial (~8 min) // extensive video tutorial on text mining in R (follow the playlist covering chapter 7, ~2 hours 20 min)

Quantile regression

+

+

+

+


requires ‘quantreg’ package // sample code and brief overview // more comprehensive guide

video demonstration (~9 min)

Seemingly unrelated regression

+

+

+

+


requires ‘systemfit’ package // sample code and demonstration // ‘systemfit’ package tutorial

video demonstration (~5 min)

Tobit and truncated regression

+

+

+

+


requires ‘censReg’ package // sample code and demonstration // ‘censReg’ package tutorial

video demonstration (~9 min)

Qualitative Comparative Analysis (QCA)

+



limited


requires ‘QCA’ package //‘QCA’ package description // brief demonstration // QCA tutorial (using ‘QCA’ and ‘SetMethods’ package)


Social network analysis

+

+

+

+

+

requites ‘igraph’, ‘statnet’ or ‘sna’ package // ‘igraph’ package tutorial // lab sessions using ‘igraph’ with densely commented code // nice tutorial combining the use of ‘igraph’ and ‘statnet’ // ‘sna’ package description


Loglinear models

+

+

+

+

+

requires ‘gnm’ package // comprehensive overview of the ‘gnm’ package


The table is an extended version of the table maintained at http://stanfordphd.com/Statistical_Software.html
* Most tutorials supplied in this table assume prior knowledge of theory behind a given method, and thus serve primarily as means of introducing R tools and syntax required to conduct a given type of analysis.

Table 2: Statistical analysis availbale among main statistical software
For instance, look at the QuickR website ( http://www.statmethods.net/) or Rbloggers ( http://www.rbloggers.com/generalizedlinearmodelsforpredictingrates/).
We really recommend you to follow one of these two websites: Try.R School and Swirl. The former contains a course from which you learn the basic steps to learn the language of computation and features to organize the data. The latter, is a package to install in R. This course teaches you the basic features of R as well. However, Swirl is different from Try.R in two aspects: i) The course repertory offers intermediate (how to do regression) and advanced course (how to establish causal inferences) and ii) There is not epic pictures when you are doing your analysis during the course. Thus, if this is the first time that you do statistics, we highly recommend you to start with Try.R. If you know something about statistics and you have never written commands, start with Swirl.
TRY.R School: http://tryr.codeschool.com/
Swirl: http://swirlstats.com/
Once you have done these courses, we suggest you to follow the videos below to learn useful shortcuts to do descriptive analysis. From 15 there are exercises available. Contact with Adrián del Río (email link).
 Basic statistical inferences: https://www.youtube.com/watch?v=dpPwdjorpg0
 Writing documents with R code in RMarkdown: https://www.youtube.com/watch?v=7qTvOZfK6Cw
 Loops (to execute repetitive code statements for a particular number of times): theory https://blog.udemy.com/rtutorial/ an practice https://www.youtube.com/watch?v=p7bJjOJoXLI
 Creating functions (to create your own commands): https://www.youtube.com/watch?v=Fb8E2HZrjUE
 Grouped aggregation in R with tapply, dplyr and sapply: https://www.youtube.com/watch?v=aD4R4ZIkeW0
For instance, you want to calculate the mean difference between more than two groups that share a set of attribute. Then, you can know the number of parties and legislatures that different types of authoritarian regimes (personal, military and civilian) that have ended up in democracies before and after the cold war:
Sample

Hegemonic Party

Military

Personalist


Legislatures





19451989 Yes

0.00%

22.73%

77.27%

100%

No

17.65%

47.06%

35.29%

100%

19902014 Yes

18.18%

18.18%

63.36%

100%

No

55.77%

26.92%

17.31%

100%

19452014 Yes

6.61%

21.21%

72.72%

100%

No

46.37%

31.88%

21.73%

100%

Parties





19451989 Zero

0.00%

21.74%

78.26%

100%

One

33.33%

50.00%

16.66%

100%

More than one

10.00%

50.00%

40.00%

100%

19902014 Zero

10.00%

20.00%

70.00%

100%

One

61.90%

28.57%

9.52%

100%

More than one

53.12%

25.00%

21.87%

100%

19452014 Zero

3.03%

21.21%

75.75%

100%

One

55.55%

33.33%

11.11%

100%

More than one

42.86%

30.95%

26.19%

100%






Parties

No legislatures

Legislatures

Total of Parties






Zero

90.90%

4.35%

32.35%


One

3.03%

37.68%

26.48%


More than one

6.07%

57.97%

41.17%


Total of Legislatures

32.35%

67.64%

100%


Note: Years covered: 19452014. Totals in cells describe countryyear observations. Hegemonicparty system includes single party hybrids and military regimes include militarypersonalist hybrids. This decision is suggested by Geddes et al.

Table 2: Democratic formal institutions in autocracies that have experienced a democratic transition
 Content Analysis: Are you interested in knowing how words are associated during a speech? Under which circumstances an euphemism is more likely to be employed to mention a substantive? How can I learn more about the content of a text? The Rscript below show you how:
 Annerose Nisser’s courtesy: click here to download
 Graham Williams’s text mining tool: http://onepager.togaware.com/TextMiningO.pdf
Text Mining uses automated algorithms to learn from massiv amount of texts that can be summarise following a set of commands described in the document.
Another similar package is the introduction by Ingo Feinerer:
 Ingo Feinerer's tm package: https://cran.rproject.org/web/packages/tm/vignettes/tm.pdf
ggplot2 vs. the base package
As for most things in R, there is a variety of ways in which plotting can be done in R. The base backage has some pretty solid plotting functions, but most people seem to use the “ggplot2” package by Hadley Wickham, who has become some kind of popstar in the R community. For a nice overview of what ggplot2 can do as well as a function reference, visit:
http://zevross.com/blog/2014/08/04/beautifulplottinginraggplot2cheatsheet3/
However, there also is a countermovement by users who prefer the simplicity of the R base plots package to ggplot2:
http://shinyapps.org/apps/RGraphCompendium/index.php
Which way you use depends in the end on your personal preference as well as the task at hand. There are some situations when using ggplot2 actually simplifies life a lot, whereas in others it can be more of a burden.
Plotting time series
Let us start with an example where the base package plotting is clearly simpler. First load the package “datasets”, which contains a couple of nice datasets to consider:
library(datasets)
Plotting data on monthly CO2 Concentration is as easy as this:
plot(co2)
“co2” is time series object which the builtin “plot” function recognizes, so plot() applies its default method for time series objects, which is plotting the date on the xaxis. Now let us do the same with ggplot2. ggplot2 works on dataframes, so we need to create a new variable representing the dates, put it together with co2 into a data frame and create a plot:
library(ggplot2)
dates< seq(1959.0,1997+11/12,1/12)
df< data.frame(cbind(year=dates,co2=co2))
ggobj< ggplot(df,aes(x=year,y=co2))+ geom_line()
print(ggobj)
This all looks very complicated compared to the base R plotting, and it is. However, ggplot2 has different features that make it great for other things, especially for multiple layers in plots. We will see this in the next section.
By the way, the discussion here directly generalizes to plotting a panel, i.e. multiple time series:
rw< sapply(rep(1,10),function(x) cumsum(rnorm(1000))) # create 10 random walks
plot(ts(rw)) # plot
Plotting crosssectional data
Now consider the data “occupationalStatus”, which is a matrix showing the respective occupational status of fathers and sons in Britain. In order to deal easily with the data, we need the “reshape2”package which allows you to easily transform dataframes in a way more suitable for plotting:
library(reshape2)
df< melt(occupationalStatus,id="origin") # reshape the data frame
ggobj< ggplot(df,aes(x=destination,y=value))+ geom_bar(stat = "identity") # make a basic bar plot
print(ggobj) # print
This does not look great. But consider splitting the individual bars according to the origin of the sons:
ggobj< ggobj + aes(fill=origin)
print(ggobj)
This looks better. This example also shows us the amazing side of ggplot2: we can change plots step by step through adding commands to the original object. Another great feature of ggplot2 is that we can easily create multiple plots grouped by a specified variable in the dataframe. Why not for instance divide the plots by the occupation of the father?
ggobj< ggobj + facet_grid(origin ~ .,scales="free_y") + aes(fill = factor(destination))
print(ggobj)
This plot seems to lend itself much better to interpretation. For more examples of how to make nice plots with either ggplot2 or the base package, check the links I provided or some of the dozens of nicely written tutorials on the internet.
Exporting graphs to LaTeX
Some of the userwritten packages in R are simply outstanding. Try out “tikzDevice”, which will transform your (base or ggplot2) graphs into a TikZ picture (if you don’t know what TikZ is, look it up). Graphs in papers or presentations can hardly look better than that. For our previous example:
library(tikzDevice)
options(tikzDefaultEngine = "xetex")
options(tikzLatexPackages = c(
getOption( "tikzLatexPackages" ),
"\\usepackage{amsmath}",
"\\usepackage{amsfonts}"
)) # change some options
tikz(file="occupationalStatus.tex",width=15.74776/2.54,height=15.74776) # 2.54 inch = 1 cm
plot(ggobj)
dev.off()
One disadvanatage of R is that it is slightly slow (maybe not compared to Stata, but compared to other programming languages, see for instance http://economics.sas.upenn.edu/~jesusfv/comparison_languages.pdf)! You might not notice this while running a regression, but as soon as you do some computationally more involved tasks, you will. The good news however is that you can speed up processes considerably.
Making your code more efficient
Whenever your code seems to run slowly, check whether you wrote it in an inefficient way. Three simple but effective pieces of advice are:
 With computationally intensive tasks such as looping, avoid using complex objects such as arrays, lists or dataframes
 Define objects of fixed length before you loop over them
 Avoid explicit looping when a builtin function does the same
In the following example, you will see how enormous the gain in speed can be if you follow the advice. We will do exactly the same task in different ways and compare how much time it takes.
A practical Example
Assume that for some reason you would like to simulate a random walk of a certain length:
n < 100000 # length of the series
You could do this by creating an empty and then consecutively adding random numbers to it:
t0 < proc.time() # start the stopwatch
y< list(); y[[1]] < 0
for(i in 2:n){
y[[i]] < y[[i1]] + rnorm(1)
}
texec1< proc.time()  t0 # get the execution time
Now, instead of a list, use a vector:
t0 < proc.time()
y < 0
for(i in 2:n){
y[i] < y[i1] + rnorm(1)
}
texec2 < proc.time()  t0
Do the same, but now define the object BEFORE looping across it:
t0 < proc.time()
y < rep(0,n)
for(i in 2:n){
y[i+1] < y[i] + rnorm(1)
}
texec3 < proc.time()  t0
Finally, instead of using the for loop, do the same with the built in “cumsum”command:
t0 < proc.time()
eps< rnorm(n)
y < cumsum(eps)
texec4 < proc.time()  t0
Now let us compare the elapsed timefor the computation:
c(texec1[3]/texec4[3],texec2[3]/texec4[3],texec3[3]/texec4[3],texec4[3]/texec4[3])
The differences are enormous: on my machine, the first method takes 1365 times longer than the most efficient one, the second one 486 times and the third one 15.5 times. So always try to make your code efficient before you try getting more processing power!
“Apply:” vs.” for”
There are two ways of constructing loops in R: the commonly used commands such as “for” and the “apply” family of functions. Hardcore R users always use apply and will tell you that it is much quicker than “for”. Let us check this in the context of our example: assume that we do not want to create only one random walk, but a bunch of them:
n < 10000 # length of time series
ns< 10000 # number of paths
eps< matrix(rnorm(n*ns),nrow=n,ncol=ns) # initialize the errors
First use the “for” loop:
y < matrix(NA,nrow=n,ncol=ns) # initialize the vectors
t0 < proc.time()
for(i in 1:ns){
y[,i] < cumsum(eps[,i])
}
texec1 < proc.time()  t0
Now do the same with “apply”:
t0 < proc.time()
y < apply(eps,2,cumsum)
texec2 < proc.time()  t0
And compare the computing time:
texec1[3]/texec2[3]
At least for this example, I get that the “for” loop is actually faster. When you use “apply”, the code however looks neater. For more details, you can read e.g. http://blog.datacamp.com/tutorialonloopsinr/.
For information on running parallel R jobs using either several CPU cores or your GPU as well as instructions and example codes for doing this on the EUIHPC cluster, please check the presentation document
https://sites.google.com/site/mschmidtblaicher/downloads/EUIHPCpresentation_R.pdf
The power of R as a programming language extends beyond statistical analysis. A number of packages have been developed by R users that allow relatively easy and convenient ways of scraping Internet data and preprocessing it for further analysis. Below you will find a list of nice video tutorials introducing the basics of extracting, parsing and processing content from
Note, however, that scraping data from Internet sources might be challenging to people unfamiliar with the basic of HTML architecture.
R also offers amazing capacities for interactive data visualization that can be employed not only for analytic purposes, but also for teaching as well as presentation and popularization of your research results.
Plotly
The first library you might want to introduce yourself to is the‘plotly’ package, which will allow you to create a wide variety of fancy looking interactive plots, both as objects within R environment as well as separate objects that can be embedded in your blog or custom webpage. Check out the following links that should convince you into using the library and provide an easy start with it:
Shiny
The second library is called ‘shiny’, and it allows turning your intimidating R code into nice userfriendly applications that can be interacted with using more common web interface. Check out the following link for inspiring examples. Conveniently the library’s official webpage contains a nice and wellexplained tutorial, which should get you started quickly (entirely accessible even to inexperienced R users!). What is more amazing is that both ‘shiny’ and ‘plotly’ can be used in conjunction with each other making them an extremely powerful tool for handling and communicating your data.
Note that both ‘shiny’ and ‘plotly’ are subscriptionbased services. Installing and handling the libraries within R environment is completely free of charge, as is the possibility to post a limited number of objects (i.e. apps and graphs) online. However, you might want to check with Shiny and Plotly subscription plans to learn more about the precise subscription terms.
Contributors to this webpage:
Adrián del Río
Matthias Schmidtblaicher
Gordey Yastrebov