Home » Services and Admin » ICT Service » Software » Introduction to R

Introduction to R

This page is maintained by the software tutors. For errors and/or amendments please contact the current tutor supporting the program.

This guide will provide an easy to read, starter guide to R. It is not intended to replace the full user manual provided at R's homepage.

For any further questions or assistance feel free to contact the R software tutor.

Overview
A brief comparison of R with other mainstream statistical software
R’s analytic capacities compared to other software (with links to tutorials covering most widely used types of analysis)
Getting started
Plotting objects
Optimizing your code
Parallel computation and running a script on the cluster
Data scraping
Interactive data visualization

Overview

R is an open-source tool for statistics and data modeling through the use of syntax. It is by far the most comprehensive computing environment for statistical and econometric computing. Early steps in R are as difficult as learning a new language for someone unacquainted with programming languages. However, once users know the basic grammar of R programming, they are able to be good producers and consumers of R’s outputs. This means that users can by their own understand and write commands to manipulate quantitative and qualitative data easily and present it in compelling ways through the use of customized tables, powerful graphs and others visual characters. The only thing that you need is patience and discipline in your process learning. In case of doubts or despair, do not hesitate to contact with your EUI R tutor(s).

In this background context, this webpage is aimed to help to face your start in R. We suggest answers to the following questions:

What are the differences between R and other statistical software?
What sort of analysis I can do in R that is not available in other software?
Where and how I can start working with R?
How do I plot objects in R?
How can I make my R code more efficient?
How can I make use of parallel computation and the EUI-HPC computing cluster with R?

A Brief Comparison of R With Other Mainstream Statistical Software

In a schematic way, the table below shows the main difference between well-known statistical software: R, MATLAB, SAS, STATA and SPSS:

	R	MATLAB	SAS/STATA	SPSS
Required license?	No	Yes	Yes	Yes
Size of library?	Unlimited	Unlimited	Unlimited, but only official ones	Limited
Flexibility to compute statistics?	Yes	Yes	Medium	No
Speed in computation?	fast	The fastest	Medium	Slow
Simple commands?	Yes	No	Medium	No
Customized graph?	Yes	Yes	No	No
Limitations in the type of analysis?	No	No	Medium	Yes

Table 1: Differences among main statistical software

R’s Analytic Capacities Compared to Other Software (with Links to Tutorials Covering Most Widely Used Types of Analysis)

The following table shows the readily available statistical analysis from the official, widely known and reliable public websites. Most of the time, the required R packages and the procedure to implement the analysis are well explained in the same website:

TYPE OF STATISTICAL ANALYSIS	R	MATLAB	SAS	STATA	SPSS	Examples of R code and written tutorials*	Video tutorials*
Nonparametric Tests	+	+	+	+	+	e.g. sample codes for Sign Test, Wilcoxon Signed-Rank Test, Mann-Whitney-Wilcoxon Test, Kruskal-Wallis Test
T-test	+	+	+	+	+	sample code
ANOVA & MANOVA	+	+	+	+	+	sample code
ANCOVA & MANCOVA	+	+	+	+	+	check the link above
Linear Regression	+	+	+	+	+	sample code and overview of related techniques here, here and here	video explaining some basic techniques (~15 min)
Generalized Least Squares	+	+	+	+	+	description and exampleusing gls() function from ‘nlme’ package // yet another example
Ridge Regression	+	+	+	limited	limited
Lasso	+	+	+	limited
Generalized Linear Models	+	+	+	+	+	introductory tutorial
Logistic Regression	+	+	+	+	+	examples involving the use of binary logit, ordinal logit, and multinomial logit // presentation with sample code involving the use of binary, ordinary and multinomial logit (incl. calculation of marginal effects)	binary logit (and probit) (~12 min) // ordinal logit (and probit) (~6 min) // multinomial logit (and probit) (~15 min)
Mixed Effects Models	+	+	+	+	+	requires ‘lme4’ or ‘nlme’ package //basic introductory guide with sample code using ‘lme4’ package //more comprehensive guideon ‘lme4’ package // example involving the use of ‘nlme’ package
Nonlinear Regression	+	+	+	limited	limited	basic introductory guide with sample code // more comprehensive guide	video demonstration: one (~4 min), two (~5 min)
Discriminant Analysis	+	+	+	+	+	requires ‘MASS’ package // sample code and brief overview // more comprehensive guide	video demonstration (~6 min)
Factor & Principal Components Analysis	+	+	+	+	+	sample code and brief overview // practical example	video demonstration (~15 min)
Canonical Correlation Analysis	+	+	+	+	+
Copula Models	+	+
Path Analysis	+	+	+	+	+	requires ‘plsm’ package (for PLS path analysis) // ‘plsm’ package tutorial // for SEM path analysis see below
Structural Equation Modeling (Latent Factors)	+	+	+	+	limited	requires ‘lavaan’ or ‘sem’ package // ‘lavaan’ package tutorial // ‘sem’ package tutorial // short 'sem' demonstration	introduction to ‘lavaan’ (~45 min) // yet another ‘lavaan’ video tutorial (~45 min)
Extreme Value Theory	+	+
Variance Stabilization	+	+
Bayesian Statistics	+	+	limited			introduction to Bayesian statistics in R
Monte Carlo, Classic Methods	+	+	+	+	limited
Markov Chain Monte Carlo	+	+	+
EM Algorithm	+	+	+
Missing Data Imputation	+	+	+	+	+	overview of several existing packages
Bootstrap & Jackknife	+	+	+	+	+	some approaches demonstrated here, here and here // applications to regression models here and here	bootstrapping video example (~20 min)
Outlier Diagnostics	+	+	+	+	+	examples of several basic techniques	video demonstration: one, two, three
Robust Estimation	+	+	+	+		example of estimating robust regressions using ‘MASS’ package // comprehensive guide to robust estimation
Cross-Validation	+	+	+
Longitudinal (Panel) Data	+	+	+	+	limited	requires ‘plm’ or ‘lme4’ package // for ‘lme4’ guides refer to MIXED EFFECT MODELS above // ‘plm’ sample code demonstration // ‘plm’ package tutorial	tutorial one (~15 min) // tutorial two (~10 min)
Survival Analysis	+	+	+	+	+	requires ‘survival’ package // ‘survival’ package tutorial // another tutorial here	introductory video (~15 min)// more in-depth tutorial (~1 hour 20 min)
Propensity Score Matching	+	+	limited	limited		requires ‘MatchIt’ or ‘matching’ package // ‘MatchIt’ package tutorial // ‘matching’ package tutorial	video demonstration using ‘matching’ package (~13 min) // video demonstration using ‘MatchIt’ package (~17 min)
Stratified Samples (Survey Data)	+	+	+	+	+
Experimental Design	+	+	limited
Quality Control	+	+	+	+	+
Reliability Theory	+	+	+	+	+
Univariate Time Series	+	+	+	+	limited	comprehensive tutorial	video tutorial (~1 hour)
Multivariate Time Series	+	+	+	+
Stochastic Volatility Models, Discrete Case	+	+	+	+	limited
Stochastic Volatility Models, Continuous Case	+	+	limited	limited
Diffusions	+	+
Markov Chains	+	+
Hidden Markov Models	+	+
Counting Processes	+	+	+			various showcases of estimating count data models here, here and here	count data models (~11 min)
Filtering	+	+	limited	limited
Instrumental Variables	+	+	+	+	+	tutorial involving ‘ivreg’ function from ‘AER’ package	video demonstration (~12 min)
Splines	+	+	+	+
Nonparametric Smoothing Methods	+	+	+	+
Spatial Statistics	+	+	limited	limited
Cluster Analysis	+	+	+	+	+	requires various packages (check tutorials) // brief introduction into R’s cluster analysis capacities // more comprehensive tutorials here and here	a set of videos on hierarchical cluster analysis: one, two, three, four and five // a set of videos on K-means clustering: one and two
Neural Networks	+	+	+		limited
Classification & Regression Trees	+	+	limited		limited
Random Forests	+	+	limited
Support Vector Machines	+	+	+
Signal Processing	+	+
Wavelet Analysis	+	+	+
Bagging	+	+	+
ROC Curves	+	+	+	+	+
Meta-analysis	+	+	limited	+
Deterministic Optimization	+	+	+	limited
Stochastic Optimization	+	+	limited
Content Analysis	+		limited	limited	limited	short tutorials using ‘RQDA’ package here, here and here // tutorial using ‘tm’ package here	‘RQDA’ video tutorial (~5 hours) // ‘tm’ video tutorial (~8 min) // extensive video tutorial on text mining in R (follow the playlist covering chapter 7, ~2 hours 20 min)
Quantile regression	+	+	+	+		requires ‘quantreg’ package // sample code and brief overview // more comprehensive guide	video demonstration (~9 min)
Seemingly unrelated regression	+	+	+	+		requires ‘systemfit’ package // sample code and demonstration // ‘systemfit’ package tutorial	video demonstration (~5 min)
Tobit and truncated regression	+	+	+	+		requires ‘censReg’ package // sample code and demonstration // ‘censReg’ package tutorial	video demonstration (~9 min)
Qualitative Comparative Analysis (QCA)	+			limited		requires ‘QCA’ package //‘QCA’ package description // brief demonstration // QCA tutorial (using ‘QCA’ and ‘SetMethods’ package)
Social network analysis	+	+	+	+	+	requites ‘igraph’, ‘statnet’ or ‘sna’ package // ‘igraph’ package tutorial // lab sessions using ‘igraph’ with densely commented code // nice tutorial combining the use of ‘igraph’ and ‘statnet’ // ‘sna’ package description
Log-linear models	+	+	+	+	+	requires ‘gnm’ package // comprehensive overview of the ‘gnm’ package
The table is an extended version of the table maintained at http://stanfordphd.com/Statistical_Software.html * Most tutorials supplied in this table assume prior knowledge of theory behind a given method, and thus serve primarily as means of introducing R tools and syntax required to conduct a given type of analysis.

Table 2: Statistical analysis availbale among main statistical software

For instance, look at the Quick-R website ( http://www.statmethods.net/) or R-bloggers ( http://www.r-bloggers.com/generalized-linear-models-for-predicting-rates/).

Getting Started

We really recommend you to follow one of these two websites: Try.R School and Swirl. The former contains a course from which you learn the basic steps to learn the language of computation and features to organize the data. The latter, is a package to install in R. This course teaches you the basic features of R as well. However, Swirl is different from Try.R in two aspects: i) The course repertory offers intermediate (how to do regression) and advanced course (how to establish causal inferences) and ii) There is not epic pictures when you are doing your analysis during the course. Thus, if this is the first time that you do statistics, we highly recommend you to start with Try.R. If you know something about statistics and you have never written commands, start with Swirl.

TRY.R School: http://tryr.codeschool.com/

Swirl: http://swirlstats.com/

Once you have done these courses, we suggest you to follow the videos below to learn useful shortcuts to do descriptive analysis. From 1-5 there are exercises available. Contact with Adrián del Río (email link).

Basic statistical inferences: https://www.youtube.com/watch?v=dpPwdjorpg0
Writing documents with R code in RMarkdown: https://www.youtube.com/watch?v=7qTvOZfK6Cw
Loops (to execute repetitive code statements for a particular number of times): theory https://blog.udemy.com/r-tutorial/ an practice https://www.youtube.com/watch?v=p7bJjOJoXLI
Creating functions (to create your own commands): https://www.youtube.com/watch?v=Fb8E2HZrjUE
Grouped aggregation in R with tapply, dplyr and sapply: https://www.youtube.com/watch?v=aD4R4ZIkeW0

For instance, you want to calculate the mean difference between more than two groups that share a set of attribute. Then, you can know the number of parties and legislatures that different types of authoritarian regimes (personal, military and civilian) that have ended up in democracies before and after the cold war:

Sample	Hegemonic Party	Military	Personalist
Legislatures
1945-1989 Yes	0.00%	22.73%	77.27%	100%
No	17.65%	47.06%	35.29%	100%
1990-2014 Yes	18.18%	18.18%	63.36%	100%
No	55.77%	26.92%	17.31%	100%
1945-2014 Yes	6.61%	21.21%	72.72%	100%
No	46.37%	31.88%	21.73%	100%
Parties
1945-1989 Zero	0.00%	21.74%	78.26%	100%
One	33.33%	50.00%	16.66%	100%
More than one	10.00%	50.00%	40.00%	100%
1990-2014 Zero	10.00%	20.00%	70.00%	100%
One	61.90%	28.57%	9.52%	100%
More than one	53.12%	25.00%	21.87%	100%
1945-2014 Zero	3.03%	21.21%	75.75%	100%
One	55.55%	33.33%	11.11%	100%
More than one	42.86%	30.95%	26.19%	100%

Parties	No legislatures	Legislatures	Total of Parties
Parties
Zero	90.90%	4.35%	32.35%
One	3.03%	37.68%	26.48%
More than one	6.07%	57.97%	41.17%
Total of Legislatures	32.35%	67.64%	100%
Note: Years covered: 1945-2014. Totals in cells describe country-year observations. Hegemonic-party system includes single party hybrids and military regimes include military-personalist hybrids. This decision is suggested by Geddes et al.

Table 2: Democratic formal institutions in autocracies that have experienced a democratic transition

Content Analysis: Are you interested in knowing how words are associated during a speech? Under which circumstances an euphemism is more likely to be employed to mention a substantive? How can I learn more about the content of a text? The Rscript below show you how:

- Annerose Nisser’s courtesy: click here to download

- Graham Williams’s text mining tool: http://onepager.togaware.com/TextMiningO.pdf

Text Mining uses automated algorithms to learn from massiv amount of texts that can be summarise following a set of commands described in the document.

Another similar package is the introduction by Ingo Feinerer:

- Ingo Feinerer's tm package: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

Plotting Objects

ggplot2 vs. the base package

As for most things in R, there is a variety of ways in which plotting can be done in R. The base backage has some pretty solid plotting functions, but most people seem to use the “ggplot2” package by Hadley Wickham, who has become some kind of popstar in the R community. For a nice overview of what ggplot2 can do as well as a function reference, visit:

http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/

However, there also is a countermovement by users who prefer the simplicity of the R base plots package to ggplot2:

http://shinyapps.org/apps/RGraphCompendium/index.php

Which way you use depends in the end on your personal preference as well as the task at hand. There are some situations when using ggplot2 actually simplifies life a lot, whereas in others it can be more of a burden.

Plotting time series

Let us start with an example where the base package plotting is clearly simpler. First load the package “datasets”, which contains a couple of nice datasets to consider:

library(datasets)

Plotting data on monthly CO2 Concentration is as easy as this:

plot(co2)

“co2” is time series object which the built-in “plot” function recognizes, so plot() applies its default method for time series objects, which is plotting the date on the x-axis. Now let us do the same with ggplot2. ggplot2 works on dataframes, so we need to create a new variable representing the dates, put it together with co2 into a data frame and create a plot:

library(ggplot2)

dates<- seq(1959.0,1997+11/12,1/12)

df<- data.frame(cbind(year=dates,co2=co2))

ggobj<- ggplot(df,aes(x=year,y=co2))+ geom_line()

print(ggobj)

This all looks very complicated compared to the base R plotting, and it is. However, ggplot2 has different features that make it great for other things, especially for multiple layers in plots. We will see this in the next section.

By the way, the discussion here directly generalizes to plotting a panel, i.e. multiple time series:

rw<- sapply(rep(1,10),function(x) cumsum(rnorm(1000))) # create 10 random walks

plot(ts(rw)) # plot

Plotting cross-sectional data

Now consider the data “occupationalStatus”, which is a matrix showing the respective occupational status of fathers and sons in Britain. In order to deal easily with the data, we need the “reshape2”package which allows you to easily transform dataframes in a way more suitable for plotting:

library(reshape2)

df<- melt(occupationalStatus,id="origin") # reshape the data frame

ggobj<- ggplot(df,aes(x=destination,y=value))+ geom_bar(stat = "identity") # make a basic bar plot

print(ggobj) # print

This does not look great. But consider splitting the individual bars according to the origin of the sons:

ggobj<- ggobj + aes(fill=origin)

print(ggobj)

This looks better. This example also shows us the amazing side of ggplot2: we can change plots step by step through adding commands to the original object. Another great feature of ggplot2 is that we can easily create multiple plots grouped by a specified variable in the dataframe. Why not for instance divide the plots by the occupation of the father?

ggobj<- ggobj + facet_grid(origin ~ .,scales="free_y") + aes(fill = factor(destination))

print(ggobj)

This plot seems to lend itself much better to interpretation. For more examples of how to make nice plots with either ggplot2 or the base package, check the links I provided or some of the dozens of nicely written tutorials on the internet.

Exporting graphs to LaTeX

Some of the user-written packages in R are simply outstanding. Try out “tikzDevice”, which will transform your (base or ggplot2) graphs into a TikZ picture (if you don’t know what TikZ is, look it up). Graphs in papers or presentations can hardly look better than that. For our previous example:

library(tikzDevice)

options(tikzDefaultEngine = "xetex")

options(tikzLatexPackages = c(

getOption( "tikzLatexPackages" ),

"\\usepackage{amsmath}",

"\\usepackage{amsfonts}"

)) # change some options

tikz(file="occupationalStatus.tex",width=15.74776/2.54,height=15.74776) # 2.54 inch = 1 cm

plot(ggobj)

dev.off()

Optimizing Your Code

One disadvanatage of R is that it is slightly slow (maybe not compared to Stata, but compared to other programming languages, see for instance http://economics.sas.upenn.edu/~jesusfv/comparison_languages.pdf)! You might not notice this while running a regression, but as soon as you do some computationally more involved tasks, you will. The good news however is that you can speed up processes considerably.

Making your code more efficient

Whenever your code seems to run slowly, check whether you wrote it in an inefficient way. Three simple but effective pieces of advice are:

With computationally intensive tasks such as looping, avoid using complex objects such as arrays, lists or dataframes
Define objects of fixed length before you loop over them
Avoid explicit looping when a built-in function does the same

In the following example, you will see how enormous the gain in speed can be if you follow the advice. We will do exactly the same task in different ways and compare how much time it takes.

A practical Example

Assume that for some reason you would like to simulate a random walk of a certain length:

n <- 100000 # length of the series

You could do this by creating an empty and then consecutively adding random numbers to it:

t0 <- proc.time() # start the stopwatch

y<- list(); y[[1]] <- 0

for(i in 2:n){

y[[i]] <- y[[i-1]] + rnorm(1)

}

texec1<- proc.time() - t0 # get the execution time

Now, instead of a list, use a vector:

t0 <- proc.time()

y <- 0

for(i in 2:n){

y[i] <- y[i-1] + rnorm(1)

}

texec2 <- proc.time() - t0

Do the same, but now define the object BEFORE looping across it:

t0 <- proc.time()

y <- rep(0,n)

for(i in 2:n){

y[i+1] <- y[i] + rnorm(1)

}

texec3 <- proc.time() - t0

Finally, instead of using the for loop, do the same with the built in “cumsum”-command:

t0 <- proc.time()

eps<- rnorm(n)

y <- cumsum(eps)

texec4 <- proc.time() - t0

Now let us compare the elapsed timefor the computation:

c(texec1[3]/texec4[3],texec2[3]/texec4[3],texec3[3]/texec4[3],texec4[3]/texec4[3])

The differences are enormous: on my machine, the first method takes 1365 times longer than the most efficient one, the second one 486 times and the third one 15.5 times. So always try to make your code efficient before you try getting more processing power!

“Apply:” vs.” for”

There are two ways of constructing loops in R: the commonly used commands such as “for” and the “apply” family of functions. Hardcore R users always use apply and will tell you that it is much quicker than “for”. Let us check this in the context of our example: assume that we do not want to create only one random walk, but a bunch of them:

n <- 10000 # length of time series

ns<- 10000 # number of paths

eps<- matrix(rnorm(n*ns),nrow=n,ncol=ns) # initialize the errors

First use the “for” loop:

y <- matrix(NA,nrow=n,ncol=ns) # initialize the vectors

t0 <- proc.time()

for(i in 1:ns){

y[,i] <- cumsum(eps[,i])

}

texec1 <- proc.time() - t0

Now do the same with “apply”:

t0 <- proc.time()

y <- apply(eps,2,cumsum)

texec2 <- proc.time() - t0

And compare the computing time:

texec1[3]/texec2[3]

At least for this example, I get that the “for” loop is actually faster. When you use “apply”, the code however looks neater. For more details, you can read e.g. http://blog.datacamp.com/tutorial-on-loops-in-r/.

Parallel Computation and Running a Script on the Cluster

For information on running parallel R jobs using either several CPU cores or your GPU as well as instructions and example codes for doing this on the EUI-HPC cluster, please check the presentation document

https://sites.google.com/site/mschmidtblaicher/downloads/EUIHPCpresentation_R.pdf

Data Scraping

The power of R as a programming language extends beyond statistical analysis. A number of packages have been developed by R users that allow relatively easy and convenient ways of scraping Internet data and preprocessing it for further analysis. Below you will find a list of nice video tutorials introducing the basics of extracting, parsing and processing content from

tables posted in custom web-pages (requires ‘XML’ package): here (~12 min)
web-pages in general (requires ‘rvest’ package): here (~7 min)
Facebook (requires ‘Rfacebook’, ‘RCurl’ and ‘tm’ packages): here (~10 min)
Twitter (requires ‘twitteR’ and ‘tm’ packages): one (~12 min), two (~16 min)

Note, however, that scraping data from Internet sources might be challenging to people unfamiliar with the basic of HTML architecture.

Interactive Data Visualization

R also offers amazing capacities for interactive data visualization that can be employed not only for analytic purposes, but also for teaching as well as presentation and popularization of your research results.

Plotly

The first library you might want to introduce yourself to is the‘plotly’ package, which will allow you to create a wide variety of fancy looking interactive plots, both as objects within R environment as well as separate objects that can be embedded in your blog or custom web-page. Check out the following links that should convince you into using the library and provide an easy start with it:

official Plotly R web-page containing examples of different graphs, library documentation and tutorials
unofficial, but neatly systematized on-line tutorial

Shiny

The second library is called ‘shiny’, and it allows turning your intimidating R code into nice user-friendly applications that can be interacted with using more common web interface. Check out the following link for inspiring examples. Conveniently the library’s official web-page contains a nice and well-explained tutorial, which should get you started quickly (entirely accessible even to inexperienced R users!). What is more amazing is that both ‘shiny’ and ‘plotly’ can be used in conjunction with each other making them an extremely powerful tool for handling and communicating your data.

Note that both ‘shiny’ and ‘plotly’ are subscription-based services. Installing and handling the libraries within R environment is completely free of charge, as is the possibility to post a limited number of objects (i.e. apps and graphs) on-line. However, you might want to check with Shiny and Plotly subscription plans to learn more about the precise subscription terms.

Contributors to this web-page:

Adrián del Río

Matthias Schmidtblaicher

Gordey Yastrebov

Page last updated on 26 September 2017