Home » Services and Admin » ICT Service » Software » Introduction to R

# Introduction to R

#### This page is maintained by the software tutors. For errors and/or amendments please contact the current  tutor supporting the program.

This guide will provide an easy to read, starter guide to R. It is not intended to replace the full user manual provided at  R's homepage.

For any further questions or assistance feel free to contact the  R software tutor.

## Overview

R is an open-source tool for statistics and data modeling through the use of syntax. It is by far the most comprehensive computing environment for statistical and econometric computing. Early steps in R are as difficult as learning a new language for someone unacquainted with programming languages. However, once users know the basic grammar of R programming, they are able to be good producers and consumers of R’s outputs. This means that users can by their own understand and write commands to manipulate quantitative and qualitative data easily and present it in compelling ways through the use of customized tables, powerful graphs and others visual characters. The only thing that you need is patience and discipline in your process learning. In case of doubts or despair, do not hesitate to contact with your EUI R tutor(s).

In this background context, this webpage is aimed to help to face your start in R. We suggest answers to the following questions:

## A Brief Comparison of R With Other Mainstream Statistical Software

In a schematic way, the table below shows the main difference between well-known statistical software: R, MATLAB, SAS, STATA and SPSS:

 R MATLAB SAS/STATA SPSS Required license? No Yes Yes Yes Size of library? Unlimited Unlimited Unlimited, but only official ones Limited Flexibility to compute statistics? Yes Yes Medium No Speed in computation? fast The fastest Medium Slow Simple commands? Yes No Medium No Customized graph? Yes Yes No No Limitations in the type of analysis? No No Medium Yes

Table 1: Differences among main statistical software

## R’s Analytic Capacities Compared to Other Software (with Links to Tutorials Covering Most Widely Used Types of Analysis)

The following table shows the readily available statistical analysis from the official, widely known and reliable public websites. Most of the time, the required R packages and the procedure to implement the analysis are well explained in the same website:

 TYPE OF STATISTICAL ANALYSIS R MATLAB SAS STATA SPSS Examples of R code and written tutorials* Video tutorials* Nonparametric Tests + + + + + e.g. sample codes for Sign Test, Wilcoxon Signed-Rank Test, Mann-Whitney-Wilcoxon Test, Kruskal-Wallis Test T-test + + + + + sample code ANOVA & MANOVA + + + + + sample code ANCOVA & MANCOVA + + + + + check the link above Linear Regression + + + + + sample code and overview of related techniques here, here and here video explaining some basic techniques (~15 min) Generalized Least Squares + + + + + description and exampleusing gls() function from ‘nlme’ package // yet another example Ridge Regression + + + limited limited Lasso + + + limited Generalized Linear Models + + + + + introductory tutorial Logistic Regression + + + + + examples involving the use of binary logit, ordinal logit, and multinomial logit // presentation with sample code involving the use of binary, ordinary and multinomial logit (incl. calculation of marginal effects) Mixed Effects Models + + + + + requires ‘lme4’ or ‘nlme’ package //basic introductory guide with sample code using ‘lme4’ package //more comprehensive guideon ‘lme4’ package // example involving the use of ‘nlme’ package Nonlinear Regression + + + limited limited video demonstration: one (~4 min), two (~5 min) Discriminant Analysis + + + + + requires ‘MASS’ package // sample code and brief overview // more comprehensive guide video demonstration (~6 min) Factor & Principal Components Analysis + + + + + video demonstration (~15 min) Canonical Correlation Analysis + + + + + Copula Models + + Path Analysis + + + + + requires ‘plsm’ package (for PLS path analysis) // ‘plsm’ package tutorial // for SEM path analysis see below Structural Equation Modeling (Latent Factors) + + + + limited requires ‘lavaan’ or ‘sem’ package // ‘lavaan’ package tutorial // ‘sem’ package tutorial // short 'sem' demonstration Extreme Value Theory + + Variance Stabilization + + Bayesian Statistics + + limited introduction to Bayesian statistics in R Monte Carlo, Classic Methods + + + + limited Markov Chain Monte Carlo + + + EM Algorithm + + + Missing Data Imputation + + + + + overview of several existing packages Bootstrap & Jackknife + + + + + some approaches demonstrated here, here and here // applications to regression models here and here bootstrapping video example (~20 min) Outlier Diagnostics + + + + + examples of several basic techniques video demonstration: one, two, three Robust Estimation + + + + example of estimating robust regressions using ‘MASS’ package // comprehensive guide to robust estimation Cross-Validation + + + Longitudinal (Panel) Data + + + + limited requires ‘plm’ or ‘lme4’ package // for ‘lme4’ guides refer to MIXED EFFECT MODELS above // ‘plm’ sample code demonstration // ‘plm’ package tutorial Survival Analysis + + + + + requires ‘survival’ package // ‘survival’ package tutorial // another tutorial here Propensity Score Matching + + limited limited requires ‘MatchIt’ or ‘matching’ package // ‘MatchIt’ package tutorial // ‘matching’ package tutorial Stratified Samples (Survey Data) + + + + + Experimental Design + + limited Quality Control + + + + + Reliability Theory + + + + + Univariate Time Series + + + + limited comprehensive tutorial video tutorial (~1 hour) Multivariate Time Series + + + + Stochastic Volatility Models, Discrete Case + + + + limited Stochastic Volatility Models, Continuous Case + + limited limited Diffusions + + Markov Chains + + Hidden Markov Models + + Counting Processes + + + various showcases of estimating count data models here, here and here count data models (~11 min) Filtering + + limited limited Instrumental Variables + + + + + tutorial involving ‘ivreg’ function from ‘AER’ package video demonstration (~12 min) Splines + + + + Nonparametric Smoothing Methods + + + + Spatial Statistics + + limited limited Cluster Analysis + + + + + requires various packages (check tutorials) // brief introduction into R’s cluster analysis capacities // more comprehensive tutorials here and here a set of videos on hierarchical cluster analysis: one, two, three, four and five // a set of videos on K-means clustering: one and two Neural Networks + + + limited Classification & Regression Trees + + limited limited Random Forests + + limited Support Vector Machines + + + Signal Processing + + Wavelet Analysis + + + Bagging + + + ROC Curves + + + + + Meta-analysis + + limited + Deterministic Optimization + + + limited Stochastic Optimization + + limited Content Analysis + limited limited limited short tutorials using ‘RQDA’ package here, here and here // tutorial using ‘tm’ package here Quantile regression + + + + requires ‘quantreg’ package // sample code and brief overview // more comprehensive guide video demonstration (~9 min) Seemingly unrelated regression + + + + requires ‘systemfit’ package // sample code and demonstration // ‘systemfit’ package tutorial video demonstration (~5 min) Tobit and truncated regression + + + + requires ‘censReg’ package // sample code and demonstration // ‘censReg’ package tutorial video demonstration (~9 min) Qualitative Comparative Analysis (QCA) + limited requires ‘QCA’ package //‘QCA’ package description // brief demonstration // QCA tutorial (using ‘QCA’ and ‘SetMethods’ package) Social network analysis + + + + + requites ‘igraph’, ‘statnet’ or ‘sna’ package // ‘igraph’ package tutorial // lab sessions using ‘igraph’ with densely commented code // nice tutorial combining the use of ‘igraph’ and ‘statnet’ // ‘sna’ package description Log-linear models + + + + + requires ‘gnm’ package // comprehensive overview of the ‘gnm’ package The table is an extended version of the table maintained at http://stanfordphd.com/Statistical_Software.html * Most tutorials supplied in this table assume prior knowledge of theory behind a given method, and thus serve primarily as means of introducing R tools and syntax required to conduct a given type of analysis.

Table 2: Statistical analysis availbale among main statistical software

For instance, look at the Quick-R website ( http://www.statmethods.net/) or R-bloggers ( http://www.r-bloggers.com/generalized-linear-models-for-predicting-rates/).

## Getting Started

We really recommend you to follow one of these two websites: Try.R School and Swirl. The former contains a course from which you learn the basic steps to learn the language of computation and features to organize the data. The latter, is a package to install in R. This course teaches you the basic features of R as well. However, Swirl is different from Try.R in two aspects: i) The course repertory offers intermediate (how to do regression) and advanced course (how to establish causal inferences) and ii) There is not epic pictures when you are doing your analysis during the course. Thus, if this is the first time that you do statistics, we highly recommend you to start with Try.R. If you know something about statistics and you have never written commands, start with Swirl.

TRY.R Schoolhttp://tryr.codeschool.com/

Swirlhttp://swirlstats.com/

Once you have done these courses, we suggest you to follow the videos below to learn useful shortcuts to do descriptive analysis. From 1-5 there are exercises available. Contact with Adrián del Río (email link).

1. Basic statistical inferences: https://www.youtube.com/watch?v=dpPwdjorpg0

2. Writing documents with R code in RMarkdown: https://www.youtube.com/watch?v=7qTvOZfK6Cw

3. Loops (to execute repetitive code statements for a particular number of times):  theory   https://blog.udemy.com/r-tutorial/ an practice https://www.youtube.com/watch?v=p7bJjOJoXLI

4. Creating functions (to create your own commands): https://www.youtube.com/watch?v=Fb8E2HZrjUE

5. Grouped aggregation in R with tapply, dplyr and sapply: https://www.youtube.com/watch?v=aD4R4ZIkeW0

For instance, you want to calculate the mean difference between more than two groups that share a set of attribute. Then, you can know the number of parties and legislatures that different types of authoritarian regimes (personal, military and civilian) that have ended up in democracies before and after the cold war:

 Sample Hegemonic Party Military Personalist Legislatures 1945-1989             Yes 0.00% 22.73% 77.27% 100% No 17.65% 47.06% 35.29% 100% 1990-2014             Yes 18.18% 18.18% 63.36% 100% No 55.77% 26.92% 17.31% 100% 1945-2014             Yes 6.61% 21.21% 72.72% 100% No 46.37% 31.88% 21.73% 100% Parties 1945-1989            Zero 0.00% 21.74% 78.26% 100% One 33.33% 50.00% 16.66% 100% More than one 10.00% 50.00% 40.00% 100% 1990-2014            Zero 10.00% 20.00% 70.00% 100% One 61.90% 28.57% 9.52% 100% More than one 53.12% 25.00% 21.87% 100% 1945-2014            Zero 3.03% 21.21% 75.75% 100% One 55.55% 33.33% 11.11% 100% More than one 42.86% 30.95% 26.19% 100% Parties No legislatures Legislatures Total of Parties Zero 90.90% 4.35% 32.35% One 3.03% 37.68% 26.48% More than one 6.07% 57.97% 41.17% Total of Legislatures 32.35% 67.64% 100% Note: Years covered: 1945-2014. Totals in cells describe country-year observations. Hegemonic-party system includes single party hybrids and military regimes include military-personalist hybrids. This decision is suggested by Geddes et al.

Table 2: Democratic formal institutions in autocracies that have experienced a democratic transition

1. Content Analysis: Are you interested in knowing how words are associated during a speech? Under which circumstances an euphemism is more likely to be employed to mention a substantive? How can I learn more about the content of a text? The Rscript below show you how:

Annerose Nisser’s courtesy: click here to download

- Graham Williams’s text mining tool:  http://onepager.togaware.com/TextMiningO.pdf

Text Mining uses automated algorithms to learn from massiv amount of texts that can be summarise following a set of commands described in the document.

Another similar package is the introduction by Ingo Feinerer:

- Ingo Feinerer's tm package:  https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

## Plotting Objects

### ggplot2 vs. the base package

As for most things in R, there is a variety of ways in which plotting can be done in R. The base backage has some pretty solid plotting functions, but most people seem to use the “ggplot2” package by Hadley Wickham, who has become some kind of popstar in the R community. For a nice overview of what ggplot2 can do as well as a function reference, visit:

http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/

However, there also is a countermovement by users who prefer the simplicity of the R base plots package to ggplot2:

http://shinyapps.org/apps/RGraphCompendium/index.php

Which way you use depends in the end on your personal preference as well as the task at hand. There are some situations when using ggplot2 actually simplifies life a lot, whereas in others it can be more of a burden.

### Plotting time series

Let us start with an example where the base package plotting is clearly simpler. First load the package “datasets”, which contains a couple of nice datasets to consider:

library(datasets)

Plotting data on monthly CO2 Concentration is as easy as this:

plot(co2)

“co2” is time series object which the built-in “plot” function recognizes, so plot() applies its default method for time series objects, which is plotting the date on the x-axis. Now let us do the same with ggplot2. ggplot2 works on dataframes, so we need to create a new variable representing the dates, put it together with co2 into a data frame and create a plot:

library(ggplot2)

dates<- seq(1959.0,1997+11/12,1/12)

df<- data.frame(cbind(year=dates,co2=co2))

ggobj<- ggplot(df,aes(x=year,y=co2))+ geom_line()

print(ggobj)

This all looks very complicated compared to the base R plotting, and it is. However, ggplot2 has different features that make it great for other things, especially for multiple layers in plots. We will see this in the next section.

By the way, the discussion here directly generalizes to plotting a panel, i.e. multiple time series:

rw<- sapply(rep(1,10),function(x) cumsum(rnorm(1000))) # create 10 random walks

plot(ts(rw)) # plot

### Plotting cross-sectional data

Now consider the data “occupationalStatus”, which is a matrix showing the respective occupational status of fathers and sons in Britain. In order to deal easily with the data, we need the “reshape2”package which allows you to easily transform dataframes in a way more suitable for plotting:

library(reshape2)

df<- melt(occupationalStatus,id="origin") # reshape the data frame

ggobj<- ggplot(df,aes(x=destination,y=value))+ geom_bar(stat = "identity") # make a basic bar plot

print(ggobj) # print

This does not look great. But consider splitting the individual bars according to the origin of the sons:

ggobj<- ggobj + aes(fill=origin)

print(ggobj)

This looks better. This example also shows us the amazing side of ggplot2: we can change plots step by step through adding commands to the original object. Another great feature of ggplot2 is that we can easily create multiple plots grouped by a specified variable in the dataframe. Why not for instance divide the plots by the occupation of the father?

ggobj<- ggobj + facet_grid(origin ~ .,scales="free_y") + aes(fill = factor(destination))

print(ggobj)

This plot seems to lend itself much better to interpretation. For more examples of how to make nice plots with either ggplot2 or the base package, check the links I provided or some of the dozens of nicely written tutorials on the internet.

### Exporting graphs to LaTeX

Some of the user-written packages in R are simply outstanding.  Try out  “tikzDevice”, which will transform your (base or ggplot2) graphs into a TikZ picture (if you don’t know what TikZ is, look it up). Graphs in papers or presentations can hardly look better than that. For our previous example:

library(tikzDevice)

options(tikzDefaultEngine = "xetex")

options(tikzLatexPackages = c(

getOption( "tikzLatexPackages" ),

"\\usepackage{amsmath}",

"\\usepackage{amsfonts}"

)) # change some options

tikz(file="occupationalStatus.tex",width=15.74776/2.54,height=15.74776) # 2.54 inch = 1 cm

plot(ggobj)

dev.off()

## Optimizing Your Code

One disadvanatage of R is that it is slightly slow (maybe not compared to Stata, but compared to other programming languages, see for instance http://economics.sas.upenn.edu/~jesusfv/comparison_languages.pdf)!  You might not notice this while running a regression, but as soon as you do some computationally more involved tasks, you will. The good news however is that you can speed up processes considerably.

### Making your code more efficient

Whenever your code seems to run slowly, check whether you wrote it in an inefficient way.  Three simple but effective pieces of advice are:

• With computationally intensive tasks such as looping, avoid using complex objects such as arrays, lists or dataframes
• Define objects of fixed length before you loop over them
• Avoid explicit looping when a built-in function does the same

In the following example, you will see how enormous the gain in speed can be if you follow the advice. We will do exactly the same task in different ways and compare how much time it takes.

### A practical Example

Assume that for some reason you would like to simulate a random walk of a certain length:

n <- 100000 # length of the series

You could do this by creating an empty and then consecutively adding random numbers to it:

t0 <- proc.time() # start the stopwatch

y<- list(); y[[1]] <- 0

for(i in 2:n){

y[[i]] <- y[[i-1]] + rnorm(1)

}

texec1<- proc.time() - t0 # get the execution time

Now, instead of a list, use a vector:

t0 <- proc.time()

y <- 0

for(i in 2:n){

y[i] <- y[i-1] + rnorm(1)

}

texec2 <- proc.time() - t0

Do the same, but now  define the object BEFORE looping across it:

t0 <- proc.time()

y <- rep(0,n)

for(i in 2:n){

y[i+1] <- y[i] + rnorm(1)

}

texec3 <- proc.time() - t0

Finally, instead of using the for loop, do the same with the built in “cumsum”-command:

t0 <- proc.time()

eps<- rnorm(n)

y <- cumsum(eps)

texec4 <- proc.time() - t0

Now let us compare the elapsed  timefor the computation:

c(texec1[3]/texec4[3],texec2[3]/texec4[3],texec3[3]/texec4[3],texec4[3]/texec4[3])

The differences are enormous: on my machine, the first method takes 1365 times longer than the most efficient one, the second one 486 times and the third one 15.5 times. So always try to make your code efficient before you try getting more processing power!

### “Apply:”  vs.” for”

There are two ways of constructing loops in R: the commonly used commands such as “for” and the “apply” family of functions. Hardcore R users always use apply and will tell you that it is much quicker than “for”. Let us check this in the context of our example: assume that we do not want to create only one random walk, but a bunch of them:

n <- 10000 # length of time series

ns<- 10000 # number of paths

eps<- matrix(rnorm(n*ns),nrow=n,ncol=ns) # initialize the errors

First use the “for”  loop:

y <- matrix(NA,nrow=n,ncol=ns) # initialize the vectors

t0 <- proc.time()

for(i in 1:ns){

y[,i] <- cumsum(eps[,i])

}

texec1 <- proc.time() - t0

Now do the same with “apply”:

t0 <- proc.time()

y <- apply(eps,2,cumsum)

texec2 <- proc.time() - t0

And compare the computing time:

texec1[3]/texec2[3]

At least for this example, I get that the “for” loop is actually faster. When you use “apply”, the code however looks neater.  For more details, you can read e.g. http://blog.datacamp.com/tutorial-on-loops-in-r/.

## Parallel Computation and Running a Script on the Cluster

For information on running parallel R jobs using either several CPU cores or your GPU as well as instructions and example codes for doing this on the EUI-HPC cluster, please check the presentation document

https://sites.google.com/site/mschmidtblaicher/downloads/EUIHPCpresentation_R.pdf

## Data Scraping

The power of R as a programming language extends beyond statistical analysis. A number of packages have been developed by R users that allow relatively easy and convenient ways of scraping Internet data and preprocessing it for further analysis. Below you will find a list of nice video tutorials introducing the basics of extracting, parsing and processing content from

Note, however, that scraping data from Internet sources might be challenging to people unfamiliar with the basic of HTML architecture.

## Interactive Data Visualization

R also offers amazing capacities for interactive data visualization that can be employed not only for analytic purposes, but also for teaching as well as presentation and popularization of your research results.

### Plotly

The first library you might want to introduce yourself to is the‘plotly’ package, which will allow you to create a wide variety of fancy looking interactive plots, both as objects within R environment as well as separate objects that can be embedded in your blog or custom web-page. Check out the following links that should convince you into using the library and provide an easy start with it:

### Shiny

The second library is called ‘shiny’, and it allows turning your intimidating R code into nice user-friendly applications that can be interacted with using more common web interface. Check out the following link for inspiring examples. Conveniently the library’s official web-page contains a nice and well-explained tutorial, which should get you started quickly (entirely accessible even to inexperienced R users!). What is more amazing is that both ‘shiny’ and ‘plotly’ can be used in conjunction with each other making them an extremely powerful tool for handling and communicating your data.

Note that both ‘shiny’ and ‘plotly’ are subscription-based services. Installing and handling the libraries within R environment is completely free of charge, as is the possibility to post a limited number of objects (i.e. apps and graphs) on-line. However, you might want to check with Shiny and Plotly subscription plans to learn more about the precise subscription terms.

Contributors to this web-page:

Adrián del Río

Matthias Schmidtblaicher

Gordey Yastrebov

Page last updated on 26 September 2017