Home » Services and Admin » ICT Service » Software » Introduction to R

Introduction to R

This page is maintained by the software tutors. For errors and/or amendments please contact the current  tutor supporting the program.

 

This guide will provide an easy to read, starter guide to R. It is not intended to replace the full user manual provided at  R's homepage.

For any further questions or assistance feel free to contact the  R software tutor.

 

 

Overview

R is an open-source tool for statistics and data modeling through the use of syntax. It is by far the most comprehensive computing environment for statistical and econometric computing. Early steps in R are as difficult as learning a new language for someone unacquainted with programming languages. However, once users know the basic grammar of R programming, they are able to be good producers and consumers of R’s outputs. This means that users can by their own understand and write commands to manipulate quantitative and qualitative data easily and present it in compelling ways through the use of customized tables, powerful graphs and others visual characters. The only thing that you need is patience and discipline in your process learning. In case of doubts or despair, do not hesitate to contact with your EUI R tutor(s). 

In this background context, this webpage is aimed to help to face your start in R. We suggest answers to the following questions:

  1. What are the differences between R and other statistical software?
  2. What sort of analysis I can do in R that is not available in other software?
  3. Where and how I can start working with R?
  4. How do I plot objects in R?
  5. How can I make my R code more efficient?
  6. How can I make use of parallel computation and the EUI-HPC computing cluster with R? 

 

 

A Brief Comparison of R With Other Mainstream Statistical Software

 

In a schematic way, the table below shows the main difference between well-known statistical software: R, MATLAB, SAS, STATA and SPSS:

 

 

R

MATLAB

SAS/STATA

SPSS

Required license?

No

Yes

Yes

Yes

Size of library?

Unlimited

Unlimited

Unlimited, but only official ones

Limited

Flexibility to compute statistics?

Yes

Yes

Medium

No

Speed in computation?

fast

The fastest

Medium

Slow

Simple commands?

Yes

No

Medium

No

Customized graph?

Yes

Yes

No

No

Limitations in the type of analysis?

No

No

Medium

Yes

Table 1: Differences among main statistical software

 

 

R’s Analytic Capacities Compared to Other Software (with Links to Tutorials Covering Most Widely Used Types of Analysis)

 

The following table shows the readily available statistical analysis from the official, widely known and reliable public websites. Most of the time, the required R packages and the procedure to implement the analysis are well explained in the same website:

TYPE OF STATISTICAL ANALYSIS

R

MATLAB

SAS

STATA

SPSS

Examples of R code and written tutorials*

Video tutorials*

Nonparametric Tests

+

+

+

+

+

e.g. sample codes for Sign Test, Wilcoxon Signed-Rank Test, Mann-Whitney-Wilcoxon Test, Kruskal-Wallis Test

 

T-test

+

+

+

+

+

sample code

 

ANOVA & MANOVA

+

+

+

+

+

sample code

 

ANCOVA & MANCOVA

+

+

+

+

+

check the link above

 

Linear Regression

+

+

+

+

+

sample code and overview of related techniques here, here and here

video explaining some basic techniques (~15 min)

Generalized Least Squares

+

+

+

+

+

description and exampleusing gls() function from ‘nlme’ package // yet another example

 

Ridge Regression

+

+

+

limited

limited

 

 

Lasso

+

+

+

limited

 

 

 

Generalized Linear Models

+

+

+

+

+

introductory tutorial

 

Logistic Regression

+

+

+

+

+

examples involving the use of binary logit, ordinal logit, and multinomial logit // presentation with sample code involving the use of binary, ordinary and multinomial logit (incl. calculation of marginal effects)

binary logit (and probit) (~12 min) // ordinal logit (and probit) (~6 min) // multinomial logit (and probit) (~15 min)

Mixed Effects Models

+

+

+

+

+

requires ‘lme4’ or ‘nlme’ package //basic introductory guide with sample code using ‘lme4’ package //more comprehensive guideon ‘lme4’ package // example involving the use of ‘nlme’ package

 

Nonlinear Regression

+

+

+

limited

limited

basic introductory guide with sample code // more comprehensive guide

video demonstration: one (~4 min), two (~5 min)

Discriminant Analysis

+

+

+

+

+

requires ‘MASS’ package // sample code and brief overview // more comprehensive guide

video demonstration (~6 min)

Factor & Principal Components Analysis

+

+

+

+

+

sample code and brief overview // practical example

video demonstration (~15 min)

Canonical Correlation Analysis

+

+

+

+

+

 

 

Copula Models

+

+

     

 

 

Path Analysis

+

+

+

+

+

requires ‘plsm’ package (for PLS path analysis) // ‘plsm’ package tutorial // for SEM path analysis see below

 

Structural Equation Modeling (Latent Factors)

+

+

+

+

limited

requires ‘lavaan’ or ‘sem’ package // ‘lavaan’ package tutorial // ‘sem’ package tutorial // short 'sem' demonstration

introduction to ‘lavaan’ (~45 min) // yet another ‘lavaan’ video tutorial (~45 min)

Extreme Value Theory

+

+

     

 

 

Variance Stabilization

+

+

     

 

 

Bayesian Statistics

+

+

limited

   

introduction to Bayesian statistics in R

 

Monte Carlo, Classic Methods

+

+

+

+

limited

 

 

Markov Chain Monte Carlo

+

+

+

   

 

 

EM Algorithm

+

+

+

   

 

 

Missing Data Imputation

+

+

+

+

+

overview of several existing packages

 

Bootstrap & Jackknife

+

+

+

+

+

some approaches demonstrated here, here and here // applications to regression models here and here

bootstrapping video example (~20 min)

Outlier Diagnostics

+

+

+

+

+

examples of several basic techniques

video demonstration: one, two, three

Robust Estimation

+

+

+

+

 

example of estimating robust regressions using ‘MASS’ package // comprehensive guide to robust estimation

 

Cross-Validation

+

+

+

   

 

 

Longitudinal (Panel) Data

+

+

+

+

limited

requires ‘plm’ or ‘lme4’ package // for ‘lme4’ guides refer to MIXED EFFECT MODELS above // ‘plm’ sample code demonstration // ‘plm’ package tutorial

tutorial one (~15 min) // tutorial two (~10 min)

Survival Analysis

+

+

+

+

+

requires ‘survival’ package // ‘survival’ package tutorial // another tutorial here

introductory video (~15 min)// more in-depth tutorial (~1 hour 20 min)

Propensity Score Matching

+

+

limited

limited

 

requires ‘MatchIt’ or ‘matching’ package // ‘MatchIt’ package tutorial // ‘matching’ package tutorial

video demonstration using ‘matching’ package (~13 min) // video demonstration using ‘MatchIt’ package (~17 min)

Stratified Samples (Survey Data)

+

+

+

+

+

 

 

Experimental Design

+

+

limited

   

 

 

Quality Control

+

+

+

+

+

 

 

Reliability Theory

+

+

+

+

+

 

 

Univariate Time Series

+

+

+

+

limited

comprehensive tutorial

video tutorial (~1 hour)

Multivariate Time Series

+

+

+

+

 

 

 

Stochastic Volatility Models, Discrete Case

+

+

+

+

limited

 

 

Stochastic Volatility Models, Continuous Case

+

+

limited

limited

 

 

 

Diffusions

+

+

     

 

 

Markov Chains

+

+

     

 

 

Hidden Markov Models

+

+

     

 

 

Counting Processes

+

+

+

   

various showcases of estimating count data models here, here and here

count data models (~11 min)

Filtering

+

+

limited

limited

 

 

 

Instrumental Variables

+

+

+

+

+

tutorial involving ‘ivreg’ function from ‘AER’ package

video demonstration (~12 min)

Splines

+

+

+

+

 

 

 

Nonparametric Smoothing Methods

+

+

+

+

 

 

 

Spatial Statistics

+

+

limited

limited

 

 

 

Cluster Analysis

+

+

+

+

+

requires various packages (check tutorials) // brief introduction into R’s cluster analysis capacities // more comprehensive tutorials here and here

a set of videos on hierarchical cluster analysis: one, two, three, four and five // a set of videos on K-means clustering: one and two

Neural Networks

+

+

+

 

limited

 

 

Classification & Regression Trees

+

+

limited

 

limited

 

 

Random Forests

+

+

limited

   

 

 

Support Vector Machines

+

+

+

   

 

 

Signal Processing

+

+

     

 

 

Wavelet Analysis

+

+

+

   

 

 

Bagging

+

+

+

   

 

 

ROC Curves

+

+

+

+

+

 

 

Meta-analysis

+

+

limited

+

 

 

 

Deterministic Optimization

+

+

+

limited

 

 

 

Stochastic Optimization

+

+

limited

   

 

 

Content Analysis

+

 

limited

limited

limited

short tutorials using ‘RQDA’ package here, here and here // tutorial using ‘tm’ package here

‘RQDA’ video tutorial (~5 hours) // ‘tm’ video tutorial (~8 min) // extensive video tutorial on text mining in R (follow the playlist covering chapter 7, ~2 hours 20 min)

Quantile regression

+

+

+

+

 

requires ‘quantreg’ package // sample code and brief overview // more comprehensive guide

video demonstration (~9 min)

Seemingly unrelated regression

+

+

+

+

 

requires ‘systemfit’ package // sample code and demonstration // ‘systemfit’ package tutorial

video demonstration (~5 min)

Tobit and truncated regression

+

+

+

+

 

requires ‘censReg’ package // sample code and demonstration // ‘censReg’ package tutorial

video demonstration (~9 min)

Qualitative Comparative Analysis (QCA)

+

 

 

limited

 

requires ‘QCA’ package //‘QCA’ package description // brief demonstration // QCA tutorial (using ‘QCA’ and ‘SetMethods’ package)

 

Social network analysis

+

+

+

+

+

requites ‘igraph’, ‘statnet’ or ‘sna’ package // ‘igraph’ package tutorial // lab sessions using ‘igraph’ with densely commented code // nice tutorial combining the use of ‘igraph’ and ‘statnet’ // ‘sna’ package description

 

Log-linear models

+

+

+

+

+

requires ‘gnm’ package // comprehensive overview of the ‘gnm’ package

 

The table is an extended version of the table maintained at http://stanfordphd.com/Statistical_Software.html

* Most tutorials supplied in this table assume prior knowledge of theory behind a given method, and thus serve primarily as means of introducing R tools and syntax required to conduct a given type of analysis.

Table 2: Statistical analysis availbale among main statistical software

 

For instance, look at the Quick-R website ( http://www.statmethods.net/) or R-bloggers ( http://www.r-bloggers.com/generalized-linear-models-for-predicting-rates/).

 

 

 

Getting Started

 

We really recommend you to follow one of these two websites: Try.R School and Swirl. The former contains a course from which you learn the basic steps to learn the language of computation and features to organize the data. The latter, is a package to install in R. This course teaches you the basic features of R as well. However, Swirl is different from Try.R in two aspects: i) The course repertory offers intermediate (how to do regression) and advanced course (how to establish causal inferences) and ii) There is not epic pictures when you are doing your analysis during the course. Thus, if this is the first time that you do statistics, we highly recommend you to start with Try.R. If you know something about statistics and you have never written commands, start with Swirl.   

TRY.R Schoolhttp://tryr.codeschool.com/

Swirlhttp://swirlstats.com/

 

Once you have done these courses, we suggest you to follow the videos below to learn useful shortcuts to do descriptive analysis. From 1-5 there are exercises available. Contact with Adrián del Río (email link).

  1. Basic statistical inferences: https://www.youtube.com/watch?v=dpPwdjorpg0
     
  2. Writing documents with R code in RMarkdown: https://www.youtube.com/watch?v=7qTvOZfK6Cw
     
  3. Loops (to execute repetitive code statements for a particular number of times):  theory   https://blog.udemy.com/r-tutorial/ an practice https://www.youtube.com/watch?v=p7bJjOJoXLI
     
  4. Creating functions (to create your own commands): https://www.youtube.com/watch?v=Fb8E2HZrjUE
     
  5. Grouped aggregation in R with tapply, dplyr and sapply: https://www.youtube.com/watch?v=aD4R4ZIkeW0
     
    For instance, you want to calculate the mean difference between more than two groups that share a set of attribute. Then, you can know the number of parties and legislatures that different types of authoritarian regimes (personal, military and civilian) that have ended up in democracies before and after the cold war:
     

Sample

Hegemonic Party

Military

Personalist

 

Legislatures

 

 

 

 

1945-1989             Yes

0.00%

22.73%

77.27%

100%

No

17.65%

47.06%

35.29%

100%

1990-2014             Yes

18.18%

18.18%

63.36%

100%

No

55.77%

26.92%

17.31%

100%

1945-2014             Yes

6.61%

21.21%

72.72%

100%

No

46.37%

31.88%

21.73%

100%

Parties

 

 

 

 

1945-1989            Zero

0.00%

21.74%

78.26%

100%

One

33.33%

50.00%

16.66%

100%

More than one

10.00%

50.00%

40.00%

100%

1990-2014            Zero

10.00%

20.00%

70.00%

100%

One

61.90%

28.57%

9.52%

100%

More than one

53.12%

25.00%

21.87%

100%

1945-2014            Zero

3.03%

21.21%

75.75%

100%

One

55.55%

33.33%

11.11%

100%

More than one

42.86%

30.95%

26.19%

100%

 

 

 

 

 

     Parties

No legislatures

Legislatures

Total of Parties

 

 

 

 

 

Zero

90.90%

4.35%

32.35%

 

One

3.03%

37.68%

26.48%

 

More than one

6.07%

57.97%

41.17%

 

Total of Legislatures

32.35%

67.64%

100%

 

Note: Years covered: 1945-2014. Totals in cells describe country-year observations. Hegemonic-party system includes single party hybrids and military regimes include military-personalist hybrids. This decision is suggested by Geddes et al.

Table 2: Democratic formal institutions in autocracies that have experienced a democratic transition

 

  1. Content Analysis: Are you interested in knowing how words are associated during a speech? Under which circumstances an euphemism is more likely to be employed to mention a substantive? How can I learn more about the content of a text? The Rscript below show you how:
     
    Annerose Nisser’s courtesy: click here to download
     
    - Graham Williams’s text mining tool:  http://onepager.togaware.com/TextMiningO.pdf
     
    Text Mining uses automated algorithms to learn from massiv amount of texts that can be summarise following a set of commands described in the document.
     
    Another similar package is the introduction by Ingo Feinerer:
     
    - Ingo Feinerer's tm package:  https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

 

 

 

Plotting Objects

 

ggplot2 vs. the base package

As for most things in R, there is a variety of ways in which plotting can be done in R. The base backage has some pretty solid plotting functions, but most people seem to use the “ggplot2” package by Hadley Wickham, who has become some kind of popstar in the R community. For a nice overview of what ggplot2 can do as well as a function reference, visit:

http://zevross.com/blog/2014/08/04/beautiful-plotting-in-r-a-ggplot2-cheatsheet-3/

However, there also is a countermovement by users who prefer the simplicity of the R base plots package to ggplot2:

http://shinyapps.org/apps/RGraphCompendium/index.php

Which way you use depends in the end on your personal preference as well as the task at hand. There are some situations when using ggplot2 actually simplifies life a lot, whereas in others it can be more of a burden.

 

Plotting time series

Let us start with an example where the base package plotting is clearly simpler. First load the package “datasets”, which contains a couple of nice datasets to consider:

 

library(datasets)

 

Plotting data on monthly CO2 Concentration is as easy as this:

 

plot(co2)

 

“co2” is time series object which the built-in “plot” function recognizes, so plot() applies its default method for time series objects, which is plotting the date on the x-axis. Now let us do the same with ggplot2. ggplot2 works on dataframes, so we need to create a new variable representing the dates, put it together with co2 into a data frame and create a plot:

 

library(ggplot2)

dates<- seq(1959.0,1997+11/12,1/12)

df<- data.frame(cbind(year=dates,co2=co2))

ggobj<- ggplot(df,aes(x=year,y=co2))+ geom_line()

print(ggobj)

 

This all looks very complicated compared to the base R plotting, and it is. However, ggplot2 has different features that make it great for other things, especially for multiple layers in plots. We will see this in the next section.

 

By the way, the discussion here directly generalizes to plotting a panel, i.e. multiple time series:

 

rw<- sapply(rep(1,10),function(x) cumsum(rnorm(1000))) # create 10 random walks

plot(ts(rw)) # plot

 

Plotting cross-sectional data

Now consider the data “occupationalStatus”, which is a matrix showing the respective occupational status of fathers and sons in Britain. In order to deal easily with the data, we need the “reshape2”package which allows you to easily transform dataframes in a way more suitable for plotting:

 

library(reshape2)

df<- melt(occupationalStatus,id="origin") # reshape the data frame

ggobj<- ggplot(df,aes(x=destination,y=value))+ geom_bar(stat = "identity") # make a basic bar plot

print(ggobj) # print

 

This does not look great. But consider splitting the individual bars according to the origin of the sons:

 

ggobj<- ggobj + aes(fill=origin)

print(ggobj)

 

This looks better. This example also shows us the amazing side of ggplot2: we can change plots step by step through adding commands to the original object. Another great feature of ggplot2 is that we can easily create multiple plots grouped by a specified variable in the dataframe. Why not for instance divide the plots by the occupation of the father?

 

ggobj<- ggobj + facet_grid(origin ~ .,scales="free_y") + aes(fill = factor(destination))

print(ggobj)

 

This plot seems to lend itself much better to interpretation. For more examples of how to make nice plots with either ggplot2 or the base package, check the links I provided or some of the dozens of nicely written tutorials on the internet.

 

Exporting graphs to LaTeX

Some of the user-written packages in R are simply outstanding.  Try out  “tikzDevice”, which will transform your (base or ggplot2) graphs into a TikZ picture (if you don’t know what TikZ is, look it up). Graphs in papers or presentations can hardly look better than that. For our previous example:

 

library(tikzDevice)

options(tikzDefaultEngine = "xetex")

options(tikzLatexPackages = c(

getOption( "tikzLatexPackages" ),

  "\\usepackage{amsmath}",

  "\\usepackage{amsfonts}"

)) # change some options

tikz(file="occupationalStatus.tex",width=15.74776/2.54,height=15.74776) # 2.54 inch = 1 cm

plot(ggobj)

dev.off()

 

 

Optimizing Your Code

One disadvanatage of R is that it is slightly slow (maybe not compared to Stata, but compared to other programming languages, see for instance http://economics.sas.upenn.edu/~jesusfv/comparison_languages.pdf)!  You might not notice this while running a regression, but as soon as you do some computationally more involved tasks, you will. The good news however is that you can speed up processes considerably.

 

Making your code more efficient

Whenever your code seems to run slowly, check whether you wrote it in an inefficient way.  Three simple but effective pieces of advice are:

  • With computationally intensive tasks such as looping, avoid using complex objects such as arrays, lists or dataframes
  • Define objects of fixed length before you loop over them
  • Avoid explicit looping when a built-in function does the same

In the following example, you will see how enormous the gain in speed can be if you follow the advice. We will do exactly the same task in different ways and compare how much time it takes.

 

A practical Example

Assume that for some reason you would like to simulate a random walk of a certain length:

 

n <- 100000 # length of the series

 

You could do this by creating an empty and then consecutively adding random numbers to it:

 

t0 <- proc.time() # start the stopwatch

y<- list(); y[[1]] <- 0

for(i in 2:n){

y[[i]] <- y[[i-1]] + rnorm(1)

}

texec1<- proc.time() - t0 # get the execution time

 

Now, instead of a list, use a vector:

 

t0 <- proc.time()

y <- 0

for(i in 2:n){

  y[i] <- y[i-1] + rnorm(1)

}

texec2 <- proc.time() - t0

 

Do the same, but now  define the object BEFORE looping across it:

 

t0 <- proc.time()

y <- rep(0,n)

for(i in 2:n){

  y[i+1] <- y[i] + rnorm(1)

}

texec3 <- proc.time() - t0

 

Finally, instead of using the for loop, do the same with the built in “cumsum”-command:

 

t0 <- proc.time()

eps<- rnorm(n)

y <- cumsum(eps)

texec4 <- proc.time() - t0

 

Now let us compare the elapsed  timefor the computation:

 

c(texec1[3]/texec4[3],texec2[3]/texec4[3],texec3[3]/texec4[3],texec4[3]/texec4[3])

 

The differences are enormous: on my machine, the first method takes 1365 times longer than the most efficient one, the second one 486 times and the third one 15.5 times. So always try to make your code efficient before you try getting more processing power!

 

“Apply:”  vs.” for”

There are two ways of constructing loops in R: the commonly used commands such as “for” and the “apply” family of functions. Hardcore R users always use apply and will tell you that it is much quicker than “for”. Let us check this in the context of our example: assume that we do not want to create only one random walk, but a bunch of them:

 

n <- 10000 # length of time series

ns<- 10000 # number of paths

eps<- matrix(rnorm(n*ns),nrow=n,ncol=ns) # initialize the errors

 

First use the “for”  loop:

 

y <- matrix(NA,nrow=n,ncol=ns) # initialize the vectors

t0 <- proc.time()

for(i in 1:ns){

    y[,i] <- cumsum(eps[,i])

}

texec1 <- proc.time() - t0

 

Now do the same with “apply”:

 

t0 <- proc.time()

y <- apply(eps,2,cumsum)

texec2 <- proc.time() - t0

 

And compare the computing time:

 

texec1[3]/texec2[3]

 

At least for this example, I get that the “for” loop is actually faster. When you use “apply”, the code however looks neater.  For more details, you can read e.g. http://blog.datacamp.com/tutorial-on-loops-in-r/.

 

 

Parallel Computation and Running a Script on the Cluster

For information on running parallel R jobs using either several CPU cores or your GPU as well as instructions and example codes for doing this on the EUI-HPC cluster, please check the presentation document

https://sites.google.com/site/mschmidtblaicher/downloads/EUIHPCpresentation_R.pdf

 

Data Scraping

The power of R as a programming language extends beyond statistical analysis. A number of packages have been developed by R users that allow relatively easy and convenient ways of scraping Internet data and preprocessing it for further analysis. Below you will find a list of nice video tutorials introducing the basics of extracting, parsing and processing content from

Note, however, that scraping data from Internet sources might be challenging to people unfamiliar with the basic of HTML architecture.

 

Interactive Data Visualization

R also offers amazing capacities for interactive data visualization that can be employed not only for analytic purposes, but also for teaching as well as presentation and popularization of your research results.

Plotly

The first library you might want to introduce yourself to is the‘plotly’ package, which will allow you to create a wide variety of fancy looking interactive plots, both as objects within R environment as well as separate objects that can be embedded in your blog or custom web-page. Check out the following links that should convince you into using the library and provide an easy start with it:

Shiny

The second library is called ‘shiny’, and it allows turning your intimidating R code into nice user-friendly applications that can be interacted with using more common web interface. Check out the following link for inspiring examples. Conveniently the library’s official web-page contains a nice and well-explained tutorial, which should get you started quickly (entirely accessible even to inexperienced R users!). What is more amazing is that both ‘shiny’ and ‘plotly’ can be used in conjunction with each other making them an extremely powerful tool for handling and communicating your data.

Note that both ‘shiny’ and ‘plotly’ are subscription-based services. Installing and handling the libraries within R environment is completely free of charge, as is the possibility to post a limited number of objects (i.e. apps and graphs) on-line. However, you might want to check with Shiny and Plotly subscription plans to learn more about the precise subscription terms.

 

Contributors to this web-page:

Adrián del Río

Matthias Schmidtblaicher

Gordey Yastrebov

 

 

Page last updated on 26 September 2017