Intelligent Infrastructure

Mass data and next-gen workloads Intelligent Infrastructure

The Role of R in Your Data Science Toolkit

Icons_Teal_CHART-2

Today, we’ll discuss R, a statistical computing language based on the S language developed at Bell Laboratories. R is a go-to tool for practicing data scientists today. Our colleagues here at Seagate recently wrote a blog on the merits of using Python for analytics. We hope to convince you to include R in your data science toolkit as well.

R first appeared in 1996 and has been under active development ever since. To date, the Comprehensive R Archive Network (CRAN) has 5521 available packages, 30,000+ members on the Linkedin R user group and 2,889 Meetup groups. Moreover, R is widely used by academic statisticians. Hence, cutting-edge statistical techniques are often available on this platform well before they are available elsewhere. R is the most comprehensive open-source data analysis tool in existence. Just a sampling of some of the packages that are available in R shows the breadth and variety of tools available. For example:

  • data manipulation: plyr,

  • visualization: ggplot2 and animation,

  • SQL queries: sqldf,

  • network client: RCurl,

  • R-Java interface: rJava,

  • geography: maps, RgoogleMaps, ggmap

  • dynamic reporting: knitr,

  • parallel computing: Rmpi,  snow,  multicore, parallel,

  • big data analytics: RHadoop, RHIPE, RSpark.

R, as a statistical programming language, was designed by statisticians. It was built to facilitate statistical analysis. This differentiates R from many other languages. For example, compared to proprietary software such as SAS JMP, Microsoft Excel, and Tableau, R is an open-source application; as such R is free, flexible and evolving with great speed. The large number of packages and strong statistics support that is available in R allows a user to get sophisticated analyses up and running in very little time.  The data scientist may often find that R already contains a package that performs the analysis she is interested in. As such, when working with R, often there is no need to “reinvent the wheel”. Many of R’s prepacked libraries are straightforward to use if one’s modeling needs are standard.

One of R’s most powerful features is its plotting and graphing capability.  Here we provide an example.  The following plot shows Denver’s crime rate in 2009 superimposed over a map. This plot was made with the R package: ggmap. The data is downloaded from the Denver city government website. This data set documents occurrences of the following crimes: robbery, aggravated-assault, arson and murder. However, any category of crime may be studied in this manner.  Generating the following plots in R was extremely straightforward and only required simple data manipulation. The first plot shows the overall crime density in 2009, while the second one breaks down the crimes by day of the week.

Rplot.png

Figure 1: Denver’s Violent Crime Density in 2009.

weeklycrime.png

Figure 2: Denver’s Weekly Violent Crime Distribution in 2009

While R has many merits, we would be remiss without mentioning a few of R’s limitations. An important example of an R limitation is that R does not support pass-by-reference function arguments. R functions only take pass-by-value arguments. This makes programming R much simpler, but has significant performance implications. However, libraries exist that can remedy this deficiency of R, e.g. R.oo. Another well-known issue of R is that, in certain circumstances, the scope of a variable may be unpredictable.

To summarize, with the:

  1. unparalleled number of packages available to enable sophisticated data analysis,

  2. the excellent community support available,

  3. and its strong emphasis on statistical analysis,

we recommend that working data scientists give serious consideration toward the use of the R statistical computing language. At the same time, it is important to be cognizant of the fact that data science is a quickly growing and changing field. The landscape of data science tools is ever evolving.  One would be wise to keep an open mind, be involved in the changes, and keep a collection of powerful and diverse tools in one’s toolkit.

Authors: Kuo Liu, Gennady Voronov, Hansu Gu, and Ed Wiley

References

D. Kahle, H. Wickham ggmap: Spatial Visualization with ggplot2,  The R Journal Vol. 5/1

R Bloggers: Simply start over and build something better

A. Vance “Data Analysis Captivated by R’s Power.” The New York Times 6 Jan. 2009.