TNS
VOXPOP
Will real-time data processing replace batch processing?
At Confluent's user conference, Kafka co-creator Jay Kreps argued that stream processing would eventually supplant traditional methods of batch processing altogether.
Absolutely: Businesses operate in real-time and are looking to move their IT systems to real-time capabilities.
0%
Eventually: Enterprises will adopt technology slowly, so batch processing will be around for several more years.
0%
No way: Stream processing is a niche, and there will always be cases where batch processing is the only option.
0%
Software Development

R Package: Drawing Quick Plots With ggplot2

Mar 30th, 2015 5:00am by
Featued image for: R Package: Drawing Quick Plots With ggplot2
This is the fourth post in a series about the R programming language. In the third post, Manjusha examined how to understand data graphically through R. In the second post, she explored how to pull data from R for smart analysis. The first post in the series explored data visualization with R.

A picture is worth a thousand words — meaning a complex notion can be expressed as an image to enable one to visually absorb large volumes of data. The popular R package uses ggplot2 to depict meaningful graphs from the data available, and requires minimum skills to create a picture of the graphical data. The concept is built to represent data accurately without any worry about graph complexities.

This graph representation is ideally achieved by using inbuilt datasets with R.

Get Started Creating

ggplot2 is a plotting system in R that uses the grammar of graphics. It illustrates two important commands:

  1. qplot(): a quick plot.
  2. ggplot(): allows for more detailing of the graph. It also allows for layered graphs.

Let’s review how ggplot2 basically works, starting with an in-depth look at the inbuilt data available in R.


This will display all the column names for the data. With the use of the head command, we can view the data as follows:


In the data set “women” there are two parameters: height and weight.

Let’s now pass these parameters into qplot():


The resulting graph is shown below:

womenplot

The above graph shows a representation of height versus weight, which by default is plotted in terms of points.

Let’s plot the line graph below by issuing the command:


Wth-lineplot

In ggplot2 the geometric object used to represent the data is called a “geom.”

If we’re looking for both representations, points as well as lines, the same needs to be passed on to the geom. Here, the geom fundamentally is the argument that denotes geometry.

Recall: When we want to declare a collection in R programming, we use c.

We are passing two options for the geom, so we have to use a collection.
To change the color of the graph, say from black to red, we have to issue:


But that is not the only use of that color option — try out a parameter with the color command:



womenhtplotNotice from the graph above that we can get the height in a particular color gradient, and the extra legend beside the graph is automatically added.

By adding more options, more legends are included, as indicated below:


Here, there are two legends — one for height, another for weight. Notice the size of the points change as the weight increases.
wthplot

More Datasets for More Plots

Let’s take a closer look into the data set “iris” which contains information of the iris dataset as given below.


Look closely at the different columns this data displays: earlier, we used only two columns. Now the effect of the third column indicates what we should see after passing it to color:


Let’s now find the different species available under the iris data set:


The effect of specifying “color = Species” will result in displaying separate colors for separate species, including the legend.

iris

More Options to Use

There are many options for a geom — these are some most frequently used. Some options work with a single-column of data information, others with double-column information. 

geom single column bar
density
smooth
double column point
line

Recall: To plot data on a graph we require two coordinates (x,y).

  • When data is available as a single column, it plots with respect to the range of frequency (count) for that data.
  • Two columns represent (x,y) data.

For example:


This will plot the histogram of Sepal.Length with count.
There are many more interesting options. For the histogram, we are considering the single column Sepal.Length, and we use the option “fill=Species” to add more colorful information to the graph:



qplotfillLet’s execute another data set called Orange. Take a look at the different column names it holds:


To see the data it holds, simply type the below command:


Now let us plot the graph with age versus circumference of the “data = Orange” with respect to color:


treedata

Other Useful Options With qplot

options effect
xlim limits for X axis
ylim limits for Y axis
main Main title for the graph
xlab Lable for X axis
ylab Lable for Y axis

For further options visit the following urls: http://www.cookbook-r.com/Graphs/ and http://docs.ggplot2.org/current/

Storing the Graphics

After some trial and error you will be able to obtain good graphics. R also provides a way to store graphics:

  1. Create a file where graphical output can be redirected.
  2. Type the command for the graphics.
  3. Redirect the output to the console by issuing graphics.off().

In this tutorial we have just witnessed the beginning of the world of graphics for data analysis with ggplot2.

In the next post we will take a look at how the ggplot() command helps us plot complex graphs.

The code for the exercises can be found here.

Manjusha Joshi is a freelancer of free, open source software for scientific computing. She is a mathematician and member of the Pune Linux user group.

Featured image via Flickr Creative Commons.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.