Data Visualization Basics with the R Programming Language

This is the first post in a series about the R programming language. In the second post, Manjusha explores how to pull data from R for smart analysis. The third post in the series covers how to understand data graphically with the R programming language. In the fourth post, she explains how to use ggplot2 for drawing quick plots.
A flourishing public library can be distressing with its overwhelming shelves of knowledge. What if there was an easier way to organize all that knowledge and see it in a different way?
One method of achieving this is with R-programming.
What is R-programming?
The R programming language is a free and open source software which is quickly gaining popularity due to its data handling capacity, and can be described as follows:
- It is a scripting language which can handle huge amounts of data quickly.
- It has connectivity and compatibility with almost all types of databases and programming languages.
- It has robust inbuilt graphical functions.
- It is one of the most powerful languages for data mining, information retrieval and data analysis.
- It supports machine learning algorithms and many more.
- It has the ability to work like a scientific calculator.
- It handles vectors, matrices and lists.
- R programming is a collection of many inbuilt statistical and mathematical functions.
Vectorization in R
In traditional programming, the operations are performed element wise. Vectorization is when the operations are directly applied to an entire vector instead of the individual elements.
Let us look at an example for the sum of squares:
1 2 3 4 5 |
> s=0 > for (i in 1:10000) {s=s+i^2} > s [1] 333383335000 |
In case of data handling, you need not run operations element wise. R can work on a whole vector at a time and you can apply operations directly on the vector.
1 2 |
> sum((1:10000)^2) [1] 333383335000 |
Data objects
Data can be of a string, number, logical data, collection of numbers etc. To store a data in variety of formats there are different types of data objects available in R.
Vector
Data can be stored as follows:
1 2 3 |
m<-c(100,-23,3.4,56/34) > m [1] 100.000000 -23.000000 3.400000 1.647059 |
c stands for collection. Data is assigned to m.
Quick Data Accessibility
One can access data with index values:
1 2 |
> m[2:4] [1] -23.000000 3.400000 1.647059 |
Logical expressions can be passed through index value.
1 2 3 4 |
> n<-c(23,34,-34,0,-12,10) > n[n<0] [1] -34 -12 <strong><strong> </strong></strong> |
Matrix
Matrix is a collection of numerical data in a tabular format. Real time data can be handled through huge matrices. Many mathematical functions are associated with Matrix, allowing the capability to process huge data at one time.
matrix is the declaration for matrix type data object.
1 2 3 4 5 6 |
A<-matrix(n,byrow=TRUE,2) > A [,1] [,2] [1,] 100.0 -23.000000 [2,] 3.4 1.647059 <strong><strong> </strong></strong> |
Matrix Related Functions
Inverse of A can be found by:
1 2 3 4 5 |
> A^-1 [,1] [,2] [1,] 0.0100000 -0.04347826 [2,] 0.2941176 0.60714286 <strong><strong> </strong></strong> |
List
List allows collection of mixed data type of uneven length. One can combine character data, numerical data, vector inside the list data.
list declares data object as a type list.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
<strong><strong> </strong></strong> > v<-list(A,"Try This",pi) > v [[1]] [,1] [,2] [1,] 100.0 -23.000000 [2,] 3.4 1.647059 <strong><strong> </strong></strong> [[2]] [1] "Try This" <strong><strong> </strong></strong> [[3]] [1] 3.141593 <strong><strong> </strong></strong> |
Popular Data Object: data.frame
Data frame is a collection of different types of data. It can hold numerical, logical or character data. The only requirement is that each data column must have the same number of entries (in length).
1 2 3 4 5 6 7 8 9 10 11 12 |
d<-data.frame( Word=word, Freq=f) > d word frequency 1 Data 123 2 mining 34 3 Algorithms 89 4 Prediction 120 5 Analysis 32 6 Logic 56 7 Twitter 45 8 Model 111 9 Cluster 22 |
From various sources, data can be read and stored as data.frame.
Add On Libraries
One can add as many libraries as per requirements. These are used for special tasks. You can download the library at http://cran.r-project.org/ with command
1 |
install.packages(“wordcloud”,dependencies=TRUE) |
from R console. For the working session one needs to load the library.
1 |
library(wordcloud) |
World Cloud: Digging Out Information
The wordcloud library will provide the wordcloud function to which one can pass arguments.
Let x be the collection of words and f is the collection of corresponding frequencies of those words.
To understand which word is occurring most frequently let us draw the word cloud.
The size of the word is proportional to the frequency of the word.
1 2 3 4 5 6 7 |
x<-c("data","stats","Sentiment","Analysis","Social","Networking","Visualize", "Big data","Graphics","Maths","Algorithms","Machine", "Learning","Classification", "Clustering","Grouping","Mining","Text") f<-c( 78, 40, 172 ,147, 213, 101, 217, 29, 149, 174, 213, 166, 265 ,215,, 56 109, 80, 260) > wordcloud(x,f,scale=c(4,1),Inf,random.order=FALSE) |
More information about R
- R has many nice GUI’s. The most famous GUI is R-studio.
- R comes with supportive help and good inbuilt graphic commands.
- It has inbuilt data sets from real data.
- If information is not available for the particular entry of any collection or matrix, R understands NA- not available as a value.
- Special values like NaN-Not a number, Inf -infinity (∞) also exist with R.
- As R is a free and open source software, the internet supports a lot of libraries, help and documentation on R.
The code for the exercises can be found here : https://github.com/manjushajoshi/R-code/tree/master/DiggingdatawithR-v2
Manjusha Joshi is a freelancer in free open source software for scientific computing. She is a mathematician and member of the Pune Linux User group.
Feature image via Flickr Creative Commons.