There have been a few articles lately posing the age-old question: “Is R or Python a better language to learn for a budding young data scientist?”
The consensus answer appears to be “It depends,” but, in reality, there’s no need to choose between R and Python, because you can have both, with a library called RPy2.
What is “Data Science”?
Before talking about how RPy2 enables “data science,” I’m going to point out that “data science” is a bit of an odd term. All science is “data science.” “Non-data science” is a completely different field: philosophy. “Data science” is just science, which is the discipline of publicly testing ideas by systematic observation, controlled experiment, and Bayesian inference.
The goal of “data science” is to draw statistically valid inferences from the data. The tag “data” is meant to suggest that it doesn’t really matter what data are being used, but this is false: it is difficult-to-impossible to do science without getting up-close-and-personal with the data, to understand the foibles of the systems that produced it, and to deal intelligently and sensitively with the non-idealities that come along with the good stuff.
Any interesting dataset has, at least, some of the following: missing values, outliers, and noise. Missing values are exactly what the name implies. Outliers are weird events that for some reason or other are wildly far outside the envelope of reasonableness. Noise is the distribution that results from the sea of random (or non-random) influences on the measured values. Outliers and noise differ in that noise generally has a well-measured distribution from fairly well-understood causes, while outliers are typically the result of poorly understood processes that happen rarely enough that we can’t get a good measure of the distribution.
For dealing with these kinds of things R, Python and RPy are all useful tools.
Why R is Useful for Data Scientists
R is a delightful little language in the hands of an experienced statistical analyst. It was written by and for statisticians and makes some of the most basic data management tasks very easy. In particular, the three basic tasks:
- Labeling data
- Filling in missing values
These tasks are all very well-supported by R. Labeling is probably the most important of these. R’s concept of a “data frame,” which carries along dimension and entity labels as column and row headers while letting algorithms work on the purely numerical data inside, is a surprisingly big deal. Traditional numerical programming languages like Python typically relegate the kind of book-keeping that data frames do automatically to the programmer. They end up taking a lot of work and are very easy to get wrong.
Dealing with missing values and filtering outliers — or discarding entities that have too many outliers or missing values — are also two very important basic functions in any data processing task. There are also those cases where something that should be strictly positive (mass values, say) turn out to be negative now and then due to measurement error. How you deal with these things can have a big effect on the outcome of your analysis.
R has a wealth of algorithms for dealing with these sorts of situations that embody the distilled wisdom of centuries of scientific practice, although it still requires a measure of taste and good judgment on the part of the analyst to choose the ones best suited to the data they are dealing with.
Bridging the R-Python Gap
Tom Radcliffe has over 20 years experience in software development and management in both academia and industry. He is a professional engineer (PEO and APEGBC) and holds a Ph.D. in physics from Queen's University at Kingston. Tom brings a passion for quantitative, data-driven processes to ActiveState. He is deeply committed to the ideas of Bayesian probability theory, and assigns a high Bayesian plausibility to the idea that putting the best software tools in the hands of the most creative and capable people will make the world a better place.
Pandas, the Python data library, has many of the same features these days, but RPy2 creates a nice migration path from R to Python and lets you learn a lot about R as an incidental adjunct to learning Python. Moving in the other direction, for a lot of experimental development an experienced analyst can use R, then when they are happy with the results and want to incorporate the algorithm into a Python application for distribution to users they can use RPy2.
The ability to perform this migration while never leaving the conceptual model of R is very valuable, but on the other side of the fence, the ability to use a truly general purpose programming language like Python to wrap that conceptual model in a user-friendly application that has a variety of complex additional features (printing, networking, USB support, etc) is vital.
For example, I’ve used this approach to create Python applications that read some sensor data, process it via RPy2, and then display it to the user in a variety of ways. I have no clue how I’d read sensor data from R although there’s probably a way to do it. With Python, there was already a module for doing what I needed, and if there hadn’t been it would have been easy to write one as an extension.
So if you don’t already know R, my recommendation is to learn Python and use RPy2 to access R’s functionality. That way you’ll be learning one language but gaining the power of two. Once you’ve learned RPy the jump to pure R isn’t a big one, whereas starting from the other end the migration path isn’t quite so easy.
This post originally appeared on the ActiveState ActiveBlog.
ActiveState is a sponsor of The New Stack.
Feature Image via Pixabay.