R Server 9 Adds Machine Learning to Work with Your Data Where It Lives
Built by data scientists, the R programming language has always been a tool for data scientists. But Microsoft’s R Server 9, the first full new version of the commercial package of R since Microsoft bought the company that created this distribution, Revolution Analytics, is also now aimed at a new audience — enterprise customers who have developers and analysts as well as data scientists.
That makes working with data from a wider range of sources key because enterprises have such mixed environments these days.
R Server already supported Apache Spark 1.6 data processing framework; R Server 9 (which is built on open source R 3.3.2) adds support for Spark 2.0, so you can take advantage of the new options for working with streaming data and the improved memory management subsystem.
“You can intermix calls to massively parallel algorithms in R with calls to native Spark, through the SparkR library,” explained Bill Jacobs, Principal Program Manager on the R Server team.
R Server 9 can also now connect to Apache Hive for real-time queries, and Apache Parquet, which is quickly becoming popular for columnar storage, as a way to load data into Spark DataFrames to be analyzed by Microsoft’s ScaleR functions. ScaleR is designed to deal with datasets too large to fit in memory and it’s available in Azure HDInsight and, soon, the Azure Machine Learning service, as well as in R Server (and in the free Microsoft R Client, for smaller datasets).
R Server 9 now also runs on Ubuntu as well as SuSE, and RedHat and CentOS, supporting Cloudera, Hortonworks and MapR Hadoop distributions.
As Bharat Sandhu from Microsoft’s Advanced Analytics team put it, “Data in the enterprise is increasing by leaps and bounds and it is on multiple platforms, so customers need this intelligence closer to the data and on multiple platforms. We want to work with what customers have; we want to work with the skills and knowledge they possess, and the systems they have already invested in.”
Machine Learning on Your Data Platform
Those advanced analytics now include machine learning algorithms and data transforms, based on Microsoft’s extensive machine learning work. From Skype Translator to Bing to Exchange, Microsoft is using machine learning in a wide range of products and already provides many of its algorithms as Cognitive Services APIs you can call in your own code.
The new MicrosoftML package in R Server 9 includes six multi-threaded algorithms (based on machine learning used by Microsoft teams, but generalized to be useful for a wider range of scenarios):
GPU-accelerated deep neural networks should have significantly more performance than models that use only CPU. Microsoft says training multi-layer custom networks is up to eight times faster.
- Fast linear SDCA (Stochastic Dual Coordinate Ascent Methods) learner, with support for L1 and L2 cache regularization, for binary classification and linear regression. Microsoft says this trains twice as fast as logistic regression.
- Fast boosted decision tree for binary classification and regression.
- Fast random forest for binary classification and regression.
- Logistic regression, with support for L1 and L2 regularization.
- Binary classification using a OneClass Support Vector Machine for anomaly detection
- These algorithms are in R Server for Windows, the free R Client for Windows and SQL Server R Services now; they’ll come to Linux and Hadoop in the first quarter of 2017.
They let you do text classification for, say, sentiment analysis or classifying support tickets, create models for churn prediction, spam filtering, fraud and risk analysis, click-through and demand forecasting or create neural networks to solve complex machine learning problems like image classification or OCR and handwriting analysis.
Building a six-layer neural network takes just 60 lines of script, although the topology of the neural network can be arbitrarily deep. The only limitation is the computing power at your disposal. The more layers usually mean slower training time.
MicrosoftML also includes machine learning transform pipelines that let you create a custom set of transformations to feature your data before you train or test with it, using the following calls:
- concat() combines multiple columns into a single vector-valued column, speeding up training times.
- categoricalHash() converts a categorical value into an indicator array using hashing, which is useful when you have large numbers of categories.
- categorical() converts a categorical value into an indicator array using a dictionary, for a small, fixed number of categories.
- selectFeatures() selects features from a list using count or mutual information modes.
- featurizeText() produces a bag of counts of n-gram word sequences from your text, with language detection, tokenization, text normalization, feature generation, and term weighting — it can also remove ‘stop words’ that are too common to be useful.
You can use the new algorithms alongside the RevoScaleR functions for importing, cleaning and visualizing your data, and existing open source CRAN R packages. GPU-accelerated deep networks.
Using Your Models in More Places
R Server 9 is designed to integrate well with enterprise systems. “The big challenge for R is how to become an operationalizable, embeddable, integratable component of larger applications,” said Jacobs. “Now you can take R models from Spark and move them to a SQL Server system or a Linux box or a Hadoop cluster or a Windows Server system or Teradata.”
One way of doing that is what used to be called the DeployR server; this feature is now integrated into R Server, and it’s now just called ‘operationalization capabilities.’ It’s already in available in standalone server installations on Windows, RHEL and CentOS, and Ubuntu and coming to SLES11, Teradata, Hadoop, and SQL Server R Services in 2017. But Microsoft sees the new R support in SQL Server 2016 as the ideal way for enterprises to turn analytics models into solutions you can use at scale in your business, rather than just reports to look at.
It’s a single line command from the new SQLRutils package to embed the R script in a T-SQL stored procedure in a SQL Server database, where you can access it through any app or website that connects to the database. If you create a neural network, you get a binary blob that’s a serialized version of the trained neural network that sits in the database.
“It runs in the database where it can run massively parallel, with multiple threads and multiple cores,” explained Jacobs. “It also provides all the security because the data never leaves the database engine.”
That will become more widely relevant when the Linux version of SQL Server ships in 2017, but it’s not the only new option. To speed up how quickly you can start using your R models, and to make sure models in R Server stay useful as technology platforms develop, R Server 9 also makes it easy to expose both models and even arbitrary R scripts as web services that you can call as APIs from any programming language, using the Swagger API framework.
Again, it’s a simple process to create the Swagger.JSON document; that’s something the data scientist can do themselves, from RStudio, R Tools for Visual Studio or a Jupyter notebook, and send the file to the developer who’s going to use the model. There’s another command the app developer runs to generate the Swagger code that creates the API, that they can then call in their app with a few more lines of code.
That’s much quicker than the traditional deployment process with R. Usually, pointed out R Server program manager Carl Nan, “after the data scientists build the R model, it takes the app developer a long time to convert to other programming languages so they can integrate it with business apps in production. It’s error prone process with slow innovation rates that ends up with stale models.”
That makes machine learning models you create yourself in R Server as easy to work with as the commercial machine learning APIs from providers like Microsoft and Google.
The new options mean that machine learning and analytics models you create in R aren’t tied to the platforms you build them on, or the platform you currently use them on, Jacobs pointed out. You can train in one environment and deploy in many. “No platform lasts forever. Some of what we’re doing here is providing portability that allows you to build hybrid applications. But over time, the ability to build code in one place and run it another also provides a form of future proofing that abstracts a lot of the data scientists’ work away from the peculiarities of platforms in use and makes their work last a long time.”
On the other hand, for businesses that don’t have data scientists, Microsoft is also producing solution templates for specific problems that are ready to use. “At times, the build-it-yourself approach can leave the uninitiated a little bit in the lurch and having a lot work to do to get their first good results,” Jacobs said.
The first template predicts when leads will convert to customers, creating a dashboard that recommends whether to use email, text message or a phone call to use to reach those potential customers and even what time and day to contact them. Modeling the data to make that would usually be weeks of work. It’s based on the insurance industry, so if you do have data scientists, all the code is on GitHub, so you can load it into R and qualify the models against your own data.
On a smaller scale (or for development purposes), several of the key new features of R Server are also in the 3.3.2 version of the free R client, including the machine learning library and the OlapR, MRSDeploy and SQLrutils packages.
The R Server 9 support for machine learning comes in the first community technical preview of the next version of SQL Server — the current version has embedded R support, but not the new packages.
Feature image by Nathan Anderson, via Unsplash. Other images from Microsoft.