New Anaconda Package Embeds Python on Cloudera Hadoop
Austin, Texas-based Continuum Analytics, the company behind the data analysis-focused Anaconda Python distribution, has further expanded into the world of big data.
The company has released the Anaconda for Cloudera package to enable users to easily build and run Python-based applications across a Cloudera cluster, and alongside Spark jobs.
Cloudera users previously had to manually install a complete Python data science stack on a Hadoop cluster and manage runtime dependencies themselves. Now with the Anaconda parcel installed, via Cloudera Manager, users don’t have to do a Python install node-by-node.
Spark will also benefit from the package. And about half of Spark users also are using Python, according to the company.
“Spark has clearly demonstrated that Python is one of the most important technologies in modern open data science,” wrote Peter Wang, Continuum chief technology officer and co-founder, in the announcement.
“We’re excited about the low-level technology advancements in Hadoop, such as Parquet [columnar data store], as well as the pioneering advancements by Cloudera on Impala and Kudu. These advancements have set the foundation for our next-generation Hadoop innovations, which extend Python from an interface for data science on Hadoop to a full-fledged native analytic computational platform for Hadoop.”
Python was ranked fifth on the Tiobe Index of popular programming languages for February.
“People will say, ‘I can’t get the performance I need out of Python’ even though it’s very easy and very approachable,” says Michele Chambers, CMO and VP Product of Continuum Analytics. Python offers nearly the same performance as C and C++, she argued. Anaconda also can scale up to run on multicores and GPUs and support full scale-out to clusters.
“In the enterprise, data science is really a team sport. You have data scientists, business analysts, statisticians, data engineers, data ops and developers that all have to work together in concert. They have to carefully orchestrated and work together very fluidly. Yet the tools they have available to them today really don’t allow them to do that,” Chambers said.
“You want to bring them all together to fully use all their data, all the open source analytics tools out there and modern computational infrastructure – Hadoop, Spark, GPUs – and you want to do that in a connected ecosystem,” she said. “The truth is that most enterprises have their data in these legacy environments, so you want to bridge the world for them. Open data science is not about prescribing to them just one approach, but connecting them with their data so they can get insights quickly, very high-impact, high-value insights.”
One advantage to Anaconda, she says, is that it can bring in legacy code — C, C++, Java, Fortran or whatever — and give it new life in modern applications.
Earlier this month the Continuum released Anaconda 2.5, which includes Intel Math Kernel Library (Intel MKL). An R-Essentials package bundled with Microsoft R Open will be available Feb. 25. The combination of these two packages, the company says, provides as much as 7x performance boost for math-intensive analytics.
It’s also added Anaconda Enterprise Notebooks, which provides the benefits of Jupyter (formerly known as IPython) Notebooks in a governed environment. Notebooks encapsulate code, comments and visualization all in one place. The company has added enterprise features such as collaborative locking, version control and more. The platform includes AnacondaXL, which provides access to popular Python packages such as scikit-learn and Pandas for machine learning to enable predictive analytics and data transformations and integrates with Microsoft Excel.
The Open Source Policy Center, for instance, is rewriting proprietary models to create a tax model tool called TaxBrain in Python that will allow non-technical people such as journalists and policy advocates to more easily check, for example, a politician’s claim about the effects of a tax policy.
It has economists who understand the models and data scientists who are developing them. Notebooks is a way for them to collaborate on the models and share results, explains Chambers.
And with the Bokeh (pronounced “bouquet”) interactive visualization library, users have a wider choice of visualizations than those offered through Excel.
“Let’s say you’re an oil and gas production engineer,” Chambers said. “You use Excel all the time. You’re doing some pretty powerful analytics. You might be looking at production quality control or production output, trying to figure out how to optimize [that], yet you have a very limited set of visualizations” in Excel.
“So what you typically do is do a bunch of analysis in Excel, then dump the data into some type of graphing tools that will allow you to create visualizations that that industry uses. With four lines of code, Bokeh allows you to make these beautiful visualizations that allow you to create the axes the way you want, you can make all kinds of charts. They’re rich because they have a lot of data, you can see the details of the data along with the visualization,” Chambers said.
And they’re contextually relevant because they’re familiar to those in that particular industry, she said.
Founded in 2012, Continuum Analytics has raised $28 million in three rounds. It reported 60 percent year over year customer growth and an 87 percent increase in revenue. It had reached a quarter-million downloads per month in 2015.
The company announced Anaconda for the Enterprise last September at Strata+Hadoop World. Chambers says the company has about 30 enterprise customers. In November, it added that the platform will run on AMD’s Accelerated Processing Units.
In another interesting use case, the Defense Advanced Research Projects Agency’s (DARPA) Memex program uses Continuum Analytics’ technology to peer into the “deep web” in an effort to nab human traffickers.