eBay’s ‘Extreme’ OLAP Engine for Distributed Data Gets Apache Backing
A distributed analytics engine first developed by eBay, called Kylin, has been accepted by the Apache Software Foundation as a full top-level project. The software could help organizations use standard business analysis tools to query big data repositories.
“At eBay, we collect every user behavior on every eBay screen. While other OLAP [Online Analytical Processing] engines struggle with the data volume, Kylin enables query responses in the milliseconds,” explained Wilson Pang, senior director of data services and solutions at eBay. “We are also starting to leverage Kylin for near-real-time data streaming storage and analytics engine. Altogether, Kylin serves as a critical backend component for eBay’s product analytics platform.”
The e-commerce site built the technology internally for processing very large OLAP cubes via ANSI SQL. It reported its largest use case was the analysis of more than 12 billion source records generating 14+ TB cubes. Its 90 percent query latency was less than 5 seconds.
In making it open source, eBay explained in a blog post that the technology to accelerate analytics on Hadoop and allow the use of SQL-compatible tools was not new, but it sought to speed things up while still allowing business users to use familiar tools. It integrates with Tableau, and integration for Microstrategy and Excel are in the works.
The company submitted the code as an Apache Incubator project in October 2014, marking a relatively rapid rise to top-level project status. As a top-level project, Kylin now gets the organizational, legal, and financial support of the Apache Software Foundation, which should speed its development for a broader array of potential users.
Wen Li, senior researcher of engineering technology at Chinese group buying website Meituan.com, lauded features including sub-second query latency on a billion records with high scalability and seamless integration with BI products.
Kylin works by reading data from Hive, running MapReduce for pre-calculations, storing cube data in HBase and using Zookeeper to coordinate jobs.
“Many technologies over the past 30 years have used the same theory to accelerate analytics. These technologies include methods to store pre-calculated results to serve analysis queries, generate each level’s cuboids with all possible combinations of dimensions, and calculate all metrics at different levels,” eBay explained when donating the code.
“When data becomes bigger, the pre-calculation processing becomes impossible – even with powerful hardware. However, with the benefit of Hadoop’s distributed computing power, calculation jobs can leverage hundreds of thousands of nodes. This allows Kylin to perform these calculations in parallel and merge the final result, thereby significantly reducing the processing time.”
It makes use of an open-source the dynamic data management framework Apache Calcite to parse SQL and plugin code. And it supports multiple access methods, including JDBC (Java Database Connectivity), ODBC (Open Database Connectivity), and a REST API for programmatic access.
The platform’s core components are:
- Metadata Manager: As a metadata-driven application, all other components rely on it.
- Job Engine: It handles all offline jobs including shell script, Java API, and MapReduce jobs. It coordinates all jobs and ensures each job executes and handles failures.
- Storage Engine: This engine manages the underlying storage – specifically the cuboids, which are stored as key-value pairs. It uses HBase, but also can be extended to support other K-V systems, such as Redis.
- REST Server: The REST Server is an entry point for applications to develop against Kylin. Applications can submit queries, get results, trigger cube build jobs, get metadata, get user privileges, etc.
- ODBC Driver: The team built and open-sourced a driver to support third-party tools and applications to make it easy for users to get on board.
- Query Engine: Once the cube is ready, the Query Engine receives and parses user queries. It then interacts with other components to return the results to the user.
It touts features such as an easy web interface to manage, build, monitor and query cubes; security capability to set access controls at the cube/project level; and support for LDAP (Lightweight Directory Access Protocol) integration.