Confluent Brings SQL Querying to Kafka Streaming Data
With ever-increasing volumes of data comes an ever-increasing need to process that data. Confluent has made a business out of helping enterprises handle never ending streams of data with its commercial packaging of Apache Kafka. And now, at Kafka Summit in San Francisco this week, Confluent introduced a new open source project, called KSQL, that it says will allow users to apply SQL queries against streaming data.
In this move, Confluent is one of a growing number of companies, such as SQLSteam, attempting to apply the bringing the rigors of SQL to the world of real-time data analysis.
Neha Narkhede, CTO and co-founder of Confluent, said that KSQL offers a number of potential use cases to enterprises, from processing data as it comes into an organization to handling extract, transform and load (ETL)-like work on data warehouses and data transfers between systems.
Said Narkhede, “KSQL is a completely interactive distributed SQL engine for Apache Kafka. It lets you do all sorts of continuous stream processing and transformations against infinite streams that flow through Kafka.”
Traditionally, processing stream data through Kafka required a developer to write Java or Python code, said Narkhede. KSQL brings data developers and SQL experts into the stream processing fold. KSQL will become an independent product offering down the line, said Narkhede, and is already available on GitHub. The project is currently in developer preview and should be generally available in a few months.
Narkhede said KSQL is, similar but not entirely compliant with ANSI SQL. It is a modified version of SQL customized specifically for querying streams of data. In a database, SQL is used to query past transactions.
“This is turning the database inside out in the sense that instead of querying the past, you start querying the future,” Narkhede said. “At this moment the grammar is pretty good, but there are more features we plan to add, such as to insert statements into Kafka topics down the line.”
While KSQL brings SQL to Confluent’s product line for the first time, it is not the first such SQL-on-streams system out there. Companies like Striim, Kinetica, and SQLStream, for example, have offered similar functionality for almost a decade. SQLStream, in fact, already offers SQL on Kafka.
So what makes KSQL different? Narkhede says it’s the distributed processing model. “KSQL builds on top of the Kafka partitioning model. You can easily distribute queries on the cluster so you can actually get away with normal sized boxes, coupling them together like Kafka. It’s integrated very closely with the fundamental building blocks of Kafka, and has the ability to run several queries in parallel. Kafka takes care of the load balancing if one machine goes down, and how queries shift over time.”
As an example, here is a KSQL query for ETL work:
CREATE STREAM vip_actions AS
SELECT userid, page, action
FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
Damian Black, CEO and founder of SQLStream, said that Narkhede and her team came to visit a year ago, and were obviously taking notes. Currently, his company’s biggest source of users is Amazon Kinesis, which was built using Amazon’s Kafka-like streaming system, and SQLStream’s SQL processing system. He said the reason SQLStream is popular with Amazon is its speed.
“The reality is we are so much faster, that you need a fraction of the number of servers. One of our customers had a job take 180 servers [running MapR] three hours. It takes 12 of our servers running at 40 percent to process the data in real time,” said Black.
Black said that one of the biggest issues with building SQL processing inside Kafka is that Kafka is written in Java. SQLStream actually runs inside the JVM, but is built in C++ and highly optimized to the point where it generates no garbage in the JVM. That means SQLStream runs its queries at true real time speeds.
Black said it’s too early to comment on KSQL’s capabilities, but mentioned that another stream SQL processing engine, that of Apache Spark, is batch-based and cannot handle queries in real time.
SQLStream also runs on Kafka, so Black said his team is familiar with Confluent’s work there. “Confluent is a great messaging system. That’s why we use it. It’s free and it performs well considering it’s written in Java. We also work with Amazon Kinesis and AMQP. We provide a much richer experience. It’s not SQL-like, it’s true standards-based SQL. It’s written in C++ and it’s lock-free. A 32-bit integer is a 32-bit integer,” said Black.
SQLStream has been offering similar capabilities since it launched eight years ago. Black said that he feels like a pioneer of the market and that many enterprises are still just waking up to the potential of stream processing with SQL.
One aspect of the decision process for most enterprises, however, is the cloud in which such a system will be based. That could mean big changes for many data processing systems, however. Narkhede said that she sees a lot of growth currently coming from Google’s Cloud. That’s being driven by BigQuery adoption, and that, she said, is a reason the team is working to integrate Kafka with BigQuery.
That could mean big changes to the way data is processed, as teams forego an on-site IT managed approach in favor of simply using Google’s own on-demand services. In that sense, KSQL and SQLStream may only be the preferred solution until Google wins over the market. That may sound bad for Confluent, but Narkhede said she’s seeing a large internal push at Google to bring Kafka into the system to work with BigQuery.
“BigQuery is making people go to Google,” said Narkhede. “[Google] wants Kafka to work with BigQuery. There are a lot of people out there asking them for Kafka and BigQuery is the draw. It’s one of the most amazing data systems out there in the world.”