Google’s Cloud Bigtable and the Data Services Ecosystem
Google Cloud Bigtable is a NoSQL database announced this week and now in beta. The service is a batch processing system that interfaces with clients via the open source Apache HBase API, which itself is based on the original Bigtable technology. Bigtable is the database that runs Google Analytics and other Google services.
It’s a move for Google that signals its growing emphasis on catering to the enterprise class and the overall demand for ways to manage data that now mostly comes from machines in the form of data logs, click streams and other unstructured data. Bigtable also fits into a Google pattern of releasing technology that has been in use for years internally at the company.
In 2004, Google released a research paper on MapReduce, which it had used internally for many years. That paper served as the foundation for Apache Hadoop, the distributed data processing technology that has in turn served as the foundation for a generation of providers such as MapR, Cloudera and Hortonworks.
Apache Drill is an open source, low latency, SQL query engine for Hadoop and NoSQL. It is based on Google Dremel, which use a columnar storage representation for nested data and combines it with SQL-like functionality.
Google Borg is the most recent technology that the company has released as a research paper. Borg is the basis for Google Kubernetes, Google’s open-source project for managing containerized applications. Borg has also served as the basis for Apache Mesos, the data center operating system used by Twitter, Airbnb and others. Mesos is the core offering behind the Mesosphere data center operating system.
Like Google Kubernetes, Cloud Bigtable is designed for the enterprise and growing demand for the platforms that allow companies to scale.
“We’ve launched a managed database service for highly scalable enterprise customers,” said Tom Kershaw, the director of product management for Google’s Cloud Platform. “It is for jobs of one petabyte or larger that require large amounts of processing and analytics around read/write operations. Think of it as a fully management high scale database for customers that have large data sets.”
Each Cloud Bigtable cluster consists of three nodes. Each node offers as many as 10,000 queries per second and 10 megabits per second (Mbps) of throughput. Each cluster has a minimum of three nodes and each node costs $0.65 per hour. SSD storage is $0.17 GB/month, and HDD storage (which is not yet available) will be $0.026 GB/month.
The sheer volume of machine data that companies must manage is unprecedented. Much of this emerging data – log files, point of sale terminal data, trading data and the like – is unstructured and therefore can’t be processed in traditional relational databases, even if volume is not an issue.
Other NoSQL providers, such as Couchbase, are also addressing the issues that come with managing unstructured data. They cite research from IDC Research firm which estimates that in 2013 the combined size of the world’s digital data was 4.4 zettabytes — i.e., 4.4 trillion gigabytes — and that by 2020 it will grow ten times the size to 44 zettabytes.
Companies don’t want to invest in platforms that are only sporadically pushed to their limits, and they want to add capacity in a flexible manner. Cloud Bigtable opens new use cases that were previously difficult to realize both for performance and cost reasons, wrote Holger Mueller, principal analyst and vice president of constellation research, in response to emailed questions.
Mueller’s comment points to a trend that we see as more cloud services are used to build scaled-out database services. Amazon Web Services launched Aurora last November, and also has a new service called Amazon EMR that is designed to simplify big data processing by providing a managed Hadoop framework.
CCRi is working with Google Bigtable to facilitate storage and querying of “spatiotemporal” data using GeoMesa, according to Director of Operations James Conklin. He said CCRi developed GeoMesa to migrate one of its threat prediction analytics solutions to the Google platform. GeoMesa was open-sourced, he said, when it became clear that it would be of value to others.
Conklin is a fan of the new platform. “Technologically, Cloud Bigtable [has] phenomenal capability,” he wrote in response to emailed questions. “The original paper about Bigtable inspired several, now widely used, database products such as HBase, Cassandra and Accumulo. While powerful, these systems are complex. The Cloud Bigtable system manages many of the low-level details involved with optimizing and tuning distributed databases, which in turn simplifies the code we use to store and access the data.”
Conklin said a benefit of Cloud Bigtable is eliminating the time and effort of creation, managing and use-case optimization of such a sophisticated platform. “Perhaps the most significant technical and business case for Cloud Bigtable … is that as your data grows, you can easily scale up the system. CCRi has gone through the pain of building and maintaining clouds, so I can attest that these benefits are not only real, but valuable.”
The data technologies available today are built by some of the world’s most brillaint engineers. They are a limited talent pool, generally not available to the tens of thousands of companies that will make the transition to be data-driven businesses. Without that talent, most companies will look to data services providers. There’s a dark side to that approach. It means that they are now subject to the terms of service of those companies, their pricing models and all the changes that come with services as they adapt to customer and market demands.
Open source is critical to the data services ecosystem, but for the most part customers will be dependent on the commercial services built on open source platforms. To build the services themselves just doesn’t make sense from an investment perspective. The services do the job just fine.
Feature image via Flickr Creative Commons.