StreamSets Smooth the Flow of Big Data

20 Jan 2016 6:46am, by

Working within Big Data presents a number of pain points when one is working with today’s technology. As data sources mutate over time, this can result in not only broken code but service interruptions and other troublesome issues when working with less than ideal tools or third-party systems. StreamSets is a company that aims to solve what they call ‘data drift’ by allowing for those that code in today’s world to diagnose these breaks, allowing for developers to better respond to and repair outages as they happen.

Collecting Data at the Source

StreamSets’ first line of defense for developers and data engineers working is the StreamSets Data Collector offering. This tool allows for developers to overlay a visual UI on their infrastructure, which they can then use to connect data sources to destinations. This allows for a more responsive, agile variety of transformations to sanitize data while it is in motion.

“It is resistant to data drift because it doesn’t rely on schema, and uses a standard record format that provides complete visibility into the data flow,” said StreamSets Co-Founder and CEO Girish Pancha.

StreamSets Data Pipeline

StreamSets Data Pipeline

Reducing hand-coding is crucial to improving quality-of-life for data engineers, allowing for less time spent on active maintenance of custom code. By taking custom coding out of the process, StreamSets simplifies life for those who are working heavily with large-scale data processing tools such as Kafka and Flume. It has quickly made a name for itself in the big data space, particularly at Cisco.

“Cisco uses StreamSets as part of their InterCloud offering. They value our ability to automatically handle infrastructure changes, as well as the ability to provide intelligent monitoring and dynamic shaping of their internal operational logs and multi-datacenter data ingestion logs,” said Pancha.

Data is useless when it is inaccurate. StreamSets provides users the ability to introspect incoming streaming data, giving them an opportunity to then test for any anomalous conditions. If data has begun to drift or returns incorrectly, StreamSets will provide users with an early warning.

StreamSets runs atop a user’s existing Hadoop cluster, working with both YARN and Mesos to ensure both enterprise-level scheduling and scalability. It can also be deployed where data is being produced to optimize bandwidth usage and data movement, functioning in memory so as to minimize its impact on system performance.

In its most recent release v. 1.1.3, StreamSets announced it will now allow users to install, manage, and deploy their StreamSets data parcels and services in Cloudera.

Getting Into the Finer Details

StreamSets can be deployed on edge nodes running in standalone mode, or in a cluster mode which supports both streaming and batch. In addition to this, it also implements a standard record format that is highly optimized for detecting anomalous conditions as well as transformations. StreamSets has implemented a stateless front-end developed entirely on REST API, which allows it to seamlessly integrate with a variety of other cloud-based offerings such as container monitoring services.

StreamSets Anomaly Monitoring

StreamSets Anomaly Monitoring

Deploying StreamSets is as simple as utilizing a drag-and-drop UI to build complex pipelines. Pancha notes that developers can also create complex logic to suit their specific requirements using the Java Expression Language and a catalog of data manipulation functions available throughout the system.

The StreamSets system also includes support for a variety of common scripting languages, such as Python and JavaScript. These can then be plugged into stages within a user’s StreamSets pipeline for doing free-form manipulation of data. If one wanted to dig deeper into StreamSets for their specific use case, it has a public API which allows developers to create custom domain specific stages which can then be made available throughout one’s pipeline.

As more enterprises continue to rely on big data, providing users with the tools to collect, analyze, and monitor this information is crucial to long-term success.

Cisco is a sponsor of The New Stack.

Feature image via Pixabay, under a CC0 license.

A digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.