How Goldsky Democratizes Streaming Data for Web3 Developers
Goldsky is a real-time data platform for the Web3 ecosystem. Our platform provides blockchain indexing, subgraphs and data-streaming pipelines that help developers build powerful dApps (decentralized applications). We support more than 10 of the most popular blockchains as sources, as well as ad-hoc user-defined data sources, and we’re used by some of the biggest protocols and networks to power analytics and APIs.
To simplify things for those not in the Web3 world: Our customers fundamentally want to get specific data out of different blockchains, which is not easy to do. Typically, you have to go through various remote procedure call (RPC) providers and APIs, which is a hard, painful and slow process, essentially analogous to old-school API crawling. Goldsky ingests data directly from blockchains, with embedded tools our customers use to build real-time data pipelines that power their apps. We push data directly into our customer’s datastores, allowing them to own the data and configure it for their own use case.
As an example, one of our customers is a popular betting website that tracks real-time betting trends. The bets and auctions it tracks are happening on the blockchain layer to ensure transparency and fairness. We ingest the data from the blockchain, decode it to a form that’s useful, transform it in real time and sync it directly to the customer’s database. This is then used for its API layer to power things like betting order books and statuses.
Let’s take a closer look at the streaming data architecture that powers Goldsky’s platform.
Goldsky’s Streaming-First Architecture
There has been a paradigm shift in the industry recently where people are realizing that you can use data platform technology that was previously used by internal analytics teams to power customer-facing features. The sort of data pipelines that previously only supported reporting and dashboards are now supporting web application functionality.
The trouble is, most app developers are not deeply familiar with the underlying technology stacks for data pipelines, which typically include complex technologies such as Apache Kafka and Apache Flink. And this is where platforms like Goldsky come in, to help democratize developer access to streaming data by providing simple API-driven tools that bring real-time data into applications. These data pipelines as a service allow app developers who don’t know the first thing about managing Apache Kafka to have streaming data at their fingertips.
Streaming data is, of course, fundamental to our stack. In fact, we have a “streaming first” architecture powered by Redpanda (a simpler, faster and more durable Kafka-compatible platform) as our primary streaming-data solution, with Apache Flink used for stream processing. We use Redpanda’s managed service, Redpanda Cloud, to ensure minimal overhead.
Let’s walk through the platform architecture, from sources to sinks.
As mentioned, our sources are blockchains, and we ingest data from them in a few different ways. One way is called direct indexing, which is based on a popular Ethereum ETL (extract, transform, load) project. It’s able to connect directly to blockchain nodes and extract raw, low-level data structures like logs and transactions. Second, we support another popular open source technology called Subgraphs, which makes it easy to write simple TypeScript applications to capture and process smart contract event telemetry.
Raw blockchain data is hard to interpret: You have to deserialize the payloads, and because the logs have data from all smart contracts, you have to write business logic to filter out what is specific to your actual contracts. So part of the value we add is processing the data into a format that is directly usable by our customers.
Internally, all data sets in the end are available as topics backed by Avro schemas.
The middle of our stack is where things get interesting. We use Apache Flink SQL for our transformation layer. Our customers can write simple SQL transformations to perform filtering or projections; they also can implement complex joins or aggregations.
And the magic is that the customer doesn’t need to know what’s under the hood at all, they just use our APIs or our GUI!
In Goldsky’s architecture, Redpanda is not just a broker, it’s where we actually store the blockchain data. It’s our source of truth — the primary database and data lake holding tens of terabytes of data. This is enabled through Redpanda’s S3-compatible tiered storage, which provides near-infinite data retention without incurring a ton of cloud infrastructure costs. We like this approach better than the traditional alternative, which is to set retention limits in the broker and then sink the data to S3 to be read later and combined with real-time topics. Redpanda tiered storage simplifies things and eliminates the usual need for additional resources to maintain archival systems. This is an easy win for us: If our streaming-data solution can also be our storage solution, why not?
We have built-in integrations to support different databases and data stores, wherever customers want to sink their data. The customer can self-service set their sink of choice, like PostgreSQL or S3 or Elasticsearch. We see PostgreSQL frequently, and it’s an extremely good sink because it can easily support both transactional and some analytical use cases. It’s easy to integrate with an existing application. In addition, modern managed PostgreSQL providers make it easy to scale databases for large data sets.
As discussed, Redpanda is foundational for our data infrastructure. Here’s how we landed on it.
When our engineering team was first building Goldsky, they knew they needed a modern streaming-data platform that had minimal complexity and was cost effective, but that could still handle high throughput reliably at scale. Some team members had used Confluent Kafka previously, but the pricing model wasn’t going to work for us because you end up paying for hardware you don’t need.
Redpanda stood out as an alternative to Apache Kafka or Confluent because it achieves the same throughput with less hardware and has a much simpler architecture: It’s a single binary with no JVM dependency, has a low memory footprint, and has schema registry and proxy built in. Plus, the hardware efficiency and Redpanda’s tiered storage capability provides cloud infrastructure savings of three to four times that from Kafka. What’s really made Redpanda an essential part of our stack is its durability. Using Redpanda as our source of truth is only possible because of the data safety it provides through its Jepsen-tested and Raft native architecture, and its cost-efficient tiered storage.
We’re also optimistic about Redpanda’s roadmap, which is very focused on developers, including Data Transforms powered by WebAssembly (Wasm) and Apache Iceberg integration. These innovations will complement our architecture well. For example, in the future, we will be able to simply point a Spark or Flink job to Redpanda tiered storage with Iceberg. And with Wasm data transformations, we will be able to filter relevant blockchain data more quickly by processing on nodes closer to the storage layer.
Goldsky and the Future of Web3
When we started working on Goldsky, we thought that data streaming and stream processing could be useful building blocks. We didn’t realize how much we’ll actually rely on them. Data-streaming concepts work really well for blockchain data. We’re able to solve challenging problems, such as blockchain reorgs, by modeling them as well-known stream-processing problems like retractions. Building on this further lets us support even more advanced use cases, like enriching on-chain data with off-chain data, reliably calculating Top-N aggregations and combining data from multiple blockchains together.