Enterprises building data lakes and warehouses have been turning to third-party vendors to bring in data. Building connectors to applications likes Salesforce is tedious and requires data engineering expertise. Plus, the SaaS platforms change up their data schemas and API details.
These data pipeline vendors extract data from the source systems and load the data into the warehouse. They will also clean up or sync the data between different systems, either before it gets to the warehouse, or after.
Today, more and more companies are opting for the latter.
With the first approach, transforming the data before you get to the warehouse, you lose traceability, said Goutham Belliappa, vice president of artificial intelligence engineering at Capgemini North America.
“When people want to understand the validity of a metric that was transformed in flight, there is no way to audit it and identify the progress of the data,” he said. But when the transformation happens afterwards, there is more control. “The moment can be recreated or alternative algorithms and formulas can be simulated with ease,” he said. “Further, business analysts can validate the data and the algorithm freely and have the ability to make tweaks or modifications.”
If the data transformation step happens before the data gets to the warehouse the technology is called ETL — extract, transform, load. If the transformation happens in the data warehouse itself, it’s called ELT — extract, load, transform.
Either way, the company ends up with a centralized place where data is stored.
“The loading and storage of raw data lets you ingest data once and then share it and distribute it for many uses without having to go back to the source applications,” said Doug Henschen, vice president and principal analyst at Constellation Research.
Now, enterprises can have a birds’ eye view of all of their data, can aggregate their operational or financial metrics, or get a 360-degree view of their customers.
Zeplin Uses Reverse ETL
One company that built just such a data warehouse is Zeplin, an online collaboration platform for UI designers and frontend developers with millions of users.
It uses a Snowflake warehouse to collect operational data, such as how end users interact with its products, and third-party data from SaaS platforms like Salesforce for customer relationship management and Intercom for customer support.
“We collect all this data through a traditional ETL solution and centralize it in a data warehouse,” said Jason Feng, senior growth marketing manager at Zeplin. “Now we have all this information in Snowflake about users are doing, what teams are doing. But the problem is that making the data useful to the rest of the company is hard.”
To get at the view, enterprises typically have data scientists run analytics using business intelligence platforms. The data scientists then produce reports. This is often a slow and time-consuming process and doesn’t put the data in the hands of the employees when they need it.
“For example, when someone converts from being a free user to a paid user, we want to notify the sales team,” said Feng.
To solve this problem, some enterprises are building new data pipelines, sending the reports back out to the systems their employees are actually using.
“But that’s difficult,” said Feng. “You need a data engineer to do it, you need to maintain it. Each time something changes, you need to have the data engineer make the updates.”
At companies where he previously worked, there were data engineers on staff to manually build those data pipelines, he said.
But Zeplin, with fewer than 100 employees, had no engineers to spare.
Instead, when Feng joined the company six months ago, he spearheaded an effort to deploy a third-party “Reverse ETL” tool.
Reverse ETL grabs information from the data warehouse and sends it out to Salesforce or other destinations, where it’s available to the sales team in real time, or close to real time.
He looked for a vendor that could integrate with Zeplin’s Snowflake warehouse and all its SaaS platforms, and could also support customized integrations when they weren’t available out-of-the-box.
And, finally, the vendor the company chose, Hightouch, did not require that the data leave Zeplin’s control.
“That was really great for data security,” he said.
The entire process took about two weeks of part-time work, with two employees working on it from Zeplin’s side, plus support on the vendor side.
“We were able to spin this up without a data team,” Feng said. “We will probably hire a data team, in the future, as we get bigger. But this was a good solution for a scrappy team like ours.”
Now, Zeplin is adding more integrations, and using reverse ETL data feeds for more business processes.
It’s been about three months since the pipeline has been up and running, and it’s changed the way the sales and marketing team works, he said.
“We can make product-informed actions on the sales side that we previously couldn’t do,” he said. “We can reach out to accounts who are very active, or who have a certain number of paid seats.”
That helps grow the business, he said. “I like it. It’s much easier than building custom data pipelines. It lets me be able to use product data within marketing motions and that unlocks a lot of things that marketers traditionally have not been able to do without engineering support.”
Making changes to the data feed does require some SQL knowledge, however.
“You take a SQL query, which spits out different outputs, and then there’s a click-and-drag interface,” he said. “You take your data columns and map them to different fields in the SaaS tools and set up a sync schedule. Maybe every 15 minutes, or every hour.”
Since the data sent to Salesforce only augments existing records, instead of replacing them, there are no data conflicts issues, he said, but it’s something that he’ll keep an eye on as the company grows and more use cases are deployed.
Why Reverse ETL Is Hard to Do
Reverse ETL pipeline engineers have to deal with ever-changing data schemas and API configurations, upload limits, data conflict issues, audit and compliance requirements, and much more.
“Putting data into systems is much more challenging than pulling it out,” said Tejas Manohar, co-founder of reverse ETL company Hightouch. For example, if the data in the SaaS system is different from the data in the warehouse, maybe because it was just updated, or because the warehouse data has been cleaned up, there needs to be a reconciliation step.
“And every company is different — you have to give the companies a lot of flexibility,” he said. “It’s definitely a challenging design problem. You can say that you can only override a field if a value doesn’t already exist in the field in the original system, or you can say that if it’s different, we overwrite it. It’s not a one-size-fits-all problem like ETL is.”
Hightouch currently supports 40 systems, plus offers customers the ability to build their own connectors to internal systems or to SaaS platforms that aren’t yet supported.
The reconciliation requirements also mean that the reverse ETL vendor has to have some way of looking at the data as it flows through the pipeline in order to decide what to do with it.
Hightouch addresses this issue with a hybrid architecture where the data is stored in the customer’s own infrastructure and Hightouch is just the transport pipe.
“It’s all encrypted end to end and it’s not visible to the Hightouch team,” said Manohar. “And no data is stored by Hightouch. This allows us to work with financial and health care companies.”
Another reverse ETL vendor, Grouparoo, addresses the problem with a self-hosted, open-source offering. “Compliance-based companies like medical firms are really into control, and they are doing that,” said Grouparoo CEO Brian Leonard.
But Grouparoo also supports a SaaS-style delivery model as well as a hybrid option.
The company’s open-source approach means that third-party developers can contribute integrations, said Leonard.
“I’ve talked to developers in Brazil and Vietnam,” he said. “The average US-based SaaS company would never make a connector with, say, the Brazilian Mailchimp.”
Grouparoo connects to about 20 applications, like Salesforce, Mailchimp, Intercom and Zendesk, as well as most data warehouses and databases, he said, including Snowflake, Redshift, SQL and Mongo.
The connections to traditional databases are particularly important, he said, because some companies haven’t yet moved to a central data warehouse.
“We’re meeting the customer where they are,” he said. “There’s not that many companies that have already implemented the modern data stack of the future where everything is in Snowflake and they’re ready to sync over. The majority of companies have stuff all over the place and it’s a mess, and we see a lot of value in meeting them at that spot.”
The Industry Landscape
Reverse ETL is very distinct from traditional ELT or ETl, said Constellation Research’s Henschen.
“Don’t confuse it with ELT,” he said. “It’s about application integration.”
“But now there are vendors who do it for you,” he said. “Previously, you had to write it by hand or use heavy enterprise tools that take weeks to set up.”
The new wave of tools can have companies set up reverse ETL pipelines in minutes, he said.
That’s important, because employees have become much more demanding.
“The standards have risen a lot in terms of what people expect internally in companies,” said Suleiman. “Five years ago, it was okay for the sales team not to have visibility in how customers are using the product. It’s no longer okay.”
As a result, reverse ETL vendors began hitting their stride in the past couple of years.
Traditional ELT and ETL vendors are, for the most part, not yet offering reverse ETL because it’s a different, and more challenging problem, leaving the field open to startups.
Enterprises can choose to wait for their ELT or ETL vendor to add the functionality, go with a new reverse ETL startup for just the reverse ETL part, or, if they’re setting up a new data infrastructure — or are looking to switch vendors — use a vendor that handles both sides of the data movement job.
RudderStack, for example, supports traditional ETL, reverse ETL, as well as real-time data streams. RudderStack currently supports 70 integrations, as well as custom integrations. The company also offers a choice of deployment models, either in the customer’s own infrastructure, or as a service, depending on privacy and security requirements.
But for companies that have been slow to deploy a data warehouse and analytics, reverse ETL might not make sense for them yet, he added.
Some experts are expecting some consolidation in the industry, with traditional ETL vendors buying up reverse ETL startups.
“I’m sure that’s something that will happen,” said Mitra.
Other types of companies can get into the game as well. For example, analytics vendor ThoughtSpot purchased of reverse ETL vendor SeekWell last month.
“The venture market is pretty aggressive for this,” he said. “We just raised a hotly contested series A round.”
Census closed a $16 million seed round in February, led by Sequoia Capital, with Andreessen Horowitz also participating, bringing the total raised to more than $20 million.
The market opportunity is big, he said.
“There are around 3,000 Snowflake customers, and probably 50,000 cloud data warehouse customers. Every one of them has made this investment in data infrastructure — but the output of that investment is mostly charts and AI experiments. With tools like ours, you can take that investment and make it available to every aspect of the business.”
Top Use Cases for Reverse ETL
Many data privacy regimes, including Europe’s General Data Protection Regulation and California’s just-passed Privacy Rights Act, individual consumers have the right to ask what data companies collect about them — and ask for that data to be deleted.
The first part is relatively straightforward if a company has a data warehouse that offers a complete view of the customer.
But deleting all that data from the source systems where it resides is a much trickier issue.
“Reverse ETL solutions support deletions, and propagate them to all the solutions,” said Hightouch’s Manohar. “Once you delete from the warehouse, you can send the deletions to all the systems throughout the company. That is a big struggle for many enterprises.”
Compliance was one of the factors that inspired the launch of RudderStack, Mitra said.
“Our product handles user suppression,” he said. “If someone says, ‘don’t track my data,’ we stop tracking their data in their system. All the tracking that happens through RudderStack, we can stop it.”
It’s a hard problem, he said, because some SaaS platforms make it easy to delete records, but others make it very difficult. “Last time I checked, deleting from Salesforce was really painful,” he said. “In a customer relationship management system like Salesforce it’s not just the one record. You also have all the tickets they’ve opened and all the activity they’ve done — it’s all tracked. You have to delete from all the places where the records are available.”
“And if you have a database, and it was backed up into a backup system, we don’t even have access to those systems,” he added.
Replacing Point-to-Point Integrations
Some SaaS platforms try to make it easy to integrate with others. Mailing list management vendor Mailchimp, for example, offers an integration with Salesforce.
But enterprises can have dozens, or hundreds, of SaaS platforms that they are using. Point-to-point connections can quickly multiply out of control.
“You should not have point to point integrations,” said Mitra. “You should be bringing everything to the warehouse and then pushing it out to the destinations. That is the way it should be.”
Sales and Marketing
Zeplin isn’t the only company using reverse ETL to help improve sales and marketing.
“It can help create a unified view of the customer, positively impacting sales, marketing and analytics teams,” said Astasia Myers, venture investor at Redpoint Ventures. “Reverse ETL helps prevent the data warehouse from becoming a data silo and helps make data usable.”
She most often hears about people using reverse ETL to empower sales, marketing, support and operational teams right in the tools they use most often on their jobs. “For example, feeding lead scores from the data warehouse into a custom field in Salesforce or taking customer plan information and pushing it into Zendesk to help prioritize tickets.”
Or, say, for example, a customer isn’t using a company’s product.
If the sales team is aware of the problem, they can contact the customer and find out what’s going on, said Polytomic’s Suleiman.
“If they can intercept the customer before the contract end date, they might save the contract renewal,” he said.
Reverse ETL is particularly useful when analytics tools are used in the data warehouse to generate recommendations, to inject those recommendations into the sales and marketing workflows.
“Say you calculate a lead score,” said Michael Ritchie, CEO at SeekWell, the reverse ETL vendor acquired by ThoughtSpot. “You have a million potential customers in your data warehouse and you want to sort out the best 1,000 customers and send those 1,000 customers an offer. it’s challenging to do and involves a lot of different teams doing different things.”
With reverse ETL, an analyst can set up the query to find the data, do the mapping, then set it to run, say, every hour.
Now, the salesperson can see the results right in Salesforce.
“Companies who have adopted a modern analytics stack and have a 360-degree view of their customer can see immediate and massive value from reverse ETL,” he said.
But for companies that have been slow to deploy a data warehouse and analytics, reverse ETL might not make sense for them yet, he added.