3 Ways to be Wrong about Open-source Data Warehousing Software

Gleb Mezhanskiy

28 June, 2020

I often come across statements from data teams such as:

“We are choosing [open source data warehousing technology] because it’s cheaper than proprietary options, also to avoid vendor lock-in and to be able to extend the system if we need to.”

There are many reasons to use open source software (OSS) in your data stack. Specifically, for data warehousing and processing. Today, we have at our disposal dozens of mature OSS technologies with active communities, including:

  1. Apache Spark – a Swiss-army-knife data processing engine that is also developer-friendly.
  2. Presto provides great performance for SQL ETL and analytical queries while abstracting the user from underlying complexity.
  3. Druid & Clickhouse offer subsecond query performance for interactive analytics.

However, due to the modularity of the OSS ecosystem (the stack is assembled from multiple OSS projects and standards) and distributed ownership of codebase (and of bugs), adopting an open-source technology is often a very different experience than buying a similar proprietary product from a vendor.

It seems that many teams “default” to open-source without properly considering trade-offs and carefully evaluating their assumptions.

In this post I am challenging what I consider top-3 erroneous assumptions about OSS data warehousing technologies.

Before we proceed, let’s define the scope for “data warehousing tech” as delivering the following use cases:

  1. Ingest & store all analytical data Execute data transformations (the “T” of “ELT”)

Serve data to consumers:

  • Dashboards
  • Ad-hoc analysis
  • Consuming applications (ML, microservices etc.)

Back to the original prompt: “We are choosing [open source data warehousing technology] because it’s cheaper than proprietary options, also to avoid vendor lock-in and to be able to extend the system if we need to” – there are actually three statements here, so let’s break them down:

OSS is cheaper

It’s hard to compare pricing for data warehousing tech especially with structurally different pricing models: for example, pay-per-use (BigQuery, Athena) vs. per-per infra time (Snowflake, Databricks). So I am going to suggest a counterexample to that assumption.

To compare pricing, we first need to establish a common denominator in terms of amount of work and performance that we are buying for a given price. Let’s consider an often-cited Fivetran’s TPC benchmark that compared Snowflake – a proprietary DWH product with Presto – a popular open-source data processing engine, among others. In that benchmark, Presto shows average query runtime of 14.78sec vs. 10.74sec (38% difference) for Snowflake, but the medians are roughly the same, so let us assume the performance is comparable.

Here’s a back-on-the envelope cost calculation:

Mean TPC query timePriceSize
Snowflake10.74 sec$16Large$1600
Presto14.78 sec$9.8232x n1-standard-8$12

At the first glance, Presto seems to be ~60% cheaper.

However, in a real production scenario we need to consider total cost of ownership and, in case of OSS, factor in expenses such as devops: products like Presto are sophisticated distributed systems that require proper deployment, monitoring, tuning and maintenance. And while it may seem easy to spin up a cluster with K8S in a couple of clicks, matching Snowflake’s availability SLA of 99.9%is a completely different game.

Let’s consider devops cost factor in two primary scenarios:

Vendor-managed Presto

For example, the leading Presto vendor, Starburst, charges ~40% markup on top of AWS infrastructure cost, which reduced the price gap between Presto & Snowflake to mere 16%.

In-house Presto

If you are up for running Presto in-house, consider 2x senior distributed systems engineers (can’t do with just one because they need to be on-call to maintain the SLA). That’s ~$180K * 2 * 2.7 overhead factor = $111 per calendar hour. (Silicon Valley pricing).

That would be a tremendous overhead for the scale of infrastructure as in the benchmark above, so let’s assume you are running ~110 Presto nodes – a more real scenario. You are then paying a ~100% markup on top of infrastructure costs for the develops labor, making Presto more expensive than Snowflake: $9.82 * 2 = $19.6 vs. $16. if we apply it to the example above. Such an outcome should not be surprising because a vendor like Snowflake with tens of thousands of customers is able to perfect their technology and distribute their devops costs thinly.

You’ve noticed that we’ve been making quite a few assumptions along the way, so the numbers may differ from case to case – but the key takeaway is that there is no magic 1.5x/2x/3x savings we can expect from going the OSS route for a typical team, unless you have your own hardware infrastructure and are really, really, really big.

OSS eliminates vendor lock-in

Unless your company is so small that you don’t care about SLAs or so big (Facebook-big) that you can actually pull it off while benefiting from economies of scale, you will probably contract a vendor to host the OSS data warehouse for you instead of running it in-house.

Many vendors run their custom distributions of OSS which can be significantly behind the master branch of the respective OSS projects. That, in turn, limits your ability to tinker with the OSS and to take advantage of its latest features and optimizations. And now your infrastructure is also integrated with vendor’s APIs making it harder to move out. All in all, you end up with a similar vendor lock in that you tried to avoid by going the OSS route. With one exception: a theoretical ability to integrate some of the community-contributed components – but unless you have very specific requirements here, it’s hardly an advantage as the best proprietary database products have excellent interoperability as well.

OSS gives more flexibility

First, consider all of the challenges with vendor-flavored OSS described above apply here.

Finally, ask yourself (and your Engineering partners) – what exactly and why would you want to add to a database solution that isn’t readily available from serverless offerings such as Snowflake or BigQuery? Is that custom feature so critical to your business that you are going to hire [expensive and rare] engineering talent to develop database tech instead of building your core product?

End-user experience

It didn’t come up in the “3 reasons”, but I am going to mention it anyway. Even most mature OSS products in the DWH space are still rough around the edges when it comes to user experience in comparison to top proprietary products such as Snowflake & BigQuery. If you don’t know what I am talking about, try optimizing the performance of a couple of Spark jobs that take forever to complete for no apparent reason. Or try to see inside your Kafka data queue. In the world where analytics powers the business, and where talent is the scarcest resource, data platform’s user experience matters A LOT, so don’t compromise on it, especially without a good reason.

To conclude, a lot of data warehousing OSS is awesome and can be the right business decision IF you are choosing it for the right goals.

Excited to chat more about the economics and UX of data platforms? Shoot me an email at hello@datafold.com.