doesn't (often) work out. Your benchmarks only run one query at a time, so BigQuery will look pretty expensive compared to the smallest Redshift/Snowflake cluster.
Have you tried simulating say a team of 10 analysts by sending concurrent queries? (You will see the average runtime go up as you fill the box). Alternatively, even 18% utilization is probably way too much for a single person/analyst!
P.S.: You seem to have checked in export PGPASSWORD=...
Trying to compare costs between BigQuery and Redshift/Snowflake is really, really hard because of their different pricing models. I tried to be very explicit about what assumptions I am making so people can make their own judgements. You are correct that a single-user sending queries to the warehouse is not realistic, but it should still produce an accurate price comparison. Here's my logic:
* Suppose I were to send 10 times as many concurrent queries to the same warehouse. That would drop the effective cost-per-query by 1/10, which makes BigQuery look 10 times worse.
* But suppose these queries now take 10 times as long. According to my formula, that makes the effective cost-per-query 10x larger, which makes BigQuery look 10 times better.
As long as the performance of the warehouse is linear-ish with respect to number-of-users, these two effects should cancel each other out.
RE: PS: Don't worry, I made that one up just for that test cluster, which is now gone :)
I should have been more explicit: it won't be linear-ish. As you fill up any per-box sort of system, you start contending (pretend that you simulate 100 analysts) while the BigQuery results will be closer to linear. It does suck that to benchmark that "accurately", you have to make the queries different enough so that no reuse occurs (unless that's reasonable!).
When we've looked at BigQuery it seemed that if you prepay you essentially get a similar effect to what you're describing. You're given a certain number of "units" of compute, and if you exceeded your concurrent units available you end up with the same compute resource contention you would with an improperly scaled Snowflake warehouse or Redshift cluster.
If you're willing to just pay per gigabyte scanned with BigQuery you can scale near linearly I'm sure (although I haven't actually tried it), but you could accomplish the same thing using Snowflake's API to add warehouses as concurrent query load increases. That's what we do (although we just pre-allocate and suspend the warehouses because you only pay when they're on).
Redshift does suffer from this problem because the compute is tied to the data, but Redshift Spectrum is attempting to rectify that as well. I don't know anything about its performance though.
We're definitely going to revisit this periodically and I would love to address this in the next iteration. Would you mind creating an issue at https://github.com/fivetran/benchmark describing the concurrency trade-off that we're not accurately capturing?
I work at Google and was on the bigquery/dremel team.
I've written on this topic in the past at [0]. Turns out analytic SQL workloads are incredibly volatile - we know from operating these big ol Dremel/bq clusters in a multi tenant fashion. In my opinion boulos is correct to point out that the above analysis sets volatility of usage at 0, which is not representing the real world..
Think about it another way. Bigquery on demand pricing only charges you for the resources you use, rather than provision. This guarantees you don't pay for the luxury of deployed resources that sit idle.
By definition this also means your "cluster utilization" is 100% with bigquery. It's rare to see provisioned databases get above 50%, even in cloud. Periscope data specifically called out running redshift at 50% utilization as a best practice - in other words it's a best practice to double your bill just for deployment reasons.
There is, of course, bigquery flat rate pricing for larger use cases, which is incredibly cost competitive.
Folks who migrate to bigquery also specifically call out cost as a major benefit. Some sites that call this out are kabam[1], sharethis[2], Yahoo [3], ny times[3], Motorola[4].
This logic is extrapolatable to any cloud pay-per-use pricing schemas (like queuing, storage, API calls, etc). It's just that in analytics usage tends to be incredibly volatile, and thus this is extra significant. Happy to chat further.
I have to consider these results to be meaningless based only on the fact that you do not use the "advanced" features like sort and distribution keys. In Redshift especially sort and distribution keys are not "advanced" features, they are fundamental factors in determining performance.
Your justification for this is that you don't know how queries will run but that's a poor excuse. With your star schema example dataset it should be easy to pick distribution keys that match the access patterns of commonly joined tables. There may be tradeoffs here but it's far from impossible and you are crippling performance by not using these features.
It's too late to edit but this comment feels like it dismisses your hard work which I do not intend to do. Having said that I am concerned that this approach leads to the ford-vs-chevy type arguments that don't help anyone make an informed decision.
What do you think about doing a series on the workloads that each database engine is best at? Surely each model has benefits and disadvantages. Finding what each system is best at will help people decide what system they really need.
You are in a unique position to answer this question because as you say, you have no interest in any specific solution and your customers use all of these systems.
Hi mulmen - I work with George (author). Dist and sort are absolutely an invaluable tool for optimizing your common BI workloads. What we are seeing with our customers is extensive use of arbitrary querying via tools like Looker or Tableau which often do not use the sort or dist key. This is becoming the most common workload for Fivetran customers b/c of the vast number of frequently updating dashboards that are powered by those queries.
Thank you for your insight. My team uses both Tableau and Redshift but we don't do a lot of frequently updating dashboards.
This sounds like a major limitation of these query tools. Are these arbitrary queries written by users or created by the tools programmatically? Since dist keys are ideally join keys why do these tools (or users) not understand the join keys? Can the queries be identified via WLM and triaged for optimization?
I am skeptical of any solution that promises to be magic and do the hard work for us. I think the "big data" industry does itself a disservice by perpetuating the myth that hard problems can be solved with mouse clicks.
It's not a limitation of the query tools. The problem is that users are running so many different queries against the same schema that you can't optimize for them all. Therefore, it's most important how a data warehouse performs when the optimization "tricks" don't work.
There will always be trade-offs so disagreements are to be expected. My interest is largely in what models work best for a given system, specifically Redshift. What data models have you seen in use? Are the schemas dimensionally modeled or are they flat tables?
Where have you run into limitations with dimensional modeling? Is it a lack of engineering resources on the customer side or is the approach limited in some fundamental way?
Thanks for sharing your invaluable experience. I look forward to part two!
I'm not sure leaving out sort and dist keys is actually being fair to Redshift here.
I get the argument that you're avoiding any "special tuning", but these are basic features that the docs and most guides will tell you are not really optional. In my mind it's more like leaving out the primary key declaration on a table in Postgres or MySQL.
It's a tough call. In my experience, sort and dist keys only really work out in < 50% of queries, and it's the other 50% where you judge the performance of your warehouse.
If we had chosen to use sort and dist keys in Redshift, then we would have also used clustering keys in Snowflake, and it probably would have been somewhat of a wash. BigQuery doesn't really have an equivalent feature yet---they have date partitioning, but that's a very narrow special case that won't work for most queries.
It looks like you had to rewrite the queries pretty significantly to get them to run on all three. Redshit/Bigquery don't support rollup/grouping sets/intersect/except etc. which the TPCDS spec queries have. Is there a reason you don't mention this in the blog post? SQL surface area seems like a pretty important thing to consider when choosing a DW.
I talk about that a little in footnote 5. The changes were mostly just syntax, which I don't think is very important. You may disagree, but surely we can both agree that the important thing is it's all in GitHub so you can see the edit history of the queries :)
BigQuery is one of the few products that truly feels magical. It lives up to the "no-ops" model of just working without thinking about details and handles any scale of data you have.
Only downsides would be cost-per-run pricing and low-latency queries.
One point that this article doesn't address (not that it necessarily SHOULD) is the fact that BigQuery achieves these results with zero database administration or schema design.
Redshift is great, but it's not "set and forget".
Also, there are a huge number of use cases out there which are daily reports on large datasets. Fixed vs. Variable cost models are obviously going to shift in favor of variable when this kind of scenario comes into play.
Last, what kind of cost factor do you think comes into account when it comes to paying someone to administer your Redshift db, vs not paying anyone to administer the BigQuery setup. I can tell you from experience that this is a huge, huge factor.
I think you need someone to run and optimize the data warehouse in any case -- figuring out data exports to BQ, table partitioning, etc don't really come for free. Perhaps BQ is easier in that you don't have to figure out how to run and scale a Redshift cluster, but as a BQ user, I do a decent amount of work mucking around with the schemas and figuring out how to do things more performantly or cheaply.
There are several other "no-ops" services that Google offers in GCP that we leverage to great effect (and personally, as the guy who would be doing the "ops" part, really enjoy using). It's quite liberating to be able to use a powerful product with _literally_ zero operations 'cost' in terms of mindshare - not getting sucked backwards to upgrade/fix/whatever service we're using, reliably, reduces a lot of mental burden and makes it easier to keep moving forwards.
This is super well written! I've seen a lot of benchmark queries show up on Hacker News, and I think this one is exceptional for calling out and justifying all of its assumptions, then comparing it to other benchmarks and explaining how they might be different. I really appreciate all of that clarity. That academic but easy-to-read style doesn't make for flashy headlines but it's so much more informative.
I don't know; personally I feel like the scale here is much too low to be relevant. Also I think not investing the time to tune sort and dist keys makes the comparison meaningless.
But maybe that just becomes meaningful at larger data sizes and maybe most people work with less data most of the time.
Do you plan to take into account the developer cost? BigQuery is more managed than Redshift so that definitely factors into the "real world" cost analysis.
Sweet! Another wrinkle to add in: if you’re comparing a redshift system that’s >$35k/yr, BQ does have a flat rate pricing option which might fare better in your TCO analysis. Info here: https://cloud.google.com/bigquery/pricing#flat_rate_pricing
Yes, I’m a biased cloud Googler.
I am using Redshift since two years and, as every database has its own SQL dialect and its own tricks, but it is a really good product. It is a pity that Amazon forked early PostgreSQL v8 and not PostgreSQL v9 (with array support) but if I think about databases I used before (I was admin of a Netezza instance and user of an Oracle instance, they were Data Warehouses too) I realize it is a really big step forward that now we can create with few clicks and witha really fast provisioning time, a database that took days, even month (contracts, waiting they arrive to datacenter, DBAs configuring it, installing it, etc) to have it up and running, that now thinking about Redshift or Big Query it seems to live in another era.
There is a different side to the cost benchmark that's not captured by the description here. If your use case needs a lot of stored data but not necessarily a matching degree of
peak CPU (even if your query load is otherwise pretty consistent), Redshift will become really expensive really fast and it will feel like a waste. BigQuery will meanwhile keep costs linear (almost) in your actual query usage with very low storage costs.
For example, you may need to provision a 20-node cluster only because you need the 10+ terabytes in storage across several datasets you need to keep "hot" for sporadic use throughout the day/week, but don't nearly need all that computational capacity around the clock. Unlike BigQuery, Redshift doesn't separate storage from querying. Redshift also doesn't offer a practically acceptable way to scale up/down; resizes at that scale take up to a day, deleting/restoring datasets would cause lots of administrative overhead and even capacity tuning between multiple users is a frequent concern.
Making matters worse, it is common for a small number of tables to be the large "source of truth" tables that you need to keep around to re-populate various intermediate tables even if they themselves don't get queries that often. In Redshift, you will provision a large cluster just to be able to keep them around even though 99% of your queries will hit one of the smaller tables.
That said, I haven't tried the relatively new "query data on S3" Redshift functionality. It doesn't seem quite the equivalent of what BigQuery does, but may perhaps alleviate this issue.
Sidenote: I have been a huge Redshift fan pretty much since its release under AWS. I do however think that it is starting to lose its edge and show its age among the recent advances in the space; I have been increasingly impressed with the ease of use (including intra team and even inter-team collaboration) in the BigQuery camp.
Redshift offers hard disk based nodes with huge amounts of storage at low cost for precisely the use case you mention. The performance of these is actually very good, especially with a little effort applied to choosing sort keys and dist keys.
Spectrum extends that even further, allowing you to have recent and reference data locally stored and keep archival data in S3 available for query at any time.
I’ve been using Redshift fairly heavily for the past couple years. I haven’t had the time to do comparisons between BigTable, unfortunately. I’ve been too strapped for time. And, we have only just started experimenting with Snowflake recently.
I would like to try out BigTable, but experimenting with it has required too much work. I would have to move my entire data set (Several Petabytes) over to Google.
Redshift has been fine. But, with Reshift, we have two big issues: 1) Loading data takes too long. Even with efficient copy operations from S3, it takes too long to import data at scale. 2) Ineffient queries frequently overflow to disk and consume the entire SSD. I’ve tried to train my data science team on how to avoid inefficient queries, but it’s been impossible.
So, I’m really looking forward to seeing the results from part 2!
I'd really recommend taking a look at BigQuery, as I think it could address at least the inefficient query trigger extreme load. It'd be great to have you report back your experiences as well.
As far as training your users on efficient queries do you leverage WLM queues to push the worst queries into a longer running queue?
It sounds like you are dealing with a lot of data. You say you are using SSDs but also that you have petabytes of data. The max size for a Redshift cluster with SSDs (dc1.8xlarge) is 326TB. Do you only keep a subset of your data in Redshift? How much data are you loading on a daily basis?
Another comment suggested using EMR to create the rollups, have you looked at using a second cluster of dc2.8xlarge instances for rollups? Have you looked at Spectrum?
I'd love to hear your experiences and solutions because it sounds like you're pushing Redshift to it's limits.
Lars here, co-founder of a start-up that specializes in optimizing Amazon Redshift performance. We work with a lot of Fivetran customers who we help maximize their load speeds.
My hunch is that your data science team can solve both issues by using the Workload Management (WLM) feature in Redshift.
Redshift operates in a queueing model. As a user, you can define queues for your different workloads, and then assign them a specific concurrency / memory configuration. The default configuration for Redshift is one queue with a concurrency of 5. If you run more than 5 concurrent queries, then your queries wait in the queue. That's when the "takes too long" goes into effect.
The available amount of memory is distributed evenly across each concurrency slot. Say that you have a total of 1GB, then with a default configuration, each of the 5 concurrency slot gets 200MB memory. If you run a query that needs more than 200MB, then it falls back to disk. Disk-based queries are really bad for the cluster, because they consume a lot of I/O which slows down the entire cluster, not just a specific queue.
You can fix both by configuring Redshift specific to your workloads. As a general set-up, we recommend defining 4 queues (load, transform, ad_hoc, catch_all) to isolate your workloads from each other.
-------------
1) data loading takes too long
Create a dedicated "load" queue in the WLM that is only responsible for loading data into Redshift. These are all COPY statements. By separating your data loads from everything else, you make sure that your data loads are protected from e.g. some big ad-hoc queries that some not-so-skilled SQL user writes. The next trick is to figure out how many concurrent slots you need for all your loads.
2) queries frequently overflow to disk
I assume this is true for some large aggregations / roll-ups you're running, or some massive ad-hoc queries. By giving them their own queue, with sufficient memory, you start eliminating the volume of disk-based queries.
--------
The problem is that it's not straightforward to figure out how to set the right concurrent / memory configuration. The information is hidden in the log files (which Redshift deletes on a rolling basis), and sifting through those log files requires writing really complex queries. That just put more load on the cluster...
However, if you do get it right, you can run blazing fast data loads and queries.
We struggled with this problem ourselves at our previous company, and so with https://www.intermix.io we built a service to solve that problem. If you ping me at lars at intermix dot io - I can give you an extended free trial.
I'll pay you dinner (in SF, or maybe in Vegas at Reinvent), if we can't solve both issues for you. How is that?
I've recently moved to clickhouse for some analytical work, and it's been awesome. It's not perfect for what I want, and I'm sure I'm using it wrong, but it's torn through what I have to throw at it and I've not spent any time doing perf tweaks.
There's a connector for redash too, so I can make nice little dashboards from the results.
Major problem is a lack of updates to data. At least loading is fast enough for my use.
You can't do updates, but you can create tables using the AggregatingMergeTree table engine.
I've been working out how to use this for tracking count/first seen/last seen for data sets. With normal sql you need to do upserts, but with clickhouse you can just create the table using countState, minState, maxState and insert:
1, now, now
for count, first, last, and when it compacts the table it'll apply the aggregations.
Interesting, I'll have an experiment with this today. I think I can see the pattern but I'm not entirely sure it'll work quite as I'm hoping.
So assuming I've got a nice primary key (which thankfully I do for the main data I care about), I could have a table something like (syntax will be wrong but):
Then when selecting I can run queries efficiently to only get the latest version of a row?
The bit I'm not sure on is I feel a lot of this is designed to work nicely with columns, and in this case I have a fairly row-like constraint (latest row with a key).
Should be an interesting day of investigating though, thanks!
Now you can 'delete' a row by sending the same row again but with a sign of -1:
insert into cmt (key, value, sign) values ('k1', 'v1', 1)
insert into cmt (key, value, sign) values ('k1', 'v1', -1)
insert into cmt (key, value, sign) values ('k1', 'v1 update', 1)
insert into cmt (key, value, sign) values ('k2', 'just delete this one', 1)
insert into cmt (key, value, sign) values ('k2', 'just delete this one', -1)
You have to either add FINAL onto the query or optimise the table as far as I can tell for this to work, or hope it's done it in the background.
I thought you had to use the sign in the query otherwise it wouldn't work, but creating this example this morning works fine. If you're not getting the response you expect after optimising, try adding the sign column to the query.
:) select key, value, sign from cmt
SELECT
key,
value,
sign
FROM cmt
┌─key─┬─value─────┬─sign─┐
│ k1 │ v1 update │ 1 │
└─────┴───────────┴──────┘
1 rows in set. Elapsed: 0.002 sec.
:)
[disclaimer - I know very little about how to make high performance things in clickhouse, tbh the Log format has been easily fast enough for my data so far]
So having benchmarks tests is great as a general guideline for what works under different architectures/schema designs. Unfortunately, benchmarking is highly subjective to the initial choices. I am a big fan of BigQuery (enough to go through Google's vetting process), but there are plenty of performance issues that I've run into it that would have been easily resolved with Redshift. Here are some concrete examples:
1) Running a Query across several very small tables. It turns out that occasionally querying small tables causes heavy network traffic within Google's distributed system. The solution on Redshift would be to adjust distribution. On Google, however, you don't have any control over this. You just have to hope that Google's algorithms pick up the issue based on usage (they don't).
2) Joining large tables. Avoid joining large tables in BigQuery. In Redshift the join would have been done by making sure that the sortkey is set on the column that is used for a join on the (typically) right table. Then having a common distkey between the two tables (this way the relevant data on both tables lives on the same node. BigQuery just throws resources at the problem. Well, it turns out that throwing resources at the problem is super slow (think 5-15 Redshift seconds vs. 200 BQ seconds).
Re: Snowflake. Can't speak to it as I haven't had personal experience. I have worked with Data people who had opinions on both favorable and negative sides of the spectrum. This just suggests to me that just like Redshift and BigQuery, Snowflake is not a universal solution. You really need to understand:
1) what your goals are for the usage among varying consumers
2) what skill set do the various users of the database have
Cool! First, thanks for doing this.
You allude to it a little in your text, but I think your cost model:
(cost_per_hour / 3600) * query_in_seconds / expected_utilization
doesn't (often) work out. Your benchmarks only run one query at a time, so BigQuery will look pretty expensive compared to the smallest Redshift/Snowflake cluster.
Have you tried simulating say a team of 10 analysts by sending concurrent queries? (You will see the average runtime go up as you fill the box). Alternatively, even 18% utilization is probably way too much for a single person/analyst!
P.S.: You seem to have checked in export PGPASSWORD=...