Building Real Time Analytics APIs at Scale

ozgune · on April 10, 2018

(Ozgun from Citus Data)

What really excites me about this blog post is how PostgreSQL is becoming central across diverse workloads - including real-time analytics.

A few Postgres resources that relate to this blog post are the following.

1. TopN: Several Citus customers were already using the TopN extension. Algolia contributed to revising the public APIs in this extension. With these revised APIs, we felt pretty comfortable in open sourcing the extension for the Postgres community to use: https://github.com/citusdata/postgresql-topn

2. Postgres JIT improvements: Postgres 11 is coming with LLVM JIT improvements. For analytical queries that run in-memory, these changes will improve query performance by up to 3x. This will significantly speed up roll-up performance mentioned in this blog post: https://news.ycombinator.com/item?id=16782052

3. For those interested, this tutorial talks about how to build real-time analytics ingest pipelines with Postgres: https://www.youtube.com/watch?v=daeUsVox8hs

al_james · on April 11, 2018

Thanks Ozgun. Thats a great video in 3.

ryanworl · on April 10, 2018

I think the choice to not go with Clickhouse deserves a bit more explanation than what was given in the article.

Instead of writing all this code to do roll ups they could’ve used an AggregatingMergeTree table over their raw events table and... gotten back to work.

Cloudflare is using Clickhouse for their DNS analytics and (maybe even by now) soon their HTTP analytics. And the system they migrated off of looked a heck of a lot like this one in the article.

Edit: I should add that I am not saying their decision was wrong. I just think the sentence that was given in the article does not justify the decision by itself on an engineering level.

The data load process of Clickhouse and Citus (in this configuration) are nearly identical. Clickhouse takes CSV files just fine like Citus. The default settings are fine for the volume mentioned in the article of single digit billions of records per day. This would probably fit on a single server if you age out the raw logs after your coarsest aggregate is created. Queries over the AggregatingMergeTree table at five minute resolution will finish in high double digit to low triple digit milliseconds if the server is not being hammered with queries and the time range is days to weeks.

sfg75 · on April 10, 2018

Hey, sorry if that wasn't clear enough (author here).

We decided not to go with ClickHouse because we were mostly looking for a SaaS solution. That's pretty much why we also didn't spend too much time on Druid either.

Choosing Citus meant we could leverage a technology that we already had a bit of experience with (Postgres) and not have to really care about the infrastructure underneath it. We're still a fairly small team and those are meaningful factor to us.

At the end of day I'm sure all those systems would do the job fine (ClickHouse or Druid), we just went for what seemed the easiest to implement and scale.

ryanworl · on April 10, 2018

That makes sense. If you do ever want to check out Clickhouse and want someone to run it for you, Percona or Altinity [1] can probably help. Not affiliated with either, I just read their Clickhouse-related content.

[1](https://www.altinity.com)

menegattig · on April 10, 2018

Agree. I think Pinot (from Linkedin) and Druid would also provide a good solution.

al_james · on April 10, 2018

A great article, and I am a big fan of algolia, Citus and Redshift. However this article ends up making an odd apples to oranges comparison.

They state that "However, achieving sub-second aggregation performances on very large datasets is prohibitively expensive with RedShift", this suggests that they want to do sub-second aggregations across raw event data. However, later in the article, the solution they build is to use rollup tables for sub-second responses.

You can also do rollup tables in Redshift, and I can assure you (if you enable the fast query acceleration option) you can get sub-second queries from the rolled up lower-cardinality tables. If you want even better response times, you can store the rollups in plain old Postgres and use something like dblink or postgres_fdw to perform the periodic aggregations on Redshift and insert into the local rollup tables (see [1]). In this model the solution ends up being very similar to their solution with Citus.... and I would predict that this is cheaper than Citus Cloud as Redshift really is a great price point for a hosted system.

So the question of performing sub-second aggregations across the raw data remains unanswered... however that really is the ideal end game as you can then offer way more flexibility in terms of filtering than any rollup based solution.

Right now, research suggests Clickhouse, Redshift or BigQuery are probably the fastest solutions for that. Not sure about Druid, I dont know it. GPU databasees appear to the be the future of this. I would be interested to see benchmarks of Citus under this use case. I should imagine that Citus is also way better if you have something like a mixed OLAP and OLTP workload (e.g. you need the analytics and the row data to match exactly at all times).

Aside: It would be great to see Citus benchmarked against the 1.1 billion taxi rides benchmark by Mark Litwintschik. Any chance of that?

[1] https://aws.amazon.com/blogs/big-data/join-amazon-redshift-a... [2] http://tech.marksblogg.com/benchmarks.html

massaman_yams · on April 10, 2018

Similar to your point about mixed workloads, I have a hunch that Mark's benchmarks are not comprehensive enough to correlate well to real-world usage across a lot of different scenarios, even on pure OLAP workloads. It's great that a billion rows can be aggregated in 0.02 seconds, but there's a reason TPC-H uses 9(-ish?) different queries with varying aggregations and joins, vs. these benchmarks on a single table. (Of course, if your use case is heavy on a specific type of aggregation, it probably makes sense to optimize for that at the expense of other query performance.)

And - perhaps I missed it, but his benchmarks don't seem to utilize rollup/materialization unless the DB does it automatically (or at least easily) on the backend.

As is, it's almost certain that Citus would underperform most of the leaders here. The PG9.5 benchmark actually uses the Citus-developed cstore_fdw extension, and it shows up towards the bottom, albeit running on a single node with hardware a few CPU generations old. (Same as used for the Clickhouse benchmark.) I am curious how Citus/Postgres might perform using the HLL / TopN extensions, though.

Also of note is his Redshift benchmark was run on magnetic drives on ds2 instances, not SSDs. Using those would almost certainly bump performance up a bit.

Druid is optimized for aggregation and filtering, and is somewhat similar to BigQuery on the backend, as I understand it. The Cloudflare blog posted elsewhere in the thread covers it briefly. https://blog.cloudflare.com/how-cloudflare-analyzes-1m-dns-q... More on its indexing strategy here: https://hortonworks.com/blog/apache-hive-druid-part-1-3/

Druid's downsides are: more complex deploy and operational needs due to architectural complexity, lack of full SQL support, limited fault tolerance on the query execution path, and the whole query being bottlenecked by the slowest historical data access.

More here: https://medium.com/@leventov/the-problems-with-druid-at-larg...

sfg75 · on April 10, 2018

That's a fair point. Indeed we started looking at doing aggregations across raw events, before realizing this was probably ill fated.

It's very possible we could have done the same with RedShift but it didn't seem obvious how. With Citus offering extensions like topn and hll we however quickly saw how that could work for us.

Thanks for the link btw!

al_james · on April 11, 2018

Yeah, thats a good point. Redshift does not have the same level of 'probabilistic counting' functions, that can be used from rollups. Redshift does have HLL (SELECT APPROXIMATE COUNT(*)) however that can only be applied when scanning the full data, I am not sure its possible to store a HLL object in a rollup and later aggregate them.

shrumm · on April 10, 2018

Most of the discussion (rightly so) focused on DB optimization. The decision to build the API in Go was barely mentioned. I’m curious if you evaluated any other frameworks / languages or was Go just an automatic choice?

sfg75 · on April 10, 2018

Pretty much automatic. With the exception of our search engine which is in C++ (as performance is paramount there), Go is becoming our language of choice for most of our backend services. We found in Go a great balance in terms of productivity and performance.

After building the aggregation and ingestion services in Go, sticking with this language for the API sounded like a good idea as well since Go makes it trivial to build an http server and the logic of the API is simple enough that we didn’t see the need for any web framework.

shrumm · on April 11, 2018

Thanks! I asked because i’m looking at the same problem, though at a smaller scale than you guys.

I decided to build our personalization API using Python’s Flask, worked great at the start because it helped us move quickly adding new features. 6 months later, we have more clients and hence more traffic, the response times have gone up significantly. I ran a benchmark doing a simple db query which returns the result as JSON with the apache benchmark tool and found my Golang implementation to be in the order of 20-25x faster, compared to falcon which is meant to be a lot faster than Flask.

Decided to just go ahead and implement the most performance sensitive parts of the API in Go.

wjossey · on April 11, 2018

“A request targeting a single customer app will only ever need to target a single Postgres instance.”

This seems remarkably dangerous to me. Isn’t hotspotting a big concern? I suppose they are large enough at this point to know what a “large” customer app looks like, but anytime I see sharding done in this manner alarm bells go off.

Happy to see another positive citus case. I was skeptical a year ago but they’re building up great success stories. We need great options like Citus!

Also, a happy algolia customer. If you’re not using them yet, give it a try!

bigger_cheese · on April 10, 2018

Seems similar to the approach used by Process Historians in industrial control world i.e store at native frequency out of the PLC then periodically aggregate.

napoleond · on April 10, 2018

Just use Keen.io and be done with it :)

menegattig · on April 10, 2018

Paying $10 per 1 Million events streamed and $1 for each 100 Million properties scanned, Algolia would be dead.

redox_ · on April 10, 2018

(working at Algolia) The biggest reason not going for Keen was indeed related to the cost.

eip · on April 10, 2018

Keen just stores blobs in Cassandra and does aggregation at query time. I can't imagine it would scale far enough to be useful in this situation.