Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is a good tutorial.

The problem of out-of-order data becomes more challenging as ingest throughput requirements increase if your storage representation must guarantee a strict total order. In high-throughput designs this is often handled by relaxing the strict total ordering requirement for how the data is represented in storage. As long as the time-series has an approximate total order at ingest time, there are many techniques for inexpensively reconstructing a strict total order at query time.



Right exactly. As a point of reference, within M3DB each unique time series has a list of “in-order” compressed timestamp/float64 tuple streams. When a datapoint is written the series finds an encoder that it can append while keeping the stream in order (timestamp ascending), and if no such stream exists a new stream is created and becomes writeable for any datapoints that arrive with time stamps greater than the last written point.

At query time these streams are read by evaluating the next timestamp of all written streams for a block of time and then taking the datapoint with the lowest timestamp of the streams.

M3DB also runs a background tick that targets to complete within a few minutes each run to amortize CPU. During this process each series merges any streams that have sibling streams created due to out of order writes, producing one single in order stream. This is done by the same process used at query time to read the datapoints in order and they are consequently written to a new single compressed stream. This way extra computation due to our of order writes is amortized and only if a large percentage of series are written in time descending order do you end up with a significant overhead at write and read time. It also reduces the cost of persisting the current mutable data to a volume on disk (whether for snapshot or for persisting data for a completed time time window).


For Interana I think we end up doing this by batching writes & sorts, and not really having a strict guarantee on when data imported actually shows up in query results.


Whatever you’re doing, it works great :) We used Interana at a previous company I worked at, and the combo of query flexibility and performance was excellent, really liked the product!


TY! Twitter has since acquired the Interana engineering team & IP, so we’re now doing the same thing at Twitter.


Ah interesting, like as an internal tool there?


Yes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: