Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Indexes don't solve the same problem as columnar because with an index you still have to dereference a pointer, read the page (performing I/O if on-disk), and locate each tuple you want within the page--the tuples you need to aggregate from may be scattered across many pages, and you have to read the whole page into memory even if you only want one column.

With a column store, the values of the same column are stored contiguously so if you're summing a column for example, you can read one or more pages that could consist entirely of values the column you need to sum, sequentially. It's much more efficient for aggregations due to cache locality--much fewer cache misses and I/O operations needed to summarize a column.

The major trade-off is that updates and deletes are more expensive in a column store. But this is usually fine for OLAP workloads.



Unsure if DuckDB implements this in its column store, but another common feature is compression.

A page of field values is a much better structure to compress than a page of tuple values. This is especially true of the sort of denormalized data that is typical in analytical workloads. E.g. dates may only be considered in a period of a few years - this will take much less space to encode than the full allocation for a date data type. Similarly we may have categorical fields with only a few values. Imagine a product category with only 8 values. This can take a set of numbers which may be otherwise stored as an integer, and encode it to 3 bits, plus some space for a dictionary.

Another easy bit of compression to apply to columns is run length encoding. Dates, again, serve as a good example. If the data is sorted by date, you can choose to store a data structure that captures (start index, date value, end index). If you had 1,000 records on the same date, you could encode this with only three likely-integer-sized values, rather than 1,000 repetitions of a single likely-integer-sized value.

Such compression offers several benefits. First, it can reduce data volumes, and allow more query workload to come from RAM, either through DB or filesystem caching, or by an in-memory architecture. Additionally, the smaller the data size is, the faster a scan can be completed. Analytic workloads tend toward queries that require a full table scan (though with a columnstore, it's more a full column scan). If the data size is smaller, then you can stream the entire thing through whatever your query is faster.


Yep, compressibility is another big win made possible by columnar layout, and it looks like DuckDB implements at least some form of column compression.


Ah yes, embarrassing not to remember basic DB architecture :-)

Do you have an example of a workload that benefits from this structure?


Yes, it's geared toward reporting / analytics / data science workloads where the most common read operation is calculating aggregate metrics like count/min/mean/max and histograms. Statistics stuff.


Thank you very much for your patience :-)


Taking stats on a collection of time-ordered set of sensor observations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: