Indexes don't solve the same problem as columnar because with an index you still...

greggyb · on May 24, 2020

Unsure if DuckDB implements this in its column store, but another common feature is compression.

A page of field values is a much better structure to compress than a page of tuple values. This is especially true of the sort of denormalized data that is typical in analytical workloads. E.g. dates may only be considered in a period of a few years - this will take much less space to encode than the full allocation for a date data type. Similarly we may have categorical fields with only a few values. Imagine a product category with only 8 values. This can take a set of numbers which may be otherwise stored as an integer, and encode it to 3 bits, plus some space for a dictionary.

Another easy bit of compression to apply to columns is run length encoding. Dates, again, serve as a good example. If the data is sorted by date, you can choose to store a data structure that captures (start index, date value, end index). If you had 1,000 records on the same date, you could encode this with only three likely-integer-sized values, rather than 1,000 repetitions of a single likely-integer-sized value.

Such compression offers several benefits. First, it can reduce data volumes, and allow more query workload to come from RAM, either through DB or filesystem caching, or by an in-memory architecture. Additionally, the smaller the data size is, the faster a scan can be completed. Analytic workloads tend toward queries that require a full table scan (though with a columnstore, it's more a full column scan). If the data size is smaller, then you can stream the entire thing through whatever your query is faster.

kyllo · on May 24, 2020

Yep, compressibility is another big win made possible by columnar layout, and it looks like DuckDB implements at least some form of column compression.

cheez · on May 24, 2020

Ah yes, embarrassing not to remember basic DB architecture :-)

Do you have an example of a workload that benefits from this structure?

kyllo · on May 24, 2020

Yes, it's geared toward reporting / analytics / data science workloads where the most common read operation is calculating aggregate metrics like count/min/mean/max and histograms. Statistics stuff.

cheez · on May 24, 2020

Thank you very much for your patience :-)

nuclearnice1 · on May 24, 2020

Taking stats on a collection of time-ordered set of sensor observations.