When I was developing my pet project for Web analytics (https://github.com/dusta...

mytherin · on May 24, 2020

Hannes and me developed both MonetDBLite and DuckDB, precisely for the need that you described :) We noticed that there was no easy RDBMS aimed at single-machine analytical workloads, whereas these kind of workloads are very common (e.g. R/Python data science workloads).

MonetDBLite was our initial approach, and is essentially an embedded version of MonetDB. We wrote a paper about it (https://arxiv.org/pdf/1805.08520.pdf). While it works, the system was not built with embeddability in mind, and we had to rewrite a lot of code to get it to work. Because of that we ended up with a fork, as the rewrite was too big to be merged back upstream. This caused a lot of problems with the fork becoming outdated, and a lot of headaches with constantly merging changes.

MonetDBLite had a number of issues stemming from the fact that the original system was made as a stand-alone system. For example, the database system once started in-process could not be shut down, as the regular database system would rely on the process shutting down to clean up certain parts of the system.

In total, the features we wanted that would not be possible to implement in MonetDB without huge rewrites are as follows:

* Multiple active databases in the same process (reading different database files)

* Multiple processes reading the same database file

* In-database shutdown/restart

* Single-file database format

* Dependency free system

* Single compilation file (similar to the SQLite amalgamation https://www.sqlite.org/amalgamation.html)

* Control over resource/memory usage of the database system

* Vectorized execution engine

* Compressed storage and compressed execution

Because of that (and increasing frustration with constantly merging changes) we opted to develop a new system instead of sticking with MonetDB, as rewriting the entire system to get those features would likely be more work than just starting from scratch (and not politically feasible as well ;)).

The result of this is DuckDB. While it is still early in the process, it is relatively stable and we hope to ship a v1.0 sometime this year, along with an updated website :)

scoresmoke · on May 24, 2020

You are doing a fantastic job and I am wishing you the best of luck!

I used only Python API of both DBs and what confused me is the mandatory requirement of NumPy and Pandas. I think ndarray/DataFrame retrieval and conversion should surely be optional. Some applications do not require all these features and can go ahead with the built-in types (mine just uses fetchall()).

mytherin · on May 24, 2020

Good point, we will make it optional :) Thanks for the feedback!

lumost · on May 24, 2020

ooc do you plan on binding in common functions for the DS/ML use cases? Things like

- String similarity measures

- ROC-AUC/MSE/correlation/Precison/Recall etc.

- LSH

- Sampling/joining with random records.

Keeping all of the transformation/prep logic in the sql engine seems like a great performance savings over python, and would also speed up the dev time for building up the code surrounding the ML functionality.

mytherin · on May 24, 2020

We already have a number of statistical ops (e.g. correlation) available, and we are planning to add more. The exact timeline I cannot promise, but feel free to open issues with the specific operations you are interested in/you think will be useful. We are always happy to review PRs as well :)

stilisstuk · on May 24, 2020

Can someone point me to what OLAP is? I've read the wiki page, but don't get. All the databases I have worked with are multidimenaional.(i.e. complex erp systems with 100's of tabled joined together in various ways). Granted most of my analytical work is done in R, after performing queries.

But I've never had the need for 'special sql'.

Is this 'cubes' all over? I understand cubes as basically views precalculated to perform faster on mostly to charge more?

But OK... If need to work on datasets bigger than e.g. 5 GB you begin to eat RAM fast. Is this the use case? Then why not simply say that?

DenisM · on May 24, 2020

Going by example, let's say you're designing a sales system.

OLTP workload is when you create/read one order at a time, or a short list of them.

OLAP workload is when you want to aggregate sales numbers across all orders for a given broad criteria, without particular interest in any one order. For example sales by customer by month, or sales by region by date.

The two cases benefit from different data structures and algorithms to achieve their goals.

A system for OLTP may have a table of orders T0 (id, date-time, customer, $total, address, ship method, contact person, ...) where any one record can be fetched efficiently. However reading all of them will require a lot of disk IO.

A system for OLAP can reduce this table to T1 (id, date-time, customer, $total) or even just T2 (date-time, customer, $total) which is good enough to answer most statistical queries and yet consuming less than half the disk bandwidth.

Further yet, you likely do not care about individual order made on a given date by a customer, so you can add up all orders for a given day before storing the data: T3 (day, customer, $total-for-the-day-for-this-customer), or for the week: T4 (week, customer, $total-for-the-week-for-this-customer) or for the months: T5 (month, customer, $total-for-the-month-for-this-customer). Alternatively if you're a big boss you may not care about individual customers, but may care about regions, so the table you will want to see is T6 (week, region, $total-for-the-week-for-all-customers-in-region).

These smaller tables will answer your questions several orders of magnitudes faster compared to T0. And no, they do not have to fit into the memory - to the contrary the entire purpose of the exercise is to make it work efficiently with disk.

One could also chose to not compute e.g. T6 and derive it on the fly from T4, it's slower than having T6 pre-computed, but allows more flexibility in e.g. computing sales by ZIP code or by proximity to major bodies of water, etc.

stilisstuk · on May 24, 2020

Okay. So its a matter of optimising for speed. Since my work requires flexibilty we just try to approximate normalization (I'm not strong on 1St, 2nd etc) - I guess what you describe a t2. Even with milions of rows its normally pretty fast to aggregate rows for e.g. one customer. Depending on the complexity. It might.take a few seconds.to.filter by delivery.method, group by payment etc...

But any ways... it seems to me, one would allways design for flexibilty (t2). And then just create views for the special cases you describe?

DenisM · on May 24, 2020

>it seems to me, one would always design for flexibilty (t2)

Clearly not - the OLAP field exists and brings in untold billions of dollars, so there is no shortage of takers. OLAP design is about finding an optimal point between two extremes:

1. put effort to pre-compute everything, gain a lot of speed while looking at the result.

2. don't pre-compute anything, spend more time waiting to visualize the result.

If things are fast enough for you in the extreme #2 then you don't need to precompute everything and you can enjoy the flexibility. But for many people this is not the case - they either want results in split-second (highly interactive systems, modeling with varying parameters over an over again), or they want to dramatically increase amount of data (e.g. not just all sales but also all page views).

stilisstuk · on May 24, 2020

Yes off course. It humbles me. I thought i've dealt with pretty big data, but with no need to be speedy, clearly. Thx

scoresmoke · on May 24, 2020

In my case, the typical scenario is running aggregation and analytical functions over a small number of columns, while the number of rows is huge. The SQL syntax is the same, but the columnar storage model better fits this case. A good illustration is available in ClickHouse documentation: https://clickhouse.tech/docs/en/#why-column-oriented-databas....

stilisstuk · on May 24, 2020

Thanks..reading the link.

Edit: really interesting. That makes sense. But also a very strict set of requirements.

Thanks.

scoresmoke · on May 24, 2020

Just as always, there is no silver bullet.