Hannes and me developed both MonetDBLite and DuckDB, precisely for the need that you described :) We noticed that there was no easy RDBMS aimed at single-machine analytical workloads, whereas these kind of workloads are very common (e.g. R/Python data science workloads).
MonetDBLite was our initial approach, and is essentially an embedded version of MonetDB. We wrote a paper about it (https://arxiv.org/pdf/1805.08520.pdf). While it works, the system was not built with embeddability in mind, and we had to rewrite a lot of code to get it to work. Because of that we ended up with a fork, as the rewrite was too big to be merged back upstream. This caused a lot of problems with the fork becoming outdated, and a lot of headaches with constantly merging changes.
MonetDBLite had a number of issues stemming from the fact that the original system was made as a stand-alone system. For example, the database system once started in-process could not be shut down, as the regular database system would rely on the process shutting down to clean up certain parts of the system.
In total, the features we wanted that would not be possible to implement in MonetDB without huge rewrites are as follows:
* Multiple active databases in the same process (reading different database files)
* Multiple processes reading the same database file
* Control over resource/memory usage of the database system
* Vectorized execution engine
* Compressed storage and compressed execution
Because of that (and increasing frustration with constantly merging changes) we opted to develop a new system instead of sticking with MonetDB, as rewriting the entire system to get those features would likely be more work than just starting from scratch (and not politically feasible as well ;)).
The result of this is DuckDB. While it is still early in the process, it is relatively stable and we hope to ship a v1.0 sometime this year, along with an updated website :)
You are doing a fantastic job and I am wishing you the best of luck!
I used only Python API of both DBs and what confused me is the mandatory requirement of NumPy and Pandas. I think ndarray/DataFrame retrieval and conversion should surely be optional. Some applications do not require all these features and can go ahead with the built-in types (mine just uses fetchall()).
ooc do you plan on binding in common functions for the DS/ML use cases? Things like
- String similarity measures
- ROC-AUC/MSE/correlation/Precison/Recall etc.
- LSH
- Sampling/joining with random records.
Keeping all of the transformation/prep logic in the sql engine seems like a great performance savings over python, and would also speed up the dev time for building up the code surrounding the ML functionality.
We already have a number of statistical ops (e.g. correlation) available, and we are planning to add more. The exact timeline I cannot promise, but feel free to open issues with the specific operations you are interested in/you think will be useful. We are always happy to review PRs as well :)
MonetDBLite was our initial approach, and is essentially an embedded version of MonetDB. We wrote a paper about it (https://arxiv.org/pdf/1805.08520.pdf). While it works, the system was not built with embeddability in mind, and we had to rewrite a lot of code to get it to work. Because of that we ended up with a fork, as the rewrite was too big to be merged back upstream. This caused a lot of problems with the fork becoming outdated, and a lot of headaches with constantly merging changes.
MonetDBLite had a number of issues stemming from the fact that the original system was made as a stand-alone system. For example, the database system once started in-process could not be shut down, as the regular database system would rely on the process shutting down to clean up certain parts of the system.
In total, the features we wanted that would not be possible to implement in MonetDB without huge rewrites are as follows:
* Multiple active databases in the same process (reading different database files)
* Multiple processes reading the same database file
* In-database shutdown/restart
* Single-file database format
* Dependency free system
* Single compilation file (similar to the SQLite amalgamation https://www.sqlite.org/amalgamation.html)
* Control over resource/memory usage of the database system
* Vectorized execution engine
* Compressed storage and compressed execution
Because of that (and increasing frustration with constantly merging changes) we opted to develop a new system instead of sticking with MonetDB, as rewriting the entire system to get those features would likely be more work than just starting from scratch (and not politically feasible as well ;)).
The result of this is DuckDB. While it is still early in the process, it is relatively stable and we hope to ship a v1.0 sometime this year, along with an updated website :)