Before diving into ways that SQL could improve I'd like to give some thanks to a real workhorse that has proved useful over decades, which is an incredibly long time in tech.
Could it be better? Sure, but author's proposal doesn't solve where I usually have problems. What are my pain points and what would I like to see instead?
1) One giant statement. Personally I really like Hadley Wickham's dplyr [1,2] (think "data pliers") which has a SQL like notion of joining different tables and selecting values but separates the filter, mutate and summarise verbs as separate steps in a pipeline rather than one huge statement. For transactions dplyr would have to add an update verb as well.
2) Hard to test, especially for more complex ETL. dplyr approach highlights that a lot of SQL these days is being used in ETL applications in addition to the usual retrieval, transactions and reporting. Being able to express as a pipeline of operations is easier for me to understand as execution is conceptually consecutive and I can unit test individual parts as part of a normal programming language environment.
3) My data isn't all tabular. Better support and semantics for non-scalar entries where value is a record itself like in json, BigQuery, Hive, Presto, etc.
4) Not that extendible. Better support for user defined operations (UDFs). More and more frequently I want to apply some non-trivial operation to data, e.g. run a machine learning model and it makes sense to do that as close as possible to the data usually. It is possible to do a fair bit in SQL itself with window functions but it is generally painful. You can point Hive at your jar and run a UDF but it is also painful to integrate and debug in my experience.
Could it be better? Sure, but author's proposal doesn't solve where I usually have problems. What are my pain points and what would I like to see instead?
1) One giant statement. Personally I really like Hadley Wickham's dplyr [1,2] (think "data pliers") which has a SQL like notion of joining different tables and selecting values but separates the filter, mutate and summarise verbs as separate steps in a pipeline rather than one huge statement. For transactions dplyr would have to add an update verb as well.
2) Hard to test, especially for more complex ETL. dplyr approach highlights that a lot of SQL these days is being used in ETL applications in addition to the usual retrieval, transactions and reporting. Being able to express as a pipeline of operations is easier for me to understand as execution is conceptually consecutive and I can unit test individual parts as part of a normal programming language environment.
3) My data isn't all tabular. Better support and semantics for non-scalar entries where value is a record itself like in json, BigQuery, Hive, Presto, etc.
4) Not that extendible. Better support for user defined operations (UDFs). More and more frequently I want to apply some non-trivial operation to data, e.g. run a machine learning model and it makes sense to do that as close as possible to the data usually. It is possible to do a fair bit in SQL itself with window functions but it is generally painful. You can point Hive at your jar and run a UDF but it is also painful to integrate and debug in my experience.
[1] https://cran.r-project.org/web/packages/dplyr/vignettes/dply... [2] https://datacarpentry.org/R-genomics/04-dplyr.html