> A notion as simple as "I want to return a list of customers with additional pr...

victorNicollet · on Oct 21, 2023

I've gone into more detail about this a few years ago in https://nicollet.net/blog/vanishing-schema-paradox.html but here's a short summary. Suppose you have the following analytics query:

    SELECT Customer.Name, sum(Sales.Quantity) 
    FROM Customer
    INNER JOIN Sales ON Customer.A = Sales.A
    GROUP BY Customer.B

Can you tell whether this will return exactly one line for each customer ? If INNER JOIN + GROUP BY was the dedicated construct to do so, then the answer would be "yes", because by definition it is the dedicated construct to do so. That's what a dedicated construct does: it is dedicated to doing that thing.

But both INNER JOIN and GROUP BY are much more versatile than that. In order to return exactly one line per customer, the following must be true: 1. column Customers.B must be a unique key of the Customers table (otherwise you'll get duplicate lines), and 2. each value of column Customer.A must also appear in column Sales.A (otherwise you'll have missing lines). Neither of these properties can be ascertained by looking at the query alone.

A dedicated construct would be something like (imaginary syntax):

   SELECT Customer.Name, sum(Sales.Quantity)
   FROM Customer 
   INNER JOIN Sales ON Customer.A = Sales.A
   GROUP INTO Customer

dragonwriter · on Oct 21, 2023

> Can you tell whether this will return exactly one line for each customer ?

I can tell it will return one row for each Customer.B.

If Customer.B is a candidate key of Customer, that will also be one eow per Customer.

> If INNER JOIN + GROUP BY was the dedicated construct to do so

The dedicated construct to say what you want one row per is GROUP BY. Yes, it operates by columns, not tables, so what it means in table terms is schema-dependent.

> each value of column Customer.A must also appear in column Sales

Well, yes, that’s what INNER JOIN means. The dedicated construct to assure that every row from the first source but not the second source table is included in the result set before filtering by WHERE is LEFT [OUTER] JOIN.

RHSeeger · on Oct 21, 2023

As I read what you wrote, my first thought is that you don't have any such thing as a "customer" in that data model. You can ask for "one line for each customer.B" (which is what you're doing). But the idea that you can ask for "one line per customer" relies on some amount of non-db, domain knowledge.

If you can't define what a "customer" is via the information in your database alone, then you can't query based on "a customer". And if the answer is "each row in Customer with a unique B", then that's part of the definition and reasonable to use as knowledge in getting "one row per customer".

I didn't explain that well, I think... but that's the general thought that was running through my head as I read your writing.

victorNicollet · on Oct 21, 2023

I agree ! This creates a situation where the SQL query does not represent the domain knowledge, but makes assumptions about it and cannot be understood without it. And while this will always be true for the more unusual parts of the domain, it is quite disappointing to be unable to properly represent as simple a concept as "what is a customer?" in SQL.

It's the same as a language forcing you to use `c & ~0x20` because it doesn't have a `Char.toUpper(c)` function. The code works (under the right assumptions) and produces the same result, but it does not convey the concept of converting a letter to uppercase.

What makes it so frustrating is that the DDL portion of SQL spends significant effort on representing such concepts in the database schema ! I can create a Customers table, with its primary key, and its foreign keys into and from other tables, and so on. I can represent "these are all the customers" in DDL, I can represent "every sale must be associated to a customer", and so on. But after the first join, I'm no longer using the Customers table, I'm using a new in-memory relation with no primary or foreign keys, and the concept that "this is the customers table, but with extra fields" is something I need to keep track of in my head, instead of in the language.

civilized · on Oct 21, 2023

In some sense, what you're asking for is straightforward with the tools we already have. It suffices to set a convention that each table always has a column named `id` which is the primary key for the table.

The harder part is how to enforce that within an organization. It sounds like you'd like technology to enforce it.

Existing tools already do the math part: you can set a constraint on a table so that the database maintains the primary key property and throws an error if a transaction would change the table in a way that violates the property.

What you're left to do is get everybody on board with the "every table has a primary key column named id" plan. Some syntactic sugar like GROUP INTO might help with that.

dragonwriter · on Oct 22, 2023

Looking at this again, I think the actual complaint isn't so much about base tables (though those were used in the illustration) but intermediate derived relations created in deeply nested queries (or even regular views), where even though their may in effect be primary/unique keys, they aren’t declared and recognizing them depends on tacit knowledge (and because the functional dependencies aren't recognized by the DB engine, they can’t be leveraged in GROUP BY to omit redundant non-key columns so a GROUP BY needs to specify all the non-aggregate columns with the domain understanding being opaque.

A primary key convention for base tables doesn’t help with this; I also don't think the propsed GROUP INTO solves it, though it requires it to be solved first to work (i.e., unless you are only using it to GROUP INTO base tables rather than intermediate tables formed by arbitrary joins, it requires first having the engine infer, or provide a way of declaring and having the engine validate, keys for those tables.)

civilized · on Oct 22, 2023

Honestly, I'm not sure what all this means. Maybe an example would help?

It sounds like there's an interest in the database inferring something subtle and making some kind of automated decisions based on that. Business stakeholders often make this kind of request - "can't an AI just figure all this out?" kind of thing. It often doesn't go anywhere because it's too far removed from the level of detail needed for a machine to automatically solve a problem.

timacles · on Oct 21, 2023

First of all, all code requires domain knowledge to understand. Some domains are just simple. Business domains never are. Even with your upper case example, if you don’t know what upper case letters are, you are in the same position.

Second, data can be organized in infinite permutations and sql has to accommodate that. People have been complaining about sql since the Dawn of time, but all proposed solutions only fix a subset of problems.

matwood · on Oct 21, 2023

MySQL aside, your first query will end up an error if Customer.Name is not functionally dependent on Customer.B. This highlights that there could be different customers with the same name and that it was a poor query to start.

For number 2, that's exactly what inner join means otherwise use a left join. The first question someone should ask themselves is if they want all customers or only ones that have had sales.