Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Interesting... it's actually associativity that matters for this class of distributed query execution problems (in particular, for AVG). While a/b != b/a indeed violates commutativity, the reason AVG doesn't distribute is that AVG(a, b, c, d, e) != AVG(AVG(a, b), AVG(c, d, e)), i.e. (1 + 2 + 3 + 4 + 5)/5 != ((1 + 2)/2 + (3 + 4 + 5)/3)/2. Notice that we're not reversing the order of any operations, merely the way in which they are grouped. Sum "scales" because a + b + c + d + e = (a + b) + (c + d + e) -> you can imagine computing a + b on one node and c + d + e on another, and then combining their sums together. GROUP_CONCAT (an aggregate that concatenates strings) is a good example of a non-commutative aggregate operation that is still associative. In fact, on a system that is distributed on non-overlapping ranges, you can straightforwardly merge a GROUP_CONCAT() operation because GROUP_CONCAT(a,b,c,d,e) = GROUP_CONCAT(GROUP_CONCAT(a,b), GROUP_CONCAT(c,d,e)).


(Ozgun from Citus Data)

In the blog post, I used the + operator and sum() aggregate function interchangeably to be brief. Actually, those two operations are related, but have different representations in distributed relational algebra. I updated the first footnote in the post to reflect that.

For your comment, we have in fact two questions. First, is the ExtendedOp commutative with the Collect operator? Second, if it isn't, what properties do our transformations need to respect so that we can pull up the Collect? (equivalence property and associativity)

It's hard to be comprehensive about distributed relational algebra in a blog post. For example, the given logical tree doesn't have enough operator primitives to express large table joins. If you'd like, I'd be happy to get together and chat more about the details.


Perhaps I was missing the point, but my first thought with the AVG counterexample was to have each distributed query return (SUM, COUNT), both of which are nicely associative and commutative, and then only at final step do ((SUM of SUMS) / (SUM of COUNTS)).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: