In practice it's more a marketing term, and how big is big depends on what the n...

gaius · on Jan 24, 2015

I've worked on 50Tb relational databases, I don't consider myself to be a "big data" guy.

bbrazil · on Jan 24, 2015

I've worked on relational databases of similar size. There's two challenges. The first is maintaining the relational model at that scale is quite tricky, tradeoffs need to be made. The second is the systems-level management of that large a deployment requires a bit more than standard configuration management.

These days Amazon have Multi-AZ RDS, which should handle the 2nd item.

IndianAstronaut · on Jan 24, 2015

The problem with databases that are 50tb or more is that you soon run into limits with the relational model. I have been reading up on different modeling techniques for converting relational models into cassandra's column family stores.

bbrazil · on Jan 24, 2015

The issues are a bit more fundamental.

You can't practically fit 50TB on one machine and have reasonable performance, that means multiple machines with the data spread across them.

There's then two potential issues: 1) You're doing 1-to-1 joins across tables in a query, network latency may be an issue at high query rates 2) You're going 1-to-many or many-to-many joins across tables in a query, the resulting combinatorial explosion of data is too much to handle

You want to have your inner loops/joins as deep down in the stack as possible. If you can structure things so all the heavy lifting stays inside one rack/machine/NUMA node/ processor/core you'll be able to scale a good bit further further.

Designing things not to require joins, denormalising and putting it in a column store like Cassandra is also a good approach.

angrybits · on Jan 24, 2015

No, you run into limits with some relational engines. I work in a Teradata shop and we handle relational models of this size just fine.

sgt101 · on Jan 24, 2015

I think that needing disk parallelism because you have a workload that demands table scans and maintaining indexes is impractical due to dynamism in the data is one feature.

Another is not having pockets deep enough to solve it with intellectual property, either in the form of a parallel proprietary rdbs (spensive) or the need to implement clever stuff.

Big data as a technology is about dumb as brick, cheap as chips, brute force.