>If you can talk intelligently about the whole grab bag of stuff these teams use...

alanctgardner3 · on Jan 26, 2015

> The biggest issue with big data is most of it sits unused

This is really variable. If you're at a place where they jumped on the bandwagon, then yes. There are also lots of companies (and not just Google/FB/LinkedIn) that build mission critical reporting and ML infratstructure on Hadoop. These companies appreciate the value of workflow coordination, and they wouldn't move ahead without (at least) Oozie/Azkaban in place to give some visibility into their workflow.

> But, in the long term, there will be a big change.

I think more types of work will become commoditized. If you just want log processing, there are lots of on-premises and cloud options. Splunk has been doing this forever. Ostensibly with good-enough BI software you could just focus on ingest, and everything else is drag and drop. On a long enough time frame, hand-rolling pipelines will become obsolete. This is like a 10+ year timeline for any player to get significant market share. In the meantime, people have to actually get stuff done, and their skills will be transferable because they understand distributed systems, ETL, warehousing, and a lot of other stuff that hasn't really changed in a decade.

> Becoming an expert in a particular data engineering component

Are you advocating that nobody writes Spark Streaming jobs, because they should rewrite Spark instead? Don't learn to work with Impala, learn to rewrite Impala? I disagree, the tools are only getting better, and it's going to take more and more work to replace the entrenched players. Working on top of solid tools will make you far more productive than engaging in NIH and making your own SQL engine.

> Becoming an expert on quickly and effectively deploying cloud services to get the job done

Like RedShift, EMR and Amazon Data Pipeline? They're hardly turn-key solutions. Amazon's Kinesis is just Kafka with paid throughput - you can absolutely re-use your skills in the cloud, without having to cave and get locked in to a single vendor serving one specific use-case.

> What not to become is one of these OSS DIY bigots: not good enough to build truly differentiating technology, but adamant about building and running their own <up and coming OSS technology>

So in your mind you either pick a vendor to handle all your data for you, or you're an "OSS DIY bigot"? Something like owning your entire user analytics pipeline isn't mission critical for a startup, it's stupid to build it yourself?

> These folks will be wiped out in the next decade or so.

Even though Oracle is amazing and great, lots of people still use Postgres, MySQL, etc. There's always going to be a continuum from "We should buy his turnkey thing" to "we started by rolling our own SQL query engine". You need to be able to identify when each is appropriate, not shoehorn in a one-size-fits-all solution.