>If you can talk intelligently about the whole grab bag of stuff these teams use, that'll get you in the door. Understanding RDBMSes, data warehousing concepts, and ETL is a big plus for people doing infrastructure work.
This is sadly true, for now. I don't think folks here disagree with the "true" part. Let me explain why it is "sadly" and "for now".
The biggest issue with big data is most of it sits unused. In many organizations, HDFS ends up being an alternative to NetApp storage servers, storing terabytes of data with the hopes of them being useful one day.
In fact, if you already get to that stage of using HDFS as a storage server, you must have a decent ETL team that can put data into HDFS with a menacing combination of ad hoc scripts and a workflow that looks like a cobweb produced by a deranged spider. For now, knowing the ins and outs of various semi-functional open source components and the tenacity, patience and skill to deal with the gnarliest of ETL tasks get you a high-paying data engineering job.
But, in the long term, there will be a big change.
1. Tools are getting better: many data practitioners are realizing there are huge gaps between different data infrastructure components, and they are trying to fill these gaps. There is a lot of attention given to query execution engines (Presto, Impala, Spark, etc.) but I find data collection/workflow management tools are just as critical (if not higher leverage) right now. Tools like Fluentd (log collector) [1], Luigi (workflow engine) are OSS software in this direction.
2. Data-related cloud services are becoming really, really good: huge kudos to services like AWS, GCP, Heroku (through Addons). They are quickly building a great ecosystem of data processing/analysis/database components that frankly work better than most self-administered OSS counterparts. (Disclaimer: my perception might be colored here since I work for a data processing/collaboration SaaS myself [3])
So, back to the question. I think aspiring data engineers have two distinct career paths:
1. Becoming an expert in a particular data engineering component: this would be building a query execution engine, designing a distributed stream processing system, etc. (It would be awesome if you decide to release as open source)
2. Becoming an expert on quickly and effectively deploying cloud services to get the job done: this is the skill most desired among data engineers at startups.
What not to become is one of these OSS DIY bigots: not good enough to build truly differentiating technology, but adamant about building and running their own <up and coming OSS technology>. These folks will be wiped out in the next decade or so.
> The biggest issue with big data is most of it sits unused
This is really variable. If you're at a place where they jumped on the bandwagon, then yes. There are also lots of companies (and not just Google/FB/LinkedIn) that build mission critical reporting and ML infratstructure on Hadoop. These companies appreciate the value of workflow coordination, and they wouldn't move ahead without (at least) Oozie/Azkaban in place to give some visibility into their workflow.
> But, in the long term, there will be a big change.
I think more types of work will become commoditized. If you just want log processing, there are lots of on-premises and cloud options. Splunk has been doing this forever. Ostensibly with good-enough BI software you could just focus on ingest, and everything else is drag and drop. On a long enough time frame, hand-rolling pipelines will become obsolete. This is like a 10+ year timeline for any player to get significant market share. In the meantime, people have to actually get stuff done, and their skills will be transferable because they understand distributed systems, ETL, warehousing, and a lot of other stuff that hasn't really changed in a decade.
> Becoming an expert in a particular data engineering component
Are you advocating that nobody writes Spark Streaming jobs, because they should rewrite Spark instead? Don't learn to work with Impala, learn to rewrite Impala? I disagree, the tools are only getting better, and it's going to take more and more work to replace the entrenched players. Working on top of solid tools will make you far more productive than engaging in NIH and making your own SQL engine.
> Becoming an expert on quickly and effectively deploying cloud services to get the job done
Like RedShift, EMR and Amazon Data Pipeline? They're hardly turn-key solutions. Amazon's Kinesis is just Kafka with paid throughput - you can absolutely re-use your skills in the cloud, without having to cave and get locked in to a single vendor serving one specific use-case.
> What not to become is one of these OSS DIY bigots: not good enough to build truly differentiating technology, but adamant about building and running their own <up and coming OSS technology>
So in your mind you either pick a vendor to handle all your data for you, or you're an "OSS DIY bigot"? Something like owning your entire user analytics pipeline isn't mission critical for a startup, it's stupid to build it yourself?
> These folks will be wiped out in the next decade or so.
Even though Oracle is amazing and great, lots of people still use Postgres, MySQL, etc. There's always going to be a continuum from "We should buy his turnkey thing" to "we started by rolling our own SQL query engine". You need to be able to identify when each is appropriate, not shoehorn in a one-size-fits-all solution.
This is sadly true, for now. I don't think folks here disagree with the "true" part. Let me explain why it is "sadly" and "for now".
The biggest issue with big data is most of it sits unused. In many organizations, HDFS ends up being an alternative to NetApp storage servers, storing terabytes of data with the hopes of them being useful one day.
In fact, if you already get to that stage of using HDFS as a storage server, you must have a decent ETL team that can put data into HDFS with a menacing combination of ad hoc scripts and a workflow that looks like a cobweb produced by a deranged spider. For now, knowing the ins and outs of various semi-functional open source components and the tenacity, patience and skill to deal with the gnarliest of ETL tasks get you a high-paying data engineering job.
But, in the long term, there will be a big change.
1. Tools are getting better: many data practitioners are realizing there are huge gaps between different data infrastructure components, and they are trying to fill these gaps. There is a lot of attention given to query execution engines (Presto, Impala, Spark, etc.) but I find data collection/workflow management tools are just as critical (if not higher leverage) right now. Tools like Fluentd (log collector) [1], Luigi (workflow engine) are OSS software in this direction.
2. Data-related cloud services are becoming really, really good: huge kudos to services like AWS, GCP, Heroku (through Addons). They are quickly building a great ecosystem of data processing/analysis/database components that frankly work better than most self-administered OSS counterparts. (Disclaimer: my perception might be colored here since I work for a data processing/collaboration SaaS myself [3])
So, back to the question. I think aspiring data engineers have two distinct career paths:
1. Becoming an expert in a particular data engineering component: this would be building a query execution engine, designing a distributed stream processing system, etc. (It would be awesome if you decide to release as open source)
2. Becoming an expert on quickly and effectively deploying cloud services to get the job done: this is the skill most desired among data engineers at startups.
What not to become is one of these OSS DIY bigots: not good enough to build truly differentiating technology, but adamant about building and running their own <up and coming OSS technology>. These folks will be wiped out in the next decade or so.
[1] https://www.fluentd.org [2] http://luigi.readthedocs.org [3] http://www.treasuredata.com