I really like this! I bootstrapped the ETL and data pipeline infrastructure at m...

vijucat · on May 9, 2018

> When I finally got our engineering team to spend some time on making the data pipelines less fragile, instead of using one of these open source solutions they took it as an opportunity to create a fancy scalable, distributed, asynchronous data pipeline system built on ECS, AWS Lambda, DynamoDB, and NodeJS. That system was never able to be used in production, as my fragile duct-taped solution turned out to be more robust.

"An engineer is one who, when asked to make a cup of tea, come up with a device to boil the ocean".

Source: unknown. I think it was in Grady Booch's OO book, or some such book.

Also: "architect astronaut" and "Better is the enemy of good" come to mind...

IncRnd · on May 9, 2018

I think the reference to impossible things is far older than Grady Booch.

  To talk of many things:
      Of shoes--and ships--and sealing-wax--
  Of cabbages--and kings--
  And why the sea is boiling hot--
  And whether pigs have wings
  ~ Lewis Caroll, 1832-1898

SmirkingRevenge · on May 9, 2018

You might take a look at Luigi again, it is pretty simple, once you realize its "scheduler" is really just a global task lock to prevent the same task from running on multiple machines or processes, at the same time.

The most onerous configuration you might have to do for a minimal setup, is to make sure your workers know the scheduler url, and that the scheduler is accessible to the workers. And of course, you have to do the actual scheduling yourself (e.g. cron).

That aside, there are few bits of infrastructure that are so quick and painless to stand up, IMHO. A database isn't even required (though desirable for task history).

This Mara project looks really cool too, though.

mattbillenstein · on May 9, 2018

I started on Luigi and loved the simplicity of it (checkpointing completed tasks on the filesystem instead of a db), but the telemetry in the Airflow webui is more or less essential imho.

SmirkingRevenge · on May 10, 2018

The simplicity of Luigi is great, but I did find myself in a spot fairly quickly, where the features of the airflow scheduler/webui were really desirable, over the rather ad hoc nature of Luigi. But after using Airflow a bit, I found myself really missing some of Luigi's simple niceties. I became pretty annoyed with Airflows operational complexity and its overall lack of emphasis on idempotent/atomic jobs, at least when compared with Luigi.

To me, Luigi wins when it comes to atomic/idempotent operations and simplicity. It loses on scheduling and visibility. Today I'm still trying to figure out what the better trade off is.

I think I really want something with Luigi like abstractions (tasks + targets) but with the global scheduling and visibility of airflow.

andscoop · on May 10, 2018

What sort of operational complexities did you run into when using Airflow?

In regards to the idempotency of workflows, so much of that comes down to how you develop your DAG files. Having read through both sets of the docs, they both pay lip service to idempotent workflows, but doing the heavy lifting of making your workflows idempotent is up to you.

SmirkingRevenge · on May 10, 2018

Airflow requires task queues (e.g. celery), message broker (e.g. rabbitmq), a web service, a scheduler service, and a database. You also need worker clusters to read from your task queues and execute jobs. All those workers need every library or app that any of your dags require. Each of these things are not necessarily big deals on their own, but it still all adds up to a sizable investment in time, complexity and cost to get it up and running.

You'll probably also want some sort of shared storage to deploy dags. And then you have to come up with a deployment procedure/practice to make sure that dags aren't upgraded piecemeal while they are running. To be fair though, that is a problem with luigi (or any distributed workflow system, probably).

Luigi, IMHO, is more "fire and forget" when it comes to atomicity/idempotency, because wherever possible, it relies on medium of the output target for those guarantees. The target class, ideally, abstracts all that stuff away, and the logic of a task can often be re-used with a variety of input/output targets. I can easily write one base task, and not care whether the output target is going to be a local filesystem, a remote host (via ssh/scp/sftp/ftp) or google cloud storage or s3. With airflow I always feel like I'm invoking operators like "CopyLocalFileToABigQueryTableButFTPItFirstAndAlsoWriteItToS3" (I'm exaggerating, but still... :P).

reinhardt · on May 10, 2018

> Airflow requires task queues (e.g. celery), message broker (e.g. rabbitmq), a web service, a scheduler service, and a database. You also need worker clusters to read from your task queues and execute jobs.

All these are supported but the scheduler is pretty much the only requirement.

Source: been running Airflow for the last two years without a worker cluster, without having celery/rabbitmq installed and sometimes without even an external database (i.e. a plan sqlite file).

swalsh · on May 9, 2018

In the environment I work in, our ETL is all over the place. We have steps on SQL Server, steps in hadoop, steps that need to be reviewed by a real person, some SAP, and a bunch of other technologies.

The solution we came up with is a decentralized solution with a centralized log. Basically the entire graph is in the SQL server, which every technology knows how to talk too. Then we have multiple "runners" (one runner per technology we use) which just ask the sql server for what's next. It also has a system for storing system states caused by events not in the system. It's very simple, and has been very robust.

BerislavLopac · on May 9, 2018

This might be useful: https://cloud.google.com/blog/big-data/2018/05/cloud-compose...

cosmie · on May 9, 2018

That looks like a nifty service, but we were an AWS shop and the bandwidth intensity of ETL would have made a GCP-hosted service cost prohibitive. That said, at least once a month I tried to convince our CTO to let me move our data workloads over to GCP due to all the nifty managed services they have available.

I was primarily munging data between FTP drops, S3, RDS, and Redshift, which mainly fell into free buckets for internal data transfer.

chimerasaurus · on May 9, 2018

Disclaimer: I'm the PM for Composer. :)

If the cost of Composer is an issue, ping me. Running a static environment _does_ have a cost, but for serious ETL it should be pretty inexpensive all things considered. You _should_ be able to use Airflow (in GCP or anywhere else) to call on other services, like S3/Redshift to operate without moving the data through Airflow, keeping network tx low.

If it's network traffic for the actual data moving to and from, that's unfortunately an artifact of how public clouds price.

occams_chainsaw · on May 9, 2018

QQ, why is it called Google cloud composer? I totally would've named it Gooflow

chimerasaurus · on May 9, 2018

My key advice for any new PM:

Every Dilbert cartoon about naming a product is true.

andscoop · on May 10, 2018

Engineer at Astronomer.io here. We offer Airflow as a managed cloud service as well as an affordably priced Enterprise Edition to run on your own infrastructure wherever you'd like. Check us out - and feel free to reach out to me personally if you have any questions.

mattbillenstein · on May 9, 2018

I did multi-cloud doing the data stuff in GCP (mostly GCS and BigQuery) and the rest in EC2 -- costs weren't really an issue, but I guess if you're moving TB's around daily, that's the problem?

cosmie · on May 9, 2018

Not quite TB's daily, but close. We were in B2B lead generation, and a lot of my ETL workloads involved heavy text normalization and standardization, then source layering to ultimately stitch together as complete and accurate of a record as possible based on the heuristics we had available.

Providers of that type of data essentially live in a world of "dump dataset to csv[1] periodically, place csv onto the FTP account for whomever is paying us for it currently". No deltas for changed or net new records, no per customer formatting requests, nothing. So the entire thing had to be re-processed every single time from every single vendor and then upserted into our master data.

[1] Hell, usually not even technical information was provided like the character encoding the data was stored or exported at or whether it's using database style character escapes (any potential special character is escaped with a backslash) or csv-style escapes (everything is interpreted as a literal except for a double-quote, which is escaped with a second double-quote).

dragonwriter · on May 9, 2018

Tangentially, I've usually seen “multicloud" for that and “hybrid cloud” for combining remote cloud services with on-premises resources in a blended system.

mattbillenstein · on May 9, 2018

Agree, multicloud is a better term for this.

jameslk · on May 9, 2018

Did you consider using Singer by chance (https://www.singer.io)? I'm wondering how that compares

mattbillenstein · on May 9, 2018

re points 3 and 4, you can use airflow in a non-distributed manner to just run bash jobs (bash running python in my case). The telemetry it collects and web interface give you a lot of visibility that you don't get with cron and plain bash jobs.

I started with Airflow in this capacity -- maybe taking a day or two to set it up figuring I'd add celery when we needed it. It's been maybe 1.5 years and task queues may never be needed; we can always just buy a bigger VM on ec2 as it's only 8-core/15GB atm.