Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I really like this!

I bootstrapped the ETL and data pipeline infrastructure at my last company with a combination of Bash, Python, and Node scripts duct-taped together. Super fragile, but effective[3]. It wasn't until about 3 years in (and 5x the initial revenue and volume) that it started having growing pains. Every time I tried to evaluate solutions like Airflow[1] or Luigi[2], there was just so much involved with getting it going reliably and migrating things over that it just wasn't worth the effort[4].

This seems like a refreshingly opinionated solution that would have fit my use case perfectly.

[1] https://airflow.apache.org/

[2] https://github.com/spotify/luigi

[3] The operational complexity of real-time, distributed architectures is non-trivial. You'd be amazed how far some basic bash scripts running on cron jobs will take you.

[4] I was a one man data management/analytics/BI team for the first two years, not a dedicated ETL resource with time to spend weeks getting a PoC based on Airflow or Luigi running. When I finally got our engineering team to spend some time on making the data pipelines less fragile, instead of using one of these open source solutions they took it as an opportunity to create a fancy scalable, distributed, asynchronous data pipeline system built on ECS, AWS Lambda, DynamoDB, and NodeJS. That system was never able to be used in production, as my fragile duct-taped solution turned out to be more robust.



> When I finally got our engineering team to spend some time on making the data pipelines less fragile, instead of using one of these open source solutions they took it as an opportunity to create a fancy scalable, distributed, asynchronous data pipeline system built on ECS, AWS Lambda, DynamoDB, and NodeJS. That system was never able to be used in production, as my fragile duct-taped solution turned out to be more robust.

"An engineer is one who, when asked to make a cup of tea, come up with a device to boil the ocean".

Source: unknown. I think it was in Grady Booch's OO book, or some such book.

Also: "architect astronaut" and "Better is the enemy of good" come to mind...


I think the reference to impossible things is far older than Grady Booch.

  To talk of many things:
      Of shoes--and ships--and sealing-wax--
  Of cabbages--and kings--
  And why the sea is boiling hot--
  And whether pigs have wings
  ~ Lewis Caroll, 1832-1898


You might take a look at Luigi again, it is pretty simple, once you realize its "scheduler" is really just a global task lock to prevent the same task from running on multiple machines or processes, at the same time.

The most onerous configuration you might have to do for a minimal setup, is to make sure your workers know the scheduler url, and that the scheduler is accessible to the workers. And of course, you have to do the actual scheduling yourself (e.g. cron).

That aside, there are few bits of infrastructure that are so quick and painless to stand up, IMHO. A database isn't even required (though desirable for task history).

This Mara project looks really cool too, though.


I started on Luigi and loved the simplicity of it (checkpointing completed tasks on the filesystem instead of a db), but the telemetry in the Airflow webui is more or less essential imho.


The simplicity of Luigi is great, but I did find myself in a spot fairly quickly, where the features of the airflow scheduler/webui were really desirable, over the rather ad hoc nature of Luigi. But after using Airflow a bit, I found myself really missing some of Luigi's simple niceties. I became pretty annoyed with Airflows operational complexity and its overall lack of emphasis on idempotent/atomic jobs, at least when compared with Luigi.

To me, Luigi wins when it comes to atomic/idempotent operations and simplicity. It loses on scheduling and visibility. Today I'm still trying to figure out what the better trade off is.

I think I really want something with Luigi like abstractions (tasks + targets) but with the global scheduling and visibility of airflow.


What sort of operational complexities did you run into when using Airflow?

In regards to the idempotency of workflows, so much of that comes down to how you develop your DAG files. Having read through both sets of the docs, they both pay lip service to idempotent workflows, but doing the heavy lifting of making your workflows idempotent is up to you.


Airflow requires task queues (e.g. celery), message broker (e.g. rabbitmq), a web service, a scheduler service, and a database. You also need worker clusters to read from your task queues and execute jobs. All those workers need every library or app that any of your dags require. Each of these things are not necessarily big deals on their own, but it still all adds up to a sizable investment in time, complexity and cost to get it up and running.

You'll probably also want some sort of shared storage to deploy dags. And then you have to come up with a deployment procedure/practice to make sure that dags aren't upgraded piecemeal while they are running. To be fair though, that is a problem with luigi (or any distributed workflow system, probably).

Luigi, IMHO, is more "fire and forget" when it comes to atomicity/idempotency, because wherever possible, it relies on medium of the output target for those guarantees. The target class, ideally, abstracts all that stuff away, and the logic of a task can often be re-used with a variety of input/output targets. I can easily write one base task, and not care whether the output target is going to be a local filesystem, a remote host (via ssh/scp/sftp/ftp) or google cloud storage or s3. With airflow I always feel like I'm invoking operators like "CopyLocalFileToABigQueryTableButFTPItFirstAndAlsoWriteItToS3" (I'm exaggerating, but still... :P).


> Airflow requires task queues (e.g. celery), message broker (e.g. rabbitmq), a web service, a scheduler service, and a database. You also need worker clusters to read from your task queues and execute jobs.

All these are supported but the scheduler is pretty much the only requirement.

Source: been running Airflow for the last two years without a worker cluster, without having celery/rabbitmq installed and sometimes without even an external database (i.e. a plan sqlite file).


In the environment I work in, our ETL is all over the place. We have steps on SQL Server, steps in hadoop, steps that need to be reviewed by a real person, some SAP, and a bunch of other technologies.

The solution we came up with is a decentralized solution with a centralized log. Basically the entire graph is in the SQL server, which every technology knows how to talk too. Then we have multiple "runners" (one runner per technology we use) which just ask the sql server for what's next. It also has a system for storing system states caused by events not in the system. It's very simple, and has been very robust.



That looks like a nifty service, but we were an AWS shop and the bandwidth intensity of ETL would have made a GCP-hosted service cost prohibitive. That said, at least once a month I tried to convince our CTO to let me move our data workloads over to GCP due to all the nifty managed services they have available.

I was primarily munging data between FTP drops, S3, RDS, and Redshift, which mainly fell into free buckets for internal data transfer.


Disclaimer: I'm the PM for Composer. :)

If the cost of Composer is an issue, ping me. Running a static environment _does_ have a cost, but for serious ETL it should be pretty inexpensive all things considered. You _should_ be able to use Airflow (in GCP or anywhere else) to call on other services, like S3/Redshift to operate without moving the data through Airflow, keeping network tx low.

If it's network traffic for the actual data moving to and from, that's unfortunately an artifact of how public clouds price.


QQ, why is it called Google cloud composer? I totally would've named it Gooflow


My key advice for any new PM:

Every Dilbert cartoon about naming a product is true.


Engineer at Astronomer.io here. We offer Airflow as a managed cloud service as well as an affordably priced Enterprise Edition to run on your own infrastructure wherever you'd like. Check us out - and feel free to reach out to me personally if you have any questions.


I did multi-cloud doing the data stuff in GCP (mostly GCS and BigQuery) and the rest in EC2 -- costs weren't really an issue, but I guess if you're moving TB's around daily, that's the problem?


Not quite TB's daily, but close. We were in B2B lead generation, and a lot of my ETL workloads involved heavy text normalization and standardization, then source layering to ultimately stitch together as complete and accurate of a record as possible based on the heuristics we had available.

Providers of that type of data essentially live in a world of "dump dataset to csv[1] periodically, place csv onto the FTP account for whomever is paying us for it currently". No deltas for changed or net new records, no per customer formatting requests, nothing. So the entire thing had to be re-processed every single time from every single vendor and then upserted into our master data.

[1] Hell, usually not even technical information was provided like the character encoding the data was stored or exported at or whether it's using database style character escapes (any potential special character is escaped with a backslash) or csv-style escapes (everything is interpreted as a literal except for a double-quote, which is escaped with a second double-quote).


Tangentially, I've usually seen “multicloud" for that and “hybrid cloud” for combining remote cloud services with on-premises resources in a blended system.


Agree, multicloud is a better term for this.


Did you consider using Singer by chance (https://www.singer.io)? I'm wondering how that compares


re points 3 and 4, you can use airflow in a non-distributed manner to just run bash jobs (bash running python in my case). The telemetry it collects and web interface give you a lot of visibility that you don't get with cron and plain bash jobs.

I started with Airflow in this capacity -- maybe taking a day or two to set it up figuring I'd add celery when we needed it. It's been maybe 1.5 years and task queues may never be needed; we can always just buy a bigger VM on ec2 as it's only 8-core/15GB atm.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: