Airflow: scaling out recommendations by Shopify

Shopify runs over 10k DAGs. 150k runs per day. Over 400 tasks at a given moment on average.

This is a brief overview of their approach. Link to source article.

Fast file access

Problem: reading DAGs files from Google Cloud Storage (through GCSFuse as a filesystem interface) is too slow.

Solution:
— Use Network File System instead of Google Cloud Storage.
— For users' convenience: allow them to upload files to GCS (through GCS UI/API/CLI), then sync files to NFS via a separate script.

Pro: GCS IAM allows to control access of specific users to specific DAGs locations.

Reduce metadata

Problem: Airflow performance degrades due to metadata volume.

Solution: introduce metadata retention policy (e.g. 28 days), regularly delete older historical data (DagRuns, TaskInstances, Logs, TaskRetries, etc) via ORM.

Con: Airflow features which rely on durable job history (for example, long-running backfills) become unusable.

Alternative: clean db in Airflow 2.3+.

Manifest file

Problems:
— In a multi-tenant setting it is hard to associate DAGs with users and teams.
— Airflow serves users with different access levels ⇒ without a proper permissions control Airflow users have too wide-ranging access.

Solution:
— Introduce a manifest file containing DAG owner email, code source, namespace, access restrictions, resources pools specification, etc.
— Reject DAGs which violate assertions declared in the manifest file. Shopify raises AirflowClusterPolicyViolation for this case.

Consistent distribution of load

Problem: humans tend to schedule DAGs unevenly, e.g. at "round" times (like :00 minutes or :00:00 hours) or via unpredictable schedule_interval=timedelta(…). This creates large surges of traffic which can overload the Airflow scheduler.

Solution: use a deterministically randomized schedule interval. Determinism is achieved by hashing a constant seed such as dag_id.

Randomized schedule implementation significantly smoothes the load

Resources contention

Problem: there are a lot of possible points of resources contention within Airflow.

Solution: use built-in Airflow concurrency control features like:
— pools (useful for reducing disruptions caused by bursts in traffic)
— and priority weights (useful to ensure that latency-sensitive tasks are run prior to lower priority ones).

Different execution environments

Problem: different tasks may rely on different sets of dependencies (e.g. Python libraries) and requirements (e.g. hardware resources).

Solution: distribute tasks using Celery queues to differently configured workers.

---

Brought to you by @pydwh 💛