Design a Job Scheduler
Design a distributed job scheduler that can handle high-throughput job processing (10,000+ jobs per second), support both scheduled (cron-based) and ad-hoc job execution, include retry mechanisms for failed jobs, and maintain execution history for up to one year.
Asked at:
Robinhood
Meta

Netflix
This question asks you to design a distributed job scheduler—think Airflow, Quartz, or AWS EventBridge/CloudWatch Events—that can trigger and run 10,000 jobs per second, with reliable execution semantics and a year of searchable history. Interviewers use this problem to test whether you can separate scheduling from execution, design for high throughput and bursty "top-of-minute" load, and reason about delivery guarantees (at-least-once, idempotency, retries). They also look for pragmatic data modeling for long-term history (hot vs. cold), worker lifecycle management, and strategies to avoid single points of failure while keeping the scheduler effectively stateless.
Hello Interview Problem Breakdown
Design a Job Scheduler
System design answer key for designing a distributed job scheduler like Apache Airflow, built by FAANG managers and staff engineers.
Common Functional Requirements
Most candidates end up covering this set of core functionalities
Users should be able to create jobs that run immediately, at a specific time in the future, or on a recurring schedule using cron expressions.
Users should be able to submit ad‑hoc job runs in addition to scheduled executions.
Users should be able to view live job status and detailed execution history (timestamps, attempts, outcome, logs) for up to one year.
Users should be able to configure execution policies such as retries with backoff, timeouts, and optional concurrency limits per job.
Common Deep Dives
Common follow-up questions interviewers like to ask for this question
At this scale, a naive single cron polling the database will create hot partitions and jitter, and it fails under bursty "top-of-minute" spikes. You need a two-tier plan that separates long-term storage of future runs from a short-term, time-ordered ready queue, plus sharding and backpressure. - Use a two-tier scheduler: persist future executions durably (e.g., keyed by scheduled_at), and continuously materialize the next N minutes into a near-term ready queue (e.g., Redis Sorted Set or a delayed queue). Fast-path near-term creates directly into the ready queue. - Shard the scan by time buckets or hash of job_id, and run multiple schedulers in active/active with leader election for each shard. Control clock drift with NTP and a small overlap buffer to avoid misses. - Apply backpressure: meter promotions into the ready queue, autoscale workers based on queue depth and execution latency, and bound the queue with rate limits to prevent thundering herds.
Distributed systems make "exactly once" unrealistic; retries, crashes, and network splits happen. Interviewers expect you to combine delivery guarantees with idempotent handlers or lightweight deduplication so the business effect is exactly-once even if the job runs more than once. - Use invisibility timeouts/leases when workers pull jobs. Renew leases with heartbeats for long runs; if a worker dies, the job becomes visible for retry. Send failures to a dead-letter queue after max attempts. - Make handlers idempotent via idempotency keys (job_id + scheduled_at) and conditional writes. Consider a small dedupe table/window keyed by that token to ignore replays. - Add exponential backoff and jitter to retries to reduce contention and avoid stampedes after transient outages.
Worker management is where many designs become fragile. You need predictable scaling based on load, safe shutdowns that don't drop work, and a plan for failures without a central bottleneck. - Prefer pull-based workers so the scheduler stays stateless. Implement worker heartbeats, per-worker concurrency limits, and cooperative cancellation/timeout handling. - Autoscale by queue depth, age of oldest message, and recent processing latency. Cap concurrency per job or tenant to avoid noisy-neighbor issues. - Implement graceful draining: stop fetching new work, extend in-flight leases until completion, and use a coordinator (or lightweight membership service) to manage rolling updates without dropping jobs.
History at 10k jobs/sec is a write-heavy firehose. Interviewers want to see hot vs. cold storage, partitioning to avoid hotspots, and efficient retrieval by job_id and time ranges. - Separate hot (recent months) vs. cold (older) data. Use a write-optimized KV/column store with time-bucketed partitions and TTLs; archive aged records and logs to object storage with compact indexes. - Model keys for common access: partition by job_id and month (or time bucket) to spread writes, with GSIs/secondary indexes for status or time-range queries. - Store summaries/metrics (counts, durations, error rates) separately from verbose logs; pre-aggregate rollups to keep queries fast and storage lean.
Relevant Patterns
Relevant patterns that you should know for this question
Jobs may run for seconds to minutes and must survive worker restarts and preemption. The pattern provides leases/visibility timeouts, heartbeats for renewal, retries with exponential backoff, and dead‑letter handling to deliver at‑least‑once execution with clear semantics.
Scheduling and history ingest together can exceed 10k writes/sec. You need partitioning, time bucketing, and write-optimized schemas to avoid hot partitions and to sustain high throughput while keeping history queries fast.
Cron workloads spike at boundaries (e.g., :00 of every minute) and after outages. Designing for backpressure, jittered retries, distributed locks/shards, and fair scheduling prevents thundering herds and overloaded components.
Relevant Technologies
Relevant technologies that could be used to solve this question
Similar Problems to Practice
Related problems to practice for this question
Both designs center on time-based triggering, distributed execution, retries, and idempotency. Even without DAGs, the core concerns—promoting due work, worker pools, and execution tracking—are the same.
A crawler uses pull-based workers, leases, retries, and deduplication to process large queues reliably, mirroring the scheduler’s execution layer and at-least-once semantics.
Heavy, sustained write throughput and time-bucketed storage requirements are shared: you must partition by time and keys, manage hot vs. cold data, and support efficient time-range queries at scale.
Red Flags to Avoid
Common mistakes that can sink candidates in an interview
Question Timeline
See when this question was last asked and where, including any notes left by other candidates.
Mid September, 2025
Snowflake
Manager
Mid September, 2025
Airbnb
Senior
Early September, 2025

Microsoft
Senior Manager
Design distributed jobs scheduler, same as what we have in HelloInterview site, only level up was, where job can be multi stage and you dont want to loose the progress if one job fails after completing few stages.
Your account is free and you can post anonymously if you choose.