Design a Job Scheduler

Design a distributed job scheduler that can handle high-throughput job processing (10,000+ jobs per second), support both scheduled (cron-based) and ad-hoc job execution, include retry mechanisms for failed jobs, and maintain execution history for up to one year.

Asked at:

Robinhood

Hello Interview Problem Breakdown

Design a Job Scheduler

System design answer key for designing a distributed job scheduler like Apache Airflow, built by FAANG managers and staff engineers.

Common Functional Requirements

Most candidates end up covering this set of core functionalities

Users should be able to create jobs that run immediately, at a specific time in the future, or on a recurring schedule using cron expressions.
Users should be able to submit ad‑hoc job runs in addition to scheduled executions.
Users should be able to view live job status and detailed execution history (timestamps, attempts, outcome, logs) for up to one year.
Users should be able to configure execution policies such as retries with backoff, timeouts, and optional concurrency limits per job.

Common Deep Dives

Common follow-up questions interviewers like to ask for this question

At this scale, a naive single cron polling the database will create hot partitions and jitter, and it fails under bursty "top-of-minute" spikes. You need a two-tier plan that separates long-term storage of future runs from a short-term, time-ordered ready queue, plus sharding and backpressure. - Use a two-tier scheduler: persist future executions durably (e.g., keyed by scheduled_at), and continuously materialize the next N minutes into a near-term ready queue (e.g., Redis Sorted Set or a delayed queue). Fast-path near-term creates directly into the ready queue. - Shard the scan by time buckets or hash of job_id, and run multiple schedulers in active/active with leader election for each shard. Control clock drift with NTP and a small overlap buffer to avoid misses. - Apply backpressure: meter promotions into the ready queue, autoscale workers based on queue depth and execution latency, and bound the queue with rate limits to prevent thundering herds.

Distributed systems make "exactly once" unrealistic; retries, crashes, and network splits happen. Interviewers expect you to combine delivery guarantees with idempotent handlers or lightweight deduplication so the business effect is exactly-once even if the job runs more than once. - Use invisibility timeouts/leases when workers pull jobs. Renew leases with heartbeats for long runs; if a worker dies, the job becomes visible for retry. Send failures to a dead-letter queue after max attempts. - Make handlers idempotent via idempotency keys (job_id + scheduled_at) and conditional writes. Consider a small dedupe table/window keyed by that token to ignore replays. - Add exponential backoff and jitter to retries to reduce contention and avoid stampedes after transient outages.

Worker management is where many designs become fragile. You need predictable scaling based on load, safe shutdowns that don't drop work, and a plan for failures without a central bottleneck. - Prefer pull-based workers so the scheduler stays stateless. Implement worker heartbeats, per-worker concurrency limits, and cooperative cancellation/timeout handling. - Autoscale by queue depth, age of oldest message, and recent processing latency. Cap concurrency per job or tenant to avoid noisy-neighbor issues. - Implement graceful draining: stop fetching new work, extend in-flight leases until completion, and use a coordinator (or lightweight membership service) to manage rolling updates without dropping jobs.

History at 10k jobs/sec is a write-heavy firehose. Interviewers want to see hot vs. cold storage, partitioning to avoid hotspots, and efficient retrieval by job_id and time ranges. - Separate hot (recent months) vs. cold (older) data. Use a write-optimized KV/column store with time-bucketed partitions and TTLs; archive aged records and logs to object storage with compact indexes. - Model keys for common access: partition by job_id and month (or time bucket) to spread writes, with GSIs/secondary indexes for status or time-range queries. - Store summaries/metrics (counts, durations, error rates) separately from verbose logs; pre-aggregate rollups to keep queries fast and storage lean.

Relevant Patterns

Relevant patterns that you should know for this question

Managing Long Running Tasks

Jobs may run for seconds to minutes and must survive worker restarts and preemption. The pattern provides leases/visibility timeouts, heartbeats for renewal, retries with exponential backoff, and dead‑letter handling to deliver at‑least‑once execution with clear semantics.

Scaling Writes

Scheduling and history ingest together can exceed 10k writes/sec. You need partitioning, time bucketing, and write-optimized schemas to avoid hot partitions and to sustain high throughput while keeping history queries fast.

Dealing with Contention

Cron workloads spike at boundaries (e.g., :00 of every minute) and after outages. Designing for backpressure, jittered retries, distributed locks/shards, and fair scheduling prevents thundering herds and overloaded components.

Relevant Technologies

Relevant technologies that could be used to solve this question

DynamoDB

A managed, horizontally scalable key‑value store with predictable performance fits both the job catalog and high‑throughput execution history. Partition by job_id and time buckets, use GSIs for status/time queries, and TTLs for lifecycle management before archiving.

Redis

Redis Sorted Sets or Streams are excellent for the near‑term, time‑ordered ready queue and delayed scheduling. They provide low‑latency atomic operations (ZADD/ZPOPMIN or stream XREADGROUP) for sub‑second dispatch and can be clustered for throughput.

ZooKeeper

ZooKeeper provides robust primitives for leader election, membership, and shard coordination. It enables running multiple schedulers active/active, dividing time windows or key ranges without duplicate promotions or single points of failure.

Red Flags to Avoid

Common mistakes that can sink candidates in an interview

A single central cron polling the database for due jobs

This creates hot partitions, high lock contention, and missed precision during spikes. It also hinders high availability and cannot reliably sustain 10k/sec with low jitter.

Promising exactly-once execution without idempotency or deduplication

Crashes and retries make duplicates inevitable. Without idempotent handlers or dedupe keys, you will double-send emails, double-charge customers, or corrupt state under failure.

No backpressure or capacity planning for bursty schedules

Ignoring queue depth, rate limits, and autoscaling leads to thundering herds at minute boundaries, cascading retries, and SLA misses as components overload and fail together.

Question Timeline

See when this question was last asked and where, including any notes left by other candidates.

Company

Level

Region

Mid September, 2025

Snowflake

Manager

Mid September, 2025

Airbnb

Senior

Early September, 2025

Microsoft

Senior Manager

Design distributed jobs scheduler, same as what we have in HelloInterview site, only level up was, where job can be multi stage and you dont want to loose the progress if one job fails after completing few stages.

Get Premium to View All 30+ Reports

Community Solutions

Comments

Your account is free and you can post anonymously if you choose.

Design a Job Scheduler

Hello Interview Problem Breakdown

Design a Job Scheduler

Common Functional Requirements

Common Deep Dives

Relevant Patterns

Relevant Technologies

Similar Problems to Practice

Red Flags to Avoid

Question Timeline

Community Solutions

Comments

Questions

Learn

Links

Legal

Contact

Design a Job Scheduler

Hello Interview Problem Breakdown

Design a Job Scheduler

Common Functional Requirements

Common Deep Dives

How will you dispatch 10,000 scheduled jobs per second with sub‑2s precision without overloading your data stores?

How do you provide at-least-once execution while preventing duplicates from causing harm?

How will you manage worker lifecycle, scaling, and safe draining at this throughput?

How will you store and query one year of execution history cost-effectively while supporting common queries?

Relevant Patterns

Relevant Technologies

Similar Problems to Practice

Red Flags to Avoid

Question Timeline

Community Solutions

Comments

Questions

Learn

Links

Legal

Contact