The status dashboard
Serve millions of status page reads during an incident using layered caching and read path isolation.
Kata overview
You do not need to be an expert to start. This kata keeps the stakes low so you can explore trade-offs, adjust the diagram, and see how the system responds.
Context for this system design kata
Serve millions of status page reads during an incident using layered caching and read path isolation. This system design kata keeps the stakes low so you can rehearse trade-offs before taking ideas into production reviews.
Scenario and practice focus
When an incident is declared, users flood the status page. The content is identical for everyone viewing the same page - it only changes when the operator posts an update (every 5–15 minutes). Between updates, every request returns the same HTML/JSON. This is the perfect use case for aggressive caching: CDN at the edge, a fast API behind it, and a read replica so the status database primary only handles writes. The 80ms latency target is only achievable if the CDN serves most requests - without it, the API + DB path alone exceeds the target under load. The cost target is tight because incidents are spiky but infrequent - you can't justify always-on large instances.
Difficulty: Intermediate–Advanced. Estimated time: 30–45 min. Domain: SaaS / DevOps.
Constraints to balance
Operational pressure
- No manual scaling during an incident.
- Status page content is identical for all viewers - cache aggressively.
Customer and product constraints
- Updates are infrequent (every 5-15 minutes during an incident).
- Incidents are spiky but rare - you can't justify always-on large instances.
Scenarios to explore in the simulator
- Keep status pages fast during incident traffic surges.
- Internal updates must not compete with external reads.
- Serve incident traffic at minimal cost - incidents are spiky but brief.
- Never let the status page itself become an incident.
Learning outcomes
- Build a multi-layer read path (CDN → API → replica) where each layer reduces origin load.
- Configure CDN cache hit ratio based on content update frequency.
- Understand that 80ms p95 is only achievable with CDN - the API + DB path alone takes longer.
- Isolate write path (operator updates) from read path (user views) using replicas.
Give it a try!