This is where the second pass actually plays out, the last gate before an interview hits your
inbox. The recruiter slows down right here, and even then your current role still drives
around 95% of the decision.
Makes sense: nothing tells a hiring team what you can run in production right now the way your
current job does. To clear that "yes", this section has to walk the full
SRE role profile, one bullet per slot you listed in Domain
Expertise above. Every bullet has to come off something you actually held in production,
not a Jira card that wandered past your queue.
1
SLO & Error Budget Engineering
The flagship work of the role. Show the SLOs you designed (latency, availability,
freshness), the error-budget policy you wrote, and the burn-rate alerts behind them. Name
the service tier and the target you set, not "owned SLOs".
Techniques
SLI selection
Burn-rate alerting
Error-budget policy
Tier classification
Tools
Prometheus
Sloth, Nobl9
Grafana SLO panels
Metrics
SLO hit rate
Services on SLO
Error budget defended
2
Observability & Tracing
What turns a production fire into a debuggable story. Show the metrics, logs, and traces
pipeline you stood up, the dashboards every service inherits, and the tracing coverage
across critical paths. Name the system and what it unblocked, not "used Datadog".
Techniques
RED & USE metrics
Distributed tracing
Structured logging
Cardinality control
Tools
Prometheus, Grafana
OpenTelemetry, Tempo
Datadog, Honeycomb
Metrics
Tracing coverage
Alert noise reduced
Time-to-diagnose cut
3
Incident Response & Command
The discipline that separates an outage from a saga. Show the incident-command program you
ran, the major incident you took point on, and the communication and rotation underneath
it. Name the incident you commanded and the MTTR you cut, not "handled incidents".
Techniques
Incident command
Severity model
Comms cadence
On-call rotation
Tools
PagerDuty, Opsgenie
incident.io, FireHydrant
Statuspage
Metrics
MTTR cut
P0 incidents reduced
Time-to-detect
4
Postmortems & Reliability Improvement
Where SRE turns one outage into ten fewer next quarter. Show the postmortem template you
standardized, the action-tracking system behind it, and the reliability bet that came out
of a real incident. Name the action and the metric it moved, not "wrote
postmortems".
Techniques
Blameless postmortems
Action tracking
Reliability bets
Trend analysis
Tools
Notion, Confluence
Jira, Linear
incident.io retros
Metrics
Actions closed
Repeat-incident rate down
SLO regressions caught
5
Capacity Planning & Performance
How the service holds up before the holiday spike. Show the load tests you ran, the
capacity model you wrote, and the bottleneck you found before traffic did. Name the
workload and what you sized for, not "did capacity planning".
Techniques
Load & soak testing
Capacity modeling
Headroom planning
Performance profiling
Tools
k6, Locust
JMeter
pprof, perf
Metrics
Peak load handled
Latency at peak
Cost per RPS cut
6
Toil Reduction & Automation
The discipline that keeps the on-call calm. Show the toil you measured, the automation you
shipped against it, and the hours-per-quarter you returned to engineering work. Name the
chore you killed, not "automated stuff".
Techniques
Toil measurement
Self-healing automation
Runbook codification
Alert hygiene
Tools
Python, Go
Ansible, Terraform
Rundeck, StackStorm
Metrics
Toil hours cut
Pages per shift down
Self-heal rate
7
Chaos Engineering & Resilience Testing
How a senior SRE finds the failure before it finds the user. Show the chaos program you
ran, the failure mode you discovered in a game day, and the gap you closed before the next
quarter. Name the experiment and the weakness it surfaced, not "ran chaos
tests".
Techniques
Failure injection
Game days
DR drills
Hypothesis testing
Tools
Chaos Mesh, LitmusChaos
Gremlin
AWS FIS
Metrics
Failure modes closed
Game days run
DR RTO held
8
Tooling & Workflow
The setup that lets one SRE cover the reliability of dozens of services. Show the internal
tooling you shipped (SLO-as-code, runbook libraries, on-call dashboards), the review
patterns that catch reliability regressions at PR time, and the docs that cut on-call
ramp. Name the workflow, not "a modern stack".
Techniques
SLO as code
Production-readiness reviews
Runbook libraries
On-call shadowing
Tools
Git, GitHub
Python, Go, Bash
Backstage
Metrics
SLOs as code
Runbooks maintained
On-call ramp cut