Change Failure Rate (CFR)
Definition: The percentage of deployments to production that result in a failure requiring remediation. “Failure” typically means a deployment that caused an incident, outage, rollback, or severe bug that necessitated immediate fix. For example, if 100 deployments in a period resulted in 5 rollbacks or hotfixes, CFR is 5%. Lower is better.
What it tells you: CFR is a measure of quality and stability in releases. It indicates the rigor of testing and validation and the effectiveness of your release processes. A low change failure rate means most releases are successful and stable, which is a hallmark of mature DevOps teams. DORA research found elite performers have CFR in the low single-digits or even zero in many cases.
What it misses or risks: CFR depends on defining what counts as a "failure". Teams might underreport failures (consciously or not) to look good. If incidents are not tracked, or if a problem is discovered much later, it might not be attributed to a specific deployment, masking the issue. Also, if you deploy infrequently, one failure among few deploys can make CFR look high percentage-wise, even if absolute number of failures isn’t big. Conversely, deploying very often might make CFR look tiny percentage-wise but still result in many failures in absolute terms. CFR also doesn’t detail severity – one could argue a single deployment that caused a massive outage is worse than three that caused minor glitches, but CFR treats them numerically.
Common misinterpretations: One might think “zero failures” is always the goal – but beware of Goodhart’s Law here: a team could achieve zero CFR by deploying far less often or by classifying all issues as non-deployment-related. It’s the classic testing paradox: if you never release, you never fail. So context is key. Also, a slightly higher CFR might be acceptable if the organization values pushing boundaries and is able to recover quickly (fast MTTR). In other words, CFR shouldn’t be looked at in isolation – consider it with MTTR. High CFR + high MTTR is dangerous (lots of failures, slow to fix). High CFR + very low MTTR might indicate a fast-moving, experimental culture (though ideally, we aim for low CFR anyway).
Improvement strategies:
- Improve testing and CI quality gates: Ensure robust automated tests (unit, integration, end-to-end) run before deploys. Possibly add canary releases or feature flag rollouts to catch issues on a subset of users first.
- Peer review and static analysis: Strengthen code review practices, add linting, static code analysis, security scans to catch defects before deploy.
- Observability in production: This doesn’t directly lower CFR but helps identify failures quicker. Good monitoring and logging mean if a deployment has an issue, you detect it and trigger rollback faster (which might still count as a failure, but reduces impact).
- Post-deployment validation: Automate smoke tests right after deployment in production to verify key functionalities. This can catch failures early and possibly automate rollback if needed.
- Analyze patterns: Look at past failed changes – were they due to certain kinds of changes or certain subsystems? Feed that info back into planning and risk assessment. For example, if deployments on Friday have higher failure, maybe instill a practice of no-Friday-deploys or double testing on Friday.
- Culture and Process: Encourage a blameless post-mortem culture to learn from failures. Sometimes CFR is high because systemic issues (like environment drift or lack of integration tests) cause repeated failures. Address the root causes identified in retrospectives.