Executive Summary: Different problems need different approaches. The Cynefin Framework provides a practical way to match your response to the problem’s actual nature: simple issues need established procedures, complicated ones require expert analysis, complex problems demand experimentation, and chaos requires immediate action. Understanding which domain you’re in prevents costly mistakes like over-analyzing straightforward issues or trying to engineer solutions for emergent problems.
Understanding Problem Types Through Real Incidents
I’ve watched the same pattern play out dozens of times: a production issue hits, and the team immediately jumps to their preferred approach regardless of what they’re facing. The database experts start analyzing query plans. The infrastructure folks check resource utilization. The application team reviews recent deployments. Or an executive jumps in and throws everyone in the direction he sees. Everyone applies their go-to method, and sometimes it works brilliantly. Other times, we waste hours or days going down the wrong path.
The problem isn’t that these approaches are wrong—it’s that we’re not stopping to ask what kind of problem we’re actually dealing with. A framework developed by Dave Snowden called Cynefin (pronounced kuh-NEV-in, it’s Welsh for “habitat”) has helped me make better decisions about when to dive deep versus when to act fast, when to experiment versus when to follow established procedures.
Cynefin categorizes situations into five domains based on how cause and effect work in each. I’ll walk through them using actual scenarios I’ve encountered.
Simple/Obvious Domain: When the Answer Is Clear
The Pattern: Clear cause-effect relationships exist. One right answer. Problems fit into known categories. Solutions can be standardized.
Real Example: Our web application starts throwing 500 errors. Monitoring shows database connection pool exhaustion—we hit the configured maximum of 100 connections. This happens every few months when traffic spikes, and the fix is always the same: increase the pool size or implement better connection reuse. A junior engineer can handle it by following the documented procedure.
How You Know You’re Here:
- The problem recurs almost identically
- Your documented solution works >95% of the time
- Junior team members diagnose it correctly
- You could fully automate the fix with high confidence
- Extensive documentation exists
The Approach: Sense (observe the issue) → Categorize (it’s connection pool exhaustion) → Respond (apply the standard fix).
Where It Fits: This is fitness perception territory. Pattern recognition, efficient execution, proven procedures. Don’t overthink it.
Complicated Domain: Multiple Right Answers
The Pattern: Cause-effect relationships exist but require expertise to figure out. Multiple viable solutions with different trade-offs. Analysis pays off.
Real Example: A microservice architecture shows periodic performance degradation. No single obvious cause. I need distributed tracing to follow requests across services, analysis of queue depths, resource utilization patterns. Several solutions might work: optimize database queries, implement caching, adjust timeout configurations. Each has different implications for maintainability, cost, and future scaling.
How You Know You’re Here:
- Analysis requires specialized tools (tracing, profiling)
- Multiple solution paths exist with real trade-offs
- Junior engineers struggle without guidance
- The problem decomposes into analyzable components
- Solution development takes significant analysis time
The Approach: Sense (gather comprehensive data) → Analyze (using expert knowledge and tools) → Respond (implement the most appropriate solution).
Where It Fits: Reality perception shines here. You need accurate understanding, expert analysis, deep investigation. The time invested in getting it right pays off.
Complex Domain: Patterns Emerge Only in Hindsight
The Pattern: Cause-effect relationships only visible retrospectively. No clear right answers. Solutions emerge through experimentation. High contextual complexity.
Real Example: Our authentication system shows intermittent failures—anywhere from 0.5% to 3% error rate with no clear pattern. Affects different user segments unpredictably. Multiple system changes preceded the first occurrence. Traditional debugging hasn’t isolated a cause. The issue reproduces in production but not in test environments.
A Complex-Domain Approach:
- Safe-to-fail experiments: Route small percentages of traffic through different authentication paths
- Pattern detection: Capture detailed telemetry to identify emergent patterns
- Amplify success: Gradually increase traffic to configurations showing reduced errors
- Dampen problems: Reduce traffic to configurations showing increased failures
How You Know You’re Here:
- The problem changes when you investigate it
- Multiple attempted solutions have unpredictable effects
- Historical data doesn’t predict current behavior
- The system exhibits emergent behavior
- Many interacting adaptive components
The Approach: Probe (run small experiments) → Sense (observe responses) → Respond (amplify what works, dampen what doesn’t).
Where It Fits: You need both frameworks. Start with fitness-based probes (what works?), then use reality-based analysis to understand patterns. But acknowledge that complete understanding may be impossible.
Chaotic Domain: Crisis Mode
The Pattern: No discernible cause-effect in the moment. No time for analysis. System in acute failure. Focus on stabilization before understanding.
Real Example: A botched deployment causes cascading failures. Multiple services down. Error rates spiking everywhere. Customer data potentially at risk. Alarms triggering faster than we can read them. No time for root cause analysis—we need to stop the bleeding.
The Protocol:
- Contain: Limit impact (redirect traffic, isolate affected systems)
- Stabilize: Establish minimal functioning (revert to last known good)
- Assess: Rapidly gather situational awareness once immediate danger passes
- Transition: Move from chaos to complex or complicated for proper resolution
How You Know You’re Here:
- Multiple critical alerts simultaneously
- Rapid deterioration of core metrics
- Can’t determine initial failure point
- Need immediate executive decisions
- No time for conventional analysis
The Approach: Act (take immediate action) → Sense (check the response) → Respond (address remaining issues in priority order).
Where It Fits: Pure fitness perception initially—stop the bleeding before diagnosing the disease. Reality-based analysis comes later to understand root causes and prevent recurrence.
Disorder: When You Can’t Tell Which Domain You’re In
The Danger Zone: Different stakeholders apply incompatible approaches. Decision paralysis. Political factors override technical considerations. Cognitive biases dominate.
Real Example: After merging with another company, an incident spans both technology stacks. One team wants formal incident response procedures (treating it as complicated). Another wants quick fixes based on similar past incidents (treating it as obvious). Leadership demands immediate resolution (treating it as chaotic). Everyone’s pulling in different directions.
Getting Out:
- Explicitly agree on domain classification
- Break the problem into parts belonging to different domains
- Clarify decision authority
- Ensure everyone has the same information
How Problems Move Between Domains
Here’s something I’ve learned: the same issue can traverse multiple domains during its lifecycle.
Typical Incident Journey:
- Detection (Often Disorder): Alert fires, classification unclear, conflicting interpretations
- First Response (May Be Chaotic): Multiple alarms, unclear causation, focus on containment
- Stabilization (Transitions to Complex): Basic functionality restored, root cause unclear, probe-sense-respond
- Investigation (Moves to Complicated): System stable enough for methodical analysis, expert tools deployed
- Resolution (Reaches Simple/Obvious): Root cause identified, solution documented for future
- Prevention (Engineers for Simple/Obvious): Learnings codified into monitoring and automation
The Cost of Getting the Domain Wrong
Treating Simple as Complicated: Engineers can spend hours deep-diving into a service that just needed restarting, applying complicated-domain analysis to a simple-domain problem. Cost: wasted time, extended outage.
Treating Complicated as Simple: Repeatedly restarting a service without addressing underlying resource issues, applying a simple fix to a complicated problem. Cost: recurring incidents, growing technical debt.
Treating Complex as Complicated: Trying to engineer comprehensive solutions for workloads with highly variable patterns, assuming it could be modelled (complicated) when it was actually emergent (complex). Cost: brittle solutions that broke under slight condition changes.
Treating Chaotic as Complex: Testing new configurations during a major outage, trying experiments (complex-domain approach) during a crisis (chaotic domain). Cost: extended downtime, potential data loss.
Practical Application: Designing for All Domains
The systems I’ve worked on that handle uncertainty best incorporate elements for each domain:
Simple/Obvious Layer:
- Standard components with documented procedures
- Extensive automation for known issues
- Clear runbooks
- Fitness perception dominates
Complicated Layer:
- Specialized services requiring expert knowledge
- Comprehensive monitoring and observability
- Detailed system behavior documentation
- Reality perception dominates
Complex Layer:
- Loosely coupled components with clear boundaries
- Adaptive algorithms and self-tuning
- Extensive telemetry with pattern detection
- Safe-to-fail experimentation infrastructure
- Balanced approach
Chaos Protection:
- Multiple redundancy layers
- Circuit breakers and bulkheads
- Degraded operation modes
- Clear emergency procedures
- Fitness perception initially, reality after stabilization
Applying Cynefin to the Correctness vs. Robustness Spectrum
Remember from earlier posts how some systems prioritize correctness (right answer or no answer) while others prioritize robustness (continued operation despite issues)? Cynefin helps navigate this too:
Simple/Obvious Domain: Correctness typically dominates—problems are well-understood, so we can get it right.
Complicated Domain: Applications requiring precision favor correctness. Those requiring high availability favor robustness.
Complex Domain: Robustness typically dominates—emergent behavior makes absolute correctness elusive. Focus on critical invariants.
Chaotic Domain: Robustness dominates during crisis. Address correctness after stabilization.
Team Structures That Match Domains
I’ve noticed teams naturally organize around domain expertise:
Domain Specialists:
- Simple/Obvious: Operations engineers focused on efficiency
- Complicated: Architects and senior engineers with deep expertise
- Complex: Research engineers working with emergent patterns
- Chaotic: SREs and incident commanders
Building Balanced Capability:
- Cross-train between domains regularly
- Clear protocols for domain transitions during incidents
- Shared understanding of appropriate approaches
- Awareness of domain characteristics
What I’ve Learned
The Cynefin Framework reminds me there’s no one-size-fits-all approach to system operations. By recognizing which domain a problem belongs to, I can choose the right combination of reality and fitness perception.
This helps:
- Avoid applying the wrong methodology
- Recognize when problems shift between domains
- Design systems that handle different challenges appropriately
- Build teams with skills across all domains
- Communicate more effectively about problem contexts
The key insight: not all problems are created equal. The approach that works brilliantly for one type of problem can be completely wrong for another. Stop, assess which domain you’re in, then apply the appropriate response pattern.
Think about your recent incidents: What domain did they fall into? How did your approach match (or mismatch) the problem’s actual nature? Are there cases where domain misclassification led to ineffective solutions? Understanding this might change how you approach the next problem that lands on your desk.