The Cynefin Framework: Matching Your Approach to Problem Complexity

Executive Summary: Different problems need different approaches. The Cynefin Framework provides a practical way to match your response to the problem’s actual nature: simple issues need established procedures, complicated ones require expert analysis, complex problems demand experimentation, and chaos requires immediate action. Understanding which domain you’re in prevents costly mistakes like over-analyzing straightforward issues or trying to engineer solutions for emergent problems.

Understanding Problem Types Through Real Incidents

I’ve watched the same pattern play out dozens of times: a production issue hits, and the team immediately jumps to their preferred approach regardless of what they’re facing. The database experts start analyzing query plans. The infrastructure folks check resource utilization. The application team reviews recent deployments. Or an executive jumps in and throws everyone in the direction he sees. Everyone applies their go-to method, and sometimes it works brilliantly. Other times, we waste hours or days going down the wrong path.

The problem isn’t that these approaches are wrong—it’s that we’re not stopping to ask what kind of problem we’re actually dealing with. A framework developed by Dave Snowden called Cynefin (pronounced kuh-NEV-in, it’s Welsh for “habitat”) has helped me make better decisions about when to dive deep versus when to act fast, when to experiment versus when to follow established procedures.

Cynefin categorizes situations into five domains based on how cause and effect work in each. I’ll walk through them using actual scenarios I’ve encountered.

Simple/Obvious Domain: When the Answer Is Clear

The Pattern: Clear cause-effect relationships exist. One right answer. Problems fit into known categories. Solutions can be standardized.

Real Example: Our web application starts throwing 500 errors. Monitoring shows database connection pool exhaustion—we hit the configured maximum of 100 connections. This happens every few months when traffic spikes, and the fix is always the same: increase the pool size or implement better connection reuse. A junior engineer can handle it by following the documented procedure.

How You Know You’re Here:

The problem recurs almost identically
Your documented solution works >95% of the time
Junior team members diagnose it correctly
You could fully automate the fix with high confidence
Extensive documentation exists

The Approach: Sense (observe the issue) → Categorize (it’s connection pool exhaustion) → Respond (apply the standard fix).

Where It Fits: This is fitness perception territory. Pattern recognition, efficient execution, proven procedures. Don’t overthink it.

Complicated Domain: Multiple Right Answers

The Pattern: Cause-effect relationships exist but require expertise to figure out. Multiple viable solutions with different trade-offs. Analysis pays off.

Real Example: A microservice architecture shows periodic performance degradation. No single obvious cause. I need distributed tracing to follow requests across services, analysis of queue depths, resource utilization patterns. Several solutions might work: optimize database queries, implement caching, adjust timeout configurations. Each has different implications for maintainability, cost, and future scaling.

How You Know You’re Here:

Analysis requires specialized tools (tracing, profiling)
Multiple solution paths exist with real trade-offs
Junior engineers struggle without guidance
The problem decomposes into analyzable components
Solution development takes significant analysis time

The Approach: Sense (gather comprehensive data) → Analyze (using expert knowledge and tools) → Respond (implement the most appropriate solution).

Where It Fits: Reality perception shines here. You need accurate understanding, expert analysis, deep investigation. The time invested in getting it right pays off.

Complex Domain: Patterns Emerge Only in Hindsight

The Pattern: Cause-effect relationships only visible retrospectively. No clear right answers. Solutions emerge through experimentation. High contextual complexity.

Real Example: Our authentication system shows intermittent failures—anywhere from 0.5% to 3% error rate with no clear pattern. Affects different user segments unpredictably. Multiple system changes preceded the first occurrence. Traditional debugging hasn’t isolated a cause. The issue reproduces in production but not in test environments.

A Complex-Domain Approach:

Safe-to-fail experiments: Route small percentages of traffic through different authentication paths
Pattern detection: Capture detailed telemetry to identify emergent patterns
Amplify success: Gradually increase traffic to configurations showing reduced errors
Dampen problems: Reduce traffic to configurations showing increased failures

How You Know You’re Here:

The problem changes when you investigate it
Multiple attempted solutions have unpredictable effects
Historical data doesn’t predict current behavior
The system exhibits emergent behavior
Many interacting adaptive components

The Approach: Probe (run small experiments) → Sense (observe responses) → Respond (amplify what works, dampen what doesn’t).

Where It Fits: You need both frameworks. Start with fitness-based probes (what works?), then use reality-based analysis to understand patterns. But acknowledge that complete understanding may be impossible.

Chaotic Domain: Crisis Mode

The Pattern: No discernible cause-effect in the moment. No time for analysis. System in acute failure. Focus on stabilization before understanding.

Real Example: A botched deployment causes cascading failures. Multiple services down. Error rates spiking everywhere. Customer data potentially at risk. Alarms triggering faster than we can read them. No time for root cause analysis—we need to stop the bleeding.

The Protocol:

Contain: Limit impact (redirect traffic, isolate affected systems)
Stabilize: Establish minimal functioning (revert to last known good)
Assess: Rapidly gather situational awareness once immediate danger passes
Transition: Move from chaos to complex or complicated for proper resolution

How You Know You’re Here:

Multiple critical alerts simultaneously
Rapid deterioration of core metrics
Can’t determine initial failure point
Need immediate executive decisions
No time for conventional analysis

The Approach: Act (take immediate action) → Sense (check the response) → Respond (address remaining issues in priority order).

Where It Fits: Pure fitness perception initially—stop the bleeding before diagnosing the disease. Reality-based analysis comes later to understand root causes and prevent recurrence.

Disorder: When You Can’t Tell Which Domain You’re In

The Danger Zone: Different stakeholders apply incompatible approaches. Decision paralysis. Political factors override technical considerations. Cognitive biases dominate.

Real Example: After merging with another company, an incident spans both technology stacks. One team wants formal incident response procedures (treating it as complicated). Another wants quick fixes based on similar past incidents (treating it as obvious). Leadership demands immediate resolution (treating it as chaotic). Everyone’s pulling in different directions.

Getting Out:

Explicitly agree on domain classification
Break the problem into parts belonging to different domains
Clarify decision authority
Ensure everyone has the same information

How Problems Move Between Domains

Here’s something I’ve learned: the same issue can traverse multiple domains during its lifecycle.

Typical Incident Journey:

Detection (Often Disorder): Alert fires, classification unclear, conflicting interpretations
First Response (May Be Chaotic): Multiple alarms, unclear causation, focus on containment
Stabilization (Transitions to Complex): Basic functionality restored, root cause unclear, probe-sense-respond
Investigation (Moves to Complicated): System stable enough for methodical analysis, expert tools deployed
Resolution (Reaches Simple/Obvious): Root cause identified, solution documented for future
Prevention (Engineers for Simple/Obvious): Learnings codified into monitoring and automation

The Cost of Getting the Domain Wrong

Treating Simple as Complicated: Engineers can spend hours deep-diving into a service that just needed restarting, applying complicated-domain analysis to a simple-domain problem. Cost: wasted time, extended outage.

Treating Complicated as Simple: Repeatedly restarting a service without addressing underlying resource issues, applying a simple fix to a complicated problem. Cost: recurring incidents, growing technical debt.

Treating Complex as Complicated: Trying to engineer comprehensive solutions for workloads with highly variable patterns, assuming it could be modelled (complicated) when it was actually emergent (complex). Cost: brittle solutions that broke under slight condition changes.

Treating Chaotic as Complex: Testing new configurations during a major outage, trying experiments (complex-domain approach) during a crisis (chaotic domain). Cost: extended downtime, potential data loss.

Practical Application: Designing for All Domains

The systems I’ve worked on that handle uncertainty best incorporate elements for each domain:

Simple/Obvious Layer:

Standard components with documented procedures
Extensive automation for known issues
Clear runbooks
Fitness perception dominates

Complicated Layer:

Specialized services requiring expert knowledge
Comprehensive monitoring and observability
Detailed system behavior documentation
Reality perception dominates

Complex Layer:

Loosely coupled components with clear boundaries
Adaptive algorithms and self-tuning
Extensive telemetry with pattern detection
Safe-to-fail experimentation infrastructure
Balanced approach

Chaos Protection:

Multiple redundancy layers
Circuit breakers and bulkheads
Degraded operation modes
Clear emergency procedures
Fitness perception initially, reality after stabilization

Applying Cynefin to the Correctness vs. Robustness Spectrum

Remember from earlier posts how some systems prioritize correctness (right answer or no answer) while others prioritize robustness (continued operation despite issues)? Cynefin helps navigate this too:

Simple/Obvious Domain: Correctness typically dominates—problems are well-understood, so we can get it right.

Complicated Domain: Applications requiring precision favor correctness. Those requiring high availability favor robustness.

Complex Domain: Robustness typically dominates—emergent behavior makes absolute correctness elusive. Focus on critical invariants.

Chaotic Domain: Robustness dominates during crisis. Address correctness after stabilization.

Team Structures That Match Domains

I’ve noticed teams naturally organize around domain expertise:

Domain Specialists:

Simple/Obvious: Operations engineers focused on efficiency
Complicated: Architects and senior engineers with deep expertise
Complex: Research engineers working with emergent patterns
Chaotic: SREs and incident commanders

Building Balanced Capability:

Cross-train between domains regularly
Clear protocols for domain transitions during incidents
Shared understanding of appropriate approaches
Awareness of domain characteristics

What I’ve Learned

The Cynefin Framework reminds me there’s no one-size-fits-all approach to system operations. By recognizing which domain a problem belongs to, I can choose the right combination of reality and fitness perception.

This helps:

Avoid applying the wrong methodology
Recognize when problems shift between domains
Design systems that handle different challenges appropriately
Build teams with skills across all domains
Communicate more effectively about problem contexts

The key insight: not all problems are created equal. The approach that works brilliantly for one type of problem can be completely wrong for another. Stop, assess which domain you’re in, then apply the appropriate response pattern.

Think about your recent incidents: What domain did they fall into? How did your approach match (or mismatch) the problem’s actual nature? Are there cases where domain misclassification led to ineffective solutions? Understanding this might change how you approach the next problem that lands on your desk.