The Next Step in Automation

Executive Summary: When you automate system responses, you’re encoding your philosophy about knowledge into code. Some automation focuses on quick fixes without deep understanding (fitness-based), while other automation tries to model reality comprehensively (reality-based). But there’s a third way: antifragile systems that don’t just survive stress—they get stronger from it. This article explores how these approaches manifest in automation and how antifragile design transcends the apparent choice between speed and understanding.

Aside: What is antifragility?
Antifragility means systems that benefit from stress, volatility, and disorder—the opposite of fragility. Robust systems resist damage. Resilient systems recover from damage. But antifragile systems actually improve because of stress. Think about how muscles grow stronger through exercise, or immune systems develop through pathogen exposure. In technical systems, antifragility shows up as architectures that systematically learn from failures and adapt, becoming stronger rather than just returning to their previous state.

What Your Automation Says About You

I’ve noticed something interesting about automation: it makes explicit the mental models we usually keep implicit. When you write code to automatically respond to system issues, you’re forced to decide: Do I need this automation to understand what’s happening, or just to fix it?

This distinction reveals itself everywhere in modern operations. Consider Kubernetes self-healing. It restarts failed containers based on health checks—no analysis of why they failed, just pattern-based response. Compare that to a distributed tracing system that tries to model how requests flow through your architecture, building comprehensive understanding of system behavior.

Both approaches work. Both are valuable. But they represent fundamentally different philosophies about what automation should do.

The Quick-Fix Automation: Outcomes Over Understanding

I call this “fitness-oriented automation” because it prioritizes what works over why it works. It’s pragmatic, efficient, and surprisingly effective.

How It Works

This automation follows specific patterns:

It recognizes signatures. Not root causes—just known patterns that indicate problems. High latency? Scale up. Connection timeouts? Recycle the pool. Circuit breaker trips? Route around the problem.

It encodes heuristics. Proven response patterns get baked into automated workflows. These are the shortcuts experienced engineers use instinctively, now codified in runbooks and scripts.

It watches key metrics. Response time, error rate, availability—the outcomes that matter. When these cross thresholds, automation kicks in.

It prioritizes speed. Getting users back online matters more than understanding exactly what went wrong.

Examples of Where This Works

A previous employer used auto-scaling groups that added capacity when load increased. The system didn’t know if the traffic was a DDoS attack, a marketing campaign, organic growth, or its driving purpose: everyone working from home due to snow. It just knew latency was rising and responded accordingly. Simple, effective, fast.

Netflix pioneered adaptive concurrency limiters that dynamically adjust based on observed latency. The system doesn’t model why latency increases—database contention, network issues, downstream service problems. It just adapts to maintain optimal throughput. This works brilliantly for their use case.

Cache invalidation routines that trigger on specific error patterns. Database connection pool recycling when query latency spikes. Session redistribution when server health degrades. Circuit breakers that isolate failing services. All fitness-based automation—outcomes focused, not understanding focused.

When This Approach Falls Short

I’ve also seen the failure modes:

Novel problems confuse it. When the system encounters something genuinely new, pattern-based responses may apply the wrong fix.

It can oscillate. Automated responses trigger counter-responses. Scale up causes cost alarms that trigger scale down, which causes performance alarms that trigger scale up again.

It masks deeper issues. The circuit breaker trips, traffic routes elsewhere, metrics look fine. Meanwhile, the underlying problem—maybe a memory leak in that service—continues growing. You’ve treated the symptom, not the disease.

Teams become complacent. When automation handles everything, engineers stop questioning whether responses are appropriate. I’ve seen “just restart it” become organizational policy when that automation should have been a stopgap.

The Deep-Dive Automation: Understanding-Driven Analysis

The contrasting approach—what I call “reality-oriented automation”—tries to codify comprehensive understanding. It’s slower but builds lasting knowledge.

How It Works

This automation follows different patterns:

It models systems. Attempts to represent actual behavior and relationships between components. These aren’t just metrics—they’re structural models of how things interact.

It traces causation. Focuses on root causes, not symptoms. Follows chains of events backward through multiple system layers.

It detects anomalies. Identifies deviations from expected behavior based on comprehensive baselines, not just threshold violations.

It accumulates knowledge. Builds and refines understanding over time, creating organizational memory that persists beyond individual engineers.

It explains decisions. When this automation suggests an action, it can articulate why—what evidence supports that conclusion.

Examples of Where This Works

Google’s Monarch monitoring system exemplifies this approach. Rather than just triggering alerts on threshold violations, it builds comprehensive time-series models of system behavior. It enables high-cardinality dimensional analysis—you can slice metrics by any attribute combination and ask complex questions about system state.

I worked with a distributed tracing platform that modeled request flows through a microservice architecture. When performance degraded, it could identify exactly which service interaction was causing delays and show you the full context—not just “service X is slow” but “service X is slow specifically when called by service Y with this type of request under these conditions.”

Root cause analysis tools that automatically correlate configuration changes with performance shifts. Topology mapping that discovers system dependencies. Incident databases that identify patterns across similar issues. All reality-based—understanding focused, not just outcome focused.

When This Approach Falls Short

I’ve experienced the limitations:

Analysis takes time. When users are down, comprehensive understanding doesn’t help much. I’ve watched engineers spend an hour building causal models while stakeholders asked why the service wasn’t back up yet.

It generates overwhelming data. One system I worked with produced so much telemetry that human operators couldn’t process it. The comprehensive monitoring actually made troubleshooting harder.

Models are always incomplete. Systems have emergent properties that resist complete understanding. I’ve seen detailed architectural documentation that was technically accurate but missed critical runtime behaviors.

It creates false confidence. Engineers believe they fully understand the system because they have comprehensive metrics. Then something unexpected happens that the model didn’t capture.

Both Approaches Have a Place

I don’t see this as either/or. Mature operations teams I’ve worked with use both:

Tiered response. Fitness-based automation handles immediate issues. Reality-based tools support deeper analysis of recurring problems. Get users back online first, then figure out why it happened.

Feedback loops. Reality-based analysis improves fitness-based responses. Once you understand root causes, you can encode better heuristics.

Selective application. Critical services get comprehensive understanding. Less critical components get pattern-based responses. Not everything needs the same level of analysis.

Escalation paths. Start with quick fixes. If those don’t work or the issue recurs, escalate to deeper investigation.

At one company, we had auto-remediation for common issues, detailed tracing for investigating new problems, and regular reviews where we’d convert new learnings into additional automation. Each approach fed the others.

Beyond Quick Fixes and Deep Dives: Antifragile Systems

Here’s where it gets interesting. There’s actually a third approach that transcends both fitness and reality perception.

Nassim Taleb introduced the concept of antifragility in his book “Antifragile: Things That Gain from Disorder.” Most systems are either fragile (break under stress) or robust (withstand stress). But antifragile systems actually improve from stress.

Think about how muscles work. Exercise stresses them. They don’t just recover—they overcompensate and grow stronger. That’s antifragility.

What Makes Systems Antifragile

Antifragile systems borrow from both approaches while adding new dimensions:

From fitness perception:

Quick adaptation to changing conditions
Circuit breakers and backpressure mechanisms
Multiple solution paths for problems
Evolutionary architecture that optimizes based on outcomes

From reality perception:

Stress testing to understand true limitations
Comprehensive modeling of failure modes
Deliberate exploration of edge cases
Detailed post-mortems that extract learning

But they add new elements:

Hormesis - Deliberately introducing controlled stress to build strength
Overcompensation - Responding to problems by becoming stronger than necessary
Optionality - Maintaining multiple approaches to handle unknown challenges
Via Negativa - Improving by removing fragilities rather than adding features

The Barbell Strategy in System Design

Taleb proposes a “barbell strategy” that I’ve found incredibly useful in system architecture. Rather than seeking middle ground, combine extremes:

Ultra-safe core: Build critical functionality with extreme correctness guarantees. Formal verification where possible. Exhaustive testing. Rigorous change control.

Experimental edges: Implement non-critical features with flexibility prioritized. Rapid iteration. Quick adaptation. Learn from failures.

Avoid the middle: Don’t build systems that are neither formally verified nor truly adaptable.

I worked on a payment processing system that followed this pattern. Transaction processing had formally verified state machines and exhaustive testing. The user interface and reporting features used evolutionary deployment with continuous experimentation. Critical correctness where it mattered, adaptive robustness everywhere else.

Implementing Antifragility: Practical Patterns

So how do you actually build systems that get stronger from stress?

1. Chaos Engineering That Teaches

Netflix’s Chaos Monkey randomly kills production instances. But the real insight isn’t the chaos—it’s what you learn from it.

I helped introduce chaos engineering at a previous company. We started small: killing single containers during business hours when the team was ready to respond. Gradually increased scope to entire availability zones.

The key was treating each experiment as a learning opportunity. We didn’t just verify the system survived—we measured how it recovered, what adaptation mechanisms activated, where recovery could be faster. Each chaos experiment made subsequent responses better.

Implementation details I learned:

Start with non-critical services
Gradually increase blast radius
Always have rollback mechanisms
Measure recovery effectiveness, not just survival
Document what you learn

2. Adaptive Architecture That Evolves

Systems that reconfigure themselves based on observed performance.

I built a service mesh that dynamically updated routing rules based on observed latency and error rates. Not just threshold-based responses—the system learned which routes performed best under which conditions and adapted its strategy over time.

This meant:

Client-side adaptive retries that adjusted backoff based on observed success rates
Automatic resource allocation that learned from historical patterns
Dynamic feature flags that disabled expensive features under load

3. Hormetic Deployment Practices

Controlled stress during deployment builds resilience.

Rather than trying to make deployments completely safe, we made them deliberately stressful:

Progressive traffic shifting with automated rollback
Synthetic load injection during deployment
Deliberately degraded test environments
Deployment during business hours (when teams could respond)

Each deployment became an opportunity to strengthen the system.

4. Optionality Engineering

Multiple approaches for critical functions.

For payment processing, we maintained:

Multiple implementation strategies for core transactions
Dynamic algorithm selection based on performance
Data redundancy across different storage technologies
Multi-cloud deployment with actual workload portability

When one approach failed, others remained available. And failures in one approach informed improvements in alternatives.

5. Via Negativa Operations

Sometimes getting stronger means removing things, not adding them.

I’ve seen systems improve dramatically by:

Removing unnecessary dependencies
Eliminating unused features and code
Simplifying deployment processes
Getting rid of single points of failure

Each removal was carefully tested and monitored. But systems often became more resilient with less complexity, not more.

Real-World Antifragility: A Payment System

Let me describe a system I worked on that integrated all these principles:

Core transaction processing (Correctness):

Formally verified state machine for transaction status
Property-based testing of balance calculations
Triple-redundant storage
Cryptographic verification of state changes

Scaling layer (Robustness):

Automatic horizontal scaling based on throughput
Circuit breakers on all downstream dependencies
Graceful degradation of non-essential features
Regional failover with continuous testing

Antifragile elements (Improvement):

Weekly “chaos days” with synthetic failures
Performance data that generated new test scenarios
Multi-algorithm processing that selected optimal paths
Automatic dependency simplification based on failure patterns
Developer incentives tied to resilience metrics

This system didn’t just survive problems—it systematically became more capable each time it faced challenges.

The Evolution Path

I’ve seen systems progress through these stages:

Fragile (Starting state):

Break under stress
Need manual intervention
Fail unpredictably
Little monitoring
Fear change

Robust (Fitness-oriented):

Withstand expected stress
Automated recovery
Predictable failures
Basic monitoring
Controlled change process

Resilient (Reality-oriented):

Recover from unexpected stress
Comprehensive diagnostics
Preserve critical functions
Detailed observability
Architectural flexibility

Antifragile (Transcendent):

Improve from stress exposure
Dynamic adaptation
Convert failures to improvements
Generate insights
Continuous evolution

The progression isn’t always linear, but mature systems tend to move in this direction.

Why This Matters

The reality versus fitness dichotomy presents a false choice. So does correctness versus robustness.

The most sophisticated systems I’ve worked with:

Use fitness-based heuristics for immediate response
Deploy reality-based analysis for understanding
Build architecture that learns from stress
Apply correctness where it matters most
Implement robustness for graceful degradation
Create feedback loops for continuous improvement

They don’t choose between approaches—they transcend the dichotomy by recognizing when each applies and how they work together.

Questions to Consider

How might you introduce controlled stress into your current systems? What would you learn from it?

Where are you stuck in false trade-offs between speed and understanding, or between correctness and robustness?

Are there areas where removing complexity might actually make your systems stronger?

What would it mean for your architecture to actively improve from challenges rather than just recovering from them?