Chapter 5: Error Handling & Recovery

Your agent will fail. The question is whether it fails gracefully or catastrophically.

At 3am on a Tuesday, I got paged because one of our agents had been hammering the Claude API for six hours straight. The original request had timed out—a simple network blip—and our naive retry logic did exactly what we told it to: try again immediately. And again. And again. A thousand times per minute, for six hours. We burned through our rate limit, triggered abuse detection, and accomplished exactly nothing.

The fix took fifteen minutes. The lesson took longer to sink in.

What Actually Goes Wrong

Failures in agentic systems cluster into distinct categories, and each demands a different response.

Infrastructure failures—network timeouts, connection resets, DNS hiccups—are transient. The request was fine; the pipes failed. Retry usually works.

Rate limiting isn't an error in the traditional sense. The API is telling you to slow down. Treating it as a failure and retrying immediately is exactly wrong.

Invalid LLM responses happen because language models are probabilistic. You'll get JSON with trailing commas, tool calls to nonexistent tools, truncated outputs. The model isn't broken — nondeterministic outputs are inherent to how language models work.

Tool failures occur when the agent's actions meet reality. Disk full. Permission denied. Unexpected API response. The agent asked for something reasonable; the world said no.

Context overflow is unique to LLM systems. Unlike traditional software, your agent has a hard ceiling on working memory. Exceed it and there's no graceful fallback—the request simply cannot proceed as-is.

The critical insight: a retry strategy that works for network errors will make rate limiting worse and do nothing for context overflow. You need to know why something failed before you can fix it.

Retries Done Right

The simplest recovery is trying again. But simple doesn't mean naive.

The Backoff Algorithm

Picture a server struggling under load. Now picture a hundred clients, all failing simultaneously, all retrying as fast as possible. You've just DDoS'd an already-struggling system.

function retryWithBackoff(operation, maxAttempts):
    baseDelay = 1 second
    maxDelay = 60 seconds
    
    for attempt in 1 to maxAttempts:
        result = try operation()
        if result.success:
            return result
        
        if not isRetryable(result.error):
            throw result.error
        
        delay = min(baseDelay * (2 ^ attempt), maxDelay)
        sleep(delay)
    
    throw MaxRetriesExceeded

First retry: 2 seconds. Second: 4 seconds. Third: 8. You're creating breathing room for the failing system while still making progress.

The Thundering Herd

But there's a problem. Imagine a thousand clients hit the same network partition at the same moment. They all back off for 2 seconds. Then they all retry simultaneously. Another spike. They all back off for 4 seconds. Another synchronized retry.

You need jitter:

delay = min(baseDelay * (2 ^ attempt), maxDelay)
jitter = random(0, delay * 0.25)
sleep(delay + jitter)

That random offset spreads retries across time instead of concentrating them at predictable intervals.

Knowing When to Retry

Not everything deserves a retry:

Status	Meaning	Action
400 Bad Request	Your request is malformed	Fix it, don't retry
401 Unauthorized	Auth failed	Refresh token, then retry
404 Not Found	Resource doesn't exist	Don't retry
429 Too Many Requests	Slow down	Retry after delay
500 Server Error	Their problem	Retry with backoff
503 Unavailable	Service down	Retry with backoff
529 Overloaded (Anthropic-specific)	At capacity	Retry for user-facing work; give up for background tasks

Connection resets are special—often caused by stale keep-alive connections. Disable connection reuse and retry once before escalating.

Context overflow can't be retried as-is. You need to transform the context first.

The Escalation Ladder

When retries don't work, you need increasingly dramatic interventions.

Level 1: Retry the same request. Wait, try again, hope the transient issue resolved.

Level 2: Modify the request. Too many tokens? Truncate input. Model overloaded? Fall back to a different model. Request too complex? Break it apart.

Level 3: Transform the context. When accumulated state is the problem—not the immediate request—reshape that state. Compact conversation history. Summarize old tool results. Drop low-priority context.

Level 4: Escalate to the user. Some problems require human judgment: authentication that can't be auto-resolved, ambiguous situations needing clarification, destructive operations needing approval. This isn't failure — it's correct delegation.

Level 5: Graceful failure with state preservation. When all else fails, fail well. Save state so work isn't lost. Explain clearly what happened. Leave the system recoverable.

function executeWithRecovery(request, context):
    // Level 1: Simple retry
    for attempt in 1 to 3:
        result = try execute(request)
        if result.success:
            return result
        if not isRetryable(result.error):
            break
        sleep(exponentialBackoffWithJitter(attempt))
    
    // Level 2: Request modification
    if result.error.type == 'ContextOverflow':
        reducedRequest = truncateInput(request, 0.75)
        result = try execute(reducedRequest)
        if result.success:
            return result
    
    if result.error.type == 'ModelOverloaded':
        result = try execute(request, model: fallbackModel)
        if result.success:
            return result
    
    // Level 3: Context transformation
    if result.error.type == 'ContextOverflow':
        compactedContext = compactHistory(context)
        result = try execute(request, context: compactedContext)
        if result.success:
            return result
    
    // Level 4: User escalation
    if isUserPresent():
        userDecision = promptUser(
            "I encountered an error I can't automatically resolve: " +
            formatUserFriendlyError(result.error) +
            "\nWould you like me to try a different approach?"
        )
        return handleUserDecision(userDecision)
    
    // Level 5: Graceful failure
    saveState(context, request, result.error)
    throw GracefulFailure(
        message: "Unable to complete after multiple recovery attempts",
        state: savedStateId,
        suggestion: "Resume with: /resume " + savedStateId
    )

Tool Error Containment

When a bash command fails, you have a choice: crash the agent loop or feed the failure back as information.

Tool: bash
Input: apt-get install nodejs
Result: {
    "success": false,
    "exit_code": 1,
    "stderr": "E: Could not open lock file - permission denied"
}

A fragile system throws an exception. A resilient system returns this as a tool result and lets the LLM reason about it: "Permission denied—I should try sudo or ask about alternative installation methods."

The implementation is minimal:

function executeTool(tool, input):
    try:
        result = tool.execute(input)
        return { role: 'tool_result', content: result, is_error: false }
    catch error:
        return { role: 'tool_result', content: formatToolError(error), is_error: true }

Errors don't crash the loop. They become part of the conversation.

Mid-Task Recovery

Agents get interrupted. Process killed. Connection dropped. User closed laptop. When they resume: what was I doing?

If you persist the conversation to disk after each turn, recovery becomes possible. The key checks:

Message chain integrity. A transcript ending with assistant tool calls but no tool results indicates an interrupted turn.

File state validation. Were files being edited? Do they still exist? Have they been modified externally?

Plan state restoration. Multi-step plan in progress? Which steps completed? Which need retry?

function resumeSession(transcriptPath):
    transcript = loadTranscript(transcriptPath)
    
    lastMessage = transcript.messages.last()
    if lastMessage.role == 'assistant' and hasToolCalls(lastMessage):
        pendingTools = lastMessage.toolCalls
        
        if not hasMatchingToolResults(transcript, pendingTools):
            for tool in pendingTools:
                result = checkToolStateAndRecover(tool)
                transcript.append(result)
    
    for fileRef in extractFileReferences(transcript):
        if not fileExists(fileRef.path):
            transcript.append(systemMessage(
                "Note: " + fileRef.path + " no longer exists"
            ))
        else if fileModified(fileRef.path, since: fileRef.timestamp):
            transcript.append(systemMessage(
                "Note: " + fileRef.path + " modified externally"
            ))
    
    return transcript

The resumed agent won't pick up exactly where it left off, but it'll understand what happened and make informed decisions about how to proceed.

Graceful Degradation

Sometimes the choice isn't success versus failure—it's partial success versus total failure.

An agent refactors 47 of 50 files before three fail with edge cases. Should it roll back everything? Or complete what it can and report what remains?

Model fallback. Primary model unavailable? Try a simpler one. Lower quality beats nothing.

function queryWithFallback(request):
    models = ['claude-sonnet-4-20250514', 'claude-3-5-haiku-20241022']
    
    for model in models:
        try:
            return query(request, model: model)
        catch error if error.type == 'ModelUnavailable':
            continue
    
    throw AllModelsUnavailable

Feature degradation. Caching broken? Disable it. Parallel execution causing issues? Fall back to sequential.

Partial results. Complete what you can. Report successes and failures separately. Let the user decide whether to retry failures or accept what was accomplished.

Two Audiences for Errors

Every error has two audiences with different needs.

Users need clarity and actionability:

// Bad
"Error: ECONNRESET at TCP.onStreamRead (node:internal/stream:333:27)"

// Good
"I lost connection to Claude's servers. This usually resolves quickly.
Try again, or check your internet connection if the problem persists."

Developers need debugging context: stack traces, request IDs, timing, system state.

function handleError(error, context):
    log.error({
        message: error.message,
        stack: error.stack,
        requestId: context.requestId,
        timestamp: now(),
        systemState: captureSystemState()
    })
    
    return {
        userMessage: translateToUserFriendly(error),
        suggestion: getSuggestion(error),
        canRetry: isRetryable(error)
    }

Map technical errors to human explanations:

function translateToUserFriendly(error):
    if error.type == 'RateLimitExceeded':
        return "Too many requests. Please wait a moment."
    
    if error.type == 'ContextOverflow':
        return "Our conversation has grown too long. I'll summarize earlier parts."
    
    if error.type == 'AuthenticationFailed':
        return "Session expired. Run /login to reconnect."
    
    return "Something unexpected went wrong. Details logged for investigation."

The Resilience Mindset

When building an agent, keep asking: "What happens when this fails?"

API returns garbage?
Tool takes 10x longer than expected?
Context space exhausted?
User closes tab mid-operation?
File state assumptions wrong?

Each question needs an answer. Retry, escalate, save state and exit gracefully—all valid. "Crash and lose everything" never is.

The agents that earn trust handle adversity well. They say "I hit a problem, here's what I accomplished, here's how we proceed." They preserve work rather than losing it. They explain problems rather than hiding them. They degrade smoothly rather than failing catastrophically.

This isn't glamorous work. But it's the work that separates demos from products. But it's the work that separates demos from products—and reliability is what turns clever prototypes into tools people depend on.