Running Kubernetes in production on multiple cloud providers means juggling OpenTofu configurations, Helm charts, cluster health checks, log analysis, and deployment pipelines — often under pressure when something breaks at 2 AM. Over the past months, I've been using Claude Code as an infrastructure copilot, and the workflow has fundamentally changed how I operate our clusters.
This post walks through the practical patterns I've developed, what works well, and how to set up Claude Code so it's genuinely useful for infra work rather than a liability.
The Setup
Our infrastructure currently spans three Kubernetes environments across two cloud providers:
- Staging — on Google Kubernetes Engine
- Production — on Google Kubernetes Engine
- Customer-specific deployment — running on StackIT in the customer's account
Each environment is managed by OpenTofu (the open-source Terraform fork), with a shared Helm chart deploying our application stack. The codebase includes 10 OpenTofu modules, a Helm chart with 17 templates, and CI/CD via GitHub Actions. The two GCP clusters share most of their configuration by using the same OpenTofu modules. The StackIT cluster has a few separate modules for StackIT-specific stuff.
Claude Code sits at the center of this, with direct access to kubectl, tofu plan, and a suite of custom skills that turn common ops tasks into conversational interactions.
Principle 1: Safety Guardrails Are Non-Negotiable
The single most important thing when giving an AI tool access to your infrastructure: define what it cannot do.
Our CLAUDE.md file — a project-level instruction file that Claude Code reads automatically — establishes hard boundaries:
For safety reasons, you are not allowed to run `tofu apply`. You can however
suggest `tofu apply` commands to the user. The user will review the command
and run it for you. You are free to run `tofu plan` to study its output.In addition to CLAUDE.md, dangerous commands are also explicitly forbidden in .claude/settings.json.
This is the key design pattern: read access is generous, write access requires human approval. Claude Code can:
- Run
tofu planand analyze the output - Run any
kubectl getorkubectl describecommand, except for secrets - Read and modify configuration files
- Suggest exact commands for me to execute
But it cannot:
- Apply infrastructure changes (
tofu apply) - Delete resources
- Push to production
The CLAUDE.md file also encodes domain knowledge that prevents subtle mistakes:
Changes to `node_config` (including `resource_labels`, `machine_type`,
`oauth_scopes`, `metadata`, etc.) trigger rolling replacement of all nodes
in the affected node pool, causing temporary pod disruptions.This means when Claude Code suggests an infrastructure change, it already knows to warn me if modifying a node pool attribute would cause a rolling restart of all nodes. This kind of contextual awareness is where AI infra management goes from “neat demo” to “genuinely useful.”
Principle 2: Custom Skills Turn Repetitive Ops Into Conversations
Claude Code supports custom skills — markdown files with structured workflows that the AI follows when triggered by natural language or a slash command. We've built various skills that cover the most common ops tasks:
Cluster Health Checks
Instead of remembering the exact kubectl incantations to check node status, pending PVCs, failed jobs, and certificate expiry, I just ask:
“How's the dev cluster doing?”
This triggers the k8s-status skill, which runs a shell script checking nodes, pod states, resource usage, PVC status, certificate expiry, and failed jobs. Claude Code then formats it into a structured health report with an overall status (HEALTHY / WARNING / CRITICAL) and actionable recommendations.
Log Analysis
“Why is the celery deployment failing on prod?”
The k8s-logs skill fetches logs from all pods in the deployment, filters for errors and warnings, provides timestamped output with counts, and suggests next steps. The key here is that the skill wraps a shell script with flags for --errors, --previous, --since, and --container — so Claude Code can intelligently choose the right flags based on what I'm asking.
Infrastructure Planning
“Run a tofu plan on dev”
The tofu-plan skill handles environment selection, runs the plan with proper environment variable loading via direnv, parses the output into a structured summary (X additions, Y changes, Z destructions), and highlights destructive changes with prominent warnings. It then suggests the exact tofu apply command if I want to proceed — but never runs it.
Environment Comparison
“What's different between dev and prod?”
The cluster-diff skill compares main.tf and values.yaml between any two environments, highlighting differences in cluster configuration, resource limits, replica counts, and feature flags. This is invaluable for catching configuration drift.
Self-Healing Authentication
One of the more creative skills: init-cluster handles Kubernetes authentication failures. When a kubectl command fails with “certificate has expired” or “Unauthorized,” the skill automatically guides through re-authentication based on the cloud provider — GKE uses gcloud re-auth, StackIT requires a fresh kubeconfig download.
For StackIT clusters where certificates expire after 7 days, this saves a surprising amount of time. If all else fails, the skill can fire a Slack webhook to ask a team member for help, including the specific error and what's already been tried.
Principle 3: Teach Domain Knowledge, Not Just Commands
The CLAUDE.md file is where we encode the tribal knowledge that makes the difference between a useful assistant and a dangerous one. Some examples from our setup:
Platform-specific gotchas:
Note: The clusterissuer is called `cert-manager`, not `letsencrypt-prod`.
Note: Alloy requires trailing commas in maps and lists, be sure to add these.Architectural conventions:
If you add outputs to a module, these outputs should usually also be added
to the top-level `main.tf` file in the respective environment.
If a service uses persistent volumes, make sure to include an appropriate
deploymentStrategy value.Environment awareness:
All commands that interact with a Kubernetes cluster must be run from the
corresponding subdirectory in `opentofu/environments`.This is the stuff that usually lives in a team wiki nobody reads, or in someone's head. Putting it in CLAUDE.md means it's always applied, every time, without fail. Over time, the file becomes a living document: every time I catch a mistake or realize a convention wasn't documented, I add it to CLAUDE.md, or modify the respective skill. The agent gets better every day.
Principle 4: Permission Layering for Team Use
Claude Code's settings.json supports granular permission controls. Our setup pre-approves read operations while requiring confirmation for anything that modifies state.
This means kubectl get runs without prompts (read-only), the health check and log scripts are pre-approved, but anything destructive still requires explicit confirmation. The settings.json is checked into the repo so the team shares the same permission model, while settings.local.json (gitignored) allows individual developers to add their own trusted commands.
This layered approach means new team members get safe defaults out of the box, while experienced operators can unlock more autonomy locally.
Real Workflow Examples
Debugging a CrashLooping Pod
Me: “The celery workers are crash-looping on dev, can you investigate?”
Claude Code:
- Runs
k8s-statusto get cluster overview — confirms celery pods are in CrashLoopBackOff - Runs
k8s-logswithpreviousflag to get logs from the crashed container - Runs
k8s-describeon the pod to check events, resource limits, and container status - Identifies the root cause (e.g., OOM kill due to memory limits)
- Suggests a specific fix to
values.yamlwith the right resource limit change - Runs
tofu planto verify the change is non-destructive - Provides the exact
tofu applycommand for me to run
What would have been 20 minutes of terminal tab-switching becomes a single conversation.
Adding Infrastructure Resources
Me: “I need to add a new Redis instance for the staging cluster”
Claude Code:
- Reads the existing Redis module to understand the pattern
- Reads the dev environment's
main.tfto see how other modules are instantiated - Adds the module reference with appropriate configuration
- Updates outputs in both the module and the environment's
main.tf(becauseCLAUDE.mdtold it to) - Checks
values.yamlfor any Helm chart references needed - Runs
tofu planand presents a summary - Warns me if any existing resources would be affected
Pre-Deployment Checklist
Me: “Compare dev and prod configs before I promote”
Claude Code:
- Runs
cluster-diffbetween the two environments - Highlights differences in replica counts, resource limits, and feature flags
- Notes any configuration that exists in dev but not prod (new features)
- Flags any concerning differences (e.g., dev has a debug flag enabled)
The 2 AM Incident
This is where the setup really pays off. A monitoring alert fires, and I open Claude Code on my laptop:
Me: “Cluster status on prod”
In seconds I have a full health report. Pods are pending.
Me: “Describe the pending pods”
Scheduling failure — nodes are at capacity.
Me: “Check the autoscaler logs”
The node pool autoscaler is stuck. From here, Claude Code can suggest the fix, I approve it, and we're back up. The entire investigation happens in one terminal, in one conversation, with context preserved throughout. No context-switching between dashboards, no re-reading runbooks.
Anatomy of a Skill
For those wanting to build their own, here's what a skill looks like. Each skill is a markdown file in .claude/skills/ with YAML frontmatter:
---
name: k8s-status
description: Get a quick overview of Kubernetes cluster health. Use when
the user asks "how is the cluster", "cluster status", "any issues in k8s",
"check cluster health", or "what's wrong with the cluster".
---
# Kubernetes Cluster Status
Provide a comprehensive health check of a Kubernetes cluster.
## Workflow
### 1. Select Environment
If the user hasn't specified an environment, list available environments...
### 2. Run Health Check Script
Run the health check script with the selected environment:
.claude/skills/k8s-status/k8s-status.sh <environment>
### 3. Present Summary
Format the output as a structured health report...
### 4. Provide Recommendations
For each issue found, suggest diagnostic commands...The description field is what triggers the skill — Claude Code matches natural language against it. The body is a step-by-step workflow that the AI follows. Shell scripts do the heavy lifting; the skill file orchestrates the narrative.
The key insight is that skills are not rigid scripts. They're structured guidance. Claude Code adapts the workflow based on context — if I already mentioned which environment I'm working with, it skips the selection step. If the health check comes back clean, it skips the recommendations. The markdown format gives you structure without rigidity.
What I've Learned
Start with read-only access. Get comfortable with Claude Code reading your infra before you let it suggest changes. The cluster health checks and log analysis skills provide immediate value with zero risk.
Encode your conventions. Every time you catch yourself saying “oh, and remember to also...” — put it in CLAUDE.md or a skill. The more domain knowledge you encode, the more useful the AI becomes. Think of it as writing a runbook that actually gets followed.
Skills are worth the investment. Writing a skill takes 30 minutes. The first time it saves you from typing a 6-flag kubectl command at 2 AM, it's paid for itself. The structured workflow also means Claude Code follows the same diagnostic process every time, which builds trust and consistency.
The human-in-the-loop for writes is essential. I've never regretted requiring manual approval for tofu apply. I have occasionally been glad the guardrail was there when a seemingly innocuous change would have triggered a node pool recreation.
Multi-environment awareness matters. Having per-environment configurations that Claude Code navigates automatically (via direnv and kubeconfig paths) means I can say “check the logs on prod” without worrying about context switching or accidentally running commands against the wrong cluster.
The AI gets better as your docs get better. There's a virtuous cycle: using Claude Code for infra management incentivizes you to document your conventions properly, and better documentation makes the AI more useful. After a few months, our CLAUDE.md has become the most accurate and up-to-date piece of documentation in the project.
Getting Started
If you want to try this approach:
- Create a
CLAUDE.mdin your infra repo with your conventions, gotchas, and hard boundaries (what the AI must never do) - Start with a health check skill — it's the lowest risk, highest value starting point
- Add
settings.jsonpermissions that pre-approve read operations and require confirmation for writes - Gradually add skills for your most common ops tasks: log checking, resource description, plan analysis
- Encode tribal knowledge as you discover it — every “watch out for...” becomes a line in
CLAUDE.md
The goal isn't to replace infrastructure operators. It's to give them a copilot that remembers every convention, never forgets to check for destructive changes, and is available at 2 AM without complaining about it.
All examples in this post are from a production setup managing GKE and StackIT Kubernetes clusters with OpenTofu and Helm.






