Structured Kubernetes Service Investigation: A Read-Only Debugging Workflow
This workflow is extracted from my actual operational runbooks. It manages services across multiple Kubernetes clusters in production. It enforces observation before action, preventing the common pattern of restarting something and masking the real problem. The workflow works as a manual checklist, in a runbook management system, or as an AI-assistant-driven procedure.
The Problem
When a service goes down in Kubernetes, the instinct is immediate: kubectl rollout restart deployment/<service>. Within seconds, the pods cycle, endpoints come back online, and the page stops showing errors. But then, three hours later, the same service crashes again. You restart it again. And again.
This workflow breaks that cycle by enforcing a read-only investigation phase before any remediation. The goal is understanding what happened, not just making the error go away. A service that crashes because of OOM will crash again after a restart unless you increase memory limits or fix the memory leak. A service that fails to mount a ConfigMap will keep failing unless you fix the ConfigMap or the label selector. Restarting hides the symptoms and wastes operational energy.
Philosophy: Read-Only Investigation
This workflow never modifies anything. No restarts, no scaling, no kubectl apply, no secret patching. The investigation phase is strictly diagnostic. This serves two purposes:
-
Prevents masking root causes. If you restart and the service stays up, you learn nothing. You might have just gotten lucky and hit the window between crash cycles.
-
Avoids making things worse. A hasty scaling operation can starve other pods. A rollout restart on a service with no readiness probe can cascade failures to upstream services. Investigation first gives you the full picture.
Step 1: Gather Context
Before diving into kubectl output, structure the intake. Vague reports ("the service is down") make diagnosis harder. Ask:
| Question | Why it matters |
|---|---|
| Problem category | Down? Errors? Slow? Crashing? Stuck deployment? Networking issues? |
| Service name | Exact deployment / statefulset name. |
| Cluster context | Which Kubernetes cluster? (prod, staging, dev?) |
| Namespace | Exact namespace where the service runs. |
| Affected components | API calls? Web traffic? Scheduled jobs? Background workers? |
| Timeline | When did the problem start? |
| Observability signals | Alert fired? Error logs? Increase in latency? Traffic dropping? |
Record these in a structured format before moving forward.
Step 2: Identify the Workload
Kubernetes services are managed by higher-level abstractions: Deployments, StatefulSets, DaemonSets, CronJobs, etc. Finding the right one is the first step.
# Search by name across common types
kubectl --context=<CONTEXT> -n <NAMESPACE> get deployments,statefulsets,daemonsets,cronjobs | grep <SERVICE>
# If that does not work, list all pods and look for label patterns
kubectl --context=<CONTEXT> -n <NAMESPACE> get pods -o wide | grep -i <SERVICE_PATTERN>
# Get labels from a pod and search backwards to the owning resource
kubectl --context=<CONTEXT> -n <NAMESPACE> describe pod <POD_NAME> | grep "Labels:"
# The owner reference will usually be visible in `kubectl describe`; if not, get it explicitly
kubectl --context=<CONTEXT> -n <NAMESPACE> get pod <POD_NAME> -o jsonpath='{.metadata.ownerReferences[0]}'Once you have identified the workload type and name, you are ready to investigate.
Step 3: Pod-Level Investigation
This is the core diagnostic phase. Pods are where containers run, and containers are where most failures originate.
Pod Status and Events
kubectl --context=<CONTEXT> -n <NAMESPACE> describe pod <POD_NAME>The describe output tells you nearly everything. Look for:
Events section: This is gold. Events surface the reason a pod failed to schedule, why it was evicted, why the container terminated, etc. Common event types and what they mean:
| Event | Interpretation |
|---|---|
BackOff | Container exited, kubelet is retrying with exponential backoff. Inspect logs. |
FailedScheduling | Scheduler could not assign the pod to a node. Check node availability, affinity rules, resource requests vs. available resources. |
FailedMount | A volume (ConfigMap, Secret, PersistentVolumeClaim) failed to mount. Verify the volume exists and the pod has permission. |
OOMKilled | Container exceeded memory limit. Increase the limit or investigate memory leak. |
Evicted | Node ran out of resources and kubelet forcibly terminated the pod. Check node pressure. |
ImagePullBackOff | Container image not found or pull failed (bad registry, bad credentials, no internet). Check image URL and registry access. |
Conditions section: Shows Ready, Initialized, etc. If Ready=False, the pod is not serving traffic. Check why.
Container state section: Shows whether the container is Running, Waiting, or Terminated. If Waiting, look at the Reason and Message. If Terminated, that tells you the exit code and any termination reason.
Resource limits and requests: Note the CPU and memory limits. If a pod is consistently hitting its memory limit, it will be OOMKilled. If the limit is not set, the pod can consume all available memory on the node, starving other pods.
Readiness and liveness probes: If the readiness probe is failing, the pod is not in the Service's endpoint list (traffic is not routed to it). Liveness probe failures trigger restarts. Check the probe definition and the container logs to see why it is failing.
Container Logs
# Current logs for a pod
kubectl --context=<CONTEXT> -n <NAMESPACE> logs <POD_NAME>
# Previous instance (if the pod restarted)
kubectl --context=<CONTEXT> -n <NAMESPACE> logs <POD_NAME> --previous
# For multi-container pods, specify the container
kubectl --context=<CONTEXT> -n <NAMESPACE> logs <POD_NAME> -c <CONTAINER_NAME>
# Stream logs in real time
kubectl --context=<CONTEXT> -n <NAMESPACE> logs -f <POD_NAME>
# Get logs from the last 1 hour
kubectl --context=<CONTEXT> -n <NAMESPACE> logs <POD_NAME> --since=1hWhat to search for in logs:
- Stack traces. Application exceptions usually print a traceback. Search for common patterns:
Exception,Error,panic,fatal. - Connection refused. The application tried to connect to a dependency (database, cache, API) and got refused. Look for the address and port.
- Configuration errors. Missing environment variables, bad config file syntax, invalid secret values.
- Timeout or deadline exceeded. The application waited for something (network, database) and the wait exceeded the timeout.
If logs are not verbose enough, check if the deployment specifies --log-level or similar flags. You might need to reconstruct the debugging context from application metrics or structured logs in a centralized logging system.
Resource Usage
The describe output shows requested resources; kubectl top shows actual usage:
kubectl --context=<CONTEXT> -n <NAMESPACE> top pod <POD_NAME>
kubectl --context=<CONTEXT> -n <NAMESPACE> top pod <POD_NAME> --containersCompare actual usage against the limits:
- If memory usage is >80% of the limit, the pod is under memory pressure and might soon be OOMKilled.
- If CPU usage is consistently near the limit, the pod is CPU-bound and would benefit from a higher limit or a code optimization.
If kubectl top returns "unknown" or "unable to get metrics," the metrics server might not be installed or is not ready. In that case, check the node's view with kubectl top nodes to see overall cluster pressure.
Step 4: Service and Networking
A pod can be Running and Ready, but if the Service does not route traffic to it, requests still fail.
Service and Endpoints
kubectl --context=<CONTEXT> -n <NAMESPACE> describe svc <SERVICE_NAME>
# Check endpoints (the actual pods the Service sends traffic to)
kubectl --context=<CONTEXT> -n <NAMESPACE> get endpoints <SERVICE_NAME>
# List endpoints with labels to verify the selector matches
kubectl --context=<CONTEXT> -n <NAMESPACE> get pods --selector=<LABEL_KEY>=<LABEL_VALUE> -o wideKey checks:
- Endpoints empty? The Service's label selector does not match any pods. Verify the selector in the Service spec matches the pod labels. This is a common mistake.
- Endpoints showing but not Ready? Pods are running but the readiness probe is failing. Fix the probe or the application.
- Endpoints correct but still not routing? Check if there is a NetworkPolicy restricting ingress to the pod.
Ingress Configuration
If the service is exposed via Ingress:
kubectl --context=<CONTEXT> -n <NAMESPACE> describe ingress <INGRESS_NAME>
kubectl --context=<CONTEXT> -n <NAMESPACE> get ingress <INGRESS_NAME> -o yamlLook for:
- TLS certificate validity. If the certificate is expired, browsers will reject it.
- Backend service name and port. Verify they match your Service.
- Host and path routing rules. Verify the rule matches the incoming request.
DNS and Service Discovery
From inside a pod, check if DNS resolution works:
# From within a pod (use `kubectl exec`)
kubectl --context=<CONTEXT> -n <NAMESPACE> exec -it <POD_NAME> -- nslookup <SERVICE_NAME>
kubectl --context=<CONTEXT> -n <NAMESPACE> exec -it <POD_NAME> -- getent hosts <SERVICE_NAME>
# Check if the CoreDNS service is running
kubectl --context=<CONTEXT> -n kube-system get pods -l k8s-app=kube-dnsIf DNS is not resolving, CoreDNS might be down or the service DNS name is incorrect. If CoreDNS pods are not ready, check their logs and node resources.
Step 5: Node-Level Investigation
If all pods seem healthy but the service is still degraded, the problem might be at the node level.
When to Check Nodes
Start here if:
- Pods are stuck in
Pendingstate (scheduler cannot assign them). - Multiple pods are being evicted.
- Resource usage shows unbalanced load across nodes.
Node Conditions and Pressure
kubectl --context=<CONTEXT> get nodes -o wide
# Detailed node info
kubectl --context=<CONTEXT> describe node <NODE_NAME>
# Check for pressure conditions
kubectl --context=<CONTEXT> get nodes -o json | jq '.items[] | {name: .metadata.name, conditions: .status.conditions}'Look for node conditions:
| Condition | Meaning | Action |
|---|---|---|
MemoryPressure=True | Node is running low on memory. | Evict pods or add memory. |
DiskPressure=True | Node is running low on disk space. | Clean up or expand storage. |
PIDPressure=True | Too many processes on the node. | Likely a container running away with process count. |
Ready=False | Node is not healthy. Pods may not be scheduled. | Investigate node logs or kubelet status. |
Node Resource Allocation
kubectl --context=<CONTEXT> top nodes
kubectl --context=<CONTEXT> describe node <NODE_NAME> | grep -A 20 "Allocated resources"If a node shows:
- CPU allocation >90%: New pods requesting CPU might not be scheduled.
- Memory allocation >90%: Similar issue for memory.
- Disk usage >90%: Node might enter DiskPressure.
FailedScheduling Diagnosis
If you see FailedScheduling events on pending pods:
kubectl --context=<CONTEXT> describe pod <POD_NAME> | grep -A 20 "Events:"Common reasons:
- Insufficient resources: Requested CPU/memory exceeds available on all nodes. Check if you can add nodes or reduce requests.
- Node affinity mismatch: Pod specifies
nodeSelectoror affinity rules that no node satisfies. Verify the node labels. - PersistentVolumeClaim not bound: The pod requests a PVC that has no PersistentVolume available. Check PVC status.
- Taints and tolerations: Nodes might have taints (e.g.,
dedicated=gpu:NoSchedule). The pod must have matching tolerations.
Step 6: Centralized Log Correlation
Pod and node diagnostics show local symptoms. Centralized logging lets you search across all container logs, Kubernetes events, and audit logs to find patterns.
Log Search
If you have centralized logging (Google Cloud Logging, ELK, Splunk, etc.):
# Search for errors in container logs from all replicas of a service
resource.type="k8s_container"
resource.labels.namespace_name="<NAMESPACE>"
resource.labels.pod_name=~"<SERVICE_PATTERN>.*"
severity="ERROR" OR "FATAL" OR "panic"
# Correlate with Kubernetes events
resource.type="k8s_pod"
resource.labels.namespace_name="<NAMESPACE>"
resource.labels.pod_name=~"<SERVICE_PATTERN>.*"
protoPayload.resourceName=~"pods/.*"
# Search for recent changes (new deployments, config updates)
resource.type="k8s_object_change"
resource.labels.namespace_name="<NAMESPACE>"
timestamp>="2024-01-01T10:00:00Z"
The goal is to correlate three things:
- Error onset timestamp. When did errors start appearing?
- Deployment timestamp. When was a new version deployed?
- Config change timestamp. When was a ConfigMap or Secret updated?
If a deployment finished at 10:05 and errors started at 10:07, the bug is in the new version. If a ConfigMap was updated at 09:55 and errors started at 10:00, the config change broke the service.
Log Patterns
Common log patterns to search for:
| Pattern | Indicates |
|---|---|
connection refused | Service cannot reach a dependency (DB, cache, upstream). |
OOM or out of memory | Memory leak or insufficient memory allocation. |
timeout | Slow dependency or network latency. |
permission denied | Missing RBAC, secret, or credentials. |
invalid config | Bad ConfigMap or environment variable. |
Step 7: Synthesize Findings
After gathering all diagnostics, structure your findings into a root cause analysis.
The Summary Template
Write up your findings in this order:
Probable cause (one sentence):
- Example: "The service is OOMKilled because the Go runtime's memory usage exceeds the 512 Mi limit."
- Example: "The service cannot reach the database because the database IP in the ConfigMap is stale."
Evidence:
- List every piece of evidence that supports the hypothesis. Example:
- Pod events show
OOMKilledtermination reason. - Container logs show no stack trace, consistent with OOM.
kubectl topshows memory usage at ~600 Mi, exceeding the 512 Mi limit.- No recent code changes that would explain increased memory usage.
- Pod events show
Timeline table:
| Timestamp | Event | Component |
|---|---|---|
| 2024-01-01T09:00 | Deployment version bumped to v1.2.3 | Service |
| 2024-01-01T09:05 | Pod initializes, memory usage steady at 450 Mi | Service pod |
| 2024-01-01T12:00 | Memory usage climbs to 550 Mi | Service pod |
| 2024-01-01T12:30 | Memory hits 620 Mi, OOMKilled triggered | Service pod |
| 2024-01-01T12:31 | Service unavailable (no healthy endpoints) | Observed impact |
Affected scope:
- Which deployments/services are impacted?
- Are other services in the cluster affected?
- Is customer traffic being dropped or failing?
Suggested next steps (not implementation, just diagnosis):
- Example: "Increase memory limit to 1 Gi and monitor memory growth over 24 hours. If memory continues climbing, there is a memory leak."
- Example: "Verify the database IP in centralized config and update the ConfigMap if it is stale. Redeploy with the corrected ConfigMap."
Key: Do not speculate. If your logs do not show a stack trace, do not assume a specific exception. Stick to what the evidence shows.
Step 8: Common Patterns
Most Kubernetes service issues fall into a handful of buckets. Recognizing the pattern speeds up diagnosis.
OOMKilled
Symptoms: Pod restarts every 5-10 minutes. kubectl describe pod shows OOMKilled in container state. Memory usage (from kubectl top) is near the limit.
Root causes:
- Memory limit is too low for the application's normal operation. Increase the limit.
- Application has a memory leak. Check if memory usage grows monotonically over time. If yes, there is a leak.
- Dependency (database, cache) is leaking memory and the sidecar is consuming the pod's allocation.
Next steps:
- Increase the memory limit temporarily and monitor for memory growth.
- If memory stabilizes at a new baseline, the limit was just too low.
- If memory continues growing, there is a leak. Escalate to the application team with time-series memory graphs.
CrashLoopBackOff
Symptoms: Pod shows CrashLoopBackOff in status. Container logs may be empty or very short. describe pod shows repeated exit codes (usually non-zero, often 1 or 137).
Root causes:
- Application crashes on startup (bad configuration, missing secret, unreachable dependency).
- Application expects a file that does not exist (missing volume mount, missing ConfigMap).
Next steps:
- Check container logs carefully. If the application exited due to a config error, logs will mention it.
- Verify all ConfigMaps and Secrets exist and are mounted correctly.
- Check if a dependency (database, message queue) is reachable from the pod.
- If logs are empty, add debug logging to the application startup and redeploy.
FailedScheduling
Symptoms: Pod stuck in Pending state indefinitely. describe pod events show FailedScheduling with a reason.
Root causes:
- Insufficient resources: Node pool is too small or all nodes are fully allocated.
- Node affinity or label mismatch: Pod specifies
nodeSelectorlabels that no node has. - PersistentVolume not available: Pod requests a PVC that is stuck in
Pending(no PV to bind to). - Taints: Node has a taint that the pod does not have a tolerance for.
Next steps:
- Read the
FailedSchedulingmessage carefully. It explicitly states the reason. - If insufficient resources, either scale up the node pool or reduce the pod's requested resources.
- If affinity mismatch, either add the required labels to a node or relax the pod's affinity rules.
- If PVC not bound, check the PVC's events for clues (usually a storage provisioner issue).
Empty Endpoints
Symptoms: Service exists, but kubectl get endpoints <SERVICE> shows no endpoints. Traffic to the service times out or gets a connection refused.
Root causes:
- Service label selector does not match any pods. Typo in the selector or pods have different labels than expected.
- All pods running the service are not in the Ready state (failing readiness probes).
Next steps:
- Check the Service's
.spec.selector. Get all pods with those labels:kubectl get pods --selector=<KEY>=<VALUE>. - If no pods match, the selector is wrong. Fix it or relabel the pods.
- If pods match but are not in Endpoints, they are not ready. Check why the readiness probe is failing.
Slow or Degraded Service
Symptoms: Service is up and responding, but requests are taking much longer than usual. Error rate is low but latency spikes.
Root causes:
- Node resource contention: Other pods on the same node are consuming CPU or memory, throttling your pod.
- Downstream dependency latency: The service is waiting on a database, API, or cache that is slow.
- Network saturation: The cluster's network or external egress link is overloaded.
Next steps:
- Check
kubectl top nodesto see if CPU or memory is near capacity. - Check application logs for slow queries or API call timings.
- In centralized logging, search for the slowest request times and what operations they were waiting on.
- If node contention, consider scaling the workload to more nodes or reducing the density.
Recent Deployment and Immediate Failure
Symptoms: New version deployed minutes ago. Service immediately starts failing or going to CrashLoopBackOff.
Root causes:
- Bug in the new version. Check the diff between the old and new version.
- Environment variable or secret changed in the deployment (new version expects a different config).
- Dependency contract changed: new version expects a different API response from a downstream service.
Next steps:
- Check the centralized logs at the exact minute of deployment. Errors will appear immediately.
- If errors reference a missing environment variable or config key, the deployment YAML likely changed.
- Correlate the deployment diff with the error messages. Usually, a line-by-line comparison reveals the issue.
Lessons Learned
Never skip the events section. The Kubernetes event log is free, highly structured diagnostic data. "FailedScheduling," "OOMKilled," and "BackOff" tell you the story immediately. Operators who jump to logs first waste time.
Label selectors are surprisingly fragile. More than once, I have debugged a service with zero endpoints only to discover a typo in the label selector. app: service-a does not match app: servicea. A five-minute kubectl describe svc catches this instantly.
Centralized logging is not optional. Reconstructing a timeline from individual pod logs is tedious and error-prone. With centralized logging, you can ask "what changed in the last 10 minutes?" and get a definitive answer. It turns hours of investigation into minutes.
Do the read-only investigation first. The urge to restart or scale is overwhelming. Resist it. Spend 15 minutes gathering data. Then make an informed decision about remediation. It pays off every time.