Core Troubleshooting Methodology and Diagnostic Tools
Effective Kubernetes troubleshooting follows a systematic approach. Identify the problem, gather information, analyze logs, and implement solutions.
Essential kubectl Commands
The primary tool is kubectl, which provides several key commands for investigating issues:
- kubectl get: Display resource status quickly
- kubectl describe: Get detailed information and event logs
- kubectl logs: View application output and errors
- kubectl exec: Access running containers for testing
- kubectl events: See recent cluster activity chronologically
- kubectl top: Display CPU and memory usage
Understanding Pod States
Pods can be in five main states, each indicating different problems. Pending means the pod hasn't started yet, often due to resource constraints. Running means the pod is active but doesn't guarantee the application is healthy. Succeeded indicates the pod completed successfully (common for jobs). Failed means the pod exited with an error. Unknown suggests communication problems with the node.
When a pod is Pending, check resource requests against node availability using kubectl describe node. For CrashLoopBackOff failures, examine application logs with kubectl logs and previous logs with kubectl logs -p.
Probe-Based Diagnostics
Three types of probes affect pod behavior differently. Liveness probes restart unhealthy containers. Readiness probes remove containers from service traffic without restarting them. Startup probes allow time for applications to initialize before other probes run.
Understanding these differences helps diagnose why pods appear running but don't accept traffic.
Network and Storage Verification
For network troubleshooting, verify service endpoints match pod selectors. Check DNS resolution within the cluster using nslookup. Test connectivity using kubectl run with netcat images.
For storage issues, verify PersistentVolumeClaim bindings, check StorageClass definitions, and confirm volume mount paths in pod specifications. Node-level problems require checking node status with kubectl get nodes and inspecting kubelet logs on the affected node.
Pod and Container Troubleshooting Scenarios
Pod troubleshooting represents a significant portion of CKA exam tasks. Understanding common failure patterns accelerates diagnosis.
Common Pod Failure States
Three failure states dominate CKA scenarios:
- ImagePullBackOff: Invalid image name or registry authentication failure. Check image specifications in pod manifests and verify registry secrets exist and have correct credentials.
- CrashLoopBackOff: Application exits immediately. Check application logs, environment variables, and command arguments for misconfigurations.
- Pending: Insufficient cluster resources, node selectors conflicting with available nodes, or PersistentVolumeClaim issues. Use kubectl describe pod to see events explaining scheduling failures.
Debugging Running Containers
Use kubectl exec to access the container shell and inspect processes, filesystem, and environment. For containers without sh shells, create a temporary debug pod in the same namespace and network namespace.
Init Container Issues
Init containers must complete successfully before the main container starts. Failures here appear as Init:0/1 or similar status. Each init container runs sequentially, so verify each one completes. Check init container logs separately using kubectl logs with the -c flag specifying the container name.
Multi-Container Pod Troubleshooting
When pods have multiple containers, troubleshooting becomes more complex. Check logs for each container separately using kubectl logs -c container-name. Verify all containers have compatible resource requirements. Understand container restart policies to diagnose why containers continuously restart or stay dead after failure.
For sidecar patterns, coordinate startup sequences and ensure all containers run in the same pod network namespace.
Network Troubleshooting and Service Connectivity
Network issues in Kubernetes affect communication between pods, external access to services, and DNS resolution. A systematic approach prevents wasting time on unlikely causes.
Basic Connectivity Verification
Start with basic connectivity verification. Check if pods can reach each other using kubectl exec and testing with ping or curl. DNS troubleshooting involves testing service name resolution from within pods using nslookup or dig against the kube-dns service (typically 10.0.0.10).
Service Configuration Checks
Service troubleshooting requires verifying that service selectors match pod labels. Use kubectl get pods --show-labels and compare labels to service selector specifications.
Check service endpoints with kubectl get endpoints service-name. This should list the pod IPs that matched the selector. Port mismatches between containerPort in the pod spec and targetPort in the service definition cause connectivity failures.
Advanced Network Issues
Network policies can inadvertently block traffic. Examine network policy rules and ensure they permit the required source and destination IPs.
Common network issues include:
- Misconfigured CNI plugins causing pods to not receive IP addresses
- Service DNS names not resolving outside the service namespace (require fully qualified names like service.namespace.svc.cluster.local)
- Firewall rules blocking traffic between nodes
- CoreDNS pod failures preventing any DNS resolution in the cluster
Ingress and External Access
Ingress troubleshooting involves checking ingress controller deployment status. Verify ingress resource specifications match service names and ports. Test backend service health independently before debugging ingress routing.
Debug containers with networking tools installed help test connectivity. Use kubectl run with nicolaka/netshoot image for comprehensive networking diagnostics.
Node and Cluster-Level Troubleshooting
Node-level issues impact multiple pods and cluster stability. These problems require different diagnostic techniques than pod-level issues.
Node Status and Conditions
Check node status with kubectl get nodes to identify NotReady or SchedulingDisabled states. kubectl describe node provides detailed information including conditions like Ready, MemoryPressure, DiskPressure, and PIDPressure.
NotReady conditions indicate the kubelet isn't reporting in. SSH into the node and check kubelet status and logs (/var/log/kubelet.log). Disk pressure occurs when the filesystem exceeds thresholds, causing pods to be evicted. Memory pressure indicates available memory is low.
Taints, Tolerations, and Scheduling
For unschedulable nodes despite available resources, check node taints with kubectl describe node. Verify pods have matching tolerations in their specifications. Kubelet certificate expiration causes authentication failures between the node and API server. Renew certificates or configure proper CA bundle paths.
Control Plane Component Health
Control plane troubleshooting requires checking status of core components: kube-apiserver, kube-controller-manager, kube-scheduler, and etcd. These typically run as static pods in /etc/kubernetes/manifests on kubeadm clusters or as systemd services.
API server downtime manifests as kubectl commands hanging or returning connection errors. Scheduler failures prevent new pods from being scheduled. Check logs for errors like insufficient resources or constraint violations. Controller manager issues prevent deployments from managing replicas or services from creating endpoints.
High Availability and Etcd
For high availability clusters, ensure all control plane replicas are healthy and communicating. Check inter-node communication on control plane nodes. Verify sufficient disk space on control plane nodes where etcd stores data. Use etcdctl for direct etcd troubleshooting. Monitor certificate expiration dates on all control plane components and worker node certificates.
Storage and Resource Troubleshooting Techniques
Storage troubleshooting in Kubernetes involves understanding PersistentVolumes, PersistentVolumeClaims, and StorageClasses. Resource quota issues prevent pod creation and require systematic diagnosis.
PersistentVolume and PersistentVolumeClaim Issues
PVCs in Pending state indicate no available PV matches the claim's requirements. Check storage class provisioners, access modes, and size specifications. Verify the StorageClass exists and its provisioner is correctly configured.
For statically provisioned volumes, ensure PVs are created with matching access modes and storage capacity. Binding issues occur when PVC requirements don't match any available PV. Check kubectl get pvc and kubectl describe pvc to see events explaining binding failures.
Mount and Storage Access Failures
Mount failures appear in pod events. Verify mount paths are writable and the volume type is compatible with the container runtime. For network storage like NFS or iSCSI, test connectivity from nodes to storage backends separately.
Understanding access modes prevents mounting errors:
- ReadWriteOnce: One node read-write access only
- ReadOnlyMany: Multiple nodes read-only access
- ReadWriteMany: Multiple nodes read-write access
Some storage classes support only specific access modes.
Resource Quotas and Eviction
Resource quota troubleshooting involves checking if namespace quotas are exhausted, which prevents new pod creation. Use kubectl describe quota namespace-name to see current usage. LimitRange objects can also prevent pod creation if requests fall outside specified ranges.
Pod eviction due to resource pressure shows EvictionThrottled or Evicted status. Increase cluster resources or reduce pod request or limit overcommitment.
Volume Lifecycle Management
Orphaned PVCs after pod deletion require manual cleanup unless storageClass has reclaim policy set to Delete. For persistent volume expansion, verify the storage class allows expansion and use kubectl patch pvc to increase the claim size. Restart the pod to mount the expanded volume.
