Probes
Tell Kubernetes when an app is ready or needs a restart.
Probes decide whether a Pod should receive traffic or be restarted.
Three types
- readiness: can it receive traffic
- liveness: should it restart
- startup: protect slow startup
Example
readinessProbe:
httpGet:
path: /readyz
port: 8080
periodSeconds: 5
livenessProbe:
httpGet:
path: /livez
port: 8080
periodSeconds: 10
Common pitfalls
- Using liveness as a traffic switch
- Timeouts that are too short
Practical notes
- Start with a quick inventory:
kubectl get nodes,kubectl get pods -A, andkubectl get events -A. - Compare desired vs. observed state;
kubectl describeusually explains drift or failed controllers. - Keep names, labels, and selectors consistent so Services and controllers can find Pods.
Quick checklist
- The resource matches the intent you described in YAML.
- Namespaces, RBAC, and images are correct for the target environment.
- Health checks and logs are in place before promotion.
probes in a
probes in a
probes in a workload oriented view
Whether you are defining a Pod, tuning probes, or organizing namespaces, the goal is to make workloads predictable. probes is part of how Kubernetes manages lifecycle, scheduling, and isolation. Think of it as a tool for turning application intent into a repeatable unit of operation.
Labels, selectors, and ownership
Every workload should be discoverable. Labels are the primary index, and selectors are how controllers and Services find what they manage. Use consistent keys like app, component, and env. Ownership links, such as controller references, determine which objects are recreated when something is deleted. Without a consistent label strategy, even simple troubleshooting becomes slow.
Scheduling and resource requests
Schedulers rely on requests to place Pods. If requests are missing, the cluster cannot make fair decisions, and overload becomes likely. For small services, start with conservative requests and measure. For batch jobs, set limits to protect critical workloads. Even in a namespace level discussion, quotas and limit ranges are how you enforce these rules.
Startup, readiness, and termination
Probes should reflect true readiness, not just process liveness. A container can be running and still not ready to serve traffic. Use readiness probes to gate traffic, and make liveness probes forgiving to avoid restart loops. Shutdown matters too: define terminationGracePeriodSeconds and handle SIGTERM so the app can flush work and release locks.
Isolation and security basics
Namespaces separate teams and environments, but they are not a hard boundary. Combine them with RBAC, NetworkPolicy, and Pod security settings. SecurityContext settings like runAsNonRoot, readOnlyRootFilesystem, and drop capabilities are small changes that reduce risk. If a workload needs extra permissions, document why.
Resource isolation and noisy neighbors
CPU limits can cause throttling, and memory limits can trigger OOM kills. When a pod is latency sensitive, prefer realistic requests and avoid overly tight limits. For batch workloads, use limits to prevent them from crowding out interactive services. This balance is part of everyday cluster operations.
Config and secret lifecycle
ConfigMaps and Secrets should be treated as part of the workload contract. Decide whether configuration changes should trigger a rollout or be hot reloaded. Keep sensitive data in Secrets and limit access with RBAC. Document how config changes are promoted between environments.
Debugging workflow
A steady workflow saves time. Start with describe for events, then logs, then exec into a container if needed. For probes, check the endpoint directly from inside the Pod to confirm it works. For namespace or quota issues, inspect ResourceQuota and LimitRange objects to see why a Pod was rejected.
kubectl get pods -n demo
kubectl describe pod demo-app -n demo
kubectl logs demo-app -n demo --tail=200
kubectl exec -it demo-app -n demo -- sh
Observability signals
Events explain scheduling and startup failures. Logs tell you application behavior. Metrics show trends like CPU spikes or memory growth. Combine these three signals before guessing. A short habit of checking all three saves long debugging cycles.
Practical stability checklist
Make sure each workload has labels, requests, and probes. Ensure Services can find Pods via selectors. Verify that namespaces have the right RBAC bindings and quotas. Finally, confirm that termination and startup behavior matches your real traffic patterns.
Field checklist
When you move from a quick lab to real traffic, confirm the basics every time. Check resource requests, readiness behavior, log coverage, alerting, and clear rollback steps. A checklist prevents skipping the boring steps that keep services stable. Keep it short, repeatable, and stored with the repo so it evolves with the service and stays close to the code.
Troubleshooting flow
Start from symptoms, not guesses. Review recent events for scheduling, image, or probe failures, then scan logs for application errors. If traffic is failing, confirm readiness, verify endpoints, and trace the request path hop by hop. When data looks wrong, validate the active version and configuration against the release plan. Always record what you changed so a rollback is fast and a postmortem is accurate.
Small exercises to build confidence
Practice common operations in a safe environment. Scale the workload up and down and observe how quickly it stabilizes. Restart a single Pod and watch how the service routes around it. Change one configuration value and verify that the change is visible in logs or metrics. These small drills teach how the system behaves under real operations without waiting for an outage.
Production guardrails
Introduce limits gradually. Resource quotas, PodDisruptionBudgets, and network policies should be tested in staging before production. Keep backups and restore procedures documented, even for stateless services, because dependencies often are not stateless. Align monitoring with user outcomes so you catch regressions before they become incidents.
Documentation and ownership
Write down who owns the service, what success looks like, and which dashboards to use. Include the on-call rotation, escalation path, and basic runbooks for common failures. A small amount of documentation removes a lot of guesswork during incidents and helps new team members ramp up quickly.
Quick validation
After any change, validate the system the same way a user would. Hit the main endpoint, check latency, and watch for error spikes. Confirm that new pods are ready, old ones are gone, and metrics are stable. If the change touched storage, verify disk usage and cleanup behavior. If it touched networking, confirm DNS names and endpoint lists are correct.
Release notes
Write a short note with what changed, why it changed, and how to roll back. This is not bureaucracy; it prevents confusion during incidents. Even a few bullets help future you remember intent and context.
Capacity check
Compare current usage to requests and limits. If the service is close to limits, plan a small scaling adjustment before traffic grows. Capacity planning is easier when it is incremental rather than reactive.
Final reminder
Keep changes small and observable. If a release is risky, reduce scope and validate in staging first. Prefer frequent small updates over rare large ones. When in doubt, pick the option that simplifies rollback and reduces time to detect issues. The goal is not perfect config, but predictable operations.