Probes

Probes decide whether a Pod should receive traffic or be restarted.

Three types

readiness: can it receive traffic
liveness: should it restart
startup: protect slow startup

Example

readinessProbe:
  httpGet:
    path: /readyz
    port: 8080
  periodSeconds: 5
livenessProbe:
  httpGet:
    path: /livez
    port: 8080
  periodSeconds: 10

Common pitfalls

Using liveness as a traffic switch
Timeouts that are too short

Practical notes

Start with a quick inventory: kubectl get nodes, kubectl get pods -A, and kubectl get events -A.
Compare desired vs. observed state; kubectl describe usually explains drift or failed controllers.
Keep names, labels, and selectors consistent so Services and controllers can find Pods.

Quick checklist

The resource matches the intent you described in YAML.
Namespaces, RBAC, and images are correct for the target environment.
Health checks and logs are in place before promotion.

probes in a

probes in a workload oriented view

Whether you are defining a Pod, tuning probes, or organizing namespaces, the goal is to make workloads predictable. probes is part of how Kubernetes manages lifecycle, scheduling, and isolation. Think of it as a tool for turning application intent into a repeatable unit of operation.

Labels, selectors, and ownership

Every workload should be discoverable. Labels are the primary index, and selectors are how controllers and Services find what they manage. Use consistent keys like app, component, and env. Ownership links, such as controller references, determine which objects are recreated when something is deleted. Without a consistent label strategy, even simple troubleshooting becomes slow.

Scheduling and resource requests

Schedulers rely on requests to place Pods. If requests are missing, the cluster cannot make fair decisions, and overload becomes likely. For small services, start with conservative requests and measure. For batch jobs, set limits to protect critical workloads. Even in a namespace level discussion, quotas and limit ranges are how you enforce these rules.

Startup, readiness, and termination

Probes should reflect true readiness, not just process liveness. A container can be running and still not ready to serve traffic. Use readiness probes to gate traffic, and make liveness probes forgiving to avoid restart loops. Shutdown matters too: define terminationGracePeriodSeconds and handle SIGTERM so the app can flush work and release locks.

Isolation and security basics

Namespaces separate teams and environments, but they are not a hard boundary. Combine them with RBAC, NetworkPolicy, and Pod security settings. SecurityContext settings like runAsNonRoot, readOnlyRootFilesystem, and drop capabilities are small changes that reduce risk. If a workload needs extra permissions, document why.

Resource isolation and noisy neighbors

CPU limits can cause throttling, and memory limits can trigger OOM kills. When a pod is latency sensitive, prefer realistic requests and avoid overly tight limits. For batch workloads, use limits to prevent them from crowding out interactive services. This balance is part of everyday cluster operations.

Config and secret lifecycle

ConfigMaps and Secrets should be treated as part of the workload contract. Decide whether configuration changes should trigger a rollout or be hot reloaded. Keep sensitive data in Secrets and limit access with RBAC. Document how config changes are promoted between environments.

Debugging workflow

A steady workflow saves time. Start with describe for events, then logs, then exec into a container if needed. For probes, check the endpoint directly from inside the Pod to confirm it works. For namespace or quota issues, inspect ResourceQuota and LimitRange objects to see why a Pod was rejected.

kubectl get pods -n demo
kubectl describe pod demo-app -n demo
kubectl logs demo-app -n demo --tail=200
kubectl exec -it demo-app -n demo -- sh

Observability signals

Events explain scheduling and startup failures. Logs tell you application behavior. Metrics show trends like CPU spikes or memory growth. Combine these three signals before guessing. A short habit of checking all three saves long debugging cycles.

Practical stability checklist

Make sure each workload has labels, requests, and probes. Ensure Services can find Pods via selectors. Verify that namespaces have the right RBAC bindings and quotas. Finally, confirm that termination and startup behavior matches your real traffic patterns.

Field checklist

When you move from a quick lab to real traffic, confirm the basics every time. Check resource requests, readiness behavior, log coverage, alerting, and clear rollback steps. A checklist prevents skipping the boring steps that keep services stable. Keep it short, repeatable, and stored with the repo so it evolves with the service and stays close to the code.

Troubleshooting flow

Start from symptoms, not guesses. Review recent events for scheduling, image, or probe failures, then scan logs for application errors. If traffic is failing, confirm readiness, verify endpoints, and trace the request path hop by hop. When data looks wrong, validate the active version and configuration against the release plan. Always record what you changed so a rollback is fast and a postmortem is accurate.

Small exercises to build confidence

Practice common operations in a safe environment. Scale the workload up and down and observe how quickly it stabilizes. Restart a single Pod and watch how the service routes around it. Change one configuration value and verify that the change is visible in logs or metrics. These small drills teach how the system behaves under real operations without waiting for an outage.

Production guardrails

Introduce limits gradually. Resource quotas, PodDisruptionBudgets, and network policies should be tested in staging before production. Keep backups and restore procedures documented, even for stateless services, because dependencies often are not stateless. Align monitoring with user outcomes so you catch regressions before they become incidents.

Documentation and ownership

Write down who owns the service, what success looks like, and which dashboards to use. Include the on-call rotation, escalation path, and basic runbooks for common failures. A small amount of documentation removes a lot of guesswork during incidents and helps new team members ramp up quickly.

Quick validation

After any change, validate the system the same way a user would. Hit the main endpoint, check latency, and watch for error spikes. Confirm that new pods are ready, old ones are gone, and metrics are stable. If the change touched storage, verify disk usage and cleanup behavior. If it touched networking, confirm DNS names and endpoint lists are correct.

Release notes

Write a short note with what changed, why it changed, and how to roll back. This is not bureaucracy; it prevents confusion during incidents. Even a few bullets help future you remember intent and context.

Capacity check

Compare current usage to requests and limits. If the service is close to limits, plan a small scaling adjustment before traffic grows. Capacity planning is easier when it is incremental rather than reactive.

Final reminder

Keep changes small and observable. If a release is risky, reduce scope and validate in staging first. Prefer frequent small updates over rare large ones. When in doubt, pick the option that simplifies rollback and reduces time to detect issues. The goal is not perfect config, but predictable operations.

Three types

Example

Common pitfalls

Practical notes

Quick checklist

probes in a

probes in a

probes in a workload oriented view

Labels, selectors, and ownership

Scheduling and resource requests

Startup, readiness, and termination

Isolation and security basics

Resource isolation and noisy neighbors

Config and secret lifecycle

Debugging workflow

Observability signals

Practical stability checklist

Field checklist

Troubleshooting flow

Small exercises to build confidence

Production guardrails

Documentation and ownership

Quick validation

Release notes

Capacity check

Final reminder

References