What we find in most Kubernetes cluster reviews

We review production Kubernetes clusters for organisations in fintech, defence, and critical infrastructure. The environments vary. The problems do not.

Most clusters we see were built under delivery pressure by teams that knew enough to get things running. That is not a criticism - it is the reality of shipping. But what works for a proof of concept becomes a liability in production, especially in regulated environments where a breach has consequences beyond a PR incident.

These are the five issues we find most often.

1. No network segmentation

The single most common finding. The cluster has no Kubernetes NetworkPolicies, which means every pod can talk to every other pod by default. If an attacker compromises one container, they can reach the database, sniff traffic between services, and move laterally across the entire cluster without restriction.

We regularly find production clusters where the only thing separating application namespaces is a naming convention. No ingress rules, no egress restrictions, no default-deny baseline.

In one review, we found that NSG rules allowed inbound traffic from any source IP, there was no IP whitelisting on the production ingress, and no WAF or application gateway sat in front of the cluster. The production IngressRoute files were named "ingressroute-dev.yaml" - a naming artefact from when someone copied the dev config and never renamed it.

What to do about it: Start with a default-deny policy in every namespace. Then add explicit allow rules for the traffic flows you actually need. This is not a weekend project, but you can roll it out namespace by namespace without downtime.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Add namespace-specific allow policies after this is in place. The goal is that every allowed traffic path is documented in code, not assumed by default.

2. Secrets handled carelessly

Kubernetes secrets are base64-encoded, not encrypted. This is well known and still widely ignored. We find clusters where credentials are generated inside the cluster by init containers, stored as plain Kubernetes secrets, and never rotated.

A typical pattern: a job runs at startup, generates a random password for a database or message broker, and writes it to a Kubernetes secret. That secret then lives indefinitely. There is no rotation schedule, no expiry, no revocation capability. If the job restarts and the secret already exists, it reuses the old one.

We also find Vault tokens exported as environment variables. Anyone with permission to exec into a pod or describe it in the same namespace can read them. The exposure window is short but it exists, and in a targeted attack, short is enough.

Other common findings: Redis running with no authentication enabled, ArgoCD running in insecure mode, no connection between RBAC in the identity provider and RBAC in the cluster tooling.

What to do about it: Encrypt secrets at rest in etcd. Lock them down with RBAC so only the services that need them can read them. Introduce rotation - even a 90-day manual rotation is better than secrets that live forever. For anything sensitive, use an external secret manager (Vault, Azure Key Vault, AWS Secrets Manager) with short-lived, scoped tokens rather than long-lived environment variables.

3. Change management is invisible

Most clusters we review have a similar structure: one folder per environment, each containing a full copy of the Helm values or manifests. Dev, test, staging, production - four folders, four copies of nearly identical configuration.

This makes it nearly impossible to see what actually differs between environments. Promoting a change from dev to production means manually copying files between folders and hoping nothing was missed. There is no natural PR path, no clear diff, and no audit trail for why production looks different from staging.

The teams know this is a problem. It is usually the thing they ask us about first.

What to do about it: Move to a single-source structure with environment-specific overlays. One base values file contains all defaults. Each environment gets a small overlay file containing only what is different.

charts/
  my-service/
    Chart.yaml
    values.yaml          # Base / shared values
    values-dev.yaml      # Dev overrides only
    values-test.yaml     # Test overrides only
    values-stage.yaml    # Stage overrides only
    values-prod.yaml     # Prod overrides only

With this structure, git diff values-dev.yaml values-prod.yaml shows you exactly what differs. Changes to the base automatically propagate to all environments unless explicitly overridden. PRs become meaningful because the diff is small and reviewable.

Pair this with ArgoCD pointing each environment at its own overlay file on the same branch (typically main). ArgoCD recommends against long-lived environment branches because they drift and create merge conflicts. Instead, all environments track a single branch and diverge only through their overlay values. You get sync enforcement, drift detection, and rollback for free.

4. Pod security is an afterthought

Containers running as root. No resource requests or limits. No policy enforcement preventing manual pod creation outside of the GitOps pipeline. No image scanning in CI/CD before images are pushed to the registry.

We find all of these in most reviews. The combination is dangerous: if a pod runs as root with no network restrictions, an attacker who gets code execution in that container can install tools, escalate privileges, access the host filesystem, and pivot to other nodes.

Third-party dependency management is often weak too. Dockerfiles pulling images with mutable tags (:latest or :3.19.0) rather than pinned SHA digests. Package files with version ranges rather than exact locks. No licence scanning, no Dependabot or equivalent, no policy preventing beta or release-candidate versions in production.

What to do about it: Start with policy enforcement. Kyverno or OPA Gatekeeper can block manual pod creation (enforcing GitOps-only deployment), prevent privileged containers, and require resource limits on every pod.

For images, switch to minimal base images (distroless, chiselled, or Alpine variants) and pin every dependency by SHA digest. Enable Dependabot or Renovate on all repositories. Add image scanning to your CI/CD pipeline so vulnerabilities are caught before they reach the registry, not after they are running in production.

Add a HorizontalPodAutoscaler and explicit resource requests/limits to your Helm charts. This is also a cost control measure - without limits, a single misbehaving pod can consume an entire node.

5. Observability gaps

Logs exist. Metrics exist. Traces might exist. But they are not connected to each other, and nobody has set up the correlation that makes them useful.

A typical finding: the cluster has a logging stack, but it has no backpressure or buffer configuration, so logs get dropped under load (exactly when you need them most). Metrics are collected but there are no meaningful alerts - or the alerts that exist fire so often that the team ignores them. Traces are either not configured or only cover a fraction of services.

Runtime security monitoring is usually absent entirely. No Falco or equivalent watching for privilege escalation, unexpected shell execution, sensitive file modification, or kernel module loading.

What to do about it: Get log-metric-trace correlation working. This means consistent request IDs propagated across services, proper buffer and backpressure configuration on your log collectors, and a single pane where an operator can go from an alert to the relevant logs to the trace that shows what happened.

For runtime security, Falco rules should cover at minimum: new pods with escalated privileges, execution of setuid binaries, modifications to sensitive files like /etc/passwd, mounting of host sockets, and shell execution as root. These are the indicators of an active compromise, and without them you are flying blind.

These are not exotic problems

Nothing in this list is a novel attack vector or a cutting-edge failure mode. These are the default state of most clusters that grew organically under delivery pressure. The teams running them are usually aware of the gaps - they just have not had the time or mandate to address them.

The fix is not a rewrite. It is a focused review, a prioritised list, and incremental hardening over a few weeks. Default-deny network policies, encrypted secrets with rotation, single-source change management, enforced pod security, and correlated observability. None of these require downtime. All of them dramatically reduce the blast radius when something goes wrong.

If your cluster has been running in production for more than six months without a security review, it almost certainly has some combination of these issues. The question is whether you find them before someone else does.

Read the case: Securing a payment platform for ISO 27001 →