All Articles

Responding to Kubernetes incidents

Kubernetes IR

During winter break, I started to think about Kubernetes incident response. This is beyond my normal learning scope on normal host and network forensics, and so I wanted to take some time to read through what it would take to be prepared in case an incident do occur. After a week of reading through all the articles, here are some of my takeaways on the subject.

Always have a plan

  • Who to contact
  • What action to take
  • Identify what data to collect, and how to collect those data
  • Identify critical systems to keep the business running
  • Communication Plan

Alerting

Determine what constitutes an actionable alert.

  • SSH from a foreign country after work hour?

Response

  • Identify the offending Pod/worker node
  • Isolate pod by creating a Network Policy that denies all ingress/egress traffic to pod
  • Revoke temporary security credentials assigned to the pod or worker node if necessary
  • Cordon impacted nodes
  • Enable termination protection on impacted worker node
  • Label offending pod/node with label indicating it’s part of an active investigation
  • Capture volatile artifacts on worker node (processes/network), pause container for forensic capture, and snapshot instance’s EBS volume.

Investigation

  • How was the container launched?
  • Are there unexpected commands being ran? (ln, mv, cp, cat, *.sh, tar, curl, wget)
  • Are there any files in /dev or /proc being touched?
  • Any unexpected network traffic or increased egress traffic from a particular node?
  • Have any binaries changed?
  • Any unexpected files?
  • What interesting things happened on the system? Processes/system calls/ files/network/ I/O/users• Logging - audit.k8s.io
  • Minimum of metadata level audit logs should be retained for all cluster events (record what API actions were performed by who)
  • Full requests logs should be retained to provide detailed introspection into past API actions performed by users

Playbooks

Have some playbooks readily available in case of incident:

  • Snapshot for forensic evidence collections

    • Identify what types of artifacts you would like to collect (Ex: process list, network logs…etc)
    • Where to store those artifacts? (Ex: S3 for storage?)
  • Pod isolation - How to scale down/kill/restart suspicious pods

    • Cordoning/draining (remove node from service)/limiting network access to VM hosting compromised container
    • Change pod label name; mark as quarantine=true
    • (If breached, scale suspicious pods to zero - kill, then restart instances of breached application)
  • Container redeployment
  • Workload deletion

The Do Not’s

This is probably pretty self-explanatory. But just in case…

  • Panic; terminate and delete all nodes/containers/ disks
  • Log in to the server / container to see if you can track it down

Enjoy!