Responding to Kubernetes incidents

Kubernetes IR

During winter break, I started to think about Kubernetes incident response. This is beyond my normal learning scope on normal host and network forensics, and so I wanted to take some time to read through what it would take to be prepared in case an incident do occur. After a week of reading through all the articles, here are some of my takeaways on the subject.

Always have a plan

Who to contact
What action to take
Identify what data to collect, and how to collect those data
Identify critical systems to keep the business running
Communication Plan

Alerting

Determine what constitutes an actionable alert.

SSH from a foreign country after work hour?

Response

Identify the offending Pod/worker node
Isolate pod by creating a Network Policy that denies all ingress/egress traffic to pod
Revoke temporary security credentials assigned to the pod or worker node if necessary
Cordon impacted nodes
Enable termination protection on impacted worker node
Label offending pod/node with label indicating it’s part of an active investigation
Capture volatile artifacts on worker node (processes/network), pause container for forensic capture, and snapshot instance’s EBS volume.

Investigation

How was the container launched?
Are there unexpected commands being ran? (ln, mv, cp, cat, *.sh, tar, curl, wget)
Are there any files in /dev or /proc being touched?
Any unexpected network traffic or increased egress traffic from a particular node?
Have any binaries changed?
Any unexpected files?
What interesting things happened on the system? Processes/system calls/ files/network/ I/O/users• Logging - audit.k8s.io
Minimum of metadata level audit logs should be retained for all cluster events (record what API actions were performed by who)
Full requests logs should be retained to provide detailed introspection into past API actions performed by users

Playbooks

Have some playbooks readily available in case of incident:

Snapshot for forensic evidence collections
- Identify what types of artifacts you would like to collect (Ex: process list, network logs…etc)
- Where to store those artifacts? (Ex: S3 for storage?)
Pod isolation - How to scale down/kill/restart suspicious pods
- Cordoning/draining (remove node from service)/limiting network access to VM hosting compromised container
- Change pod label name; mark as quarantine=true
- (If breached, scale suspicious pods to zero - kill, then restart instances of breached application)
Container redeployment
Workload deletion

The Do Not’s

This is probably pretty self-explanatory. But just in case…

Panic; terminate and delete all nodes/containers/ disks
Log in to the server / container to see if you can track it down

Enjoy!

Published 13 Dec 2020

Security engineer. Blue teamer. What about you?TV0 on Twitter