During winter break, I started to think about Kubernetes incident response. This is beyond my normal learning scope on normal host and network forensics, and so I wanted to take some time to read through what it would take to be prepared in case an incident do occur. After a week of reading through all the articles, here are some of my takeaways on the subject.
- Who to contact
- What action to take
- Identify what data to collect, and how to collect those data
- Identify critical systems to keep the business running
- Communication Plan
Determine what constitutes an actionable alert.
- SSH from a foreign country after work hour?
- Identify the offending Pod/worker node
- Isolate pod by creating a Network Policy that denies all ingress/egress traffic to pod
- Revoke temporary security credentials assigned to the pod or worker node if necessary
- Cordon impacted nodes
- Enable termination protection on impacted worker node
- Label offending pod/node with label indicating it’s part of an active investigation
- Capture volatile artifacts on worker node (processes/network), pause container for forensic capture, and snapshot instance’s EBS volume.
- How was the container launched?
- Are there unexpected commands being ran? (ln, mv, cp, cat, *.sh, tar, curl, wget)
- Are there any files in /dev or /proc being touched?
- Any unexpected network traffic or increased egress traffic from a particular node?
- Have any binaries changed?
- Any unexpected files?
- What interesting things happened on the system? Processes/system calls/ files/network/ I/O/users• Logging - audit.k8s.io
- Minimum of metadata level audit logs should be retained for all cluster events (record what API actions were performed by who)
- Full requests logs should be retained to provide detailed introspection into past API actions performed by users
Have some playbooks readily available in case of incident:
Snapshot for forensic evidence collections
- Identify what types of artifacts you would like to collect (Ex: process list, network logs…etc)
- Where to store those artifacts? (Ex: S3 for storage?)
Pod isolation - How to scale down/kill/restart suspicious pods
- Cordoning/draining (remove node from service)/limiting network access to VM hosting compromised container
- Change pod label name; mark as quarantine=true
- (If breached, scale suspicious pods to zero - kill, then restart instances of breached application)
- Container redeployment
- Workload deletion
This is probably pretty self-explanatory. But just in case…
- Panic; terminate and delete all nodes/containers/ disks
- Log in to the server / container to see if you can track it down