The Kubernetes network stack has many levels.
One of the most important layers is conntrack.
Table of contents:
· What is conntrack?
· What is the conntrack limit?
· Why does conntrack get full?
· In Kubernetes, DNS queries are not so simple
· How to fix this issue?
· Solution 1: Add Ram
· Solution 2: Put a dot
· Solution 3: Modify the nf_conntrack_max value via DaemonSet
· Solution 4: Reduce the number of DNS connections
· Solution 5: Send sequential DNS requests
· Solution 6: UDP is fragile; consider TCP
· Solution 7: Use NodeLocal DNS Cache
You may have seen this error log on your system:
nf_conntrack: table full, dropping packets
What is conntrack?
conntrack stands for “Connection Tracking”. conntrack is a component of the Netfilter framework in Linux Kernel. conntrack functions as a database for network connections that travel through a Linux server. It tracks the state of each network connection (TCP, UDP, ICMP, etc).
The state of a network connection is critical for packet filtering and NAT rules in Kubernetes.
What is the conntrack limit?
Conntrack has a maximum limit to maintain active network traffic. The Linux Kernel automatically sets the maximum value depending on the Linux Server’s RAM (source code).
However, each Linux distribution or cloud provider may assign a different value.
Enter your Kubernetes node and get your k8s node’s limit:
cat /proc/sys/net/netfilter/nf_conntrack_max
Get the current number of entries in conntrack:
cat /proc/sys/net/netfilter/nf_conntrack_count
In your monitoring system, you may create an alarm like this:
(<nf_conntrack_count>/<nf_conntrack_max>) * 100 >= 95
Alert: 95% of the maximum conntrack limit is reached.
Why does conntrack get full?
By default, conntrack retains an entry for a TCP connection for 120 seconds and an entry for a UDP connection for 30 seconds. If no action occurs in the flow, the entry is immediately removed from conntrack.
As a result, short-lived connections are bad for conntrack. As an example, a simple DNS request will be retained in conntrack’s memory for a duration of 30 seconds.
In Kubernetes, DNS queries are not so simple
Prior to reviewing DNS requests in Kubernetes, it is good to analyze the /etc/resolv.conf
file located within a Kubernetes pod (My test pod runs on EKS).
root@web-debug:~# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local eu-west-1.compute.internal
nameserver 10.100.0.10
options ndots:5
If the domain name has less than 5 dots, it will be passed through all domain names in the search field (default.svc.cluster.local, svc.cluster.local, etc).
If the Kubernetes DNS server cannot find any DNS records, the DNS request will be directed to an external DNS server.
Does this seem complicated? I’ll enable CoreDNS logs to check what’s going on.
I’ll execute a simple curl command:
root@web-debug:~# curl google.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
...
CoreDNS logs:
What on earth is going on in there!?
Google.com searched over many domain zones (default.svc.cluster.local, svc.cluster.local, etc.).
CoreDNS couldn’t identify any DNS entries in those zones, thus it returned NXDOMAIN (non-existent domain). Finally, the DNS request was routed to an external DNS server, which responded successfully (NOERROR).
By default, glibc’s getaddrinfo function sends out concurrent A and AAAA requests.
By default, CoreDNS returns 30 seconds for TTL. So DNS queries will be cached in the client for 30 seconds.
Assume you are operating a service that repeatedly calls internal/external services/domains. Boom! Conntrack’s capacity will run out.
DNS connections stored in the conntrack memory on the Kubernetes node:
How to fix this issue?
Either you increase the conntrack limit or you make fewer DNS queries.
Solution 1: Add Ram
Quick fix: add extra RAM to the Kubernetes node or raise the instance type if you are in the cloud.
My t2.medium (4 GiB Memory) Kubernetes node has the nf_conntrack_max value of 131072.
My t3.xlarge (16 GiB Memory) Kubernetes node has the nf_conntrack_max value of 262144
Solution 2: Put a dot
If the domain ends with a dot, DNS clients will not check for it in the search field.
curl google.com.
CoreDNS logs:
Solution 3: Modify the nf_conntrack_max value via DaemonSet
It is possible to increase the Kubernetes node’s nf_conntrack_max value via DaemonSet:
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nf-conntrack-fix
spec:
selector:
matchLabels:
app: nf-conntrack-fix
template:
metadata:
labels:
app: nf-conntrack-fix
spec:
hostNetwork: true
hostPID: true
initContainers:
- name: dependency-install
command: ["/bin/sh"]
args: ["-c", "nsenter --mount=/proc/1/ns/mnt -- sh -c 'sysctl -w net.nf_conntrack_max=140000'"]
image: alpine:latest
securityContext:
privileged: true
containers:
- name: pause
image: public.ecr.aws/eks-distro/kubernetes/pause:3.8
Solution 4: Reduce the number of DNS connections
Let’s assume you call an external API with this domain: api.site.com
If you lower the number of dots to two, DNS requests should not hit the search field.
Example yaml:
---
apiVersion: v1
kind: Pod
metadata:
labels:
app: web-debug
name: web-debug
spec:
containers:
- image: ailhan/web-debug
name: web-debug
imagePullPolicy: Always
dnsConfig:
options:
- name: ndots
value: "2"
CoreDNS logs:
Solution 5: Send sequential DNS requests
By default, DNS clients do queries for A and AAAA records using the same socket. (Please note the IP column in the CoreDNS logs). Some clients, servers, or hardware are unable to support concurrent requests. This is why DNS clients continue to query the same domain.
It is good to send DNS queries through different source ports:
spec.template.spec:
dnsConfig:
options:
- name: single-request-reopen
Yaml:
---
apiVersion: v1
kind: Pod
metadata:
labels:
app: web-debug
name: web-debug
spec:
containers:
- image: ailhan/web-debug
name: web-debug
imagePullPolicy: Always
dnsConfig:
options:
- name: single-request-reopen
/etc/resolv.conf
should look as follows:
CoreDNS Logs:
Solution 6: UDP is fragile; consider TCP
UDP connections are likely to be lost. Thereby, DNS clients continue to query the domain.
YAML
---
apiVersion: v1
kind: Pod
metadata:
labels:
app: web-debug
name: web-debug
spec:
containers:
- image: ailhan/web-debug
name: web-debug
imagePullPolicy: Always
dnsConfig:
options:
- name: use-vc
CoreDNS logs:
Solution 7: Use NodeLocal DNS Cache
NodeLocal DNSCache improves Cluster DNS performance by running a DNS caching agent on cluster nodes as a DaemonSet.
The DNS caching agent is very simple to install.
If you are using EKS, you can get variable values from here.