| # NFTables kube-proxy |
| |
| This is an implementation of service proxying via the nftables API of |
| the kernel netfilter subsystem. |
| |
| ## General theory of netfilter |
| |
| Packet flow through netfilter looks something like: |
| |
| ```text |
| +================+ +=====================+ |
| | hostNetwork IP | | hostNetwork process | |
| +================+ +=====================+ |
| ^ | |
| - - - - - - - - | - - - - - [*] - - - - - - - - - |
| | v |
| +-------+ +--------+ |
| | input | | output | |
| +-------+ +--------+ |
| ^ | |
| +------------+ | +---------+ v +-------------+ |
| | prerouting |-[*]-+-->| forward |--+-[*]->| postrouting | |
| +------------+ +---------+ +-------------+ |
| ^ | |
| - - - - | - - - - - - - - - - - - - - | - - - - |
| | v |
| +---------+ +--------+ |
| --->| ingress | | egress |---> |
| +---------+ +--------+ |
| ``` |
| |
| where the `[*]` represents a routing decision, and all of the boxes except in the top row |
| represent netfilter hooks. More detailed versions of this diagram can be seen at |
| https://en.wikipedia.org/wiki/Netfilter#/media/File:Netfilter-packet-flow.svg and |
| https://wiki.nftables.org/wiki-nftables/index.php/Netfilter_hooks but note that in the the |
| standard version of this diagram, the top two boxes are squished together into "local |
| process" which (a) fails to make a few important distinctions, and (b) makes it look like |
| a single packet can go `input` -> "local process" -> `output`, which it cannot. Note also |
| that the `ingress` and `egress` hooks are special and mostly not available to us; |
| kube-proxy lives in the middle section of diagram, with the five main netfilter hooks. |
| |
| There are three paths through the diagram, called the "input", "forward", and "output" |
| paths, depending on which of those hooks it passes through. Packets coming from host |
| network namespace processes always take the output path, while packets coming in from |
| outside the host network namespace (whether that's from an external host or from a pod |
| network namespace) arrive via `ingress` and take the input or forward path, depending on |
| the routing decision made after `prerouting`; packets destined for an IP which is assigned |
| to a network interface in the host network namespace get routed along the input path; |
| anything else (including, in particular, packets destined for a pod IP) gets routed along |
| the forward path. |
| |
| ## kube-proxy's use of nftables hooks |
| |
| Kube-proxy uses nftables for seven things: |
| |
| - Using DNAT to rewrite traffic from service IPs (cluster IPs, external IPs, load balancer |
| IP, and NodePorts on node IPs) to the corresponding endpoint IPs. |
| |
| - Using SNAT to masquerade traffic as needed to ensure that replies to it will come back |
| to this node/namespace (so that they can be un-DNAT-ed). |
| |
| - Dropping packets that are filtered out by the `LoadBalancerSourceRanges` feature. |
| |
| - Dropping packets for services with `Local` traffic policy but no local endpoints. |
| |
| - Rejecting packets for services with no local or remote endpoints. |
| |
| - Dropping packets to ClusterIPs which are not yet allocated. |
| |
| - Rejecting packets to undefined ports of ClusterIPs. |
| |
| This is implemented as follows: |
| |
| - We do the DNAT for inbound traffic in `prerouting`: this covers traffic coming from |
| off-node to all types of service IPs, and traffic coming from pods to all types of |
| service IPs. (We *must* do this in `prerouting`, because the choice of endpoint IP may |
| affect whether the packet then gets routed along the input path or the forward path.) |
| |
| - We do the DNAT for outbound traffic in `output`: this covers traffic coming from |
| host-network processes to all types of service IPs. Regardless of the final |
| destination, the traffic will take the "output path". (In the case where a |
| host-network process connects to a service IP that DNATs it to a host-network endpoint |
| IP, the traffic will still initially take the "output path", but then reappear on the |
| "input path".) |
| |
| - `LoadBalancerSourceRanges` firewalling has to happen before service DNAT, so we do |
| that on `prerouting` and `output` as well, with a lower (i.e. more urgent) priority |
| than the DNAT chains. |
| |
| - The `drop` and `reject` rules for services with no endpoints don't need to happen |
| explicitly before or after any other rules (since they match packets that wouldn't be |
| matched by any other rules). But with kernels before 5.9, `reject` is not allowed in |
| `prerouting`, so we can't just do them in the same place as the source ranges |
| firewall. So we do these checks from `input`, `forward`, and `output` for |
| `@no-endpoint-services` and from `input` for `@no-endpoint-nodeports` to cover all |
| the possible paths. |
| |
| - Masquerading has to happen in the `postrouting` hook, because "masquerade" means "SNAT |
| to the IP of the interface the packet is going out on", so it has to happen after the |
| final routing decision. (We don't need to masquerade packets that are going to a host |
| network IP, because masquerading is about ensuring that the packet eventually gets |
| routed back to the host network namespace on this node, so if it's never getting |
| routed away from there, there's nothing to do.) |
| |
| - We install a `reject` rule for ClusterIPs matching `@cluster-ips` set and a `drop` |
| rule for ClusterIPs belonging to any of the ServiceCIDRs in `forward` and `output` hook, with a |
| higher (i.e. less urgent) priority than the DNAT chains making sure all valid |
| traffic directed for ClusterIPs is already DNATed. Drop rule will only |
| be installed if `MultiCIDRServiceAllocator` feature is enabled. |