| .. SPDX-License-Identifier: GPL-2.0 | 
 |  | 
 | ==================================== | 
 | Netfilter's flowtable infrastructure | 
 | ==================================== | 
 |  | 
 | This documentation describes the Netfilter flowtable infrastructure which allows | 
 | you to define a fastpath through the flowtable datapath. This infrastructure | 
 | also provides hardware offload support. The flowtable supports for the layer 3 | 
 | IPv4 and IPv6 and the layer 4 TCP and UDP protocols. | 
 |  | 
 | Overview | 
 | -------- | 
 |  | 
 | Once the first packet of the flow successfully goes through the IP forwarding | 
 | path, from the second packet on, you might decide to offload the flow to the | 
 | flowtable through your ruleset. The flowtable infrastructure provides a rule | 
 | action that allows you to specify when to add a flow to the flowtable. | 
 |  | 
 | A packet that finds a matching entry in the flowtable (ie. flowtable hit) is | 
 | transmitted to the output netdevice via neigh_xmit(), hence, packets bypass the | 
 | classic IP forwarding path (the visible effect is that you do not see these | 
 | packets from any of the Netfilter hooks coming after ingress). In case that | 
 | there is no matching entry in the flowtable (ie. flowtable miss), the packet | 
 | follows the classic IP forwarding path. | 
 |  | 
 | The flowtable uses a resizable hashtable. Lookups are based on the following | 
 | n-tuple selectors: layer 2 protocol encapsulation (VLAN and PPPoE), layer 3 | 
 | source and destination, layer 4 source and destination ports and the input | 
 | interface (useful in case there are several conntrack zones in place). | 
 |  | 
 | The 'flow add' action allows you to populate the flowtable, the user selectively | 
 | specifies what flows are placed into the flowtable. Hence, packets follow the | 
 | classic IP forwarding path unless the user explicitly instruct flows to use this | 
 | new alternative forwarding path via policy. | 
 |  | 
 | The flowtable datapath is represented in Fig.1, which describes the classic IP | 
 | forwarding path including the Netfilter hooks and the flowtable fastpath bypass. | 
 |  | 
 | :: | 
 |  | 
 | 					 userspace process | 
 | 					  ^              | | 
 | 					  |              | | 
 | 				     _____|____     ____\/___ | 
 | 				    /          \   /         \ | 
 | 				    |   input   |  |  output  | | 
 | 				    \__________/   \_________/ | 
 | 					 ^               | | 
 | 					 |               | | 
 |       _________      __________      ---------     _____\/_____ | 
 |      /         \    /          \     |Routing |   /            \ | 
 |   -->  ingress  ---> prerouting ---> |decision|   | postrouting |--> neigh_xmit | 
 |      \_________/    \__________/     ----------   \____________/          ^ | 
 |        |      ^                          |               ^                | | 
 |    flowtable  |                     ____\/___            |                | | 
 |        |      |                    /         \           |                | | 
 |     __\/___   |                    | forward |------------                | | 
 |     |-----|   |                    \_________/                            | | 
 |     |-----|   |                 'flow offload' rule                       | | 
 |     |-----|   |                   adds entry to                           | | 
 |     |_____|   |                     flowtable                             | | 
 |        |      |                                                           | | 
 |       / \     |                                                           | | 
 |      /hit\_no_|                                                           | | 
 |      \ ? /                                                                | | 
 |       \ /                                                                 | | 
 |        |__yes_________________fastpath bypass ____________________________| | 
 |  | 
 | 	       Fig.1 Netfilter hooks and flowtable interactions | 
 |  | 
 | The flowtable entry also stores the NAT configuration, so all packets are | 
 | mangled according to the NAT policy that is specified from the classic IP | 
 | forwarding path. The TTL is decremented before calling neigh_xmit(). Fragmented | 
 | traffic is passed up to follow the classic IP forwarding path given that the | 
 | transport header is missing, in this case, flowtable lookups are not possible. | 
 | TCP RST and FIN packets are also passed up to the classic IP forwarding path to | 
 | release the flow gracefully. Packets that exceed the MTU are also passed up to | 
 | the classic forwarding path to report packet-too-big ICMP errors to the sender. | 
 |  | 
 | Example configuration | 
 | --------------------- | 
 |  | 
 | Enabling the flowtable bypass is relatively easy, you only need to create a | 
 | flowtable and add one rule to your forward chain:: | 
 |  | 
 | 	table inet x { | 
 | 		flowtable f { | 
 | 			hook ingress priority 0; devices = { eth0, eth1 }; | 
 | 		} | 
 | 		chain y { | 
 | 			type filter hook forward priority 0; policy accept; | 
 | 			ip protocol tcp flow add @f | 
 | 			counter packets 0 bytes 0 | 
 | 		} | 
 | 	} | 
 |  | 
 | This example adds the flowtable 'f' to the ingress hook of the eth0 and eth1 | 
 | netdevices. You can create as many flowtables as you want in case you need to | 
 | perform resource partitioning. The flowtable priority defines the order in which | 
 | hooks are run in the pipeline, this is convenient in case you already have a | 
 | nftables ingress chain (make sure the flowtable priority is smaller than the | 
 | nftables ingress chain hence the flowtable runs before in the pipeline). | 
 |  | 
 | The 'flow offload' action from the forward chain 'y' adds an entry to the | 
 | flowtable for the TCP syn-ack packet coming in the reply direction. Once the | 
 | flow is offloaded, you will observe that the counter rule in the example above | 
 | does not get updated for the packets that are being forwarded through the | 
 | forwarding bypass. | 
 |  | 
 | You can identify offloaded flows through the [OFFLOAD] tag when listing your | 
 | connection tracking table. | 
 |  | 
 | :: | 
 |  | 
 | 	# conntrack -L | 
 | 	tcp      6 src=10.141.10.2 dst=192.168.10.2 sport=52728 dport=5201 src=192.168.10.2 dst=192.168.10.1 sport=5201 dport=52728 [OFFLOAD] mark=0 use=2 | 
 |  | 
 |  | 
 | Layer 2 encapsulation | 
 | --------------------- | 
 |  | 
 | Since Linux kernel 5.13, the flowtable infrastructure discovers the real | 
 | netdevice behind VLAN and PPPoE netdevices. The flowtable software datapath | 
 | parses the VLAN and PPPoE layer 2 headers to extract the ethertype and the | 
 | VLAN ID / PPPoE session ID which are used for the flowtable lookups. The | 
 | flowtable datapath also deals with layer 2 decapsulation. | 
 |  | 
 | You do not need to add the PPPoE and the VLAN devices to your flowtable, | 
 | instead the real device is sufficient for the flowtable to track your flows. | 
 |  | 
 | Bridge and IP forwarding | 
 | ------------------------ | 
 |  | 
 | Since Linux kernel 5.13, you can add bridge ports to the flowtable. The | 
 | flowtable infrastructure discovers the topology behind the bridge device. This | 
 | allows the flowtable to define a fastpath bypass between the bridge ports | 
 | (represented as eth1 and eth2 in the example figure below) and the gateway | 
 | device (represented as eth0) in your switch/router. | 
 |  | 
 | :: | 
 |  | 
 |                       fastpath bypass | 
 |                .-------------------------. | 
 |               /                           \ | 
 |               |           IP forwarding   | | 
 |               |          /             \ \/ | 
 |               |       br0               eth0 ..... eth0 | 
 |               .       / \                          *host B* | 
 |                -> eth1  eth2 | 
 |                    .           *switch/router* | 
 |                    . | 
 |                    . | 
 |                  eth0 | 
 |                *host A* | 
 |  | 
 | The flowtable infrastructure also supports for bridge VLAN filtering actions | 
 | such as PVID and untagged. You can also stack a classic VLAN device on top of | 
 | your bridge port. | 
 |  | 
 | If you would like that your flowtable defines a fastpath between your bridge | 
 | ports and your IP forwarding path, you have to add your bridge ports (as | 
 | represented by the real netdevice) to your flowtable definition. | 
 |  | 
 | Counters | 
 | -------- | 
 |  | 
 | The flowtable can synchronize packet and byte counters with the existing | 
 | connection tracking entry by specifying the counter statement in your flowtable | 
 | definition, e.g. | 
 |  | 
 | :: | 
 |  | 
 | 	table inet x { | 
 | 		flowtable f { | 
 | 			hook ingress priority 0; devices = { eth0, eth1 }; | 
 | 			counter | 
 | 		} | 
 | 	} | 
 |  | 
 | Counter support is available since Linux kernel 5.7. | 
 |  | 
 | Hardware offload | 
 | ---------------- | 
 |  | 
 | If your network device provides hardware offload support, you can turn it on by | 
 | means of the 'offload' flag in your flowtable definition, e.g. | 
 |  | 
 | :: | 
 |  | 
 | 	table inet x { | 
 | 		flowtable f { | 
 | 			hook ingress priority 0; devices = { eth0, eth1 }; | 
 | 			flags offload; | 
 | 		} | 
 | 	} | 
 |  | 
 | There is a workqueue that adds the flows to the hardware. Note that a few | 
 | packets might still run over the flowtable software path until the workqueue has | 
 | a chance to offload the flow to the network device. | 
 |  | 
 | You can identify hardware offloaded flows through the [HW_OFFLOAD] tag when | 
 | listing your connection tracking table. Please, note that the [OFFLOAD] tag | 
 | refers to the software offload mode, so there is a distinction between [OFFLOAD] | 
 | which refers to the software flowtable fastpath and [HW_OFFLOAD] which refers | 
 | to the hardware offload datapath being used by the flow. | 
 |  | 
 | The flowtable hardware offload infrastructure also supports for the DSA | 
 | (Distributed Switch Architecture). | 
 |  | 
 | Limitations | 
 | ----------- | 
 |  | 
 | The flowtable behaves like a cache. The flowtable entries might get stale if | 
 | either the destination MAC address or the egress netdevice that is used for | 
 | transmission changes. | 
 |  | 
 | This might be a problem if: | 
 |  | 
 | - You run the flowtable in software mode and you combine bridge and IP | 
 |   forwarding in your setup. | 
 | - Hardware offload is enabled. | 
 |  | 
 | More reading | 
 | ------------ | 
 |  | 
 | This documentation is based on the LWN.net articles [1]_\ [2]_. Rafal Milecki | 
 | also made a very complete and comprehensive summary called "A state of network | 
 | acceleration" that describes how things were before this infrastructure was | 
 | mainlined [3]_ and it also makes a rough summary of this work [4]_. | 
 |  | 
 | .. [1] https://lwn.net/Articles/738214/ | 
 | .. [2] https://lwn.net/Articles/742164/ | 
 | .. [3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html | 
 | .. [4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html |