merge-upstream/v5.10.65 from branch/tag: upstream/v5.10.65 into branch: cos-5.10 Changelog: ------------------------------------------------------------- Abhishek Naik (1): iwlwifi: skip first element in the WTAS ACPI table Ahmad Fatoum (1): brcmfmac: pcie: fix oops on failure to resume and reprobe Alexander Gordeev (1): s390/kasan: fix large PMD pages address alignment check Amit Engel (1): nvmet: pass back cntlid on successful completion Anand Moon (3): ARM: dts: meson8b: odroidc1: Fix the pwm regulator supply properties ARM: dts: meson8b: mxq: Fix the pwm regulator supply properties ARM: dts: meson8b: ec100: Fix the pwm regulator supply properties Andrey Ignatov (1): bpf: Fix possible out of bound write in narrow load handling Andrii Nakryiko (1): libbpf: Re-build libbpf.so when libbpf.map changes Andy Duan (1): tty: serial: fsl_lpuart: fix the wrong mapbase value Andy Shevchenko (1): leds: lt3593: Put fwnode in any case during ->probe() Austin Kim (1): IMA: remove -Wmissing-prototypes warning Aya Levin (1): net/mlx5: Register to devlink ingress VLAN filter trap Babu Moger (1): x86/resctrl: Fix a maybe-uninitialized build warning treated as error Ben Hutchings (1): crypto: omap - Fix inconsistent locking of device lists Benjamin Coddington (1): lockd: Fix invalid lockowner cast after vfs_test_lock Biju Das (1): arm64: dts: renesas: hihope-rzg2-ex: Add EtherAVB internal rx delay Bjorn Andersson (1): soc: qcom: rpmhpd: Use corner in power_off Bob Peterson (1): gfs2: init system threads before freeze lock Borislav Petkov (1): x86/mce: Defer processing of early errors Brett Creeley (1): ice: Only lock to update netdev dev_addr Cezary Rojewski (3): ASoC: Intel: kbl_da7219_max98927: Fix format selection for max98373 ASoC: Intel: Skylake: Leave data as is when invoking TLV IPCs ASoC: Intel: Skylake: Fix module resource and format selection Chen-Yu Tsai (3): irqchip/gic-v3: Fix priority comparison when non-secure priorities are used regulator: vctrl: Use locked regulator_get_voltage in probe path regulator: vctrl: Avoid lockdep warning in enable/disable ops Chih-Kang Chang (1): mac80211: Fix insufficient headroom issue for AMSDU Christoph Hellwig (1): bcache: add proper error unwinding in bcache_device_init Christophe JAILLET (9): spi: coldfire-qspi: Use clk_disable_unprepare in the remove function media: cxd2880-spi: Fix an error handling path drm/msm/dsi: Fix some reference counted resource leaks firmware: raspberrypi: Fix a leak in 'rpi_firmware_get()' usb: bdc: Fix an error handling path in 'bdc_probe()' when no suitable DMA config is available usb: bdc: Fix a resource leak in the error handling path of 'bdc_probe()' ASoC: wcd9335: Fix a double irq free in the remove function ASoC: wcd9335: Fix a memory leak in the error handling path of the probe function ASoC: wcd9335: Disable irq on slave ports in the remove function Chunguang Xu (1): blk-throtl: optimize IOPS throttle for large IO scenarios Chunyan Zhang (1): spi: sprd: Fix the wrong WDG_LOAD_VAL Claudiu Beznea (1): ARM: dts: at91: add pinctrl-{names, 0} for all gpios Colin Ian King (4): gfs2: Fix memory leak of object lsi on error return path 6lowpan: iphc: Fix an off-by-one check of array index media: venus: venc: Fix potential null pointer dereference on pointer fmt Bluetooth: increase BTNAMSIZ to 21 chars to fix potential buffer overflow Curtis Malainey (1): ASoC: Intel: Fix platform ID matching Damien Le Moal (1): libata: fix ata_host_start() Dan Carpenter (5): media: rockchip/rga: fix error handling in probe Bluetooth: sco: prevent information leak in sco_conn_defer_accept() rsi: fix error code in rsi_load_9116_firmware() rsi: fix an error code in rsi_probe() ath6kl: wmi: fix an error code in ath6kl_wmi_sync_point() Daniel Thompson (1): backlight: pwm_bl: Improve bootloader/kernel device handover David Heidelberg (2): drm/msm/mdp4: refactor HW revision detection into read_mdp_hw_revision drm/msm/mdp4: move HW revision detection to earlier phase Desmond Cheong Zhi Xi (2): fcntl: fix potential deadlock for &fasync_struct.fa_lock Bluetooth: fix repeated calls to sco_sock_kill Dietmar Eggemann (1): sched/deadline: Fix missing clock update in migrate_task_rq_dl() Dmitry Baryshkov (1): drm/msm/dpu: make dpu_hw_ctl_clear_all_blendstages clear necessary LMs Dmitry Osipenko (2): regulator: tps65910: Silence deferred probe error power: supply: smb347-charger: Add missing pin control activation Dongliang Mu (4): media: dvb-usb: fix uninit-value in dvb_usb_adapter_dvb_init media: dvb-usb: fix uninit-value in vp702x_read_mac_addr media: dvb-usb: Fix error handling in dvb_usb_i2c_init media: em28xx-input: fix refcount bug in em28xx_usb_disconnect Douglas Anderson (2): ASoC: rt5682: Properly turn off regulators if wrong device ID ASoC: rt5682: Remove unused variable in rt5682_i2c_remove() Dylan Hung (1): ARM: dts: aspeed-g6: Fix HVI3C function-group in pinctrl dtsi Emmanuel Grumbach (1): iwlwifi: follow the new inclusive terminology Eric Biggers (1): blk-crypto: fix check for too-large dun_bytes Eric Dumazet (3): ipv6: make exception cache less predictible ipv4: make exception cache less predictible ipv4: fix endianness issue in inet_rtm_getroute_build_skb() Evgeny Novikov (1): usb: ehci-orion: Handle errors of clk_prepare_enable() in probe Frederic Weisbecker (1): posix-cpu-timers: Force next expiration recalc after itimer reset Geert Uytterhoeven (5): m68k: Fix invalid RMW_INSNS on CPUs that lack CAS soc: rockchip: ROCKCHIP_GRF should not default to y, unconditionally arm64: dts: renesas: r8a77995: draak: Remove bogus adv7511w properties arm64: dts: renesas: rzg2: Convert EtherAVB to explicit delay handling usb: gadget: udc: renesas_usb3: Fix soc_device_match() abuse Giovanni Cabiddu (4): crypto: qat - do not ignore errors from enable_vf2pf_comms() crypto: qat - handle both source of interrupt in VF ISR crypto: qat - do not export adf_iov_putmsg() crypto: qat - use proper type for vf_mask Greg Kroah-Hartman (1): Linux 5.10.65 Haiyue Wang (1): gve: fix the wrong AdminQ buffer overflow check Halil Pasic (1): KVM: s390: index kvm->arch.idle_mask by vcpu_idx Hans de Goede (2): power: supply: axp288_fuel_gauge: Report register-address on readb / writeb errors leds: trigger: audio: Add an activate callback to ensure the initial brightness is set Harald Freudenberger (2): s390/zcrypt: fix wrong offset index for APKA master key valid state s390/ap: fix state machine hang after failure to enable irq Harshvardhan Jha (1): drm/gma500: Fix end of loop tests for list_for_each_entry He Fengqing (1): bpf: Fix potential memleak and UAF in the verifier. Hongbo Li (1): lib/mpi: use kcalloc in mpi_resize Huacai Chen (1): irqchip/loongson-pch-pic: Improve edge triggered interrupt support Ilya Leoshkevich (1): selftests/bpf: Fix test_core_autosize on big-endian machines J. Bruce Fields (1): nfsd4: Fix forced-expiry locking Jaegeuk Kim (1): f2fs: guarantee to write dirty data when enabling checkpoint back Jan Kara (1): udf: Check LVID earlier Jens Axboe (1): io_uring: IORING_OP_WRITE needs hash_reg_file set Jeongtae Park (1): regmap: fix the offset of register error log Jiapeng Chong (2): leds: is31fl32xx: Fix missing error code in is31fl32xx_parse_dt() net/mlx5: Fix missing return value in mlx5_devlink_eswitch_inline_mode_set() Jose Blanquicet (1): selftests/bpf: Fix bpf-iter-tcp4 test to print correctly the dest IP Juhee Kang (1): samples: pktgen: add missing IPv6 option to pktgen scripts Julia Lawall (1): drm/of: free the right object Justin M. Forbes (1): iwlwifi Add support for ax201 in Samsung Galaxy Book Flex2 Alpha Kai-Heng Feng (2): drm/amdgpu/acp: Make PM domain really work Bluetooth: Move shutdown callback before flushing tx and rx queue Kevin Mitchell (1): lkdtm: replace SCSI_DISPATCH_CMD with SCSI_QUEUE_RQ Kim Phillips (1): perf/x86/amd/ibs: Extend PERF_PMU_CAP_NO_EXCLUDE to IBS Op Krzysztof Hałasa (1): media: TDA1997x: enable EDID support Krzysztof Kozlowski (1): arm64: dts: exynos: correct GIC CPU interfaces address range on Exynos7 Kuniyuki Iwashima (1): bpf: Fix a typo of reuseport map in bpf.h. Len Baker (1): CIFS: Fix a potencially linear read overflow Leon Romanovsky (3): ionic: cleanly release devlink instance devlink: Break parameter notification sequence to be before/after unload/load driver devlink: Clear whole devlink_flash_notify struct Linus Walleij (1): clk: kirkwood: Fix a clocking boot regression Lukas Bulwahn (1): clk: staging: correct reference to config IOMEM to config HAS_IOMEM Lukas Hannen (1): time: Handle negative seconds correctly in timespec64_to_ns() Lukasz Luba (1): PM: EM: Increase energy calculation precision Marco Chiappero (2): crypto: qat - fix reuse of completion variable crypto: qat - fix naming for init/shutdown VF to PF notifications Marek Vasut (3): drm: mxsfb: Enable recovery on underflow drm: mxsfb: Increase number of outstanding requests on V4 and newer HW drm: mxsfb: Clear FIFO_CLEAR bit Martin Blumenstingl (1): ARM: dts: meson8: Use a higher default GPU clock frequency Martin KaFai Lau (1): tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos Martynas Pumputis (1): libbpf: Fix removal of inner map in bpf_object__create_map Matija Glavinic Pecotic (1): spi: davinci: invoke chipselect callback Matthew Cover (1): bpf, samples: Add missing mprog-disable to xdp_redirect_cpu's optstring Mauro Carvalho Chehab (1): media: rockchip/rga: use pm_runtime_resume_and_get() Maxim Levitsky (1): KVM: VMX: avoid running vmx_handle_exit_irqoff in case of emulation Maxim Mikityanskiy (2): net/mlx5e: Prohibit inner indir TIRs in IPoIB net/mlx5e: Block LRO if firmware asks for tunneled LRO Mika Penttilä (1): sched/numa: Fix is_core_idle() Miklos Szeredi (2): fuse: truncate pagecache on atomic_o_trunc fuse: flush extending writes Ming Lei (1): block: return ELEVATOR_DISCARD_MERGE if possible Nadezda Lutovinova (1): usb: gadget: mv_u3d: request_irq() after initializing UDC Nguyen Dinh Phi (1): tty: Fix data race between tiocsti() and flush_to_ldisc() Nicolas Saenz Julienne (1): firmware: raspberrypi: Keep count of all consumers Niklas Schnelle (1): s390/pci: fix misleading rc in clp_set_pci_fn() Pali Rohár (3): udf: Fix iocharset=utf8 mount option isofs: joliet: Fix iocharset=utf8 mount option arm64: dts: marvell: armada-37xx: Extend PCIe MEM space Parav Pandit (1): net/mlx5: Fix unpublish devlink parameters Paul E. McKenney (1): rcu: Add lockdep_assert_irqs_disabled() to rcu_sched_clock_irq() and callees Pavel Begunkov (1): bio: fix page leak bio_add_hw_page failure Pavel Skripkin (6): m68k: emu: Fix invalid free in nfeth_cleanup() block: nbd: add sanity check for first_minor media: go7007: fix memory leak in go7007_usb_probe media: go7007: remove redundant initialization net: cipso: fix warnings in netlbl_cipsov4_add_std Bluetooth: add timeout sanity check to hci_inquiry Peter Oberparleiter (2): s390/debug: keep debug data on resize s390/debug: fix debug area life cycle Peter Robinson (1): power: supply: cw2015: use dev_err_probe to allow deferred probe Peter Zijlstra (2): locking/mutex: Fix HANDOFF condition locking/lockdep: Mark local_lock_t Philipp Zabel (1): media: coda: fix frame_mem_ctrl for YUV420 and YVU420 formats Phong Hoang (1): clocksource/drivers/sh_cmt: Fix wrong setting if don't request IRQ for clock source channel Qiuxu Zhuo (1): EDAC/i10nm: Fix NVDIMM detection Quanyang Wang (1): spi: spi-zynq-qspi: use wait_for_completion_timeout to make zynq_qspi_exec_mem_op not interruptible Quentin Perret (2): sched/deadline: Fix reset_on_fork reporting of DL tasks sched: Fix UCLAMP_FLAG_IDLE setting Rafael J. Wysocki (2): PCI: PM: Avoid forcing PCI_D0 for wakeup reasons inconsistently PCI: PM: Enable PME if it can be signaled from D3cold Ruozhu Li (2): nvme-tcp: don't update queue count when failing to set io queues nvme-rdma: don't update queue count when failing to set io queues Sean Anderson (1): crypto: mxs-dcp - Check for DMA mapping errors Sean Christopherson (2): Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()" KVM: nVMX: Unconditionally clear nested.pi_pending on nested VM-Enter Sebastian Krzyszkowiak (1): power: supply: max17042_battery: fix typo in MAx17042_TOFF Sergey Senozhatsky (1): rcu/tree: Handle VM stoppage in stall detection Sergey Shtylyov (15): i2c: highlander: add IRQ check usb: dwc3: meson-g12a: add IRQ check usb: dwc3: qcom: add IRQ check usb: gadget: udc: at91: add IRQ check usb: gadget: udc: s3c2410: add IRQ check usb: phy: fsl-usb: add IRQ check usb: phy: twl6030: add IRQ checks usb: host: ohci-tmio: add IRQ check usb: phy: tahvo: add IRQ check i2c: synquacer: fix deferred probing i2c: iop3xx: fix deferred probing i2c: s3c2410: fix IRQ check i2c: hix5hd2: fix IRQ check i2c: mt65xx: fix IRQ check i2c: xlp9xx: fix main IRQ check Shuyi Cheng (1): libbpf: Fix the possible memory leak on error Smita Koralahalli (1): EDAC/mce_amd: Do not load edac_mce_amd module on guests Stefan Assmann (1): i40e: improve locking of mac_filter_hash Stefan Berger (2): certs: Trigger creation of RSA module signing key if it's not an RSA key tpm: ibmvtpm: Avoid error message when process gets signal while waiting Stefan Wahren (1): net: qualcomm: fix QCA7000 checksum handling Stephan Gerhold (1): soc: qcom: smsm: Fix missed interrupts if state changes while masked Stephen Boyd (1): ASoC: rt5682: Implement remove callback Steven Price (1): drm/of: free the iterator object on failure Stian Skjelstad (1): udf_get_extendedattr() had no boundary checks. Subbaraya Sundeep (2): octeontx2-af: Fix loop in free and unmap counter octeontx2-af: Fix static code analyzer reported issues Sudarsana Reddy Kalluru (1): atlantic: Fix driver resume flow. Sunil Goutham (1): octeontx2-af: Set proper errorcode for IPv4 checksum errors Sven Eckelmann (1): debugfs: Return error during {full/open}_proxy_open() on rmmod THOBY Simon (1): IMA: remove the dependency on CRYPTO_MD5 Tedd Ho-Jeong An (1): Bluetooth: mgmt: Fix wrong opcode in the response for add_adv cmd Tetsuo Handa (1): fbmem: don't allow too huge resolutions Thomas Gleixner (3): hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() hrtimer: Ensure timerfd notification for HIGHRES=n locking/local_lock: Add missing owner initialization Tian Tao (1): i2c: fix platform_get_irq.cocci warnings Tony Lindgren (6): crypto: omap-sham - clear dma flags only after omap_sham_update_dma_stop() spi: spi-fsl-dspi: Fix issue with uninitialized dma_slave_config spi: spi-pic32: Fix issue with uninitialized dma_slave_config mmc: sdhci: Fix issue with uninitialized dma_slave_config mmc: dw_mmc: Fix issue with uninitialized dma_slave_config mmc: moxart: Fix issue with uninitialized dma_slave_config Valentin Schneider (1): PM: cpu: Make notifier chain use a raw_spinlock_t Vineeth Vijayan (1): s390/cio: add dev_busid sysfs entry for each subchannel Waiman Long (3): cgroup/cpuset: Fix a partition bug with hotplug cgroup/cpuset: Miscellaneous code cleanup cgroup/cpuset: Fix violation of cpuset locking rule Wei Yongjun (1): drm/panfrost: Fix missing clk_disable_unprepare() on error in panfrost_clk_init() William Breathitt Gray (1): counter: 104-quad-8: Return error when invalid mode during ceiling_write Xiyu Yang (1): net: sched: Fix qdisc_rate_table refcount leak when get tcf_block failed Xu Yu (1): mm/swap: consider max pages in iomap_swapfile_add_extent Yanfei Xu (2): rcu: Fix to include first blocked task in stall warning rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock Yizhuo (1): media: atomisp: fix the uninitialized use and rename "retvalue" Zelin Deng (1): KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted Zenghui Yu (1): bcma: Fix memory leak for internally-handled cores Zhang Qilong (1): ASoC: mediatek: mt8183: Fix Unbalanced pm_runtime_enable in mt8183_afe_pcm_dev_probe Zhen Lei (3): genirq/timings: Fix error return code in irq_timings_test_irqs() firmware: fix theoretical UAF race with firmware cache and resume driver core: Fix error return code in really_probe() BUG=b/200002173 TEST=tryjob, validation and K8s e2e RELEASE_NOTE=Updated the Linux kernel to v5.10.65. Signed-off-by: COS Kernel Merge Bot <cloud-image-merge-automation@prod.google.com> Change-Id: Ib9d744bd17d0cf92e78b29ef072157c57439d09a

commit: 80b9fac18917cdba3d0f9147f85e2bcea3749927 [log] [tgz]
author: COS Kernel Merge Bot <cloud-image-merge-automation@prod.google.com> Wed Sep 15 01:57:34 2021 -0700
committer: COS Kernel Merge Bot <cloud-image-merge-automation@prod.google.com> Wed Sep 15 01:57:34 2021 -0700
tree: f279a8a6a8d5af3c508bec2a4a5a2e611335cd8b
parent: c6be0351b9fe8770372e090885c83f50af4e03ca [diff]
parent: c31c2cca229aa5280d108618bb264c713840a4c2 [diff]
diff --git a/Documentation/block/queue-sysfs.rst b/Documentation/block/queue-sysfs.rst
index 2638d34..4874c52 100644
--- a/Documentation/block/queue-sysfs.rst
+++ b/Documentation/block/queue-sysfs.rst

@@ -273,4 +273,13 @@
 do not support zone commands, they will be treated as regular block devices
 and zoned will report "none".
 
+split_alignment (RW)
+----------------------
+This is the alignment in bytes sectors at which the requeust queue is
+allowed to split IO requests. Once this value is set, the requeust
+queue splits IOs such that the individual IOs are aligned to
+split_alignment. The value of 0 indicates that an IO request can be
+split anywhere. This value must be a power of 2 and greater than or
+equal to 512.
+
 Jens Axboe <jens.axboe@oracle.com>, February 2009

diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
index 849d5b1..5fad388 100644
--- a/Documentation/filesystems/ext4/journal.rst
+++ b/Documentation/filesystems/ext4/journal.rst

@@ -4,14 +4,14 @@
 --------------
 
 Introduced in ext3, the ext4 filesystem employs a journal to protect the
-filesystem against corruption in the case of a system crash. A small
-continuous region of disk (default 128MiB) is reserved inside the
-filesystem as a place to land “important” data writes on-disk as quickly
-as possible. Once the important data transaction is fully written to the
-disk and flushed from the disk write cache, a record of the data being
-committed is also written to the journal. At some later point in time,
-the journal code writes the transactions to their final locations on
-disk (this could involve a lot of seeking or a lot of small
+filesystem against metadata inconsistencies in the case of a system crash. Up
+to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal
+size limits) can be reserved inside the filesystem as a place to land
+“important” data writes on-disk as quickly as possible. Once the important
+data transaction is fully written to the disk and flushed from the disk write
+cache, a record of the data being committed is also written to the journal. At
+some later point in time, the journal code writes the transactions to their
+final locations on disk (this could involve a lot of seeking or a lot of small
 read-write-erases) before erasing the commit record. Should the system
 crash during the second slow write, the journal can be replayed all the
 way to the latest commit record, guaranteeing the atomicity of whatever
@@ -681,3 +681,76 @@
      - Stores the TID of the commit, CRC of the fast commit of which this tag
        represents the end of
 
+Fast Commit Replay Idempotence
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Fast commits tags are idempotent in nature provided the recovery code follows
+certain rules. The guiding principle that the commit path follows while
+committing is that it stores the result of a particular operation instead of
+storing the procedure.
+
+Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a'
+was associated with inode 10. During fast commit, instead of storing this
+operation as a procedure "rename a to b", we store the resulting file system
+state as a "series" of outcomes:
+
+- Link dirent b to inode 10
+- Unlink dirent a
+- Inode 10 with valid refcount
+
+Now when recovery code runs, it needs "enforce" this state on the file
+system. This is what guarantees idempotence of fast commit replay.
+
+Let's take an example of a procedure that is not idempotent and see how fast
+commits make it idempotent. Consider following sequence of operations:
+
+1) rm A
+2) mv B A
+3) read A
+
+If we store this sequence of operations as is then the replay is not idempotent.
+Let's say while in replay, we crash after (2). During the second replay,
+file A (which was actually created as a result of "mv B A" operation) would get
+deleted. Thus, file named A would be absent when we try to read A. So, this
+sequence of operations is not idempotent. However, as mentioned above, instead
+of storing the procedure fast commits store the outcome of each procedure. Thus
+the fast commit log for above procedure would be as follows:
+
+(Let's assume dirent A was linked to inode 10 and dirent B was linked to
+inode 11 before the replay)
+
+1) Unlink A
+2) Link A to inode 11
+3) Unlink B
+4) Inode 11
+
+If we crash after (3) we will have file A linked to inode 11. During the second
+replay, we will remove file A (inode 11). But we will create it back and make
+it point to inode 11. We won't find B, so we'll just skip that step. At this
+point, the refcount for inode 11 is not reliable, but that gets fixed by the
+replay of last inode 11 tag. Thus, by converting a non-idempotent procedure
+into a series of idempotent outcomes, fast commits ensured idempotence during
+the replay.
+
+Journal Checkpoint
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Checkpointing the journal ensures all transactions and their associated buffers
+are submitted to the disk. In-progress transactions are waited upon and included
+in the checkpoint. Checkpointing is used internally during critical updates to
+the filesystem including journal recovery, filesystem resizing, and freeing of
+the journal_t structure.
+
+A journal checkpoint can be triggered from userspace via the ioctl
+EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags.
+Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
+can be used to verify input to the ioctl. It returns error if there is any
+invalid input, otherwise it returns success without performing
+any checkpointing. This can be used to check whether the ioctl exists on a
+system and to verify there are no issues with arguments or flags. The
+other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and
+EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be
+discarded or zero-filled, respectively, after the journal checkpoint is
+complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
+cannot both be set. The ioctl may be useful when snapshotting a system or for
+complying with content deletion SLOs.

diff --git a/Documentation/networking/filter.rst b/Documentation/networking/filter.rst
index debb59e..1583d59 100644
--- a/Documentation/networking/filter.rst
+++ b/Documentation/networking/filter.rst

@@ -1006,13 +1006,13 @@
 
 Mode modifier is one of::
 
-  BPF_IMM  0x00  /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
-  BPF_ABS  0x20
-  BPF_IND  0x40
-  BPF_MEM  0x60
-  BPF_LEN  0x80  /* classic BPF only, reserved in eBPF */
-  BPF_MSH  0xa0  /* classic BPF only, reserved in eBPF */
-  BPF_XADD 0xc0  /* eBPF only, exclusive add */
+  BPF_IMM     0x00  /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
+  BPF_ABS     0x20
+  BPF_IND     0x40
+  BPF_MEM     0x60
+  BPF_LEN     0x80  /* classic BPF only, reserved in eBPF */
+  BPF_MSH     0xa0  /* classic BPF only, reserved in eBPF */
+  BPF_ATOMIC  0xc0  /* eBPF only, atomic operations */
 
 eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and
 (BPF_IND | <size> | BPF_LD) which are used to access packet data.
@@ -1044,11 +1044,19 @@
     BPF_MEM | <size> | BPF_STX:  *(size *) (dst_reg + off) = src_reg
     BPF_MEM | <size> | BPF_ST:   *(size *) (dst_reg + off) = imm32
     BPF_MEM | <size> | BPF_LDX:  dst_reg = *(size *) (src_reg + off)
-    BPF_XADD | BPF_W  | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg
-    BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
 
-Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
-2 byte atomic increments are not supported.
+Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW.
+
+It also includes atomic operations, which use the immediate field for extra
+encoding.
+
+   .imm = BPF_ADD, .code = BPF_ATOMIC | BPF_W  | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg
+   .imm = BPF_ADD, .code = BPF_ATOMIC | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
+
+Note that 1 and 2 byte atomic operations are not supported.
+
+You may encounter BPF_XADD - this is a legacy name for BPF_ATOMIC, referring to
+the exclusive-add operation encoded when the immediate field is zero.
 
 eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists
 of two consecutive ``struct bpf_insn`` 8-byte blocks and interpreted as single

diff --git a/PRESUBMIT.cfg b/PRESUBMIT.cfg
new file mode 100644
index 0000000..0d7269b
--- /dev/null
+++ b/PRESUBMIT.cfg

@@ -0,0 +1,11 @@
+[Hook Overrides]
+aosp_license_check: false
+cros_license_check: false
+long_line_check: false
+stray_whitespace_check: false
+tab_check: false
+tabbed_indent_required_check: false
+signoff_check: true
+
+# Make sure RELEASE_NOTE field is present.
+release_note_field_check: true

diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index ce8b043..a951276 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c

@@ -1623,10 +1623,9 @@
 		}
 		emit_str_r(dst_lo, tmp2, off, ctx, BPF_SIZE(code));
 		break;
-	/* STX XADD: lock *(u32 *)(dst + off) += src */
-	case BPF_STX | BPF_XADD | BPF_W:
-	/* STX XADD: lock *(u64 *)(dst + off) += src */
-	case BPF_STX | BPF_XADD | BPF_DW:
+	/* Atomic ops */
+	case BPF_STX | BPF_ATOMIC | BPF_W:
+	case BPF_STX | BPF_ATOMIC | BPF_DW:
 		goto notyet;
 	/* STX: *(size *)(dst + off) = src */
 	case BPF_STX | BPF_MEM | BPF_W:

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index 345066b..5a876af 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c

@@ -888,10 +888,18 @@
 		}
 		break;
 
-	/* STX XADD: lock *(u32 *)(dst + off) += src */
-	case BPF_STX | BPF_XADD | BPF_W:
-	/* STX XADD: lock *(u64 *)(dst + off) += src */
-	case BPF_STX | BPF_XADD | BPF_DW:
+	case BPF_STX | BPF_ATOMIC | BPF_W:
+	case BPF_STX | BPF_ATOMIC | BPF_DW:
+		if (insn->imm != BPF_ADD) {
+			pr_err_once("unknown atomic op code %02x\n", insn->imm);
+			return -EINVAL;
+		}
+
+		/* STX XADD: lock *(u32 *)(dst + off) += src
+		 * and
+		 * STX XADD: lock *(u64 *)(dst + off) += src
+		 */
+
 		if (!off) {
 			reg = dst;
 		} else {

diff --git a/arch/mips/net/ebpf_jit.c b/arch/mips/net/ebpf_jit.c
index b31b91e..3a73e93 100644
--- a/arch/mips/net/ebpf_jit.c
+++ b/arch/mips/net/ebpf_jit.c

@@ -1426,8 +1426,8 @@
 	case BPF_STX | BPF_H | BPF_MEM:
 	case BPF_STX | BPF_W | BPF_MEM:
 	case BPF_STX | BPF_DW | BPF_MEM:
-	case BPF_STX | BPF_W | BPF_XADD:
-	case BPF_STX | BPF_DW | BPF_XADD:
+	case BPF_STX | BPF_W | BPF_ATOMIC:
+	case BPF_STX | BPF_DW | BPF_ATOMIC:
 		if (insn->dst_reg == BPF_REG_10) {
 			ctx->flags |= EBPF_SEEN_FP;
 			dst = MIPS_R_SP;
@@ -1441,7 +1441,12 @@
 		src = ebpf_to_mips_reg(ctx, insn, src_reg_no_fp);
 		if (src < 0)
 			return src;
-		if (BPF_MODE(insn->code) == BPF_XADD) {
+		if (BPF_MODE(insn->code) == BPF_ATOMIC) {
+			if (insn->imm != BPF_ADD) {
+				pr_err("ATOMIC OP %02x NOT HANDLED\n", insn->imm);
+				return -EINVAL;
+			}
+
 			/*
 			 * If mem_off does not fit within the 9 bit ll/sc
 			 * instruction immediate field, use a temp reg.

diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c
index 658ca2b..6753754 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c

@@ -689,10 +689,18 @@
 			break;
 
 		/*
-		 * BPF_STX XADD (atomic_add)
+		 * BPF_STX ATOMIC (atomic ops)
 		 */
-		/* *(u32 *)(dst + off) += src */
-		case BPF_STX | BPF_XADD | BPF_W:
+		case BPF_STX | BPF_ATOMIC | BPF_W:
+			if (insn->imm != BPF_ADD) {
+				pr_err_ratelimited(
+					"eBPF filter atomic op code %02x (@%d) unsupported\n",
+					code, i);
+				return -ENOTSUPP;
+			}
+
+			/* *(u32 *)(dst + off) += src */
+
 			/* Get EA into TMP_REG_1 */
 			EMIT(PPC_RAW_ADDI(b2p[TMP_REG_1], dst_reg, off));
 			tmp_idx = ctx->idx * 4;
@@ -705,8 +713,15 @@
 			/* we're done if this succeeded */
 			PPC_BCC_SHORT(COND_NE, tmp_idx);
 			break;
-		/* *(u64 *)(dst + off) += src */
-		case BPF_STX | BPF_XADD | BPF_DW:
+		case BPF_STX | BPF_ATOMIC | BPF_DW:
+			if (insn->imm != BPF_ADD) {
+				pr_err_ratelimited(
+					"eBPF filter atomic op code %02x (@%d) unsupported\n",
+					code, i);
+				return -ENOTSUPP;
+			}
+			/* *(u64 *)(dst + off) += src */
+
 			EMIT(PPC_RAW_ADDI(b2p[TMP_REG_1], dst_reg, off));
 			tmp_idx = ctx->idx * 4;
 			EMIT(PPC_RAW_LDARX(b2p[TMP_REG_2], 0, b2p[TMP_REG_1], 0));

diff --git a/arch/riscv/net/bpf_jit_comp32.c b/arch/riscv/net/bpf_jit_comp32.c
index f300f93..e649742 100644
--- a/arch/riscv/net/bpf_jit_comp32.c
+++ b/arch/riscv/net/bpf_jit_comp32.c

@@ -881,7 +881,7 @@
 	const s8 *rd = bpf_get_reg64(dst, tmp1, ctx);
 	const s8 *rs = bpf_get_reg64(src, tmp2, ctx);
 
-	if (mode == BPF_XADD && size != BPF_W)
+	if (mode == BPF_ATOMIC && size != BPF_W)
 		return -1;
 
 	emit_imm(RV_REG_T0, off, ctx);
@@ -899,7 +899,7 @@
 		case BPF_MEM:
 			emit(rv_sw(RV_REG_T0, 0, lo(rs)), ctx);
 			break;
-		case BPF_XADD:
+		case BPF_ATOMIC: /* Only BPF_ADD supported */
 			emit(rv_amoadd_w(RV_REG_ZERO, lo(rs), RV_REG_T0, 0, 0),
 			     ctx);
 			break;
@@ -1264,7 +1264,6 @@
 	case BPF_STX | BPF_MEM | BPF_H:
 	case BPF_STX | BPF_MEM | BPF_W:
 	case BPF_STX | BPF_MEM | BPF_DW:
-	case BPF_STX | BPF_XADD | BPF_W:
 		if (BPF_CLASS(code) == BPF_ST) {
 			emit_imm32(tmp2, imm, ctx);
 			src = tmp2;
@@ -1275,8 +1274,21 @@
 			return -1;
 		break;
 
+	case BPF_STX | BPF_ATOMIC | BPF_W:
+		if (insn->imm != BPF_ADD) {
+			pr_info_once(
+				"bpf-jit: not supported: atomic operation %02x ***\n",
+				insn->imm);
+			return -EFAULT;
+		}
+
+		if (emit_store_r64(dst, src, off, ctx, BPF_SIZE(code),
+				   BPF_MODE(code)))
+			return -1;
+		break;
+
 	/* No hardware support for 8-byte atomics in RV32. */
-	case BPF_STX | BPF_XADD | BPF_DW:
+	case BPF_STX | BPF_ATOMIC | BPF_DW:
 		/* Fallthrough. */
 
 notsupported:

diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
index c113ae8..8ce394f 100644
--- a/arch/riscv/net/bpf_jit_comp64.c
+++ b/arch/riscv/net/bpf_jit_comp64.c

@@ -1031,10 +1031,18 @@
 		emit_add(RV_REG_T1, RV_REG_T1, rd, ctx);
 		emit_sd(RV_REG_T1, 0, rs, ctx);
 		break;
-	/* STX XADD: lock *(u32 *)(dst + off) += src */
-	case BPF_STX | BPF_XADD | BPF_W:
-	/* STX XADD: lock *(u64 *)(dst + off) += src */
-	case BPF_STX | BPF_XADD | BPF_DW:
+	case BPF_STX | BPF_ATOMIC | BPF_W:
+	case BPF_STX | BPF_ATOMIC | BPF_DW:
+		if (insn->imm != BPF_ADD) {
+			pr_err("bpf-jit: not supported: atomic operation %02x ***\n",
+			       insn->imm);
+			return -EINVAL;
+		}
+
+		/* atomic_add: lock *(u32 *)(dst + off) += src
+		 * atomic_add: lock *(u64 *)(dst + off) += src
+		 */
+
 		if (off) {
 			if (is_12b_int(off)) {
 				emit_addi(RV_REG_T1, rd, off, ctx);

diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
index dee01d3..9e3363d 100644
--- a/arch/s390/net/bpf_jit_comp.c
+++ b/arch/s390/net/bpf_jit_comp.c

@@ -1210,18 +1210,23 @@
 		jit->seen |= SEEN_MEM;
 		break;
 	/*
-	 * BPF_STX XADD (atomic_add)
+	 * BPF_ATOMIC
 	 */
-	case BPF_STX | BPF_XADD | BPF_W: /* *(u32 *)(dst + off) += src */
-		/* laal %w0,%src,off(%dst) */
-		EMIT6_DISP_LH(0xeb000000, 0x00fa, REG_W0, src_reg,
-			      dst_reg, off);
-		jit->seen |= SEEN_MEM;
-		break;
-	case BPF_STX | BPF_XADD | BPF_DW: /* *(u64 *)(dst + off) += src */
-		/* laalg %w0,%src,off(%dst) */
-		EMIT6_DISP_LH(0xeb000000, 0x00ea, REG_W0, src_reg,
-			      dst_reg, off);
+	case BPF_STX | BPF_ATOMIC | BPF_DW:
+	case BPF_STX | BPF_ATOMIC | BPF_W:
+		if (insn->imm != BPF_ADD) {
+			pr_err("Unknown atomic operation %02x\n", insn->imm);
+			return -1;
+		}
+
+		/* *(u32/u64 *)(dst + off) += src
+		 *
+		 * BFW_W:  laal  %w0,%src,off(%dst)
+		 * BPF_DW: laalg %w0,%src,off(%dst)
+		 */
+		EMIT6_DISP_LH(0xeb000000,
+			      BPF_SIZE(insn->code) == BPF_W ? 0x00fa : 0x00ea,
+			      REG_W0, src_reg, dst_reg, off);
 		jit->seen |= SEEN_MEM;
 		break;
 	/*

diff --git a/arch/sparc/net/bpf_jit_comp_64.c b/arch/sparc/net/bpf_jit_comp_64.c
index fef7344..9a2f20c 100644
--- a/arch/sparc/net/bpf_jit_comp_64.c
+++ b/arch/sparc/net/bpf_jit_comp_64.c

@@ -1369,12 +1369,18 @@
 		break;
 	}
 
-	/* STX XADD: lock *(u32 *)(dst + off) += src */
-	case BPF_STX | BPF_XADD | BPF_W: {
+	case BPF_STX | BPF_ATOMIC | BPF_W: {
 		const u8 tmp = bpf2sparc[TMP_REG_1];
 		const u8 tmp2 = bpf2sparc[TMP_REG_2];
 		const u8 tmp3 = bpf2sparc[TMP_REG_3];
 
+		if (insn->imm != BPF_ADD) {
+			pr_err_once("unknown atomic op %02x\n", insn->imm);
+			return -EINVAL;
+		}
+
+		/* lock *(u32 *)(dst + off) += src */
+
 		if (insn->dst_reg == BPF_REG_FP)
 			ctx->saw_frame_pointer = true;
 
@@ -1393,11 +1399,16 @@
 		break;
 	}
 	/* STX XADD: lock *(u64 *)(dst + off) += src */
-	case BPF_STX | BPF_XADD | BPF_DW: {
+	case BPF_STX | BPF_ATOMIC | BPF_DW: {
 		const u8 tmp = bpf2sparc[TMP_REG_1];
 		const u8 tmp2 = bpf2sparc[TMP_REG_2];
 		const u8 tmp3 = bpf2sparc[TMP_REG_3];
 
+		if (insn->imm != BPF_ADD) {
+			pr_err_once("unknown atomic op %02x\n", insn->imm);
+			return -EINVAL;
+		}
+
 		if (insn->dst_reg == BPF_REG_FP)
 			ctx->saw_frame_pointer = true;
 

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 28c89fc..9b9e24c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c

@@ -492,6 +492,10 @@
 
 	/* 0 means: find the address automatically */
 	if (!crash_base) {
+		unsigned long long max_addr = high ? CRASH_ADDR_HIGH_MAX
+						   : CRASH_ADDR_LOW_MAX;
+		unsigned long long base = CRASH_ALIGN;
+
 		/*
 		 * Set CRASH_ADDR_LOW_MAX upper bound for crash memory,
 		 * crashkernel=x,high reserves memory over 4G, also allocates
@@ -499,15 +503,14 @@
 		 * But the extra memory is not required for all machines.
 		 * So try low memory first and fall back to high memory
 		 * unless "crashkernel=size[KMG],high" is specified.
+		 * To conserve memory in crash-capture kernel try
+		 * to allocate crash_base at the lowest address possible.
 		 */
-		if (!high)
+		do {
 			crash_base = memblock_phys_alloc_range(crash_size,
-						CRASH_ALIGN, CRASH_ALIGN,
-						CRASH_ADDR_LOW_MAX);
-		if (!crash_base)
-			crash_base = memblock_phys_alloc_range(crash_size,
-						CRASH_ALIGN, CRASH_ALIGN,
-						CRASH_ADDR_HIGH_MAX);
+				CRASH_ALIGN, base, base + crash_size);
+			base += CRASH_ALIGN;
+		} while (!crash_base && base + crash_size <= max_addr);
 		if (!crash_base) {
 			pr_info("crashkernel reservation failed - No suitable area found.\n");
 			return;

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 0a962cd..2f304b3 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c

@@ -205,6 +205,18 @@
 	return byte + reg2hex[dst_reg] + (reg2hex[src_reg] << 3);
 }
 
+/* Some 1-byte opcodes for binary ALU operations */
+static u8 simple_alu_opcodes[] = {
+	[BPF_ADD] = 0x01,
+	[BPF_SUB] = 0x29,
+	[BPF_AND] = 0x21,
+	[BPF_OR] = 0x09,
+	[BPF_XOR] = 0x31,
+	[BPF_LSH] = 0xE0,
+	[BPF_RSH] = 0xE8,
+	[BPF_ARSH] = 0xF8,
+};
+
 static void jit_fill_hole(void *area, unsigned int size)
 {
 	/* Fill whole space with INT3 instructions */
@@ -684,6 +696,42 @@
 	*pprog = prog;
 }
 
+/* Emit the suffix (ModR/M etc) for addressing *(ptr_reg + off) and val_reg */
+static void emit_insn_suffix(u8 **pprog, u32 ptr_reg, u32 val_reg, int off)
+{
+	u8 *prog = *pprog;
+	int cnt = 0;
+
+	if (is_imm8(off)) {
+		/* 1-byte signed displacement.
+		 *
+		 * If off == 0 we could skip this and save one extra byte, but
+		 * special case of x86 R13 which always needs an offset is not
+		 * worth the hassle
+		 */
+		EMIT2(add_2reg(0x40, ptr_reg, val_reg), off);
+	} else {
+		/* 4-byte signed displacement */
+		EMIT1_off32(add_2reg(0x80, ptr_reg, val_reg), off);
+	}
+	*pprog = prog;
+}
+
+/*
+ * Emit a REX byte if it will be necessary to address these registers
+ */
+static void maybe_emit_mod(u8 **pprog, u32 dst_reg, u32 src_reg, bool is64)
+{
+	u8 *prog = *pprog;
+	int cnt = 0;
+
+	if (is64)
+		EMIT1(add_2mod(0x48, dst_reg, src_reg));
+	else if (is_ereg(dst_reg) || is_ereg(src_reg))
+		EMIT1(add_2mod(0x40, dst_reg, src_reg));
+	*pprog = prog;
+}
+
 /* LDX: dst_reg = *(u8*)(src_reg + off) */
 static void emit_ldx(u8 **pprog, u32 size, u32 dst_reg, u32 src_reg, int off)
 {
@@ -711,15 +759,7 @@
 		EMIT2(add_2mod(0x48, src_reg, dst_reg), 0x8B);
 		break;
 	}
-	/*
-	 * If insn->off == 0 we can save one extra byte, but
-	 * special case of x86 R13 which always needs an offset
-	 * is not worth the hassle
-	 */
-	if (is_imm8(off))
-		EMIT2(add_2reg(0x40, src_reg, dst_reg), off);
-	else
-		EMIT1_off32(add_2reg(0x80, src_reg, dst_reg), off);
+	emit_insn_suffix(&prog, src_reg, dst_reg, off);
 	*pprog = prog;
 }
 
@@ -754,13 +794,53 @@
 		EMIT2(add_2mod(0x48, dst_reg, src_reg), 0x89);
 		break;
 	}
-	if (is_imm8(off))
-		EMIT2(add_2reg(0x40, dst_reg, src_reg), off);
-	else
-		EMIT1_off32(add_2reg(0x80, dst_reg, src_reg), off);
+	emit_insn_suffix(&prog, dst_reg, src_reg, off);
 	*pprog = prog;
 }
 
+static int emit_atomic(u8 **pprog, u8 atomic_op,
+		       u32 dst_reg, u32 src_reg, s16 off, u8 bpf_size)
+{
+	u8 *prog = *pprog;
+	int cnt = 0;
+
+	EMIT1(0xF0); /* lock prefix */
+
+	maybe_emit_mod(&prog, dst_reg, src_reg, bpf_size == BPF_DW);
+
+	/* emit opcode */
+	switch (atomic_op) {
+	case BPF_ADD:
+	case BPF_SUB:
+	case BPF_AND:
+	case BPF_OR:
+	case BPF_XOR:
+		/* lock *(u32/u64*)(dst_reg + off) <op>= src_reg */
+		EMIT1(simple_alu_opcodes[atomic_op]);
+		break;
+	case BPF_ADD | BPF_FETCH:
+		/* src_reg = atomic_fetch_add(dst_reg + off, src_reg); */
+		EMIT2(0x0F, 0xC1);
+		break;
+	case BPF_XCHG:
+		/* src_reg = atomic_xchg(dst_reg + off, src_reg); */
+		EMIT1(0x87);
+		break;
+	case BPF_CMPXCHG:
+		/* r0 = atomic_cmpxchg(dst_reg + off, r0, src_reg); */
+		EMIT2(0x0F, 0xB1);
+		break;
+	default:
+		pr_err("bpf_jit: unknown atomic opcode %02x\n", atomic_op);
+		return -EFAULT;
+	}
+
+	emit_insn_suffix(&prog, dst_reg, src_reg, off);
+
+	*pprog = prog;
+	return 0;
+}
+
 static bool ex_handler_bpf(const struct exception_table_entry *x,
 			   struct pt_regs *regs, int trapnr,
 			   unsigned long error_code, unsigned long fault_addr)
@@ -805,6 +885,7 @@
 	int i, cnt = 0, excnt = 0;
 	int proglen = 0;
 	u8 *prog = temp;
+	int err;
 
 	detect_reg_usage(insn, insn_cnt, callee_regs_used,
 			 &tail_call_seen);
@@ -840,17 +921,9 @@
 		case BPF_ALU64 | BPF_AND | BPF_X:
 		case BPF_ALU64 | BPF_OR | BPF_X:
 		case BPF_ALU64 | BPF_XOR | BPF_X:
-			switch (BPF_OP(insn->code)) {
-			case BPF_ADD: b2 = 0x01; break;
-			case BPF_SUB: b2 = 0x29; break;
-			case BPF_AND: b2 = 0x21; break;
-			case BPF_OR: b2 = 0x09; break;
-			case BPF_XOR: b2 = 0x31; break;
-			}
-			if (BPF_CLASS(insn->code) == BPF_ALU64)
-				EMIT1(add_2mod(0x48, dst_reg, src_reg));
-			else if (is_ereg(dst_reg) || is_ereg(src_reg))
-				EMIT1(add_2mod(0x40, dst_reg, src_reg));
+			maybe_emit_mod(&prog, dst_reg, src_reg,
+				       BPF_CLASS(insn->code) == BPF_ALU64);
+			b2 = simple_alu_opcodes[BPF_OP(insn->code)];
 			EMIT2(b2, add_2reg(0xC0, dst_reg, src_reg));
 			break;
 
@@ -1030,12 +1103,7 @@
 			else if (is_ereg(dst_reg))
 				EMIT1(add_1mod(0x40, dst_reg));
 
-			switch (BPF_OP(insn->code)) {
-			case BPF_LSH: b3 = 0xE0; break;
-			case BPF_RSH: b3 = 0xE8; break;
-			case BPF_ARSH: b3 = 0xF8; break;
-			}
-
+			b3 = simple_alu_opcodes[BPF_OP(insn->code)];
 			if (imm32 == 1)
 				EMIT2(0xD1, add_1reg(b3, dst_reg));
 			else
@@ -1069,11 +1137,7 @@
 			else if (is_ereg(dst_reg))
 				EMIT1(add_1mod(0x40, dst_reg));
 
-			switch (BPF_OP(insn->code)) {
-			case BPF_LSH: b3 = 0xE0; break;
-			case BPF_RSH: b3 = 0xE8; break;
-			case BPF_ARSH: b3 = 0xF8; break;
-			}
+			b3 = simple_alu_opcodes[BPF_OP(insn->code)];
 			EMIT2(0xD3, add_1reg(b3, dst_reg));
 
 			if (src_reg != BPF_REG_4)
@@ -1240,21 +1304,60 @@
 			}
 			break;
 
-			/* STX XADD: lock *(u32*)(dst_reg + off) += src_reg */
-		case BPF_STX | BPF_XADD | BPF_W:
-			/* Emit 'lock add dword ptr [rax + off], eax' */
-			if (is_ereg(dst_reg) || is_ereg(src_reg))
-				EMIT3(0xF0, add_2mod(0x40, dst_reg, src_reg), 0x01);
-			else
-				EMIT2(0xF0, 0x01);
-			goto xadd;
-		case BPF_STX | BPF_XADD | BPF_DW:
-			EMIT3(0xF0, add_2mod(0x48, dst_reg, src_reg), 0x01);
-xadd:			if (is_imm8(insn->off))
-				EMIT2(add_2reg(0x40, dst_reg, src_reg), insn->off);
-			else
-				EMIT1_off32(add_2reg(0x80, dst_reg, src_reg),
-					    insn->off);
+		case BPF_STX | BPF_ATOMIC | BPF_W:
+		case BPF_STX | BPF_ATOMIC | BPF_DW:
+			if (insn->imm == (BPF_AND | BPF_FETCH) ||
+			    insn->imm == (BPF_OR | BPF_FETCH) ||
+			    insn->imm == (BPF_XOR | BPF_FETCH)) {
+				u8 *branch_target;
+				bool is64 = BPF_SIZE(insn->code) == BPF_DW;
+				u32 real_src_reg = src_reg;
+
+				/*
+				 * Can't be implemented with a single x86 insn.
+				 * Need to do a CMPXCHG loop.
+				 */
+
+				/* Will need RAX as a CMPXCHG operand so save R0 */
+				emit_mov_reg(&prog, true, BPF_REG_AX, BPF_REG_0);
+				if (src_reg == BPF_REG_0)
+					real_src_reg = BPF_REG_AX;
+
+				branch_target = prog;
+				/* Load old value */
+				emit_ldx(&prog, BPF_SIZE(insn->code),
+					 BPF_REG_0, dst_reg, insn->off);
+				/*
+				 * Perform the (commutative) operation locally,
+				 * put the result in the AUX_REG.
+				 */
+				emit_mov_reg(&prog, is64, AUX_REG, BPF_REG_0);
+				maybe_emit_mod(&prog, AUX_REG, real_src_reg, is64);
+				EMIT2(simple_alu_opcodes[BPF_OP(insn->imm)],
+				      add_2reg(0xC0, AUX_REG, real_src_reg));
+				/* Attempt to swap in new value */
+				err = emit_atomic(&prog, BPF_CMPXCHG,
+						  dst_reg, AUX_REG, insn->off,
+						  BPF_SIZE(insn->code));
+				if (WARN_ON(err))
+					return err;
+				/*
+				 * ZF tells us whether we won the race. If it's
+				 * cleared we need to try again.
+				 */
+				EMIT2(X86_JNE, -(prog - branch_target) - 2);
+				/* Return the pre-modification value */
+				emit_mov_reg(&prog, is64, real_src_reg, BPF_REG_0);
+				/* Restore R0 after clobbering RAX */
+				emit_mov_reg(&prog, true, BPF_REG_0, BPF_REG_AX);
+				break;
+
+			}
+
+			err = emit_atomic(&prog, insn->imm, dst_reg, src_reg,
+						  insn->off, BPF_SIZE(insn->code));
+			if (err)
+				return err;
 			break;
 
 			/* call */
@@ -1305,20 +1408,16 @@
 		case BPF_JMP32 | BPF_JSGE | BPF_X:
 		case BPF_JMP32 | BPF_JSLE | BPF_X:
 			/* cmp dst_reg, src_reg */
-			if (BPF_CLASS(insn->code) == BPF_JMP)
-				EMIT1(add_2mod(0x48, dst_reg, src_reg));
-			else if (is_ereg(dst_reg) || is_ereg(src_reg))
-				EMIT1(add_2mod(0x40, dst_reg, src_reg));
+			maybe_emit_mod(&prog, dst_reg, src_reg,
+				       BPF_CLASS(insn->code) == BPF_JMP);
 			EMIT2(0x39, add_2reg(0xC0, dst_reg, src_reg));
 			goto emit_cond_jmp;
 
 		case BPF_JMP | BPF_JSET | BPF_X:
 		case BPF_JMP32 | BPF_JSET | BPF_X:
 			/* test dst_reg, src_reg */
-			if (BPF_CLASS(insn->code) == BPF_JMP)
-				EMIT1(add_2mod(0x48, dst_reg, src_reg));
-			else if (is_ereg(dst_reg) || is_ereg(src_reg))
-				EMIT1(add_2mod(0x40, dst_reg, src_reg));
+			maybe_emit_mod(&prog, dst_reg, src_reg,
+				       BPF_CLASS(insn->code) == BPF_JMP);
 			EMIT2(0x85, add_2reg(0xC0, dst_reg, src_reg));
 			goto emit_cond_jmp;
 
@@ -1354,10 +1453,8 @@
 		case BPF_JMP32 | BPF_JSLE | BPF_K:
 			/* test dst_reg, dst_reg to save one extra byte */
 			if (imm32 == 0) {
-				if (BPF_CLASS(insn->code) == BPF_JMP)
-					EMIT1(add_2mod(0x48, dst_reg, dst_reg));
-				else if (is_ereg(dst_reg))
-					EMIT1(add_2mod(0x40, dst_reg, dst_reg));
+				maybe_emit_mod(&prog, dst_reg, dst_reg,
+					       BPF_CLASS(insn->code) == BPF_JMP);
 				EMIT2(0x85, add_2reg(0xC0, dst_reg, dst_reg));
 				goto emit_cond_jmp;
 			}

diff --git a/arch/x86/net/bpf_jit_comp32.c b/arch/x86/net/bpf_jit_comp32.c
index 4bd0f98..e610828 100644
--- a/arch/x86/net/bpf_jit_comp32.c
+++ b/arch/x86/net/bpf_jit_comp32.c

@@ -2249,10 +2249,8 @@
 				return -EFAULT;
 			}
 			break;
-		/* STX XADD: lock *(u32 *)(dst + off) += src */
-		case BPF_STX | BPF_XADD | BPF_W:
-		/* STX XADD: lock *(u64 *)(dst + off) += src */
-		case BPF_STX | BPF_XADD | BPF_DW:
+		case BPF_STX | BPF_ATOMIC | BPF_W:
+		case BPF_STX | BPF_ATOMIC | BPF_DW:
 			goto notyet;
 		case BPF_JMP | BPF_EXIT:
 			if (seen_exit) {

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 26f4bcc..e59ed04 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c

@@ -106,15 +106,18 @@
 static struct bio *blk_bio_write_zeroes_split(struct request_queue *q,
 		struct bio *bio, struct bio_set *bs, unsigned *nsegs)
 {
+	sector_t split;
+
 	*nsegs = 0;
 
-	if (!q->limits.max_write_zeroes_sectors)
+	split = q->limits.max_write_zeroes_sectors;
+	if (split && q->split_alignment)
+		split = round_down(split, q->split_alignment);
+
+	if (!split || bio_sectors(bio) <= split)
 		return NULL;
 
-	if (bio_sectors(bio) <= q->limits.max_write_zeroes_sectors)
-		return NULL;
-
-	return bio_split(bio, q->limits.max_write_zeroes_sectors, GFP_NOIO, bs);
+	return bio_split(bio, split, GFP_NOIO, bs);
 }
 
 static struct bio *blk_bio_write_same_split(struct request_queue *q,
@@ -122,15 +125,18 @@
 					    struct bio_set *bs,
 					    unsigned *nsegs)
 {
+	sector_t split;
+
 	*nsegs = 1;
 
-	if (!q->limits.max_write_same_sectors)
+	split = q->limits.max_write_same_sectors;
+	if (split && q->split_alignment)
+		split = round_down(split, q->split_alignment);
+
+	if (!split || bio_sectors(bio) <= split)
 		return NULL;
 
-	if (bio_sectors(bio) <= q->limits.max_write_same_sectors)
-		return NULL;
-
-	return bio_split(bio, q->limits.max_write_same_sectors, GFP_NOIO, bs);
+	return bio_split(bio, split, GFP_NOIO, bs);
 }
 
 /*
@@ -249,7 +255,9 @@
 {
 	struct bio_vec bv, bvprv, *bvprvp = NULL;
 	struct bvec_iter iter;
-	unsigned nsegs = 0, sectors = 0;
+	unsigned int nsegs = 0, nsegs_aligned = 0;
+	unsigned int sectors = 0, sectors_aligned = 0, before = 0, after = 0;
+	unsigned int sector_alignment = q->split_alignment;
 	const unsigned max_sectors = get_max_io_size(q, bio);
 	const unsigned max_segs = queue_max_segments(q);
 
@@ -265,12 +273,31 @@
 		    sectors + (bv.bv_len >> 9) <= max_sectors &&
 		    bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
 			nsegs++;
-			sectors += bv.bv_len >> 9;
-		} else if (bvec_split_segs(q, &bv, &nsegs, &sectors, max_segs,
-					 max_sectors)) {
-			goto split;
+			before = round_down(sectors, sector_alignment);
+			sectors += (bv.bv_len >> 9);
+			after = round_down(sectors, sector_alignment);
+			if (sector_alignment && before != after) {
+				/* This is a valid split point */
+				nsegs_aligned = nsegs;
+				sectors_aligned = after;
+			}
+			goto next;
 		}
-
+		if (sector_alignment) {
+			before = round_down(sectors, sector_alignment);
+			after = round_down(sectors + (bv.bv_len >> 9),
+					  sector_alignment);
+			if ((nsegs < max_segs) && before != after &&
+			    ((after - before) << 9) + bv.bv_offset <=  PAGE_SIZE
+			    && after <= max_sectors) {
+				sectors_aligned = after;
+				nsegs_aligned = nsegs + 1;
+			}
+		}
+		if (bvec_split_segs(q, &bv, &nsegs, &sectors, max_segs,
+				    max_sectors))
+			goto split;
+next:
 		bvprv = bv;
 		bvprvp = &bvprv;
 	}
@@ -279,7 +306,13 @@
 	return NULL;
 split:
 	*segs = nsegs;
-	return bio_split(bio, sectors, GFP_NOIO, bs);
+	if (sector_alignment && sectors_aligned == 0)
+		return NULL;
+
+	*segs = sector_alignment ? nsegs_aligned : nsegs;
+
+	return bio_split(bio, sector_alignment ? sectors_aligned : sectors,
+			 GFP_NOIO, bs);
 }
 
 /**

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index b513f16..bdc214d 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c

@@ -548,6 +548,33 @@
 	return queue_var_show(blk_queue_dax(q), page);
 }
 
+static ssize_t queue_split_alignment_show(struct request_queue *q, char *page)
+{
+	return queue_var_show((q->split_alignment << 9), page);
+}
+
+static ssize_t queue_split_alignment_store(struct request_queue *q, const char *page,
+						size_t count)
+{
+	unsigned long split_alignment;
+	int ret;
+
+	ret = queue_var_store(&split_alignment, page, count);
+	if (ret < 0)
+		return ret;
+
+	/* split_alignment can only be a power of 2 */
+	if (split_alignment & (split_alignment - 1))
+		return -EINVAL;
+
+	/* ..and it should be greater than 512 */
+	if (!(split_alignment >> 9))
+		return -EINVAL;
+
+	q->split_alignment = split_alignment >> 9;
+	return count;
+}
+
 #define QUEUE_RO_ENTRY(_prefix, _name)			\
 static struct queue_sysfs_entry _prefix##_entry = {	\
 	.attr	= { .name = _name, .mode = 0444 },	\
@@ -600,6 +627,7 @@
 QUEUE_RO_ENTRY(queue_dax, "dax");
 QUEUE_RW_ENTRY(queue_io_timeout, "io_timeout");
 QUEUE_RW_ENTRY(queue_wb_lat, "wbt_lat_usec");
+QUEUE_RW_ENTRY(queue_split_alignment, "split_alignment");
 
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
 QUEUE_RW_ENTRY(blk_throtl_sample_time, "throttle_sample_time");
@@ -659,6 +687,7 @@
 #ifdef CONFIG_BLK_DEV_THROTTLING_LOW
 	&blk_throtl_sample_time_entry.attr,
 #endif
+	&queue_split_alignment_entry.attr,
 	NULL,
 };
 

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 8d70017..817fed4 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig

@@ -59,6 +59,15 @@
 	  rescue mode with init=/bin/sh, even when the /dev directory
 	  on the rootfs is completely empty.
 
+config DEVTMPFS_SAFE
+	bool "Automount devtmpfs with nosuid/noexec"
+	depends on DEVTMPFS_MOUNT
+	default y
+	help
+	  This instructs the kernel to automount devtmpfs with the
+	  MS_NOEXEC and MS_NOSUID mount flags, which can prevent
+	  certain kinds of code-execution attack on embedded platforms.
+
 config STANDALONE
 	bool "Select only drivers that don't need compile-time external firmware"
 	default y

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index a71d141..6bf82b3 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c

@@ -353,6 +353,7 @@
 int __init devtmpfs_mount(void)
 {
 	int err;
+	int mflags = MS_SILENT;
 
 	if (!mount_dev)
 		return 0;
@@ -360,7 +361,10 @@
 	if (!thread)
 		return 0;
 
-	err = init_mount("devtmpfs", "dev", "devtmpfs", MS_SILENT, NULL);
+#ifdef CONFIG_DEVTMPFS_SAFE
+	mflags |= MS_NOEXEC | MS_NOSUID;
+#endif
+	err = init_mount("devtmpfs", "dev", "devtmpfs", mflags, NULL);
 	if (err)
 		printk(KERN_INFO "devtmpfs: error mounting %i\n", err);
 	else

diff --git a/drivers/md/dm-init.c b/drivers/md/dm-init.c
index b0c45c6..188b23c 100644
--- a/drivers/md/dm-init.c
+++ b/drivers/md/dm-init.c

@@ -301,3 +301,254 @@
 
 module_param(create, charp, 0);
 MODULE_PARM_DESC(create, "Create a mapped device in early boot");
+
+/* ---------------------------------------------------------------
+ * ChromeOS shim - convert dm= format to dm-mod.create= format
+ * ---------------------------------------------------------------
+ */
+
+struct dm_chrome_target {
+	char *field[4];
+};
+
+struct dm_chrome_dev {
+	char *name, *uuid, *mode;
+	unsigned int num_targets;
+	struct dm_chrome_target targets[DM_MAX_TARGETS];
+};
+
+static char __init *dm_chrome_parse_target(char *str, struct dm_chrome_target *tgt)
+{
+	unsigned int i;
+
+	tgt->field[0] = str;
+	/* Delimit first 3 fields that are separated by space */
+	for (i = 0; i < ARRAY_SIZE(tgt->field) - 1; i++) {
+		tgt->field[i + 1] = str_field_delimit(&tgt->field[i], ' ');
+		if (!tgt->field[i + 1])
+			return NULL;
+	}
+	/* Delimit last field that can be terminated by comma */
+	return str_field_delimit(&tgt->field[i], ',');
+}
+
+static char __init *dm_chrome_parse_dev(char *str, struct dm_chrome_dev *dev)
+{
+	char *target, *num;
+	unsigned int i;
+
+	if (!str)
+		return ERR_PTR(-EINVAL);
+
+	target = str_field_delimit(&str, ',');
+	if (!target)
+		return ERR_PTR(-EINVAL);
+
+	/* Delimit first 3 fields that are separated by space */
+	dev->name = str;
+	dev->uuid = str_field_delimit(&dev->name, ' ');
+	if (!dev->uuid)
+		return ERR_PTR(-EINVAL);
+
+	dev->mode = str_field_delimit(&dev->uuid, ' ');
+	if (!dev->mode)
+		return ERR_PTR(-EINVAL);
+
+	/* num is optional */
+	num = str_field_delimit(&dev->mode, ' ');
+	if (!num)
+		dev->num_targets = 1;
+	else {
+		/* Delimit num and check if it the last field */
+		if(str_field_delimit(&num, ' '))
+			return ERR_PTR(-EINVAL);
+		if (kstrtouint(num, 0, &dev->num_targets))
+			return ERR_PTR(-EINVAL);
+	}
+
+	if (dev->num_targets > DM_MAX_TARGETS) {
+		DMERR("too many targets %u > %d",
+		      dev->num_targets, DM_MAX_TARGETS);
+		return ERR_PTR(-EINVAL);
+	}
+
+	for (i = 0; i < dev->num_targets - 1; i++) {
+		target = dm_chrome_parse_target(target, &dev->targets[i]);
+		if (!target)
+			return ERR_PTR(-EINVAL);
+	}
+	/* The last one can return NULL if it reaches the end of str */
+	return dm_chrome_parse_target(target, &dev->targets[i]);
+}
+
+static char __init *dm_chrome_convert(struct dm_chrome_dev *devs, unsigned int num_devs)
+{
+	char *str = kmalloc(DM_MAX_STR_SIZE, GFP_KERNEL);
+	char *p = str;
+	unsigned int i, j;
+	int ret;
+
+	if (!str)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < num_devs; i++) {
+		if (!strcmp(devs[i].uuid, "none"))
+			devs[i].uuid = "";
+		ret = snprintf(p, DM_MAX_STR_SIZE - (p - str),
+			       "%s,%s,,%s",
+			       devs[i].name,
+			       devs[i].uuid,
+			       devs[i].mode);
+		if (ret < 0)
+			goto out;
+		p += ret;
+
+		for (j = 0; j < devs[i].num_targets; j++) {
+			ret = snprintf(p, DM_MAX_STR_SIZE - (p - str),
+				       ",%s %s %s %s",
+				       devs[i].targets[j].field[0],
+				       devs[i].targets[j].field[1],
+				       devs[i].targets[j].field[2],
+				       devs[i].targets[j].field[3]);
+			if (ret < 0)
+				goto out;
+			p += ret;
+		}
+		if (i < num_devs - 1) {
+			ret = snprintf(p, DM_MAX_STR_SIZE - (p - str), ";");
+			if (ret < 0)
+				goto out;
+			p += ret;
+		}
+	}
+
+	return str;
+
+out:
+	kfree(str);
+	return ERR_PTR(ret);
+}
+
+/**
+ * dm_chrome_shim - convert old dm= format used in chromeos to the new
+ * upstream format.
+ *
+ * ChromeOS old format
+ * -------------------
+ * <device>        ::= [<num>] <device-mapper>+
+ * <device-mapper> ::= <head> "," <target>+
+ * <head>          ::= <name> <uuid> <mode> [<num>]
+ * <target>        ::= <start> <length> <type> <options> ","
+ * <mode>          ::= "ro" | "rw"
+ * <uuid>          ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | "none"
+ * <type>          ::= "verity" | "bootcache" | ...
+ *
+ * Example:
+ * 2 vboot none ro 1,
+ *     0 1768000 bootcache
+ *       device=aa55b119-2a47-8c45-946a-5ac57765011f+1
+ *       signature=76e9be054b15884a9fa85973e9cb274c93afadb6
+ *       cache_start=1768000 max_blocks=100000 size_limit=23 max_trace=20000,
+ *   vroot none ro 1,
+ *     0 1740800 verity payload=254:0 hashtree=254:0 hashstart=1740800 alg=sha1
+ *       root_hexdigest=76e9be054b15884a9fa85973e9cb274c93afadb6
+ *       salt=5b3549d54d6c7a3837b9b81ed72e49463a64c03680c47835bef94d768e5646fe
+ *
+ * Notes:
+ *  1. uuid is a label for the device and we set it to "none".
+ *  2. The <num> field will be optional initially and assumed to be 1.
+ *     Once all the scripts that set these fields have been set, it will
+ *     be made mandatory.
+ */
+
+static char *chrome_create;
+
+static int __init dm_chrome_shim(char *arg) {
+	if (!arg || create)
+		return -EINVAL;
+	chrome_create = arg;
+	return 0;
+}
+
+static int __init dm_chrome_parse_devices(void)
+{
+	struct dm_chrome_dev *devs;
+	unsigned int num_devs, i;
+	char *next, *base_str;
+	int ret = 0;
+
+	/* Verify if dm-mod.create was not used */
+	if (!chrome_create || create)
+		return -EINVAL;
+
+	if (strlen(chrome_create) >= DM_MAX_STR_SIZE) {
+		DMERR("Argument is too big. Limit is %d\n", DM_MAX_STR_SIZE);
+		return -EINVAL;
+	}
+
+	base_str = kstrdup(chrome_create, GFP_KERNEL);
+	if (!base_str)
+		return -ENOMEM;
+
+	next = str_field_delimit(&base_str, ' ');
+	if (!next) {
+		ret = -EINVAL;
+		goto out_str;
+	}
+
+	/* if first field is not the optional <num> field */
+	if (kstrtouint(base_str, 0, &num_devs)) {
+		num_devs = 1;
+		/* rewind next pointer */
+		next = base_str;
+	}
+
+	if (num_devs > DM_MAX_DEVICES) {
+		DMERR("too many devices %u > %d", num_devs, DM_MAX_DEVICES);
+		ret = -EINVAL;
+		goto out_str;
+	}
+
+	devs = kcalloc(num_devs, sizeof(*devs), GFP_KERNEL);
+	if (!devs)
+		return -ENOMEM;
+
+	/* restore string */
+	strcpy(base_str, chrome_create);
+
+	/* parse devices */
+	for (i = 0; i < num_devs; i++) {
+		next = dm_chrome_parse_dev(next, &devs[i]);
+		if (IS_ERR(next)) {
+			DMERR("couldn't parse device");
+			ret = PTR_ERR(next);
+			goto out_devs;
+		}
+	}
+
+	create = dm_chrome_convert(devs, num_devs);
+	if (IS_ERR(create)) {
+		ret = PTR_ERR(create);
+		goto out_devs;
+	}
+
+	DMDEBUG("Converting:\n\tdm=\"%s\"\n\tdm-mod.create=\"%s\"\n",
+		chrome_create, create);
+
+	/* Call upstream code */
+	dm_init_init();
+
+	kfree(create);
+
+out_devs:
+	create = NULL;
+	kfree(devs);
+out_str:
+	kfree(base_str);
+
+	return ret;
+}
+
+late_initcall(dm_chrome_parse_devices);
+
+__setup("dm=", dm_chrome_shim);

diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c
index 808a98e..ac5f9ad 100644
--- a/drivers/md/dm-verity-target.c
+++ b/drivers/md/dm-verity-target.c

@@ -16,8 +16,10 @@
 #include "dm-verity.h"
 #include "dm-verity-fec.h"
 #include "dm-verity-verify-sig.h"
+#include <linux/delay.h>
 #include <linux/module.h>
 #include <linux/reboot.h>
+#include <crypto/hash.h>
 
 #define DM_MSG_PREFIX			"verity"
 
@@ -33,8 +35,9 @@
 #define DM_VERITY_OPT_PANIC		"panic_on_corruption"
 #define DM_VERITY_OPT_IGN_ZEROES	"ignore_zero_blocks"
 #define DM_VERITY_OPT_AT_MOST_ONCE	"check_at_most_once"
+#define DM_VERITY_OPT_ERROR_BEHAVIOR	"error_behavior"
 
-#define DM_VERITY_OPTS_MAX		(3 + DM_VERITY_OPTS_FEC + \
+#define DM_VERITY_OPTS_MAX		(4 + DM_VERITY_OPTS_FEC + \
 					 DM_VERITY_ROOT_HASH_VERIFICATION_OPTS)
 
 static unsigned dm_verity_prefetch_cluster = DM_VERITY_DEFAULT_PREFETCH_SIZE;
@@ -48,6 +51,120 @@
 	unsigned n_blocks;
 };
 
+/* Provide a lightweight means of specifying the global default for
+ * error behavior: eio, reboot, or none
+ * Legacy support for 0 = eio, 1 = reboot/panic, 2 = none, 3 = notify.
+ * This is matched to the enum in dm-verity.h.
+ */
+static char *error_behavior_istring[] = { "0", "1", "2", "3" };
+static const char *allowed_error_behaviors[] = { "eio", "panic", "none",
+						 "notify", NULL };
+static char *error_behavior = "eio";
+module_param(error_behavior, charp, 0644);
+MODULE_PARM_DESC(error_behavior, "Behavior on error "
+				 "(eio, panic, none, notify)");
+
+/* Controls whether verity_get_device will wait forever for a device. */
+static int dev_wait;
+module_param(dev_wait, int, 0444);
+MODULE_PARM_DESC(dev_wait, "Wait forever for a backing device");
+
+static BLOCKING_NOTIFIER_HEAD(verity_error_notifier);
+
+int dm_verity_register_error_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&verity_error_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(dm_verity_register_error_notifier);
+
+int dm_verity_unregister_error_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&verity_error_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(dm_verity_unregister_error_notifier);
+
+/* If the request is not successful, this handler takes action.
+ * TODO make this call a registered handler.
+ */
+static void verity_error(struct dm_verity *v, struct dm_verity_io *io,
+			 blk_status_t status)
+{
+	const char *message = v->hash_failed ? "integrity" : "block";
+	int error_behavior = DM_VERITY_ERROR_BEHAVIOR_PANIC;
+	dev_t devt = 0;
+	u64 block = ~0;
+	struct dm_verity_error_state error_state;
+	/* If the hash did not fail, then this is likely transient. */
+	int transient = !v->hash_failed;
+
+	devt = v->data_dev->bdev->bd_dev;
+	error_behavior = v->error_behavior;
+
+	DMERR_LIMIT("verification failure occurred: %s failure", message);
+
+	if (error_behavior == DM_VERITY_ERROR_BEHAVIOR_NOTIFY) {
+		error_state.code = status;
+		error_state.transient = transient;
+		error_state.block = block;
+		error_state.message = message;
+		error_state.dev_start = v->data_start;
+		error_state.dev_len = v->data_blocks;
+		error_state.dev = v->data_dev->bdev;
+		error_state.hash_dev_start = v->hash_start;
+		error_state.hash_dev_len = v->hash_blocks;
+		error_state.hash_dev = v->hash_dev->bdev;
+
+		/* Set default fallthrough behavior. */
+		error_state.behavior = DM_VERITY_ERROR_BEHAVIOR_PANIC;
+		error_behavior = DM_VERITY_ERROR_BEHAVIOR_PANIC;
+
+		if (!blocking_notifier_call_chain(
+		    &verity_error_notifier, transient, &error_state)) {
+			error_behavior = error_state.behavior;
+		}
+	}
+
+	switch (error_behavior) {
+	case DM_VERITY_ERROR_BEHAVIOR_EIO:
+		break;
+	case DM_VERITY_ERROR_BEHAVIOR_NONE:
+		break;
+	default:
+		if (!transient)
+			goto do_panic;
+	}
+	return;
+
+do_panic:
+	panic("dm-verity failure: "
+	      "device:%u:%u status:%d block:%llu message:%s",
+	      MAJOR(devt), MINOR(devt), status, (u64)block, message);
+}
+
+/**
+ * verity_parse_error_behavior - parse a behavior charp to the enum
+ * @behavior:	NUL-terminated char array
+ *
+ * Checks if the behavior is valid either as text or as an index digit
+ * and returns the proper enum value in string form or ERR_PTR(-EINVAL)
+ * on error.
+ */
+static char *verity_parse_error_behavior(const char *behavior)
+{
+	const char **allowed = allowed_error_behaviors;
+	int index;
+
+	for (index = 0; *allowed; allowed++, index++)
+		if (!strcmp(*allowed, behavior) || behavior[0] == index + '0')
+			break;
+
+	if (!*allowed)
+		return ERR_PTR(-EINVAL);
+
+	/* Convert to the integer index matching the enum. */
+	return error_behavior_istring[index];
+}
+
 /*
  * Auxiliary structure appended to each dm-bufio buffer. If the value
  * hash_verified is nonzero, hash of the block has been verified.
@@ -554,6 +671,8 @@
 	struct dm_verity *v = io->v;
 	struct bio *bio = dm_bio_from_per_bio_data(io, v->ti->per_io_data_size);
 
+	if (status && !verity_fec_is_enabled(io->v))
+		verity_error(v, io, status);
 	bio->bi_end_io = io->orig_bi_end_io;
 	bio->bi_status = status;
 
@@ -942,6 +1061,22 @@
 				return r;
 			continue;
 
+		} else if (!strcasecmp(arg_name, DM_VERITY_OPT_ERROR_BEHAVIOR)) {
+			int behavior;
+
+			if (!argc) {
+				ti->error = "Missing error behavior parameter";
+				return -EINVAL;
+			}
+			if (kstrtoint(dm_shift_arg(as), 0, &behavior) ||
+			    behavior < 0) {
+				ti->error = "Bad error behavior parameter";
+				return -EINVAL;
+			}
+			v->error_behavior = behavior;
+			argc--;
+			continue;
+
 		} else if (verity_is_fec_opt_arg(arg_name)) {
 			r = verity_fec_parse_opt_args(as, v, &argc, arg_name);
 			if (r)
@@ -964,6 +1099,132 @@
 	return r;
 }
 
+static int verity_get_device(struct dm_target *ti, const char *devname,
+			     struct dm_dev **dm_dev)
+{
+	do {
+		/* Try the normal path first since if everything is ready, it
+		 * will be the fastest.
+		 */
+		if (!dm_get_device(ti, devname,
+				   dm_table_get_mode(ti->table), dm_dev))
+			return 0;
+
+		if (!dev_wait)
+			break;
+
+		/* No need to be too aggressive since this is a slow path. */
+		msleep(500);
+	} while (dev_wait && (driver_probe_done() != 0 || *dm_dev == NULL));
+	return -1;
+}
+
+static void splitarg(char *arg, char **key, char **val)
+{
+	*key = strsep(&arg, "=");
+	*val = strsep(&arg, "");
+}
+
+/* Convert Chrome OS arguments into standard arguments */
+
+static char *chromeos_args(unsigned *pargc, char ***pargv)
+{
+	char *hashstart = NULL;
+	char **argv = *pargv;
+	int argc = *pargc;
+	char *key, *val;
+	int nargc = 10;
+	char **nargv;
+	char *errstr;
+	int i;
+
+	nargv = kcalloc(14, sizeof(char *), GFP_KERNEL);
+	if (!nargv)
+		return "Failed to allocate memory";
+
+	nargv[0] = "0";		/* version */
+	nargv[3] = "4096";	/* hash block size */
+	nargv[4] = "4096";	/* data block size */
+	nargv[9] = "-";		/* salt (optional) */
+
+	for (i = 0; i < argc; ++i) {
+		DMDEBUG("Argument %d: '%s'", i, argv[i]);
+		splitarg(argv[i], &key, &val);
+		if (!key) {
+			DMWARN("Bad argument %d: missing key?", i);
+			errstr = "Bad argument: missing key";
+			goto err;
+		}
+		if (!val) {
+			DMWARN("Bad argument %d='%s': missing value", i, key);
+			errstr = "Bad argument: missing value";
+			goto err;
+		}
+		if (!strcmp(key, "alg")) {
+			nargv[7] = val;
+		} else if (!strcmp(key, "payload")) {
+			nargv[1] = val;
+		} else if (!strcmp(key, "hashtree")) {
+			nargv[2] = val;
+		} else if (!strcmp(key, "root_hexdigest")) {
+			nargv[8] = val;
+		} else if (!strcmp(key, "hashstart")) {
+			unsigned long num;
+
+			if (kstrtoul(val, 10, &num)) {
+				errstr = "Invalid hashstart";
+				goto err;
+			}
+			num >>= (12 - SECTOR_SHIFT);
+			hashstart = kmalloc(24, GFP_KERNEL);
+			if (!hashstart) {
+				errstr = "Failed to allocate memory";
+				goto err;
+			}
+			scnprintf(hashstart, sizeof(hashstart), "%lu", num);
+			nargv[5] = hashstart;
+			nargv[6] = hashstart;
+		} else if (!strcmp(key, "salt")) {
+			nargv[9] = val;
+		} else if (!strcmp(key, DM_VERITY_OPT_ERROR_BEHAVIOR)) {
+			char *behavior = verity_parse_error_behavior(val);
+
+			if (IS_ERR(behavior)) {
+				errstr = "Invalid error behavior";
+				goto err;
+			}
+			nargv[10] = "2";
+			nargv[11] = key;
+			nargv[12] = behavior;
+			nargc = 13;
+		}
+	}
+
+	if (!nargv[1] || !nargv[2] || !nargv[5] || !nargv[7] || !nargv[8]) {
+		errstr = "Missing argument";
+		goto err;
+	}
+
+	*pargc = nargc;
+	*pargv = nargv;
+	return NULL;
+
+err:
+	kfree(nargv);
+	kfree(hashstart);
+	return errstr;
+}
+
+/* Release memory allocated for Chrome OS parameter conversion */
+
+static void free_chromeos_argv(char **argv)
+{
+	if (argv) {
+		kfree(argv[5]);
+		kfree(argv);
+	}
+}
+
 /*
  * Target parameters:
  *	<version>	The current format is version 1.
@@ -990,10 +1251,19 @@
 	sector_t hash_position;
 	char dummy;
 	char *root_hash_digest_to_validate;
+	char **chromeos_argv = NULL;
+
+	if (argc < 10) {
+		ti->error = chromeos_args(&argc, &argv);
+		if (ti->error)
+			return -EINVAL;
+		chromeos_argv = argv;
+	}
 
 	v = kzalloc(sizeof(struct dm_verity), GFP_KERNEL);
 	if (!v) {
 		ti->error = "Cannot allocate verity structure";
+		free_chromeos_argv(chromeos_argv);
 		return -ENOMEM;
 	}
 	ti->private = v;
@@ -1023,13 +1293,13 @@
 	}
 	v->version = num;
 
-	r = dm_get_device(ti, argv[1], FMODE_READ, &v->data_dev);
+	r = verity_get_device(ti, argv[1], &v->data_dev);
 	if (r) {
 		ti->error = "Data device lookup failed";
 		goto bad;
 	}
 
-	r = dm_get_device(ti, argv[2], FMODE_READ, &v->hash_dev);
+	r = verity_get_device(ti, argv[2], &v->hash_dev);
 	if (r) {
 		ti->error = "Hash device lookup failed";
 		goto bad;
@@ -1229,14 +1499,14 @@
 				       __alignof__(struct dm_verity_io));
 
 	verity_verify_sig_opts_cleanup(&verify_args);
-
+	free_chromeos_argv(chromeos_argv);
 	return 0;
 
 bad:
 
 	verity_verify_sig_opts_cleanup(&verify_args);
 	verity_dtr(ti);
-
+	free_chromeos_argv(chromeos_argv);
 	return r;
 }
 

diff --git a/drivers/md/dm-verity.h b/drivers/md/dm-verity.h
index 4e769d13..6ec6ffa 100644
--- a/drivers/md/dm-verity.h
+++ b/drivers/md/dm-verity.h

@@ -14,6 +14,7 @@
 #include <linux/dm-bufio.h>
 #include <linux/device-mapper.h>
 #include <crypto/hash.h>
+#include <linux/notifier.h>
 
 #define DM_VERITY_MAX_LEVELS		63
 
@@ -56,6 +57,7 @@
 	int hash_failed;	/* set to 1 if hash of any block failed */
 	enum verity_mode mode;	/* mode for handling verification errors */
 	unsigned corrupted_errs;/* Number of errors for corrupted blocks */
+	int error_behavior;	/* selects error behavior on io errors */
 
 	struct workqueue_struct *verify_wq;
 
@@ -93,6 +95,40 @@
 	 */
 };
 
+struct verity_result {
+	struct completion completion;
+	int err;
+};
+
+struct dm_verity_error_state {
+	int code;
+	int transient;  /* Likely to not happen after a reboot */
+	u64 block;
+	const char *message;
+
+	sector_t dev_start;
+	sector_t dev_len;
+	struct block_device *dev;
+
+	sector_t hash_dev_start;
+	sector_t hash_dev_len;
+	struct block_device *hash_dev;
+
+	/* Final behavior after all notifications are completed. */
+	int behavior;
+};
+
+/* This enum must be matched to allowed_error_behaviors in dm-verity.c */
+enum dm_verity_error_behavior {
+	DM_VERITY_ERROR_BEHAVIOR_EIO = 0,
+	DM_VERITY_ERROR_BEHAVIOR_PANIC,
+	DM_VERITY_ERROR_BEHAVIOR_NONE,
+	DM_VERITY_ERROR_BEHAVIOR_NOTIFY
+};
+
+int dm_verity_register_error_notifier(struct notifier_block *nb);
+int dm_verity_unregister_error_notifier(struct notifier_block *nb);
+
 static inline struct ahash_request *verity_io_hash_req(struct dm_verity *v,
 						     struct dm_verity_io *io)
 {

diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
index f5c8022..ef8bcd6 100644
--- a/drivers/net/ethernet/google/gve/gve.h
+++ b/drivers/net/ethernet/google/gve/gve.h

@@ -50,6 +50,8 @@
 	struct page *page;
 	void *page_address;
 	u32 page_offset; /* offset to write to in page */
+	int pagecnt_bias; /* expected pagecnt if only the driver has a ref */
+	bool can_flip; /* page can be flipped and reused */
 };
 
 /* A list of pages registered with the device during setup and used by a queue
@@ -68,6 +70,7 @@
 	dma_addr_t data_bus; /* dma mapping of the slots */
 	struct gve_rx_slot_page_info *page_info; /* page info of the buffers */
 	struct gve_queue_page_list *qpl; /* qpl assigned to this queue */
+	bool raw_addressing; /* use raw_addressing? */
 };
 
 struct gve_priv;
@@ -82,11 +85,13 @@
 	u32 cnt; /* free-running total number of completed packets */
 	u32 fill_cnt; /* free-running total number of descs and buffs posted */
 	u32 mask; /* masks the cnt and fill_cnt to the size of the ring */
+	u32 db_threshold; /* threshold for posting new buffs and descs */
 	u64 rx_copybreak_pkt; /* free-running count of copybreak packets */
 	u64 rx_copied_pkt; /* free-running total number of copied packets */
 	u64 rx_skb_alloc_fail; /* free-running count of skb alloc fails */
 	u64 rx_buf_alloc_fail; /* free-running count of buffer alloc fails */
 	u64 rx_desc_err_dropped_pkt; /* free-running count of packets dropped by descriptor error */
+	u64 rx_no_refill_dropped_pkt; /* free-running count of packets dropped because of lack of buffer refill */
 	u32 q_num; /* queue index */
 	u32 ntfy_id; /* notification block index */
 	struct gve_queue_resources *q_resources; /* head and tail pointer idx */
@@ -107,12 +112,20 @@
 	u32 iov_padding; /* padding associated with this segment */
 };
 
+struct gve_tx_dma_buf {
+	DEFINE_DMA_UNMAP_ADDR(dma);
+	DEFINE_DMA_UNMAP_LEN(len);
+};
+
 /* Tracks the memory in the fifo occupied by the skb. Mapped 1:1 to a desc
  * ring entry but only used for a pkt_desc not a seg_desc
  */
 struct gve_tx_buffer_state {
 	struct sk_buff *skb; /* skb for this pkt */
-	struct gve_tx_iovec iov[GVE_TX_MAX_IOVEC]; /* segments of this pkt */
+	union {
+		struct gve_tx_iovec iov[GVE_TX_MAX_IOVEC]; /* segments of this pkt */
+		struct gve_tx_dma_buf buf;
+	};
 };
 
 /* A TX buffer - each queue has one */
@@ -135,13 +148,16 @@
 	__be32 last_nic_done ____cacheline_aligned; /* NIC tail pointer */
 	u64 pkt_done; /* free-running - total packets completed */
 	u64 bytes_done; /* free-running - total bytes completed */
+	u32 dropped_pkt; /* free-running - total packets dropped */
 
 	/* Cacheline 2 -- Read-mostly fields */
 	union gve_tx_desc *desc ____cacheline_aligned;
 	struct gve_tx_buffer_state *info; /* Maps 1:1 to a desc */
 	struct netdev_queue *netdev_txq;
 	struct gve_queue_resources *q_resources; /* head and tail pointer idx */
+	struct device *dev;
 	u32 mask; /* masks req and done down to queue size */
+	bool raw_addressing; /* use raw_addressing? */
 
 	/* Slow-path fields */
 	u32 q_num ____cacheline_aligned; /* queue idx */
@@ -157,13 +173,13 @@
  * associated with that irq.
  */
 struct gve_notify_block {
-	__be32 irq_db_index; /* idx into Bar2 - set by device, must be 1st */
+	__be32 *irq_db_index; /* pointer to idx into Bar2 */
 	char name[IFNAMSIZ + 16]; /* name registered with the kernel */
 	struct napi_struct napi; /* kernel napi struct for this block */
 	struct gve_priv *priv;
 	struct gve_tx_ring *tx; /* tx rings on this block */
 	struct gve_rx_ring *rx; /* rx rings on this block */
-} ____cacheline_aligned;
+};
 
 /* Tracks allowed and current queue settings */
 struct gve_queue_config {
@@ -177,13 +193,18 @@
 	unsigned long *qpl_id_map; /* bitmap of used qpl ids */
 };
 
+struct gve_irq_db {
+	__be32 index;
+} ____cacheline_aligned;
+
 struct gve_priv {
 	struct net_device *dev;
 	struct gve_tx_ring *tx; /* array of tx_cfg.num_queues */
 	struct gve_rx_ring *rx; /* array of rx_cfg.num_queues */
 	struct gve_queue_page_list *qpls; /* array of num qpls */
 	struct gve_notify_block *ntfy_blocks; /* array of num_ntfy_blks */
-	dma_addr_t ntfy_block_bus;
+	struct gve_irq_db *irq_db_indices; /* array of num_ntfy_blks */
+	dma_addr_t irq_db_indices_bus;
 	struct msix_entry *msix_vectors; /* array of num_ntfy_blks + 1 */
 	char mgmt_msix_name[IFNAMSIZ + 16];
 	u32 mgmt_msix_idx;
@@ -194,11 +215,12 @@
 	u16 tx_desc_cnt; /* num desc per ring */
 	u16 rx_desc_cnt; /* num desc per ring */
 	u16 tx_pages_per_qpl; /* tx buffer length */
-	u16 rx_pages_per_qpl; /* rx buffer length */
+	u16 rx_data_slot_cnt; /* rx buffer length */
 	u64 max_registered_pages;
 	u64 num_registered_pages; /* num pages registered with NIC */
 	u32 rx_copybreak; /* copy packets smaller than this */
 	u16 default_num_queues; /* default num queues to set up */
+	bool raw_addressing; /* true if this dev supports raw addressing */
 
 	struct gve_queue_config tx_cfg;
 	struct gve_queue_config rx_cfg;
@@ -415,7 +437,7 @@
 static inline __be32 __iomem *gve_irq_doorbell(struct gve_priv *priv,
 					       struct gve_notify_block *block)
 {
-	return &priv->db_bar2[be32_to_cpu(block->irq_db_index)];
+	return &priv->db_bar2[be32_to_cpu(*block->irq_db_index)];
 }
 
 /* Returns the index into ntfy_blocks of the given tx ring's block
@@ -436,14 +458,22 @@
  */
 static inline u32 gve_num_tx_qpls(struct gve_priv *priv)
 {
-	return priv->tx_cfg.num_queues;
+	if (priv->raw_addressing) {
+		return 0;
+	} else {
+		return priv->tx_cfg.num_queues;
+	}
 }
 
 /* Returns the number of rx queue page lists
  */
 static inline u32 gve_num_rx_qpls(struct gve_priv *priv)
 {
-	return priv->rx_cfg.num_queues;
+	if (priv->raw_addressing) {
+		return 0;
+	} else {
+		return priv->rx_cfg.num_queues;
+	}
 }
 
 /* Returns a pointer to the next available tx qpl in the list of qpls
@@ -497,15 +527,6 @@
 		return DMA_FROM_DEVICE;
 }
 
-/* Returns true if the max mtu allows page recycling */
-static inline bool gve_can_recycle_pages(struct net_device *dev)
-{
-	/* We can't recycle the pages if we can't fit a packet into half a
-	 * page.
-	 */
-	return dev->max_mtu <= PAGE_SIZE / 2;
-}
-
 /* buffers */
 int gve_alloc_page(struct gve_priv *priv, struct device *dev,
 		   struct page **page, dma_addr_t *dma,

diff --git a/drivers/net/ethernet/google/gve/gve_adminq.c b/drivers/net/ethernet/google/gve/gve_adminq.c
index 6009d76..862ca8b 100644
--- a/drivers/net/ethernet/google/gve/gve_adminq.c
+++ b/drivers/net/ethernet/google/gve/gve_adminq.c

@@ -298,7 +298,7 @@
 		.num_counters = cpu_to_be32(num_counters),
 		.irq_db_addr = cpu_to_be64(db_array_bus_addr),
 		.num_irq_dbs = cpu_to_be32(num_ntfy_blks),
-		.irq_db_stride = cpu_to_be32(sizeof(priv->ntfy_blocks[0])),
+		.irq_db_stride = cpu_to_be32(sizeof(*priv->irq_db_indices)),
 		.ntfy_blk_msix_base_idx =
 					cpu_to_be32(GVE_NTFY_BLK_BASE_MSIX_IDX),
 	};
@@ -320,8 +320,12 @@
 {
 	struct gve_tx_ring *tx = &priv->tx[queue_index];
 	union gve_adminq_command cmd;
+        u32 qpl_id;
 	int err;
 
+	qpl_id = priv->raw_addressing ? GVE_RAW_ADDRESSING_QPL_ID :
+	tx->tx_fifo.qpl->id;
+ 
 	memset(&cmd, 0, sizeof(cmd));
 	cmd.opcode = cpu_to_be32(GVE_ADMINQ_CREATE_TX_QUEUE);
 	cmd.create_tx_queue = (struct gve_adminq_create_tx_queue) {
@@ -330,7 +334,7 @@
 		.queue_resources_addr =
 			cpu_to_be64(tx->q_resources_bus),
 		.tx_ring_addr = cpu_to_be64(tx->bus),
-		.queue_page_list_id = cpu_to_be32(tx->tx_fifo.qpl->id),
+		.queue_page_list_id = cpu_to_be32(qpl_id),
 		.ntfy_id = cpu_to_be32(tx->ntfy_id),
 	};
 
@@ -359,8 +363,12 @@
 {
 	struct gve_rx_ring *rx = &priv->rx[queue_index];
 	union gve_adminq_command cmd;
+        u32 qpl_id;
 	int err;
 
+        qpl_id = priv->raw_addressing ? GVE_RAW_ADDRESSING_QPL_ID :
+        rx->data.qpl->id;
+ 
 	memset(&cmd, 0, sizeof(cmd));
 	cmd.opcode = cpu_to_be32(GVE_ADMINQ_CREATE_RX_QUEUE);
 	cmd.create_rx_queue = (struct gve_adminq_create_rx_queue) {
@@ -371,7 +379,7 @@
 		.queue_resources_addr = cpu_to_be64(rx->q_resources_bus),
 		.rx_desc_ring_addr = cpu_to_be64(rx->desc.bus),
 		.rx_data_ring_addr = cpu_to_be64(rx->data.data_bus),
-		.queue_page_list_id = cpu_to_be32(rx->data.qpl->id),
+		.queue_page_list_id = cpu_to_be32(qpl_id),
 	};
 
 	err = gve_adminq_issue_cmd(priv, &cmd);
@@ -462,11 +470,14 @@
 int gve_adminq_describe_device(struct gve_priv *priv)
 {
 	struct gve_device_descriptor *descriptor;
+	struct gve_device_option *dev_opt;
 	union gve_adminq_command cmd;
 	dma_addr_t descriptor_bus;
+	u16 num_options;
 	int err = 0;
 	u8 *mac;
 	u16 mtu;
+	int i;
 
 	memset(&cmd, 0, sizeof(cmd));
 	descriptor = dma_alloc_coherent(&priv->pdev->dev, PAGE_SIZE,
@@ -513,16 +524,61 @@
 	mac = descriptor->mac;
 	dev_info(&priv->pdev->dev, "MAC addr: %pM\n", mac);
 	priv->tx_pages_per_qpl = be16_to_cpu(descriptor->tx_pages_per_qpl);
-	priv->rx_pages_per_qpl = be16_to_cpu(descriptor->rx_pages_per_qpl);
-	if (priv->rx_pages_per_qpl < priv->rx_desc_cnt) {
-		dev_err(&priv->pdev->dev, "rx_pages_per_qpl cannot be smaller than rx_desc_cnt, setting rx_desc_cnt down to %d.\n",
-			priv->rx_pages_per_qpl);
-		priv->rx_desc_cnt = priv->rx_pages_per_qpl;
+        priv->rx_data_slot_cnt = be16_to_cpu(descriptor->rx_pages_per_qpl);
+        if (priv->rx_data_slot_cnt < priv->rx_desc_cnt) {
+                dev_err(&priv->pdev->dev, "rx_pages_per_qpl cannot be smaller than rx_desc_cnt, setting rx_desc_cnt down to %d.\n",
+			priv->rx_data_slot_cnt);
+		priv->rx_desc_cnt = priv->rx_data_slot_cnt;
 	}
 	priv->default_num_queues = be16_to_cpu(descriptor->default_num_queues);
+        dev_opt = (struct gve_device_option *)((void *)descriptor +
+							sizeof(*descriptor));
 
+	num_options = be16_to_cpu(descriptor->num_device_options);
+	for (i = 0; i < num_options; i++) {
+		u16 option_id;
+		u16 option_length;
+
+		if ((void *)dev_opt + sizeof(*dev_opt)  > (void *)descriptor +
+				      be16_to_cpu(descriptor->total_length)) {
+			dev_err(&priv->dev->dev,
+				  "num_options in device_descriptor does not match total length.\n");
+			err = -EINVAL;
+			goto free_device_descriptor;
+		}
+
+		option_id = be16_to_cpu(dev_opt->option_id);
+		option_length = be16_to_cpu(dev_opt->option_length);
+		switch(option_id) {
+		case GVE_DEV_OPT_ID_RAW_ADDRESSING:
+			/* If the length or feature mask doesn't match,
+			 * continue without enabling the feature.
+			 */
+			if (option_length != GVE_DEV_OPT_LEN_RAW_ADDRESSING ||
+			    be32_to_cpu(dev_opt->feat_mask) !=
+			    GVE_DEV_OPT_FEAT_MASK_RAW_ADDRESSING) {
+				dev_info(&priv->pdev->dev,
+					   "Raw addressing device option not enabled, length or features mask did not match expected.\n");
+				priv->raw_addressing = false;
+			} else {
+				dev_info(&priv->pdev->dev,
+					   "Raw addressing device option enabled.\n");
+				priv->raw_addressing = true;
+			}
+			break;
+		default:
+			/* If we don't recognize the option just continue
+			 * without doing anything.
+			 */
+			dev_info(&priv->pdev->dev,
+				   "Unrecognized device option 0x%hx not enabled.\n",
+				   option_id);
+			break;
+		}
+		dev_opt = (void *)dev_opt + sizeof(*dev_opt) + option_length;
+	} 
 free_device_descriptor:
-	dma_free_coherent(&priv->pdev->dev, sizeof(*descriptor), descriptor,
+	dma_free_coherent(&priv->pdev->dev, PAGE_SIZE, descriptor,
 			  descriptor_bus);
 	return err;
 }

diff --git a/drivers/net/ethernet/google/gve/gve_adminq.h b/drivers/net/ethernet/google/gve/gve_adminq.h
index 015796a..d320c2f 100644
--- a/drivers/net/ethernet/google/gve/gve_adminq.h
+++ b/drivers/net/ethernet/google/gve/gve_adminq.h

@@ -79,12 +79,17 @@
 
 static_assert(sizeof(struct gve_device_descriptor) == 40);
 
-struct device_option {
-	__be32 option_id;
-	__be32 option_length;
+struct gve_device_option {
+	__be16 option_id;
+	__be16 option_length;
+	__be32 feat_mask;
 };
 
-static_assert(sizeof(struct device_option) == 8);
+static_assert(sizeof(struct gve_device_option) == 8);
+
+#define GVE_DEV_OPT_ID_RAW_ADDRESSING 0x1
+#define GVE_DEV_OPT_LEN_RAW_ADDRESSING 0x0
+#define GVE_DEV_OPT_FEAT_MASK_RAW_ADDRESSING 0x0
 
 struct gve_adminq_configure_device_resources {
 	__be64 counter_array;
@@ -111,6 +116,8 @@
 
 static_assert(sizeof(struct gve_adminq_unregister_page_list) == 4);
 
+#define GVE_RAW_ADDRESSING_QPL_ID 0xFFFFFFFF
+
 struct gve_adminq_create_tx_queue {
 	__be32 queue_id;
 	__be32 reserved;

diff --git a/drivers/net/ethernet/google/gve/gve_desc.h b/drivers/net/ethernet/google/gve/gve_desc.h
index 54779871..a7da364 100644
--- a/drivers/net/ethernet/google/gve/gve_desc.h
+++ b/drivers/net/ethernet/google/gve/gve_desc.h

@@ -16,9 +16,11 @@
  * Base addresses encoded in seg_addr are not assumed to be physical
  * addresses. The ring format assumes these come from some linear address
  * space. This could be physical memory, kernel virtual memory, user virtual
- * memory. gVNIC uses lists of registered pages. Each queue is assumed
- * to be associated with a single such linear address space to ensure a
- * consistent meaning for seg_addrs posted to its rings.
+ * memory.
+ * If raw dma addressing is not supported then gVNIC uses lists of registered
+ * pages. Each queue is assumed to be associated with a single such linear
+ * address space to ensure a consistent meaning for seg_addrs posted to its
+ * rings.
  */
 
 struct gve_tx_pkt_desc {
@@ -72,12 +74,14 @@
 } __packed;
 static_assert(sizeof(struct gve_rx_desc) == 64);
 
-/* As with the Tx ring format, the qpl_offset entries below are offsets into an
- * ordered list of registered pages.
+/* If the device supports raw dma addressing then the addr in data slot is
+ * the dma address of the buffer.
+ * If the device only supports registered segments than the addr is a byte
+ * offset into the registered segment (an ordered list of pages) where the
+ * buffer is.
  */
 struct gve_rx_data_slot {
-	/* byte offset into the rx registered segment of this slot */
-	__be64 qpl_offset;
+	__be64 addr;
 };
 
 /* GVE Recive Packet Descriptor Seq No */

diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 0b714b6..1fc660a 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c

@@ -246,15 +246,24 @@
 		dev_err(&priv->pdev->dev, "Did not receive management vector.\n");
 		goto abort_with_msix_enabled;
 	}
-	priv->ntfy_blocks =
+
+	priv->irq_db_indices =
 		dma_alloc_coherent(&priv->pdev->dev,
 				   priv->num_ntfy_blks *
-				   sizeof(*priv->ntfy_blocks),
-				   &priv->ntfy_block_bus, GFP_KERNEL);
-	if (!priv->ntfy_blocks) {
+				   sizeof(*priv->irq_db_indices),
+				   &priv->irq_db_indices_bus, GFP_KERNEL);
+	if (!priv->irq_db_indices) {
 		err = -ENOMEM;
 		goto abort_with_mgmt_vector;
 	}
+
+	priv->ntfy_blocks = kvzalloc(priv->num_ntfy_blks *
+				     sizeof(*priv->ntfy_blocks), GFP_KERNEL);
+	if (!priv->ntfy_blocks) {
+		err = -ENOMEM;
+		goto abort_with_irq_db_indices;
+	}
+
 	/* Setup the other blocks - the first n-1 vectors */
 	for (i = 0; i < priv->num_ntfy_blks; i++) {
 		struct gve_notify_block *block = &priv->ntfy_blocks[i];
@@ -272,6 +281,7 @@
 		}
 		irq_set_affinity_hint(priv->msix_vectors[msix_idx].vector,
 				      get_cpu_mask(i % active_cpus));
+		block->irq_db_index = &priv->irq_db_indices[i].index;
 	}
 	return 0;
 abort_with_some_ntfy_blocks:
@@ -283,10 +293,13 @@
 				      NULL);
 		free_irq(priv->msix_vectors[msix_idx].vector, block);
 	}
-	dma_free_coherent(&priv->pdev->dev, priv->num_ntfy_blks *
-			  sizeof(*priv->ntfy_blocks),
-			  priv->ntfy_blocks, priv->ntfy_block_bus);
+	kvfree(priv->ntfy_blocks);
 	priv->ntfy_blocks = NULL;
+abort_with_irq_db_indices:
+	dma_free_coherent(&priv->pdev->dev, priv->num_ntfy_blks *
+			  sizeof(*priv->irq_db_indices),
+			  priv->irq_db_indices, priv->irq_db_indices_bus);
+	priv->irq_db_indices = NULL;
 abort_with_mgmt_vector:
 	free_irq(priv->msix_vectors[priv->mgmt_msix_idx].vector, priv);
 abort_with_msix_enabled:
@@ -313,11 +326,13 @@
 		}
 		free_irq(priv->msix_vectors[priv->mgmt_msix_idx].vector, priv);
 	}
-	dma_free_coherent(&priv->pdev->dev,
-			  priv->num_ntfy_blks * sizeof(*priv->ntfy_blocks),
-			  priv->ntfy_blocks, priv->ntfy_block_bus);
+        kvfree(priv->ntfy_blocks);
 	priv->ntfy_blocks = NULL;
-	pci_disable_msix(priv->pdev);
+	dma_free_coherent(&priv->pdev->dev, priv->num_ntfy_blks *
+			  sizeof(*priv->irq_db_indices),
+			  priv->irq_db_indices, priv->irq_db_indices_bus);
+	priv->irq_db_indices = NULL; 
+        pci_disable_msix(priv->pdev);
 	kvfree(priv->msix_vectors);
 	priv->msix_vectors = NULL;
 }
@@ -338,7 +353,7 @@
 	err = gve_adminq_configure_device_resources(priv,
 						    priv->counter_array_bus,
 						    priv->num_event_counters,
-						    priv->ntfy_block_bus,
+						    priv->irq_db_indices_bus,
 						    priv->num_ntfy_blks);
 	if (unlikely(err)) {
 		dev_err(&priv->pdev->dev,
@@ -598,6 +613,7 @@
 	if (dma_mapping_error(dev, *dma)) {
 		priv->dma_mapping_error++;
 		put_page(*page);
+		*page = NULL;
 		return -ENOMEM;
 	}
 	return 0;
@@ -680,6 +696,10 @@
 	int i, j;
 	int err;
 
+	/* Raw addressing means no QPLs */
+	if (priv->raw_addressing)
+		return 0;
+
 	priv->qpls = kvzalloc(num_qpls * sizeof(*priv->qpls), GFP_KERNEL);
 	if (!priv->qpls)
 		return -ENOMEM;
@@ -692,7 +712,7 @@
 	}
 	for (; i < num_qpls; i++) {
 		err = gve_alloc_queue_page_list(priv, i,
-						priv->rx_pages_per_qpl);
+						priv->rx_data_slot_cnt);
 		if (err)
 			goto free_qpls;
 	}
@@ -720,6 +740,10 @@
 	int num_qpls = gve_num_tx_qpls(priv) + gve_num_rx_qpls(priv);
 	int i;
 
+	/* Raw addressing means no QPLs */
+	if (priv->raw_addressing)
+		return;
+
 	kvfree(priv->qpl_cfg.qpl_id_map);
 
 	for (i = 0; i < num_qpls; i++)
@@ -1080,6 +1104,7 @@
 	if (skip_describe_device)
 		goto setup_device;
 
+	priv->raw_addressing = false;
 	/* Get the initial information we need from the device */
 	err = gve_adminq_describe_device(priv);
 	if (err) {

diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
index 008fa89..a660558 100644
--- a/drivers/net/ethernet/google/gve/gve_rx.c
+++ b/drivers/net/ethernet/google/gve/gve_rx.c

@@ -16,12 +16,41 @@
 	block->rx = NULL;
 }
 
+static void gve_rx_free_buffer(struct device *dev,
+			       struct gve_rx_slot_page_info *page_info,
+			       struct gve_rx_data_slot *data_slot) {
+	dma_addr_t dma = (dma_addr_t)(be64_to_cpu(data_slot->addr) -
+				      page_info->page_offset);
+
+	page_ref_sub(page_info->page, page_info->pagecnt_bias - 1);
+	gve_free_page(dev, page_info->page, dma, DMA_FROM_DEVICE);
+}
+
+static void gve_rx_unfill_pages(struct gve_priv *priv, struct gve_rx_ring *rx) {
+	u32 slots = rx->mask + 1;
+	int i;
+
+	if (rx->data.raw_addressing) {
+		for (i = 0; i < slots; i++)
+			gve_rx_free_buffer(&priv->pdev->dev, &rx->data.page_info[i],
+					   &rx->data.data_ring[i]);
+	} else {
+		for (i = 0; i < slots; i++)
+			page_ref_sub(rx->data.page_info[i].page,
+				     rx->data.page_info[i].pagecnt_bias - 1);
+		gve_unassign_qpl(priv, rx->data.qpl->id);
+		rx->data.qpl = NULL;
+	}
+	kfree(rx->data.page_info);
+	rx->data.page_info = NULL;
+}
+
 static void gve_rx_free_ring(struct gve_priv *priv, int idx)
 {
 	struct gve_rx_ring *rx = &priv->rx[idx];
 	struct device *dev = &priv->pdev->dev;
 	size_t bytes;
-	u32 slots;
+	u32 slots = rx->mask + 1;
 
 	gve_rx_remove_from_block(priv, idx);
 
@@ -33,11 +62,8 @@
 			  rx->q_resources, rx->q_resources_bus);
 	rx->q_resources = NULL;
 
-	gve_unassign_qpl(priv, rx->data.qpl->id);
-	rx->data.qpl = NULL;
-	kvfree(rx->data.page_info);
+	gve_rx_unfill_pages(priv, rx);
 
-	slots = rx->mask + 1;
 	bytes = sizeof(*rx->data.data_ring) * slots;
 	dma_free_coherent(dev, bytes, rx->data.data_ring,
 			  rx->data.data_bus);
@@ -52,13 +78,17 @@
 	page_info->page = page;
 	page_info->page_offset = 0;
 	page_info->page_address = page_address(page);
-	slot->qpl_offset = cpu_to_be64(addr);
+	slot->addr = cpu_to_be64(addr);
+	/* The page already has 1 ref */
+	page_ref_add(page, INT_MAX - 1);
+	page_info->pagecnt_bias = INT_MAX;
 }
 
 static int gve_prefill_rx_pages(struct gve_rx_ring *rx)
 {
 	struct gve_priv *priv = rx->gve;
 	u32 slots;
+	int err;
 	int i;
 
 	/* Allocate one page per Rx queue slot. Each page is split into two
@@ -71,12 +101,31 @@
 	if (!rx->data.page_info)
 		return -ENOMEM;
 
-	rx->data.qpl = gve_assign_rx_qpl(priv);
-
+	if (!rx->data.raw_addressing)
+		rx->data.qpl = gve_assign_rx_qpl(priv);
 	for (i = 0; i < slots; i++) {
-		struct page *page = rx->data.qpl->pages[i];
-		dma_addr_t addr = i * PAGE_SIZE;
+		struct page *page;
+		dma_addr_t addr;
 
+		if (rx->data.raw_addressing) {
+			err = gve_alloc_page(priv, &priv->pdev->dev, &page,
+					     &addr, DMA_FROM_DEVICE);
+			if (err) {
+				int j;
+
+				u64_stats_update_begin(&rx->statss);
+				rx->rx_buf_alloc_fail++;
+				u64_stats_update_end(&rx->statss);
+				for (j = 0; j < i; j++)
+					gve_rx_free_buffer(&priv->pdev->dev,
+							 &rx->data.page_info[j],
+							 &rx->data.data_ring[j]);
+				return err;
+			}
+		} else {
+			page = rx->data.qpl->pages[i];
+			addr = i * PAGE_SIZE;
+		}
 		gve_setup_rx_buffer(&rx->data.page_info[i],
 				    &rx->data.data_ring[i], addr, page);
 	}
@@ -110,8 +159,9 @@
 	rx->gve = priv;
 	rx->q_num = idx;
 
-	slots = priv->rx_pages_per_qpl;
+	slots = priv->rx_data_slot_cnt;
 	rx->mask = slots - 1;
+	rx->data.raw_addressing = priv->raw_addressing;
 
 	/* alloc rx data ring */
 	bytes = sizeof(*rx->data.data_ring) * slots;
@@ -156,8 +206,8 @@
 		err = -ENOMEM;
 		goto abort_with_q_resources;
 	}
-	rx->mask = slots - 1;
 	rx->cnt = 0;
+	rx->db_threshold = priv->rx_desc_cnt / 2;
 	rx->desc.seqno = 1;
 	gve_rx_add_to_block(priv, idx);
 
@@ -168,7 +218,7 @@
 			  rx->q_resources, rx->q_resources_bus);
 	rx->q_resources = NULL;
 abort_filled:
-	kvfree(rx->data.page_info);
+	gve_rx_unfill_pages(priv, rx);
 abort_with_slots:
 	bytes = sizeof(*rx->data.data_ring) * slots;
 	dma_free_coherent(hdev, bytes, rx->data.data_ring, rx->data.data_bus);
@@ -225,8 +275,7 @@
 	return PKT_HASH_TYPE_L2;
 }
 
-static struct sk_buff *gve_rx_copy(struct gve_rx_ring *rx,
-				   struct net_device *dev,
+static struct sk_buff *gve_rx_copy(struct net_device *dev,
 				   struct napi_struct *napi,
 				   struct gve_rx_slot_page_info *page_info,
 				   u16 len)
@@ -244,15 +293,10 @@
 
 	skb->protocol = eth_type_trans(skb, dev);
 
-	u64_stats_update_begin(&rx->statss);
-	rx->rx_copied_pkt++;
-	u64_stats_update_end(&rx->statss);
-
 	return skb;
 }
 
-static struct sk_buff *gve_rx_add_frags(struct net_device *dev,
-					struct napi_struct *napi,
+static struct sk_buff *gve_rx_add_frags(struct napi_struct *napi,
 					struct gve_rx_slot_page_info *page_info,
 					u16 len)
 {
@@ -268,14 +312,134 @@
 	return skb;
 }
 
-static void gve_rx_flip_buff(struct gve_rx_slot_page_info *page_info,
-			     struct gve_rx_data_slot *data_ring)
+static int gve_rx_alloc_buffer(struct gve_priv *priv, struct device *dev,
+			       struct gve_rx_slot_page_info *page_info,
+			       struct gve_rx_data_slot *data_slot,
+			       struct gve_rx_ring *rx)
 {
-	u64 addr = be64_to_cpu(data_ring->qpl_offset);
+	struct page *page;
+	dma_addr_t dma;
+	int err;
 
+	err = gve_alloc_page(priv, dev, &page, &dma, DMA_FROM_DEVICE);
+	if (err) {
+		u64_stats_update_begin(&rx->statss);
+		rx->rx_buf_alloc_fail++;
+		u64_stats_update_end(&rx->statss);
+		return err;
+	}
+
+	gve_setup_rx_buffer(page_info, data_slot, dma, page);
+	return 0;
+}
+
+static void gve_rx_flip_buffer(struct gve_rx_slot_page_info *page_info,
+			       struct gve_rx_data_slot *data_slot)
+{
+	u64 addr = be64_to_cpu(data_slot->addr);
+
+	/* "flip" to other packet buffer on this page */
 	page_info->page_offset ^= PAGE_SIZE / 2;
 	addr ^= PAGE_SIZE / 2;
-	data_ring->qpl_offset = cpu_to_be64(addr);
+	data_slot->addr = cpu_to_be64(addr);
+}
+
+static bool gve_rx_can_flip_buffers(struct net_device *netdev) {
+#if PAGE_SIZE == 4096
+	/* We can't flip a buffer if we can't fit a packet
+	 * into half a page.
+	 */
+	if (netdev->max_mtu + GVE_RX_PAD + ETH_HLEN  > PAGE_SIZE / 2)
+		return false;
+	return true;
+#else
+	/* PAGE_SIZE != 4096 - don't try to reuse */
+	return false;
+#endif
+}
+
+static int gve_rx_can_recycle_buffer(struct gve_rx_slot_page_info *page_info)
+{
+	int pagecount = page_count(page_info->page);
+
+	/* This page is not being used by any SKBs - reuse */
+	if (pagecount == page_info->pagecnt_bias) {
+		return 1;
+	/* This page is still being used by an SKB - we can't reuse */
+	} else if (pagecount > page_info->pagecnt_bias) {
+		return 0;
+	} else {
+		WARN(pagecount < page_info->pagecnt_bias,
+		     "Pagecount should never be less than the bias.");
+		return -1;
+	}
+}
+
+static void gve_rx_update_pagecnt_bias(struct gve_rx_slot_page_info *page_info)
+{
+	page_info->pagecnt_bias--;
+	if (page_info->pagecnt_bias == 0) {
+		int pagecount = page_count(page_info->page);
+
+		/* If we have run out of bias - set it back up to INT_MAX
+		 * minus the existing refs.
+		 */
+		page_info->pagecnt_bias = INT_MAX - (pagecount);
+		/* Set pagecount back up to max */
+		page_ref_add(page_info->page, INT_MAX - pagecount);
+	}
+}
+
+static struct sk_buff *
+gve_rx_raw_addressing(struct device *dev, struct net_device *netdev,
+		      struct gve_rx_slot_page_info *page_info, u16 len,
+		      struct napi_struct *napi,
+		      struct gve_rx_data_slot *data_slot, bool can_flip)
+{
+	struct sk_buff *skb = gve_rx_add_frags(napi, page_info, len);
+
+	if (!skb)
+		return NULL;
+
+	/* Optimistically stop the kernel from freeing the page.
+	 * We will check again in refill to determine if we need to alloc a
+	 * new page.
+	 */
+	gve_rx_update_pagecnt_bias(page_info);
+	page_info->can_flip = can_flip;
+
+	return skb;
+}
+
+static struct sk_buff *
+gve_rx_qpl(struct device *dev, struct net_device *netdev,
+	   struct gve_rx_ring *rx, struct gve_rx_slot_page_info *page_info,
+	   u16 len, struct napi_struct *napi,
+	   struct gve_rx_data_slot *data_slot, bool recycle)
+{
+	struct sk_buff *skb;
+	/* if raw_addressing mode is not enabled gvnic can only receive into
+	 * registered segments. If the buffer can't be recycled, our only
+	 * choice is to copy the data out of it so that we can return it to the
+	 * device.
+	 */
+	if (recycle) {
+		skb = gve_rx_add_frags(napi, page_info, len);
+		/* No point in recycling if we didn't get the skb */
+		if (skb) {
+			/* Make sure the networking stack can't free the page */
+			gve_rx_update_pagecnt_bias(page_info);
+			gve_rx_flip_buffer(page_info, data_slot);
+		}
+	} else {
+		skb = gve_rx_copy(netdev, napi, page_info, len);
+		if (skb) {
+			u64_stats_update_begin(&rx->statss);
+			rx->rx_copied_pkt++;
+			u64_stats_update_end(&rx->statss);
+		}
+	}
+	return skb;
 }
 
 static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
@@ -284,9 +448,10 @@
 	struct gve_rx_slot_page_info *page_info;
 	struct gve_priv *priv = rx->gve;
 	struct napi_struct *napi = &priv->ntfy_blocks[rx->ntfy_id].napi;
-	struct net_device *dev = priv->dev;
-	struct sk_buff *skb;
-	int pagecount;
+	struct net_device *netdev = priv->dev;
+	struct gve_rx_data_slot *data_slot;
+	struct sk_buff *skb = NULL;
+	dma_addr_t page_bus;
 	u16 len;
 
 	/* drop this packet */
@@ -294,71 +459,55 @@
 		u64_stats_update_begin(&rx->statss);
 		rx->rx_desc_err_dropped_pkt++;
 		u64_stats_update_end(&rx->statss);
-		return true;
+		return false;
 	}
 
 	len = be16_to_cpu(rx_desc->len) - GVE_RX_PAD;
 	page_info = &rx->data.page_info[idx];
-	dma_sync_single_for_cpu(&priv->pdev->dev, rx->data.qpl->page_buses[idx],
+	data_slot = &rx->data.data_ring[idx];
+	page_bus = (rx->data.raw_addressing) ?
+					be64_to_cpu(data_slot->addr) - page_info->page_offset:
+					rx->data.qpl->page_buses[idx];									
+	dma_sync_single_for_cpu(&priv->pdev->dev, page_bus,
 				PAGE_SIZE, DMA_FROM_DEVICE);
-
-	/* gvnic can only receive into registered segments. If the buffer
-	 * can't be recycled, our only choice is to copy the data out of
-	 * it so that we can return it to the device.
-	 */
-
-	if (PAGE_SIZE == 4096) {
-		if (len <= priv->rx_copybreak) {
-			/* Just copy small packets */
-			skb = gve_rx_copy(rx, dev, napi, page_info, len);
-			u64_stats_update_begin(&rx->statss);
-			rx->rx_copybreak_pkt++;
-			u64_stats_update_end(&rx->statss);
-			goto have_skb;
-		}
-		if (unlikely(!gve_can_recycle_pages(dev))) {
-			skb = gve_rx_copy(rx, dev, napi, page_info, len);
-			goto have_skb;
-		}
-		pagecount = page_count(page_info->page);
-		if (pagecount == 1) {
-			/* No part of this page is used by any SKBs; we attach
-			 * the page fragment to a new SKB and pass it up the
-			 * stack.
-			 */
-			skb = gve_rx_add_frags(dev, napi, page_info, len);
-			if (!skb) {
+ 
+      	if (len <= priv->rx_copybreak) {
+		/* Just copy small packets */
+		skb = gve_rx_copy(netdev, napi, page_info, len);
+		if (skb) {
 				u64_stats_update_begin(&rx->statss);
-				rx->rx_skb_alloc_fail++;
+				rx->rx_copied_pkt++;
+				rx->rx_copybreak_pkt++;
 				u64_stats_update_end(&rx->statss);
-				return true;
 			}
-			/* Make sure the kernel stack can't release the page */
-			get_page(page_info->page);
-			/* "flip" to other packet buffer on this page */
-			gve_rx_flip_buff(page_info, &rx->data.data_ring[idx]);
-		} else if (pagecount >= 2) {
-			/* We have previously passed the other half of this
-			 * page up the stack, but it has not yet been freed.
-			 */
-			skb = gve_rx_copy(rx, dev, napi, page_info, len);
-		} else {
-			WARN(pagecount < 1, "Pagecount should never be < 1");
-			return false;
-		}
 	} else {
-		skb = gve_rx_copy(rx, dev, napi, page_info, len);
-	}
+                bool can_flip = gve_rx_can_flip_buffers(netdev);
+                int recycle = 0;
 
-have_skb:
-	/* We didn't manage to allocate an skb but we haven't had any
-	 * reset worthy failures.
-	 */
+		if (can_flip) {
+			recycle = gve_rx_can_recycle_buffer(page_info);
+			if (recycle < 0) {
+				gve_schedule_reset(priv);
+				return false;
+			}
+		}
+		if (rx->data.raw_addressing) {
+			skb = gve_rx_raw_addressing(&priv->pdev->dev, netdev,
+						    page_info, len, napi,
+						    data_slot,
+						    can_flip && recycle);
+                } else {
+			skb = gve_rx_qpl(&priv->pdev->dev, netdev, rx,
+					 page_info, len, napi, data_slot,
+					 can_flip && recycle);
+                }
+        }
+
 	if (!skb) {
 		u64_stats_update_begin(&rx->statss);
 		rx->rx_skb_alloc_fail++;
 		u64_stats_update_end(&rx->statss);
-		return true;
+		return false;
 	}
 
 	if (likely(feat & NETIF_F_RXCSUM)) {
@@ -380,6 +529,7 @@
 		napi_gro_frags(napi);
 	else
 		napi_gro_receive(napi, skb);
+
 	return true;
 }
 
@@ -392,26 +542,80 @@
 	next_idx = rx->cnt & rx->mask;
 	desc = rx->desc.desc_ring + next_idx;
 
+	/* make sure we have synchronized the seq no with the device */
+	smp_mb();
 	flags_seq = desc->flags_seq;
-	/* Make sure we have synchronized the seq no with the device */
-	smp_rmb();
+
 
 	return (GVE_SEQNO(flags_seq) == rx->desc.seqno);
 }
 
+static bool gve_rx_refill_buffers(struct gve_priv *priv, struct gve_rx_ring *rx)
+{
+	u32 fill_cnt = rx->fill_cnt;
+
+	while ((fill_cnt & rx->mask) != (rx->cnt & rx->mask)) {
+		u32 idx = fill_cnt & rx->mask;
+		struct gve_rx_slot_page_info *page_info =
+						&rx->data.page_info[idx];
+
+		if (page_info->can_flip) {
+			/* The other half of the page is free because it was
+			 * free when we processed the descriptor. Flip to it.
+			 */
+			struct gve_rx_data_slot *data_slot =
+						&rx->data.data_ring[idx];
+
+			gve_rx_flip_buffer(page_info, data_slot);
+			page_info->can_flip = false;
+		} else {
+			/* It is possible that the networking stack has already
+			 * finished processing all outstanding packets in the buffer
+			 * and it can be reused.
+			 * Flipping is unceccessary here - if the networking stack still
+			 * owns half the page it is impossible to tell which half. Either
+			 * the whole page is free or it needs to be replaced.
+			 */
+			int recycle = gve_rx_can_recycle_buffer(page_info);
+
+			if (recycle < 0) {
+				gve_schedule_reset(priv);
+				return false;
+			}
+			if (!recycle) {
+				/* We can't reuse the buffer - alloc a new one*/
+				struct gve_rx_data_slot *data_slot =
+						&rx->data.data_ring[idx];
+				struct device *dev = &priv->pdev->dev;
+
+				gve_rx_free_buffer(dev, page_info, data_slot);
+				page_info->page = NULL;
+				if (gve_rx_alloc_buffer(priv, dev, page_info,
+							data_slot, rx)) {
+					break;
+				}
+			}
+		}
+		fill_cnt++;
+	}
+	rx->fill_cnt = fill_cnt;
+	return true;
+}
+
 bool gve_clean_rx_done(struct gve_rx_ring *rx, int budget,
 		       netdev_features_t feat)
 {
 	struct gve_priv *priv = rx->gve;
+	u32 work_done = 0, packets = 0;
 	struct gve_rx_desc *desc;
 	u32 cnt = rx->cnt;
 	u32 idx = cnt & rx->mask;
-	u32 work_done = 0;
 	u64 bytes = 0;
 
 	desc = rx->desc.desc_ring + idx;
 	while ((GVE_SEQNO(desc->flags_seq) == rx->desc.seqno) &&
 	       work_done < budget) {
+		bool dropped;
 		netif_info(priv, rx_status, priv->dev,
 			   "[%d] idx=%d desc=%p desc->flags_seq=0x%x\n",
 			   rx->q_num, idx, desc, desc->flags_seq);
@@ -419,9 +623,11 @@
 			   "[%d] seqno=%d rx->desc.seqno=%d\n",
 			   rx->q_num, GVE_SEQNO(desc->flags_seq),
 			   rx->desc.seqno);
-		bytes += be16_to_cpu(desc->len) - GVE_RX_PAD;
-		if (!gve_rx(rx, desc, feat, idx))
-			gve_schedule_reset(priv);
+		dropped = !gve_rx(rx, desc, feat, idx);
+		if (!dropped) {
+			bytes += be16_to_cpu(desc->len) - GVE_RX_PAD;
+			packets++;
+		}
 		cnt++;
 		idx = cnt & rx->mask;
 		desc = rx->desc.desc_ring + idx;
@@ -433,13 +639,27 @@
 		return false;
 
 	u64_stats_update_begin(&rx->statss);
-	rx->rpackets += work_done;
+	rx->rpackets += packets;
 	rx->rbytes += bytes;
 	u64_stats_update_end(&rx->statss);
 	rx->cnt = cnt;
-	rx->fill_cnt += work_done;
+	/* restock ring slots */
+	if (!rx->data.raw_addressing) {
+		/* In QPL mode buffs are refilled as the desc are processed */
+		rx->fill_cnt += work_done;
+		dma_wmb();/* Ensure descs are visible before ringing doorbell */
+		gve_rx_write_doorbell(priv, rx);
+	} else if (rx->fill_cnt - cnt <= rx->db_threshold) {
+		/* In raw addressing mode buffs are only refilled if the avail
+		 * falls below a threshold.
+		 */
+		if(!gve_rx_refill_buffers(priv, rx))
+			return false;
+		/* restock desc ring slots */
+		dma_wmb();/* Ensure descs are visible before ringing doorbell */
+		gve_rx_write_doorbell(priv, rx);
+	}
 
-	gve_rx_write_doorbell(priv, rx);
 	return gve_rx_work_pending(rx);
 }
 

diff --git a/drivers/net/ethernet/google/gve/gve_tx.c b/drivers/net/ethernet/google/gve/gve_tx.c
index b653197..53e8caa 100644
--- a/drivers/net/ethernet/google/gve/gve_tx.c
+++ b/drivers/net/ethernet/google/gve/gve_tx.c

@@ -158,9 +158,11 @@
 			  tx->q_resources, tx->q_resources_bus);
 	tx->q_resources = NULL;
 
-	gve_tx_fifo_release(priv, &tx->tx_fifo);
-	gve_unassign_qpl(priv, tx->tx_fifo.qpl->id);
-	tx->tx_fifo.qpl = NULL;
+	if (!tx->raw_addressing) {
+		gve_tx_fifo_release(priv, &tx->tx_fifo);
+		gve_unassign_qpl(priv, tx->tx_fifo.qpl->id);
+		tx->tx_fifo.qpl = NULL;
+	}
 
 	bytes = sizeof(*tx->desc) * slots;
 	dma_free_coherent(hdev, bytes, tx->desc, tx->bus);
@@ -174,12 +176,16 @@
 
 static void gve_tx_add_to_block(struct gve_priv *priv, int queue_idx)
 {
+	unsigned int active_cpus = min_t(int, priv->num_ntfy_blks / 2,
+					 num_online_cpus());
 	int ntfy_idx = gve_tx_idx_to_ntfy(priv, queue_idx);
 	struct gve_notify_block *block = &priv->ntfy_blocks[ntfy_idx];
 	struct gve_tx_ring *tx = &priv->tx[queue_idx];
 
 	block->tx = tx;
 	tx->ntfy_id = ntfy_idx;
+	netif_set_xps_queue(priv->dev, get_cpu_mask(ntfy_idx % active_cpus),
+			    queue_idx);
 }
 
 static int gve_tx_alloc_ring(struct gve_priv *priv, int idx)
@@ -206,14 +212,17 @@
 	if (!tx->desc)
 		goto abort_with_info;
 
-	tx->tx_fifo.qpl = gve_assign_tx_qpl(priv);
-	if (!tx->tx_fifo.qpl)
-		goto abort_with_desc;
+	tx->raw_addressing = priv->raw_addressing;
+	tx->dev = &priv->pdev->dev;
+	if (!tx->raw_addressing) {
+	        tx->tx_fifo.qpl = gve_assign_tx_qpl(priv);
+	        if (!tx->tx_fifo.qpl)
+		        goto abort_with_desc;
 
-	/* map Tx FIFO */
-	if (gve_tx_fifo_init(priv, &tx->tx_fifo))
-		goto abort_with_qpl;
-
+	        /* map Tx FIFO */
+	        if (gve_tx_fifo_init(priv, &tx->tx_fifo))
+        		goto abort_with_qpl;
+        }
 	tx->q_resources =
 		dma_alloc_coherent(hdev,
 				   sizeof(*tx->q_resources),
@@ -230,7 +239,8 @@
 	return 0;
 
 abort_with_fifo:
-	gve_tx_fifo_release(priv, &tx->tx_fifo);
+        if (!tx->raw_addressing)
+	        gve_tx_fifo_release(priv, &tx->tx_fifo);
 abort_with_qpl:
 	gve_unassign_qpl(priv, tx->tx_fifo.qpl->id);
 abort_with_desc:
@@ -310,22 +320,44 @@
  * payload wraps to the beginning of the FIFO.
  */
 #define MAX_TX_DESC_NEEDED	3
+static void gve_tx_unmap_buf(struct device *dev,
+			     struct gve_tx_dma_buf *buf)
+{
+	const int buf_len = (int)dma_unmap_len(buf, len);
+	if (buf_len > 0) {
+		dma_unmap_single(dev, dma_unmap_addr(buf, dma),
+				 dma_unmap_len(buf, len),
+				 DMA_TO_DEVICE);
+		dma_unmap_len_set(buf, len, 0);
+	} else if (buf_len < 0) {
+		dma_unmap_page(dev, dma_unmap_addr(buf, dma),
+			       -dma_unmap_len(buf, len),
+			       DMA_TO_DEVICE);
+		dma_unmap_len_set(buf, len, 0);
+	}
+}
 
 /* Check if sufficient resources (descriptor ring space, FIFO space) are
  * available to transmit the given number of bytes.
  */
 static inline bool gve_can_tx(struct gve_tx_ring *tx, int bytes_required)
 {
-	return (gve_tx_avail(tx) >= MAX_TX_DESC_NEEDED &&
-		gve_tx_fifo_can_alloc(&tx->tx_fifo, bytes_required));
+	bool can_alloc = true;
+
+	if (!tx->raw_addressing)
+		can_alloc = gve_tx_fifo_can_alloc(&tx->tx_fifo, bytes_required);
+
+	return (gve_tx_avail(tx) >= MAX_TX_DESC_NEEDED && can_alloc);
 }
 
 /* Stops the queue if the skb cannot be transmitted. */
 static int gve_maybe_stop_tx(struct gve_tx_ring *tx, struct sk_buff *skb)
 {
-	int bytes_required;
+	int bytes_required = 0;
 
-	bytes_required = gve_skb_fifo_bytes_required(tx, skb);
+	if (!tx->raw_addressing)
+		bytes_required = gve_skb_fifo_bytes_required(tx, skb);
+
 	if (likely(gve_can_tx(tx, bytes_required)))
 		return 0;
 
@@ -394,22 +426,23 @@
 	seg_desc->seg.seg_addr = cpu_to_be64(addr);
 }
 
-static void gve_dma_sync_for_device(struct device *dev, dma_addr_t *page_buses,
-				    u64 iov_offset, u64 iov_len)
+static void gve_dma_sync_for_device(struct gve_priv *priv,
+								dma_addr_t *page_buses,
+								u64 iov_offset, u64 iov_len)
 {
 	u64 last_page = (iov_offset + iov_len - 1) / PAGE_SIZE;
 	u64 first_page = iov_offset / PAGE_SIZE;
-	dma_addr_t dma;
 	u64 page;
 
 	for (page = first_page; page <= last_page; page++) {
-		dma = page_buses[page];
-		dma_sync_single_for_device(dev, dma, PAGE_SIZE, DMA_TO_DEVICE);
+		dma_addr_t dma = page_buses[page];
+		dma_sync_single_for_device(&priv->pdev->dev, dma, PAGE_SIZE,
+					   DMA_TO_DEVICE);
 	}
 }
 
-static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
-			  struct device *dev)
+static int gve_tx_add_skb_copy(struct gve_priv* priv, struct gve_tx_ring *tx,
+								struct sk_buff *skb)
 {
 	int pad_bytes, hlen, hdr_nfrags, payload_nfrags, l4_hdr_offset;
 	union gve_tx_desc *pkt_desc, *seg_desc;
@@ -451,7 +484,7 @@
 	skb_copy_bits(skb, 0,
 		      tx->tx_fifo.base + info->iov[hdr_nfrags - 1].iov_offset,
 		      hlen);
-	gve_dma_sync_for_device(dev, tx->tx_fifo.qpl->page_buses,
+	gve_dma_sync_for_device(priv, tx->tx_fifo.qpl->page_buses,
 				info->iov[hdr_nfrags - 1].iov_offset,
 				info->iov[hdr_nfrags - 1].iov_len);
 	copy_offset = hlen;
@@ -467,7 +500,7 @@
 		skb_copy_bits(skb, copy_offset,
 			      tx->tx_fifo.base + info->iov[i].iov_offset,
 			      info->iov[i].iov_len);
-		gve_dma_sync_for_device(dev, tx->tx_fifo.qpl->page_buses,
+		gve_dma_sync_for_device(priv, tx->tx_fifo.qpl->page_buses,
 					info->iov[i].iov_offset,
 					info->iov[i].iov_len);
 		copy_offset += info->iov[i].iov_len;
@@ -476,6 +509,96 @@
 	return 1 + payload_nfrags;
 }
 
+static int gve_tx_add_skb_no_copy(struct gve_priv *priv, struct gve_tx_ring *tx,
+				  struct sk_buff *skb)
+{
+	const struct skb_shared_info *shinfo = skb_shinfo(skb);
+	int hlen, payload_nfrags, l4_hdr_offset, seg_idx_bias;
+	union gve_tx_desc *pkt_desc, *seg_desc;
+	struct gve_tx_buffer_state *info;
+	bool is_gso = skb_is_gso(skb);
+	u32 idx = tx->req & tx->mask;
+	struct gve_tx_dma_buf *buf;
+	int last_mapped = 0;
+	u64 addr;
+	u32 len;
+	int i;
+
+	info = &tx->info[idx];
+	pkt_desc = &tx->desc[idx];
+
+	l4_hdr_offset = skb_checksum_start_offset(skb);
+	/* If the skb is gso, then we want the tcp header in the first segment
+	 * otherwise we want the linear portion of the skb (which will contain
+	 * the checksum because skb->csum_start and skb->csum_offset are given
+	 * relative to skb->head) in the first segment.
+	 */
+	hlen = is_gso ? l4_hdr_offset + tcp_hdrlen(skb) :
+			skb_headlen(skb);
+	len = skb_headlen(skb);
+
+	info->skb =  skb;
+
+	addr = dma_map_single(tx->dev, skb->data, len, DMA_TO_DEVICE);
+	if (unlikely(dma_mapping_error(tx->dev, addr))) {
+		priv->dma_mapping_error++;
+		goto drop;
+	}
+	buf = &info->buf;
+	dma_unmap_len_set(buf, len, len);
+	dma_unmap_addr_set(buf, dma, addr);
+
+	payload_nfrags = shinfo->nr_frags;
+	if (hlen < len) {
+		/* For gso the rest of the linear portion of the skb needs to
+		 * be in its own descriptor.
+		 */
+		payload_nfrags++;
+		gve_tx_fill_pkt_desc(pkt_desc, skb, is_gso, l4_hdr_offset,
+				     1 + payload_nfrags, hlen, addr);
+
+		len -= hlen;
+		addr += hlen;
+		seg_desc = &tx->desc[(tx->req + 1) & tx->mask];
+		seg_idx_bias = 2;
+		gve_tx_fill_seg_desc(seg_desc, skb, is_gso, len, addr);
+	} else {
+		seg_idx_bias = 1;
+		gve_tx_fill_pkt_desc(pkt_desc, skb, is_gso, l4_hdr_offset,
+				     1 + payload_nfrags, hlen, addr);
+	}
+
+	for (i = 0; i < payload_nfrags - (seg_idx_bias - 1); i++) {
+		const skb_frag_t* frag = &shinfo->frags[i];
+
+		idx = (tx->req + i + seg_idx_bias) & tx->mask;
+		seg_desc = &tx->desc[idx];
+		len = skb_frag_size(frag);
+		addr = skb_frag_dma_map(tx->dev, frag, 0, len, DMA_TO_DEVICE);
+		if (unlikely(dma_mapping_error(tx->dev, addr))) {
+			priv->dma_mapping_error++;
+			goto unmap_drop;
+		}
+		buf = &tx->info[idx].buf;
+		dma_unmap_len_set(buf, len, -len);
+		dma_unmap_addr_set(buf, dma, addr);
+
+		gve_tx_fill_seg_desc(seg_desc, skb, is_gso, len, addr);
+	}
+
+	return 1 + payload_nfrags;
+
+unmap_drop:
+	i--;
+	for (last_mapped = i + seg_idx_bias; last_mapped >= 0; last_mapped--) {
+		idx = (tx->req + last_mapped) & tx->mask;
+		gve_tx_unmap_buf(tx->dev, &tx->info[idx].buf);
+	}
+drop:
+	tx->dropped_pkt++;
+	return 0;
+}
+
 netdev_tx_t gve_tx(struct sk_buff *skb, struct net_device *dev)
 {
 	struct gve_priv *priv = netdev_priv(dev);
@@ -491,20 +614,34 @@
 		 * may have added descriptors without ringing the doorbell.
 		 */
 
+		/* Ensure tx descs from a prior gve_tx are visible before
+		 * ringing doorbell.
+		 */
+		dma_wmb();
 		gve_tx_put_doorbell(priv, tx->q_resources, tx->req);
 		return NETDEV_TX_BUSY;
 	}
-	nsegs = gve_tx_add_skb(tx, skb, &priv->pdev->dev);
+	if (tx->raw_addressing)
+		nsegs = gve_tx_add_skb_no_copy(priv, tx, skb);
+	else
+		nsegs = gve_tx_add_skb_copy(priv, tx, skb);
 
-	netdev_tx_sent_queue(tx->netdev_txq, skb->len);
-	skb_tx_timestamp(skb);
+	/* If the packet is getting sent, we need to update the skb */
+	if (nsegs) {
+		netdev_tx_sent_queue(tx->netdev_txq, skb->len);
+		skb_tx_timestamp(skb);
+	}
 
-	/* give packets to NIC */
+	/* Give packets to NIC. Even if this packet failed to send the doorbell
+	 * might need to be rung because of xmit_more.
+	 */
 	tx->req += nsegs;
 
 	if (!netif_xmit_stopped(tx->netdev_txq) && netdev_xmit_more())
 		return NETDEV_TX_OK;
 
+	/* Ensure tx descs are visible before ringing doorbell */
+	dma_wmb();
 	gve_tx_put_doorbell(priv, tx->q_resources, tx->req);
 	return NETDEV_TX_OK;
 }
@@ -529,24 +666,31 @@
 		info = &tx->info[idx];
 		skb = info->skb;
 
+		/* Unmap the buffer */
+		if (tx->raw_addressing)
+			gve_tx_unmap_buf(tx->dev, &tx->info[idx].buf);
 		/* Mark as free */
 		if (skb) {
 			info->skb = NULL;
 			bytes += skb->len;
 			pkts++;
 			dev_consume_skb_any(skb);
-			/* FIFO free */
-			for (i = 0; i < ARRAY_SIZE(info->iov); i++) {
-				space_freed += info->iov[i].iov_len +
-					       info->iov[i].iov_padding;
-				info->iov[i].iov_len = 0;
-				info->iov[i].iov_padding = 0;
+			if (!tx->raw_addressing) {
+				/* FIFO free */
+				for (i = 0; i < ARRAY_SIZE(info->iov); i++) {
+					space_freed += info->iov[i].iov_len +
+						       info->iov[i].iov_padding;
+					info->iov[i].iov_len = 0;
+					info->iov[i].iov_padding = 0;
+				}
 			}
 		}
 		tx->done++;
 	}
 
-	gve_tx_free_fifo(&tx->tx_fifo, space_freed);
+	if (!tx->raw_addressing) {
+		gve_tx_free_fifo(&tx->tx_fifo, space_freed);
+	}
 	u64_stats_update_begin(&tx->statss);
 	tx->bytes_done += bytes;
 	tx->pkt_done += pkts;

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 0a721f6..e31f8fb 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c

@@ -3109,13 +3109,19 @@
 	return 0;
 }
 
-static int mem_xadd4(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
+static int mem_atomic4(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
 {
+	if (meta->insn.imm != BPF_ADD)
+		return -EOPNOTSUPP;
+
 	return mem_xadd(nfp_prog, meta, false);
 }
 
-static int mem_xadd8(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
+static int mem_atomic8(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
 {
+	if (meta->insn.imm != BPF_ADD)
+		return -EOPNOTSUPP;
+
 	return mem_xadd(nfp_prog, meta, true);
 }
 
@@ -3475,8 +3481,8 @@
 	[BPF_STX | BPF_MEM | BPF_H] =	mem_stx2,
 	[BPF_STX | BPF_MEM | BPF_W] =	mem_stx4,
 	[BPF_STX | BPF_MEM | BPF_DW] =	mem_stx8,
-	[BPF_STX | BPF_XADD | BPF_W] =	mem_xadd4,
-	[BPF_STX | BPF_XADD | BPF_DW] =	mem_xadd8,
+	[BPF_STX | BPF_ATOMIC | BPF_W] =	mem_atomic4,
+	[BPF_STX | BPF_ATOMIC | BPF_DW] =	mem_atomic8,
 	[BPF_ST | BPF_MEM | BPF_B] =	mem_st1,
 	[BPF_ST | BPF_MEM | BPF_H] =	mem_st2,
 	[BPF_ST | BPF_MEM | BPF_W] =	mem_st4,

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index fac9c6f..d0e17ee 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h

@@ -428,9 +428,9 @@
 	return is_mbpf_classic_store(meta) && meta->ptr.type == PTR_TO_PACKET;
 }
 
-static inline bool is_mbpf_xadd(const struct nfp_insn_meta *meta)
+static inline bool is_mbpf_atomic(const struct nfp_insn_meta *meta)
 {
-	return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_STX | BPF_XADD);
+	return (meta->insn.code & ~BPF_SIZE_MASK) == (BPF_STX | BPF_ATOMIC);
 }
 
 static inline bool is_mbpf_mul(const struct nfp_insn_meta *meta)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
index e92ee51..9d235c0 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/verifier.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/verifier.c

@@ -479,7 +479,7 @@
 			pr_vlog(env, "map writes not supported\n");
 			return -EOPNOTSUPP;
 		}
-		if (is_mbpf_xadd(meta)) {
+		if (is_mbpf_atomic(meta)) {
 			err = nfp_bpf_map_mark_used(env, meta, reg,
 						    NFP_MAP_USE_ATOMIC_CNT);
 			if (err)
@@ -523,12 +523,17 @@
 }
 
 static int
-nfp_bpf_check_xadd(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
-		   struct bpf_verifier_env *env)
+nfp_bpf_check_atomic(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
+		     struct bpf_verifier_env *env)
 {
 	const struct bpf_reg_state *sreg = cur_regs(env) + meta->insn.src_reg;
 	const struct bpf_reg_state *dreg = cur_regs(env) + meta->insn.dst_reg;
 
+	if (meta->insn.imm != BPF_ADD) {
+		pr_vlog(env, "atomic op not implemented: %d\n", meta->insn.imm);
+		return -EOPNOTSUPP;
+	}
+
 	if (dreg->type != PTR_TO_MAP_VALUE) {
 		pr_vlog(env, "atomic add not to a map value pointer: %d\n",
 			dreg->type);
@@ -655,8 +660,8 @@
 	if (is_mbpf_store(meta))
 		return nfp_bpf_check_store(nfp_prog, meta, env);
 
-	if (is_mbpf_xadd(meta))
-		return nfp_bpf_check_xadd(nfp_prog, meta, env);
+	if (is_mbpf_atomic(meta))
+		return nfp_bpf_check_atomic(nfp_prog, meta, env);
 
 	if (is_mbpf_alu(meta))
 		return nfp_bpf_check_alu(nfp_prog, meta, env);

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 115a77b..da2752d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h

@@ -152,7 +152,12 @@
 #define EXT4_MB_USE_RESERVED		0x2000
 /* Do strict check for free blocks while retrying block allocation */
 #define EXT4_MB_STRICT_CHECK		0x4000
-
+/* Large fragment size list lookup succeeded at least once for cr = 0 */
+#define EXT4_MB_CR0_OPTIMIZED		0x8000
+/* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */
+#define EXT4_MB_CR1_OPTIMIZED		0x00010000
+/* Perform linear traversal for one group */
+#define EXT4_MB_SEARCH_NEXT_LINEAR	0x00020000
 struct ext4_allocation_request {
 	/* target inode for block we're allocating */
 	struct inode *inode;
@@ -714,6 +719,7 @@
 #define EXT4_IOC_CLEAR_ES_CACHE		_IO('f', 40)
 #define EXT4_IOC_GETSTATE		_IOW('f', 41, __u32)
 #define EXT4_IOC_GET_ES_CACHE		_IOWR('f', 42, struct fiemap)
+#define EXT4_IOC_CHECKPOINT		_IOW('f', 43, __u32)
 
 #define EXT4_IOC_SHUTDOWN _IOR ('X', 125, __u32)
 
@@ -735,6 +741,14 @@
 #define EXT4_STATE_FLAG_NEWENTRY	0x00000004
 #define EXT4_STATE_FLAG_DA_ALLOC_CLOSE	0x00000008
 
+/* flags for ioctl EXT4_IOC_CHECKPOINT */
+#define EXT4_IOC_CHECKPOINT_FLAG_DISCARD	0x1
+#define EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT	0x2
+#define EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN	0x4
+#define EXT4_IOC_CHECKPOINT_FLAG_VALID		(EXT4_IOC_CHECKPOINT_FLAG_DISCARD | \
+						EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT | \
+						EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN)
+
 #if defined(__KERNEL__) && defined(CONFIG_COMPAT)
 /*
  * ioctl commands in 32 bit emulation
@@ -1212,7 +1226,7 @@
 #define EXT4_MOUNT_JOURNAL_CHECKSUM	0x800000 /* Journal checksums */
 #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT	0x1000000 /* Journal Async Commit */
 #define EXT4_MOUNT_WARN_ON_ERROR	0x2000000 /* Trigger WARN_ON on error */
-#define EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS 0x4000000
+#define EXT4_MOUNT_NO_PREFETCH_BLOCK_BITMAPS 0x4000000
 #define EXT4_MOUNT_DELALLOC		0x8000000 /* Delalloc support */
 #define EXT4_MOUNT_DATA_ERR_ABORT	0x10000000 /* Abort on file data write */
 #define EXT4_MOUNT_BLOCK_VALIDITY	0x20000000 /* Block validity checking */
@@ -1237,7 +1251,9 @@
 #define EXT4_MOUNT2_JOURNAL_FAST_COMMIT	0x00000010 /* Journal fast commit */
 #define EXT4_MOUNT2_DAX_NEVER		0x00000020 /* Do not allow Direct Access */
 #define EXT4_MOUNT2_DAX_INODE		0x00000040 /* For printing options only */
-
+#define EXT4_MOUNT2_MB_OPTIMIZE_SCAN	0x00000080 /* Optimize group
+						    * scanning in mballoc
+						    */
 
 #define clear_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt &= \
 						~EXT4_MOUNT_##opt
@@ -1519,9 +1535,14 @@
 	unsigned int s_mb_free_pending;
 	struct list_head s_freed_data_list;	/* List of blocks to be freed
 						   after commit completed */
+	struct rb_root s_mb_avg_fragment_size_root;
+	rwlock_t s_mb_rb_lock;
+	struct list_head *s_mb_largest_free_orders;
+	rwlock_t *s_mb_largest_free_orders_locks;
 
 	/* tunables */
 	unsigned long s_stripe;
+	unsigned int s_mb_max_linear_groups;
 	unsigned int s_mb_stream_request;
 	unsigned int s_mb_max_to_scan;
 	unsigned int s_mb_min_to_scan;
@@ -1541,12 +1562,17 @@
 	atomic_t s_bal_success;	/* we found long enough chunks */
 	atomic_t s_bal_allocated;	/* in blocks */
 	atomic_t s_bal_ex_scanned;	/* total extents scanned */
+	atomic_t s_bal_groups_scanned;	/* number of groups scanned */
 	atomic_t s_bal_goals;	/* goal hits */
 	atomic_t s_bal_breaks;	/* too long searches */
 	atomic_t s_bal_2orders;	/* 2^order hits */
-	spinlock_t s_bal_lock;
-	unsigned long s_mb_buddies_generated;
-	unsigned long long s_mb_generation_time;
+	atomic_t s_bal_cr0_bad_suggestions;
+	atomic_t s_bal_cr1_bad_suggestions;
+	atomic64_t s_bal_cX_groups_considered[4];
+	atomic64_t s_bal_cX_hits[4];
+	atomic64_t s_bal_cX_failed[4];		/* cX loop didn't find blocks */
+	atomic_t s_mb_buddies_generated;	/* number of buddies generated */
+	atomic64_t s_mb_generation_time;
 	atomic_t s_mb_lost_chunks;
 	atomic_t s_mb_preallocated;
 	atomic_t s_mb_discarded;
@@ -2781,8 +2807,10 @@
 
 /* mballoc.c */
 extern const struct seq_operations ext4_mb_seq_groups_ops;
+extern const struct seq_operations ext4_mb_seq_structs_summary_ops;
 extern long ext4_mb_stats;
 extern long ext4_mb_max_to_scan;
+extern int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset);
 extern int ext4_mb_init(struct super_block *);
 extern int ext4_mb_release(struct super_block *);
 extern ext4_fsblk_t ext4_mb_new_blocks(handle_t *,
@@ -3284,11 +3312,14 @@
 	ext4_grpblk_t	bb_free;	/* total free blocks */
 	ext4_grpblk_t	bb_fragments;	/* nr of freespace fragments */
 	ext4_grpblk_t	bb_largest_free_order;/* order of largest frag in BG */
+	ext4_group_t	bb_group;	/* Group number */
 	struct          list_head bb_prealloc_list;
 #ifdef DOUBLE_CHECK
 	void            *bb_bitmap;
 #endif
 	struct rw_semaphore alloc_sem;
+	struct rb_node	bb_avg_fragment_size_rb;
+	struct list_head bb_largest_free_order_node;
 	ext4_grpblk_t	bb_counters[];	/* Nr of free power-of-two-block
 					 * regions, index is order.
 					 * bb_counters[3] = 5 means

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e00a355..413a848 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c

@@ -5994,7 +5994,6 @@
 			kfree(path);
 			break;
 		}
-		ex = path2[path2->p_depth].p_ext;
 		for (i = 0; i <= max(path->p_depth, path2->p_depth); i++) {
 			cmp1 = cmp2 = 0;
 			if (i <= path->p_depth)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 53647fa..03953d6 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c

@@ -103,8 +103,69 @@
  *
  * Replay code should thus check for all the valid tails in the FC area.
  *
+ * Fast Commit Replay Idempotence
+ * ------------------------------
+ *
+ * Fast commits tags are idempotent in nature provided the recovery code follows
+ * certain rules. The guiding principle that the commit path follows while
+ * committing is that it stores the result of a particular operation instead of
+ * storing the procedure.
+ *
+ * Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a'
+ * was associated with inode 10. During fast commit, instead of storing this
+ * operation as a procedure "rename a to b", we store the resulting file system
+ * state as a "series" of outcomes:
+ *
+ * - Link dirent b to inode 10
+ * - Unlink dirent a
+ * - Inode <10> with valid refcount
+ *
+ * Now when recovery code runs, it needs "enforce" this state on the file
+ * system. This is what guarantees idempotence of fast commit replay.
+ *
+ * Let's take an example of a procedure that is not idempotent and see how fast
+ * commits make it idempotent. Consider following sequence of operations:
+ *
+ *     rm A;    mv B A;    read A
+ *  (x)     (y)        (z)
+ *
+ * (x), (y) and (z) are the points at which we can crash. If we store this
+ * sequence of operations as is then the replay is not idempotent. Let's say
+ * while in replay, we crash at (z). During the second replay, file A (which was
+ * actually created as a result of "mv B A" operation) would get deleted. Thus,
+ * file named A would be absent when we try to read A. So, this sequence of
+ * operations is not idempotent. However, as mentioned above, instead of storing
+ * the procedure fast commits store the outcome of each procedure. Thus the fast
+ * commit log for above procedure would be as follows:
+ *
+ * (Let's assume dirent A was linked to inode 10 and dirent B was linked to
+ * inode 11 before the replay)
+ *
+ *    [Unlink A]   [Link A to inode 11]   [Unlink B]   [Inode 11]
+ * (w)          (x)                    (y)          (z)
+ *
+ * If we crash at (z), we will have file A linked to inode 11. During the second
+ * replay, we will remove file A (inode 11). But we will create it back and make
+ * it point to inode 11. We won't find B, so we'll just skip that step. At this
+ * point, the refcount for inode 11 is not reliable, but that gets fixed by the
+ * replay of last inode 11 tag. Crashes at points (w), (x) and (y) get handled
+ * similarly. Thus, by converting a non-idempotent procedure into a series of
+ * idempotent outcomes, fast commits ensured idempotence during the replay.
+ *
  * TODOs
  * -----
+ *
+ * 0) Fast commit replay path hardening: Fast commit replay code should use
+ *    journal handles to make sure all the updates it does during the replay
+ *    path are atomic. With that if we crash during fast commit replay, after
+ *    trying to do recovery again, we will find a file system where fast commit
+ *    area is invalid (because new full commit would be found). In order to deal
+ *    with that, fast commit replay code should ensure that the "FC_REPLAY"
+ *    superblock state is persisted before starting the replay, so that after
+ *    the crash, fast commit recovery code can look at that flag and perform
+ *    fast commit recovery even if that area is invalidated by later full
+ *    commits.
+ *
  * 1) Make fast commit atomic updates more fine grained. Today, a fast commit
  *    eligible update must be protected within ext4_fc_start_update() and
  *    ext4_fc_stop_update(). These routines are called at much higher
@@ -548,13 +609,13 @@
 	trace_ext4_fc_track_range(inode, start, end, ret);
 }
 
-static void ext4_fc_submit_bh(struct super_block *sb)
+static void ext4_fc_submit_bh(struct super_block *sb, bool is_tail)
 {
 	int write_flags = REQ_SYNC;
 	struct buffer_head *bh = EXT4_SB(sb)->s_fc_bh;
 
-	/* TODO: REQ_FUA | REQ_PREFLUSH is unnecessarily expensive. */
-	if (test_opt(sb, BARRIER))
+	/* Add REQ_FUA | REQ_PREFLUSH only its tail */
+	if (test_opt(sb, BARRIER) && is_tail)
 		write_flags |= REQ_FUA | REQ_PREFLUSH;
 	lock_buffer(bh);
 	set_buffer_dirty(bh);
@@ -628,7 +689,7 @@
 		*crc = ext4_chksum(sbi, *crc, tl, sizeof(*tl));
 	if (pad_len > 0)
 		ext4_fc_memzero(sb, tl + 1, pad_len, crc);
-	ext4_fc_submit_bh(sb);
+	ext4_fc_submit_bh(sb, false);
 
 	ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, &bh);
 	if (ret)
@@ -685,7 +746,7 @@
 	tail.fc_crc = cpu_to_le32(crc);
 	ext4_fc_memcpy(sb, dst, &tail.fc_crc, sizeof(tail.fc_crc), NULL);
 
-	ext4_fc_submit_bh(sb);
+	ext4_fc_submit_bh(sb, true);
 
 	return 0;
 }
@@ -1771,32 +1832,6 @@
 	return 0;
 }
 
-static inline const char *tag2str(u16 tag)
-{
-	switch (tag) {
-	case EXT4_FC_TAG_LINK:
-		return "TAG_ADD_ENTRY";
-	case EXT4_FC_TAG_UNLINK:
-		return "TAG_DEL_ENTRY";
-	case EXT4_FC_TAG_ADD_RANGE:
-		return "TAG_ADD_RANGE";
-	case EXT4_FC_TAG_CREAT:
-		return "TAG_CREAT_DENTRY";
-	case EXT4_FC_TAG_DEL_RANGE:
-		return "TAG_DEL_RANGE";
-	case EXT4_FC_TAG_INODE:
-		return "TAG_INODE";
-	case EXT4_FC_TAG_PAD:
-		return "TAG_PAD";
-	case EXT4_FC_TAG_TAIL:
-		return "TAG_TAIL";
-	case EXT4_FC_TAG_HEAD:
-		return "TAG_HEAD";
-	default:
-		return "TAG_ERROR";
-	}
-}
-
 static void ext4_fc_set_bitmaps_and_counters(struct super_block *sb)
 {
 	struct ext4_fc_replay_state *state;

diff --git a/fs/ext4/fast_commit.h b/fs/ext4/fast_commit.h
index d8d0998..937c381 100644
--- a/fs/ext4/fast_commit.h
+++ b/fs/ext4/fast_commit.h

@@ -3,6 +3,11 @@
 #ifndef __FAST_COMMIT_H__
 #define __FAST_COMMIT_H__
 
+/*
+ * Note this file is present in e2fsprogs/lib/ext2fs/fast_commit.h and
+ * linux/fs/ext4/fast_commit.h. These file should always be byte identical.
+ */
+
 /* Fast commit tags */
 #define EXT4_FC_TAG_ADD_RANGE		0x0001
 #define EXT4_FC_TAG_DEL_RANGE		0x0002
@@ -50,7 +55,7 @@
 struct ext4_fc_dentry_info {
 	__le32 fc_parent_ino;
 	__le32 fc_ino;
-	u8 fc_dname[0];
+	__u8 fc_dname[0];
 };
 
 /* Value structure for EXT4_FC_TAG_INODE and EXT4_FC_TAG_INODE_PARTIAL. */
@@ -66,19 +71,6 @@
 };
 
 /*
- * In memory list of dentry updates that are performed on the file
- * system used by fast commit code.
- */
-struct ext4_fc_dentry_update {
-	int fcd_op;		/* Type of update create / unlink / link */
-	int fcd_parent;		/* Parent inode number */
-	int fcd_ino;		/* Inode number */
-	struct qstr fcd_name;	/* Dirent name */
-	unsigned char fcd_iname[DNAME_INLINE_LEN];	/* Dirent name string */
-	struct list_head fcd_list;
-};
-
-/*
  * Fast commit reason codes
  */
 enum {
@@ -107,6 +99,20 @@
 	EXT4_FC_REASON_MAX
 };
 
+#ifdef __KERNEL__
+/*
+ * In memory list of dentry updates that are performed on the file
+ * system used by fast commit code.
+ */
+struct ext4_fc_dentry_update {
+	int fcd_op;		/* Type of update create / unlink / link */
+	int fcd_parent;		/* Parent inode number */
+	int fcd_ino;		/* Inode number */
+	struct qstr fcd_name;	/* Dirent name */
+	unsigned char fcd_iname[DNAME_INLINE_LEN];	/* Dirent name string */
+	struct list_head fcd_list;
+};
+
 struct ext4_fc_stats {
 	unsigned int fc_ineligible_reason_count[EXT4_FC_REASON_MAX];
 	unsigned long fc_num_commits;
@@ -145,6 +151,32 @@
 };
 
 #define region_last(__region) (((__region)->lblk) + ((__region)->len) - 1)
+#endif
 
+static inline const char *tag2str(__u16 tag)
+{
+	switch (tag) {
+	case EXT4_FC_TAG_LINK:
+		return "ADD_ENTRY";
+	case EXT4_FC_TAG_UNLINK:
+		return "DEL_ENTRY";
+	case EXT4_FC_TAG_ADD_RANGE:
+		return "ADD_RANGE";
+	case EXT4_FC_TAG_CREAT:
+		return "CREAT_DENTRY";
+	case EXT4_FC_TAG_DEL_RANGE:
+		return "DEL_RANGE";
+	case EXT4_FC_TAG_INODE:
+		return "INODE";
+	case EXT4_FC_TAG_PAD:
+		return "PAD";
+	case EXT4_FC_TAG_TAIL:
+		return "TAIL";
+	case EXT4_FC_TAG_HEAD:
+		return "HEAD";
+	default:
+		return "ERROR";
+	}
+}
 
 #endif /* __FAST_COMMIT_H__ */

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 18a5321..abe6f00 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c

@@ -3224,7 +3224,7 @@
 		ext4_clear_inode_state(inode, EXT4_STATE_JDATA);
 		journal = EXT4_JOURNAL(inode);
 		jbd2_journal_lock_updates(journal);
-		err = jbd2_journal_flush(journal);
+		err = jbd2_journal_flush(journal, 0);
 		jbd2_journal_unlock_updates(journal);
 
 		if (err)
@@ -6014,7 +6014,7 @@
 	if (val)
 		ext4_set_inode_flag(inode, EXT4_INODE_JOURNAL_DATA);
 	else {
-		err = jbd2_journal_flush(journal);
+		err = jbd2_journal_flush(journal, 0);
 		if (err < 0) {
 			jbd2_journal_unlock_updates(journal);
 			percpu_up_write(&sbi->s_writepages_rwsem);

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index cb54ea6..d39b3a6 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c

@@ -759,7 +759,7 @@
 	err = ext4_group_add(sb, input);
 	if (EXT4_SB(sb)->s_journal) {
 		jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
-		err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal);
+		err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0);
 		jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
 	}
 	if (err == 0)
@@ -815,6 +815,57 @@
 	return error;
 }
 
+static int ext4_ioctl_checkpoint(struct file *filp, unsigned long arg)
+{
+	int err = 0;
+	__u32 flags = 0;
+	unsigned int flush_flags = 0;
+	struct super_block *sb = file_inode(filp)->i_sb;
+	struct request_queue *q;
+
+	if (copy_from_user(&flags, (__u32 __user *)arg,
+				sizeof(__u32)))
+		return -EFAULT;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* check for invalid bits set */
+	if ((flags & ~EXT4_IOC_CHECKPOINT_FLAG_VALID) ||
+				((flags & JBD2_JOURNAL_FLUSH_DISCARD) &&
+				(flags & JBD2_JOURNAL_FLUSH_ZEROOUT)))
+		return -EINVAL;
+
+	if (!EXT4_SB(sb)->s_journal)
+		return -ENODEV;
+
+	if (flags & ~JBD2_JOURNAL_FLUSH_VALID)
+		return -EINVAL;
+
+	q = bdev_get_queue(EXT4_SB(sb)->s_journal->j_dev);
+	if (!q)
+		return -ENXIO;
+	if ((flags & JBD2_JOURNAL_FLUSH_DISCARD) && !blk_queue_discard(q))
+		return -EOPNOTSUPP;
+
+	if (flags & EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN)
+		return 0;
+
+	if (flags & EXT4_IOC_CHECKPOINT_FLAG_DISCARD)
+		flush_flags |= JBD2_JOURNAL_FLUSH_DISCARD;
+
+	if (flags & EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT) {
+		flush_flags |= JBD2_JOURNAL_FLUSH_ZEROOUT;
+		pr_info_ratelimited("warning: checkpointing journal with EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT can be slow");
+	}
+
+	jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
+	err = jbd2_journal_flush(EXT4_SB(sb)->s_journal, flush_flags);
+	jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
+
+	return err;
+}
+
 static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
 	struct inode *inode = file_inode(filp);
@@ -941,7 +992,7 @@
 		err = ext4_group_extend(sb, EXT4_SB(sb)->s_es, n_blocks_count);
 		if (EXT4_SB(sb)->s_journal) {
 			jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
-			err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal);
+			err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0);
 			jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
 		}
 		if (err == 0)
@@ -1084,7 +1135,7 @@
 		if (EXT4_SB(sb)->s_journal) {
 			ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_RESIZE);
 			jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
-			err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal);
+			err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0);
 			jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
 		}
 		if (err == 0)
@@ -1315,6 +1366,9 @@
 			return -EOPNOTSUPP;
 		return fsverity_ioctl_measure(filp, (void __user *)arg);
 
+	case EXT4_IOC_CHECKPOINT:
+		return ext4_ioctl_checkpoint(filp, arg);
+
 	default:
 		return -ENOTTY;
 	}
@@ -1402,6 +1456,7 @@
 	case EXT4_IOC_GET_ES_CACHE:
 	case FS_IOC_FSGETXATTR:
 	case FS_IOC_FSSETXATTR:
+	case EXT4_IOC_CHECKPOINT:
 		break;
 	default:
 		return -ENOIOCTLCMD;

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d7cb7d7..95c1ca7 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c

@@ -127,11 +127,50 @@
  * smallest multiple of the stripe value (sbi->s_stripe) which is
  * greater than the default mb_group_prealloc.
  *
+ * If "mb_optimize_scan" mount option is set, we maintain in memory group info
+ * structures in two data structures:
+ *
+ * 1) Array of largest free order lists (sbi->s_mb_largest_free_orders)
+ *
+ *    Locking: sbi->s_mb_largest_free_orders_locks(array of rw locks)
+ *
+ *    This is an array of lists where the index in the array represents the
+ *    largest free order in the buddy bitmap of the participating group infos of
+ *    that list. So, there are exactly MB_NUM_ORDERS(sb) (which means total
+ *    number of buddy bitmap orders possible) number of lists. Group-infos are
+ *    placed in appropriate lists.
+ *
+ * 2) Average fragment size rb tree (sbi->s_mb_avg_fragment_size_root)
+ *
+ *    Locking: sbi->s_mb_rb_lock (rwlock)
+ *
+ *    This is a red black tree consisting of group infos and the tree is sorted
+ *    by average fragment sizes (which is calculated as ext4_group_info->bb_free
+ *    / ext4_group_info->bb_fragments).
+ *
+ * When "mb_optimize_scan" mount option is set, mballoc consults the above data
+ * structures to decide the order in which groups are to be traversed for
+ * fulfilling an allocation request.
+ *
+ * At CR = 0, we look for groups which have the largest_free_order >= the order
+ * of the request. We directly look at the largest free order list in the data
+ * structure (1) above where largest_free_order = order of the request. If that
+ * list is empty, we look at remaining list in the increasing order of
+ * largest_free_order. This allows us to perform CR = 0 lookup in O(1) time.
+ *
+ * At CR = 1, we only consider groups where average fragment size > request
+ * size. So, we lookup a group which has average fragment size just above or
+ * equal to request size using our rb tree (data structure 2) in O(log N) time.
+ *
+ * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in
+ * linear order which requires O(N) search time for each CR 0 and CR 1 phase.
+ *
  * The regular allocator (using the buddy cache) supports a few tunables.
  *
  * /sys/fs/ext4/<partition>/mb_min_to_scan
  * /sys/fs/ext4/<partition>/mb_max_to_scan
  * /sys/fs/ext4/<partition>/mb_order2_req
+ * /sys/fs/ext4/<partition>/mb_linear_limit
  *
  * The regular allocator uses buddy scan only if the request len is power of
  * 2 blocks and the order of allocation is >= sbi->s_mb_order2_reqs. The
@@ -149,6 +188,16 @@
  * can be used for allocation. ext4_mb_good_group explains how the groups are
  * checked.
  *
+ * When "mb_optimize_scan" is turned on, as mentioned above, the groups may not
+ * get traversed linearly. That may result in subsequent allocations being not
+ * close to each other. And so, the underlying device may get filled up in a
+ * non-linear fashion. While that may not matter on non-rotational devices, for
+ * rotational devices that may result in higher seek times. "mb_linear_limit"
+ * tells mballoc how many groups mballoc should search linearly before
+ * performing consulting above data structures for more efficient lookups. For
+ * non rotational devices, this value defaults to 0 and for rotational devices
+ * this is set to MB_DEFAULT_LINEAR_LIMIT.
+ *
  * Both the prealloc space are getting populated as above. So for the first
  * request we will hit the buddy cache which will result in this prealloc
  * space getting filled. The prealloc space is then later used for the
@@ -299,6 +348,8 @@
  *  - bitlock on a group	(group)
  *  - object (inode/locality)	(object)
  *  - per-pa lock		(pa)
+ *  - cr0 lists lock		(cr0)
+ *  - cr1 tree lock		(cr1)
  *
  * Paths:
  *  - new pa
@@ -328,6 +379,9 @@
  *    group
  *        object
  *
+ *  - allocation path (ext4_mb_regular_allocator)
+ *    group
+ *    cr0/cr1
  */
 static struct kmem_cache *ext4_pspace_cachep;
 static struct kmem_cache *ext4_ac_cachep;
@@ -351,6 +405,9 @@
 						ext4_group_t group);
 static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
 
+static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
+			       ext4_group_t group, int cr);
+
 /*
  * The algorithm using this percpu seq counter goes below:
  * 1. We sample the percpu discard_pa_seq counter before trying for block
@@ -744,6 +801,269 @@
 	}
 }
 
+static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
+			int (*cmp)(struct rb_node *, struct rb_node *))
+{
+	struct rb_node **iter = &root->rb_node, *parent = NULL;
+
+	while (*iter) {
+		parent = *iter;
+		if (cmp(new, *iter) > 0)
+			iter = &((*iter)->rb_left);
+		else
+			iter = &((*iter)->rb_right);
+	}
+
+	rb_link_node(new, parent, iter);
+	rb_insert_color(new, root);
+}
+
+static int
+ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
+{
+	struct ext4_group_info *grp1 = rb_entry(rb1,
+						struct ext4_group_info,
+						bb_avg_fragment_size_rb);
+	struct ext4_group_info *grp2 = rb_entry(rb2,
+						struct ext4_group_info,
+						bb_avg_fragment_size_rb);
+	int num_frags_1, num_frags_2;
+
+	num_frags_1 = grp1->bb_fragments ?
+		grp1->bb_free / grp1->bb_fragments : 0;
+	num_frags_2 = grp2->bb_fragments ?
+		grp2->bb_free / grp2->bb_fragments : 0;
+
+	return (num_frags_2 - num_frags_1);
+}
+
+/*
+ * Reinsert grpinfo into the avg_fragment_size tree with new average
+ * fragment size.
+ */
+static void
+mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info *grp)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+	if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_free == 0)
+		return;
+
+	write_lock(&sbi->s_mb_rb_lock);
+	if (!RB_EMPTY_NODE(&grp->bb_avg_fragment_size_rb)) {
+		rb_erase(&grp->bb_avg_fragment_size_rb,
+				&sbi->s_mb_avg_fragment_size_root);
+		RB_CLEAR_NODE(&grp->bb_avg_fragment_size_rb);
+	}
+
+	ext4_mb_rb_insert(&sbi->s_mb_avg_fragment_size_root,
+		&grp->bb_avg_fragment_size_rb,
+		ext4_mb_avg_fragment_size_cmp);
+	write_unlock(&sbi->s_mb_rb_lock);
+}
+
+/*
+ * Choose next group by traversing largest_free_order lists. Updates *new_cr if
+ * cr level needs an update.
+ */
+static void ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
+			int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+	struct ext4_group_info *iter, *grp;
+	int i;
+
+	if (ac->ac_status == AC_STATUS_FOUND)
+		return;
+
+	if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR0_OPTIMIZED))
+		atomic_inc(&sbi->s_bal_cr0_bad_suggestions);
+
+	grp = NULL;
+	for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
+		if (list_empty(&sbi->s_mb_largest_free_orders[i]))
+			continue;
+		read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
+		if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
+			read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
+			continue;
+		}
+		grp = NULL;
+		list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
+				    bb_largest_free_order_node) {
+			if (sbi->s_mb_stats)
+				atomic64_inc(&sbi->s_bal_cX_groups_considered[0]);
+			if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) {
+				grp = iter;
+				break;
+			}
+		}
+		read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
+		if (grp)
+			break;
+	}
+
+	if (!grp) {
+		/* Increment cr and search again */
+		*new_cr = 1;
+	} else {
+		*group = grp->bb_group;
+		ac->ac_last_optimal_group = *group;
+		ac->ac_flags |= EXT4_MB_CR0_OPTIMIZED;
+	}
+}
+
+/*
+ * Choose next group by traversing average fragment size tree. Updates *new_cr
+ * if cr lvel needs an update. Sets EXT4_MB_SEARCH_NEXT_LINEAR to indicate that
+ * the linear search should continue for one iteration since there's lock
+ * contention on the rb tree lock.
+ */
+static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac,
+		int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+	int avg_fragment_size, best_so_far;
+	struct rb_node *node, *found;
+	struct ext4_group_info *grp;
+
+	/*
+	 * If there is contention on the lock, instead of waiting for the lock
+	 * to become available, just continue searching lineraly. We'll resume
+	 * our rb tree search later starting at ac->ac_last_optimal_group.
+	 */
+	if (!read_trylock(&sbi->s_mb_rb_lock)) {
+		ac->ac_flags |= EXT4_MB_SEARCH_NEXT_LINEAR;
+		return;
+	}
+
+	if (unlikely(ac->ac_flags & EXT4_MB_CR1_OPTIMIZED)) {
+		if (sbi->s_mb_stats)
+			atomic_inc(&sbi->s_bal_cr1_bad_suggestions);
+		/* We have found something at CR 1 in the past */
+		grp = ext4_get_group_info(ac->ac_sb, ac->ac_last_optimal_group);
+		for (found = rb_next(&grp->bb_avg_fragment_size_rb); found != NULL;
+		     found = rb_next(found)) {
+			grp = rb_entry(found, struct ext4_group_info,
+				       bb_avg_fragment_size_rb);
+			if (sbi->s_mb_stats)
+				atomic64_inc(&sbi->s_bal_cX_groups_considered[1]);
+			if (likely(ext4_mb_good_group(ac, grp->bb_group, 1)))
+				break;
+		}
+		goto done;
+	}
+
+	node = sbi->s_mb_avg_fragment_size_root.rb_node;
+	best_so_far = 0;
+	found = NULL;
+
+	while (node) {
+		grp = rb_entry(node, struct ext4_group_info,
+			       bb_avg_fragment_size_rb);
+		avg_fragment_size = 0;
+		if (ext4_mb_good_group(ac, grp->bb_group, 1)) {
+			avg_fragment_size = grp->bb_fragments ?
+				grp->bb_free / grp->bb_fragments : 0;
+			if (!best_so_far || avg_fragment_size < best_so_far) {
+				best_so_far = avg_fragment_size;
+				found = node;
+			}
+		}
+		if (avg_fragment_size > ac->ac_g_ex.fe_len)
+			node = node->rb_right;
+		else
+			node = node->rb_left;
+	}
+
+done:
+	if (found) {
+		grp = rb_entry(found, struct ext4_group_info,
+			       bb_avg_fragment_size_rb);
+		*group = grp->bb_group;
+		ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED;
+	} else {
+		*new_cr = 2;
+	}
+
+	read_unlock(&sbi->s_mb_rb_lock);
+	ac->ac_last_optimal_group = *group;
+}
+
+static inline int should_optimize_scan(struct ext4_allocation_context *ac)
+{
+	if (unlikely(!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN)))
+		return 0;
+	if (ac->ac_criteria >= 2)
+		return 0;
+	if (ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
+		return 0;
+	return 1;
+}
+
+/*
+ * Return next linear group for allocation. If linear traversal should not be
+ * performed, this function just returns the same group
+ */
+static int
+next_linear_group(struct ext4_allocation_context *ac, int group, int ngroups)
+{
+	if (!should_optimize_scan(ac))
+		goto inc_and_return;
+
+	if (ac->ac_groups_linear_remaining) {
+		ac->ac_groups_linear_remaining--;
+		goto inc_and_return;
+	}
+
+	if (ac->ac_flags & EXT4_MB_SEARCH_NEXT_LINEAR) {
+		ac->ac_flags &= ~EXT4_MB_SEARCH_NEXT_LINEAR;
+		goto inc_and_return;
+	}
+
+	return group;
+inc_and_return:
+	/*
+	 * Artificially restricted ngroups for non-extent
+	 * files makes group > ngroups possible on first loop.
+	 */
+	return group + 1 >= ngroups ? 0 : group + 1;
+}
+
+/*
+ * ext4_mb_choose_next_group: choose next group for allocation.
+ *
+ * @ac        Allocation Context
+ * @new_cr    This is an output parameter. If the there is no good group
+ *            available at current CR level, this field is updated to indicate
+ *            the new cr level that should be used.
+ * @group     This is an input / output parameter. As an input it indicates the
+ *            next group that the allocator intends to use for allocation. As
+ *            output, this field indicates the next group that should be used as
+ *            determined by the optimization functions.
+ * @ngroups   Total number of groups
+ */
+static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
+		int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
+{
+	*new_cr = ac->ac_criteria;
+
+	if (!should_optimize_scan(ac) || ac->ac_groups_linear_remaining)
+		return;
+
+	if (*new_cr == 0) {
+		ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
+	} else if (*new_cr == 1) {
+		ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
+	} else {
+		/*
+		 * TODO: For CR=2, we can arrange groups in an rb tree sorted by
+		 * bb_free. But until that happens, we should never come here.
+		 */
+		WARN_ON(1);
+	}
+}
+
 /*
  * Cache the order of the largest free extent we have available in this block
  * group.
@@ -751,18 +1071,33 @@
 static void
 mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
 {
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	int i;
-	int bits;
 
+	if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
+		write_lock(&sbi->s_mb_largest_free_orders_locks[
+					      grp->bb_largest_free_order]);
+		list_del_init(&grp->bb_largest_free_order_node);
+		write_unlock(&sbi->s_mb_largest_free_orders_locks[
+					      grp->bb_largest_free_order]);
+	}
 	grp->bb_largest_free_order = -1; /* uninit */
 
-	bits = sb->s_blocksize_bits + 1;
-	for (i = bits; i >= 0; i--) {
+	for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
 		if (grp->bb_counters[i] > 0) {
 			grp->bb_largest_free_order = i;
 			break;
 		}
 	}
+	if (test_opt2(sb, MB_OPTIMIZE_SCAN) &&
+	    grp->bb_largest_free_order >= 0 && grp->bb_free) {
+		write_lock(&sbi->s_mb_largest_free_orders_locks[
+					      grp->bb_largest_free_order]);
+		list_add_tail(&grp->bb_largest_free_order_node,
+		      &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
+		write_unlock(&sbi->s_mb_largest_free_orders_locks[
+					      grp->bb_largest_free_order]);
+	}
 }
 
 static noinline_for_stack
@@ -816,10 +1151,9 @@
 	clear_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state));
 
 	period = get_cycles() - period;
-	spin_lock(&sbi->s_bal_lock);
-	sbi->s_mb_buddies_generated++;
-	sbi->s_mb_generation_time += period;
-	spin_unlock(&sbi->s_bal_lock);
+	atomic_inc(&sbi->s_mb_buddies_generated);
+	atomic64_add(period, &sbi->s_mb_generation_time);
+	mb_update_avg_fragment_size(sb, grp);
 }
 
 static void mb_regenerate_buddy(struct ext4_buddy *e4b)
@@ -977,7 +1311,7 @@
 			grinfo->bb_fragments = 0;
 			memset(grinfo->bb_counters, 0,
 			       sizeof(*grinfo->bb_counters) *
-				(sb->s_blocksize_bits+2));
+			       (MB_NUM_ORDERS(sb)));
 			/*
 			 * incore got set to the group block bitmap below
 			 */
@@ -1542,6 +1876,7 @@
 
 done:
 	mb_set_largest_free_order(sb, e4b->bd_info);
+	mb_update_avg_fragment_size(sb, e4b->bd_info);
 	mb_check_buddy(e4b);
 }
 
@@ -1679,6 +2014,7 @@
 	}
 	mb_set_largest_free_order(e4b->bd_sb, e4b->bd_info);
 
+	mb_update_avg_fragment_size(e4b->bd_sb, e4b->bd_info);
 	ext4_set_bits(e4b->bd_bitmap, ex->fe_start, len0);
 	mb_check_buddy(e4b);
 
@@ -1954,7 +2290,7 @@
 	int max;
 
 	BUG_ON(ac->ac_2order <= 0);
-	for (i = ac->ac_2order; i <= sb->s_blocksize_bits + 1; i++) {
+	for (i = ac->ac_2order; i < MB_NUM_ORDERS(sb); i++) {
 		if (grp->bb_counters[i] == 0)
 			continue;
 
@@ -2133,7 +2469,7 @@
 		if (free < ac->ac_g_ex.fe_len)
 			return false;
 
-		if (ac->ac_2order > ac->ac_sb->s_blocksize_bits+1)
+		if (ac->ac_2order >= MB_NUM_ORDERS(ac->ac_sb))
 			return true;
 
 		if (grp->bb_largest_free_order < ac->ac_2order)
@@ -2172,6 +2508,8 @@
 	ext4_grpblk_t free;
 	int ret = 0;
 
+	if (sbi->s_mb_stats)
+		atomic64_inc(&sbi->s_bal_cX_groups_considered[ac->ac_criteria]);
 	if (should_lock)
 		ext4_lock_group(sb, group);
 	free = grp->bb_free;
@@ -2339,13 +2677,13 @@
 	 * We also support searching for power-of-two requests only for
 	 * requests upto maximum buddy size we have constructed.
 	 */
-	if (i >= sbi->s_mb_order2_reqs && i <= sb->s_blocksize_bits + 2) {
+	if (i >= sbi->s_mb_order2_reqs && i <= MB_NUM_ORDERS(sb)) {
 		/*
 		 * This should tell if fe_len is exactly power of 2
 		 */
 		if ((ac->ac_g_ex.fe_len & (~(1 << (i - 1)))) == 0)
 			ac->ac_2order = array_index_nospec(i - 1,
-							   sb->s_blocksize_bits + 2);
+							   MB_NUM_ORDERS(sb));
 	}
 
 	/* if stream allocation is enabled, use global goal */
@@ -2371,17 +2709,21 @@
 		 * from the goal value specified
 		 */
 		group = ac->ac_g_ex.fe_group;
+		ac->ac_last_optimal_group = group;
+		ac->ac_groups_linear_remaining = sbi->s_mb_max_linear_groups;
 		prefetch_grp = group;
 
-		for (i = 0; i < ngroups; group++, i++) {
-			int ret = 0;
+		for (i = 0; i < ngroups; group = next_linear_group(ac, group, ngroups),
+			     i++) {
+			int ret = 0, new_cr;
+
 			cond_resched();
-			/*
-			 * Artificially restricted ngroups for non-extent
-			 * files makes group > ngroups possible on first loop.
-			 */
-			if (group >= ngroups)
-				group = 0;
+
+			ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups);
+			if (new_cr != cr) {
+				cr = new_cr;
+				goto repeat;
+			}
 
 			/*
 			 * Batch reads of the block allocation bitmaps
@@ -2446,6 +2788,9 @@
 			if (ac->ac_status != AC_STATUS_CONTINUE)
 				break;
 		}
+		/* Processed all groups and haven't found blocks */
+		if (sbi->s_mb_stats && i == ngroups)
+			atomic64_inc(&sbi->s_bal_cX_failed[cr]);
 	}
 
 	if (ac->ac_b_ex.fe_len > 0 && ac->ac_status != AC_STATUS_FOUND &&
@@ -2475,6 +2820,9 @@
 			goto repeat;
 		}
 	}
+
+	if (sbi->s_mb_stats && ac->ac_status == AC_STATUS_FOUND)
+		atomic64_inc(&sbi->s_bal_cX_hits[ac->ac_criteria]);
 out:
 	if (!err && ac->ac_status != AC_STATUS_FOUND && first_err)
 		err = first_err;
@@ -2574,6 +2922,157 @@
 	.show   = ext4_mb_seq_groups_show,
 };
 
+int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset)
+{
+	struct super_block *sb = (struct super_block *)seq->private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+	seq_puts(seq, "mballoc:\n");
+	if (!sbi->s_mb_stats) {
+		seq_puts(seq, "\tmb stats collection turned off.\n");
+		seq_puts(seq, "\tTo enable, please write \"1\" to sysfs file mb_stats.\n");
+		return 0;
+	}
+	seq_printf(seq, "\treqs: %u\n", atomic_read(&sbi->s_bal_reqs));
+	seq_printf(seq, "\tsuccess: %u\n", atomic_read(&sbi->s_bal_success));
+
+	seq_printf(seq, "\tgroups_scanned: %u\n",  atomic_read(&sbi->s_bal_groups_scanned));
+
+	seq_puts(seq, "\tcr0_stats:\n");
+	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[0]));
+	seq_printf(seq, "\t\tgroups_considered: %llu\n",
+		   atomic64_read(&sbi->s_bal_cX_groups_considered[0]));
+	seq_printf(seq, "\t\tuseless_loops: %llu\n",
+		   atomic64_read(&sbi->s_bal_cX_failed[0]));
+	seq_printf(seq, "\t\tbad_suggestions: %u\n",
+		   atomic_read(&sbi->s_bal_cr0_bad_suggestions));
+
+	seq_puts(seq, "\tcr1_stats:\n");
+	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[1]));
+	seq_printf(seq, "\t\tgroups_considered: %llu\n",
+		   atomic64_read(&sbi->s_bal_cX_groups_considered[1]));
+	seq_printf(seq, "\t\tuseless_loops: %llu\n",
+		   atomic64_read(&sbi->s_bal_cX_failed[1]));
+	seq_printf(seq, "\t\tbad_suggestions: %u\n",
+		   atomic_read(&sbi->s_bal_cr1_bad_suggestions));
+
+	seq_puts(seq, "\tcr2_stats:\n");
+	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[2]));
+	seq_printf(seq, "\t\tgroups_considered: %llu\n",
+		   atomic64_read(&sbi->s_bal_cX_groups_considered[2]));
+	seq_printf(seq, "\t\tuseless_loops: %llu\n",
+		   atomic64_read(&sbi->s_bal_cX_failed[2]));
+
+	seq_puts(seq, "\tcr3_stats:\n");
+	seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[3]));
+	seq_printf(seq, "\t\tgroups_considered: %llu\n",
+		   atomic64_read(&sbi->s_bal_cX_groups_considered[3]));
+	seq_printf(seq, "\t\tuseless_loops: %llu\n",
+		   atomic64_read(&sbi->s_bal_cX_failed[3]));
+	seq_printf(seq, "\textents_scanned: %u\n", atomic_read(&sbi->s_bal_ex_scanned));
+	seq_printf(seq, "\t\tgoal_hits: %u\n", atomic_read(&sbi->s_bal_goals));
+	seq_printf(seq, "\t\t2^n_hits: %u\n", atomic_read(&sbi->s_bal_2orders));
+	seq_printf(seq, "\t\tbreaks: %u\n", atomic_read(&sbi->s_bal_breaks));
+	seq_printf(seq, "\t\tlost: %u\n", atomic_read(&sbi->s_mb_lost_chunks));
+
+	seq_printf(seq, "\tbuddies_generated: %u/%u\n",
+		   atomic_read(&sbi->s_mb_buddies_generated),
+		   ext4_get_groups_count(sb));
+	seq_printf(seq, "\tbuddies_time_used: %llu\n",
+		   atomic64_read(&sbi->s_mb_generation_time));
+	seq_printf(seq, "\tpreallocated: %u\n",
+		   atomic_read(&sbi->s_mb_preallocated));
+	seq_printf(seq, "\tdiscarded: %u\n",
+		   atomic_read(&sbi->s_mb_discarded));
+	return 0;
+}
+
+static void *ext4_mb_seq_structs_summary_start(struct seq_file *seq, loff_t *pos)
+{
+	struct super_block *sb = PDE_DATA(file_inode(seq->file));
+	unsigned long position;
+
+	read_lock(&EXT4_SB(sb)->s_mb_rb_lock);
+
+	if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + 1)
+		return NULL;
+	position = *pos + 1;
+	return (void *) ((unsigned long) position);
+}
+
+static void *ext4_mb_seq_structs_summary_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct super_block *sb = PDE_DATA(file_inode(seq->file));
+	unsigned long position;
+
+	++*pos;
+	if (*pos < 0 || *pos >= MB_NUM_ORDERS(sb) + 1)
+		return NULL;
+	position = *pos + 1;
+	return (void *) ((unsigned long) position);
+}
+
+static int ext4_mb_seq_structs_summary_show(struct seq_file *seq, void *v)
+{
+	struct super_block *sb = PDE_DATA(file_inode(seq->file));
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	unsigned long position = ((unsigned long) v);
+	struct ext4_group_info *grp;
+	struct rb_node *n;
+	unsigned int count, min, max;
+
+	position--;
+	if (position >= MB_NUM_ORDERS(sb)) {
+		seq_puts(seq, "fragment_size_tree:\n");
+		n = rb_first(&sbi->s_mb_avg_fragment_size_root);
+		if (!n) {
+			seq_puts(seq, "\ttree_min: 0\n\ttree_max: 0\n\ttree_nodes: 0\n");
+			return 0;
+		}
+		grp = rb_entry(n, struct ext4_group_info, bb_avg_fragment_size_rb);
+		min = grp->bb_fragments ? grp->bb_free / grp->bb_fragments : 0;
+		count = 1;
+		while (rb_next(n)) {
+			count++;
+			n = rb_next(n);
+		}
+		grp = rb_entry(n, struct ext4_group_info, bb_avg_fragment_size_rb);
+		max = grp->bb_fragments ? grp->bb_free / grp->bb_fragments : 0;
+
+		seq_printf(seq, "\ttree_min: %u\n\ttree_max: %u\n\ttree_nodes: %u\n",
+			   min, max, count);
+		return 0;
+	}
+
+	if (position == 0) {
+		seq_printf(seq, "optimize_scan: %d\n",
+			   test_opt2(sb, MB_OPTIMIZE_SCAN) ? 1 : 0);
+		seq_puts(seq, "max_free_order_lists:\n");
+	}
+	count = 0;
+	list_for_each_entry(grp, &sbi->s_mb_largest_free_orders[position],
+			    bb_largest_free_order_node)
+		count++;
+	seq_printf(seq, "\tlist_order_%u_groups: %u\n",
+		   (unsigned int)position, count);
+
+	return 0;
+}
+
+static void ext4_mb_seq_structs_summary_stop(struct seq_file *seq, void *v)
+{
+	struct super_block *sb = PDE_DATA(file_inode(seq->file));
+
+	read_unlock(&EXT4_SB(sb)->s_mb_rb_lock);
+}
+
+const struct seq_operations ext4_mb_seq_structs_summary_ops = {
+	.start  = ext4_mb_seq_structs_summary_start,
+	.next   = ext4_mb_seq_structs_summary_next,
+	.stop   = ext4_mb_seq_structs_summary_stop,
+	.show   = ext4_mb_seq_structs_summary_show,
+};
+
 static struct kmem_cache *get_groupinfo_cache(int blocksize_bits)
 {
 	int cache_index = blocksize_bits - EXT4_MIN_BLOCK_LOG_SIZE;
@@ -2676,7 +3175,10 @@
 	INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
 	init_rwsem(&meta_group_info[i]->alloc_sem);
 	meta_group_info[i]->bb_free_root = RB_ROOT;
+	INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node);
+	RB_CLEAR_NODE(&meta_group_info[i]->bb_avg_fragment_size_rb);
 	meta_group_info[i]->bb_largest_free_order = -1;  /* uninit */
+	meta_group_info[i]->bb_group = group;
 
 	mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group);
 	return 0;
@@ -2837,7 +3339,7 @@
 	unsigned max;
 	int ret;
 
-	i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_offsets);
+	i = MB_NUM_ORDERS(sb) * sizeof(*sbi->s_mb_offsets);
 
 	sbi->s_mb_offsets = kmalloc(i, GFP_KERNEL);
 	if (sbi->s_mb_offsets == NULL) {
@@ -2845,7 +3347,7 @@
 		goto out;
 	}
 
-	i = (sb->s_blocksize_bits + 2) * sizeof(*sbi->s_mb_maxs);
+	i = MB_NUM_ORDERS(sb) * sizeof(*sbi->s_mb_maxs);
 	sbi->s_mb_maxs = kmalloc(i, GFP_KERNEL);
 	if (sbi->s_mb_maxs == NULL) {
 		ret = -ENOMEM;
@@ -2871,10 +3373,30 @@
 		offset_incr = offset_incr >> 1;
 		max = max >> 1;
 		i++;
-	} while (i <= sb->s_blocksize_bits + 1);
+	} while (i < MB_NUM_ORDERS(sb));
+
+	sbi->s_mb_avg_fragment_size_root = RB_ROOT;
+	sbi->s_mb_largest_free_orders =
+		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head),
+			GFP_KERNEL);
+	if (!sbi->s_mb_largest_free_orders) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	sbi->s_mb_largest_free_orders_locks =
+		kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t),
+			GFP_KERNEL);
+	if (!sbi->s_mb_largest_free_orders_locks) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	for (i = 0; i < MB_NUM_ORDERS(sb); i++) {
+		INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]);
+		rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]);
+	}
+	rwlock_init(&sbi->s_mb_rb_lock);
 
 	spin_lock_init(&sbi->s_md_lock);
-	spin_lock_init(&sbi->s_bal_lock);
 	sbi->s_mb_free_pending = 0;
 	INIT_LIST_HEAD(&sbi->s_freed_data_list);
 
@@ -2925,6 +3447,10 @@
 		spin_lock_init(&lg->lg_prealloc_lock);
 	}
 
+	if (blk_queue_nonrot(bdev_get_queue(sb->s_bdev)))
+		sbi->s_mb_max_linear_groups = 0;
+	else
+		sbi->s_mb_max_linear_groups = MB_DEFAULT_LINEAR_LIMIT;
 	/* init file for buddy data */
 	ret = ext4_mb_init_backend(sb);
 	if (ret != 0)
@@ -2936,6 +3462,8 @@
 	free_percpu(sbi->s_locality_groups);
 	sbi->s_locality_groups = NULL;
 out:
+	kfree(sbi->s_mb_largest_free_orders);
+	kfree(sbi->s_mb_largest_free_orders_locks);
 	kfree(sbi->s_mb_offsets);
 	sbi->s_mb_offsets = NULL;
 	kfree(sbi->s_mb_maxs);
@@ -2992,6 +3520,8 @@
 		kvfree(group_info);
 		rcu_read_unlock();
 	}
+	kfree(sbi->s_mb_largest_free_orders);
+	kfree(sbi->s_mb_largest_free_orders_locks);
 	kfree(sbi->s_mb_offsets);
 	kfree(sbi->s_mb_maxs);
 	iput(sbi->s_buddy_cache);
@@ -3002,17 +3532,18 @@
 				atomic_read(&sbi->s_bal_reqs),
 				atomic_read(&sbi->s_bal_success));
 		ext4_msg(sb, KERN_INFO,
-		      "mballoc: %u extents scanned, %u goal hits, "
+		      "mballoc: %u extents scanned, %u groups scanned, %u goal hits, "
 				"%u 2^N hits, %u breaks, %u lost",
 				atomic_read(&sbi->s_bal_ex_scanned),
+				atomic_read(&sbi->s_bal_groups_scanned),
 				atomic_read(&sbi->s_bal_goals),
 				atomic_read(&sbi->s_bal_2orders),
 				atomic_read(&sbi->s_bal_breaks),
 				atomic_read(&sbi->s_mb_lost_chunks));
 		ext4_msg(sb, KERN_INFO,
-		       "mballoc: %lu generated and it took %Lu",
-				sbi->s_mb_buddies_generated,
-				sbi->s_mb_generation_time);
+		       "mballoc: %u generated and it took %llu",
+				atomic_read(&sbi->s_mb_buddies_generated),
+				atomic64_read(&sbi->s_mb_generation_time));
 		ext4_msg(sb, KERN_INFO,
 		       "mballoc: %u preallocated, %u discarded",
 				atomic_read(&sbi->s_mb_preallocated),
@@ -3607,12 +4138,13 @@
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 
-	if (sbi->s_mb_stats && ac->ac_g_ex.fe_len > 1) {
+	if (sbi->s_mb_stats && ac->ac_g_ex.fe_len >= 1) {
 		atomic_inc(&sbi->s_bal_reqs);
 		atomic_add(ac->ac_b_ex.fe_len, &sbi->s_bal_allocated);
 		if (ac->ac_b_ex.fe_len >= ac->ac_o_ex.fe_len)
 			atomic_inc(&sbi->s_bal_success);
 		atomic_add(ac->ac_found, &sbi->s_bal_ex_scanned);
+		atomic_add(ac->ac_groups_scanned, &sbi->s_bal_groups_scanned);
 		if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start &&
 				ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group)
 			atomic_inc(&sbi->s_bal_goals);

diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index e75b474..02585e3 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h

@@ -78,6 +78,23 @@
  */
 #define MB_DEFAULT_MAX_INODE_PREALLOC	512
 
+/*
+ * Number of groups to search linearly before performing group scanning
+ * optimization.
+ */
+#define MB_DEFAULT_LINEAR_LIMIT		4
+
+/*
+ * Minimum number of groups that should be present in the file system to perform
+ * group scanning optimizations.
+ */
+#define MB_DEFAULT_LINEAR_SCAN_THRESHOLD	16
+
+/*
+ * Number of valid buddy orders
+ */
+#define MB_NUM_ORDERS(sb)		((sb)->s_blocksize_bits + 2)
+
 struct ext4_free_data {
 	/* this links the free block information from sb_info */
 	struct list_head		efd_list;
@@ -161,11 +178,14 @@
 	/* copy of the best found extent taken before preallocation efforts */
 	struct ext4_free_extent ac_f_ex;
 
+	ext4_group_t ac_last_optimal_group;
+	__u32 ac_groups_considered;
+	__u32 ac_flags;		/* allocation hints */
 	__u16 ac_groups_scanned;
+	__u16 ac_groups_linear_remaining;
 	__u16 ac_found;
 	__u16 ac_tail;
 	__u16 ac_buddy;
-	__u16 ac_flags;		/* allocation hints */
 	__u8 ac_status;
 	__u8 ac_criteria;
 	__u8 ac_2order;		/* if request is to allocate 2^N blocks and

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index f71de6c..f9589e6 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c

@@ -1776,7 +1776,14 @@
 		memcpy (to, de, rec_len);
 		((struct ext4_dir_entry_2 *) to)->rec_len =
 				ext4_rec_len_to_disk(rec_len, blocksize);
+
+		/* wipe dir_entry excluding the rec_len field */
 		de->inode = 0;
+		memset(&de->name_len, 0, ext4_rec_len_from_disk(de->rec_len,
+								blocksize) -
+					 offsetof(struct ext4_dir_entry_2,
+								name_len));
+
 		map++;
 		to += rec_len;
 	}
@@ -2101,6 +2108,7 @@
 	data2 = bh2->b_data;
 
 	memcpy(data2, de, len);
+	memset(de, 0, len); /* wipe old data */
 	de = (struct ext4_dir_entry_2 *) data2;
 	top = data2 + len;
 	while ((char *)(de2 = ext4_next_entry(de, blocksize)) < top)
@@ -2481,15 +2489,27 @@
 					 entry_buf, buf_size, i))
 			return -EFSCORRUPTED;
 		if (de == de_del)  {
-			if (pde)
+			if (pde) {
 				pde->rec_len = ext4_rec_len_to_disk(
 					ext4_rec_len_from_disk(pde->rec_len,
 							       blocksize) +
 					ext4_rec_len_from_disk(de->rec_len,
 							       blocksize),
 					blocksize);
-			else
+
+				/* wipe entire dir_entry */
+				memset(de, 0, ext4_rec_len_from_disk(de->rec_len,
+								blocksize));
+			} else {
+				/* wipe dir_entry excluding the rec_len field */
 				de->inode = 0;
+				memset(&de->name_len, 0,
+					ext4_rec_len_from_disk(de->rec_len,
+								blocksize) -
+					offsetof(struct ext4_dir_entry_2,
+								name_len));
+			}
+
 			inode_inc_iversion(dir);
 			return 0;
 		}

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 099e4af..ce14df9 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c

@@ -1704,7 +1704,7 @@
 	Opt_dioread_nolock, Opt_dioread_lock,
 	Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
 	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
-	Opt_prefetch_block_bitmaps,
+	Opt_no_prefetch_block_bitmaps, Opt_mb_optimize_scan,
 #ifdef CONFIG_EXT4_DEBUG
 	Opt_fc_debug_max_replay, Opt_fc_debug_force
 #endif
@@ -1787,7 +1787,6 @@
 	{Opt_auto_da_alloc, "auto_da_alloc"},
 	{Opt_noauto_da_alloc, "noauto_da_alloc"},
 	{Opt_dioread_nolock, "dioread_nolock"},
-	{Opt_dioread_lock, "nodioread_nolock"},
 	{Opt_dioread_lock, "dioread_lock"},
 	{Opt_discard, "discard"},
 	{Opt_nodiscard, "nodiscard"},
@@ -1804,7 +1803,9 @@
 	{Opt_inlinecrypt, "inlinecrypt"},
 	{Opt_nombcache, "nombcache"},
 	{Opt_nombcache, "no_mbcache"},	/* for backward compatibility */
-	{Opt_prefetch_block_bitmaps, "prefetch_block_bitmaps"},
+	{Opt_removed, "prefetch_block_bitmaps"},
+	{Opt_no_prefetch_block_bitmaps, "no_prefetch_block_bitmaps"},
+	{Opt_mb_optimize_scan, "mb_optimize_scan=%d"},
 	{Opt_removed, "check=none"},	/* mount option from ext2/3 */
 	{Opt_removed, "nocheck"},	/* mount option from ext2/3 */
 	{Opt_removed, "reservation"},	/* mount option from ext2/3 */
@@ -1837,6 +1838,8 @@
 }
 
 #define DEFAULT_JOURNAL_IOPRIO (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 3))
+#define DEFAULT_MB_OPTIMIZE_SCAN	(-1)
+
 static const char deprecated_msg[] =
 	"Mount option \"%s\" will be removed by %s\n"
 	"Contact linux-ext4@vger.kernel.org if you think we should keep it.\n";
@@ -2023,8 +2026,9 @@
 	{Opt_max_dir_size_kb, 0, MOPT_GTE0},
 	{Opt_test_dummy_encryption, 0, MOPT_STRING},
 	{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
-	{Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS,
+	{Opt_no_prefetch_block_bitmaps, EXT4_MOUNT_NO_PREFETCH_BLOCK_BITMAPS,
 	 MOPT_SET},
+	{Opt_mb_optimize_scan, EXT4_MOUNT2_MB_OPTIMIZE_SCAN, MOPT_GTE0},
 #ifdef CONFIG_EXT4_DEBUG
 	{Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
 	 MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
@@ -2106,9 +2110,15 @@
 	return 1;
 }
 
+struct ext4_parsed_options {
+	unsigned long journal_devnum;
+	unsigned int journal_ioprio;
+	int mb_optimize_scan;
+};
+
 static int handle_mount_opt(struct super_block *sb, char *opt, int token,
-			    substring_t *args, unsigned long *journal_devnum,
-			    unsigned int *journal_ioprio, int is_remount)
+			    substring_t *args, struct ext4_parsed_options *parsed_opts,
+			    int is_remount)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 	const struct mount_opts *m;
@@ -2265,7 +2275,7 @@
 				 "Cannot specify journal on remount");
 			return -1;
 		}
-		*journal_devnum = arg;
+		parsed_opts->journal_devnum = arg;
 	} else if (token == Opt_journal_path) {
 		char *journal_path;
 		struct inode *journal_inode;
@@ -2301,7 +2311,7 @@
 			return -1;
 		}
 
-		*journal_devnum = new_encode_dev(journal_inode->i_rdev);
+		parsed_opts->journal_devnum = new_encode_dev(journal_inode->i_rdev);
 		path_put(&path);
 		kfree(journal_path);
 	} else if (token == Opt_journal_ioprio) {
@@ -2310,7 +2320,7 @@
 				 " (must be 0-7)");
 			return -1;
 		}
-		*journal_ioprio =
+		parsed_opts->journal_ioprio =
 			IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, arg);
 	} else if (token == Opt_test_dummy_encryption) {
 		return ext4_set_test_dummy_encryption(sb, opt, &args[0],
@@ -2400,6 +2410,13 @@
 		sbi->s_mount_opt |= m->mount_opt;
 	} else if (token == Opt_data_err_ignore) {
 		sbi->s_mount_opt &= ~m->mount_opt;
+	} else if (token == Opt_mb_optimize_scan) {
+		if (arg != 0 && arg != 1) {
+			ext4_msg(sb, KERN_WARNING,
+				 "mb_optimize_scan should be set to 0 or 1.");
+			return -1;
+		}
+		parsed_opts->mb_optimize_scan = arg;
 	} else {
 		if (!args->from)
 			arg = 1;
@@ -2427,8 +2444,7 @@
 }
 
 static int parse_options(char *options, struct super_block *sb,
-			 unsigned long *journal_devnum,
-			 unsigned int *journal_ioprio,
+			 struct ext4_parsed_options *ret_opts,
 			 int is_remount)
 {
 	struct ext4_sb_info __maybe_unused *sbi = EXT4_SB(sb);
@@ -2448,8 +2464,8 @@
 		 */
 		args[0].to = args[0].from = NULL;
 		token = match_token(p, tokens, args);
-		if (handle_mount_opt(sb, p, token, args, journal_devnum,
-				     journal_ioprio, is_remount) < 0)
+		if (handle_mount_opt(sb, p, token, args, ret_opts,
+				     is_remount) < 0)
 			return 0;
 	}
 #ifdef CONFIG_QUOTA
@@ -3711,11 +3727,11 @@
 
 	elr->lr_super = sb;
 	elr->lr_first_not_zeroed = start;
-	if (test_opt(sb, PREFETCH_BLOCK_BITMAPS))
-		elr->lr_mode = EXT4_LI_MODE_PREFETCH_BBITMAP;
-	else {
+	if (test_opt(sb, NO_PREFETCH_BLOCK_BITMAPS)) {
 		elr->lr_mode = EXT4_LI_MODE_ITABLE;
 		elr->lr_next_group = start;
+	} else {
+		elr->lr_mode = EXT4_LI_MODE_PREFETCH_BBITMAP;
 	}
 
 	/*
@@ -3746,7 +3762,7 @@
 		goto out;
 	}
 
-	if (!test_opt(sb, PREFETCH_BLOCK_BITMAPS) &&
+	if (test_opt(sb, NO_PREFETCH_BLOCK_BITMAPS) &&
 	    (first_not_zeroed == ngroups || sb_rdonly(sb) ||
 	     !test_opt(sb, INIT_INODE_TABLE)))
 		goto out;
@@ -4020,7 +4036,6 @@
 	ext4_fsblk_t sb_block = get_sb_block(&data);
 	ext4_fsblk_t logical_sb_block;
 	unsigned long offset = 0;
-	unsigned long journal_devnum = 0;
 	unsigned long def_mount_opts;
 	struct inode *root;
 	const char *descr;
@@ -4031,8 +4046,13 @@
 	int needs_recovery, has_huge_files;
 	__u64 blocks_count;
 	int err = 0;
-	unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
 	ext4_group_t first_not_zeroed;
+	struct ext4_parsed_options parsed_opts;
+
+	/* Set defaults for the variables that will be set during parsing */
+	parsed_opts.journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
+	parsed_opts.journal_devnum = 0;
+	parsed_opts.mb_optimize_scan = DEFAULT_MB_OPTIMIZE_SCAN;
 
 	if ((data && !orig_data) || !sbi)
 		goto out_free_base;
@@ -4280,8 +4300,7 @@
 					      GFP_KERNEL);
 		if (!s_mount_opts)
 			goto failed_mount;
-		if (!parse_options(s_mount_opts, sb, &journal_devnum,
-				   &journal_ioprio, 0)) {
+		if (!parse_options(s_mount_opts, sb, &parsed_opts, 0)) {
 			ext4_msg(sb, KERN_WARNING,
 				 "failed to parse options in superblock: %s",
 				 s_mount_opts);
@@ -4289,8 +4308,7 @@
 		kfree(s_mount_opts);
 	}
 	sbi->s_def_mount_opt = sbi->s_mount_opt;
-	if (!parse_options((char *) data, sb, &journal_devnum,
-			   &journal_ioprio, 0))
+	if (!parse_options((char *) data, sb, &parsed_opts, 0))
 		goto failed_mount;
 
 #ifdef CONFIG_UNICODE
@@ -4331,9 +4349,9 @@
 #endif
 
 	if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) {
-		printk_once(KERN_WARNING "EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, O_DIRECT and fast_commit support!\n");
-		/* can't mount with both data=journal and dioread_nolock. */
-		clear_opt(sb, DIOREAD_NOLOCK);
+		printk_once(KERN_WARNING "EXT4-fs: Warning: mounting "
+			    "with data=journal disables delayed "
+			    "allocation, O_DIRECT and fast_commit support!\n");
 		clear_opt2(sb, JOURNAL_FAST_COMMIT);
 		if (test_opt2(sb, EXPLICIT_DELALLOC)) {
 			ext4_msg(sb, KERN_ERR, "can't mount with "
@@ -4786,7 +4804,7 @@
 	 * root first: it may be modified in the journal!
 	 */
 	if (!test_opt(sb, NOLOAD) && ext4_has_feature_journal(sb)) {
-		err = ext4_load_journal(sb, es, journal_devnum);
+		err = ext4_load_journal(sb, es, parsed_opts.journal_devnum);
 		if (err)
 			goto failed_mount3a;
 	} else if (test_opt(sb, NOLOAD) && !sb_rdonly(sb) &&
@@ -4885,7 +4903,7 @@
 		goto failed_mount_wq;
 	}
 
-	set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
+	set_task_ioprio(sbi->s_journal->j_task, parsed_opts.journal_ioprio);
 
 	sbi->s_journal->j_submit_inode_data_buffers =
 		ext4_journal_submit_inode_data_buffers;
@@ -4996,6 +5014,19 @@
 	ext4_fc_replay_cleanup(sb);
 
 	ext4_ext_init(sb);
+
+	/*
+	 * Enable optimize_scan if number of groups is > threshold. This can be
+	 * turned off by passing "mb_optimize_scan=0". This can also be
+	 * turned on forcefully by passing "mb_optimize_scan=1".
+	 */
+	if (parsed_opts.mb_optimize_scan == 1)
+		set_opt2(sb, MB_OPTIMIZE_SCAN);
+	else if (parsed_opts.mb_optimize_scan == 0)
+		clear_opt2(sb, MB_OPTIMIZE_SCAN);
+	else if (sbi->s_groups_count >= MB_DEFAULT_LINEAR_SCAN_THRESHOLD)
+		set_opt2(sb, MB_OPTIMIZE_SCAN);
+
 	err = ext4_mb_init(sb);
 	if (err) {
 		ext4_msg(sb, KERN_ERR, "failed to initialize mballoc (%d)",
@@ -5590,7 +5621,7 @@
 		return 0;
 	}
 	jbd2_journal_lock_updates(journal);
-	err = jbd2_journal_flush(journal);
+	err = jbd2_journal_flush(journal, 0);
 	if (err < 0)
 		goto out;
 
@@ -5732,7 +5763,7 @@
 		 * Don't clear the needs_recovery flag if we failed to
 		 * flush the journal.
 		 */
-		error = jbd2_journal_flush(journal);
+		error = jbd2_journal_flush(journal, 0);
 		if (error < 0)
 			goto out;
 
@@ -5790,13 +5821,16 @@
 	struct ext4_mount_options old_opts;
 	int enable_quota = 0;
 	ext4_group_t g;
-	unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
 	int err = 0;
 #ifdef CONFIG_QUOTA
 	int i, j;
 	char *to_free[EXT4_MAXQUOTAS];
 #endif
 	char *orig_data = kstrdup(data, GFP_KERNEL);
+	struct ext4_parsed_options parsed_opts;
+
+	parsed_opts.journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
+	parsed_opts.journal_devnum = 0;
 
 	if (data && !orig_data)
 		return -ENOMEM;
@@ -5827,7 +5861,8 @@
 			old_opts.s_qf_names[i] = NULL;
 #endif
 	if (sbi->s_journal && sbi->s_journal->j_task->io_context)
-		journal_ioprio = sbi->s_journal->j_task->io_context->ioprio;
+		parsed_opts.journal_ioprio =
+			sbi->s_journal->j_task->io_context->ioprio;
 
 	/*
 	 * Some options can be enabled by ext4 and/or by VFS mount flag
@@ -5837,7 +5872,7 @@
 	vfs_flags = SB_LAZYTIME | SB_I_VERSION;
 	sb->s_flags = (sb->s_flags & ~vfs_flags) | (*flags & vfs_flags);
 
-	if (!parse_options(data, sb, NULL, &journal_ioprio, 1)) {
+	if (!parse_options(data, sb, &parsed_opts, 1)) {
 		err = -EINVAL;
 		goto restore_opts;
 	}
@@ -5887,7 +5922,7 @@
 
 	if (sbi->s_journal) {
 		ext4_init_journal_params(sb, sbi->s_journal);
-		set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
+		set_task_ioprio(sbi->s_journal->j_task, parsed_opts.journal_ioprio);
 	}
 
 	if ((bool)(*flags & SB_RDONLY) != sb_rdonly(sb)) {
@@ -6324,7 +6359,7 @@
 		 * otherwise be livelocked...
 		 */
 		jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
-		err = jbd2_journal_flush(EXT4_SB(sb)->s_journal);
+		err = jbd2_journal_flush(EXT4_SB(sb)->s_journal, 0);
 		jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
 		if (err)
 			return err;

diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index f24bef3..68c772a 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c

@@ -221,6 +221,7 @@
 EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
 EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
 EXT4_RW_ATTR_SBI_UI(mb_max_inode_prealloc, s_mb_max_inode_prealloc);
+EXT4_RW_ATTR_SBI_UI(mb_max_linear_groups, s_mb_max_linear_groups);
 EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb);
 EXT4_ATTR(trigger_fs_error, 0200, trigger_test_error);
 EXT4_RW_ATTR_SBI_UI(err_ratelimit_interval_ms, s_err_ratelimit_state.interval);
@@ -269,6 +270,7 @@
 	ATTR_LIST(mb_stream_req),
 	ATTR_LIST(mb_group_prealloc),
 	ATTR_LIST(mb_max_inode_prealloc),
+	ATTR_LIST(mb_max_linear_groups),
 	ATTR_LIST(max_writeback_mb_bump),
 	ATTR_LIST(extent_max_zeroout_kb),
 	ATTR_LIST(trigger_fs_error),
@@ -534,6 +536,10 @@
 					ext4_fc_info_show, sb);
 		proc_create_seq_data("mb_groups", S_IRUGO, sbi->s_proc,
 				&ext4_mb_seq_groups_ops, sb);
+		proc_create_single_data("mb_stats", 0444, sbi->s_proc,
+				ext4_seq_mb_stats_show, sb);
+		proc_create_seq_data("mb_structs_summary", 0444, sbi->s_proc,
+				&ext4_mb_seq_structs_summary_ops, sb);
 	}
 	return 0;
 }

diff --git a/fs/file_table.c b/fs/file_table.c
index 709ada3..fbd45a1a 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c

@@ -279,6 +279,7 @@
 	}
 	if (file->f_op->release)
 		file->f_op->release(inode, file);
+	security_file_pre_free(file);
 	if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
 		     !(mode & FMODE_PATH))) {
 		cdev_put(inode->i_cdev);

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 8bb17b6d..7d18ae0 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c

@@ -501,8 +501,10 @@
 		current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
 	io_wq_switch_blkcg(worker, work);
 #ifdef CONFIG_AUDIT
-	current->loginuid = work->identity->loginuid;
-	current->sessionid = work->identity->sessionid;
+	if (current->audit) {
+		current->audit->loginuid = work->identity->loginuid;
+		current->audit->sessionid = work->identity->sessionid;
+	}
 #endif
 }
 
@@ -517,8 +519,10 @@
 	}
 
 #ifdef CONFIG_AUDIT
-	current->loginuid = KUIDT_INIT(AUDIT_UID_UNSET);
-	current->sessionid = AUDIT_SID_UNSET;
+	if (current->audit) {
+		current->audit->loginuid = KUIDT_INIT(AUDIT_UID_UNSET);
+		current->audit->sessionid = AUDIT_SID_UNSET;
+	}
 #endif
 
 	spin_lock_irq(&worker->lock);

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2009d1c..246c6ae 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c

@@ -1126,8 +1126,8 @@
 	id->fs = current->fs;
 	id->fsize = rlimit(RLIMIT_FSIZE);
 #ifdef CONFIG_AUDIT
-	id->loginuid = current->loginuid;
-	id->sessionid = current->sessionid;
+	id->loginuid = audit_get_loginuid(current);
+	id->sessionid = audit_get_sessionid(current);
 #endif
 	refcount_set(&id->count, 1);
 }
@@ -1380,8 +1380,8 @@
 		req->work.flags |= IO_WQ_WORK_CREDS;
 	}
 #ifdef CONFIG_AUDIT
-	if (!uid_eq(current->loginuid, id->loginuid) ||
-	    current->sessionid != id->sessionid)
+	if (!uid_eq(audit_get_loginuid(current), id->loginuid) ||
+	    audit_get_sessionid(current) != id->sessionid)
 		return false;
 #endif
 	if (!(req->work.flags & IO_WQ_WORK_FS) &&
@@ -6904,8 +6904,10 @@
 			}
 			io_sq_thread_associate_blkcg(ctx, &cur_css);
 #ifdef CONFIG_AUDIT
-			current->loginuid = ctx->loginuid;
-			current->sessionid = ctx->sessionid;
+			if (current->audit) {
+				current->audit->loginuid = ctx->loginuid;
+				current->audit->sessionid = ctx->sessionid;
+			}
 #endif
 
 			ret |= __io_sq_thread(ctx, start_jiffies, cap_entries);
@@ -9409,8 +9411,8 @@
 	ctx->user = user;
 	ctx->creds = get_current_cred();
 #ifdef CONFIG_AUDIT
-	ctx->loginuid = current->loginuid;
-	ctx->sessionid = current->sessionid;
+	ctx->loginuid = audit_get_loginuid(current);
+	ctx->sessionid = audit_get_sessionid(current);
 #endif
 	ctx->sqo_task = get_task_struct(current);
 

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 188f79d..8789c75 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c

@@ -1686,6 +1686,110 @@
 	write_unlock(&journal->j_state_lock);
 }
 
+/**
+ * __jbd2_journal_erase() - Discard or zeroout journal blocks (excluding superblock)
+ * @journal: The journal to erase.
+ * @flags: A discard/zeroout request is sent for each physically contigous
+ *	region of the journal. Either JBD2_JOURNAL_FLUSH_DISCARD or
+ *	JBD2_JOURNAL_FLUSH_ZEROOUT must be set to determine which operation
+ *	to perform.
+ *
+ * Note: JBD2_JOURNAL_FLUSH_ZEROOUT attempts to use hardware offload. Zeroes
+ * will be explicitly written if no hardware offload is available, see
+ * blkdev_issue_zeroout for more details.
+ */
+static int __jbd2_journal_erase(journal_t *journal, unsigned int flags)
+{
+	int err = 0;
+	unsigned long block, log_offset; /* logical */
+	unsigned long long phys_block, block_start, block_stop; /* physical */
+	loff_t byte_start, byte_stop, byte_count;
+	struct request_queue *q = bdev_get_queue(journal->j_dev);
+
+	/* flags must be set to either discard or zeroout */
+	if ((flags & ~JBD2_JOURNAL_FLUSH_VALID) || !flags ||
+			((flags & JBD2_JOURNAL_FLUSH_DISCARD) &&
+			(flags & JBD2_JOURNAL_FLUSH_ZEROOUT)))
+		return -EINVAL;
+
+	if (!q)
+		return -ENXIO;
+
+	if ((flags & JBD2_JOURNAL_FLUSH_DISCARD) && !blk_queue_discard(q))
+		return -EOPNOTSUPP;
+
+	/*
+	 * lookup block mapping and issue discard/zeroout for each
+	 * contiguous region
+	 */
+	log_offset = be32_to_cpu(journal->j_superblock->s_first);
+	block_start =  ~0ULL;
+	for (block = log_offset; block < journal->j_total_len; block++) {
+		err = jbd2_journal_bmap(journal, block, &phys_block);
+		if (err) {
+			pr_err("JBD2: bad block at offset %lu", block);
+			return err;
+		}
+
+		if (block_start == ~0ULL) {
+			block_start = phys_block;
+			block_stop = block_start - 1;
+		}
+
+		/*
+		 * last block not contiguous with current block,
+		 * process last contiguous region and return to this block on
+		 * next loop
+		 */
+		if (phys_block != block_stop + 1) {
+			block--;
+		} else {
+			block_stop++;
+			/*
+			 * if this isn't the last block of journal,
+			 * no need to process now because next block may also
+			 * be part of this contiguous region
+			 */
+			if (block != journal->j_total_len - 1)
+				continue;
+		}
+
+		/*
+		 * end of contiguous region or this is last block of journal,
+		 * take care of the region
+		 */
+		byte_start = block_start * journal->j_blocksize;
+		byte_stop = block_stop * journal->j_blocksize;
+		byte_count = (block_stop - block_start + 1) *
+				journal->j_blocksize;
+
+		truncate_inode_pages_range(journal->j_dev->bd_inode->i_mapping,
+				byte_start, byte_stop);
+
+		if (flags & JBD2_JOURNAL_FLUSH_DISCARD) {
+			err = blkdev_issue_discard(journal->j_dev,
+					byte_start >> SECTOR_SHIFT,
+					byte_count >> SECTOR_SHIFT,
+					GFP_NOFS, 0);
+		} else if (flags & JBD2_JOURNAL_FLUSH_ZEROOUT) {
+			err = blkdev_issue_zeroout(journal->j_dev,
+					byte_start >> SECTOR_SHIFT,
+					byte_count >> SECTOR_SHIFT,
+					GFP_NOFS, 0);
+		}
+
+		if (unlikely(err != 0)) {
+			pr_err("JBD2: (error %d) unable to wipe journal at physical blocks %llu - %llu",
+					err, block_start, block_stop);
+			return err;
+		}
+
+		/* reset start and stop after processing a region */
+		block_start = ~0ULL;
+	}
+
+	return blkdev_issue_flush(journal->j_dev, GFP_NOFS);
+}
 
 /**
  * jbd2_journal_update_sb_errno() - Update error in the journal.
@@ -1869,9 +1973,7 @@
 
 	if (jbd2_has_feature_fast_commit(journal)) {
 		journal->j_fc_last = be32_to_cpu(sb->s_maxlen);
-		num_fc_blocks = be32_to_cpu(sb->s_num_fc_blks);
-		if (!num_fc_blocks)
-			num_fc_blocks = JBD2_MIN_FC_BLOCKS;
+		num_fc_blocks = jbd2_journal_get_num_fc_blks(sb);
 		if (journal->j_last - num_fc_blocks >= JBD2_MIN_JOURNAL_BLOCKS)
 			journal->j_last = journal->j_fc_last - num_fc_blocks;
 		journal->j_fc_first = journal->j_last + 1;
@@ -2102,9 +2204,7 @@
 	journal_superblock_t *sb = journal->j_superblock;
 	unsigned long long num_fc_blks;
 
-	num_fc_blks = be32_to_cpu(sb->s_num_fc_blks);
-	if (num_fc_blks == 0)
-		num_fc_blks = JBD2_MIN_FC_BLOCKS;
+	num_fc_blks = jbd2_journal_get_num_fc_blks(sb);
 	if (journal->j_last - num_fc_blks < JBD2_MIN_JOURNAL_BLOCKS)
 		return -ENOSPC;
 
@@ -2250,13 +2350,18 @@
 /**
  * jbd2_journal_flush() - Flush journal
  * @journal: Journal to act on.
+ * @flags: optional operation on the journal blocks after the flush (see below)
  *
  * Flush all data for a given journal to disk and empty the journal.
  * Filesystems can use this when remounting readonly to ensure that
- * recovery does not need to happen on remount.
+ * recovery does not need to happen on remount. Optionally, a discard or zeroout
+ * can be issued on the journal blocks after flushing.
+ *
+ * flags:
+ *	JBD2_JOURNAL_FLUSH_DISCARD: issues discards for the journal blocks
+ *	JBD2_JOURNAL_FLUSH_ZEROOUT: issues zeroouts for the journal blocks
  */
-
-int jbd2_journal_flush(journal_t *journal)
+int jbd2_journal_flush(journal_t *journal, unsigned int flags)
 {
 	int err = 0;
 	transaction_t *transaction = NULL;
@@ -2310,6 +2415,10 @@
 	 * commits of data to the journal will restore the current
 	 * s_start value. */
 	jbd2_mark_journal_empty(journal, REQ_SYNC | REQ_FUA);
+
+	if (flags)
+		err = __jbd2_journal_erase(journal, flags);
+
 	mutex_unlock(&journal->j_checkpoint_mutex);
 	write_lock(&journal->j_state_lock);
 	J_ASSERT(!journal->j_running_transaction);

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 7871078..1b41bf9 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c

@@ -6020,7 +6020,7 @@
 	 * Then truncate log will be replayed resulting in cluster double free.
 	 */
 	jbd2_journal_lock_updates(journal->j_journal);
-	status = jbd2_journal_flush(journal->j_journal);
+	status = jbd2_journal_flush(journal->j_journal, 0);
 	jbd2_journal_unlock_updates(journal->j_journal);
 	if (status < 0) {
 		mlog_errno(status);

diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index db52e84..a143854 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c

@@ -310,7 +310,7 @@
 	}
 
 	jbd2_journal_lock_updates(journal->j_journal);
-	status = jbd2_journal_flush(journal->j_journal);
+	status = jbd2_journal_flush(journal->j_journal, 0);
 	jbd2_journal_unlock_updates(journal->j_journal);
 	if (status < 0) {
 		up_write(&journal->j_trans_barrier);
@@ -1002,7 +1002,7 @@
 
 	if (ocfs2_mount_local(osb)) {
 		jbd2_journal_lock_updates(journal->j_journal);
-		status = jbd2_journal_flush(journal->j_journal);
+		status = jbd2_journal_flush(journal->j_journal, 0);
 		jbd2_journal_unlock_updates(journal->j_journal);
 		if (status < 0)
 			mlog_errno(status);
@@ -1072,7 +1072,7 @@
 
 	if (replayed) {
 		jbd2_journal_lock_updates(journal->j_journal);
-		status = jbd2_journal_flush(journal->j_journal);
+		status = jbd2_journal_flush(journal->j_journal, 0);
 		jbd2_journal_unlock_updates(journal->j_journal);
 		if (status < 0)
 			mlog_errno(status);
@@ -1668,7 +1668,7 @@
 
 	/* wipe the journal */
 	jbd2_journal_lock_updates(journal);
-	status = jbd2_journal_flush(journal);
+	status = jbd2_journal_flush(journal, 0);
 	jbd2_journal_unlock_updates(journal);
 	if (status < 0)
 		mlog_errno(status);

diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index c930001..c54bc40 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig

@@ -107,3 +107,11 @@
 config PROC_CPU_RESCTRL
 	def_bool n
 	depends on PROC_FS
+
+config PROC_SELF_MEM_READONLY
+	bool "Force /proc/<pid>/mem paths to be read-only"
+	default y
+	help
+	  When enabled, attempts to open /proc/self/mem for write access
+	  will always fail.  Write access to this file allows bypassing
+	  of memory map permissions (such as modifying read-only code).

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 5d52aea8..0e8c58f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c

@@ -150,6 +150,12 @@
 		NULL, &proc_pid_attr_operations,	\
 		{ .lsm = LSM })
 
+#ifdef CONFIG_PROC_SELF_MEM_READONLY
+# define PROC_PID_MEM_MODE S_IRUSR
+#else
+# define PROC_PID_MEM_MODE S_IRUSR|S_IWUSR
+#endif
+
 /*
  * Count the number of hardlinks for the pid_entry table, excluding the .
  * and .. links.
@@ -896,7 +902,11 @@
 static ssize_t mem_write(struct file *file, const char __user *buf,
 			 size_t count, loff_t *ppos)
 {
+#ifdef CONFIG_PROC_SELF_MEM_READONLY
+	return -EACCES;
+#else
 	return mem_rw(file, (char __user*)buf, count, ppos, 1);
+#endif
 }
 
 loff_t mem_lseek(struct file *file, loff_t offset, int orig)
@@ -3203,7 +3213,7 @@
 #ifdef CONFIG_NUMA
 	REG("numa_maps",  S_IRUGO, proc_pid_numa_maps_operations),
 #endif
-	REG("mem",        S_IRUSR|S_IWUSR, proc_mem_operations),
+	REG("mem",        PROC_PID_MEM_MODE, proc_mem_operations),
 	LNK("cwd",        proc_cwd_link),
 	LNK("root",       proc_root_link),
 	LNK("exe",        proc_exe_link),
@@ -3543,7 +3553,7 @@
 #ifdef CONFIG_NUMA
 	REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
 #endif
-	REG("mem",       S_IRUSR|S_IWUSR, proc_mem_operations),
+	REG("mem",       PROC_PID_MEM_MODE, proc_mem_operations),
 	LNK("cwd",       proc_cwd_link),
 	LNK("root",      proc_root_link),
 	LNK("exe",       proc_exe_link),

diff --git a/include/linux/audit.h b/include/linux/audit.h
index b3d8598..3d5b243 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h

@@ -118,6 +118,17 @@
 	AUDIT_NFT_OP_INVALID,
 };
 
+struct audit_task_info {
+	kuid_t			loginuid;
+	unsigned int		sessionid;
+	u64			contid;
+#ifdef CONFIG_AUDITSYSCALL
+	struct audit_context	*ctx;
+#endif
+};
+
+extern struct audit_task_info init_struct_audit;
+
 extern int is_audit_feature_set(int which);
 
 extern int __init audit_register_class(int class, unsigned *list);
@@ -154,6 +165,9 @@
 #ifdef CONFIG_AUDIT
 /* These are defined in audit.c */
 				/* Public API */
+extern int  audit_alloc(struct task_struct *task);
+extern void audit_free(struct task_struct *task);
+extern void __init audit_task_init(void);
 extern __printf(4, 5)
 void audit_log(struct audit_context *ctx, gfp_t gfp_mask, int type,
 	       const char *fmt, ...);
@@ -197,12 +211,25 @@
 
 static inline kuid_t audit_get_loginuid(struct task_struct *tsk)
 {
-	return tsk->loginuid;
+	if (!tsk->audit)
+		return INVALID_UID;
+	return tsk->audit->loginuid;
 }
 
 static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
 {
-	return tsk->sessionid;
+	if (!tsk->audit)
+		return AUDIT_SID_UNSET;
+	return tsk->audit->sessionid;
+}
+
+extern int audit_set_contid(struct task_struct *tsk, u64 contid);
+
+static inline u64 audit_get_contid(struct task_struct *tsk)
+{
+	if (!tsk->audit)
+		return AUDIT_CID_UNSET;
+	return tsk->audit->contid;
 }
 
 extern u32 audit_enabled;
@@ -210,6 +237,14 @@
 extern int audit_signal_info(int sig, struct task_struct *t);
 
 #else /* CONFIG_AUDIT */
+static inline int audit_alloc(struct task_struct *task)
+{
+	return 0;
+}
+static inline void audit_free(struct task_struct *task)
+{ }
+static inline void __init audit_task_init(void)
+{ }
 static inline __printf(4, 5)
 void audit_log(struct audit_context *ctx, gfp_t gfp_mask, int type,
 	       const char *fmt, ...)
@@ -261,6 +296,11 @@
 	return AUDIT_SID_UNSET;
 }
 
+static inline u64 audit_get_contid(struct task_struct *tsk)
+{
+	return AUDIT_CID_UNSET;
+}
+
 #define audit_enabled AUDIT_OFF
 
 static inline int audit_signal_info(int sig, struct task_struct *t)
@@ -285,8 +325,6 @@
 
 /* These are defined in auditsc.c */
 				/* Public API */
-extern int  audit_alloc(struct task_struct *task);
-extern void __audit_free(struct task_struct *task);
 extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long a1,
 				  unsigned long a2, unsigned long a3);
 extern void __audit_syscall_exit(int ret_success, long ret_value);
@@ -306,12 +344,14 @@
 
 static inline void audit_set_context(struct task_struct *task, struct audit_context *ctx)
 {
-	task->audit_context = ctx;
+	task->audit->ctx = ctx;
 }
 
 static inline struct audit_context *audit_context(void)
 {
-	return current->audit_context;
+	if (!current->audit)
+		return NULL;
+	return current->audit->ctx;
 }
 
 static inline bool audit_dummy_context(void)
@@ -319,11 +359,7 @@
 	void *p = audit_context();
 	return !p || *(int *)p;
 }
-static inline void audit_free(struct task_struct *task)
-{
-	if (unlikely(task->audit_context))
-		__audit_free(task);
-}
+
 static inline void audit_syscall_entry(int major, unsigned long a0,
 				       unsigned long a1, unsigned long a2,
 				       unsigned long a3)
@@ -556,12 +592,6 @@
 extern int audit_n_rules;
 extern int audit_signals;
 #else /* CONFIG_AUDITSYSCALL */
-static inline int audit_alloc(struct task_struct *task)
-{
-	return 0;
-}
-static inline void audit_free(struct task_struct *task)
-{ }
 static inline void audit_syscall_entry(int major, unsigned long a0,
 				       unsigned long a1, unsigned long a2,
 				       unsigned long a3)
@@ -694,4 +724,19 @@
 	return uid_valid(audit_get_loginuid(tsk));
 }
 
+static inline bool audit_contid_valid(u64 contid)
+{
+	return contid != AUDIT_CID_UNSET;
+}
+
+static inline bool audit_contid_set(struct task_struct *tsk)
+{
+	return audit_contid_valid(audit_get_contid(tsk));
+}
+
+static inline void audit_log_string(struct audit_buffer *ab, const char *buf)
+{
+	audit_log_n_string(ab, buf, strlen(buf));
+}
+
 #endif

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8aae375..3f553f1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h

@@ -471,6 +471,7 @@
 
 	unsigned int		dma_pad_mask;
 	unsigned int		dma_alignment;
+	unsigned int		split_alignment;
 
 #ifdef CONFIG_BLK_INLINE_ENCRYPTION
 	/* Inline crypto capabilities */

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 3f93a50..38255f2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h

@@ -311,6 +311,7 @@
 	RET_PTR_TO_BTF_ID_OR_NULL,	/* returns a pointer to a btf_id or NULL */
 	RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL, /* returns a pointer to a valid memory or a btf_id or NULL */
 	RET_PTR_TO_MEM_OR_BTF_ID,	/* returns a pointer to a valid memory or a btf_id */
+	RET_PTR_TO_BTF_ID,		/* returns a pointer to a btf_id */
 };
 
 /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs
@@ -1868,6 +1869,8 @@
 extern const struct bpf_func_proto bpf_snprintf_btf_proto;
 extern const struct bpf_func_proto bpf_per_cpu_ptr_proto;
 extern const struct bpf_func_proto bpf_this_cpu_ptr_proto;
+extern const struct bpf_func_proto bpf_ktime_get_coarse_ns_proto;
+extern const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto;
 
 const struct bpf_func_proto *bpf_tracing_func_proto(
 	enum bpf_func_id func_id, const struct bpf_prog *prog);

diff --git a/include/linux/bpf_lsm.h b/include/linux/bpf_lsm.h
index aaacb6a..0d1c33a 100644
--- a/include/linux/bpf_lsm.h
+++ b/include/linux/bpf_lsm.h

@@ -7,6 +7,7 @@
 #ifndef _LINUX_BPF_LSM_H
 #define _LINUX_BPF_LSM_H
 
+#include <linux/sched.h>
 #include <linux/bpf.h>
 #include <linux/lsm_hooks.h>
 
@@ -26,6 +27,8 @@
 int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog,
 			const struct bpf_prog *prog);
 
+bool bpf_lsm_is_sleepable_hook(u32 btf_id);
+
 static inline struct bpf_storage_blob *bpf_inode(
 	const struct inode *inode)
 {
@@ -35,12 +38,29 @@
 	return inode->i_security + bpf_lsm_blob_sizes.lbs_inode;
 }
 
+static inline struct bpf_storage_blob *bpf_task(
+	const struct task_struct *task)
+{
+	if (unlikely(!task->security))
+		return NULL;
+
+	return task->security + bpf_lsm_blob_sizes.lbs_task;
+}
+
 extern const struct bpf_func_proto bpf_inode_storage_get_proto;
 extern const struct bpf_func_proto bpf_inode_storage_delete_proto;
+extern const struct bpf_func_proto bpf_task_storage_get_proto;
+extern const struct bpf_func_proto bpf_task_storage_delete_proto;
 void bpf_inode_storage_free(struct inode *inode);
+void bpf_task_storage_free(struct task_struct *task);
 
 #else /* !CONFIG_BPF_LSM */
 
+static inline bool bpf_lsm_is_sleepable_hook(u32 btf_id)
+{
+	return false;
+}
+
 static inline int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog,
 				      const struct bpf_prog *prog)
 {
@@ -53,10 +73,20 @@
 	return NULL;
 }
 
+static inline struct bpf_storage_blob *bpf_task(
+	const struct task_struct *task)
+{
+	return NULL;
+}
+
 static inline void bpf_inode_storage_free(struct inode *inode)
 {
 }
 
+static inline void bpf_task_storage_free(struct task_struct *task)
+{
+}
+
 #endif /* CONFIG_BPF_LSM */
 
 #endif /* _LINUX_BPF_LSM_H */

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index a8137bb..e256d6e 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h

@@ -109,6 +109,7 @@
 #endif
 #ifdef CONFIG_BPF_LSM
 BPF_MAP_TYPE(BPF_MAP_TYPE_INODE_STORAGE, inode_storage_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_TASK_STORAGE, task_storage_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
 #if defined(CONFIG_XDP_SOCKETS)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 822b701..97534c9 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h

@@ -264,15 +264,32 @@
 		.off   = OFF,					\
 		.imm   = 0 })
 
-/* Atomic memory add, *(uint *)(dst_reg + off16) += src_reg */
 
-#define BPF_STX_XADD(SIZE, DST, SRC, OFF)			\
+/*
+ * Atomic operations:
+ *
+ *   BPF_ADD                  *(uint *) (dst_reg + off16) += src_reg
+ *   BPF_AND                  *(uint *) (dst_reg + off16) &= src_reg
+ *   BPF_OR                   *(uint *) (dst_reg + off16) |= src_reg
+ *   BPF_XOR                  *(uint *) (dst_reg + off16) ^= src_reg
+ *   BPF_ADD | BPF_FETCH      src_reg = atomic_fetch_add(dst_reg + off16, src_reg);
+ *   BPF_AND | BPF_FETCH      src_reg = atomic_fetch_and(dst_reg + off16, src_reg);
+ *   BPF_OR | BPF_FETCH       src_reg = atomic_fetch_or(dst_reg + off16, src_reg);
+ *   BPF_XOR | BPF_FETCH      src_reg = atomic_fetch_xor(dst_reg + off16, src_reg);
+ *   BPF_XCHG                 src_reg = atomic_xchg(dst_reg + off16, src_reg)
+ *   BPF_CMPXCHG              r0 = atomic_cmpxchg(dst_reg + off16, r0, src_reg)
+ */
+
+#define BPF_ATOMIC_OP(SIZE, OP, DST, SRC, OFF)			\
 	((struct bpf_insn) {					\
-		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_XADD,	\
+		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_ATOMIC,	\
 		.dst_reg = DST,					\
 		.src_reg = SRC,					\
 		.off   = OFF,					\
-		.imm   = 0 })
+		.imm   = OP })
+
+/* Legacy alias */
+#define BPF_STX_XADD(SIZE, DST, SRC, OFF) BPF_ATOMIC_OP(SIZE, BPF_ADD, DST, SRC, OFF)
 
 /* Memory store, *(uint *) (dst_reg + off16) = imm32 */
 

diff --git a/include/linux/ima.h b/include/linux/ima.h
index 8fa7bcf..7233a27 100644
--- a/include/linux/ima.h
+++ b/include/linux/ima.h

@@ -29,6 +29,7 @@
 			      enum kernel_read_file_id id);
 extern void ima_post_path_mknod(struct dentry *dentry);
 extern int ima_file_hash(struct file *file, char *buf, size_t buf_size);
+extern int ima_inode_hash(struct inode *inode, char *buf, size_t buf_size);
 extern void ima_kexec_cmdline(int kernel_fd, const void *buf, int size);
 
 #ifdef CONFIG_IMA_KEXEC
@@ -115,6 +116,11 @@
 	return -EOPNOTSUPP;
 }
 
+static inline int ima_inode_hash(struct inode *inode, char *buf, size_t buf_size)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline void ima_kexec_cmdline(int kernel_fd, const void *buf, int size) {}
 #endif /* CONFIG_IMA */
 

diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 578ff19..81498a3 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h

@@ -68,7 +68,7 @@
 extern void jbd2_free(void *ptr, size_t size);
 
 #define JBD2_MIN_JOURNAL_BLOCKS 1024
-#define JBD2_MIN_FC_BLOCKS	256
+#define JBD2_DEFAULT_FAST_COMMIT_BLOCKS 256
 
 #ifdef __KERNEL__
 
@@ -1360,6 +1360,10 @@
 						 * mode */
 #define JBD2_FAST_COMMIT_ONGOING	0x100	/* Fast commit is ongoing */
 #define JBD2_FULL_COMMIT_ONGOING	0x200	/* Full commit is ongoing */
+#define JBD2_JOURNAL_FLUSH_DISCARD	0x0001
+#define JBD2_JOURNAL_FLUSH_ZEROOUT	0x0002
+#define JBD2_JOURNAL_FLUSH_VALID	(JBD2_JOURNAL_FLUSH_DISCARD | \
+					JBD2_JOURNAL_FLUSH_ZEROOUT)
 
 /*
  * Function declarations for the journaling transaction and buffer
@@ -1490,7 +1494,7 @@
 				struct page *, unsigned int, unsigned int);
 extern int	 jbd2_journal_try_to_free_buffers(journal_t *journal, struct page *page);
 extern int	 jbd2_journal_stop(handle_t *);
-extern int	 jbd2_journal_flush (journal_t *);
+extern int	 jbd2_journal_flush(journal_t *journal, unsigned int flags);
 extern void	 jbd2_journal_lock_updates (journal_t *);
 extern void	 jbd2_journal_unlock_updates (journal_t *);
 
@@ -1691,6 +1695,13 @@
 	return journal->j_chksum_driver != NULL;
 }
 
+static inline int jbd2_journal_get_num_fc_blks(journal_superblock_t *jsb)
+{
+	int num_fc_blocks = be32_to_cpu(jsb->s_num_fc_blks);
+
+	return num_fc_blocks ? num_fc_blocks : JBD2_DEFAULT_FAST_COMMIT_BLOCKS;
+}
+
 /*
  * Return number of free blocks in the log. Must be called under j_state_lock.
  */

diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index 32a9401..4cd6eb3 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h

@@ -156,6 +156,7 @@
 LSM_HOOK(int, 0, file_permission, struct file *file, int mask)
 LSM_HOOK(int, 0, file_alloc_security, struct file *file)
 LSM_HOOK(void, LSM_RET_VOID, file_free_security, struct file *file)
+LSM_HOOK(void, LSM_RET_VOID, file_pre_free_security, struct file *file)
 LSM_HOOK(int, 0, file_ioctl, struct file *file, unsigned int cmd,
 	 unsigned long arg)
 LSM_HOOK(int, 0, mmap_addr, unsigned long addr)
@@ -173,6 +174,7 @@
 LSM_HOOK(int, 0, file_open, struct file *file)
 LSM_HOOK(int, 0, task_alloc, struct task_struct *task,
 	 unsigned long clone_flags)
+LSM_HOOK(void, LSM_RET_VOID, task_post_alloc, struct task_struct *task)
 LSM_HOOK(void, LSM_RET_VOID, task_free, struct task_struct *task)
 LSM_HOOK(int, 0, cred_alloc_blank, struct cred *cred, gfp_t gfp)
 LSM_HOOK(void, LSM_RET_VOID, cred_free, struct cred *cred)
@@ -211,6 +213,7 @@
 LSM_HOOK(int, 0, task_movememory, struct task_struct *p)
 LSM_HOOK(int, 0, task_kill, struct task_struct *p, struct kernel_siginfo *info,
 	 int sig, const struct cred *cred)
+LSM_HOOK(void, LSM_RET_VOID, task_exit, struct task_struct *p)
 LSM_HOOK(int, -ENOSYS, task_prctl, int option, unsigned long arg2,
 	 unsigned long arg3, unsigned long arg4, unsigned long arg5)
 LSM_HOOK(void, LSM_RET_VOID, task_to_inode, struct task_struct *p,

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index c503f7a..fa18e56 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h

@@ -513,6 +513,10 @@
  * @file_free_security:
  *	Deallocate and free any security structures stored in file->f_security.
  *	@file contains the file structure being modified.
+ * @file_pre_free_security:
+ *	Perform any logging or LSM state updates for a file being deleted
+ *	using fields of the file before they have been cleared.
+ *	@file contains the file structure being freed
  * @file_ioctl:
  *	@file contains the file structure.
  *	@cmd contains the operation to perform.
@@ -589,6 +593,10 @@
  *	@clone_flags contains the flags indicating what should be shared.
  *	Handle allocation of task-related resources.
  *	Returns a zero on success, negative values on failure.
+ * @task_post_alloc:
+ *	@task task being allocated.
+ *	Handle allocation of task-related resources after all task fields are
+ *	filled in.
  * @task_free:
  *	@task task about to be freed.
  *	Handle release of task-related resources. (Note that this can be called
@@ -759,6 +767,9 @@
  *	@cred contains the cred of the process where the signal originated, or
  *	NULL if the current task is the originator.
  *	Return 0 if permission is granted.
+ * @task_exit:
+ *      Called early when a task is exiting before all state is lost.
+ *      @p contains the task_struct for process.
  * @task_prctl:
  *	Check permission before performing a process control operation on the
  *	current process.

diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
index 03dee12d..2494b64 100644
--- a/include/linux/mmu_context.h
+++ b/include/linux/mmu_context.h

@@ -5,6 +5,11 @@
 #include <asm/mmu_context.h>
 #include <asm/mmu.h>
 
+struct mm_struct;
+
+void use_mm(struct mm_struct *mm);
+void unuse_mm(struct mm_struct *mm);
+
 /* Architectures that care about IRQ state in switch_mm can override this. */
 #ifndef switch_mm_irqs_off
 # define switch_mm_irqs_off switch_mm

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 5a5cb45..1a1ee88 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h

@@ -33,6 +33,9 @@
 	struct ucounts *ucounts;
 	int reboot;	/* group exit code if this pidns was rebooted */
 	struct ns_common ns;
+#ifdef CONFIG_SECURITY_CONTAINER_MONITOR
+	u64 cid;  /* Main container identifier, zero if not assigned. */
+#endif
 } __randomize_layout;
 
 extern struct pid_namespace init_pid_ns;

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2660ee4..fd675de 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h

@@ -36,7 +36,6 @@
 #include <linux/kcsan.h>
 
 /* task_struct member predeclarations (sorted alphabetically): */
-struct audit_context;
 struct backing_dev_info;
 struct bio_list;
 struct blk_plug;
@@ -988,11 +987,7 @@
 	struct callback_head		*task_works;
 
 #ifdef CONFIG_AUDIT
-#ifdef CONFIG_AUDITSYSCALL
-	struct audit_context		*audit_context;
-#endif
-	kuid_t				loginuid;
-	unsigned int			sessionid;
+	struct audit_task_info		*audit;
 #endif
 	struct seccomp			seccomp;
 
@@ -1765,6 +1760,12 @@
 extern int wake_up_process(struct task_struct *tsk);
 extern void wake_up_new_task(struct task_struct *tsk);
 
+/*
+ * Wake up tsk and try to swap it into the current tasks place, which
+ * initially means just trying to migrate it to the current CPU.
+ */
+extern int wake_up_swap(struct task_struct *tsk);
+
 #ifdef CONFIG_SMP
 extern void kick_process(struct task_struct *tsk);
 #else

diff --git a/include/linux/security.h b/include/linux/security.h
index 7ef74d0..7d0ad77 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h

@@ -365,6 +365,7 @@
 int security_file_permission(struct file *file, int mask);
 int security_file_alloc(struct file *file);
 void security_file_free(struct file *file);
+void security_file_pre_free(struct file *file);
 int security_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
 int security_mmap_file(struct file *file, unsigned long prot,
 			unsigned long flags);
@@ -379,6 +380,7 @@
 int security_file_receive(struct file *file);
 int security_file_open(struct file *file);
 int security_task_alloc(struct task_struct *task, unsigned long clone_flags);
+void security_task_post_alloc(struct task_struct *task);
 void security_task_free(struct task_struct *task);
 int security_cred_alloc_blank(struct cred *cred, gfp_t gfp);
 void security_cred_free(struct cred *cred);
@@ -416,6 +418,7 @@
 int security_task_movememory(struct task_struct *p);
 int security_task_kill(struct task_struct *p, struct kernel_siginfo *info,
 			int sig, const struct cred *cred);
+void security_task_exit(struct task_struct *p);
 int security_task_prctl(int option, unsigned long arg2, unsigned long arg3,
 			unsigned long arg4, unsigned long arg5);
 void security_task_to_inode(struct task_struct *p, struct inode *inode);
@@ -917,6 +920,9 @@
 static inline void security_file_free(struct file *file)
 { }
 
+static inline void security_file_pre_free(struct file *file)
+{ }
+
 static inline int security_file_ioctl(struct file *file, unsigned int cmd,
 				      unsigned long arg)
 {
@@ -980,6 +986,9 @@
 	return 0;
 }
 
+static inline void security_task_post_alloc(struct task_struct *task)
+{ }
+
 static inline void security_task_free(struct task_struct *task)
 { }
 
@@ -1130,6 +1139,9 @@
 	return 0;
 }
 
+static inline void security_task_exit(struct task_struct *p)
+{ }
+
 static inline int security_task_prctl(int option, unsigned long arg2,
 				      unsigned long arg3,
 				      unsigned long arg4,

diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index cd2d827..26d65d0 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h

@@ -71,6 +71,7 @@
 #define AUDIT_TTY_SET		1017	/* Set TTY auditing status */
 #define AUDIT_SET_FEATURE	1018	/* Turn an audit feature on or off */
 #define AUDIT_GET_FEATURE	1019	/* Get which features are enabled */
+#define AUDIT_CONTAINER_OP	1020	/* Define the container id and info */
 
 #define AUDIT_FIRST_USER_MSG	1100	/* Userspace messages mostly uninteresting to kernel */
 #define AUDIT_USER_AVC		1107	/* We filter this differently */
@@ -495,6 +496,7 @@
 
 #define AUDIT_UID_UNSET (unsigned int)-1
 #define AUDIT_SID_UNSET ((unsigned int)-1)
+#define AUDIT_CID_UNSET ((u64)-1)
 
 /* audit_rule_data supports filter rules with both integer and string
  * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 762bf87..9880f34 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h

@@ -19,7 +19,8 @@
 
 /* ld/ldx fields */
 #define BPF_DW		0x18	/* double word (64-bit) */
-#define BPF_XADD	0xc0	/* exclusive add */
+#define BPF_ATOMIC	0xc0	/* atomic memory ops - op type in immediate */
+#define BPF_XADD	0xc0	/* exclusive add - legacy name */
 
 /* alu/jmp fields */
 #define BPF_MOV		0xb0	/* mov reg to reg */
@@ -43,6 +44,11 @@
 #define BPF_CALL	0x80	/* function call */
 #define BPF_EXIT	0x90	/* function return */
 
+/* atomic op type fields (stored in immediate) */
+#define BPF_FETCH	0x01	/* not an opcode on its own, used to build others */
+#define BPF_XCHG	(0xe0 | BPF_FETCH)	/* atomic exchange */
+#define BPF_CMPXCHG	(0xf0 | BPF_FETCH)	/* atomic compare-and-write */
+
 /* Register numbers */
 enum {
 	BPF_REG_0 = 0,
@@ -157,6 +163,7 @@
 	BPF_MAP_TYPE_STRUCT_OPS,
 	BPF_MAP_TYPE_RINGBUF,
 	BPF_MAP_TYPE_INODE_STORAGE,
+	BPF_MAP_TYPE_TASK_STORAGE,
 };
 
 /* Note that tracing related programs such as
@@ -1661,6 +1668,14 @@
  * 	Return
  * 		A 8-byte long non-decreasing number.
  *
+ * u64 bpf_get_socket_cookie(struct sock *sk)
+ * 	Description
+ * 		Equivalent to **bpf_get_socket_cookie**\ () helper that accepts
+ * 		*sk*, but gets socket from a BTF **struct sock**. This helper
+ * 		also works for sleepable programs.
+ * 	Return
+ * 		A 8-byte long unique number or 0 if *sk* is NULL.
+ *
  * u32 bpf_get_socket_uid(struct sk_buff *skb)
  * 	Return
  * 		The owner UID of the socket associated to *skb*. If the socket
@@ -2442,7 +2457,7 @@
  *		running simultaneously.
  *
  *		A user should care about the synchronization by himself.
- *		For example, by using the **BPF_STX_XADD** instruction to alter
+ *		For example, by using the **BPF_ATOMIC** instructions to alter
  *		the shared data.
  *	Return
  *		A pointer to the local storage area.
@@ -3742,6 +3757,80 @@
  * 	Return
  * 		The helper returns **TC_ACT_REDIRECT** on success or
  * 		**TC_ACT_SHOT** on error.
+ *
+ * void *bpf_task_storage_get(struct bpf_map *map, struct task_struct *task, void *value, u64 flags)
+ *	Description
+ *		Get a bpf_local_storage from the *task*.
+ *
+ *		Logically, it could be thought of as getting the value from
+ *		a *map* with *task* as the **key**.  From this
+ *		perspective,  the usage is not much different from
+ *		**bpf_map_lookup_elem**\ (*map*, **&**\ *task*) except this
+ *		helper enforces the key must be an task_struct and the map must also
+ *		be a **BPF_MAP_TYPE_TASK_STORAGE**.
+ *
+ *		Underneath, the value is stored locally at *task* instead of
+ *		the *map*.  The *map* is used as the bpf-local-storage
+ *		"type". The bpf-local-storage "type" (i.e. the *map*) is
+ *		searched against all bpf_local_storage residing at *task*.
+ *
+ *		An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
+ *		used such that a new bpf_local_storage will be
+ *		created if one does not exist.  *value* can be used
+ *		together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
+ *		the initial value of a bpf_local_storage.  If *value* is
+ *		**NULL**, the new bpf_local_storage will be zero initialized.
+ *	Return
+ *		A bpf_local_storage pointer is returned on success.
+ *
+ *		**NULL** if not found or there was an error in adding
+ *		a new bpf_local_storage.
+ *
+ * long bpf_task_storage_delete(struct bpf_map *map, struct task_struct *task)
+ *	Description
+ *		Delete a bpf_local_storage from a *task*.
+ *	Return
+ *		0 on success.
+ *
+ *		**-ENOENT** if the bpf_local_storage cannot be found.
+ *
+ * struct task_struct *bpf_get_current_task_btf(void)
+ *	Description
+ *		Return a BTF pointer to the "current" task.
+ *		This pointer can also be used in helpers that accept an
+ *		*ARG_PTR_TO_BTF_ID* of type *task_struct*.
+ *	Return
+ *		Pointer to the current task.
+ *
+ * long bpf_bprm_opts_set(struct linux_binprm *bprm, u64 flags)
+ *	Description
+ *		Set or clear certain options on *bprm*:
+ *
+ *		**BPF_F_BPRM_SECUREEXEC** Set the secureexec bit
+ *		which sets the **AT_SECURE** auxv for glibc. The bit
+ *		is cleared if the flag is not specified.
+ *	Return
+ *		**-EINVAL** if invalid *flags* are passed, zero otherwise.
+ *
+ * u64 bpf_ktime_get_coarse_ns(void)
+ * 	Description
+ * 		Return a coarse-grained version of the time elapsed since
+ * 		system boot, in nanoseconds. Does not include time the system
+ * 		was suspended.
+ *
+ * 		See: **clock_gettime**\ (**CLOCK_MONOTONIC_COARSE**)
+ * 	Return
+ * 		Current *ktime*.
+ *
+ * long bpf_ima_inode_hash(struct inode *inode, void *dst, u32 size)
+ *	Description
+ *		Returns the stored IMA hash of the *inode* (if it's avaialable).
+ *		If the hash is larger than *size*, then only *size*
+ *		bytes will be copied to *dst*
+ *	Return
+ *		The **hash_algo** is returned on success,
+ *		**-EOPNOTSUP** if IMA is disabled or **-EINVAL** if
+ *		invalid arguments are passed.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3900,6 +3989,12 @@
 	FN(per_cpu_ptr),		\
 	FN(this_cpu_ptr),		\
 	FN(redirect_peer),		\
+	FN(task_storage_get),		\
+	FN(task_storage_delete),	\
+	FN(get_current_task_btf),	\
+	FN(bprm_opts_set),		\
+	FN(ktime_get_coarse_ns),	\
+	FN(ima_inode_hash),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -4071,6 +4166,11 @@
 	BPF_LWT_ENCAP_IP,
 };
 
+/* Flags for bpf_bprm_opts_set helper */
+enum {
+	BPF_F_BPRM_SECUREEXEC	= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\

diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index a89eb0a..7e2b8a8 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h

@@ -22,6 +22,8 @@
 #define FUTEX_WAIT_REQUEUE_PI	11
 #define FUTEX_CMP_REQUEUE_PI	12
 
+#define GFUTEX_SWAP		60
+
 #define FUTEX_PRIVATE_FLAG	128
 #define FUTEX_CLOCK_REALTIME	256
 #define FUTEX_CMD_MASK		~(FUTEX_PRIVATE_FLAG | FUTEX_CLOCK_REALTIME)
@@ -41,6 +43,8 @@
 #define FUTEX_CMP_REQUEUE_PI_PRIVATE	(FUTEX_CMP_REQUEUE_PI | \
 					 FUTEX_PRIVATE_FLAG)
 
+#define GFUTEX_SWAP_PRIVATE		(GFUTEX_SWAP | FUTEX_PRIVATE_FLAG)
+
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
  * thread exit time.

diff --git a/init/init_task.c b/init/init_task.c
index 5fa18ed..68bfc90 100644
--- a/init/init_task.c
+++ b/init/init_task.c

@@ -134,8 +134,7 @@
 	.thread_group	= LIST_HEAD_INIT(init_task.thread_group),
 	.thread_node	= LIST_HEAD_INIT(init_signals.thread_head),
 #ifdef CONFIG_AUDIT
-	.loginuid	= INVALID_UID,
-	.sessionid	= AUDIT_SID_UNSET,
+	.audit		= &init_struct_audit,
 #endif
 #ifdef CONFIG_PERF_EVENTS
 	.perf_event_mutex = __MUTEX_INITIALIZER(init_task.perf_event_mutex),

diff --git a/init/main.c b/init/main.c
index dd26a42..d517bd4 100644
--- a/init/main.c
+++ b/init/main.c

@@ -98,6 +98,7 @@
 #include <linux/mem_encrypt.h>
 #include <linux/kcsan.h>
 #include <linux/init_syscalls.h>
+#include <linux/audit.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -1042,6 +1043,7 @@
 	nsfs_init();
 	cpuset_init();
 	cgroup_init();
+	audit_task_init();
 	taskstats_init_early();
 	delayacct_init();
 

diff --git a/kernel/audit.c b/kernel/audit.c
index 68cee3b..e9a3e40 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c

@@ -208,6 +208,75 @@
 	struct sk_buff *skb;
 };
 
+static struct kmem_cache *audit_task_cache;
+
+void __init audit_task_init(void)
+{
+	audit_task_cache = kmem_cache_create("audit_task",
+					     sizeof(struct audit_task_info),
+					     0, SLAB_PANIC, NULL);
+}
+
+/**
+ * audit_alloc - allocate an audit info block for a task
+ * @tsk: task
+ *
+ * Call audit_alloc_syscall to filter on the task information and
+ * allocate a per-task audit context if necessary.  This is called from
+ * copy_process, so no lock is needed.
+ */
+int audit_alloc(struct task_struct *tsk)
+{
+	int ret = 0;
+	struct audit_task_info *info;
+
+	info = kmem_cache_alloc(audit_task_cache, GFP_KERNEL);
+	if (!info) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	info->loginuid = audit_get_loginuid(current);
+	info->sessionid = audit_get_sessionid(current);
+	info->contid = audit_get_contid(current);
+	tsk->audit = info;
+
+	ret = audit_alloc_syscall(tsk);
+	if (ret) {
+		tsk->audit = NULL;
+		kmem_cache_free(audit_task_cache, info);
+	}
+out:
+	return ret;
+}
+
+struct audit_task_info init_struct_audit = {
+	.loginuid = INVALID_UID,
+	.sessionid = AUDIT_SID_UNSET,
+	.contid = AUDIT_CID_UNSET,
+#ifdef CONFIG_AUDITSYSCALL
+	.ctx = NULL,
+#endif
+};
+
+/**
+ * audit_free - free per-task audit info
+ * @tsk: task whose audit info block to free
+ *
+ * Called from copy_process and do_exit
+ */
+void audit_free(struct task_struct *tsk)
+{
+	struct audit_task_info *info = tsk->audit;
+
+	audit_free_syscall(tsk);
+	/* Freeing the audit_task_info struct must be performed after
+	 * audit_log_exit() due to need for loginuid and sessionid.
+	 */
+	info = tsk->audit;
+	tsk->audit = NULL;
+	kmem_cache_free(audit_task_cache, info);
+}
+
 /**
  * auditd_test_task - Check to see if a given task is an audit daemon
  * @task: the task to check
@@ -2322,8 +2391,8 @@
 			sessionid = (unsigned int)atomic_inc_return(&session_id);
 	}
 
-	current->sessionid = sessionid;
-	current->loginuid = loginuid;
+	current->audit->sessionid = sessionid;
+	current->audit->loginuid = loginuid;
 out:
 	audit_log_set_loginuid(oldloginuid, loginuid, oldsessionid, sessionid, rc);
 	return rc;
@@ -2356,6 +2425,62 @@
 	return audit_signal_info_syscall(t);
 }
 
+/*
+ * audit_set_contid - set current task's audit contid
+ * @task: target task
+ * @contid: contid value
+ *
+ * Returns 0 on success, -EPERM on permission failure.
+ *
+ * Called (set) from fs/proc/base.c::proc_contid_write().
+ */
+int audit_set_contid(struct task_struct *task, u64 contid)
+{
+	u64 oldcontid;
+	int rc = 0;
+	struct audit_buffer *ab;
+
+	task_lock(task);
+	/* Can't set if audit disabled */
+	if (!task->audit) {
+		task_unlock(task);
+		return -ENOPROTOOPT;
+	}
+	oldcontid = audit_get_contid(task);
+	read_lock(&tasklist_lock);
+	/* Don't allow the audit containerid to be unset */
+	if (!audit_contid_valid(contid))
+		rc = -EINVAL;
+	/* if we don't have caps, reject */
+	else if (!capable(CAP_AUDIT_CONTROL))
+		rc = -EPERM;
+	/* if task has children or is not single-threaded, deny */
+	else if (!list_empty(&task->children))
+		rc = -EBUSY;
+	else if (!(thread_group_leader(task) && thread_group_empty(task)))
+		rc = -EALREADY;
+	/* if contid is already set, deny */
+	else if (audit_contid_set(task))
+		rc = -ECHILD;
+	read_unlock(&tasklist_lock);
+	if (!rc)
+		task->audit->contid = contid;
+	task_unlock(task);
+
+	if (!audit_enabled)
+		return rc;
+
+	ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONTAINER_OP);
+	if (!ab)
+		return rc;
+
+	audit_log_format(ab,
+			 "op=set opid=%d contid=%llu old-contid=%llu",
+			 task_tgid_nr(task), contid, oldcontid);
+	audit_log_end(ab);
+	return rc;
+}
+
 /**
  * audit_log_end - end one audit record
  * @ab: the audit_buffer

diff --git a/kernel/audit.h b/kernel/audit.h
index 3b9c094..c650ed4 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h

@@ -135,6 +135,7 @@
 	kuid_t		    target_uid;
 	unsigned int	    target_sessionid;
 	u32		    target_sid;
+	u64		    target_cid;
 	char		    target_comm[TASK_COMM_LEN];
 
 	struct audit_tree_refs *trees, *first_trees;
@@ -251,6 +252,8 @@
 extern unsigned int audit_serial(void);
 extern int auditsc_get_stamp(struct audit_context *ctx,
 			      struct timespec64 *t, unsigned int *serial);
+extern int audit_alloc_syscall(struct task_struct *tsk);
+extern void audit_free_syscall(struct task_struct *tsk);
 
 extern void audit_put_watch(struct audit_watch *watch);
 extern void audit_get_watch(struct audit_watch *watch);
@@ -292,6 +295,9 @@
 extern struct list_head *audit_killed_trees(void);
 #else /* CONFIG_AUDITSYSCALL */
 #define auditsc_get_stamp(c, t, s) 0
+#define audit_alloc_syscall(t) 0
+#define audit_free_syscall(t) {}
+
 #define audit_put_watch(w) {}
 #define audit_get_watch(w) {}
 #define audit_to_watch(k, p, l, o) (-EINVAL)

diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 8dba8f0..b0867f06 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c

@@ -114,6 +114,7 @@
 	kuid_t			target_uid[AUDIT_AUX_PIDS];
 	unsigned int		target_sessionid[AUDIT_AUX_PIDS];
 	u32			target_sid[AUDIT_AUX_PIDS];
+	u64			target_cid[AUDIT_AUX_PIDS];
 	char 			target_comm[AUDIT_AUX_PIDS][TASK_COMM_LEN];
 	int			pid_count;
 };
@@ -932,23 +933,25 @@
 	return context;
 }
 
-/**
- * audit_alloc - allocate an audit context block for a task
+/*
+ * audit_alloc_syscall - allocate an audit context block for a task
  * @tsk: task
  *
  * Filter on the task information and allocate a per-task audit context
  * if necessary.  Doing so turns on system call auditing for the
- * specified task.  This is called from copy_process, so no lock is
- * needed.
+ * specified task.  This is called from copy_process via audit_alloc, so
+ * no lock is needed.
  */
-int audit_alloc(struct task_struct *tsk)
+int audit_alloc_syscall(struct task_struct *tsk)
 {
 	struct audit_context *context;
 	enum audit_state     state;
 	char *key = NULL;
 
-	if (likely(!audit_ever_enabled))
+	if (likely(!audit_ever_enabled)) {
+		audit_set_context(tsk, NULL);
 		return 0; /* Return if not auditing. */
+	}
 
 	state = audit_filter_task(tsk, &key);
 	if (state == AUDIT_DISABLED) {
@@ -958,7 +961,7 @@
 
 	if (!(context = audit_alloc_context(state))) {
 		kfree(key);
-		audit_log_lost("out of memory in audit_alloc");
+		audit_log_lost("out of memory in audit_alloc_syscall");
 		return -ENOMEM;
 	}
 	context->filterkey = key;
@@ -1603,14 +1606,15 @@
 }
 
 /**
- * __audit_free - free a per-task audit context
+ * audit_free_syscall - free per-task audit context info
  * @tsk: task whose audit context block to free
  *
- * Called from copy_process and do_exit
+ * Called from audit_free
  */
-void __audit_free(struct task_struct *tsk)
+void audit_free_syscall(struct task_struct *tsk)
 {
-	struct audit_context *context = tsk->audit_context;
+	struct audit_task_info *info = tsk->audit;
+	struct audit_context *context = info->ctx;
 
 	if (!context)
 		return;
@@ -1633,7 +1637,6 @@
 		if (context->current_state == AUDIT_RECORD_CONTEXT)
 			audit_log_exit();
 	}
-
 	audit_set_context(tsk, NULL);
 	audit_free_context(context);
 }
@@ -2415,6 +2418,7 @@
 	context->target_uid = task_uid(t);
 	context->target_sessionid = audit_get_sessionid(t);
 	security_task_getsecid(t, &context->target_sid);
+	context->target_cid = audit_get_contid(t);
 	memcpy(context->target_comm, t->comm, TASK_COMM_LEN);
 }
 
@@ -2442,6 +2446,7 @@
 		ctx->target_uid = t_uid;
 		ctx->target_sessionid = audit_get_sessionid(t);
 		security_task_getsecid(t, &ctx->target_sid);
+		ctx->target_cid = audit_get_contid(t);
 		memcpy(ctx->target_comm, t->comm, TASK_COMM_LEN);
 		return 0;
 	}
@@ -2463,6 +2468,7 @@
 	axp->target_uid[axp->pid_count] = t_uid;
 	axp->target_sessionid[axp->pid_count] = audit_get_sessionid(t);
 	security_task_getsecid(t, &axp->target_sid[axp->pid_count]);
+	axp->target_cid[axp->pid_count] = audit_get_contid(t);
 	memcpy(axp->target_comm[axp->pid_count], t->comm, TASK_COMM_LEN);
 	axp->pid_count++;
 

diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index c1b9f71..d124934 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile

@@ -10,6 +10,7 @@
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
+obj-${CONFIG_BPF_LSM}	  += bpf_task_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
 obj-$(CONFIG_BPF_SYSCALL) += btf.o

diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
index 56cc5a9..1622a44 100644
--- a/kernel/bpf/bpf_lsm.c
+++ b/kernel/bpf/bpf_lsm.c

@@ -7,6 +7,7 @@
 #include <linux/filter.h>
 #include <linux/bpf.h>
 #include <linux/btf.h>
+#include <linux/binfmts.h>
 #include <linux/lsm_hooks.h>
 #include <linux/bpf_lsm.h>
 #include <linux/kallsyms.h>
@@ -14,6 +15,7 @@
 #include <net/bpf_sk_storage.h>
 #include <linux/bpf_local_storage.h>
 #include <linux/btf_ids.h>
+#include <linux/ima.h>
 
 /* For every LSM hook that allows attachment of BPF programs, declare a nop
  * function where a BPF program can be attached.
@@ -51,6 +53,52 @@
 	return 0;
 }
 
+/* Mask for all the currently supported BPRM option flags */
+#define BPF_F_BRPM_OPTS_MASK	BPF_F_BPRM_SECUREEXEC
+
+BPF_CALL_2(bpf_bprm_opts_set, struct linux_binprm *, bprm, u64, flags)
+{
+	if (flags & ~BPF_F_BRPM_OPTS_MASK)
+		return -EINVAL;
+
+	bprm->secureexec = (flags & BPF_F_BPRM_SECUREEXEC);
+	return 0;
+}
+
+BTF_ID_LIST_SINGLE(bpf_bprm_opts_set_btf_ids, struct, linux_binprm)
+
+const static struct bpf_func_proto bpf_bprm_opts_set_proto = {
+	.func		= bpf_bprm_opts_set,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_BTF_ID,
+	.arg1_btf_id	= &bpf_bprm_opts_set_btf_ids[0],
+	.arg2_type	= ARG_ANYTHING,
+};
+
+BPF_CALL_3(bpf_ima_inode_hash, struct inode *, inode, void *, dst, u32, size)
+{
+	return ima_inode_hash(inode, dst, size);
+}
+
+static bool bpf_ima_inode_hash_allowed(const struct bpf_prog *prog)
+{
+	return bpf_lsm_is_sleepable_hook(prog->aux->attach_btf_id);
+}
+
+BTF_ID_LIST_SINGLE(bpf_ima_inode_hash_btf_ids, struct, inode)
+
+const static struct bpf_func_proto bpf_ima_inode_hash_proto = {
+	.func		= bpf_ima_inode_hash,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_BTF_ID,
+	.arg1_btf_id	= &bpf_ima_inode_hash_btf_ids[0],
+	.arg2_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.allowed	= bpf_ima_inode_hash_allowed,
+};
+
 static const struct bpf_func_proto *
 bpf_lsm_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
@@ -63,11 +111,115 @@
 		return &bpf_sk_storage_get_proto;
 	case BPF_FUNC_sk_storage_delete:
 		return &bpf_sk_storage_delete_proto;
+	case BPF_FUNC_spin_lock:
+		return &bpf_spin_lock_proto;
+	case BPF_FUNC_spin_unlock:
+		return &bpf_spin_unlock_proto;
+	case BPF_FUNC_task_storage_get:
+		return &bpf_task_storage_get_proto;
+	case BPF_FUNC_task_storage_delete:
+		return &bpf_task_storage_delete_proto;
+	case BPF_FUNC_bprm_opts_set:
+		return &bpf_bprm_opts_set_proto;
+	case BPF_FUNC_ima_inode_hash:
+		return prog->aux->sleepable ? &bpf_ima_inode_hash_proto : NULL;
 	default:
 		return tracing_prog_func_proto(func_id, prog);
 	}
 }
 
+/* The set of hooks which are called without pagefaults disabled and are allowed
+ * to "sleep" and thus can be used for sleeable BPF programs.
+ */
+BTF_SET_START(sleepable_lsm_hooks)
+BTF_ID(func, bpf_lsm_bpf)
+BTF_ID(func, bpf_lsm_bpf_map)
+BTF_ID(func, bpf_lsm_bpf_map_alloc_security)
+BTF_ID(func, bpf_lsm_bpf_map_free_security)
+BTF_ID(func, bpf_lsm_bpf_prog)
+BTF_ID(func, bpf_lsm_bprm_check_security)
+BTF_ID(func, bpf_lsm_bprm_committed_creds)
+BTF_ID(func, bpf_lsm_bprm_committing_creds)
+BTF_ID(func, bpf_lsm_bprm_creds_for_exec)
+BTF_ID(func, bpf_lsm_bprm_creds_from_file)
+BTF_ID(func, bpf_lsm_capget)
+BTF_ID(func, bpf_lsm_capset)
+BTF_ID(func, bpf_lsm_cred_prepare)
+BTF_ID(func, bpf_lsm_file_ioctl)
+BTF_ID(func, bpf_lsm_file_lock)
+BTF_ID(func, bpf_lsm_file_open)
+BTF_ID(func, bpf_lsm_file_receive)
+
+#ifdef CONFIG_SECURITY_NETWORK
+BTF_ID(func, bpf_lsm_inet_conn_established)
+#endif /* CONFIG_SECURITY_NETWORK */
+
+BTF_ID(func, bpf_lsm_inode_create)
+BTF_ID(func, bpf_lsm_inode_free_security)
+BTF_ID(func, bpf_lsm_inode_getattr)
+BTF_ID(func, bpf_lsm_inode_getxattr)
+BTF_ID(func, bpf_lsm_inode_mknod)
+BTF_ID(func, bpf_lsm_inode_need_killpriv)
+BTF_ID(func, bpf_lsm_inode_post_setxattr)
+BTF_ID(func, bpf_lsm_inode_readlink)
+BTF_ID(func, bpf_lsm_inode_rename)
+BTF_ID(func, bpf_lsm_inode_rmdir)
+BTF_ID(func, bpf_lsm_inode_setattr)
+BTF_ID(func, bpf_lsm_inode_setxattr)
+BTF_ID(func, bpf_lsm_inode_symlink)
+BTF_ID(func, bpf_lsm_inode_unlink)
+BTF_ID(func, bpf_lsm_kernel_module_request)
+BTF_ID(func, bpf_lsm_kernfs_init_security)
+
+#ifdef CONFIG_KEYS
+BTF_ID(func, bpf_lsm_key_free)
+#endif /* CONFIG_KEYS */
+
+BTF_ID(func, bpf_lsm_mmap_file)
+BTF_ID(func, bpf_lsm_netlink_send)
+BTF_ID(func, bpf_lsm_path_notify)
+BTF_ID(func, bpf_lsm_release_secctx)
+BTF_ID(func, bpf_lsm_sb_alloc_security)
+BTF_ID(func, bpf_lsm_sb_eat_lsm_opts)
+BTF_ID(func, bpf_lsm_sb_kern_mount)
+BTF_ID(func, bpf_lsm_sb_mount)
+BTF_ID(func, bpf_lsm_sb_remount)
+BTF_ID(func, bpf_lsm_sb_set_mnt_opts)
+BTF_ID(func, bpf_lsm_sb_show_options)
+BTF_ID(func, bpf_lsm_sb_statfs)
+BTF_ID(func, bpf_lsm_sb_umount)
+BTF_ID(func, bpf_lsm_settime)
+
+#ifdef CONFIG_SECURITY_NETWORK
+BTF_ID(func, bpf_lsm_socket_accept)
+BTF_ID(func, bpf_lsm_socket_bind)
+BTF_ID(func, bpf_lsm_socket_connect)
+BTF_ID(func, bpf_lsm_socket_create)
+BTF_ID(func, bpf_lsm_socket_getpeername)
+BTF_ID(func, bpf_lsm_socket_getpeersec_dgram)
+BTF_ID(func, bpf_lsm_socket_getsockname)
+BTF_ID(func, bpf_lsm_socket_getsockopt)
+BTF_ID(func, bpf_lsm_socket_listen)
+BTF_ID(func, bpf_lsm_socket_post_create)
+BTF_ID(func, bpf_lsm_socket_recvmsg)
+BTF_ID(func, bpf_lsm_socket_sendmsg)
+BTF_ID(func, bpf_lsm_socket_shutdown)
+BTF_ID(func, bpf_lsm_socket_socketpair)
+#endif /* CONFIG_SECURITY_NETWORK */
+
+BTF_ID(func, bpf_lsm_syslog)
+BTF_ID(func, bpf_lsm_task_alloc)
+BTF_ID(func, bpf_lsm_task_getsecid)
+BTF_ID(func, bpf_lsm_task_prctl)
+BTF_ID(func, bpf_lsm_task_setscheduler)
+BTF_ID(func, bpf_lsm_task_to_inode)
+BTF_SET_END(sleepable_lsm_hooks)
+
+bool bpf_lsm_is_sleepable_hook(u32 btf_id)
+{
+	return btf_id_set_contains(&sleepable_lsm_hooks, btf_id);
+}
+
 const struct bpf_prog_ops lsm_prog_ops = {
 };
 

diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
new file mode 100644
index 0000000..e0da025
--- /dev/null
+++ b/kernel/bpf/bpf_task_storage.c

@@ -0,0 +1,318 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2020 Facebook
+ * Copyright 2020 Google LLC.
+ */
+
+#include <linux/pid.h>
+#include <linux/sched.h>
+#include <linux/rculist.h>
+#include <linux/list.h>
+#include <linux/hash.h>
+#include <linux/types.h>
+#include <linux/spinlock.h>
+#include <linux/bpf.h>
+#include <linux/bpf_local_storage.h>
+#include <linux/filter.h>
+#include <uapi/linux/btf.h>
+#include <linux/bpf_lsm.h>
+#include <linux/btf_ids.h>
+#include <linux/fdtable.h>
+
+DEFINE_BPF_STORAGE_CACHE(task_cache);
+
+static struct bpf_local_storage __rcu **task_storage_ptr(void *owner)
+{
+	struct task_struct *task = owner;
+	struct bpf_storage_blob *bsb;
+
+	bsb = bpf_task(task);
+	if (!bsb)
+		return NULL;
+	return &bsb->storage;
+}
+
+static struct bpf_local_storage_data *
+task_storage_lookup(struct task_struct *task, struct bpf_map *map,
+		    bool cacheit_lockit)
+{
+	struct bpf_local_storage *task_storage;
+	struct bpf_local_storage_map *smap;
+	struct bpf_storage_blob *bsb;
+
+	bsb = bpf_task(task);
+	if (!bsb)
+		return NULL;
+
+	task_storage = rcu_dereference(bsb->storage);
+	if (!task_storage)
+		return NULL;
+
+	smap = (struct bpf_local_storage_map *)map;
+	return bpf_local_storage_lookup(task_storage, smap, cacheit_lockit);
+}
+
+void bpf_task_storage_free(struct task_struct *task)
+{
+	struct bpf_local_storage_elem *selem;
+	struct bpf_local_storage *local_storage;
+	bool free_task_storage = false;
+	struct bpf_storage_blob *bsb;
+	struct hlist_node *n;
+
+	bsb = bpf_task(task);
+	if (!bsb)
+		return;
+
+	rcu_read_lock();
+
+	local_storage = rcu_dereference(bsb->storage);
+	if (!local_storage) {
+		rcu_read_unlock();
+		return;
+	}
+
+	/* Neither the bpf_prog nor the bpf-map's syscall
+	 * could be modifying the local_storage->list now.
+	 * Thus, no elem can be added-to or deleted-from the
+	 * local_storage->list by the bpf_prog or by the bpf-map's syscall.
+	 *
+	 * It is racing with bpf_local_storage_map_free() alone
+	 * when unlinking elem from the local_storage->list and
+	 * the map's bucket->list.
+	 */
+	raw_spin_lock_bh(&local_storage->lock);
+	hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
+		/* Always unlink from map before unlinking from
+		 * local_storage.
+		 */
+		bpf_selem_unlink_map(selem);
+		free_task_storage = bpf_selem_unlink_storage_nolock(
+			local_storage, selem, false);
+	}
+	raw_spin_unlock_bh(&local_storage->lock);
+	rcu_read_unlock();
+
+	/* free_task_storage should always be true as long as
+	 * local_storage->list was non-empty.
+	 */
+	if (free_task_storage)
+		kfree_rcu(local_storage, rcu);
+}
+
+static void *bpf_pid_task_storage_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_local_storage_data *sdata;
+	struct task_struct *task;
+	unsigned int f_flags;
+	struct pid *pid;
+	int fd, err;
+
+	fd = *(int *)key;
+	pid = pidfd_get_pid(fd, &f_flags);
+	if (IS_ERR(pid))
+		return ERR_CAST(pid);
+
+	/* We should be in an RCU read side critical section, it should be safe
+	 * to call pid_task.
+	 */
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	task = pid_task(pid, PIDTYPE_PID);
+	if (!task) {
+		err = -ENOENT;
+		goto out;
+	}
+
+	sdata = task_storage_lookup(task, map, true);
+	put_pid(pid);
+	return sdata ? sdata->data : NULL;
+out:
+	put_pid(pid);
+	return ERR_PTR(err);
+}
+
+static int bpf_pid_task_storage_update_elem(struct bpf_map *map, void *key,
+					    void *value, u64 map_flags)
+{
+	struct bpf_local_storage_data *sdata;
+	struct task_struct *task;
+	unsigned int f_flags;
+	struct pid *pid;
+	int fd, err;
+
+	fd = *(int *)key;
+	pid = pidfd_get_pid(fd, &f_flags);
+	if (IS_ERR(pid))
+		return PTR_ERR(pid);
+
+	/* We should be in an RCU read side critical section, it should be safe
+	 * to call pid_task.
+	 */
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	task = pid_task(pid, PIDTYPE_PID);
+	if (!task || !task_storage_ptr(task)) {
+		err = -ENOENT;
+		goto out;
+	}
+
+	sdata = bpf_local_storage_update(
+		task, (struct bpf_local_storage_map *)map, value, map_flags);
+
+	err = PTR_ERR_OR_ZERO(sdata);
+out:
+	put_pid(pid);
+	return err;
+}
+
+static int task_storage_delete(struct task_struct *task, struct bpf_map *map)
+{
+	struct bpf_local_storage_data *sdata;
+
+	sdata = task_storage_lookup(task, map, false);
+	if (!sdata)
+		return -ENOENT;
+
+	bpf_selem_unlink(SELEM(sdata));
+
+	return 0;
+}
+
+static int bpf_pid_task_storage_delete_elem(struct bpf_map *map, void *key)
+{
+	struct task_struct *task;
+	unsigned int f_flags;
+	struct pid *pid;
+	int fd, err;
+
+	fd = *(int *)key;
+	pid = pidfd_get_pid(fd, &f_flags);
+	if (IS_ERR(pid))
+		return PTR_ERR(pid);
+
+	/* We should be in an RCU read side critical section, it should be safe
+	 * to call pid_task.
+	 */
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	task = pid_task(pid, PIDTYPE_PID);
+	if (!task) {
+		err = -ENOENT;
+		goto out;
+	}
+
+	err = task_storage_delete(task, map);
+out:
+	put_pid(pid);
+	return err;
+}
+
+BPF_CALL_4(bpf_task_storage_get, struct bpf_map *, map, struct task_struct *,
+	   task, void *, value, u64, flags)
+{
+	struct bpf_local_storage_data *sdata;
+
+	if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
+		return (unsigned long)NULL;
+
+	/* explicitly check that the task_storage_ptr is not
+	 * NULL as task_storage_lookup returns NULL in this case and
+	 * bpf_local_storage_update expects the owner to have a
+	 * valid storage pointer.
+	 */
+	if (!task || !task_storage_ptr(task))
+		return (unsigned long)NULL;
+
+	sdata = task_storage_lookup(task, map, true);
+	if (sdata)
+		return (unsigned long)sdata->data;
+
+	/* This helper must only be called from places where the lifetime of the task
+	 * is guaranteed. Either by being refcounted or by being protected
+	 * by an RCU read-side critical section.
+	 */
+	if (flags & BPF_LOCAL_STORAGE_GET_F_CREATE) {
+		sdata = bpf_local_storage_update(
+			task, (struct bpf_local_storage_map *)map, value,
+			BPF_NOEXIST);
+		return IS_ERR(sdata) ? (unsigned long)NULL :
+					     (unsigned long)sdata->data;
+	}
+
+	return (unsigned long)NULL;
+}
+
+BPF_CALL_2(bpf_task_storage_delete, struct bpf_map *, map, struct task_struct *,
+	   task)
+{
+	if (!task)
+		return -EINVAL;
+
+	/* This helper must only be called from places where the lifetime of the task
+	 * is guaranteed. Either by being refcounted or by being protected
+	 * by an RCU read-side critical section.
+	 */
+	return task_storage_delete(task, map);
+}
+
+static int notsupp_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	return -ENOTSUPP;
+}
+
+static struct bpf_map *task_storage_map_alloc(union bpf_attr *attr)
+{
+	struct bpf_local_storage_map *smap;
+
+	smap = bpf_local_storage_map_alloc(attr);
+	if (IS_ERR(smap))
+		return ERR_CAST(smap);
+
+	smap->cache_idx = bpf_local_storage_cache_idx_get(&task_cache);
+	return &smap->map;
+}
+
+static void task_storage_map_free(struct bpf_map *map)
+{
+	struct bpf_local_storage_map *smap;
+
+	smap = (struct bpf_local_storage_map *)map;
+	bpf_local_storage_cache_idx_free(&task_cache, smap->cache_idx);
+	bpf_local_storage_map_free(smap);
+}
+
+static int task_storage_map_btf_id;
+const struct bpf_map_ops task_storage_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc_check = bpf_local_storage_map_alloc_check,
+	.map_alloc = task_storage_map_alloc,
+	.map_free = task_storage_map_free,
+	.map_get_next_key = notsupp_get_next_key,
+	.map_lookup_elem = bpf_pid_task_storage_lookup_elem,
+	.map_update_elem = bpf_pid_task_storage_update_elem,
+	.map_delete_elem = bpf_pid_task_storage_delete_elem,
+	.map_check_btf = bpf_local_storage_map_check_btf,
+	.map_btf_name = "bpf_local_storage_map",
+	.map_btf_id = &task_storage_map_btf_id,
+	.map_owner_storage_ptr = task_storage_ptr,
+};
+
+BTF_ID_LIST_SINGLE(bpf_task_storage_btf_ids, struct, task_struct)
+
+const struct bpf_func_proto bpf_task_storage_get_proto = {
+	.func = bpf_task_storage_get,
+	.gpl_only = false,
+	.ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
+	.arg1_type = ARG_CONST_MAP_PTR,
+	.arg2_type = ARG_PTR_TO_BTF_ID,
+	.arg2_btf_id = &bpf_task_storage_btf_ids[0],
+	.arg3_type = ARG_PTR_TO_MAP_VALUE_OR_NULL,
+	.arg4_type = ARG_ANYTHING,
+};
+
+const struct bpf_func_proto bpf_task_storage_delete_proto = {
+	.func = bpf_task_storage_delete,
+	.gpl_only = false,
+	.ret_type = RET_INTEGER,
+	.arg1_type = ARG_CONST_MAP_PTR,
+	.arg2_type = ARG_PTR_TO_BTF_ID,
+	.arg2_btf_id = &bpf_task_storage_btf_ids[0],
+};

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index d12efb2..51672d8 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c

@@ -1319,8 +1319,8 @@
 	INSN_3(STX, MEM,  H),			\
 	INSN_3(STX, MEM,  W),			\
 	INSN_3(STX, MEM,  DW),			\
-	INSN_3(STX, XADD, W),			\
-	INSN_3(STX, XADD, DW),			\
+	INSN_3(STX, ATOMIC, W),			\
+	INSN_3(STX, ATOMIC, DW),		\
 	/*   Immediate based. */		\
 	INSN_3(ST, MEM, B),			\
 	INSN_3(ST, MEM, H),			\
@@ -1668,13 +1668,59 @@
 	LDX_PROBE(DW, 8)
 #undef LDX_PROBE
 
-	STX_XADD_W: /* lock xadd *(u32 *)(dst_reg + off16) += src_reg */
-		atomic_add((u32) SRC, (atomic_t *)(unsigned long)
-			   (DST + insn->off));
-		CONT;
-	STX_XADD_DW: /* lock xadd *(u64 *)(dst_reg + off16) += src_reg */
-		atomic64_add((u64) SRC, (atomic64_t *)(unsigned long)
-			     (DST + insn->off));
+#define ATOMIC_ALU_OP(BOP, KOP)						\
+		case BOP:						\
+			if (BPF_SIZE(insn->code) == BPF_W)		\
+				atomic_##KOP((u32) SRC, (atomic_t *)(unsigned long) \
+					     (DST + insn->off));	\
+			else						\
+				atomic64_##KOP((u64) SRC, (atomic64_t *)(unsigned long) \
+					       (DST + insn->off));	\
+			break;						\
+		case BOP | BPF_FETCH:					\
+			if (BPF_SIZE(insn->code) == BPF_W)		\
+				SRC = (u32) atomic_fetch_##KOP(		\
+					(u32) SRC,			\
+					(atomic_t *)(unsigned long) (DST + insn->off)); \
+			else						\
+				SRC = (u64) atomic64_fetch_##KOP(	\
+					(u64) SRC,			\
+					(atomic64_t *)(unsigned long) (DST + insn->off)); \
+			break;
+
+	STX_ATOMIC_DW:
+	STX_ATOMIC_W:
+		switch (IMM) {
+		ATOMIC_ALU_OP(BPF_ADD, add)
+		ATOMIC_ALU_OP(BPF_AND, and)
+		ATOMIC_ALU_OP(BPF_OR, or)
+		ATOMIC_ALU_OP(BPF_XOR, xor)
+#undef ATOMIC_ALU_OP
+
+		case BPF_XCHG:
+			if (BPF_SIZE(insn->code) == BPF_W)
+				SRC = (u32) atomic_xchg(
+					(atomic_t *)(unsigned long) (DST + insn->off),
+					(u32) SRC);
+			else
+				SRC = (u64) atomic64_xchg(
+					(atomic64_t *)(unsigned long) (DST + insn->off),
+					(u64) SRC);
+			break;
+		case BPF_CMPXCHG:
+			if (BPF_SIZE(insn->code) == BPF_W)
+				BPF_R0 = (u32) atomic_cmpxchg(
+					(atomic_t *)(unsigned long) (DST + insn->off),
+					(u32) BPF_R0, (u32) SRC);
+			else
+				BPF_R0 = (u64) atomic64_cmpxchg(
+					(atomic64_t *)(unsigned long) (DST + insn->off),
+					(u64) BPF_R0, (u64) SRC);
+			break;
+
+		default:
+			goto default_label;
+		}
 		CONT;
 
 	default_label:
@@ -1684,7 +1730,8 @@
 		 *
 		 * Note, verifier whitelists all opcodes in bpf_opcode_in_insntable().
 		 */
-		pr_warn("BPF interpreter: unknown opcode %02x\n", insn->code);
+		pr_warn("BPF interpreter: unknown opcode %02x (imm: 0x%x)\n",
+			insn->code, insn->imm);
 		BUG_ON(1);
 		return 0;
 }
@@ -2259,6 +2306,7 @@
 const struct bpf_func_proto bpf_get_numa_node_id_proto __weak;
 const struct bpf_func_proto bpf_ktime_get_ns_proto __weak;
 const struct bpf_func_proto bpf_ktime_get_boot_ns_proto __weak;
+const struct bpf_func_proto bpf_ktime_get_coarse_ns_proto __weak;
 
 const struct bpf_func_proto bpf_get_current_pid_tgid_proto __weak;
 const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
@@ -2317,6 +2365,10 @@
 /* Return TRUE if the JIT backend wants verifier to enable sub-register usage
  * analysis code and wants explicit zero extension inserted by verifier.
  * Otherwise, return FALSE.
+ *
+ * The verifier inserts an explicit zero extension after BPF_CMPXCHGs even if
+ * you don't override this. JITs that don't want these extra insns can detect
+ * them using insn_is_zext.
  */
 bool __weak bpf_jit_needs_zext(void)
 {

diff --git a/kernel/bpf/disasm.c b/kernel/bpf/disasm.c
index ff1dd7d..6c64330 100644
--- a/kernel/bpf/disasm.c
+++ b/kernel/bpf/disasm.c

@@ -80,6 +80,13 @@
 	[BPF_END >> 4]  = "endian",
 };
 
+static const char *const bpf_atomic_alu_string[16] = {
+	[BPF_ADD >> 4]  = "add",
+	[BPF_AND >> 4]  = "and",
+	[BPF_OR >> 4]  = "or",
+	[BPF_XOR >> 4]  = "or",
+};
+
 static const char *const bpf_ldst_string[] = {
 	[BPF_W >> 3]  = "u32",
 	[BPF_H >> 3]  = "u16",
@@ -153,14 +160,44 @@
 				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
 				insn->dst_reg,
 				insn->off, insn->src_reg);
-		else if (BPF_MODE(insn->code) == BPF_XADD)
-			verbose(cbs->private_data, "(%02x) lock *(%s *)(r%d %+d) += r%d\n",
+		else if (BPF_MODE(insn->code) == BPF_ATOMIC &&
+			 (insn->imm == BPF_ADD || insn->imm == BPF_AND ||
+			  insn->imm == BPF_OR || insn->imm == BPF_XOR)) {
+			verbose(cbs->private_data, "(%02x) lock *(%s *)(r%d %+d) %s r%d\n",
 				insn->code,
 				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
 				insn->dst_reg, insn->off,
+				bpf_alu_string[BPF_OP(insn->imm) >> 4],
 				insn->src_reg);
-		else
+		} else if (BPF_MODE(insn->code) == BPF_ATOMIC &&
+			   (insn->imm == (BPF_ADD | BPF_FETCH) ||
+			    insn->imm == (BPF_AND | BPF_FETCH) ||
+			    insn->imm == (BPF_OR | BPF_FETCH) ||
+			    insn->imm == (BPF_XOR | BPF_FETCH))) {
+			verbose(cbs->private_data, "(%02x) r%d = atomic%s_fetch_%s((%s *)(r%d %+d), r%d)\n",
+				insn->code, insn->src_reg,
+				BPF_SIZE(insn->code) == BPF_DW ? "64" : "",
+				bpf_atomic_alu_string[BPF_OP(insn->imm) >> 4],
+				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+				insn->dst_reg, insn->off, insn->src_reg);
+		} else if (BPF_MODE(insn->code) == BPF_ATOMIC &&
+			   insn->imm == BPF_CMPXCHG) {
+			verbose(cbs->private_data, "(%02x) r0 = atomic%s_cmpxchg((%s *)(r%d %+d), r0, r%d)\n",
+				insn->code,
+				BPF_SIZE(insn->code) == BPF_DW ? "64" : "",
+				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+				insn->dst_reg, insn->off,
+				insn->src_reg);
+		} else if (BPF_MODE(insn->code) == BPF_ATOMIC &&
+			   insn->imm == BPF_XCHG) {
+			verbose(cbs->private_data, "(%02x) r%d = atomic%s_xchg((%s *)(r%d %+d), r%d)\n",
+				insn->code, insn->src_reg,
+				BPF_SIZE(insn->code) == BPF_DW ? "64" : "",
+				bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+				insn->dst_reg, insn->off, insn->src_reg);
+		} else {
 			verbose(cbs->private_data, "BUG_%02x\n", insn->code);
+		}
 	} else if (class == BPF_ST) {
 		if (BPF_MODE(insn->code) == BPF_MEM) {
 			verbose(cbs->private_data, "(%02x) *(%s *)(r%d %+d) = %d\n",

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 0efe7c7b..c18b88b 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c

@@ -168,6 +168,17 @@
 	.ret_type	= RET_INTEGER,
 };
 
+BPF_CALL_0(bpf_ktime_get_coarse_ns)
+{
+	return ktime_get_coarse_ns();
+}
+
+const struct bpf_func_proto bpf_ktime_get_coarse_ns_proto = {
+	.func		= bpf_ktime_get_coarse_ns,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+};
+
 BPF_CALL_0(bpf_get_current_pid_tgid)
 {
 	struct task_struct *task = current;
@@ -693,6 +704,8 @@
 		return &bpf_ktime_get_ns_proto;
 	case BPF_FUNC_ktime_get_boot_ns:
 		return &bpf_ktime_get_boot_ns_proto;
+	case BPF_FUNC_ktime_get_coarse_ns:
+		return &bpf_ktime_get_coarse_ns_proto;
 	case BPF_FUNC_ringbuf_output:
 		return &bpf_ringbuf_output_proto;
 	case BPF_FUNC_ringbuf_reserve:

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 9433ab9..f1a146f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c

@@ -773,7 +773,8 @@
 		    map->map_type != BPF_MAP_TYPE_ARRAY &&
 		    map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE &&
 		    map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
-		    map->map_type != BPF_MAP_TYPE_INODE_STORAGE)
+		    map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
+		    map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
 			return -ENOTSUPP;
 		if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
 		    map->value_size) {

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index cba1f86..3db53378 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c

@@ -496,6 +496,13 @@
 		func_id == BPF_FUNC_skc_to_tcp_request_sock;
 }
 
+static bool is_cmpxchg_insn(const struct bpf_insn *insn)
+{
+	return BPF_CLASS(insn->code) == BPF_STX &&
+	       BPF_MODE(insn->code) == BPF_ATOMIC &&
+	       insn->imm == BPF_CMPXCHG;
+}
+
 /* string representation of 'enum bpf_reg_type' */
 static const char * const reg_type_str[] = {
 	[NOT_INIT]		= "?",
@@ -1649,7 +1656,11 @@
 	}
 
 	if (class == BPF_STX) {
-		if (reg->type != SCALAR_VALUE)
+		/* BPF_STX (including atomic variants) has multiple source
+		 * operands, one of which is a ptr. Check whether the caller is
+		 * asking about it.
+		 */
+		if (t == SRC_OP && reg->type != SCALAR_VALUE)
 			return true;
 		return BPF_SIZE(code) == BPF_DW;
 	}
@@ -1681,22 +1692,38 @@
 	return true;
 }
 
-/* Return TRUE if INSN doesn't have explicit value define. */
-static bool insn_no_def(struct bpf_insn *insn)
+/* Return the regno defined by the insn, or -1. */
+static int insn_def_regno(const struct bpf_insn *insn)
 {
-	u8 class = BPF_CLASS(insn->code);
-
-	return (class == BPF_JMP || class == BPF_JMP32 ||
-		class == BPF_STX || class == BPF_ST);
+	switch (BPF_CLASS(insn->code)) {
+	case BPF_JMP:
+	case BPF_JMP32:
+	case BPF_ST:
+		return -1;
+	case BPF_STX:
+		if (BPF_MODE(insn->code) == BPF_ATOMIC &&
+		    (insn->imm & BPF_FETCH)) {
+			if (insn->imm == BPF_CMPXCHG)
+				return BPF_REG_0;
+			else
+				return insn->src_reg;
+		} else {
+			return -1;
+		}
+	default:
+		return insn->dst_reg;
+	}
 }
 
 /* Return TRUE if INSN has defined any 32-bit value explicitly. */
 static bool insn_has_def32(struct bpf_verifier_env *env, struct bpf_insn *insn)
 {
-	if (insn_no_def(insn))
+	int dst_reg = insn_def_regno(insn);
+
+	if (dst_reg == -1)
 		return false;
 
-	return !is_reg64(env, insn, insn->dst_reg, NULL, DST_OP);
+	return !is_reg64(env, insn, dst_reg, NULL, DST_OP);
 }
 
 static void mark_insn_zext(struct bpf_verifier_env *env,
@@ -3912,13 +3939,30 @@
 	return err;
 }
 
-static int check_xadd(struct bpf_verifier_env *env, int insn_idx, struct bpf_insn *insn)
+static int check_atomic(struct bpf_verifier_env *env, int insn_idx, struct bpf_insn *insn)
 {
+	int load_reg;
 	int err;
 
-	if ((BPF_SIZE(insn->code) != BPF_W && BPF_SIZE(insn->code) != BPF_DW) ||
-	    insn->imm != 0) {
-		verbose(env, "BPF_XADD uses reserved fields\n");
+	switch (insn->imm) {
+	case BPF_ADD:
+	case BPF_ADD | BPF_FETCH:
+	case BPF_AND:
+	case BPF_AND | BPF_FETCH:
+	case BPF_OR:
+	case BPF_OR | BPF_FETCH:
+	case BPF_XOR:
+	case BPF_XOR | BPF_FETCH:
+	case BPF_XCHG:
+	case BPF_CMPXCHG:
+		break;
+	default:
+		verbose(env, "BPF_ATOMIC uses invalid atomic opcode %02x\n", insn->imm);
+		return -EINVAL;
+	}
+
+	if (BPF_SIZE(insn->code) != BPF_W && BPF_SIZE(insn->code) != BPF_DW) {
+		verbose(env, "invalid atomic operand size\n");
 		return -EINVAL;
 	}
 
@@ -3932,6 +3976,13 @@
 	if (err)
 		return err;
 
+	if (insn->imm == BPF_CMPXCHG) {
+		/* Check comparison of R0 with memory location */
+		err = check_reg_arg(env, BPF_REG_0, SRC_OP);
+		if (err)
+			return err;
+	}
+
 	if (is_pointer_value(env, insn->src_reg)) {
 		verbose(env, "R%d leaks addr into mem\n", insn->src_reg);
 		return -EACCES;
@@ -3941,21 +3992,42 @@
 	    is_pkt_reg(env, insn->dst_reg) ||
 	    is_flow_key_reg(env, insn->dst_reg) ||
 	    is_sk_reg(env, insn->dst_reg)) {
-		verbose(env, "BPF_XADD stores into R%d %s is not allowed\n",
+		verbose(env, "BPF_ATOMIC stores into R%d %s is not allowed\n",
 			insn->dst_reg,
 			reg_type_str[reg_state(env, insn->dst_reg)->type]);
 		return -EACCES;
 	}
 
-	/* check whether atomic_add can read the memory */
+	if (insn->imm & BPF_FETCH) {
+		if (insn->imm == BPF_CMPXCHG)
+			load_reg = BPF_REG_0;
+		else
+			load_reg = insn->src_reg;
+
+		/* check and record load of old value */
+		err = check_reg_arg(env, load_reg, DST_OP);
+		if (err)
+			return err;
+	} else {
+		/* This instruction accesses a memory location but doesn't
+		 * actually load it into a register.
+		 */
+		load_reg = -1;
+	}
+
+	/* check whether we can read the memory */
 	err = check_mem_access(env, insn_idx, insn->dst_reg, insn->off,
-			       BPF_SIZE(insn->code), BPF_READ, -1, true);
+			       BPF_SIZE(insn->code), BPF_READ, load_reg, true);
 	if (err)
 		return err;
 
-	/* check whether atomic_add can write into the same memory */
-	return check_mem_access(env, insn_idx, insn->dst_reg, insn->off,
-				BPF_SIZE(insn->code), BPF_WRITE, -1, true);
+	/* check whether we can write into the same memory */
+	err = check_mem_access(env, insn_idx, insn->dst_reg, insn->off,
+			       BPF_SIZE(insn->code), BPF_WRITE, -1, true);
+	if (err)
+		return err;
+
+	return 0;
 }
 
 /* When register 'regno' is used to read the stack (either directly or through
@@ -4774,6 +4846,11 @@
 		    func_id != BPF_FUNC_inode_storage_delete)
 			goto error;
 		break;
+	case BPF_MAP_TYPE_TASK_STORAGE:
+		if (func_id != BPF_FUNC_task_storage_get &&
+		    func_id != BPF_FUNC_task_storage_delete)
+			goto error;
+		break;
 	default:
 		break;
 	}
@@ -4858,6 +4935,11 @@
 		if (map->map_type != BPF_MAP_TYPE_INODE_STORAGE)
 			goto error;
 		break;
+	case BPF_FUNC_task_storage_get:
+	case BPF_FUNC_task_storage_delete:
+		if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
+			goto error;
+		break;
 	default:
 		break;
 	}
@@ -5490,11 +5572,14 @@
 				PTR_TO_BTF_ID : PTR_TO_BTF_ID_OR_NULL;
 			regs[BPF_REG_0].btf_id = meta.ret_btf_id;
 		}
-	} else if (fn->ret_type == RET_PTR_TO_BTF_ID_OR_NULL) {
+	} else if (fn->ret_type == RET_PTR_TO_BTF_ID_OR_NULL ||
+		   fn->ret_type == RET_PTR_TO_BTF_ID) {
 		int ret_btf_id;
 
 		mark_reg_known_zero(env, regs, BPF_REG_0);
-		regs[BPF_REG_0].type = PTR_TO_BTF_ID_OR_NULL;
+		regs[BPF_REG_0].type = fn->ret_type == RET_PTR_TO_BTF_ID ?
+						     PTR_TO_BTF_ID :
+						     PTR_TO_BTF_ID_OR_NULL;
 		ret_btf_id = *fn->ret_btf_id;
 		if (ret_btf_id == 0) {
 			verbose(env, "invalid return type %d of func %s#%d\n",
@@ -9890,14 +9975,19 @@
 		} else if (class == BPF_STX) {
 			enum bpf_reg_type *prev_dst_type, dst_reg_type;
 
-			if (BPF_MODE(insn->code) == BPF_XADD) {
-				err = check_xadd(env, env->insn_idx, insn);
+			if (BPF_MODE(insn->code) == BPF_ATOMIC) {
+				err = check_atomic(env, env->insn_idx, insn);
 				if (err)
 					return err;
 				env->insn_idx++;
 				continue;
 			}
 
+			if (BPF_MODE(insn->code) != BPF_MEM || insn->imm != 0) {
+				verbose(env, "BPF_STX uses reserved fields\n");
+				return -EINVAL;
+			}
+
 			/* check src1 operand */
 			err = check_reg_arg(env, insn->src_reg, SRC_OP);
 			if (err)
@@ -10222,11 +10312,21 @@
 		verbose(env, "trace type programs with run-time allocated hash maps are unsafe. Switch to preallocated hash maps.\n");
 	}
 
-	if ((is_tracing_prog_type(prog_type) ||
-	     prog_type == BPF_PROG_TYPE_SOCKET_FILTER) &&
-	    map_value_has_spin_lock(map)) {
-		verbose(env, "tracing progs cannot use bpf_spin_lock yet\n");
-		return -EINVAL;
+	if (map_value_has_spin_lock(map)) {
+		if (prog_type == BPF_PROG_TYPE_SOCKET_FILTER) {
+			verbose(env, "socket filter progs cannot use bpf_spin_lock yet\n");
+			return -EINVAL;
+		}
+
+		if (is_tracing_prog_type(prog_type)) {
+			verbose(env, "tracing progs cannot use bpf_spin_lock yet\n");
+			return -EINVAL;
+		}
+
+		if (prog->aux->sleepable) {
+			verbose(env, "sleepable progs cannot use bpf_spin_lock yet\n");
+			return -EINVAL;
+		}
 	}
 
 	if ((bpf_prog_is_dev_bound(prog->aux) || bpf_map_is_dev_bound(map)) &&
@@ -10251,9 +10351,11 @@
 				return -EINVAL;
 			}
 			break;
+		case BPF_MAP_TYPE_RINGBUF:
+			break;
 		default:
 			verbose(env,
-				"Sleepable programs can only use array and hash maps\n");
+				"Sleepable programs can only use array, hash, and ringbuf maps\n");
 			return -EINVAL;
 		}
 
@@ -10290,13 +10392,6 @@
 			return -EINVAL;
 		}
 
-		if (BPF_CLASS(insn->code) == BPF_STX &&
-		    ((BPF_MODE(insn->code) != BPF_MEM &&
-		      BPF_MODE(insn->code) != BPF_XADD) || insn->imm != 0)) {
-			verbose(env, "BPF_STX uses reserved fields\n");
-			return -EINVAL;
-		}
-
 		if (insn[0].code == (BPF_LD | BPF_IMM | BPF_DW)) {
 			struct bpf_insn_aux_data *aux;
 			struct bpf_map *map;
@@ -10821,8 +10916,10 @@
 	for (i = 0; i < len; i++) {
 		int adj_idx = i + delta;
 		struct bpf_insn insn;
+		int load_reg;
 
 		insn = insns[adj_idx];
+		load_reg = insn_def_regno(&insn);
 		if (!aux[adj_idx].zext_dst) {
 			u8 code, class;
 			u32 imm_rnd;
@@ -10832,14 +10929,14 @@
 
 			code = insn.code;
 			class = BPF_CLASS(code);
-			if (insn_no_def(&insn))
+			if (load_reg == -1)
 				continue;
 
 			/* NOTE: arg "reg" (the fourth one) is only used for
-			 *       BPF_STX which has been ruled out in above
-			 *       check, it is safe to pass NULL here.
+			 *       BPF_STX + SRC_OP, so it is safe to pass NULL
+			 *       here.
 			 */
-			if (is_reg64(env, &insn, insn.dst_reg, NULL, DST_OP)) {
+			if (is_reg64(env, &insn, load_reg, NULL, DST_OP)) {
 				if (class == BPF_LD &&
 				    BPF_MODE(code) == BPF_IMM)
 					i++;
@@ -10854,18 +10951,33 @@
 			imm_rnd = get_random_int();
 			rnd_hi32_patch[0] = insn;
 			rnd_hi32_patch[1].imm = imm_rnd;
-			rnd_hi32_patch[3].dst_reg = insn.dst_reg;
+			rnd_hi32_patch[3].dst_reg = load_reg;
 			patch = rnd_hi32_patch;
 			patch_len = 4;
 			goto apply_patch_buffer;
 		}
 
-		if (!bpf_jit_needs_zext())
+		/* Add in an zero-extend instruction if a) the JIT has requested
+		 * it or b) it's a CMPXCHG.
+		 *
+		 * The latter is because: BPF_CMPXCHG always loads a value into
+		 * R0, therefore always zero-extends. However some archs'
+		 * equivalent instruction only does this load when the
+		 * comparison is successful. This detail of CMPXCHG is
+		 * orthogonal to the general zero-extension behaviour of the
+		 * CPU, so it's treated independently of bpf_jit_needs_zext.
+		 */
+		if (!bpf_jit_needs_zext() && !is_cmpxchg_insn(&insn))
 			continue;
 
+		if (WARN_ON(load_reg == -1)) {
+			verbose(env, "verifier bug. zext_dst is set, but no reg is defined\n");
+			return -EFAULT;
+		}
+
 		zext_patch[0] = insn;
-		zext_patch[1].dst_reg = insn.dst_reg;
-		zext_patch[1].src_reg = insn.dst_reg;
+		zext_patch[1].dst_reg = load_reg;
+		zext_patch[1].src_reg = load_reg;
 		patch = zext_patch;
 		patch_len = 2;
 apply_patch_buffer:
@@ -11929,20 +12041,6 @@
 	return -EINVAL;
 }
 
-/* non exhaustive list of sleepable bpf_lsm_*() functions */
-BTF_SET_START(btf_sleepable_lsm_hooks)
-#ifdef CONFIG_BPF_LSM
-BTF_ID(func, bpf_lsm_bprm_committed_creds)
-#else
-BTF_ID_UNUSED
-#endif
-BTF_SET_END(btf_sleepable_lsm_hooks)
-
-static int check_sleepable_lsm_hook(u32 btf_id)
-{
-	return btf_id_set_contains(&btf_sleepable_lsm_hooks, btf_id);
-}
-
 /* list of non-sleepable functions that are otherwise on
  * ALLOW_ERROR_INJECTION list
  */
@@ -12164,7 +12262,7 @@
 				/* LSM progs check that they are attached to bpf_lsm_*() funcs.
 				 * Only some of them are sleepable.
 				 */
-				if (check_sleepable_lsm_hook(btf_id))
+				if (bpf_lsm_is_sleepable_hook(btf_id))
 					ret = 0;
 				break;
 			default:

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index e289e67..777f2ff 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c

@@ -65,6 +65,8 @@
 
 	if (unlikely(ti_work & _TIF_SYSCALL_TRACEPOINT))
 		trace_sys_enter(regs, syscall);
+	/* tace_sys_enter might have changed the syscall number */
+	syscall = syscall_get_nr(current, regs);
 
 	syscall_enter_audit(regs, syscall);
 

diff --git a/kernel/exit.c b/kernel/exit.c
index d13d67f..7ecf893 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c

@@ -64,6 +64,7 @@
 #include <linux/rcuwait.h>
 #include <linux/compat.h>
 #include <linux/io_uring.h>
+#include <linux/security.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -786,6 +787,8 @@
 #endif
 		if (tsk->mm)
 			setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm);
+
+		security_task_exit(tsk);
 	}
 	acct_collect(code, group_dead);
 	if (group_dead)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9705439..b4bc619 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c

@@ -2033,7 +2033,6 @@
 	posix_cputimers_init(&p->posix_cputimers);
 
 	p->io_context = NULL;
-	audit_set_context(p, NULL);
 	cgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
@@ -2317,6 +2316,7 @@
 	uprobe_copy_process(p, clone_flags);
 
 	copy_oom_score_adj(clone_flags, p);
+	security_task_post_alloc(p);
 
 	return p;
 

diff --git a/kernel/futex.c b/kernel/futex.c
index 98a6e1b..fc4cec1 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c

@@ -1584,16 +1584,16 @@
 }
 
 /*
- * Wake up waiters matching bitset queued on this futex (uaddr).
+ * Prepare wake queue matching bitset queued on this futex (uaddr).
  */
 static int
-futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
+prepare_wake_q(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset,
+		struct wake_q_head *wake_q)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
 	union futex_key key = FUTEX_KEY_INIT;
 	int ret;
-	DEFINE_WAKE_Q(wake_q);
 
 	if (!bitset)
 		return -EINVAL;
@@ -1621,14 +1621,28 @@
 			if (!(this->bitset & bitset))
 				continue;
 
-			mark_wake_futex(&wake_q, this);
+			mark_wake_futex(wake_q, this);
 			if (++ret >= nr_wake)
 				break;
 		}
 	}
 
 	spin_unlock(&hb->lock);
-	wake_up_q(&wake_q);
+	return ret;
+}
+
+/*
+ * Wake up waiters matching bitset queued on this futex (uaddr).
+ */
+static int
+futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
+{
+	int ret;
+	DEFINE_WAKE_Q(wake_q);
+
+	ret = prepare_wake_q(uaddr, flags, nr_wake, bitset, &wake_q);
+	if (ret > 0)
+		wake_up_q(&wake_q);
 	return ret;
 }
 
@@ -2576,9 +2590,12 @@
  * @hb:		the futex hash bucket, must be locked by the caller
  * @q:		the futex_q to queue up on
  * @timeout:	the prepared hrtimer_sleeper, or null for no timeout
+ * @next:	if present, wake next and hint to the scheduler that we'd
+ *		prefer to execute it locally.
  */
 static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
-				struct hrtimer_sleeper *timeout)
+				struct hrtimer_sleeper *timeout,
+				struct task_struct *next)
 {
 	/*
 	 * The task state is guaranteed to be set before another task can
@@ -2603,10 +2620,25 @@
 		 * flagged for rescheduling. Only call schedule if there
 		 * is no timeout, or if it has yet to expire.
 		 */
-		if (!timeout || timeout->task)
+		if (!timeout || timeout->task) {
+			if (next) {
+#ifdef CONFIG_SMP
+				wake_up_swap(next);
+#else
+				wake_up_process(next);
+#endif
+				put_task_struct(next);
+				next = NULL;
+			}
 			freezable_schedule();
+		}
 	}
 	__set_current_state(TASK_RUNNING);
+
+	if (next) {
+		wake_up_process(next);
+		put_task_struct(next);
+	}
 }
 
 /**
@@ -2682,7 +2714,7 @@
 }
 
 static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
-		      ktime_t *abs_time, u32 bitset)
+		      ktime_t *abs_time, u32 bitset, struct task_struct *next)
 {
 	struct hrtimer_sleeper timeout, *to;
 	struct restart_block *restart;
@@ -2706,7 +2738,8 @@
 		goto out;
 
 	/* queue_me and wait for wakeup, timeout, or a signal. */
-	futex_wait_queue_me(hb, &q, to);
+	futex_wait_queue_me(hb, &q, to, next);
+	next = NULL;
 
 	/* If we were woken (and unqueued), we succeeded, whatever. */
 	ret = 0;
@@ -2738,6 +2771,10 @@
 	ret = set_restart_fn(restart, futex_wait_restart);
 
 out:
+	if (next) {
+		wake_up_process(next);
+		put_task_struct(next);
+	}
 	if (to) {
 		hrtimer_cancel(&to->timer);
 		destroy_hrtimer_on_stack(&to->timer);
@@ -2757,10 +2794,30 @@
 	}
 	restart->fn = do_no_restart_syscall;
 
-	return (long)futex_wait(uaddr, restart->futex.flags,
-				restart->futex.val, tp, restart->futex.bitset);
+	return (long)futex_wait(uaddr, restart->futex.flags, restart->futex.val,
+				tp, restart->futex.bitset, NULL);
 }
 
+static int futex_swap(u32 __user *uaddr, unsigned int flags, u32 val,
+		      ktime_t *abs_time, u32 __user *uaddr2)
+{
+	u32 bitset = FUTEX_BITSET_MATCH_ANY;
+	struct task_struct *next = NULL;
+	DEFINE_WAKE_Q(wake_q);
+	int ret;
+
+	ret = prepare_wake_q(uaddr2, flags, 1, bitset, &wake_q);
+	if (ret < 0)
+		return ret;
+	if (wake_q.first != WAKE_Q_TAIL) {
+		WARN_ON(ret != 1);
+		/* At most one wakee can be present. Pull it out. */
+		next = container_of(wake_q.first, struct task_struct, wake_q);
+		next->wake_q.next = NULL;
+	}
+
+	return futex_wait(uaddr, flags, val, abs_time, bitset, next);
+}
 
 /*
  * Userspace tried a 0 -> TID atomic transition of the futex value
@@ -3222,7 +3279,7 @@
 	}
 
 	/* Queue the futex_q, drop the hb lock, wait for wakeup. */
-	futex_wait_queue_me(hb, &q, to);
+	futex_wait_queue_me(hb, &q, to, NULL);
 
 	spin_lock(&hb->lock);
 	ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
@@ -3732,7 +3789,7 @@
 		val3 = FUTEX_BITSET_MATCH_ANY;
 		fallthrough;
 	case FUTEX_WAIT_BITSET:
-		return futex_wait(uaddr, flags, val, timeout, val3);
+		return futex_wait(uaddr, flags, val, timeout, val3, NULL);
 	case FUTEX_WAKE:
 		val3 = FUTEX_BITSET_MATCH_ANY;
 		fallthrough;
@@ -3756,6 +3813,8 @@
 					     uaddr2);
 	case FUTEX_CMP_REQUEUE_PI:
 		return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
+	case GFUTEX_SWAP:
+		return futex_swap(uaddr, flags, val, timeout, uaddr2);
 	}
 	return -ENOSYS;
 }
@@ -3772,7 +3831,7 @@
 
 	if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
 		      cmd == FUTEX_WAIT_BITSET ||
-		      cmd == FUTEX_WAIT_REQUEUE_PI)) {
+		      cmd == FUTEX_WAIT_REQUEUE_PI || cmd == GFUTEX_SWAP)) {
 		if (unlikely(should_fail_futex(!(op & FUTEX_PRIVATE_FLAG))))
 			return -EFAULT;
 		if (get_timespec64(&ts, utime))
@@ -3781,7 +3840,7 @@
 			return -EINVAL;
 
 		t = timespec64_to_ktime(ts);
-		if (cmd == FUTEX_WAIT)
+		if (cmd == FUTEX_WAIT || cmd == GFUTEX_SWAP)
 			t = ktime_add_safe(ktime_get(), t);
 		else if (cmd != FUTEX_LOCK_PI && !(op & FUTEX_CLOCK_REALTIME))
 			t = timens_ktime_to_host(CLOCK_MONOTONIC, t);

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 508fe52..3761719 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c

@@ -1308,12 +1308,11 @@
  * kthread_use_mm - make the calling kthread operate on an address space
  * @mm: address space to operate on
  */
-void kthread_use_mm(struct mm_struct *mm)
+void use_mm(struct mm_struct *mm)
 {
 	struct mm_struct *active_mm;
 	struct task_struct *tsk = current;
 
-	WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
 	WARN_ON_ONCE(tsk->mm);
 
 	task_lock(tsk);
@@ -1334,6 +1333,16 @@
 
 	if (active_mm != mm)
 		mmdrop(active_mm);
+}
+EXPORT_SYMBOL_GPL(use_mm);
+
+void kthread_use_mm(struct mm_struct *mm)
+{
+	struct task_struct *tsk = current;
+
+	WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
+
+	use_mm(mm);
 
 	to_kthread(tsk)->oldfs = force_uaccess_begin();
 }
@@ -1348,10 +1357,18 @@
 	struct task_struct *tsk = current;
 
 	WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
-	WARN_ON_ONCE(!tsk->mm);
-
 	force_uaccess_end(to_kthread(tsk)->oldfs);
 
+	unuse_mm(mm);
+}
+EXPORT_SYMBOL_GPL(kthread_unuse_mm);
+
+void unuse_mm(struct mm_struct *mm)
+{
+	struct task_struct *tsk = current;
+
+	WARN_ON_ONCE(!tsk->mm);
+
 	task_lock(tsk);
 	sync_mm_rss(mm);
 	local_irq_disable();
@@ -1361,7 +1378,7 @@
 	local_irq_enable();
 	task_unlock(tsk);
 }
-EXPORT_SYMBOL_GPL(kthread_unuse_mm);
+EXPORT_SYMBOL_GPL(unuse_mm);
 
 #ifdef CONFIG_BLK_CGROUP
 /**

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6db20a6..e0495b0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c

@@ -3058,6 +3058,11 @@
 }
 EXPORT_SYMBOL(wake_up_process);
 
+int wake_up_swap(struct task_struct *tsk)
+{
+	return try_to_wake_up(tsk, TASK_NORMAL, WF_CURRENT_CPU);
+}
+
 int wake_up_state(struct task_struct *p, unsigned int state)
 {
 	return try_to_wake_up(p, state, 0);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c004e3b..fdf4ff6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

@@ -6740,6 +6740,9 @@
 	int want_affine = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
+	if ((wake_flags & WF_CURRENT_CPU) && cpumask_test_cpu(cpu, p->cpus_ptr))
+		return cpu;
+
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
 

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 08db8e0..a7c9480 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h

@@ -1725,6 +1725,7 @@
 #define WF_FORK			0x02		/* Child wakeup after fork */
 #define WF_MIGRATED		0x04		/* Internal use, task got migrated */
 #define WF_ON_CPU		0x08		/* Wakee is on_cpu */
+#define WF_CURRENT_CPU		0x200		/* Prefer to move wakee to the current CPU */
 
 /*
  * To aid in avoiding the subversion of "niceness" due to uneven distribution

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ba64476..9deeb9d 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c

@@ -16,6 +16,8 @@
 #include <linux/syscalls.h>
 #include <linux/error-injection.h>
 #include <linux/btf_ids.h>
+#include <linux/bpf_lsm.h>
+#include <net/bpf_sk_storage.h>
 
 #include <uapi/linux/bpf.h>
 #include <uapi/linux/btf.h>
@@ -1017,6 +1019,20 @@
 	.ret_type	= RET_INTEGER,
 };
 
+BPF_CALL_0(bpf_get_current_task_btf)
+{
+	return (unsigned long) current;
+}
+
+BTF_ID_LIST_SINGLE(bpf_get_current_btf_ids, struct, task_struct)
+
+static const struct bpf_func_proto bpf_get_current_task_btf_proto = {
+	.func		= bpf_get_current_task_btf,
+	.gpl_only	= true,
+	.ret_type	= RET_PTR_TO_BTF_ID,
+	.ret_btf_id	= &bpf_get_current_btf_ids[0],
+};
+
 BPF_CALL_2(bpf_current_task_under_cgroup, struct bpf_map *, map, u32, idx)
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
@@ -1159,7 +1175,11 @@
 
 static bool bpf_d_path_allowed(const struct bpf_prog *prog)
 {
-	return btf_id_set_contains(&btf_allowlist_d_path, prog->aux->attach_btf_id);
+	if (prog->type == BPF_PROG_TYPE_LSM)
+		return bpf_lsm_is_sleepable_hook(prog->aux->attach_btf_id);
+
+	return btf_id_set_contains(&btf_allowlist_d_path,
+				   prog->aux->attach_btf_id);
 }
 
 BTF_ID_LIST_SINGLE(bpf_d_path_btf_ids, struct, path)
@@ -1254,12 +1274,16 @@
 		return &bpf_ktime_get_ns_proto;
 	case BPF_FUNC_ktime_get_boot_ns:
 		return &bpf_ktime_get_boot_ns_proto;
+	case BPF_FUNC_ktime_get_coarse_ns:
+		return &bpf_ktime_get_coarse_ns_proto;
 	case BPF_FUNC_tail_call:
 		return &bpf_tail_call_proto;
 	case BPF_FUNC_get_current_pid_tgid:
 		return &bpf_get_current_pid_tgid_proto;
 	case BPF_FUNC_get_current_task:
 		return &bpf_get_current_task_proto;
+	case BPF_FUNC_get_current_task_btf:
+		return &bpf_get_current_task_btf_proto;
 	case BPF_FUNC_get_current_uid_gid:
 		return &bpf_get_current_uid_gid_proto;
 	case BPF_FUNC_get_current_comm:
@@ -1719,6 +1743,8 @@
 		return &bpf_skc_to_tcp_request_sock_proto;
 	case BPF_FUNC_skc_to_udp6_sock:
 		return &bpf_skc_to_udp6_sock_proto;
+	case BPF_FUNC_get_socket_cookie:
+		return &bpf_get_socket_ptr_cookie_proto;
 #endif
 	case BPF_FUNC_seq_printf:
 		return prog->expected_attach_type == BPF_TRACE_ITER ?

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index ca7d635..49ec9e8 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c

@@ -4295,13 +4295,13 @@
 		{ { 0, 0xffffffff } },
 		.stack_depth = 40,
 	},
-	/* BPF_STX | BPF_XADD | BPF_W/DW */
+	/* BPF_STX | BPF_ATOMIC | BPF_W/DW */
 	{
 		"STX_XADD_W: Test: 0x12 + 0x10 = 0x22",
 		.u.insns_int = {
 			BPF_ALU32_IMM(BPF_MOV, R0, 0x12),
 			BPF_ST_MEM(BPF_W, R10, -40, 0x10),
-			BPF_STX_XADD(BPF_W, R10, R0, -40),
+			BPF_ATOMIC_OP(BPF_W, BPF_ADD, R10, R0, -40),
 			BPF_LDX_MEM(BPF_W, R0, R10, -40),
 			BPF_EXIT_INSN(),
 		},
@@ -4316,7 +4316,7 @@
 			BPF_ALU64_REG(BPF_MOV, R1, R10),
 			BPF_ALU32_IMM(BPF_MOV, R0, 0x12),
 			BPF_ST_MEM(BPF_W, R10, -40, 0x10),
-			BPF_STX_XADD(BPF_W, R10, R0, -40),
+			BPF_ATOMIC_OP(BPF_W, BPF_ADD, R10, R0, -40),
 			BPF_ALU64_REG(BPF_MOV, R0, R10),
 			BPF_ALU64_REG(BPF_SUB, R0, R1),
 			BPF_EXIT_INSN(),
@@ -4331,7 +4331,7 @@
 		.u.insns_int = {
 			BPF_ALU32_IMM(BPF_MOV, R0, 0x12),
 			BPF_ST_MEM(BPF_W, R10, -40, 0x10),
-			BPF_STX_XADD(BPF_W, R10, R0, -40),
+			BPF_ATOMIC_OP(BPF_W, BPF_ADD, R10, R0, -40),
 			BPF_EXIT_INSN(),
 		},
 		INTERNAL,
@@ -4352,7 +4352,7 @@
 		.u.insns_int = {
 			BPF_ALU32_IMM(BPF_MOV, R0, 0x12),
 			BPF_ST_MEM(BPF_DW, R10, -40, 0x10),
-			BPF_STX_XADD(BPF_DW, R10, R0, -40),
+			BPF_ATOMIC_OP(BPF_DW, BPF_ADD, R10, R0, -40),
 			BPF_LDX_MEM(BPF_DW, R0, R10, -40),
 			BPF_EXIT_INSN(),
 		},
@@ -4367,7 +4367,7 @@
 			BPF_ALU64_REG(BPF_MOV, R1, R10),
 			BPF_ALU32_IMM(BPF_MOV, R0, 0x12),
 			BPF_ST_MEM(BPF_DW, R10, -40, 0x10),
-			BPF_STX_XADD(BPF_DW, R10, R0, -40),
+			BPF_ATOMIC_OP(BPF_DW, BPF_ADD, R10, R0, -40),
 			BPF_ALU64_REG(BPF_MOV, R0, R10),
 			BPF_ALU64_REG(BPF_SUB, R0, R1),
 			BPF_EXIT_INSN(),
@@ -4382,7 +4382,7 @@
 		.u.insns_int = {
 			BPF_ALU32_IMM(BPF_MOV, R0, 0x12),
 			BPF_ST_MEM(BPF_DW, R10, -40, 0x10),
-			BPF_STX_XADD(BPF_DW, R10, R0, -40),
+			BPF_ATOMIC_OP(BPF_DW, BPF_ADD, R10, R0, -40),
 			BPF_EXIT_INSN(),
 		},
 		INTERNAL,

diff --git a/net/core/filter.c b/net/core/filter.c
index 7ea752a..e863b66 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c

@@ -4624,6 +4624,18 @@
 	.arg1_type	= ARG_PTR_TO_CTX,
 };
 
+BPF_CALL_1(bpf_get_socket_ptr_cookie, struct sock *, sk)
+{
+	return sk ? sock_gen_cookie(sk) : 0;
+}
+
+const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto = {
+	.func		= bpf_get_socket_ptr_cookie,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_BTF_ID_SOCK_COMMON,
+};
+
 BPF_CALL_1(bpf_get_socket_cookie_sock_ops, struct bpf_sock_ops_kern *, ctx)
 {
 	return __sock_gen_cookie(ctx->sk);

diff --git a/samples/bpf/bpf_insn.h b/samples/bpf/bpf_insn.h
index 5442379..db67a28 100644
--- a/samples/bpf/bpf_insn.h
+++ b/samples/bpf/bpf_insn.h

@@ -138,11 +138,11 @@
 
 #define BPF_STX_XADD(SIZE, DST, SRC, OFF)			\
 	((struct bpf_insn) {					\
-		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_XADD,	\
+		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_ATOMIC,	\
 		.dst_reg = DST,					\
 		.src_reg = SRC,					\
 		.off   = OFF,					\
-		.imm   = 0 })
+		.imm   = BPF_ADD })
 
 /* Memory store, *(uint *) (dst_reg + off16) = imm32 */
 

diff --git a/samples/bpf/cookie_uid_helper_example.c b/samples/bpf/cookie_uid_helper_example.c
index deb0e3e..c5ff7a1 100644
--- a/samples/bpf/cookie_uid_helper_example.c
+++ b/samples/bpf/cookie_uid_helper_example.c

@@ -147,12 +147,12 @@
 		 */
 		BPF_MOV64_REG(BPF_REG_9, BPF_REG_0),
 		BPF_MOV64_IMM(BPF_REG_1, 1),
-		BPF_STX_XADD(BPF_DW, BPF_REG_9, BPF_REG_1,
-				offsetof(struct stats, packets)),
+		BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_9, BPF_REG_1,
+			      offsetof(struct stats, packets)),
 		BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_6,
 				offsetof(struct __sk_buff, len)),
-		BPF_STX_XADD(BPF_DW, BPF_REG_9, BPF_REG_1,
-				offsetof(struct stats, bytes)),
+		BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_9, BPF_REG_1,
+			      offsetof(struct stats, bytes)),
 		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_6,
 				offsetof(struct __sk_buff, len)),
 		BPF_EXIT_INSN(),

diff --git a/samples/bpf/sock_example.c b/samples/bpf/sock_example.c
index 00aae1d..23d1930 100644
--- a/samples/bpf/sock_example.c
+++ b/samples/bpf/sock_example.c

@@ -54,7 +54,7 @@
 		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
 		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
 		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
-		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+		BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_0, BPF_REG_1, 0),
 		BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
 		BPF_EXIT_INSN(),
 	};

diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c
index 20fbd12..390ff38 100644
--- a/samples/bpf/test_cgrp2_attach.c
+++ b/samples/bpf/test_cgrp2_attach.c

@@ -53,7 +53,7 @@
 		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
 		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
 		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
-		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+		BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_0, BPF_REG_1, 0),
 
 		/* Count bytes */
 		BPF_MOV64_IMM(BPF_REG_0, MAP_KEY_BYTES), /* r0 = 1 */
@@ -64,7 +64,8 @@
 		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
 		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
 		BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_6, offsetof(struct __sk_buff, len)), /* r1 = skb->len */
-		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+
+		BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_0, BPF_REG_1, 0),
 
 		BPF_MOV64_IMM(BPF_REG_0, verdict), /* r0 = verdict */
 		BPF_EXIT_INSN(),

diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py
index 3148437..8b829748 100755
--- a/scripts/bpf_helpers_doc.py
+++ b/scripts/bpf_helpers_doc.py

@@ -418,6 +418,7 @@
             'struct bpf_tcp_sock',
             'struct bpf_tunnel_key',
             'struct bpf_xfrm_state',
+            'struct linux_binprm',
             'struct pt_regs',
             'struct sk_reuseport_md',
             'struct sockaddr',
@@ -435,6 +436,7 @@
             'struct xdp_md',
             'struct path',
             'struct btf_ptr',
+            'struct inode',
     ]
     known_types = {
             '...',
@@ -465,6 +467,7 @@
             'struct bpf_tcp_sock',
             'struct bpf_tunnel_key',
             'struct bpf_xfrm_state',
+            'struct linux_binprm',
             'struct pt_regs',
             'struct sk_reuseport_md',
             'struct sockaddr',
@@ -478,6 +481,7 @@
             'struct task_struct',
             'struct path',
             'struct btf_ptr',
+            'struct inode',
     }
     mapped_types = {
             'u8': '__u8',

diff --git a/scripts/module.lds.S b/scripts/module.lds.S
index 69b9b71..18d5b84 100644
--- a/scripts/module.lds.S
+++ b/scripts/module.lds.S

@@ -23,6 +23,30 @@
 	.init_array		0 : ALIGN(8) { *(SORT(.init_array.*)) *(.init_array) }
 
 	__jump_table		0 : ALIGN(8) { KEEP(*(__jump_table)) }
+
+	__patchable_function_entries : { *(__patchable_function_entries) }
+
+	/*
+	 * With CONFIG_LTO_CLANG, LLD always enables -fdata-sections and
+	 * -ffunction-sections, which increases the size of the final module.
+	 * Merge the split sections in the final binary.
+	 */
+	.bss : {
+		*(.bss .bss.[0-9a-zA-Z_]*)
+		*(.bss..L*)
+	}
+
+	.data : {
+		*(.data .data.[0-9a-zA-Z_]*)
+		*(.data..L*)
+	}
+
+	.rodata : {
+		*(.rodata .rodata.[0-9a-zA-Z_]*)
+		*(.rodata..L*)
+	}
+
+	.text : { *(.text .text.[0-9a-zA-Z_]*) }
 }
 
 /* bring in arch-specific sections */

diff --git a/security/Kconfig b/security/Kconfig
index 7561f6f..98355ed 100644
--- a/security/Kconfig
+++ b/security/Kconfig

@@ -236,6 +236,7 @@
 source "security/apparmor/Kconfig"
 source "security/loadpin/Kconfig"
 source "security/yama/Kconfig"
+source "security/container/Kconfig"
 source "security/safesetid/Kconfig"
 source "security/lockdown/Kconfig"
 

diff --git a/security/Makefile b/security/Makefile
index 3baf435..f54e22d 100644
--- a/security/Makefile
+++ b/security/Makefile

@@ -10,6 +10,7 @@
 subdir-$(CONFIG_SECURITY_APPARMOR)	+= apparmor
 subdir-$(CONFIG_SECURITY_YAMA)		+= yama
 subdir-$(CONFIG_SECURITY_LOADPIN)	+= loadpin
+subdir-$(CONFIG_SECURITY_CONTAINER_MONITOR) += container
 subdir-$(CONFIG_SECURITY_SAFESETID)    += safesetid
 subdir-$(CONFIG_SECURITY_LOCKDOWN_LSM)	+= lockdown
 subdir-$(CONFIG_BPF_LSM)		+= bpf
@@ -32,6 +33,7 @@
 obj-$(CONFIG_SECURITY_LOCKDOWN_LSM)	+= lockdown/
 obj-$(CONFIG_CGROUPS)			+= device_cgroup.o
 obj-$(CONFIG_BPF_LSM)			+= bpf/
+obj-$(CONFIG_SECURITY_CONTAINER_MONITOR) += container/
 
 # Object integrity file lists
 subdir-$(CONFIG_INTEGRITY)		+= integrity

diff --git a/security/bpf/hooks.c b/security/bpf/hooks.c
index 788667d..e5971fa 100644
--- a/security/bpf/hooks.c
+++ b/security/bpf/hooks.c

@@ -12,6 +12,7 @@
 	#include <linux/lsm_hook_defs.h>
 	#undef LSM_HOOK
 	LSM_HOOK_INIT(inode_free_security, bpf_inode_storage_free),
+	LSM_HOOK_INIT(task_free, bpf_task_storage_free),
 };
 
 static int __init bpf_lsm_init(void)
@@ -23,6 +24,7 @@
 
 struct lsm_blob_sizes bpf_lsm_blob_sizes __lsm_ro_after_init = {
 	.lbs_inode = sizeof(struct bpf_storage_blob),
+	.lbs_task = sizeof(struct bpf_storage_blob),
 };
 
 DEFINE_LSM(bpf) = {

diff --git a/security/container/Kconfig b/security/container/Kconfig
new file mode 100644
index 0000000..72a51eb
--- /dev/null
+++ b/security/container/Kconfig

@@ -0,0 +1,17 @@
+config SECURITY_CONTAINER_MONITOR
+	bool "Monitor containerized processes"
+	depends on SECURITY
+	depends on MMU
+	depends on X86_64
+	select SECURITYFS
+	help
+	  Instrument the Linux kernel to collect more information about containers
+	  and identify security threats.
+
+config SECURITY_CONTAINER_MONITOR_DEBUG
+    bool "Enable debug pr_devel logs"
+	depends on SECURITY_CONTAINER_MONITOR
+	help
+	  Define DEBUG for CSM files to compile verbose debugging messages.
+
+	  Only for debugging/testing do not enable for production.

diff --git a/security/container/Makefile b/security/container/Makefile
new file mode 100644
index 0000000..9be2528
--- /dev/null
+++ b/security/container/Makefile

@@ -0,0 +1,16 @@
+PB_CCFLAGS := -DPB_SYSTEM_HEADER="<pbsystem.h>" \
+	-DPB_NO_ERRMSG \
+	-DPB_FIELD_16BIT \
+	-DPB_BUFFER_ONLY
+export PB_CCFLAGS
+
+subdir-$(CONFIG_SECURITY_CONTAINER_MONITOR) += protos
+
+obj-$(CONFIG_SECURITY_CONTAINER_MONITOR) += protos/
+obj-$(CONFIG_SECURITY_CONTAINER_MONITOR) += monitor.o pb.o process.o pipe.o
+
+ccflags-y := -I$(srctree)/security/container/protos \
+	-I$(srctree)/security/container/protos/nanopb \
+	-I$(srctree)/fs \
+	$(PB_CCFLAGS)
+ccflags-$(CONFIG_SECURITY_CONTAINER_MONITOR_DEBUG) += -DDEBUG

diff --git a/security/container/monitor.c b/security/container/monitor.c
new file mode 100644
index 0000000..63b6a66
--- /dev/null
+++ b/security/container/monitor.c

@@ -0,0 +1,747 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Container Security Monitor module
+ *
+ * Copyright (c) 2018 Google, Inc
+ */
+
+#include "monitor.h"
+#include "process.h"
+
+#include <linux/audit.h>
+#include <linux/file.h>
+#include <linux/lsm_hooks.h>
+#include <linux/module.h>
+#include <linux/pipe_fs_i.h>
+#include <linux/poll.h>
+#include <linux/rwsem.h>
+#include <linux/seq_file.h>
+#include <linux/string.h>
+#include <linux/sysctl.h>
+
+/* protects csm_*_enabled and configurations. */
+DECLARE_RWSEM(csm_rwsem_config);
+
+/* queue used for poll wait on config changes. */
+static DECLARE_WAIT_QUEUE_HEAD(config_wait);
+
+/* increase each time a new configuration is applied. */
+static unsigned long config_version;
+
+/* Stats gathered from the LSM. */
+struct container_stats csm_stats;
+
+struct container_stats_mapping {
+	const char *key;
+	size_t *value;
+};
+
+/* Key value pair mapping for the sysfs entry. */
+struct container_stats_mapping csm_stats_mapping[] = {
+	{ "ProtoEncodingFailed", &csm_stats.proto_encoding_failed },
+	{ "WorkQueueFailed", &csm_stats.workqueue_failed },
+	{ "EventWritingFailed", &csm_stats.event_writing_failed },
+	{ "SizePickingFailed", &csm_stats.size_picking_failed },
+	{ "PipeAlreadyOpened", &csm_stats.pipe_already_opened },
+};
+
+/*
+ * Is monitoring enabled? Defaults to disabled.
+ * These variables might be used without locking csm_rwsem_config to check if an
+ * LSM hook can bail quickly. The semaphore is taken later to ensure CSM is
+ * still enabled.
+ *
+ * csm_enabled is true if any collector is enabled.
+ */
+bool csm_enabled;
+static bool csm_container_enabled;
+bool csm_execute_enabled;
+bool csm_memexec_enabled;
+
+/* securityfs control files */
+static struct dentry *csm_dir;
+static struct dentry *csm_enabled_file;
+static struct dentry *csm_container_file;
+static struct dentry *csm_config_file;
+static struct dentry *csm_config_vers_file;
+static struct dentry *csm_pipe_file;
+static struct dentry *csm_stats_file;
+
+/* Pipes to forward data to user-mode. */
+DECLARE_RWSEM(csm_rwsem_pipe);
+static struct file *csm_user_read_pipe;
+struct file *csm_user_write_pipe;
+
+/* Option to disable the CSM features at boot. */
+static bool cmdline_boot_disabled;
+bool cmdline_boot_vsock_enabled;
+
+/* Options disabled by default. */
+static bool cmdline_boot_pipe_enabled;
+static bool cmdline_boot_config_enabled;
+
+/* Option to fully enabled the LSM at boot for automated testing. */
+static bool cmdline_default_enabled;
+
+static int csm_boot_disabled_setup(char *str)
+{
+	return kstrtobool(str, &cmdline_boot_disabled);
+}
+early_param("csm.disabled", csm_boot_disabled_setup);
+
+static int csm_default_enabled_setup(char *str)
+{
+	return kstrtobool(str, &cmdline_default_enabled);
+}
+early_param("csm.default.enabled", csm_default_enabled_setup);
+
+static int csm_boot_vsock_enabled_setup(char *str)
+{
+	return kstrtobool(str, &cmdline_boot_vsock_enabled);
+}
+early_param("csm.vsock.enabled", csm_boot_vsock_enabled_setup);
+
+static int csm_boot_pipe_enabled_setup(char *str)
+{
+	return kstrtobool(str, &cmdline_boot_pipe_enabled);
+}
+early_param("csm.pipe.enabled", csm_boot_pipe_enabled_setup);
+
+static int csm_boot_config_enabled_setup(char *str)
+{
+	return kstrtobool(str, &cmdline_boot_config_enabled);
+}
+early_param("csm.config.enabled", csm_boot_config_enabled_setup);
+
+static bool pipe_in_use(void)
+{
+	struct pipe_inode_info *pipe;
+
+	lockdep_assert_held_write(&csm_rwsem_config);
+	if (csm_user_read_pipe) {
+		pipe = get_pipe_info(csm_user_read_pipe, false);
+		if (pipe)
+			return READ_ONCE(pipe->readers) > 1;
+	}
+	return false;
+}
+
+/* Close pipe, force has to be true to close pipe if it is still being used. */
+int close_pipe_files(bool force)
+{
+	if (csm_user_read_pipe) {
+		/* Pipe is still used. */
+		if (pipe_in_use()) {
+			if (!force)
+				return -EBUSY;
+			pr_warn("pipe is closed while it is still being used.\n");
+		}
+
+		fput(csm_user_read_pipe);
+		fput(csm_user_write_pipe);
+		csm_user_read_pipe = NULL;
+		csm_user_write_pipe = NULL;
+	}
+	return 0;
+}
+
+static void csm_update_config(schema_ConfigurationRequest *req)
+{
+	schema_ExecuteCollectorConfig *econf;
+	size_t i;
+	bool enumerate_processes = false;
+
+	/* Expect the lock to be held for write before this call. */
+	lockdep_assert_held_write(&csm_rwsem_config);
+
+	/* This covers the scenario where a client is connected and the config
+	 * transitions the execute collector from disabled to enabled. In that
+	 * case there may have been execute events not sent. So they are
+	 * enumerated.
+	 */
+	if (!csm_execute_enabled && req->execute_config.enabled &&
+	    pipe_in_use())
+		enumerate_processes = true;
+
+	csm_container_enabled = req->container_config.enabled;
+	csm_execute_enabled = req->execute_config.enabled;
+	csm_memexec_enabled = req->memexec_config.enabled;
+
+	/* csm_enabled is true if any collector is enabled. */
+	csm_enabled = csm_container_enabled || csm_execute_enabled ||
+		csm_memexec_enabled;
+
+	/* Clean-up existing configurations. */
+	kfree(csm_execute_config.envp_allowlist);
+	memset(&csm_execute_config, 0, sizeof(csm_execute_config));
+
+	if (csm_execute_enabled) {
+		econf = &req->execute_config;
+		csm_execute_config.argv_limit = econf->argv_limit;
+		csm_execute_config.envp_limit = econf->envp_limit;
+
+		/* Swap the allowlist so it is not freed on return. */
+		csm_execute_config.envp_allowlist = econf->envp_allowlist.arg;
+		econf->envp_allowlist.arg = NULL;
+	}
+
+	/* Reset all stats and close pipe if disabled. */
+	if (!csm_enabled) {
+		for (i = 0; i < ARRAY_SIZE(csm_stats_mapping); i++)
+			*csm_stats_mapping[i].value = 0;
+
+		close_pipe_files(true);
+	}
+
+	config_version++;
+	if (enumerate_processes)
+		csm_enumerate_processes();
+	wake_up(&config_wait);
+}
+
+int csm_update_config_from_buffer(void *data, size_t size)
+{
+	schema_ConfigurationRequest c = {};
+	pb_istream_t istream;
+
+	c.execute_config.envp_allowlist.funcs.decode = pb_decode_string_array;
+
+	istream = pb_istream_from_buffer(data, size);
+	if (!pb_decode(&istream, schema_ConfigurationRequest_fields, &c)) {
+		kfree(c.execute_config.envp_allowlist.arg);
+		return -EINVAL;
+	}
+
+	down_write(&csm_rwsem_config);
+	csm_update_config(&c);
+	up_write(&csm_rwsem_config);
+
+	return 0;
+}
+
+static ssize_t csm_config_write(struct file *file, const char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	ssize_t err = 0;
+	void *mem;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* No partial writes. */
+	if (*ppos != 0)
+		return -EINVAL;
+
+	/* Duplicate user memory to safely parse protobuf. */
+	mem = memdup_user(buf, count);
+	if (IS_ERR(mem))
+		return PTR_ERR(mem);
+
+	err = csm_update_config_from_buffer(mem, count);
+	if (!err)
+		err = count;
+
+	kfree(mem);
+	return err;
+}
+
+static const struct file_operations csm_config_fops = {
+	.write = csm_config_write,
+};
+
+static void csm_enable(void)
+{
+	schema_ConfigurationRequest req = {};
+
+	/* Expect the lock to be held for write before this call. */
+	lockdep_assert_held_write(&csm_rwsem_config);
+
+	/* Default configuration */
+	req.container_config.enabled = true;
+	req.execute_config.enabled = true;
+	req.execute_config.argv_limit = UINT_MAX;
+	req.execute_config.envp_limit = UINT_MAX;
+	req.memexec_config.enabled = true;
+	csm_update_config(&req);
+}
+
+static void csm_disable(void)
+{
+	schema_ConfigurationRequest req = {};
+
+	/* Expect the lock to be held for write before this call. */
+	lockdep_assert_held_write(&csm_rwsem_config);
+
+	/* Zero configuration disable all collectors. */
+	csm_update_config(&req);
+	pr_info("disabled\n");
+}
+
+static ssize_t csm_enabled_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	const char *str = csm_enabled ? "1\n" : "0\n";
+
+	return simple_read_from_buffer(buf, count, ppos, str, 2);
+}
+
+static ssize_t csm_enabled_write(struct file *file, const char __user *buf,
+				 size_t count, loff_t *ppos)
+{
+	bool enabled;
+	int err;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (count <= 0 || count > PAGE_SIZE || *ppos)
+		return -EINVAL;
+
+	err = kstrtobool_from_user(buf, count, &enabled);
+	if (err)
+		return err;
+
+	down_write(&csm_rwsem_config);
+
+	if (enabled)
+		csm_enable();
+	else
+		csm_disable();
+
+	up_write(&csm_rwsem_config);
+
+	return count;
+}
+
+static const struct file_operations csm_enabled_fops = {
+	.read = csm_enabled_read,
+	.write = csm_enabled_write,
+};
+
+static int csm_config_version_open(struct inode *inode, struct file *file)
+{
+	/* private_data is used to keep the latest config version read. */
+	file->private_data = (void*)-1;
+	return 0;
+}
+
+static ssize_t csm_config_version_read(struct file *file, char __user *buf,
+				       size_t count, loff_t *ppos)
+{
+	unsigned long version = config_version;
+	file->private_data = (void*)version;
+	return simple_read_from_buffer(buf, count, ppos, &version,
+				       sizeof(version));
+}
+
+static __poll_t csm_config_version_poll(struct file *file,
+					struct poll_table_struct *poll_tab)
+{
+	if ((unsigned long)file->private_data != config_version)
+		return EPOLLIN;
+	poll_wait(file, &config_wait, poll_tab);
+	if ((unsigned long)file->private_data != config_version)
+		return EPOLLIN;
+	return 0;
+}
+
+static const struct file_operations csm_config_version_fops = {
+	.open = csm_config_version_open,
+	.read = csm_config_version_read,
+	.poll = csm_config_version_poll,
+};
+
+static int csm_pipe_open(struct inode *inode, struct file *file)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	if (!csm_enabled)
+		return -EAGAIN;
+	return 0;
+}
+
+/* Similar to file_clone_open that is available only in 4.19 and up. */
+static inline struct file *pipe_clone_open(struct file *file)
+{
+	return dentry_open(&file->f_path, file->f_flags, file->f_cred);
+}
+
+/* Check if the pipe is still used, else recreate and dup it. */
+static struct file *csm_dup_pipe(void)
+{
+	long pipe_size = 1024 * PAGE_SIZE;
+	long actual_size;
+	struct file *pipes[2] = {NULL, NULL};
+	struct file *ret;
+	int err;
+
+	down_write(&csm_rwsem_pipe);
+
+	err = close_pipe_files(false);
+	if (err) {
+		ret = ERR_PTR(err);
+		csm_stats.pipe_already_opened++;
+		goto out;
+	}
+
+	err = create_pipe_files(pipes, O_NONBLOCK);
+	if (err) {
+		ret = ERR_PTR(err);
+		goto out;
+	}
+
+	/*
+	 * Try to increase the pipe size to 1024 pages, if there is not
+	 * enough memory, pipes will stay unchanged.
+	 */
+	actual_size = pipe_fcntl(pipes[0], F_SETPIPE_SZ, pipe_size);
+	if (actual_size != pipe_size)
+		pr_err("failed to resize pipe to 1024 pages, error: %ld, fallback to the default value\n",
+		       actual_size);
+
+	csm_user_read_pipe = pipes[0];
+	csm_user_write_pipe = pipes[1];
+
+	/* Clone the file so we can track if the reader is still used. */
+	ret = pipe_clone_open(csm_user_read_pipe);
+
+out:
+	up_write(&csm_rwsem_pipe);
+	return ret;
+}
+
+static ssize_t csm_pipe_read(struct file *file, char __user *buf,
+				       size_t count, loff_t *ppos)
+{
+	int fd;
+	ssize_t err;
+	struct file *local_pipe;
+
+	/* No partial reads. */
+	if (*ppos != 0)
+		return -EINVAL;
+
+	fd = get_unused_fd_flags(0);
+	if (fd < 0)
+		return fd;
+
+	local_pipe = csm_dup_pipe();
+	if (IS_ERR(local_pipe)) {
+		err = PTR_ERR(local_pipe);
+		local_pipe = NULL;
+		goto error;
+	}
+
+	err = simple_read_from_buffer(buf, count, ppos, &fd, sizeof(fd));
+	if (err < 0)
+		goto error;
+
+	if (err < sizeof(fd)) {
+		err = -EINVAL;
+		goto error;
+	}
+
+	/* Install the file descriptor when we know everything succeeded. */
+	fd_install(fd, local_pipe);
+
+	csm_enumerate_processes();
+
+	return err;
+
+error:
+	if (local_pipe)
+		fput(local_pipe);
+	put_unused_fd(fd);
+	return err;
+}
+
+
+static const struct file_operations csm_pipe_fops = {
+	.open = csm_pipe_open,
+	.read = csm_pipe_read,
+};
+
+static void set_container_decode_callbacks(schema_Container *container)
+{
+	container->pod_namespace.funcs.decode = pb_decode_string_field;
+	container->pod_name.funcs.decode = pb_decode_string_field;
+	container->container_name.funcs.decode = pb_decode_string_field;
+	container->container_image_uri.funcs.decode = pb_decode_string_field;
+	container->labels.funcs.decode = pb_decode_string_array;
+}
+
+static void set_container_encode_callbacks(schema_Container *container)
+{
+	container->pod_namespace.funcs.encode = pb_encode_string_field;
+	container->pod_name.funcs.encode = pb_encode_string_field;
+	container->container_name.funcs.encode = pb_encode_string_field;
+	container->container_image_uri.funcs.encode = pb_encode_string_field;
+	container->labels.funcs.encode = pb_encode_string_array;
+}
+
+static void free_container_callbacks_args(schema_Container *container)
+{
+	kfree(container->pod_namespace.arg);
+	kfree(container->pod_name.arg);
+	kfree(container->container_name.arg);
+	kfree(container->container_image_uri.arg);
+	kfree(container->labels.arg);
+}
+
+static ssize_t csm_container_write(struct file *file, const char __user *buf,
+				   size_t count, loff_t *ppos)
+{
+	ssize_t err = 0;
+	void *mem;
+	u64 cid;
+	pb_istream_t istream;
+	struct task_struct *task;
+	schema_ContainerReport report = {};
+	schema_Event event = {};
+	schema_Container *container;
+	char *uuid = NULL;
+
+	/* Notify that this collector is not yet enabled. */
+	if (!csm_container_enabled)
+		return -EAGAIN;
+
+	/* No partial writes. */
+	if (*ppos != 0)
+		return -EINVAL;
+
+	/* Duplicate user memory to safely parse protobuf. */
+	mem = memdup_user(buf, count);
+	if (IS_ERR(mem))
+		return PTR_ERR(mem);
+
+	/* Callback to decode string in protobuf. */
+	set_container_decode_callbacks(&report.container);
+
+	istream = pb_istream_from_buffer(mem, count);
+	if (!pb_decode(&istream, schema_ContainerReport_fields, &report)) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	/* Check protobuf is as expected */
+	if (report.pid == 0 ||
+	    report.container.container_id != 0) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	/* Find if the process id is linked to an existing container-id. */
+	rcu_read_lock();
+	task = find_task_by_pid_ns(report.pid, &init_pid_ns);
+	if (task) {
+		cid = audit_get_contid(task);
+		if (cid == AUDIT_CID_UNSET)
+			err = -ENOENT;
+	} else {
+		err = -ENOENT;
+	}
+	rcu_read_unlock();
+
+	if (err)
+		goto out;
+
+	uuid = kzalloc(PROCESS_UUID_SIZE, GFP_KERNEL);
+	if (!uuid)
+		goto out;
+
+	/* Provide the uuid for the top process of the container. */
+	err = get_process_uuid_by_pid(report.pid, uuid, PROCESS_UUID_SIZE);
+	if (err)
+		goto out;
+
+	/* Correct the container-id and feed the event to pipe */
+	report.has_container = true;
+	report.container.container_id = cid;
+	report.container.init_uuid.funcs.encode = pb_encode_uuid_field;
+	report.container.init_uuid.arg = uuid;
+	container = &event.event.container.container;
+	*container = report.container;
+
+	/* Use encode callback to generate the final proto. */
+	set_container_encode_callbacks(container);
+
+	event.which_event = schema_Event_container_tag;
+	event.event.container.has_container = true;
+
+	err = csm_sendeventproto(schema_Event_fields, &event);
+	if (!err)
+		err = count;
+
+out:
+	/* Free any allocated nanopb callback arguments. */
+	free_container_callbacks_args(&report.container);
+	kfree(uuid);
+	kfree(mem);
+	return err;
+}
+
+static const struct file_operations csm_container_fops = {
+	.write = csm_container_write,
+};
+
+static int csm_show_stats(struct seq_file *p, void *v)
+{
+	size_t i;
+
+	for (i = 0; i < ARRAY_SIZE(csm_stats_mapping); i++) {
+		seq_printf(p, "%s:\t%zu\n",
+			   csm_stats_mapping[i].key,
+			   *csm_stats_mapping[i].value);
+	}
+
+	return 0;
+}
+
+static int csm_stats_open(struct inode *inode, struct file *file)
+{
+	size_t i, size = 1; /* Start at one for the null byte. */
+
+	for (i = 0; i < ARRAY_SIZE(csm_stats_mapping); i++) {
+		/*
+		 * Calculate the maximum length:
+		 * - Length of the key
+		 * - 3 additional chars :\t\n
+		 * - longest unsigned 64-bit integer.
+		 */
+		size += strlen(csm_stats_mapping[i].key)
+			+ 3 + sizeof("18446744073709551615");
+	}
+
+	return single_open_size(file, csm_show_stats, NULL, size);
+}
+
+static const struct file_operations csm_stats_fops = {
+	.open		= csm_stats_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int csm_setxattr(struct dentry *dentry, const char *name,
+			const void *value, size_t size, int flags)
+{
+	if (csm_enabled && !strcmp(name, XATTR_SECURITY_CSM))
+		return -EPERM;
+	return 0;
+}
+
+static struct security_hook_list csm_hooks[] __lsm_ro_after_init = {
+	/* Track process execution. */
+	LSM_HOOK_INIT(bprm_check_security, csm_bprm_check_security),
+	LSM_HOOK_INIT(task_post_alloc, csm_task_post_alloc),
+	LSM_HOOK_INIT(task_exit, csm_task_exit),
+
+	/* Track memory execution */
+	LSM_HOOK_INIT(file_mprotect, csm_mprotect),
+	LSM_HOOK_INIT(mmap_file, csm_mmap_file),
+
+	/* Track file modification provenance. */
+	LSM_HOOK_INIT(file_pre_free_security, csm_file_pre_free),
+
+	/* Block modyfing csm xattr. */
+	LSM_HOOK_INIT(inode_setxattr, csm_setxattr),
+};
+
+static int __init csm_init(void)
+{
+	int err;
+
+	if (cmdline_boot_disabled)
+		return 0;
+
+	if (cmdline_boot_vsock_enabled)
+		pr_debug("vsock is deprecated, but was enabled at boot\n");
+
+	csm_dir = securityfs_create_dir("container_monitor", NULL);
+	if (IS_ERR(csm_dir)) {
+		err = PTR_ERR(csm_dir);
+		goto error;
+	}
+
+	csm_enabled_file = securityfs_create_file("enabled", 0644, csm_dir,
+						  NULL, &csm_enabled_fops);
+	if (IS_ERR(csm_enabled_file)) {
+		err = PTR_ERR(csm_enabled_file);
+		goto error_rmdir;
+	}
+
+	csm_container_file = securityfs_create_file("container", 0200, csm_dir,
+						  NULL, &csm_container_fops);
+	if (IS_ERR(csm_container_file)) {
+		err = PTR_ERR(csm_container_file);
+		goto error_rm_enabled;
+	}
+
+	csm_config_vers_file = securityfs_create_file("config_version", 0400,
+						      csm_dir, NULL,
+						      &csm_config_version_fops);
+	if (IS_ERR(csm_config_vers_file)) {
+		err = PTR_ERR(csm_config_vers_file);
+		goto error_rm_container;
+	}
+
+	if (cmdline_boot_config_enabled) {
+		csm_config_file = securityfs_create_file("config", 0200,
+							 csm_dir, NULL,
+							 &csm_config_fops);
+		if (IS_ERR(csm_config_file)) {
+			err = PTR_ERR(csm_config_file);
+			goto error_rm_config_vers;
+		}
+	}
+
+	if (cmdline_boot_pipe_enabled) {
+		csm_pipe_file = securityfs_create_file("pipe", 0400, csm_dir,
+						       NULL, &csm_pipe_fops);
+		if (IS_ERR(csm_pipe_file)) {
+			err = PTR_ERR(csm_pipe_file);
+			goto error_rm_config;
+		}
+	}
+
+	csm_stats_file = securityfs_create_file("stats", 0400, csm_dir,
+						 NULL, &csm_stats_fops);
+	if (IS_ERR(csm_stats_file)) {
+		err = PTR_ERR(csm_stats_file);
+		goto error_rm_pipe;
+	}
+
+	pr_debug("created securityfs control files\n");
+
+	security_add_hooks(csm_hooks, ARRAY_SIZE(csm_hooks), "csm");
+	pr_debug("registered hooks\n");
+
+	/* Off-by-default, only used for testing images. */
+	if (cmdline_default_enabled) {
+		down_write(&csm_rwsem_config);
+		csm_enable();
+		up_write(&csm_rwsem_config);
+	}
+
+	return 0;
+
+error_rm_pipe:
+	if (cmdline_boot_pipe_enabled)
+		securityfs_remove(csm_pipe_file);
+error_rm_config:
+	if (cmdline_boot_config_enabled)
+		securityfs_remove(csm_config_file);
+error_rm_config_vers:
+	securityfs_remove(csm_config_vers_file);
+error_rm_container:
+	securityfs_remove(csm_container_file);
+error_rm_enabled:
+	securityfs_remove(csm_enabled_file);
+error_rmdir:
+	securityfs_remove(csm_dir);
+error:
+	pr_warn("fs initialization error: %d", err);
+	return err;
+}
+
+late_initcall(csm_init);

diff --git a/security/container/monitor.h b/security/container/monitor.h
new file mode 100644
index 0000000..221d7f5
--- /dev/null
+++ b/security/container/monitor.h

@@ -0,0 +1,109 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Container Security Monitor module
+ *
+ * Copyright (c) 2018 Google, Inc
+ */
+
+#define pr_fmt(fmt)	"container-security-monitor: " fmt
+
+#include <linux/kernel.h>
+#include <linux/security.h>
+#include <linux/fs.h>
+#include <linux/rwsem.h>
+#include <linux/binfmts.h>
+#include <linux/xattr.h>
+#include <config.pb.h>
+#include <event.pb.h>
+#include <pb_encode.h>
+#include <pb_decode.h>
+
+#include "monitoring_protocol.h"
+
+/* Part of the CSM configuration response. */
+#define CSM_VERSION 1
+
+/* protects csm_*_enabled and configurations. */
+extern struct rw_semaphore csm_rwsem_config;
+
+/*
+ * Is monitoring enabled? Defaults to disabled.
+ * These variables might be used as gates without locking (as processor ensures
+ * valid proper access for native scalar values) so it can bail quickly.
+ */
+extern bool csm_enabled;
+extern bool csm_execute_enabled;
+extern bool csm_memexec_enabled;
+
+/* Configuration options for execute collector. */
+struct execute_config {
+	size_t argv_limit;
+	size_t envp_limit;
+	char *envp_allowlist;
+};
+
+extern struct execute_config csm_execute_config;
+
+/* pipe to forward events to user-mode. */
+extern struct rw_semaphore csm_rwsem_pipe;
+extern struct file *csm_user_write_pipe;
+
+/* Stats on LSM events. */
+struct container_stats {
+	size_t proto_encoding_failed;
+	size_t event_writing_failed;
+	size_t workqueue_failed;
+	size_t size_picking_failed;
+	size_t pipe_already_opened;
+};
+
+extern struct container_stats csm_stats;
+
+/* Streams file numbers are unknown from the kernel */
+#define STDIN_FILENO	0
+#define STDOUT_FILENO	1
+#define STDERR_FILENO	2
+
+/* security attribute for file provenance. */
+#define XATTR_SECURITY_CSM XATTR_SECURITY_PREFIX "csm"
+
+/* monitor functions */
+int csm_update_config_from_buffer(void *data, size_t size);
+
+/* send event to userland */
+int csm_sendeventproto(const pb_msgdesc_t *fields, schema_Event *event);
+
+/* process events functions */
+int csm_bprm_check_security(struct linux_binprm *bprm);
+void csm_task_exit(struct task_struct *task);
+void csm_task_post_alloc(struct task_struct *task);
+int get_process_uuid_by_pid(pid_t pid_nr, char *buffer, size_t size);
+
+/* memory execution events functions */
+int csm_mprotect(struct vm_area_struct *vma, unsigned long reqprot,
+					  unsigned long prot);
+int csm_mmap_file(struct file *file, unsigned long reqprot,
+				  unsigned long prot, unsigned long flags);
+
+/* Tracking of file modification provenance. */
+void csm_file_pre_free(struct file *file);
+
+/* nano functions */
+bool pb_encode_string_field(pb_ostream_t *stream, const pb_field_t *field,
+			    void * const *arg);
+bool pb_decode_string_field(pb_istream_t *stream, const pb_field_t *field,
+		      void **arg);
+ssize_t pb_encode_string_field_limit(pb_ostream_t *stream,
+				     const pb_field_t *field,
+				     void * const *arg, size_t limit);
+bool pb_encode_string_array(pb_ostream_t *stream, const pb_field_t *field,
+			    void * const *arg);
+bool pb_decode_string_array(pb_istream_t *stream, const pb_field_t *field,
+			    void **arg);
+bool pb_encode_uuid_field(pb_ostream_t *stream, const pb_field_t *field,
+			  void * const *arg);
+bool pb_encode_ip4(pb_ostream_t *stream, const pb_field_t *field,
+		   void * const *arg);
+bool pb_encode_ip6(pb_ostream_t *stream, const pb_field_t *field,
+		   void * const *arg);
+

diff --git a/security/container/monitoring_protocol.h b/security/container/monitoring_protocol.h
new file mode 100644
index 0000000..dbdfc9c
--- /dev/null
+++ b/security/container/monitoring_protocol.h

@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+
+/* Container security monitoring protocol definitions */
+
+#include <linux/types.h>
+
+enum csm_msgtype {
+	CSM_MSG_TYPE_HEARTBEAT = 1,
+	CSM_MSG_EVENT_PROTO = 2,
+	CSM_MSG_CONFIG_REQUEST_PROTO = 3,
+	CSM_MSG_CONFIG_RESPONSE_PROTO = 4,
+};
+
+struct csm_msg_hdr {
+	__le32 msg_type;
+	__le32 msg_length;
+};
+
+/* The process uuid is a 128-bits identifier */
+#define PROCESS_UUID_SIZE 16
+
+/* The entire structure forms the collision domain. */
+union process_uuid {
+	struct {
+		__u32 machineid;
+		__u64 start_time;
+		__u32 tgid;
+	} __attribute__((packed));
+	__u8 data[PROCESS_UUID_SIZE];
+};

diff --git a/security/container/pb.c b/security/container/pb.c
new file mode 100644
index 0000000..1cc7ecf
--- /dev/null
+++ b/security/container/pb.c

@@ -0,0 +1,174 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Container Security Monitor module
+ *
+ * Copyright (c) 2018 Google, Inc
+ */
+
+#include "monitor.h"
+
+#include <linux/string.h>
+#include <net/tcp.h>
+#include <net/ipv6.h>
+
+bool pb_encode_string_field(pb_ostream_t *stream, const pb_field_t *field,
+			    void * const *arg)
+{
+	const uint8_t *str = (const uint8_t *)*arg;
+
+	/* If the string is not set, skip this string. */
+	if (!str)
+		return true;
+
+	if (!pb_encode_tag_for_field(stream, field))
+		return false;
+
+	return pb_encode_string(stream, str, strlen(str));
+}
+
+bool pb_decode_string_field(pb_istream_t *stream, const pb_field_t *field,
+			    void **arg)
+{
+	size_t size;
+	void *data;
+
+	*arg = NULL;
+
+	size = stream->bytes_left;
+
+	/* Ensure a null-byte at the end */
+	if (size + 1 < size)
+		return false;
+
+	data = kzalloc(size + 1, GFP_KERNEL);
+	if (!data)
+		return false;
+
+	if (!pb_read(stream, data, size)) {
+		kfree(data);
+		return false;
+	}
+
+	*arg = data;
+
+	return true;
+}
+
+bool pb_encode_string_array(pb_ostream_t *stream, const pb_field_t *field,
+			    void * const *arg)
+{
+	char *strs = (char *)*arg;
+
+	/* If the string array is not set, skip this string array. */
+	if (!strs)
+		return true;
+
+	do {
+		if (!pb_encode_string_field(stream, field,
+					    (void * const *) &strs))
+			return false;
+
+		strs += strlen(strs) + 1;
+	} while (*strs != 0);
+
+	return true;
+}
+
+/* Limit the encoded string size and return how many characters were added. */
+ssize_t pb_encode_string_field_limit(pb_ostream_t *stream,
+				     const pb_field_t *field,
+				     void * const *arg, size_t limit)
+{
+	char *str = (char *)*arg;
+	size_t length;
+
+	/* If the string is not set, skip this string. */
+	if (!str)
+		return 0;
+
+	if (!pb_encode_tag_for_field(stream, field))
+		return -EINVAL;
+
+	length = strlen(str);
+	if (length > limit)
+		length = limit;
+
+	if (!pb_encode_string(stream, (uint8_t *)str, length))
+		return -EINVAL;
+
+	return length;
+}
+
+bool pb_decode_string_array(pb_istream_t *stream, const pb_field_t *field,
+			    void **arg)
+{
+	size_t needed, used = 0;
+	char *data, *strs;
+
+	/* String length, and two null-bytes for the end of the list. */
+	needed = stream->bytes_left + 2;
+	if (needed < stream->bytes_left)
+		return false;
+
+	if (*arg) {
+		/* Calculate used space from the current list. */
+		strs = (char *)*arg;
+		do {
+			used += strlen(strs + used) + 1;
+		} while (strs[used] != 0);
+
+		if (used + needed < needed)
+			return false;
+	}
+
+	data = krealloc(*arg, used + needed, GFP_KERNEL);
+	if (!data)
+		return false;
+
+	/* Will always be freed by the caller */
+	*arg = data;
+
+	/* Reset the new part of the buffer. */
+	memset(data + used, 0, needed);
+
+	/* Read what's in the stream buffer only. */
+	if (!pb_read(stream, data + used, stream->bytes_left))
+		return false;
+
+	return true;
+}
+
+bool pb_encode_fixed_string(pb_ostream_t *stream, const pb_field_t *field,
+			    const uint8_t *data, size_t length)
+{
+	/* If the data is not set, skip this string. */
+	if (!data)
+		return true;
+
+	if (!pb_encode_tag_for_field(stream, field))
+		return false;
+
+	return pb_encode_string(stream, data, length);
+}
+
+
+bool pb_encode_uuid_field(pb_ostream_t *stream, const pb_field_t *field,
+			  void * const *arg)
+{
+	return pb_encode_fixed_string(stream, field, (const uint8_t *)*arg,
+				      PROCESS_UUID_SIZE);
+}
+
+bool pb_encode_ip4(pb_ostream_t *stream, const pb_field_t *field,
+		   void * const *arg)
+{
+	return pb_encode_fixed_string(stream, field, (const uint8_t *)*arg,
+				      sizeof(struct in_addr));
+}
+
+bool pb_encode_ip6(pb_ostream_t *stream, const pb_field_t *field,
+		   void * const *arg)
+{
+	return pb_encode_fixed_string(stream, field, (const uint8_t *)*arg,
+				      sizeof(struct in6_addr));
+}

diff --git a/security/container/pipe.c b/security/container/pipe.c
new file mode 100644
index 0000000..e78ddde
--- /dev/null
+++ b/security/container/pipe.c

@@ -0,0 +1,218 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Container Security Monitor module
+ *
+ * Copyright (c) 2018 Google, Inc
+ */
+
+#include "monitor.h"
+
+#include <linux/pipe_fs_i.h>
+#include <linux/printk.h>
+#include <linux/ratelimit.h>
+#include <linux/uio.h>
+#include <linux/workqueue.h>
+
+/* csm protobuf work */
+static void csm_sendmsg_pipe_handler(struct work_struct *work);
+
+/* csm message work container */
+struct msg_work_data {
+	struct work_struct msg_work;
+	size_t pos_bytes_written;
+	char msg[];
+};
+
+/* Mutex to ensure sequential dumping of protos */
+static DEFINE_MUTEX(protodump);
+
+static ssize_t csm_user_pipe_write(struct kvec *vecs, size_t vecs_size,
+				   size_t total_length)
+{
+	ssize_t perr = 0;
+	struct iov_iter io = { };
+	loff_t pos = 0;
+	struct pipe_inode_info *pipe;
+	unsigned int readers;
+
+	if (!csm_user_write_pipe)
+		return 0;
+
+	down_read(&csm_rwsem_pipe);
+
+	if (csm_user_write_pipe == NULL)
+		goto end;
+
+	/* The pipe info is the same for reader and write files. */
+	pipe = get_pipe_info(csm_user_write_pipe, false);
+
+	/* If nobody is listening, don't write events. */
+	readers = READ_ONCE(pipe->readers);
+	if (readers <= 1) {
+		WARN_ON(readers == 0);
+		goto end;
+	}
+
+
+	iov_iter_kvec(&io, WRITE, vecs, vecs_size, total_length);
+
+	file_start_write(csm_user_write_pipe);
+	perr = vfs_iter_write(csm_user_write_pipe, &io, &pos, 0);
+	file_end_write(csm_user_write_pipe);
+
+end:
+	up_read(&csm_rwsem_pipe);
+	return perr;
+}
+
+static int csm_sendmsg(int type, const void *buf, size_t len)
+{
+	struct csm_msg_hdr hdr = {
+		.msg_type = cpu_to_le32(type),
+		.msg_length = cpu_to_le32(sizeof(hdr) + len),
+	};
+	struct kvec vecs[] = {
+		{
+			.iov_base = &hdr,
+			.iov_len = sizeof(hdr),
+		}, {
+			.iov_base = (void *)buf,
+			.iov_len = len,
+		}
+	};
+	ssize_t perr;
+
+	perr = csm_user_pipe_write(vecs, ARRAY_SIZE(vecs),
+				   le32_to_cpu(hdr.msg_length));
+	if (perr < 0) {
+		pr_warn_ratelimited("vfs_iter_write error (msg_type=%d, msg_length=%u): %zd\n",
+				    type, le32_to_cpu(hdr.msg_length), perr);
+		csm_stats.event_writing_failed++;
+	}
+
+	return perr;
+}
+
+static bool csm_get_expected_size(size_t *size, const pb_msgdesc_t *fields,
+				    const void *src_struct)
+{
+	schema_Event *event;
+
+	if (fields != schema_Event_fields)
+		goto other;
+
+	/* Size above 99% of the 100 containers tested running k8s. */
+	event = (schema_Event *)src_struct;
+	switch (event->which_event) {
+	case schema_Event_execute_tag:
+		*size = 3344;
+		return true;
+	case schema_Event_memexec_tag:
+		*size = 176;
+		return true;
+	case schema_Event_clone_tag:
+		*size = 50;
+		return true;
+	case schema_Event_exit_tag:
+		*size = 30;
+		return true;
+	}
+
+other:
+	/* If unknown, do the pre-computation. */
+	return pb_get_encoded_size(size, fields, src_struct);
+}
+
+static struct msg_work_data *csm_encodeproto(size_t size,
+					     const pb_msgdesc_t *fields,
+					     const void *src_struct)
+{
+	pb_ostream_t pos;
+	struct msg_work_data *wd;
+	size_t total;
+
+	total = size + sizeof(*wd);
+	if (total < size)
+		return ERR_PTR(-EINVAL);
+
+	wd = kmalloc(total, GFP_KERNEL);
+	if (!wd)
+		return ERR_PTR(-ENOMEM);
+
+	pos = pb_ostream_from_buffer(wd->msg, size);
+	if (!pb_encode(&pos, fields, src_struct)) {
+		kfree(wd);
+		return ERR_PTR(-EINVAL);
+	}
+
+	INIT_WORK(&wd->msg_work, csm_sendmsg_pipe_handler);
+	wd->pos_bytes_written = pos.bytes_written;
+	return wd;
+}
+
+static int csm_sendproto(int type, const pb_msgdesc_t *fields,
+			 const void *src_struct)
+{
+	int err = 0;
+	size_t size, previous_size;
+	struct msg_work_data *wd;
+
+	/* Use the expected size first. */
+	if (!csm_get_expected_size(&size, fields, src_struct))
+		return -EINVAL;
+
+	wd = csm_encodeproto(size, fields, src_struct);
+	if (IS_ERR(wd)) {
+		/* If it failed, retry with the exact size. */
+		csm_stats.size_picking_failed++;
+		previous_size = size;
+
+		if (!pb_get_encoded_size(&size, fields, src_struct))
+			return -EINVAL;
+
+		wd = csm_encodeproto(size, fields, src_struct);
+		if (IS_ERR(wd)) {
+			csm_stats.proto_encoding_failed++;
+			return PTR_ERR(wd);
+		}
+
+		pr_debug("size picking failed %lu vs %lu\n", previous_size,
+			 size);
+	}
+
+	/* The work handler takes care of cleanup, if successfully scheduled. */
+	if (likely(schedule_work(&wd->msg_work)))
+		return 0;
+
+	csm_stats.workqueue_failed++;
+	pr_err_ratelimited("Sent msg to workqueue unsuccessfully (assume dropped).\n");
+
+	kfree(wd);
+	return err;
+}
+
+static void csm_sendmsg_pipe_handler(struct work_struct *work)
+{
+	int err;
+	int type = CSM_MSG_EVENT_PROTO;
+	struct msg_work_data *wd = container_of(work, struct msg_work_data,
+						msg_work);
+
+	err = csm_sendmsg(type, wd->msg, wd->pos_bytes_written);
+	if (err < 0)
+		pr_err_ratelimited("csm_sendmsg failed in work handler %s\n",
+				   __func__);
+
+	kfree(wd);
+}
+
+int csm_sendeventproto(const pb_msgdesc_t *fields, schema_Event *event)
+{
+	/* Last check before generating and sending an event. */
+	if (!csm_enabled)
+		return -ENOTSUPP;
+
+	event->timestamp = ktime_get_real_ns();
+
+	return csm_sendproto(CSM_MSG_EVENT_PROTO, fields, event);
+}

diff --git a/security/container/process.c b/security/container/process.c
new file mode 100644
index 0000000..fbb2944
--- /dev/null
+++ b/security/container/process.c

@@ -0,0 +1,1167 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Container Security Monitor module
+ *
+ * Copyright (c) 2018 Google, Inc
+ */
+
+#include "monitor.h"
+
+#include <linux/atomic.h>
+#include <linux/audit.h>
+#include <linux/file.h>
+#include <linux/highmem.h>
+#include <linux/mempool.h>
+#include <linux/mm.h>
+#include <linux/mount.h>
+#include <linux/notifier.h>
+#include <linux/net.h>
+#include <linux/path.h>
+#include <linux/pid.h>
+#include <linux/pid_namespace.h>
+#include <linux/random.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/task.h>
+#include <linux/slab.h>
+#include <linux/socket.h>
+#include <linux/timekeeping.h>
+#include <linux/vmalloc.h>
+#include <linux/workqueue.h>
+#include <linux/xattr.h>
+#include <net/ipv6.h>
+#include <net/sock.h>
+#include <net/tcp.h>
+#include <overlayfs/overlayfs.h>
+#include <uapi/linux/magic.h>
+#include <uapi/asm/mman.h>
+
+/* Configuration options for execute collector. */
+struct execute_config csm_execute_config;
+
+/* unique atomic value for the machine boot instance */
+static atomic_t machine_rand = ATOMIC_INIT(0);
+
+/* sequential container identifier */
+static atomic_t contid = ATOMIC_INIT(0);
+
+/* Generation id for each enumeration invocation. */
+static atomic_t enumeration_count = ATOMIC_INIT(0);
+
+struct file_provenance {
+	/* pid of the process doing the first write. */
+	pid_t tgid;
+	/* start_time of the process to uniquely identify it. */
+	u64 start_time;
+};
+
+struct csm_enumerate_processes_work_data {
+	struct work_struct work;
+	int enumeration_count;
+};
+
+static void *kmap_argument_stack(struct linux_binprm *bprm, void **ctx)
+{
+	char *argv;
+	int err;
+	unsigned long i, pos, count;
+	void *map;
+	struct page *page;
+
+	/* vma_pages() returns the number of pages reserved for the stack */
+	count = vma_pages(bprm->vma);
+
+	if (likely(count == 1)) {
+		err = get_user_pages_remote(bprm->mm, bprm->p, 1,
+					    FOLL_FORCE, &page, NULL, NULL);
+		if (err != 1)
+			return NULL;
+
+		argv = kmap(page);
+		*ctx = page;
+	} else {
+		/*
+		 * If more than one pages is needed, copy all of them to a set
+		 * of pages. Parsing the argument across kmap pages in different
+		 * addresses would make it impractical.
+		 */
+		argv = vmalloc(count * PAGE_SIZE);
+		if (!argv)
+			return NULL;
+
+		for (i = 0; i < count; i++) {
+			pos = ALIGN_DOWN(bprm->p, PAGE_SIZE) + i * PAGE_SIZE;
+			err = get_user_pages_remote(bprm->mm, pos, 1,
+						    FOLL_FORCE, &page, NULL,
+						    NULL);
+			if (err <= 0) {
+				vfree(argv);
+				return NULL;
+			}
+
+			map = kmap(page);
+			memcpy(argv + i * PAGE_SIZE, map, PAGE_SIZE);
+			kunmap(page);
+			put_page(page);
+		}
+		*ctx = bprm;
+	}
+
+	return argv;
+}
+
+static void kunmap_argument_stack(struct linux_binprm *bprm, void *addr,
+				  void *ctx)
+{
+	struct page *page;
+
+	if (!addr)
+		return;
+
+	if (likely(vma_pages(bprm->vma) == 1)) {
+		page = (struct page *)ctx;
+		kunmap(page);
+		put_page(ctx);
+	} else {
+		vfree(addr);
+	}
+}
+
+static char *find_array_next_entry(char *array, unsigned long *offset,
+				   unsigned long end)
+{
+	char *entry;
+	unsigned long off = *offset;
+
+	if (off >= end)
+		return NULL;
+
+	/* Check the entry is null terminated and in bound */
+	entry = array + off;
+	while (array[off]) {
+		if (++off >= end)
+			return NULL;
+	}
+
+	/* Pass the null byte for the next iteration */
+	*offset = off + 1;
+
+	return entry;
+}
+
+struct string_arr_ctx {
+	struct linux_binprm *bprm;
+	void *stack;
+};
+
+static size_t get_config_limit(size_t *config_ptr)
+{
+	lockdep_assert_held_read(&csm_rwsem_config);
+
+	/*
+	 * If execute is not enabled, do not capture arguments.
+	 * The event proto won't be sent anyway.
+	 */
+	if (!csm_execute_enabled)
+		return 0;
+
+	return *config_ptr;
+}
+
+static bool encode_current_argv(pb_ostream_t *stream, const pb_field_t *field,
+				void * const *arg)
+{
+	struct string_arr_ctx *ctx = (struct string_arr_ctx *)*arg;
+	int i;
+	struct linux_binprm *bprm = ctx->bprm;
+	unsigned long offset = bprm->p % PAGE_SIZE;
+	unsigned long end = vma_pages(bprm->vma) * PAGE_SIZE;
+	char *argv = ctx->stack;
+	char *entry;
+	size_t limit, used = 0;
+	ssize_t ret;
+
+	limit = get_config_limit(&csm_execute_config.argv_limit);
+	if (!limit)
+		return true;
+
+	for (i = 0; i < bprm->argc; i++) {
+		entry = find_array_next_entry(argv, &offset, end);
+		if (!entry)
+			return false;
+
+		ret = pb_encode_string_field_limit(stream, field,
+						   (void * const *)&entry,
+						   limit - used);
+		if (ret < 0)
+			return false;
+
+		used += ret;
+
+		if (used >= limit)
+			break;
+	}
+
+	return true;
+}
+
+static bool check_envp_allowlist(char *envp)
+{
+	bool ret = false;
+	char *strs, *equal;
+	size_t str_size, equal_pos;
+
+	/* If execute is not enabled, skip all. */
+	if (!csm_execute_enabled)
+		goto out;
+
+	/* No filter, allow all. */
+	strs = csm_execute_config.envp_allowlist;
+	if (!strs) {
+		ret = true;
+		goto out;
+	}
+
+	/*
+	 * Identify the key=value separation.
+	 * If none exists use the whole string as a key.
+	 */
+	equal = strchr(envp, '=');
+	equal_pos = equal ? (equal - envp) : strlen(envp);
+
+	/* Default to skip if no match found. */
+	ret = false;
+
+	do {
+		str_size = strlen(strs);
+
+		/*
+		 * If the filter length align with the key value equal sign,
+		 * it might be a match, check the key value.
+		 */
+		if (str_size == equal_pos &&
+		    !strncmp(strs, envp, str_size)) {
+			ret = true;
+			goto out;
+		}
+
+		strs += str_size + 1;
+	} while (*strs != 0);
+
+out:
+	return ret;
+}
+
+static bool encode_current_envp(pb_ostream_t *stream, const pb_field_t *field,
+				void * const *arg)
+{
+	struct string_arr_ctx *ctx = (struct string_arr_ctx *)*arg;
+	int i;
+	struct linux_binprm *bprm = ctx->bprm;
+	unsigned long offset = bprm->p % PAGE_SIZE;
+	unsigned long end = vma_pages(bprm->vma) * PAGE_SIZE;
+	char *argv = ctx->stack;
+	char *entry;
+	size_t limit, used = 0;
+	ssize_t ret;
+
+	limit = get_config_limit(&csm_execute_config.envp_limit);
+	if (!limit)
+		return true;
+
+	/* Skip arguments */
+	for (i = 0; i < bprm->argc; i++) {
+		if (!find_array_next_entry(argv, &offset, end))
+			return false;
+	}
+
+	for (i = 0; i < bprm->envc; i++) {
+		entry = find_array_next_entry(argv, &offset, end);
+		if (!entry)
+			return false;
+
+		if (!check_envp_allowlist(entry))
+			continue;
+
+		ret = pb_encode_string_field_limit(stream, field,
+						   (void * const *)&entry,
+						   limit - used);
+		if (ret < 0)
+			return false;
+
+		used += ret;
+
+		if (used >= limit)
+			break;
+	}
+
+	return true;
+}
+
+static bool is_overlayfs_mounted(struct file *file)
+{
+	struct vfsmount *mnt;
+	struct super_block *mnt_sb;
+
+	mnt = file->f_path.mnt;
+	if (mnt == NULL)
+		return false;
+
+	mnt_sb = mnt->mnt_sb;
+	if (mnt_sb == NULL || mnt_sb->s_magic != OVERLAYFS_SUPER_MAGIC)
+		return false;
+
+	return true;
+}
+
+/*
+ * Before the process starts, identify a possible container by checking if the
+ * task is on a pid namespace and the target file is using an overlayfs mounting
+ * point. This check is valid for COS and GKE but not all existing containers.
+ */
+static bool is_possible_container(struct task_struct *task,
+				  struct file *file)
+{
+	if (task_active_pid_ns(task) == &init_pid_ns)
+		return false;
+
+	return is_overlayfs_mounted(file);
+}
+
+/*
+ * Generates a random identifier for this boot instance.
+ * This identifier is generated only when needed to increase the entropy
+ * available compared to doing it at early boot.
+ */
+static u32 get_machine_id(void)
+{
+	int machineid, old;
+
+	machineid = atomic_read(&machine_rand);
+
+	if (unlikely(machineid == 0)) {
+		machineid = (int)get_random_int();
+		if (machineid == 0)
+			machineid = 1;
+		old = atomic_cmpxchg(&machine_rand, 0, machineid);
+
+		/* If someone beat us, use their value. */
+		if (old != 0)
+			machineid = old;
+	}
+
+	return (u32)machineid;
+}
+
+/*
+ * Generate a 128-bit unique identifier for the process by appending:
+ *  - A machine identifier unique per boot.
+ *  - The start time of the process in nanoseconds.
+ *  - The tgid for the set of threads in a process.
+ */
+static int get_process_uuid(struct task_struct *task, char *buffer, size_t size)
+{
+	union process_uuid *id = (union process_uuid *)buffer;
+
+	memset(buffer, 0, size);
+
+	if (WARN_ON(size < PROCESS_UUID_SIZE))
+		return -EINVAL;
+
+	id->machineid = get_machine_id();
+	id->start_time = ktime_mono_to_real(task->group_leader->start_time);
+	id->tgid = task_tgid_nr(task);
+
+	return 0;
+}
+
+int get_process_uuid_by_pid(pid_t pid_nr, char *buffer, size_t size)
+{
+	int err;
+	struct task_struct *task = NULL;
+
+	rcu_read_lock();
+	task = find_task_by_pid_ns(pid_nr, &init_pid_ns);
+	if (!task) {
+		err = -ENOENT;
+		goto out;
+	}
+	err = get_process_uuid(task, buffer, size);
+out:
+	rcu_read_unlock();
+	return err;
+}
+
+static int get_process_uuid_from_xattr(struct file *file, char *buffer,
+				       size_t size)
+{
+	struct dentry *dentry;
+	int err;
+	struct file_provenance prov;
+	union process_uuid *id = (union process_uuid *)buffer;
+
+	memset(buffer, 0, size);
+
+	if (WARN_ON(size < PROCESS_UUID_SIZE))
+		return -EINVAL;
+
+	/* The file is part of overlayfs on the upper layer. */
+	if (!is_overlayfs_mounted(file))
+		return -ENODATA;
+
+	dentry = ovl_dentry_upper(file->f_path.dentry);
+	if (!dentry)
+		return -ENODATA;
+
+	err = __vfs_getxattr(dentry, dentry->d_inode,
+			     XATTR_SECURITY_CSM, &prov, sizeof(prov));
+	/* returns -ENODATA if the xattr does not exist. */
+	if (err < 0)
+		return err;
+	if (err != sizeof(prov)) {
+		pr_err("unexpected size for xattr: %zu -> %d\n",
+		       size, err);
+		return -ENODATA;
+	}
+
+	id->machineid = get_machine_id();
+	id->start_time = prov.start_time;
+	id->tgid = prov.tgid;
+	return 0;
+}
+
+u64 csm_set_contid(struct task_struct *task)
+{
+	u64 cid;
+	struct pid_namespace *ns;
+
+	ns = task_active_pid_ns(task);
+	if (WARN_ON(!task->audit) || WARN_ON(!ns))
+		return AUDIT_CID_UNSET;
+
+	cid = atomic_inc_return(&contid);
+	task->audit->contid = cid;
+
+	/*
+	 * If the namespace container-id is not set, use the one assigned
+	 * to the first process created.
+	 */
+	cmpxchg(&ns->cid, 0, cid);
+	return cid;
+}
+
+u64 csm_get_ns_contid(struct pid_namespace *ns)
+{
+	if (!ns || !ns->cid)
+		return AUDIT_CID_UNSET;
+
+	return ns->cid;
+}
+
+union ip_data {
+	struct in_addr ip4;
+	struct in6_addr ip6;
+};
+
+struct file_data {
+	void *allocated;
+	union ip_data local;
+	union ip_data remote;
+	char modified_uuid[PROCESS_UUID_SIZE];
+};
+
+static void free_file_data(struct file_data *fdata)
+{
+	free_page((unsigned long)fdata->allocated);
+	fdata->allocated = NULL;
+}
+
+static void fill_socket_description(struct sockaddr_storage *saddr,
+				   union ip_data *idata,
+				   schema_SocketIp *schema_socketip)
+{
+	struct sockaddr_in *sin4 = (struct sockaddr_in *)saddr;
+	struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)saddr;
+
+	schema_socketip->family = saddr->ss_family;
+
+	switch (saddr->ss_family) {
+	case AF_INET:
+		schema_socketip->port = ntohs(sin4->sin_port);
+		idata->ip4 = sin4->sin_addr;
+		schema_socketip->ip.funcs.encode = pb_encode_ip4;
+		schema_socketip->ip.arg = &idata->ip4;
+		break;
+	case AF_INET6:
+		schema_socketip->port = ntohs(sin6->sin6_port);
+		idata->ip6 = sin6->sin6_addr;
+		schema_socketip->ip.funcs.encode = pb_encode_ip6;
+		schema_socketip->ip.arg = &idata->ip6;
+		break;
+	}
+}
+
+static int fill_file_overlayfs(struct file *file, schema_File *schema_file,
+			       struct file_data *fdata)
+{
+	struct dentry *dentry;
+	int err;
+	schema_Overlay *overlayfs;
+
+	/* If not an overlayfs superblock, done. */
+	if (!is_overlayfs_mounted(file))
+		return 0;
+
+	dentry = file->f_path.dentry;
+	schema_file->which_filesystem = schema_File_overlayfs_tag;
+	overlayfs = &schema_file->filesystem.overlayfs;
+	overlayfs->lower_layer = ovl_dentry_lower(dentry);
+	overlayfs->upper_layer = ovl_dentry_upper(dentry);
+
+	err = get_process_uuid_from_xattr(file, fdata->modified_uuid,
+					  sizeof(fdata->modified_uuid));
+	/* If there is no xattr, just skip the modified_uuid field. */
+	if (err == -ENODATA)
+		return 0;
+	if (err < 0)
+		return err;
+
+	overlayfs->modified_uuid.funcs.encode = pb_encode_uuid_field;
+	overlayfs->modified_uuid.arg = fdata->modified_uuid;
+	return 0;
+}
+
+static int fill_file_description(struct file *file, schema_File *schema_file,
+				 struct file_data *fdata)
+{
+	char *buf;
+	int err;
+	u32 mode;
+	char *path;
+	struct socket *socket;
+	schema_Socket *socketfs;
+	struct sockaddr_storage saddr;
+
+	memset(fdata, 0, sizeof(*fdata));
+
+	if (file == NULL)
+		return 0;
+
+	schema_file->ino = file_inode(file)->i_ino;
+	mode = file_inode(file)->i_mode;
+
+	/* For pipes, no need to resolve the path. */
+	if (S_ISFIFO(mode))
+		return 0;
+
+	if (S_ISSOCK(mode)) {
+		socket = (struct socket *)file->private_data;
+		socketfs = &schema_file->filesystem.socket;
+
+		/* Local socket */
+		err = kernel_getsockname(socket, (struct sockaddr *)&saddr);
+		if (err >= 0) {
+			socketfs->has_local = true;
+			fill_socket_description(&saddr, &fdata->local,
+						&socketfs->local);
+		}
+
+		/* Remote socket, might not be connected. */
+		err = kernel_getpeername(socket, (struct sockaddr *)&saddr);
+		if (err >= 0) {
+			socketfs->has_remote = true;
+			fill_socket_description(&saddr, &fdata->remote,
+						&socketfs->remote);
+		}
+
+		schema_file->which_filesystem = schema_File_socket_tag;
+		return 0;
+	}
+
+	/*
+	 * From this point, we care about all the other types of files as their
+	 * path provides interesting insight.
+	 */
+	buf = (char *)__get_free_page(GFP_KERNEL);
+	if (buf == NULL)
+		return -ENOMEM;
+
+	fdata->allocated = buf;
+
+	path = d_path(&file->f_path, buf, PAGE_SIZE);
+	if (IS_ERR(path)) {
+		free_file_data(fdata);
+		return PTR_ERR(path);
+	}
+
+	schema_file->fullpath.funcs.encode = pb_encode_string_field;
+	schema_file->fullpath.arg = path; /* buf is freed in free_file_data. */
+
+	err = fill_file_overlayfs(file, schema_file, fdata);
+	if (err) {
+		free_file_data(fdata);
+		return err;
+	}
+
+	return 0;
+}
+
+static int fill_stream_description(schema_Descriptor *desc, int fd,
+				   struct file_data *fdata)
+{
+	struct fd sfd;
+	struct file *file;
+	int err = 0;
+
+	sfd = fdget(fd);
+	file = sfd.file;
+
+	if (file == NULL) {
+		memset(fdata, 0, sizeof(*fdata));
+		goto end;
+	}
+
+	desc->mode = file_inode(file)->i_mode;
+	desc->has_file = true;
+	err = fill_file_description(file, &desc->file, fdata);
+
+end:
+	fdput(sfd);
+	return err;
+}
+
+static int populate_proc_uuid_common(schema_Process *proc, char *uuid,
+				     size_t uuid_size, char *parent_uuid,
+				     size_t parent_uuid_size,
+				     struct task_struct *task)
+{
+	int err;
+	struct task_struct *parent;
+	/* Generate unique identifier for the process and its parent */
+	err = get_process_uuid(task, uuid, uuid_size);
+	if (err)
+		return err;
+
+	proc->uuid.funcs.encode = pb_encode_uuid_field;
+	proc->uuid.arg = uuid;
+
+	rcu_read_lock();
+
+	if (!pid_alive(task))
+		goto out;
+	/*
+	 * I don't think this needs to be task_rcu_dereference because
+	 * real_parent is only supposed to be accessed using RCU.
+	 */
+	parent = rcu_dereference(task->real_parent);
+
+	if (parent) {
+		err = get_process_uuid(parent, parent_uuid, parent_uuid_size);
+		if (!err) {
+			proc->parent_uuid.funcs.encode = pb_encode_uuid_field;
+			proc->parent_uuid.arg = parent_uuid;
+		}
+	}
+
+out:
+	rcu_read_unlock();
+
+	return err;
+}
+
+/* Populate the fields that we always want to set in Process messages. */
+static int populate_proc_common(schema_Process *proc, char *uuid,
+				size_t uuid_size, char *parent_uuid,
+				size_t parent_uuid_size,
+				struct task_struct *task)
+{
+	u64 cid;
+	struct pid_namespace *ns = task_active_pid_ns(task);
+
+	/* Container identifier for the current namespace. */
+	proc->container_id = csm_get_ns_contid(ns);
+
+	/*
+	 * If the process container-id is different, the process tree is part of
+	 * a different session within the namespace (kubectl/docker exec,
+	 * liveness probe or others).
+	 */
+	cid = audit_get_contid(task);
+	if (proc->container_id != cid)
+		proc->exec_session_id = cid;
+
+	/* Add information about pid in different namespaces */
+	proc->pid = task_tgid_nr(task);
+	proc->parent_pid = task_ppid_nr(task);
+	proc->container_pid = task_tgid_nr_ns(task, ns);
+	proc->container_parent_pid = task_ppid_nr_ns(task, ns);
+
+	return populate_proc_uuid_common(proc, uuid, uuid_size, parent_uuid,
+					 parent_uuid_size, task);
+}
+
+int csm_bprm_check_security(struct linux_binprm *bprm)
+{
+	char uuid[PROCESS_UUID_SIZE];
+	char parent_uuid[PROCESS_UUID_SIZE];
+	int err;
+	schema_Event event = {};
+	schema_Process *proc;
+	struct string_arr_ctx argv_ctx;
+	void *stack = NULL, *ctx = NULL;
+	u64 cid;
+	struct file_data path_data = {};
+	struct file_data stdin_data = {};
+	struct file_data stdout_data = {};
+	struct file_data stderr_data = {};
+
+	/*
+	 * Always create a container-id for containerized processes.
+	 * If the LSM is enabled later, we can track existing containers.
+	 */
+	cid = audit_get_contid(current);
+
+	if (cid == AUDIT_CID_UNSET) {
+		if (!is_possible_container(current, bprm->file))
+			return 0;
+
+		cid = csm_set_contid(current);
+
+		if (cid == AUDIT_CID_UNSET)
+			return 0;
+	}
+
+	if (!csm_execute_enabled)
+		return 0;
+
+	/* The interpreter will call us again with more context. */
+	if (bprm->buf[0] == '#' && bprm->buf[1] == '!')
+		return 0;
+
+	proc = &event.event.execute.proc;
+	err = populate_proc_common(proc, uuid, sizeof(uuid), parent_uuid,
+				   sizeof(parent_uuid), current);
+	if (err)
+		goto out_free_buf;
+
+	proc->creation_timestamp = ktime_get_real_ns();
+
+	/* Provide information about the launched binary. */
+	proc->has_binary = true;
+	err = fill_file_description(bprm->file, &proc->binary, &path_data);
+	if (err)
+		goto out_free_buf;
+
+	/* Information about streams */
+	proc->has_streams = true;
+
+	proc->streams.has_stdin = true;
+	err = fill_stream_description(&proc->streams.stdin, STDIN_FILENO,
+				      &stdin_data);
+	if (err)
+		goto out_free_buf;
+
+	proc->streams.has_stdout = true;
+	err = fill_stream_description(&proc->streams.stdout, STDOUT_FILENO,
+				      &stdout_data);
+	if (err)
+		goto out_free_buf;
+
+	proc->streams.has_stderr = true;
+	err = fill_stream_description(&proc->streams.stderr, STDERR_FILENO,
+				      &stderr_data);
+	if (err)
+		goto out_free_buf;
+
+	stack = kmap_argument_stack(bprm, &ctx);
+	if (!stack) {
+		err = -EFAULT;
+		goto out_free_buf;
+	}
+
+	/* Capture process argument */
+	argv_ctx.bprm = bprm;
+	argv_ctx.stack = stack;
+	proc->args.argv.funcs.encode = encode_current_argv;
+	proc->args.argv.arg = &argv_ctx;
+
+	/* Capture process environment variables */
+	proc->args.envp.funcs.encode = encode_current_envp;
+	proc->args.envp.arg = &argv_ctx;
+
+	event.which_event = schema_Event_execute_tag;
+	event.event.execute.has_proc = true;
+	proc->has_args = true;
+
+	/*
+	 * Configurations options are checked when computing the serialized
+	 * protobufs.
+	 */
+	down_read(&csm_rwsem_config);
+	err = csm_sendeventproto(schema_Event_fields, &event);
+	up_read(&csm_rwsem_config);
+
+	if (err)
+		pr_err("csm_sendeventproto returned %d on execve\n", err);
+	err = 0;
+
+out_free_buf:
+	kunmap_argument_stack(bprm, stack, ctx);
+	free_file_data(&path_data);
+	free_file_data(&stdin_data);
+	free_file_data(&stdout_data);
+	free_file_data(&stderr_data);
+
+	/*
+	 * On failure, enforce it only if the execute config is enabled.
+	 * If the collector was disabled, prefer to succeed to not impact the
+	 * system.
+	 */
+	if (unlikely(err < 0 && !csm_execute_enabled))
+		err = 0;
+
+	return err;
+}
+
+/* Create a clone event when a new task leader is created. */
+void csm_task_post_alloc(struct task_struct *task)
+{
+	int err;
+	char uuid[PROCESS_UUID_SIZE];
+	char parent_uuid[PROCESS_UUID_SIZE];
+	schema_Event event = {};
+	schema_Process *proc;
+
+	if (!csm_execute_enabled ||
+	    audit_get_contid(task) == AUDIT_CID_UNSET ||
+	    !thread_group_leader(task))
+		return;
+
+	proc = &event.event.clone.proc;
+
+	err = populate_proc_uuid_common(proc, uuid, sizeof(uuid), parent_uuid,
+					sizeof(parent_uuid), task);
+
+	event.which_event = schema_Event_clone_tag;
+	event.event.clone.has_proc = true;
+	err = csm_sendeventproto(schema_Event_fields, &event);
+	if (err)
+		pr_err("csm_sendeventproto returned %d on exit\n", err);
+}
+
+/*
+ * This LSM hook callback doesn't exist upstream and is called only when the
+ * last thread of a thread group exit.
+ */
+void csm_task_exit(struct task_struct *task)
+{
+	int err;
+	schema_Event event = {};
+	schema_ExitEvent *exit;
+	char uuid[PROCESS_UUID_SIZE];
+
+	if (!csm_execute_enabled ||
+	    audit_get_contid(task) == AUDIT_CID_UNSET)
+		return;
+
+	exit = &event.event.exit;
+
+	/* Fetch the unique identifier for this process */
+	err = get_process_uuid(task, uuid, sizeof(uuid));
+	if (err) {
+		pr_err("failed to get process uuid on exit\n");
+		return;
+	}
+
+	exit->process_uuid.funcs.encode = pb_encode_uuid_field;
+	exit->process_uuid.arg = uuid;
+
+	event.which_event = schema_Event_exit_tag;
+
+	err = csm_sendeventproto(schema_Event_fields, &event);
+	if (err)
+		pr_err("csm_sendeventproto returned %d on exit\n", err);
+}
+
+int csm_mprotect(struct vm_area_struct *vma, unsigned long reqprot,
+		unsigned long prot)
+{
+	char uuid[PROCESS_UUID_SIZE];
+	char parent_uuid[PROCESS_UUID_SIZE];
+	int err;
+	schema_Event event = {};
+	schema_MemoryExecEvent *memexec;
+	u64 cid;
+	struct file_data path_data = {};
+
+	cid = audit_get_contid(current);
+
+	if (!csm_memexec_enabled ||
+	    !(prot & PROT_EXEC) ||
+	    vma->vm_file == NULL ||
+	    cid == AUDIT_CID_UNSET)
+		return 0;
+
+	memexec = &event.event.memexec;
+
+	err = fill_file_description(vma->vm_file, &memexec->mapped_file,
+				    &path_data);
+	if (err)
+		return err;
+
+	err = populate_proc_common(&memexec->proc, uuid, sizeof(uuid),
+				   parent_uuid, sizeof(parent_uuid), current);
+	if (err)
+		goto out;
+
+	memexec->prot_exec_timestamp = ktime_get_real_ns();
+	memexec->new_flags = prot;
+	memexec->req_flags = reqprot;
+	memexec->old_vm_flags = vma->vm_flags;
+
+	memexec->action = schema_MemoryExecEvent_Action_MPROTECT;
+	memexec->start_addr = vma->vm_start;
+	memexec->end_addr = vma->vm_end;
+
+	event.which_event = schema_Event_memexec_tag;
+	event.event.memexec.has_proc = true;
+	event.event.memexec.has_mapped_file = true;
+
+	err = csm_sendeventproto(schema_Event_fields, &event);
+	if (err)
+		pr_err("csm_sendeventproto returned %d on mprotect\n", err);
+	err = 0;
+
+	if (unlikely(err < 0 && !csm_memexec_enabled))
+		err = 0;
+
+out:
+	free_file_data(&path_data);
+	return err;
+}
+
+int csm_mmap_file(struct file *file, unsigned long reqprot,
+		unsigned long prot, unsigned long flags)
+{
+	char uuid[PROCESS_UUID_SIZE];
+	char parent_uuid[PROCESS_UUID_SIZE];
+	int err;
+	schema_Event event = {};
+	schema_MemoryExecEvent *memexec;
+	struct file *exe_file;
+	u64 cid;
+	struct file_data path_data = {};
+
+	cid = audit_get_contid(current);
+
+	if (!csm_memexec_enabled ||
+	    !(prot & PROT_EXEC) ||
+	    file == NULL ||
+	    cid == AUDIT_CID_UNSET)
+		return 0;
+
+	memexec = &event.event.memexec;
+	err = fill_file_description(file, &memexec->mapped_file,
+				    &path_data);
+	if (err)
+		return err;
+
+	err = populate_proc_common(&memexec->proc, uuid, sizeof(uuid),
+				   parent_uuid, sizeof(parent_uuid), current);
+	if (err)
+		goto out;
+
+	/* get_mm_exe_file does its own locking on mm_sem. */
+	exe_file = get_mm_exe_file(current->mm);
+	if (exe_file) {
+		if (path_equal(&file->f_path, &exe_file->f_path))
+			memexec->is_initial_mmap = 1;
+		fput(exe_file);
+	}
+
+	memexec->prot_exec_timestamp = ktime_get_real_ns();
+	memexec->new_flags = prot;
+	memexec->req_flags = reqprot;
+	memexec->mmap_flags = flags;
+	memexec->action = schema_MemoryExecEvent_Action_MMAP_FILE;
+	event.which_event = schema_Event_memexec_tag;
+	event.event.memexec.has_proc = true;
+	event.event.memexec.has_mapped_file = true;
+
+	err = csm_sendeventproto(schema_Event_fields, &event);
+	if (err)
+		pr_err("csm_sendeventproto returned %d on mmap_file\n", err);
+	err = 0;
+
+	if (unlikely(err < 0 && !csm_memexec_enabled))
+		err = 0;
+
+out:
+	free_file_data(&path_data);
+	return err;
+}
+
+void csm_file_pre_free(struct file *file)
+{
+	struct dentry *dentry;
+	int err;
+	struct file_provenance prov;
+
+	/* The file was opened to be modified and the LSM is enabled */
+	if (!(file->f_mode & FMODE_WRITE) ||
+	    !csm_enabled)
+		return;
+
+	/* The current process is containerized. */
+	if (audit_get_contid(current) == AUDIT_CID_UNSET)
+		return;
+
+	/* The file is part of overlayfs on the upper layer. */
+	if (!is_overlayfs_mounted(file))
+		return;
+
+	dentry = ovl_dentry_upper(file->f_path.dentry);
+	if (!dentry)
+		return;
+
+	err = __vfs_getxattr(dentry, dentry->d_inode, XATTR_SECURITY_CSM,
+			     NULL, 0);
+	if (err != -ENODATA) {
+		if (err < 0)
+			pr_err("failed to get security attribute: %d\n", err);
+		return;
+	}
+
+	prov.tgid = task_tgid_nr(current);
+	prov.start_time = ktime_mono_to_real(current->group_leader->start_time);
+
+	err = __vfs_setxattr(dentry, dentry->d_inode, XATTR_SECURITY_CSM, &prov,
+			     sizeof(prov), 0);
+	if (err < 0)
+		pr_err("failed to set security attribute: %d\n", err);
+}
+
+/*
+ * Based off of fs/proc/base.c:next_tgid
+ *
+ * next_thread_group_leader returns the task_struct of the next task with a pid
+ * greater than or equal to tgid. The reference count is increased so that
+ * rcu_read_unlock may be called, and preemption reenabled.
+ */
+static struct task_struct *next_thread_group_leader(pid_t *tgid)
+{
+	struct pid *pid;
+	struct task_struct *task;
+
+	cond_resched();
+	rcu_read_lock();
+retry:
+	task = NULL;
+	pid = find_ge_pid(*tgid, &init_pid_ns);
+	if (pid) {
+		*tgid = pid_nr_ns(pid, &init_pid_ns);
+		task = pid_task(pid, PIDTYPE_PID);
+		if (!task || !thread_group_leader(task) ||
+		    audit_get_contid(task) == AUDIT_CID_UNSET) {
+			(*tgid) += 1;
+			goto retry;
+		}
+
+		/*
+		 * Increment the reference count on the task before leaving
+		 * the RCU grace period.
+		 */
+		get_task_struct(task);
+		(*tgid) += 1;
+	}
+
+	rcu_read_unlock();
+	return task;
+}
+
+void delayed_enumerate_processes(struct work_struct *work)
+{
+	pid_t tgid = 0;
+	struct task_struct *task;
+	struct csm_enumerate_processes_work_data *wd = container_of(
+		work, struct csm_enumerate_processes_work_data, work);
+	int wd_enumeration_count = wd->enumeration_count;
+
+	kfree(wd);
+	wd = NULL;
+	work = NULL;
+
+	/*
+	 * Try for only a single enumeration routine at a time, as long as the
+	 * execute collector is enabled.
+	 */
+	while ((wd_enumeration_count == atomic_read(&enumeration_count)) &&
+	       READ_ONCE(csm_execute_enabled) &&
+	       (task = next_thread_group_leader(&tgid))) {
+		int err;
+		char uuid[PROCESS_UUID_SIZE];
+		char parent_uuid[PROCESS_UUID_SIZE];
+		struct file *exe_file = NULL;
+		struct file_data path_data = {};
+		schema_Event event = {};
+		schema_Process *proc = &event.event.enumproc.proc;
+
+		exe_file = get_task_exe_file(task);
+		if (!exe_file) {
+			pr_err("failed to get enumerated process executable, pid: %u\n",
+			       task_pid_nr(task));
+			goto next;
+		}
+
+		proc->has_binary = true;
+		err = fill_file_description(exe_file, &proc->binary,
+					    &path_data);
+		if (err) {
+			pr_err("failed to fill enumerated process %u executable description: %d\n",
+			       task_pid_nr(task), err);
+			goto next;
+		}
+
+		err = populate_proc_common(proc, uuid, sizeof(uuid),
+					   parent_uuid, sizeof(parent_uuid),
+					   task);
+		if (err) {
+			pr_err("failed to set pid %u common fields: %d\n",
+			       task_pid_nr(task), err);
+			goto next;
+		}
+
+		if (task->flags & PF_EXITING)
+			goto next;
+
+		event.which_event = schema_Event_enumproc_tag;
+		event.event.execute.has_proc = true;
+		err = csm_sendeventproto(schema_Event_fields,
+					 &event);
+		if (err) {
+			pr_err("failed to send pid %u enumerated process: %d\n",
+			       task_pid_nr(task), err);
+			goto next;
+		}
+next:
+		free_file_data(&path_data);
+		if (exe_file)
+			fput(exe_file);
+
+		put_task_struct(task);
+	}
+}
+
+void csm_enumerate_processes(unsigned long const config_version)
+{
+	struct csm_enumerate_processes_work_data *wd;
+
+	wd = kmalloc(sizeof(*wd), GFP_KERNEL);
+	if (!wd)
+		return;
+
+	INIT_WORK(&wd->work, delayed_enumerate_processes);
+	wd->enumeration_count = atomic_add_return(1, &enumeration_count);
+	schedule_work(&wd->work);
+}

diff --git a/security/container/process.h b/security/container/process.h
new file mode 100644
index 0000000..1c98134
--- /dev/null
+++ b/security/container/process.h

@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Container Security Monitor module
+ *
+ * Copyright (c) 2019 Google, Inc
+ */
+
+void csm_enumerate_processes(void);

diff --git a/security/container/protos/Makefile b/security/container/protos/Makefile
new file mode 100644
index 0000000..a88068b
--- /dev/null
+++ b/security/container/protos/Makefile

@@ -0,0 +1,10 @@
+subdir-$(CONFIG_SECURITY_CONTAINER_MONITOR) += nanopb
+
+obj-$(CONFIG_SECURITY_CONTAINER_MONITOR) += nanopb/
+obj-$(CONFIG_SECURITY_CONTAINER_MONITOR) += protos.o
+
+protos-y := config.pb.o event.pb.o
+
+ccflags-y := -I$(srctree)/security/container/protos \
+	-I$(srctree)/security/container/protos/nanopb \
+	$(PB_CCFLAGS)

diff --git a/security/container/protos/README b/security/container/protos/README
new file mode 100644
index 0000000..1b0628a
--- /dev/null
+++ b/security/container/protos/README

@@ -0,0 +1,18 @@
+This document provides guidance on how to change the protos used in this directory.
+
+Any change made to a proto file require to reformat it and regenerate nanopb
+sources. It also requires the proto files to be compatible to previously released versions.
+
+To reformat any proto file run: "clang-format -style=Google -i <file.proto>"
+
+To regenerate nanopb files:
+ - Install protoc
+   - apt-get install protobuf-compiler
+ - Clone/setup nanopb for version 0.3.9.1 (or clone the internal depot)
+   - git clone --depth=1 https://github.com/nanopb/nanopb.git
+   - cd nanopb
+   - git fetch --tags
+   - git checkout tags/0.3.9.1
+   - make -C generator/proto
+ - Run protoc with the nanopb definition
+   - protoc --plugin=<path_to_nanopb>/generator/protoc-gen-nanopb --nanopb_out=<path_to_linux>/security/container/protos/ <path_to_linux>/security/container/protos/<file.proto> --proto_path=<path_to_linux>/security/container/protos

diff --git a/security/container/protos/config.pb.c b/security/container/protos/config.pb.c
new file mode 100644
index 0000000..08436ee
--- /dev/null
+++ b/security/container/protos/config.pb.c

@@ -0,0 +1,25 @@
+/* Automatically generated nanopb constant definitions */
+/* Generated by nanopb-0.4.5 */
+
+#include "config.pb.h"
+#if PB_PROTO_HEADER_VERSION != 40
+#error Regenerate this file with the current version of nanopb generator.
+#endif
+
+PB_BIND(schema_ContainerCollectorConfig, schema_ContainerCollectorConfig, AUTO)
+
+
+PB_BIND(schema_ExecuteCollectorConfig, schema_ExecuteCollectorConfig, AUTO)
+
+
+PB_BIND(schema_MemExecCollectorConfig, schema_MemExecCollectorConfig, AUTO)
+
+
+PB_BIND(schema_ConfigurationRequest, schema_ConfigurationRequest, AUTO)
+
+
+PB_BIND(schema_ConfigurationResponse, schema_ConfigurationResponse, AUTO)
+
+
+
+

diff --git a/security/container/protos/config.pb.h b/security/container/protos/config.pb.h
new file mode 100644
index 0000000..893961e
--- /dev/null
+++ b/security/container/protos/config.pb.h

@@ -0,0 +1,157 @@
+/* Automatically generated nanopb header */
+/* Generated by nanopb-0.4.5 */
+
+#ifndef PB_SCHEMA_CONFIG_PB_H_INCLUDED
+#define PB_SCHEMA_CONFIG_PB_H_INCLUDED
+#include <pb.h>
+
+#if PB_PROTO_HEADER_VERSION != 40
+#error Regenerate this file with the current version of nanopb generator.
+#endif
+
+/* Enum definitions */
+typedef enum _schema_ConfigurationResponse_ErrorCode { 
+    schema_ConfigurationResponse_ErrorCode_NO_ERROR = 0, 
+    schema_ConfigurationResponse_ErrorCode_UNKNOWN = 2 
+} schema_ConfigurationResponse_ErrorCode;
+
+/* Struct definitions */
+/* Report success or failure of previous ConfigurationRequest */
+typedef struct _schema_ConfigurationResponse { 
+    schema_ConfigurationResponse_ErrorCode error; 
+    pb_callback_t msg; 
+    uint64_t version; /* Version of the LSM */
+    uint32_t kernel_version; /* LINUX_VERSION_CODE */
+} schema_ConfigurationResponse;
+
+/* Collect information about running containers */
+typedef struct _schema_ContainerCollectorConfig { 
+    bool enabled; 
+} schema_ContainerCollectorConfig;
+
+typedef struct _schema_ExecuteCollectorConfig { 
+    bool enabled; 
+    /* truncate argv/envp if cumulative length exceeds limit */
+    uint32_t argv_limit; 
+    uint32_t envp_limit; 
+    /* If specified, only report the named environment variables.  An
+ empty envp_allowlist indicates that all environment variables
+ should be reported up to a cumulative total of envp_limit bytes. */
+    pb_callback_t envp_allowlist; 
+} schema_ExecuteCollectorConfig;
+
+/* Collect information about executable memory mappings. */
+typedef struct _schema_MemExecCollectorConfig { 
+    bool enabled; 
+} schema_MemExecCollectorConfig;
+
+/* Convey configuration information to Guest LSM */
+typedef struct _schema_ConfigurationRequest { 
+    bool has_container_config;
+    schema_ContainerCollectorConfig container_config; 
+    bool has_execute_config;
+    schema_ExecuteCollectorConfig execute_config; 
+    bool has_memexec_config;
+    schema_MemExecCollectorConfig memexec_config; 
+} schema_ConfigurationRequest;
+
+
+/* Helper constants for enums */
+#define _schema_ConfigurationResponse_ErrorCode_MIN schema_ConfigurationResponse_ErrorCode_NO_ERROR
+#define _schema_ConfigurationResponse_ErrorCode_MAX schema_ConfigurationResponse_ErrorCode_UNKNOWN
+#define _schema_ConfigurationResponse_ErrorCode_ARRAYSIZE ((schema_ConfigurationResponse_ErrorCode)(schema_ConfigurationResponse_ErrorCode_UNKNOWN+1))
+
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/* Initializer values for message structs */
+#define schema_ContainerCollectorConfig_init_default {0}
+#define schema_ExecuteCollectorConfig_init_default {0, 0, 0, {{NULL}, NULL}}
+#define schema_MemExecCollectorConfig_init_default {0}
+#define schema_ConfigurationRequest_init_default {false, schema_ContainerCollectorConfig_init_default, false, schema_ExecuteCollectorConfig_init_default, false, schema_MemExecCollectorConfig_init_default}
+#define schema_ConfigurationResponse_init_default {_schema_ConfigurationResponse_ErrorCode_MIN, {{NULL}, NULL}, 0, 0}
+#define schema_ContainerCollectorConfig_init_zero {0}
+#define schema_ExecuteCollectorConfig_init_zero  {0, 0, 0, {{NULL}, NULL}}
+#define schema_MemExecCollectorConfig_init_zero  {0}
+#define schema_ConfigurationRequest_init_zero    {false, schema_ContainerCollectorConfig_init_zero, false, schema_ExecuteCollectorConfig_init_zero, false, schema_MemExecCollectorConfig_init_zero}
+#define schema_ConfigurationResponse_init_zero   {_schema_ConfigurationResponse_ErrorCode_MIN, {{NULL}, NULL}, 0, 0}
+
+/* Field tags (for use in manual encoding/decoding) */
+#define schema_ConfigurationResponse_error_tag   1
+#define schema_ConfigurationResponse_msg_tag     2
+#define schema_ConfigurationResponse_version_tag 3
+#define schema_ConfigurationResponse_kernel_version_tag 4
+#define schema_ContainerCollectorConfig_enabled_tag 1
+#define schema_ExecuteCollectorConfig_enabled_tag 1
+#define schema_ExecuteCollectorConfig_argv_limit_tag 2
+#define schema_ExecuteCollectorConfig_envp_limit_tag 3
+#define schema_ExecuteCollectorConfig_envp_allowlist_tag 4
+#define schema_MemExecCollectorConfig_enabled_tag 1
+#define schema_ConfigurationRequest_container_config_tag 1
+#define schema_ConfigurationRequest_execute_config_tag 2
+#define schema_ConfigurationRequest_memexec_config_tag 3
+
+/* Struct field encoding specification for nanopb */
+#define schema_ContainerCollectorConfig_FIELDLIST(X, a) \
+X(a, STATIC,   SINGULAR, BOOL,     enabled,           1)
+#define schema_ContainerCollectorConfig_CALLBACK NULL
+#define schema_ContainerCollectorConfig_DEFAULT NULL
+
+#define schema_ExecuteCollectorConfig_FIELDLIST(X, a) \
+X(a, STATIC,   SINGULAR, BOOL,     enabled,           1) \
+X(a, STATIC,   SINGULAR, UINT32,   argv_limit,        2) \
+X(a, STATIC,   SINGULAR, UINT32,   envp_limit,        3) \
+X(a, CALLBACK, REPEATED, STRING,   envp_allowlist,    4)
+#define schema_ExecuteCollectorConfig_CALLBACK pb_default_field_callback
+#define schema_ExecuteCollectorConfig_DEFAULT NULL
+
+#define schema_MemExecCollectorConfig_FIELDLIST(X, a) \
+X(a, STATIC,   SINGULAR, BOOL,     enabled,           1)
+#define schema_MemExecCollectorConfig_CALLBACK NULL
+#define schema_MemExecCollectorConfig_DEFAULT NULL
+
+#define schema_ConfigurationRequest_FIELDLIST(X, a) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  container_config,   1) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  execute_config,    2) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  memexec_config,    3)
+#define schema_ConfigurationRequest_CALLBACK NULL
+#define schema_ConfigurationRequest_DEFAULT NULL
+#define schema_ConfigurationRequest_container_config_MSGTYPE schema_ContainerCollectorConfig
+#define schema_ConfigurationRequest_execute_config_MSGTYPE schema_ExecuteCollectorConfig
+#define schema_ConfigurationRequest_memexec_config_MSGTYPE schema_MemExecCollectorConfig
+
+#define schema_ConfigurationResponse_FIELDLIST(X, a) \
+X(a, STATIC,   SINGULAR, UENUM,    error,             1) \
+X(a, CALLBACK, SINGULAR, STRING,   msg,               2) \
+X(a, STATIC,   SINGULAR, UINT64,   version,           3) \
+X(a, STATIC,   SINGULAR, UINT32,   kernel_version,    4)
+#define schema_ConfigurationResponse_CALLBACK pb_default_field_callback
+#define schema_ConfigurationResponse_DEFAULT NULL
+
+extern const pb_msgdesc_t schema_ContainerCollectorConfig_msg;
+extern const pb_msgdesc_t schema_ExecuteCollectorConfig_msg;
+extern const pb_msgdesc_t schema_MemExecCollectorConfig_msg;
+extern const pb_msgdesc_t schema_ConfigurationRequest_msg;
+extern const pb_msgdesc_t schema_ConfigurationResponse_msg;
+
+/* Defines for backwards compatibility with code written before nanopb-0.4.0 */
+#define schema_ContainerCollectorConfig_fields &schema_ContainerCollectorConfig_msg
+#define schema_ExecuteCollectorConfig_fields &schema_ExecuteCollectorConfig_msg
+#define schema_MemExecCollectorConfig_fields &schema_MemExecCollectorConfig_msg
+#define schema_ConfigurationRequest_fields &schema_ConfigurationRequest_msg
+#define schema_ConfigurationResponse_fields &schema_ConfigurationResponse_msg
+
+/* Maximum encoded size of messages (where known) */
+/* schema_ExecuteCollectorConfig_size depends on runtime parameters */
+/* schema_ConfigurationRequest_size depends on runtime parameters */
+/* schema_ConfigurationResponse_size depends on runtime parameters */
+#define schema_ContainerCollectorConfig_size     2
+#define schema_MemExecCollectorConfig_size       2
+
+#ifdef __cplusplus
+} /* extern "C" */
+#endif
+
+#endif

diff --git a/security/container/protos/config.proto b/security/container/protos/config.proto
new file mode 100644
index 0000000..e32a517
--- /dev/null
+++ b/security/container/protos/config.proto

@@ -0,0 +1,51 @@
+syntax = "proto3";
+
+package schema;
+
+// Collect information about running containers
+message ContainerCollectorConfig {
+  bool enabled = 1;
+}
+
+message ExecuteCollectorConfig {
+  bool enabled = 1;
+
+  // truncate argv/envp if cumulative length exceeds limit
+  uint32 argv_limit = 2;
+  uint32 envp_limit = 3;
+
+  // If specified, only report the named environment variables.  An
+  // empty envp_allowlist indicates that all environment variables
+  // should be reported up to a cumulative total of envp_limit bytes.
+  repeated string envp_allowlist = 4;
+}
+
+// Collect information about executable memory mappings.
+message MemExecCollectorConfig {
+  bool enabled = 1;
+}
+
+// Convey configuration information to Guest LSM
+message ConfigurationRequest {
+  ContainerCollectorConfig container_config = 1;
+  ExecuteCollectorConfig execute_config = 2;
+  MemExecCollectorConfig memexec_config = 3;
+
+  // Additional configuration messages will be added as new collectors
+  // are implemented
+}
+
+// Report success or failure of previous ConfigurationRequest
+message ConfigurationResponse {
+  enum ErrorCode {
+    // Keep values in sync with
+    // https://github.com/googleapis/googleapis/blob/master/google/rpc/code.proto
+    NO_ERROR = 0;
+    UNKNOWN = 2;
+  }
+
+  ErrorCode error = 1;
+  string msg = 2;
+  uint64 version = 3;         // Version of the LSM
+  uint32 kernel_version = 4;  // LINUX_VERSION_CODE
+}

diff --git a/security/container/protos/event.pb.c b/security/container/protos/event.pb.c
new file mode 100644
index 0000000..2293566
--- /dev/null
+++ b/security/container/protos/event.pb.c

@@ -0,0 +1,61 @@
+/* Automatically generated nanopb constant definitions */
+/* Generated by nanopb-0.4.5 */
+
+#include "event.pb.h"
+#if PB_PROTO_HEADER_VERSION != 40
+#error Regenerate this file with the current version of nanopb generator.
+#endif
+
+PB_BIND(schema_SocketIp, schema_SocketIp, AUTO)
+
+
+PB_BIND(schema_Socket, schema_Socket, AUTO)
+
+
+PB_BIND(schema_Overlay, schema_Overlay, AUTO)
+
+
+PB_BIND(schema_File, schema_File, AUTO)
+
+
+PB_BIND(schema_ProcessArguments, schema_ProcessArguments, AUTO)
+
+
+PB_BIND(schema_Descriptor, schema_Descriptor, AUTO)
+
+
+PB_BIND(schema_Streams, schema_Streams, 2)
+
+
+PB_BIND(schema_Process, schema_Process, 2)
+
+
+PB_BIND(schema_Container, schema_Container, AUTO)
+
+
+PB_BIND(schema_ExecuteEvent, schema_ExecuteEvent, 2)
+
+
+PB_BIND(schema_CloneEvent, schema_CloneEvent, 2)
+
+
+PB_BIND(schema_EnumerateProcessEvent, schema_EnumerateProcessEvent, 2)
+
+
+PB_BIND(schema_MemoryExecEvent, schema_MemoryExecEvent, 2)
+
+
+PB_BIND(schema_ContainerInfoEvent, schema_ContainerInfoEvent, AUTO)
+
+
+PB_BIND(schema_ExitEvent, schema_ExitEvent, AUTO)
+
+
+PB_BIND(schema_Event, schema_Event, 2)
+
+
+PB_BIND(schema_ContainerReport, schema_ContainerReport, AUTO)
+
+
+
+

diff --git a/security/container/protos/event.pb.h b/security/container/protos/event.pb.h
new file mode 100644
index 0000000..9535068
--- /dev/null
+++ b/security/container/protos/event.pb.h

@@ -0,0 +1,518 @@
+/* Automatically generated nanopb header */
+/* Generated by nanopb-0.4.5 */
+
+#ifndef PB_SCHEMA_EVENT_PB_H_INCLUDED
+#define PB_SCHEMA_EVENT_PB_H_INCLUDED
+#include <pb.h>
+
+#if PB_PROTO_HEADER_VERSION != 40
+#error Regenerate this file with the current version of nanopb generator.
+#endif
+
+/* Enum definitions */
+typedef enum _schema_MemoryExecEvent_Action { 
+    schema_MemoryExecEvent_Action_UNDEFINED = 0, 
+    schema_MemoryExecEvent_Action_MPROTECT = 1, 
+    schema_MemoryExecEvent_Action_MMAP_FILE = 2 
+} schema_MemoryExecEvent_Action;
+
+/* Struct definitions */
+/* The process with the indicated pid has exited. */
+typedef struct _schema_ExitEvent { 
+    pb_callback_t process_uuid; 
+} schema_ExitEvent;
+
+typedef struct _schema_Container { 
+    uint64_t creation_timestamp; /* container create time in ns */
+    pb_callback_t pod_namespace; 
+    pb_callback_t pod_name; 
+    uint64_t container_id; /* unique across lifetime of Node */
+    pb_callback_t container_name; 
+    pb_callback_t container_image_uri; 
+    pb_callback_t labels; 
+    pb_callback_t init_uuid; 
+    pb_callback_t container_image_id; 
+} schema_Container;
+
+typedef struct _schema_Overlay { 
+    bool lower_layer; 
+    bool upper_layer; 
+    pb_callback_t modified_uuid; /* The process who first modified the file. */
+} schema_Overlay;
+
+typedef struct _schema_ProcessArguments { 
+    pb_callback_t argv; /* process arguments */
+    uint32_t argv_truncated; /* number of characters truncated from argv */
+    pb_callback_t envp; /* process environment variables */
+    uint32_t envp_truncated; /* number of characters truncated from envp */
+} schema_ProcessArguments;
+
+typedef struct _schema_SocketIp { 
+    uint32_t family; /* AF_* for socket type. */
+    pb_callback_t ip; /* ip4 or ip6 address. */
+    uint32_t port; /* port bind or connected. */
+} schema_SocketIp;
+
+/* Associate the following container information with all processes
+ that have the indicated container_id. */
+typedef struct _schema_ContainerInfoEvent { 
+    bool has_container;
+    schema_Container container; 
+} schema_ContainerInfoEvent;
+
+/* Message sent by the daemonset to the LSM for container enlightenment. */
+typedef struct _schema_ContainerReport { 
+    uint32_t pid; /* Top pid of the running container. */
+    bool has_container;
+    schema_Container container; /* Information collected about the container. */
+} schema_ContainerReport;
+
+typedef struct _schema_Socket { 
+    bool has_local;
+    schema_SocketIp local; 
+    bool has_remote;
+    schema_SocketIp remote; /* unset if not connected. */
+} schema_Socket;
+
+typedef struct _schema_File { 
+    pb_callback_t fullpath; 
+    pb_size_t which_filesystem;
+    union {
+        schema_Overlay overlayfs;
+        schema_Socket socket;
+    } filesystem; /* inode number. */
+    uint32_t ino; 
+    uint64_t ctime; 
+} schema_File;
+
+typedef struct _schema_Descriptor { 
+    uint32_t mode; /* file mode (stat st_mode) */
+    bool has_file;
+    schema_File file; 
+} schema_Descriptor;
+
+typedef struct _schema_Streams { 
+    bool has_stdin;
+    schema_Descriptor stdin; 
+    bool has_stdout;
+    schema_Descriptor stdout; 
+    bool has_stderr;
+    schema_Descriptor stderr; 
+} schema_Streams;
+
+typedef struct _schema_Process { 
+    uint64_t creation_timestamp; /* Only populated in ExecuteEvent, in ns. */
+    pb_callback_t uuid; 
+    uint32_t pid; 
+    bool has_binary;
+    schema_File binary; /* Only populated in ExecuteEvent. */
+    uint32_t parent_pid; 
+    pb_callback_t parent_uuid; 
+    uint64_t container_id; /* unique id of process's container */
+    uint32_t container_pid; /* pid inside the container namespace pid */
+    uint32_t container_parent_pid; /* optional */
+    bool has_args;
+    schema_ProcessArguments args; /* Only populated in ExecuteEvent. */
+    bool has_streams;
+    schema_Streams streams; /* Only populated in ExecuteEvent. */
+    uint64_t exec_session_id; /* identifier set for kubectl exec sessions. */
+} schema_Process;
+
+/* A process clone is being created. This message means that a cloning operation
+ is being attempted. It may be sent even if fork fails. */
+typedef struct _schema_CloneEvent { 
+    bool has_proc;
+    schema_Process proc; 
+} schema_CloneEvent;
+
+/* Processes that are enumerated at startup will be sent with this event. There
+ is no distinction from events we would have seen from fork or exec. */
+typedef struct _schema_EnumerateProcessEvent { 
+    bool has_proc;
+    schema_Process proc; 
+} schema_EnumerateProcessEvent;
+
+/* A binary being executed.
+ e.g., execve() */
+typedef struct _schema_ExecuteEvent { 
+    bool has_proc;
+    schema_Process proc; 
+} schema_ExecuteEvent;
+
+/* Collect information about mmap/mprotect calls with the PROT_EXEC flag set. */
+typedef struct _schema_MemoryExecEvent { 
+    bool has_proc;
+    schema_Process proc; /* The origin process */
+    /* The timestamp in ns when the memory was set executable */
+    uint64_t prot_exec_timestamp; 
+    /* The prot flags granted by the kernel for the operation */
+    uint64_t new_flags; 
+    /* The prot flags requested for the mprotect/mmap operation */
+    uint64_t req_flags; 
+    /* The vm_flags prior to the mprotect operation, if relevant */
+    uint64_t old_vm_flags; 
+    /* The operational flags for the mmap operation, if relevant */
+    uint64_t mmap_flags; 
+    /* Derived from the file struct describing the fd being mapped */
+    bool has_mapped_file;
+    schema_File mapped_file; 
+    schema_MemoryExecEvent_Action action; 
+    uint64_t start_addr; /* The executable memory region start addr */
+    uint64_t end_addr; /* The executable memory region end addr */
+    /* True if this event is a mmap of the process' binary */
+    bool is_initial_mmap; 
+} schema_MemoryExecEvent;
+
+/* Next ID: 8 */
+typedef struct _schema_Event { 
+    pb_size_t which_event;
+    union {
+        schema_ExecuteEvent execute;
+        schema_ContainerInfoEvent container;
+        schema_ExitEvent exit;
+        schema_MemoryExecEvent memexec;
+        schema_CloneEvent clone;
+        schema_EnumerateProcessEvent enumproc;
+    } event; 
+    uint64_t timestamp; 
+} schema_Event;
+
+
+/* Helper constants for enums */
+#define _schema_MemoryExecEvent_Action_MIN schema_MemoryExecEvent_Action_UNDEFINED
+#define _schema_MemoryExecEvent_Action_MAX schema_MemoryExecEvent_Action_MMAP_FILE
+#define _schema_MemoryExecEvent_Action_ARRAYSIZE ((schema_MemoryExecEvent_Action)(schema_MemoryExecEvent_Action_MMAP_FILE+1))
+
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/* Initializer values for message structs */
+#define schema_SocketIp_init_default             {0, {{NULL}, NULL}, 0}
+#define schema_Socket_init_default               {false, schema_SocketIp_init_default, false, schema_SocketIp_init_default}
+#define schema_Overlay_init_default              {0, 0, {{NULL}, NULL}}
+#define schema_File_init_default                 {{{NULL}, NULL}, 0, {schema_Overlay_init_default}, 0, 0}
+#define schema_ProcessArguments_init_default     {{{NULL}, NULL}, 0, {{NULL}, NULL}, 0}
+#define schema_Descriptor_init_default           {0, false, schema_File_init_default}
+#define schema_Streams_init_default              {false, schema_Descriptor_init_default, false, schema_Descriptor_init_default, false, schema_Descriptor_init_default}
+#define schema_Process_init_default              {0, {{NULL}, NULL}, 0, false, schema_File_init_default, 0, {{NULL}, NULL}, 0, 0, 0, false, schema_ProcessArguments_init_default, false, schema_Streams_init_default, 0}
+#define schema_Container_init_default            {0, {{NULL}, NULL}, {{NULL}, NULL}, 0, {{NULL}, NULL}, {{NULL}, NULL}, {{NULL}, NULL}, {{NULL}, NULL}, {{NULL}, NULL}}
+#define schema_ExecuteEvent_init_default         {false, schema_Process_init_default}
+#define schema_CloneEvent_init_default           {false, schema_Process_init_default}
+#define schema_EnumerateProcessEvent_init_default {false, schema_Process_init_default}
+#define schema_MemoryExecEvent_init_default      {false, schema_Process_init_default, 0, 0, 0, 0, 0, false, schema_File_init_default, _schema_MemoryExecEvent_Action_MIN, 0, 0, 0}
+#define schema_ContainerInfoEvent_init_default   {false, schema_Container_init_default}
+#define schema_ExitEvent_init_default            {{{NULL}, NULL}}
+#define schema_Event_init_default                {0, {schema_ExecuteEvent_init_default}, 0}
+#define schema_ContainerReport_init_default      {0, false, schema_Container_init_default}
+#define schema_SocketIp_init_zero                {0, {{NULL}, NULL}, 0}
+#define schema_Socket_init_zero                  {false, schema_SocketIp_init_zero, false, schema_SocketIp_init_zero}
+#define schema_Overlay_init_zero                 {0, 0, {{NULL}, NULL}}
+#define schema_File_init_zero                    {{{NULL}, NULL}, 0, {schema_Overlay_init_zero}, 0, 0}
+#define schema_ProcessArguments_init_zero        {{{NULL}, NULL}, 0, {{NULL}, NULL}, 0}
+#define schema_Descriptor_init_zero              {0, false, schema_File_init_zero}
+#define schema_Streams_init_zero                 {false, schema_Descriptor_init_zero, false, schema_Descriptor_init_zero, false, schema_Descriptor_init_zero}
+#define schema_Process_init_zero                 {0, {{NULL}, NULL}, 0, false, schema_File_init_zero, 0, {{NULL}, NULL}, 0, 0, 0, false, schema_ProcessArguments_init_zero, false, schema_Streams_init_zero, 0}
+#define schema_Container_init_zero               {0, {{NULL}, NULL}, {{NULL}, NULL}, 0, {{NULL}, NULL}, {{NULL}, NULL}, {{NULL}, NULL}, {{NULL}, NULL}, {{NULL}, NULL}}
+#define schema_ExecuteEvent_init_zero            {false, schema_Process_init_zero}
+#define schema_CloneEvent_init_zero              {false, schema_Process_init_zero}
+#define schema_EnumerateProcessEvent_init_zero   {false, schema_Process_init_zero}
+#define schema_MemoryExecEvent_init_zero         {false, schema_Process_init_zero, 0, 0, 0, 0, 0, false, schema_File_init_zero, _schema_MemoryExecEvent_Action_MIN, 0, 0, 0}
+#define schema_ContainerInfoEvent_init_zero      {false, schema_Container_init_zero}
+#define schema_ExitEvent_init_zero               {{{NULL}, NULL}}
+#define schema_Event_init_zero                   {0, {schema_ExecuteEvent_init_zero}, 0}
+#define schema_ContainerReport_init_zero         {0, false, schema_Container_init_zero}
+
+/* Field tags (for use in manual encoding/decoding) */
+#define schema_ExitEvent_process_uuid_tag        1
+#define schema_Container_creation_timestamp_tag  1
+#define schema_Container_pod_namespace_tag       2
+#define schema_Container_pod_name_tag            3
+#define schema_Container_container_id_tag        4
+#define schema_Container_container_name_tag      5
+#define schema_Container_container_image_uri_tag 6
+#define schema_Container_labels_tag              7
+#define schema_Container_init_uuid_tag           8
+#define schema_Container_container_image_id_tag  9
+#define schema_Overlay_lower_layer_tag           1
+#define schema_Overlay_upper_layer_tag           2
+#define schema_Overlay_modified_uuid_tag         3
+#define schema_ProcessArguments_argv_tag         1
+#define schema_ProcessArguments_argv_truncated_tag 2
+#define schema_ProcessArguments_envp_tag         3
+#define schema_ProcessArguments_envp_truncated_tag 4
+#define schema_SocketIp_family_tag               1
+#define schema_SocketIp_ip_tag                   2
+#define schema_SocketIp_port_tag                 3
+#define schema_ContainerInfoEvent_container_tag  1
+#define schema_ContainerReport_pid_tag           1
+#define schema_ContainerReport_container_tag     2
+#define schema_Socket_local_tag                  1
+#define schema_Socket_remote_tag                 2
+#define schema_File_fullpath_tag                 1
+#define schema_File_overlayfs_tag                2
+#define schema_File_socket_tag                   4
+#define schema_File_ino_tag                      3
+#define schema_File_ctime_tag                    5
+#define schema_Descriptor_mode_tag               1
+#define schema_Descriptor_file_tag               2
+#define schema_Streams_stdin_tag                 1
+#define schema_Streams_stdout_tag                2
+#define schema_Streams_stderr_tag                3
+#define schema_Process_creation_timestamp_tag    1
+#define schema_Process_uuid_tag                  2
+#define schema_Process_pid_tag                   3
+#define schema_Process_binary_tag                4
+#define schema_Process_parent_pid_tag            5
+#define schema_Process_parent_uuid_tag           6
+#define schema_Process_container_id_tag          7
+#define schema_Process_container_pid_tag         8
+#define schema_Process_container_parent_pid_tag  9
+#define schema_Process_args_tag                  10
+#define schema_Process_streams_tag               11
+#define schema_Process_exec_session_id_tag       12
+#define schema_CloneEvent_proc_tag               1
+#define schema_EnumerateProcessEvent_proc_tag    1
+#define schema_ExecuteEvent_proc_tag             1
+#define schema_MemoryExecEvent_proc_tag          1
+#define schema_MemoryExecEvent_prot_exec_timestamp_tag 2
+#define schema_MemoryExecEvent_new_flags_tag     3
+#define schema_MemoryExecEvent_req_flags_tag     4
+#define schema_MemoryExecEvent_old_vm_flags_tag  5
+#define schema_MemoryExecEvent_mmap_flags_tag    6
+#define schema_MemoryExecEvent_mapped_file_tag   7
+#define schema_MemoryExecEvent_action_tag        8
+#define schema_MemoryExecEvent_start_addr_tag    9
+#define schema_MemoryExecEvent_end_addr_tag      10
+#define schema_MemoryExecEvent_is_initial_mmap_tag 11
+#define schema_Event_execute_tag                 1
+#define schema_Event_container_tag               2
+#define schema_Event_exit_tag                    3
+#define schema_Event_memexec_tag                 4
+#define schema_Event_clone_tag                   5
+#define schema_Event_enumproc_tag                7
+#define schema_Event_timestamp_tag               6
+
+/* Struct field encoding specification for nanopb */
+#define schema_SocketIp_FIELDLIST(X, a) \
+X(a, STATIC,   SINGULAR, UINT32,   family,            1) \
+X(a, CALLBACK, SINGULAR, BYTES,    ip,                2) \
+X(a, STATIC,   SINGULAR, UINT32,   port,              3)
+#define schema_SocketIp_CALLBACK pb_default_field_callback
+#define schema_SocketIp_DEFAULT NULL
+
+#define schema_Socket_FIELDLIST(X, a) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  local,             1) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  remote,            2)
+#define schema_Socket_CALLBACK NULL
+#define schema_Socket_DEFAULT NULL
+#define schema_Socket_local_MSGTYPE schema_SocketIp
+#define schema_Socket_remote_MSGTYPE schema_SocketIp
+
+#define schema_Overlay_FIELDLIST(X, a) \
+X(a, STATIC,   SINGULAR, BOOL,     lower_layer,       1) \
+X(a, STATIC,   SINGULAR, BOOL,     upper_layer,       2) \
+X(a, CALLBACK, SINGULAR, BYTES,    modified_uuid,     3)
+#define schema_Overlay_CALLBACK pb_default_field_callback
+#define schema_Overlay_DEFAULT NULL
+
+#define schema_File_FIELDLIST(X, a) \
+X(a, CALLBACK, SINGULAR, BYTES,    fullpath,          1) \
+X(a, STATIC,   ONEOF,    MESSAGE,  (filesystem,overlayfs,filesystem.overlayfs),   2) \
+X(a, STATIC,   SINGULAR, UINT32,   ino,               3) \
+X(a, STATIC,   ONEOF,    MESSAGE,  (filesystem,socket,filesystem.socket),   4) \
+X(a, STATIC,   SINGULAR, UINT64,   ctime,             5)
+#define schema_File_CALLBACK pb_default_field_callback
+#define schema_File_DEFAULT NULL
+#define schema_File_filesystem_overlayfs_MSGTYPE schema_Overlay
+#define schema_File_filesystem_socket_MSGTYPE schema_Socket
+
+#define schema_ProcessArguments_FIELDLIST(X, a) \
+X(a, CALLBACK, REPEATED, BYTES,    argv,              1) \
+X(a, STATIC,   SINGULAR, UINT32,   argv_truncated,    2) \
+X(a, CALLBACK, REPEATED, BYTES,    envp,              3) \
+X(a, STATIC,   SINGULAR, UINT32,   envp_truncated,    4)
+#define schema_ProcessArguments_CALLBACK pb_default_field_callback
+#define schema_ProcessArguments_DEFAULT NULL
+
+#define schema_Descriptor_FIELDLIST(X, a) \
+X(a, STATIC,   SINGULAR, UINT32,   mode,              1) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  file,              2)
+#define schema_Descriptor_CALLBACK NULL
+#define schema_Descriptor_DEFAULT NULL
+#define schema_Descriptor_file_MSGTYPE schema_File
+
+#define schema_Streams_FIELDLIST(X, a) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  stdin,             1) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  stdout,            2) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  stderr,            3)
+#define schema_Streams_CALLBACK NULL
+#define schema_Streams_DEFAULT NULL
+#define schema_Streams_stdin_MSGTYPE schema_Descriptor
+#define schema_Streams_stdout_MSGTYPE schema_Descriptor
+#define schema_Streams_stderr_MSGTYPE schema_Descriptor
+
+#define schema_Process_FIELDLIST(X, a) \
+X(a, STATIC,   SINGULAR, UINT64,   creation_timestamp,   1) \
+X(a, CALLBACK, SINGULAR, BYTES,    uuid,              2) \
+X(a, STATIC,   SINGULAR, UINT32,   pid,               3) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  binary,            4) \
+X(a, STATIC,   SINGULAR, UINT32,   parent_pid,        5) \
+X(a, CALLBACK, SINGULAR, BYTES,    parent_uuid,       6) \
+X(a, STATIC,   SINGULAR, UINT64,   container_id,      7) \
+X(a, STATIC,   SINGULAR, UINT32,   container_pid,     8) \
+X(a, STATIC,   SINGULAR, UINT32,   container_parent_pid,   9) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  args,             10) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  streams,          11) \
+X(a, STATIC,   SINGULAR, UINT64,   exec_session_id,  12)
+#define schema_Process_CALLBACK pb_default_field_callback
+#define schema_Process_DEFAULT NULL
+#define schema_Process_binary_MSGTYPE schema_File
+#define schema_Process_args_MSGTYPE schema_ProcessArguments
+#define schema_Process_streams_MSGTYPE schema_Streams
+
+#define schema_Container_FIELDLIST(X, a) \
+X(a, STATIC,   SINGULAR, UINT64,   creation_timestamp,   1) \
+X(a, CALLBACK, SINGULAR, BYTES,    pod_namespace,     2) \
+X(a, CALLBACK, SINGULAR, BYTES,    pod_name,          3) \
+X(a, STATIC,   SINGULAR, UINT64,   container_id,      4) \
+X(a, CALLBACK, SINGULAR, BYTES,    container_name,    5) \
+X(a, CALLBACK, SINGULAR, BYTES,    container_image_uri,   6) \
+X(a, CALLBACK, REPEATED, BYTES,    labels,            7) \
+X(a, CALLBACK, SINGULAR, BYTES,    init_uuid,         8) \
+X(a, CALLBACK, SINGULAR, BYTES,    container_image_id,   9)
+#define schema_Container_CALLBACK pb_default_field_callback
+#define schema_Container_DEFAULT NULL
+
+#define schema_ExecuteEvent_FIELDLIST(X, a) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  proc,              1)
+#define schema_ExecuteEvent_CALLBACK NULL
+#define schema_ExecuteEvent_DEFAULT NULL
+#define schema_ExecuteEvent_proc_MSGTYPE schema_Process
+
+#define schema_CloneEvent_FIELDLIST(X, a) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  proc,              1)
+#define schema_CloneEvent_CALLBACK NULL
+#define schema_CloneEvent_DEFAULT NULL
+#define schema_CloneEvent_proc_MSGTYPE schema_Process
+
+#define schema_EnumerateProcessEvent_FIELDLIST(X, a) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  proc,              1)
+#define schema_EnumerateProcessEvent_CALLBACK NULL
+#define schema_EnumerateProcessEvent_DEFAULT NULL
+#define schema_EnumerateProcessEvent_proc_MSGTYPE schema_Process
+
+#define schema_MemoryExecEvent_FIELDLIST(X, a) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  proc,              1) \
+X(a, STATIC,   SINGULAR, UINT64,   prot_exec_timestamp,   2) \
+X(a, STATIC,   SINGULAR, UINT64,   new_flags,         3) \
+X(a, STATIC,   SINGULAR, UINT64,   req_flags,         4) \
+X(a, STATIC,   SINGULAR, UINT64,   old_vm_flags,      5) \
+X(a, STATIC,   SINGULAR, UINT64,   mmap_flags,        6) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  mapped_file,       7) \
+X(a, STATIC,   SINGULAR, UENUM,    action,            8) \
+X(a, STATIC,   SINGULAR, UINT64,   start_addr,        9) \
+X(a, STATIC,   SINGULAR, UINT64,   end_addr,         10) \
+X(a, STATIC,   SINGULAR, BOOL,     is_initial_mmap,  11)
+#define schema_MemoryExecEvent_CALLBACK NULL
+#define schema_MemoryExecEvent_DEFAULT NULL
+#define schema_MemoryExecEvent_proc_MSGTYPE schema_Process
+#define schema_MemoryExecEvent_mapped_file_MSGTYPE schema_File
+
+#define schema_ContainerInfoEvent_FIELDLIST(X, a) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  container,         1)
+#define schema_ContainerInfoEvent_CALLBACK NULL
+#define schema_ContainerInfoEvent_DEFAULT NULL
+#define schema_ContainerInfoEvent_container_MSGTYPE schema_Container
+
+#define schema_ExitEvent_FIELDLIST(X, a) \
+X(a, CALLBACK, SINGULAR, BYTES,    process_uuid,      1)
+#define schema_ExitEvent_CALLBACK pb_default_field_callback
+#define schema_ExitEvent_DEFAULT NULL
+
+#define schema_Event_FIELDLIST(X, a) \
+X(a, STATIC,   ONEOF,    MESSAGE,  (event,execute,event.execute),   1) \
+X(a, STATIC,   ONEOF,    MESSAGE,  (event,container,event.container),   2) \
+X(a, STATIC,   ONEOF,    MESSAGE,  (event,exit,event.exit),   3) \
+X(a, STATIC,   ONEOF,    MESSAGE,  (event,memexec,event.memexec),   4) \
+X(a, STATIC,   ONEOF,    MESSAGE,  (event,clone,event.clone),   5) \
+X(a, STATIC,   SINGULAR, UINT64,   timestamp,         6) \
+X(a, STATIC,   ONEOF,    MESSAGE,  (event,enumproc,event.enumproc),   7)
+#define schema_Event_CALLBACK NULL
+#define schema_Event_DEFAULT NULL
+#define schema_Event_event_execute_MSGTYPE schema_ExecuteEvent
+#define schema_Event_event_container_MSGTYPE schema_ContainerInfoEvent
+#define schema_Event_event_exit_MSGTYPE schema_ExitEvent
+#define schema_Event_event_memexec_MSGTYPE schema_MemoryExecEvent
+#define schema_Event_event_clone_MSGTYPE schema_CloneEvent
+#define schema_Event_event_enumproc_MSGTYPE schema_EnumerateProcessEvent
+
+#define schema_ContainerReport_FIELDLIST(X, a) \
+X(a, STATIC,   SINGULAR, UINT32,   pid,               1) \
+X(a, STATIC,   OPTIONAL, MESSAGE,  container,         2)
+#define schema_ContainerReport_CALLBACK NULL
+#define schema_ContainerReport_DEFAULT NULL
+#define schema_ContainerReport_container_MSGTYPE schema_Container
+
+extern const pb_msgdesc_t schema_SocketIp_msg;
+extern const pb_msgdesc_t schema_Socket_msg;
+extern const pb_msgdesc_t schema_Overlay_msg;
+extern const pb_msgdesc_t schema_File_msg;
+extern const pb_msgdesc_t schema_ProcessArguments_msg;
+extern const pb_msgdesc_t schema_Descriptor_msg;
+extern const pb_msgdesc_t schema_Streams_msg;
+extern const pb_msgdesc_t schema_Process_msg;
+extern const pb_msgdesc_t schema_Container_msg;
+extern const pb_msgdesc_t schema_ExecuteEvent_msg;
+extern const pb_msgdesc_t schema_CloneEvent_msg;
+extern const pb_msgdesc_t schema_EnumerateProcessEvent_msg;
+extern const pb_msgdesc_t schema_MemoryExecEvent_msg;
+extern const pb_msgdesc_t schema_ContainerInfoEvent_msg;
+extern const pb_msgdesc_t schema_ExitEvent_msg;
+extern const pb_msgdesc_t schema_Event_msg;
+extern const pb_msgdesc_t schema_ContainerReport_msg;
+
+/* Defines for backwards compatibility with code written before nanopb-0.4.0 */
+#define schema_SocketIp_fields &schema_SocketIp_msg
+#define schema_Socket_fields &schema_Socket_msg
+#define schema_Overlay_fields &schema_Overlay_msg
+#define schema_File_fields &schema_File_msg
+#define schema_ProcessArguments_fields &schema_ProcessArguments_msg
+#define schema_Descriptor_fields &schema_Descriptor_msg
+#define schema_Streams_fields &schema_Streams_msg
+#define schema_Process_fields &schema_Process_msg
+#define schema_Container_fields &schema_Container_msg
+#define schema_ExecuteEvent_fields &schema_ExecuteEvent_msg
+#define schema_CloneEvent_fields &schema_CloneEvent_msg
+#define schema_EnumerateProcessEvent_fields &schema_EnumerateProcessEvent_msg
+#define schema_MemoryExecEvent_fields &schema_MemoryExecEvent_msg
+#define schema_ContainerInfoEvent_fields &schema_ContainerInfoEvent_msg
+#define schema_ExitEvent_fields &schema_ExitEvent_msg
+#define schema_Event_fields &schema_Event_msg
+#define schema_ContainerReport_fields &schema_ContainerReport_msg
+
+/* Maximum encoded size of messages (where known) */
+/* schema_SocketIp_size depends on runtime parameters */
+/* schema_Socket_size depends on runtime parameters */
+/* schema_Overlay_size depends on runtime parameters */
+/* schema_File_size depends on runtime parameters */
+/* schema_ProcessArguments_size depends on runtime parameters */
+/* schema_Descriptor_size depends on runtime parameters */
+/* schema_Streams_size depends on runtime parameters */
+/* schema_Process_size depends on runtime parameters */
+/* schema_Container_size depends on runtime parameters */
+/* schema_ExecuteEvent_size depends on runtime parameters */
+/* schema_CloneEvent_size depends on runtime parameters */
+/* schema_EnumerateProcessEvent_size depends on runtime parameters */
+/* schema_MemoryExecEvent_size depends on runtime parameters */
+/* schema_ContainerInfoEvent_size depends on runtime parameters */
+/* schema_ExitEvent_size depends on runtime parameters */
+/* schema_Event_size depends on runtime parameters */
+/* schema_ContainerReport_size depends on runtime parameters */
+
+#ifdef __cplusplus
+} /* extern "C" */
+#endif
+
+#endif

diff --git a/security/container/protos/event.proto b/security/container/protos/event.proto
new file mode 100644
index 0000000..dfe483f
--- /dev/null
+++ b/security/container/protos/event.proto

@@ -0,0 +1,151 @@
+syntax = "proto3";
+
+package schema;
+
+message SocketIp {
+  uint32 family = 1;  // AF_* for socket type.
+  bytes ip = 2;       // ip4 or ip6 address.
+  uint32 port = 3;    // port bind or connected.
+}
+
+message Socket {
+  SocketIp local = 1;
+  SocketIp remote = 2;  // unset if not connected.
+}
+
+message Overlay {
+  bool lower_layer = 1;
+  bool upper_layer = 2;
+  bytes modified_uuid = 3;  // The process who first modified the file.
+}
+
+message File {
+  bytes fullpath = 1;
+  uint32 ino = 3;  // inode number.
+  oneof filesystem {
+    Overlay overlayfs = 2;
+    Socket socket = 4;
+  }
+}
+
+message ProcessArguments {
+  repeated bytes argv = 1;    // process arguments
+  uint32 argv_truncated = 2;  // number of characters truncated from argv
+  repeated bytes envp = 3;    // process environment variables
+  uint32 envp_truncated = 4;  // number of characters truncated from envp
+}
+
+message Descriptor {
+  uint32 mode = 1;  // file mode (stat st_mode)
+  File file = 2;
+}
+
+message Streams {
+  Descriptor stdin = 1;
+  Descriptor stdout = 2;
+  Descriptor stderr = 3;
+}
+
+message Process {
+  uint64 creation_timestamp = 1;  // Only populated in ExecuteEvent, in ns.
+  bytes uuid = 2;
+  uint32 pid = 3;
+  File binary = 4;  // Only populated in ExecuteEvent.
+  uint32 parent_pid = 5;
+  bytes parent_uuid = 6;
+  uint64 container_id = 7;          // unique id of process's container
+  uint32 container_pid = 8;         // pid inside the container namespace pid
+  uint32 container_parent_pid = 9;  // optional
+  ProcessArguments args = 10;       // Only populated in ExecuteEvent.
+  Streams streams = 11;             // Only populated in ExecuteEvent.
+  uint64 exec_session_id = 12;      // identifier set for kubectl exec sessions.
+}
+
+message Container {
+  uint64 creation_timestamp = 1;  // container create time in ns
+  bytes pod_namespace = 2;
+  bytes pod_name = 3;
+  uint64 container_id = 4;  // unique across lifetime of Node
+  bytes container_name = 5;
+  bytes container_image_uri = 6;
+  repeated bytes labels = 7;
+  bytes init_uuid = 8;
+  bytes container_image_id = 9;
+}
+
+// A binary being executed.
+// e.g., execve()
+message ExecuteEvent {
+  Process proc = 1;
+}
+
+// A process clone is being created. This message means that a cloning operation
+// is being attempted. It may be sent even if fork fails.
+message CloneEvent {
+  Process proc = 1;
+}
+
+// Processes that are enumerated at startup will be sent with this event. There
+// is no distinction from events we would have seen from fork or exec.
+message EnumerateProcessEvent {
+  Process proc = 1;
+}
+
+// Collect information about mmap/mprotect calls with the PROT_EXEC flag set.
+message MemoryExecEvent {
+  Process proc = 1;  // The origin process
+  // The timestamp in ns when the memory was set executable
+  uint64 prot_exec_timestamp = 2;
+  // The prot flags granted by the kernel for the operation
+  uint64 new_flags = 3;
+  // The prot flags requested for the mprotect/mmap operation
+  uint64 req_flags = 4;
+  // The vm_flags prior to the mprotect operation, if relevant
+  uint64 old_vm_flags = 5;
+  // The operational flags for the mmap operation, if relevant
+  uint64 mmap_flags = 6;
+  // Derived from the file struct describing the fd being mapped
+  File mapped_file = 7;
+  enum Action {
+    UNDEFINED = 0;
+    MPROTECT = 1;
+    MMAP_FILE = 2;
+  }
+  Action action = 8;
+
+  uint64 start_addr = 9;  // The executable memory region start addr
+  uint64 end_addr = 10;   // The executable memory region end addr
+  // True if this event is a mmap of the process' binary
+  bool is_initial_mmap = 11;
+}
+
+// Associate the following container information with all processes
+// that have the indicated container_id.
+message ContainerInfoEvent {
+  Container container = 1;
+}
+
+// The process with the indicated pid has exited.
+message ExitEvent {
+  bytes process_uuid = 1;
+}
+
+// Next ID: 8
+message Event {
+  oneof event {
+    ExecuteEvent execute = 1;
+    ContainerInfoEvent container = 2;
+    ExitEvent exit = 3;
+    MemoryExecEvent memexec = 4;
+    CloneEvent clone = 5;
+    EnumerateProcessEvent enumproc = 7;
+  }
+
+  uint64 timestamp = 6;  // In nanoseconds
+}
+
+// Message sent by the daemonset to the LSM for container enlightenment.
+message ContainerReport {
+  uint32 pid = 1;           // Top pid of the running container.
+  Container container = 2;  // Information collected about the container.
+}

diff --git a/security/container/protos/nanopb/LICENSE b/security/container/protos/nanopb/LICENSE
new file mode 100644
index 0000000..a83630a
--- /dev/null
+++ b/security/container/protos/nanopb/LICENSE

@@ -0,0 +1,20 @@
+Copyright (c) 2011 Petteri Aimonen <jpa at nanopb.mail.kapsi.fi>
+
+This software is provided 'as-is', without any express or
+implied warranty. In no event will the authors be held liable
+for any damages arising from the use of this software.
+
+Permission is granted to anyone to use this software for any
+purpose, including commercial applications, and to alter it and
+redistribute it freely, subject to the following restrictions:
+
+1. The origin of this software must not be misrepresented; you
+   must not claim that you wrote the original software. If you use
+   this software in a product, an acknowledgment in the product
+   documentation would be appreciated but is not required.
+
+2. Altered source versions must be plainly marked as such, and
+   must not be misrepresented as being the original software.
+
+3. This notice may not be removed or altered from any source
+   distribution.

diff --git a/security/container/protos/nanopb/METADATA b/security/container/protos/nanopb/METADATA
new file mode 100644
index 0000000..6b85630
--- /dev/null
+++ b/security/container/protos/nanopb/METADATA

@@ -0,0 +1,23 @@
+name: "nanopb"
+description: "Nanopb is a C library for encoding and decoding protocol buffers."
+
+third_party {
+  url {
+    type: GIT
+    value: "https://github.com/nanopb/nanopb/"
+  }
+  version: "0.4.5"
+  last_upgrade_date: {
+    year: 2021
+    month: 8
+    day: 12
+  }
+  license_type: NOTICE
+  security {
+    category: REVIEWED_AND_SECURE
+    note: "https://buganizer.corp.google.com/u/0/issues/19409596, https://buganizer.corp.google.com/u/0/issues/120506242"
+    tag: "NVD-CPE2.3:cpe:/a:nanopb_project:nanopb"
+    tag: "vuln_reporting:buganizer_component:588910"
+    tag: "vuln_reporting:contact_emails:"  # Blunderbuss will assign bugs.
+  }
+}

diff --git a/security/container/protos/nanopb/Makefile b/security/container/protos/nanopb/Makefile
new file mode 100644
index 0000000..b7e15f8
--- /dev/null
+++ b/security/container/protos/nanopb/Makefile

@@ -0,0 +1,7 @@
+obj-$(CONFIG_SECURITY_CONTAINER_MONITOR) += nanopb.o
+
+nanopb-y := pb_encode.o pb_decode.o pb_common.o
+
+ccflags-y := -I$(srctree)/security/container/protos \
+	-I$(srctree)/security/container/protos/nanopb \
+	$(PB_CCFLAGS)

diff --git a/security/container/protos/nanopb/pb.h b/security/container/protos/nanopb/pb.h
new file mode 100644
index 0000000..be7c067
--- /dev/null
+++ b/security/container/protos/nanopb/pb.h

@@ -0,0 +1,875 @@
+/* Common parts of the nanopb library. Most of these are quite low-level
+ * stuff. For the high-level interface, see pb_encode.h and pb_decode.h.
+ */
+
+#ifndef PB_H_INCLUDED
+#define PB_H_INCLUDED
+
+/*****************************************************************
+ * Nanopb compilation time options. You can change these here by *
+ * uncommenting the lines, or on the compiler command line.      *
+ *****************************************************************/
+
+/* Enable support for dynamically allocated fields */
+/* #define PB_ENABLE_MALLOC 1 */
+
+/* Define this if your CPU / compiler combination does not support
+ * unaligned memory access to packed structures. */
+/* #define PB_NO_PACKED_STRUCTS 1 */
+
+/* Increase the number of required fields that are tracked.
+ * A compiler warning will tell if you need this. */
+/* #define PB_MAX_REQUIRED_FIELDS 256 */
+
+/* Add support for tag numbers > 65536 and fields larger than 65536 bytes. */
+/* #define PB_FIELD_32BIT 1 */
+
+/* Disable support for error messages in order to save some code space. */
+/* #define PB_NO_ERRMSG 1 */
+
+/* Disable support for custom streams (support only memory buffers). */
+/* #define PB_BUFFER_ONLY 1 */
+
+/* Disable support for 64-bit datatypes, for compilers without int64_t
+   or to save some code space. */
+/* #define PB_WITHOUT_64BIT 1 */
+
+/* Don't encode scalar arrays as packed. This is only to be used when
+ * the decoder on the receiving side cannot process packed scalar arrays.
+ * Such example is older protobuf.js. */
+/* #define PB_ENCODE_ARRAYS_UNPACKED 1 */
+
+/* Enable conversion of doubles to floats for platforms that do not
+ * support 64-bit doubles. Most commonly AVR. */
+/* #define PB_CONVERT_DOUBLE_FLOAT 1 */
+
+/* Check whether incoming strings are valid UTF-8 sequences. Slows down
+ * the string processing slightly and slightly increases code size. */
+/* #define PB_VALIDATE_UTF8 1 */
+
+/******************************************************************
+ * You usually don't need to change anything below this line.     *
+ * Feel free to look around and use the defined macros, though.   *
+ ******************************************************************/
+
+
+/* Version of the nanopb library. Just in case you want to check it in
+ * your own program. */
+#define NANOPB_VERSION nanopb-0.4.5
+
+/* Include all the system headers needed by nanopb. You will need the
+ * definitions of the following:
+ * - strlen, memcpy, memset functions
+ * - [u]int_least8_t, uint_fast8_t, [u]int_least16_t, [u]int32_t, [u]int64_t
+ * - size_t
+ * - bool
+ *
+ * If you don't have the standard header files, you can instead provide
+ * a custom header that defines or includes all this. In that case,
+ * define PB_SYSTEM_HEADER to the path of this file.
+ */
+#ifdef PB_SYSTEM_HEADER
+#include PB_SYSTEM_HEADER
+#else
+#include <stdint.h>
+#include <stddef.h>
+#include <stdbool.h>
+#include <string.h>
+#include <limits.h>
+
+#ifdef PB_ENABLE_MALLOC
+#include <stdlib.h>
+#endif
+#endif
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/* Macro for defining packed structures (compiler dependent).
+ * This just reduces memory requirements, but is not required.
+ */
+#if defined(PB_NO_PACKED_STRUCTS)
+    /* Disable struct packing */
+#   define PB_PACKED_STRUCT_START
+#   define PB_PACKED_STRUCT_END
+#   define pb_packed
+#elif defined(__GNUC__) || defined(__clang__)
+    /* For GCC and clang */
+#   define PB_PACKED_STRUCT_START
+#   define PB_PACKED_STRUCT_END
+#   define pb_packed __attribute__((packed))
+#elif defined(__ICCARM__) || defined(__CC_ARM)
+    /* For IAR ARM and Keil MDK-ARM compilers */
+#   define PB_PACKED_STRUCT_START _Pragma("pack(push, 1)")
+#   define PB_PACKED_STRUCT_END _Pragma("pack(pop)")
+#   define pb_packed
+#elif defined(_MSC_VER) && (_MSC_VER >= 1500)
+    /* For Microsoft Visual C++ */
+#   define PB_PACKED_STRUCT_START __pragma(pack(push, 1))
+#   define PB_PACKED_STRUCT_END __pragma(pack(pop))
+#   define pb_packed
+#else
+    /* Unknown compiler */
+#   define PB_PACKED_STRUCT_START
+#   define PB_PACKED_STRUCT_END
+#   define pb_packed
+#endif
+
+/* Handly macro for suppressing unreferenced-parameter compiler warnings. */
+#ifndef PB_UNUSED
+#define PB_UNUSED(x) (void)(x)
+#endif
+
+/* Harvard-architecture processors may need special attributes for storing
+ * field information in program memory. */
+#ifndef PB_PROGMEM
+#ifdef __AVR__
+#include <avr/pgmspace.h>
+#define PB_PROGMEM             PROGMEM
+#define PB_PROGMEM_READU32(x)  pgm_read_dword(&x)
+#else
+#define PB_PROGMEM
+#define PB_PROGMEM_READU32(x)  (x)
+#endif
+#endif
+
+/* Compile-time assertion, used for checking compatible compilation options.
+ * If this does not work properly on your compiler, use
+ * #define PB_NO_STATIC_ASSERT to disable it.
+ *
+ * But before doing that, check carefully the error message / place where it
+ * comes from to see if the error has a real cause. Unfortunately the error
+ * message is not always very clear to read, but you can see the reason better
+ * in the place where the PB_STATIC_ASSERT macro was called.
+ */
+#ifndef PB_NO_STATIC_ASSERT
+#  ifndef PB_STATIC_ASSERT
+#    if defined(__STDC_VERSION__) && __STDC_VERSION__ >= 201112L
+       /* C11 standard _Static_assert mechanism */
+#      define PB_STATIC_ASSERT(COND,MSG) _Static_assert(COND,#MSG);
+#    else
+       /* Classic negative-size-array static assert mechanism */
+#      define PB_STATIC_ASSERT(COND,MSG) typedef char PB_STATIC_ASSERT_MSG(MSG, __LINE__, __COUNTER__)[(COND)?1:-1];
+#      define PB_STATIC_ASSERT_MSG(MSG, LINE, COUNTER) PB_STATIC_ASSERT_MSG_(MSG, LINE, COUNTER)
+#      define PB_STATIC_ASSERT_MSG_(MSG, LINE, COUNTER) pb_static_assertion_##MSG##_##LINE##_##COUNTER
+#    endif
+#  endif
+#else
+   /* Static asserts disabled by PB_NO_STATIC_ASSERT */
+#  define PB_STATIC_ASSERT(COND,MSG)
+#endif
+
+/* Number of required fields to keep track of. */
+#ifndef PB_MAX_REQUIRED_FIELDS
+#define PB_MAX_REQUIRED_FIELDS 64
+#endif
+
+#if PB_MAX_REQUIRED_FIELDS < 64
+#error You should not lower PB_MAX_REQUIRED_FIELDS from the default value (64).
+#endif
+
+#ifdef PB_WITHOUT_64BIT
+#ifdef PB_CONVERT_DOUBLE_FLOAT
+/* Cannot use doubles without 64-bit types */
+#undef PB_CONVERT_DOUBLE_FLOAT
+#endif
+#endif
+
+/* List of possible field types. These are used in the autogenerated code.
+ * Least-significant 4 bits tell the scalar type
+ * Most-significant 4 bits specify repeated/required/packed etc.
+ */
+
+typedef uint_least8_t pb_type_t;
+
+/**** Field data types ****/
+
+/* Numeric types */
+#define PB_LTYPE_BOOL    0x00U /* bool */
+#define PB_LTYPE_VARINT  0x01U /* int32, int64, enum, bool */
+#define PB_LTYPE_UVARINT 0x02U /* uint32, uint64 */
+#define PB_LTYPE_SVARINT 0x03U /* sint32, sint64 */
+#define PB_LTYPE_FIXED32 0x04U /* fixed32, sfixed32, float */
+#define PB_LTYPE_FIXED64 0x05U /* fixed64, sfixed64, double */
+
+/* Marker for last packable field type. */
+#define PB_LTYPE_LAST_PACKABLE 0x05U
+
+/* Byte array with pre-allocated buffer.
+ * data_size is the length of the allocated PB_BYTES_ARRAY structure. */
+#define PB_LTYPE_BYTES 0x06U
+
+/* String with pre-allocated buffer.
+ * data_size is the maximum length. */
+#define PB_LTYPE_STRING 0x07U
+
+/* Submessage
+ * submsg_fields is pointer to field descriptions */
+#define PB_LTYPE_SUBMESSAGE 0x08U
+
+/* Submessage with pre-decoding callback
+ * The pre-decoding callback is stored as pb_callback_t right before pSize.
+ * submsg_fields is pointer to field descriptions */
+#define PB_LTYPE_SUBMSG_W_CB 0x09U
+
+/* Extension pseudo-field
+ * The field contains a pointer to pb_extension_t */
+#define PB_LTYPE_EXTENSION 0x0AU
+
+/* Byte array with inline, pre-allocated byffer.
+ * data_size is the length of the inline, allocated buffer.
+ * This differs from PB_LTYPE_BYTES by defining the element as
+ * pb_byte_t[data_size] rather than pb_bytes_array_t. */
+#define PB_LTYPE_FIXED_LENGTH_BYTES 0x0BU
+
+/* Number of declared LTYPES */
+#define PB_LTYPES_COUNT 0x0CU
+#define PB_LTYPE_MASK 0x0FU
+
+/**** Field repetition rules ****/
+
+#define PB_HTYPE_REQUIRED 0x00U
+#define PB_HTYPE_OPTIONAL 0x10U
+#define PB_HTYPE_SINGULAR 0x10U
+#define PB_HTYPE_REPEATED 0x20U
+#define PB_HTYPE_FIXARRAY 0x20U
+#define PB_HTYPE_ONEOF    0x30U
+#define PB_HTYPE_MASK     0x30U
+
+/**** Field allocation types ****/
+ 
+#define PB_ATYPE_STATIC   0x00U
+#define PB_ATYPE_POINTER  0x80U
+#define PB_ATYPE_CALLBACK 0x40U
+#define PB_ATYPE_MASK     0xC0U
+
+#define PB_ATYPE(x) ((x) & PB_ATYPE_MASK)
+#define PB_HTYPE(x) ((x) & PB_HTYPE_MASK)
+#define PB_LTYPE(x) ((x) & PB_LTYPE_MASK)
+#define PB_LTYPE_IS_SUBMSG(x) (PB_LTYPE(x) == PB_LTYPE_SUBMESSAGE || \
+                               PB_LTYPE(x) == PB_LTYPE_SUBMSG_W_CB)
+
+/* Data type used for storing sizes of struct fields
+ * and array counts.
+ */
+#if defined(PB_FIELD_32BIT)
+    typedef uint32_t pb_size_t;
+    typedef int32_t pb_ssize_t;
+#else
+    typedef uint_least16_t pb_size_t;
+    typedef int_least16_t pb_ssize_t;
+#endif
+#define PB_SIZE_MAX ((pb_size_t)-1)
+
+/* Data type for storing encoded data and other byte streams.
+ * This typedef exists to support platforms where uint8_t does not exist.
+ * You can regard it as equivalent on uint8_t on other platforms.
+ */
+typedef uint_least8_t pb_byte_t;
+
+/* Forward declaration of struct types */
+typedef struct pb_istream_s pb_istream_t;
+typedef struct pb_ostream_s pb_ostream_t;
+typedef struct pb_field_iter_s pb_field_iter_t;
+
+/* This structure is used in auto-generated constants
+ * to specify struct fields.
+ */
+typedef struct pb_msgdesc_s pb_msgdesc_t;
+struct pb_msgdesc_s {
+    const uint32_t *field_info;
+    const pb_msgdesc_t * const * submsg_info;
+    const pb_byte_t *default_value;
+
+    bool (*field_callback)(pb_istream_t *istream, pb_ostream_t *ostream, const pb_field_iter_t *field);
+
+    pb_size_t field_count;
+    pb_size_t required_field_count;
+    pb_size_t largest_tag;
+};
+
+/* Iterator for message descriptor */
+struct pb_field_iter_s {
+    const pb_msgdesc_t *descriptor;  /* Pointer to message descriptor constant */
+    void *message;                   /* Pointer to start of the structure */
+
+    pb_size_t index;                 /* Index of the field */
+    pb_size_t field_info_index;      /* Index to descriptor->field_info array */
+    pb_size_t required_field_index;  /* Index that counts only the required fields */
+    pb_size_t submessage_index;      /* Index that counts only submessages */
+
+    pb_size_t tag;                   /* Tag of current field */
+    pb_size_t data_size;             /* sizeof() of a single item */
+    pb_size_t array_size;            /* Number of array entries */
+    pb_type_t type;                  /* Type of current field */
+
+    void *pField;                    /* Pointer to current field in struct */
+    void *pData;                     /* Pointer to current data contents. Different than pField for arrays and pointers. */
+    void *pSize;                     /* Pointer to count/has field */
+
+    const pb_msgdesc_t *submsg_desc; /* For submessage fields, pointer to field descriptor for the submessage. */
+};
+
+/* For compatibility with legacy code */
+typedef pb_field_iter_t pb_field_t;
+
+/* Make sure that the standard integer types are of the expected sizes.
+ * Otherwise fixed32/fixed64 fields can break.
+ *
+ * If you get errors here, it probably means that your stdint.h is not
+ * correct for your platform.
+ */
+#ifndef PB_WITHOUT_64BIT
+PB_STATIC_ASSERT(sizeof(int64_t) == 2 * sizeof(int32_t), INT64_T_WRONG_SIZE)
+PB_STATIC_ASSERT(sizeof(uint64_t) == 2 * sizeof(uint32_t), UINT64_T_WRONG_SIZE)
+#endif
+
+/* This structure is used for 'bytes' arrays.
+ * It has the number of bytes in the beginning, and after that an array.
+ * Note that actual structs used will have a different length of bytes array.
+ */
+#define PB_BYTES_ARRAY_T(n) struct { pb_size_t size; pb_byte_t bytes[n]; }
+#define PB_BYTES_ARRAY_T_ALLOCSIZE(n) ((size_t)n + offsetof(pb_bytes_array_t, bytes))
+
+struct pb_bytes_array_s {
+    pb_size_t size;
+    pb_byte_t bytes[1];
+};
+typedef struct pb_bytes_array_s pb_bytes_array_t;
+
+/* This structure is used for giving the callback function.
+ * It is stored in the message structure and filled in by the method that
+ * calls pb_decode.
+ *
+ * The decoding callback will be given a limited-length stream
+ * If the wire type was string, the length is the length of the string.
+ * If the wire type was a varint/fixed32/fixed64, the length is the length
+ * of the actual value.
+ * The function may be called multiple times (especially for repeated types,
+ * but also otherwise if the message happens to contain the field multiple
+ * times.)
+ *
+ * The encoding callback will receive the actual output stream.
+ * It should write all the data in one call, including the field tag and
+ * wire type. It can write multiple fields.
+ *
+ * The callback can be null if you want to skip a field.
+ */
+typedef struct pb_callback_s pb_callback_t;
+struct pb_callback_s {
+    /* Callback functions receive a pointer to the arg field.
+     * You can access the value of the field as *arg, and modify it if needed.
+     */
+    union {
+        bool (*decode)(pb_istream_t *stream, const pb_field_t *field, void **arg);
+        bool (*encode)(pb_ostream_t *stream, const pb_field_t *field, void * const *arg);
+    } funcs;
+    
+    /* Free arg for use by callback */
+    void *arg;
+};
+
+extern bool pb_default_field_callback(pb_istream_t *istream, pb_ostream_t *ostream, const pb_field_t *field);
+
+/* Wire types. Library user needs these only in encoder callbacks. */
+typedef enum {
+    PB_WT_VARINT = 0,
+    PB_WT_64BIT  = 1,
+    PB_WT_STRING = 2,
+    PB_WT_32BIT  = 5
+} pb_wire_type_t;
+
+/* Structure for defining the handling of unknown/extension fields.
+ * Usually the pb_extension_type_t structure is automatically generated,
+ * while the pb_extension_t structure is created by the user. However,
+ * if you want to catch all unknown fields, you can also create a custom
+ * pb_extension_type_t with your own callback.
+ */
+typedef struct pb_extension_type_s pb_extension_type_t;
+typedef struct pb_extension_s pb_extension_t;
+struct pb_extension_type_s {
+    /* Called for each unknown field in the message.
+     * If you handle the field, read off all of its data and return true.
+     * If you do not handle the field, do not read anything and return true.
+     * If you run into an error, return false.
+     * Set to NULL for default handler.
+     */
+    bool (*decode)(pb_istream_t *stream, pb_extension_t *extension,
+                   uint32_t tag, pb_wire_type_t wire_type);
+    
+    /* Called once after all regular fields have been encoded.
+     * If you have something to write, do so and return true.
+     * If you do not have anything to write, just return true.
+     * If you run into an error, return false.
+     * Set to NULL for default handler.
+     */
+    bool (*encode)(pb_ostream_t *stream, const pb_extension_t *extension);
+    
+    /* Free field for use by the callback. */
+    const void *arg;
+};
+
+struct pb_extension_s {
+    /* Type describing the extension field. Usually you'll initialize
+     * this to a pointer to the automatically generated structure. */
+    const pb_extension_type_t *type;
+    
+    /* Destination for the decoded data. This must match the datatype
+     * of the extension field. */
+    void *dest;
+    
+    /* Pointer to the next extension handler, or NULL.
+     * If this extension does not match a field, the next handler is
+     * automatically called. */
+    pb_extension_t *next;
+
+    /* The decoder sets this to true if the extension was found.
+     * Ignored for encoding. */
+    bool found;
+};
+
+#define pb_extension_init_zero {NULL,NULL,NULL,false}
+
+/* Memory allocation functions to use. You can define pb_realloc and
+ * pb_free to custom functions if you want. */
+#ifdef PB_ENABLE_MALLOC
+#   ifndef pb_realloc
+#       define pb_realloc(ptr, size) realloc(ptr, size)
+#   endif
+#   ifndef pb_free
+#       define pb_free(ptr) free(ptr)
+#   endif
+#endif
+
+/* This is used to inform about need to regenerate .pb.h/.pb.c files. */
+#define PB_PROTO_HEADER_VERSION 40
+
+/* These macros are used to declare pb_field_t's in the constant array. */
+/* Size of a structure member, in bytes. */
+#define pb_membersize(st, m) (sizeof ((st*)0)->m)
+/* Number of entries in an array. */
+#define pb_arraysize(st, m) (pb_membersize(st, m) / pb_membersize(st, m[0]))
+/* Delta from start of one member to the start of another member. */
+#define pb_delta(st, m1, m2) ((int)offsetof(st, m1) - (int)offsetof(st, m2))
+
+/* Force expansion of macro value */
+#define PB_EXPAND(x) x
+
+/* Binding of a message field set into a specific structure */
+#define PB_BIND(msgname, structname, width) \
+    const uint32_t structname ## _field_info[] PB_PROGMEM = \
+    { \
+        msgname ## _FIELDLIST(PB_GEN_FIELD_INFO_ ## width, structname) \
+        0 \
+    }; \
+    const pb_msgdesc_t* const structname ## _submsg_info[] = \
+    { \
+        msgname ## _FIELDLIST(PB_GEN_SUBMSG_INFO, structname) \
+        NULL \
+    }; \
+    const pb_msgdesc_t structname ## _msg = \
+    { \
+       structname ## _field_info, \
+       structname ## _submsg_info, \
+       msgname ## _DEFAULT, \
+       msgname ## _CALLBACK, \
+       0 msgname ## _FIELDLIST(PB_GEN_FIELD_COUNT, structname), \
+       0 msgname ## _FIELDLIST(PB_GEN_REQ_FIELD_COUNT, structname), \
+       0 msgname ## _FIELDLIST(PB_GEN_LARGEST_TAG, structname), \
+    }; \
+    msgname ## _FIELDLIST(PB_GEN_FIELD_INFO_ASSERT_ ## width, structname)
+
+#define PB_GEN_FIELD_COUNT(structname, atype, htype, ltype, fieldname, tag) +1
+#define PB_GEN_REQ_FIELD_COUNT(structname, atype, htype, ltype, fieldname, tag) \
+    + (PB_HTYPE_ ## htype == PB_HTYPE_REQUIRED)
+#define PB_GEN_LARGEST_TAG(structname, atype, htype, ltype, fieldname, tag) \
+    * 0 + tag
+
+/* X-macro for generating the entries in struct_field_info[] array. */
+#define PB_GEN_FIELD_INFO_1(structname, atype, htype, ltype, fieldname, tag) \
+    PB_FIELDINFO_1(tag, PB_ATYPE_ ## atype | PB_HTYPE_ ## htype | PB_LTYPE_MAP_ ## ltype, \
+                   PB_DATA_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_DATA_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_SIZE_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_ARRAY_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname))
+
+#define PB_GEN_FIELD_INFO_2(structname, atype, htype, ltype, fieldname, tag) \
+    PB_FIELDINFO_2(tag, PB_ATYPE_ ## atype | PB_HTYPE_ ## htype | PB_LTYPE_MAP_ ## ltype, \
+                   PB_DATA_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_DATA_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_SIZE_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_ARRAY_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname))
+
+#define PB_GEN_FIELD_INFO_4(structname, atype, htype, ltype, fieldname, tag) \
+    PB_FIELDINFO_4(tag, PB_ATYPE_ ## atype | PB_HTYPE_ ## htype | PB_LTYPE_MAP_ ## ltype, \
+                   PB_DATA_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_DATA_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_SIZE_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_ARRAY_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname))
+
+#define PB_GEN_FIELD_INFO_8(structname, atype, htype, ltype, fieldname, tag) \
+    PB_FIELDINFO_8(tag, PB_ATYPE_ ## atype | PB_HTYPE_ ## htype | PB_LTYPE_MAP_ ## ltype, \
+                   PB_DATA_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_DATA_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_SIZE_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_ARRAY_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname))
+
+#define PB_GEN_FIELD_INFO_AUTO(structname, atype, htype, ltype, fieldname, tag) \
+    PB_FIELDINFO_AUTO2(PB_FIELDINFO_WIDTH_AUTO(_PB_ATYPE_ ## atype, _PB_HTYPE_ ## htype, _PB_LTYPE_ ## ltype), \
+                   tag, PB_ATYPE_ ## atype | PB_HTYPE_ ## htype | PB_LTYPE_MAP_ ## ltype, \
+                   PB_DATA_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_DATA_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_SIZE_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_ARRAY_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname))
+
+#define PB_FIELDINFO_AUTO2(width, tag, type, data_offset, data_size, size_offset, array_size) \
+    PB_FIELDINFO_AUTO3(width, tag, type, data_offset, data_size, size_offset, array_size)
+
+#define PB_FIELDINFO_AUTO3(width, tag, type, data_offset, data_size, size_offset, array_size) \
+    PB_FIELDINFO_ ## width(tag, type, data_offset, data_size, size_offset, array_size)
+
+/* X-macro for generating asserts that entries fit in struct_field_info[] array.
+ * The structure of macros here must match the structure above in PB_GEN_FIELD_INFO_x(),
+ * but it is not easily reused because of how macro substitutions work. */
+#define PB_GEN_FIELD_INFO_ASSERT_1(structname, atype, htype, ltype, fieldname, tag) \
+    PB_FIELDINFO_ASSERT_1(tag, PB_ATYPE_ ## atype | PB_HTYPE_ ## htype | PB_LTYPE_MAP_ ## ltype, \
+                   PB_DATA_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_DATA_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_SIZE_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_ARRAY_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname))
+
+#define PB_GEN_FIELD_INFO_ASSERT_2(structname, atype, htype, ltype, fieldname, tag) \
+    PB_FIELDINFO_ASSERT_2(tag, PB_ATYPE_ ## atype | PB_HTYPE_ ## htype | PB_LTYPE_MAP_ ## ltype, \
+                   PB_DATA_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_DATA_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_SIZE_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_ARRAY_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname))
+
+#define PB_GEN_FIELD_INFO_ASSERT_4(structname, atype, htype, ltype, fieldname, tag) \
+    PB_FIELDINFO_ASSERT_4(tag, PB_ATYPE_ ## atype | PB_HTYPE_ ## htype | PB_LTYPE_MAP_ ## ltype, \
+                   PB_DATA_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_DATA_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_SIZE_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_ARRAY_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname))
+
+#define PB_GEN_FIELD_INFO_ASSERT_8(structname, atype, htype, ltype, fieldname, tag) \
+    PB_FIELDINFO_ASSERT_8(tag, PB_ATYPE_ ## atype | PB_HTYPE_ ## htype | PB_LTYPE_MAP_ ## ltype, \
+                   PB_DATA_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_DATA_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_SIZE_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_ARRAY_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname))
+
+#define PB_GEN_FIELD_INFO_ASSERT_AUTO(structname, atype, htype, ltype, fieldname, tag) \
+    PB_FIELDINFO_ASSERT_AUTO2(PB_FIELDINFO_WIDTH_AUTO(_PB_ATYPE_ ## atype, _PB_HTYPE_ ## htype, _PB_LTYPE_ ## ltype), \
+                   tag, PB_ATYPE_ ## atype | PB_HTYPE_ ## htype | PB_LTYPE_MAP_ ## ltype, \
+                   PB_DATA_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_DATA_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_SIZE_OFFSET_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname), \
+                   PB_ARRAY_SIZE_ ## atype(_PB_HTYPE_ ## htype, structname, fieldname))
+
+#define PB_FIELDINFO_ASSERT_AUTO2(width, tag, type, data_offset, data_size, size_offset, array_size) \
+    PB_FIELDINFO_ASSERT_AUTO3(width, tag, type, data_offset, data_size, size_offset, array_size)
+
+#define PB_FIELDINFO_ASSERT_AUTO3(width, tag, type, data_offset, data_size, size_offset, array_size) \
+    PB_FIELDINFO_ASSERT_ ## width(tag, type, data_offset, data_size, size_offset, array_size)
+
+#define PB_DATA_OFFSET_STATIC(htype, structname, fieldname) PB_DO ## htype(structname, fieldname)
+#define PB_DATA_OFFSET_POINTER(htype, structname, fieldname) PB_DO ## htype(structname, fieldname)
+#define PB_DATA_OFFSET_CALLBACK(htype, structname, fieldname) PB_DO ## htype(structname, fieldname)
+#define PB_DO_PB_HTYPE_REQUIRED(structname, fieldname) offsetof(structname, fieldname)
+#define PB_DO_PB_HTYPE_SINGULAR(structname, fieldname) offsetof(structname, fieldname)
+#define PB_DO_PB_HTYPE_ONEOF(structname, fieldname) offsetof(structname, PB_ONEOF_NAME(FULL, fieldname))
+#define PB_DO_PB_HTYPE_OPTIONAL(structname, fieldname) offsetof(structname, fieldname)
+#define PB_DO_PB_HTYPE_REPEATED(structname, fieldname) offsetof(structname, fieldname)
+#define PB_DO_PB_HTYPE_FIXARRAY(structname, fieldname) offsetof(structname, fieldname)
+
+#define PB_SIZE_OFFSET_STATIC(htype, structname, fieldname) PB_SO ## htype(structname, fieldname)
+#define PB_SIZE_OFFSET_POINTER(htype, structname, fieldname) PB_SO_PTR ## htype(structname, fieldname)
+#define PB_SIZE_OFFSET_CALLBACK(htype, structname, fieldname) PB_SO_CB ## htype(structname, fieldname)
+#define PB_SO_PB_HTYPE_REQUIRED(structname, fieldname) 0
+#define PB_SO_PB_HTYPE_SINGULAR(structname, fieldname) 0
+#define PB_SO_PB_HTYPE_ONEOF(structname, fieldname) PB_SO_PB_HTYPE_ONEOF2(structname, PB_ONEOF_NAME(FULL, fieldname), PB_ONEOF_NAME(UNION, fieldname))
+#define PB_SO_PB_HTYPE_ONEOF2(structname, fullname, unionname) PB_SO_PB_HTYPE_ONEOF3(structname, fullname, unionname)
+#define PB_SO_PB_HTYPE_ONEOF3(structname, fullname, unionname) pb_delta(structname, fullname, which_ ## unionname)
+#define PB_SO_PB_HTYPE_OPTIONAL(structname, fieldname) pb_delta(structname, fieldname, has_ ## fieldname)
+#define PB_SO_PB_HTYPE_REPEATED(structname, fieldname) pb_delta(structname, fieldname, fieldname ## _count)
+#define PB_SO_PB_HTYPE_FIXARRAY(structname, fieldname) 0
+#define PB_SO_PTR_PB_HTYPE_REQUIRED(structname, fieldname) 0
+#define PB_SO_PTR_PB_HTYPE_SINGULAR(structname, fieldname) 0
+#define PB_SO_PTR_PB_HTYPE_ONEOF(structname, fieldname) PB_SO_PB_HTYPE_ONEOF(structname, fieldname)
+#define PB_SO_PTR_PB_HTYPE_OPTIONAL(structname, fieldname) 0
+#define PB_SO_PTR_PB_HTYPE_REPEATED(structname, fieldname) PB_SO_PB_HTYPE_REPEATED(structname, fieldname)
+#define PB_SO_PTR_PB_HTYPE_FIXARRAY(structname, fieldname) 0
+#define PB_SO_CB_PB_HTYPE_REQUIRED(structname, fieldname) 0
+#define PB_SO_CB_PB_HTYPE_SINGULAR(structname, fieldname) 0
+#define PB_SO_CB_PB_HTYPE_ONEOF(structname, fieldname) PB_SO_PB_HTYPE_ONEOF(structname, fieldname)
+#define PB_SO_CB_PB_HTYPE_OPTIONAL(structname, fieldname) 0
+#define PB_SO_CB_PB_HTYPE_REPEATED(structname, fieldname) 0
+#define PB_SO_CB_PB_HTYPE_FIXARRAY(structname, fieldname) 0
+
+#define PB_ARRAY_SIZE_STATIC(htype, structname, fieldname) PB_AS ## htype(structname, fieldname)
+#define PB_ARRAY_SIZE_POINTER(htype, structname, fieldname) PB_AS_PTR ## htype(structname, fieldname)
+#define PB_ARRAY_SIZE_CALLBACK(htype, structname, fieldname) 1
+#define PB_AS_PB_HTYPE_REQUIRED(structname, fieldname) 1
+#define PB_AS_PB_HTYPE_SINGULAR(structname, fieldname) 1
+#define PB_AS_PB_HTYPE_OPTIONAL(structname, fieldname) 1
+#define PB_AS_PB_HTYPE_ONEOF(structname, fieldname) 1
+#define PB_AS_PB_HTYPE_REPEATED(structname, fieldname) pb_arraysize(structname, fieldname)
+#define PB_AS_PB_HTYPE_FIXARRAY(structname, fieldname) pb_arraysize(structname, fieldname)
+#define PB_AS_PTR_PB_HTYPE_REQUIRED(structname, fieldname) 1
+#define PB_AS_PTR_PB_HTYPE_SINGULAR(structname, fieldname) 1
+#define PB_AS_PTR_PB_HTYPE_OPTIONAL(structname, fieldname) 1
+#define PB_AS_PTR_PB_HTYPE_ONEOF(structname, fieldname) 1
+#define PB_AS_PTR_PB_HTYPE_REPEATED(structname, fieldname) 1
+#define PB_AS_PTR_PB_HTYPE_FIXARRAY(structname, fieldname) pb_arraysize(structname, fieldname[0])
+
+#define PB_DATA_SIZE_STATIC(htype, structname, fieldname) PB_DS ## htype(structname, fieldname)
+#define PB_DATA_SIZE_POINTER(htype, structname, fieldname) PB_DS_PTR ## htype(structname, fieldname)
+#define PB_DATA_SIZE_CALLBACK(htype, structname, fieldname) PB_DS_CB ## htype(structname, fieldname)
+#define PB_DS_PB_HTYPE_REQUIRED(structname, fieldname) pb_membersize(structname, fieldname)
+#define PB_DS_PB_HTYPE_SINGULAR(structname, fieldname) pb_membersize(structname, fieldname)
+#define PB_DS_PB_HTYPE_OPTIONAL(structname, fieldname) pb_membersize(structname, fieldname)
+#define PB_DS_PB_HTYPE_ONEOF(structname, fieldname) pb_membersize(structname, PB_ONEOF_NAME(FULL, fieldname))
+#define PB_DS_PB_HTYPE_REPEATED(structname, fieldname) pb_membersize(structname, fieldname[0])
+#define PB_DS_PB_HTYPE_FIXARRAY(structname, fieldname) pb_membersize(structname, fieldname[0])
+#define PB_DS_PTR_PB_HTYPE_REQUIRED(structname, fieldname) pb_membersize(structname, fieldname[0])
+#define PB_DS_PTR_PB_HTYPE_SINGULAR(structname, fieldname) pb_membersize(structname, fieldname[0])
+#define PB_DS_PTR_PB_HTYPE_OPTIONAL(structname, fieldname) pb_membersize(structname, fieldname[0])
+#define PB_DS_PTR_PB_HTYPE_ONEOF(structname, fieldname) pb_membersize(structname, PB_ONEOF_NAME(FULL, fieldname)[0])
+#define PB_DS_PTR_PB_HTYPE_REPEATED(structname, fieldname) pb_membersize(structname, fieldname[0])
+#define PB_DS_PTR_PB_HTYPE_FIXARRAY(structname, fieldname) pb_membersize(structname, fieldname[0][0])
+#define PB_DS_CB_PB_HTYPE_REQUIRED(structname, fieldname) pb_membersize(structname, fieldname)
+#define PB_DS_CB_PB_HTYPE_SINGULAR(structname, fieldname) pb_membersize(structname, fieldname)
+#define PB_DS_CB_PB_HTYPE_OPTIONAL(structname, fieldname) pb_membersize(structname, fieldname)
+#define PB_DS_CB_PB_HTYPE_ONEOF(structname, fieldname) pb_membersize(structname, PB_ONEOF_NAME(FULL, fieldname))
+#define PB_DS_CB_PB_HTYPE_REPEATED(structname, fieldname) pb_membersize(structname, fieldname)
+#define PB_DS_CB_PB_HTYPE_FIXARRAY(structname, fieldname) pb_membersize(structname, fieldname)
+
+#define PB_ONEOF_NAME(type, tuple) PB_EXPAND(PB_ONEOF_NAME_ ## type tuple)
+#define PB_ONEOF_NAME_UNION(unionname,membername,fullname) unionname
+#define PB_ONEOF_NAME_MEMBER(unionname,membername,fullname) membername
+#define PB_ONEOF_NAME_FULL(unionname,membername,fullname) fullname
+
+#define PB_GEN_SUBMSG_INFO(structname, atype, htype, ltype, fieldname, tag) \
+    PB_SUBMSG_INFO_ ## htype(_PB_LTYPE_ ## ltype, structname, fieldname)
+
+#define PB_SUBMSG_INFO_REQUIRED(ltype, structname, fieldname) PB_SI ## ltype(structname ## _ ## fieldname ## _MSGTYPE)
+#define PB_SUBMSG_INFO_SINGULAR(ltype, structname, fieldname) PB_SI ## ltype(structname ## _ ## fieldname ## _MSGTYPE)
+#define PB_SUBMSG_INFO_OPTIONAL(ltype, structname, fieldname) PB_SI ## ltype(structname ## _ ## fieldname ## _MSGTYPE)
+#define PB_SUBMSG_INFO_ONEOF(ltype, structname, fieldname) PB_SUBMSG_INFO_ONEOF2(ltype, structname, PB_ONEOF_NAME(UNION, fieldname), PB_ONEOF_NAME(MEMBER, fieldname))
+#define PB_SUBMSG_INFO_ONEOF2(ltype, structname, unionname, membername) PB_SUBMSG_INFO_ONEOF3(ltype, structname, unionname, membername)
+#define PB_SUBMSG_INFO_ONEOF3(ltype, structname, unionname, membername) PB_SI ## ltype(structname ## _ ## unionname ## _ ## membername ## _MSGTYPE)
+#define PB_SUBMSG_INFO_REPEATED(ltype, structname, fieldname) PB_SI ## ltype(structname ## _ ## fieldname ## _MSGTYPE)
+#define PB_SUBMSG_INFO_FIXARRAY(ltype, structname, fieldname) PB_SI ## ltype(structname ## _ ## fieldname ## _MSGTYPE)
+#define PB_SI_PB_LTYPE_BOOL(t)
+#define PB_SI_PB_LTYPE_BYTES(t)
+#define PB_SI_PB_LTYPE_DOUBLE(t)
+#define PB_SI_PB_LTYPE_ENUM(t)
+#define PB_SI_PB_LTYPE_UENUM(t)
+#define PB_SI_PB_LTYPE_FIXED32(t)
+#define PB_SI_PB_LTYPE_FIXED64(t)
+#define PB_SI_PB_LTYPE_FLOAT(t)
+#define PB_SI_PB_LTYPE_INT32(t)
+#define PB_SI_PB_LTYPE_INT64(t)
+#define PB_SI_PB_LTYPE_MESSAGE(t)  PB_SUBMSG_DESCRIPTOR(t)
+#define PB_SI_PB_LTYPE_MSG_W_CB(t) PB_SUBMSG_DESCRIPTOR(t)
+#define PB_SI_PB_LTYPE_SFIXED32(t)
+#define PB_SI_PB_LTYPE_SFIXED64(t)
+#define PB_SI_PB_LTYPE_SINT32(t)
+#define PB_SI_PB_LTYPE_SINT64(t)
+#define PB_SI_PB_LTYPE_STRING(t)
+#define PB_SI_PB_LTYPE_UINT32(t)
+#define PB_SI_PB_LTYPE_UINT64(t)
+#define PB_SI_PB_LTYPE_EXTENSION(t)
+#define PB_SI_PB_LTYPE_FIXED_LENGTH_BYTES(t)
+#define PB_SUBMSG_DESCRIPTOR(t)    &(t ## _msg),
+
+/* The field descriptors use a variable width format, with width of either
+ * 1, 2, 4 or 8 of 32-bit words. The two lowest bytes of the first byte always
+ * encode the descriptor size, 6 lowest bits of field tag number, and 8 bits
+ * of the field type.
+ *
+ * Descriptor size is encoded as 0 = 1 word, 1 = 2 words, 2 = 4 words, 3 = 8 words.
+ *
+ * Formats, listed starting with the least significant bit of the first word.
+ * 1 word:  [2-bit len] [6-bit tag] [8-bit type] [8-bit data_offset] [4-bit size_offset] [4-bit data_size]
+ *
+ * 2 words: [2-bit len] [6-bit tag] [8-bit type] [12-bit array_size] [4-bit size_offset]
+ *          [16-bit data_offset] [12-bit data_size] [4-bit tag>>6]
+ *
+ * 4 words: [2-bit len] [6-bit tag] [8-bit type] [16-bit array_size]
+ *          [8-bit size_offset] [24-bit tag>>6]
+ *          [32-bit data_offset]
+ *          [32-bit data_size]
+ *
+ * 8 words: [2-bit len] [6-bit tag] [8-bit type] [16-bit reserved]
+ *          [8-bit size_offset] [24-bit tag>>6]
+ *          [32-bit data_offset]
+ *          [32-bit data_size]
+ *          [32-bit array_size]
+ *          [32-bit reserved]
+ *          [32-bit reserved]
+ *          [32-bit reserved]
+ */
+
+#define PB_FIELDINFO_1(tag, type, data_offset, data_size, size_offset, array_size) \
+    (0 | (((tag) << 2) & 0xFF) | ((type) << 8) | (((uint32_t)(data_offset) & 0xFF) << 16) | \
+     (((uint32_t)(size_offset) & 0x0F) << 24) | (((uint32_t)(data_size) & 0x0F) << 28)),
+
+#define PB_FIELDINFO_2(tag, type, data_offset, data_size, size_offset, array_size) \
+    (1 | (((tag) << 2) & 0xFF) | ((type) << 8) | (((uint32_t)(array_size) & 0xFFF) << 16) | (((uint32_t)(size_offset) & 0x0F) << 28)), \
+    (((uint32_t)(data_offset) & 0xFFFF) | (((uint32_t)(data_size) & 0xFFF) << 16) | (((uint32_t)(tag) & 0x3c0) << 22)),
+
+#define PB_FIELDINFO_4(tag, type, data_offset, data_size, size_offset, array_size) \
+    (2 | (((tag) << 2) & 0xFF) | ((type) << 8) | (((uint32_t)(array_size) & 0xFFFF) << 16)), \
+    ((uint32_t)(int_least8_t)(size_offset) | (((uint32_t)(tag) << 2) & 0xFFFFFF00)), \
+    (data_offset), (data_size),
+
+#define PB_FIELDINFO_8(tag, type, data_offset, data_size, size_offset, array_size) \
+    (3 | (((tag) << 2) & 0xFF) | ((type) << 8)), \
+    ((uint32_t)(int_least8_t)(size_offset) | (((uint32_t)(tag) << 2) & 0xFFFFFF00)), \
+    (data_offset), (data_size), (array_size), 0, 0, 0,
+
+/* These assertions verify that the field information fits in the allocated space.
+ * The generator tries to automatically determine the correct width that can fit all
+ * data associated with a message. These asserts will fail only if there has been a
+ * problem in the automatic logic - this may be worth reporting as a bug. As a workaround,
+ * you can increase the descriptor width by defining PB_FIELDINFO_WIDTH or by setting
+ * descriptorsize option in .options file.
+ */
+#define PB_FITS(value,bits) ((uint32_t)(value) < ((uint32_t)1<<bits))
+#define PB_FIELDINFO_ASSERT_1(tag, type, data_offset, data_size, size_offset, array_size) \
+    PB_STATIC_ASSERT(PB_FITS(tag,6) && PB_FITS(data_offset,8) && PB_FITS(size_offset,4) && PB_FITS(data_size,4) && PB_FITS(array_size,1), FIELDINFO_DOES_NOT_FIT_width1_field ## tag)
+
+#define PB_FIELDINFO_ASSERT_2(tag, type, data_offset, data_size, size_offset, array_size) \
+    PB_STATIC_ASSERT(PB_FITS(tag,10) && PB_FITS(data_offset,16) && PB_FITS(size_offset,4) && PB_FITS(data_size,12) && PB_FITS(array_size,12), FIELDINFO_DOES_NOT_FIT_width2_field ## tag)
+
+#ifndef PB_FIELD_32BIT
+/* Maximum field sizes are still 16-bit if pb_size_t is 16-bit */
+#define PB_FIELDINFO_ASSERT_4(tag, type, data_offset, data_size, size_offset, array_size) \
+    PB_STATIC_ASSERT(PB_FITS(tag,16) && PB_FITS(data_offset,16) && PB_FITS((int_least8_t)size_offset,8) && PB_FITS(data_size,16) && PB_FITS(array_size,16), FIELDINFO_DOES_NOT_FIT_width4_field ## tag)
+
+#define PB_FIELDINFO_ASSERT_8(tag, type, data_offset, data_size, size_offset, array_size) \
+    PB_STATIC_ASSERT(PB_FITS(tag,16) && PB_FITS(data_offset,16) && PB_FITS((int_least8_t)size_offset,8) && PB_FITS(data_size,16) && PB_FITS(array_size,16), FIELDINFO_DOES_NOT_FIT_width8_field ## tag)
+#else
+/* Up to 32-bit fields supported.
+ * Note that the checks are against 31 bits to avoid compiler warnings about shift wider than type in the test.
+ * I expect that there is no reasonable use for >2GB messages with nanopb anyway.
+ */
+#define PB_FIELDINFO_ASSERT_4(tag, type, data_offset, data_size, size_offset, array_size) \
+    PB_STATIC_ASSERT(PB_FITS(tag,30) && PB_FITS(data_offset,31) && PB_FITS(size_offset,8) && PB_FITS(data_size,31) && PB_FITS(array_size,16), FIELDINFO_DOES_NOT_FIT_width4_field ## tag)
+
+#define PB_FIELDINFO_ASSERT_8(tag, type, data_offset, data_size, size_offset, array_size) \
+    PB_STATIC_ASSERT(PB_FITS(tag,30) && PB_FITS(data_offset,31) && PB_FITS(size_offset,8) && PB_FITS(data_size,31) && PB_FITS(array_size,31), FIELDINFO_DOES_NOT_FIT_width8_field ## tag)
+#endif
+
+
+/* Automatic picking of FIELDINFO width:
+ * Uses width 1 when possible, otherwise resorts to width 2.
+ * This is used when PB_BIND() is called with "AUTO" as the argument.
+ * The generator will give explicit size argument when it knows that a message
+ * structure grows beyond 1-word format limits.
+ */
+#define PB_FIELDINFO_WIDTH_AUTO(atype, htype, ltype) PB_FI_WIDTH ## atype(htype, ltype)
+#define PB_FI_WIDTH_PB_ATYPE_STATIC(htype, ltype) PB_FI_WIDTH ## htype(ltype)
+#define PB_FI_WIDTH_PB_ATYPE_POINTER(htype, ltype) PB_FI_WIDTH ## htype(ltype)
+#define PB_FI_WIDTH_PB_ATYPE_CALLBACK(htype, ltype) 2
+#define PB_FI_WIDTH_PB_HTYPE_REQUIRED(ltype) PB_FI_WIDTH ## ltype
+#define PB_FI_WIDTH_PB_HTYPE_SINGULAR(ltype) PB_FI_WIDTH ## ltype
+#define PB_FI_WIDTH_PB_HTYPE_OPTIONAL(ltype) PB_FI_WIDTH ## ltype
+#define PB_FI_WIDTH_PB_HTYPE_ONEOF(ltype) PB_FI_WIDTH ## ltype
+#define PB_FI_WIDTH_PB_HTYPE_REPEATED(ltype) 2
+#define PB_FI_WIDTH_PB_HTYPE_FIXARRAY(ltype) 2
+#define PB_FI_WIDTH_PB_LTYPE_BOOL      1
+#define PB_FI_WIDTH_PB_LTYPE_BYTES     2
+#define PB_FI_WIDTH_PB_LTYPE_DOUBLE    1
+#define PB_FI_WIDTH_PB_LTYPE_ENUM      1
+#define PB_FI_WIDTH_PB_LTYPE_UENUM     1
+#define PB_FI_WIDTH_PB_LTYPE_FIXED32   1
+#define PB_FI_WIDTH_PB_LTYPE_FIXED64   1
+#define PB_FI_WIDTH_PB_LTYPE_FLOAT     1
+#define PB_FI_WIDTH_PB_LTYPE_INT32     1
+#define PB_FI_WIDTH_PB_LTYPE_INT64     1
+#define PB_FI_WIDTH_PB_LTYPE_MESSAGE   2
+#define PB_FI_WIDTH_PB_LTYPE_MSG_W_CB  2
+#define PB_FI_WIDTH_PB_LTYPE_SFIXED32  1
+#define PB_FI_WIDTH_PB_LTYPE_SFIXED64  1
+#define PB_FI_WIDTH_PB_LTYPE_SINT32    1
+#define PB_FI_WIDTH_PB_LTYPE_SINT64    1
+#define PB_FI_WIDTH_PB_LTYPE_STRING    2
+#define PB_FI_WIDTH_PB_LTYPE_UINT32    1
+#define PB_FI_WIDTH_PB_LTYPE_UINT64    1
+#define PB_FI_WIDTH_PB_LTYPE_EXTENSION 1
+#define PB_FI_WIDTH_PB_LTYPE_FIXED_LENGTH_BYTES 2
+
+/* The mapping from protobuf types to LTYPEs is done using these macros. */
+#define PB_LTYPE_MAP_BOOL               PB_LTYPE_BOOL
+#define PB_LTYPE_MAP_BYTES              PB_LTYPE_BYTES
+#define PB_LTYPE_MAP_DOUBLE             PB_LTYPE_FIXED64
+#define PB_LTYPE_MAP_ENUM               PB_LTYPE_VARINT
+#define PB_LTYPE_MAP_UENUM              PB_LTYPE_UVARINT
+#define PB_LTYPE_MAP_FIXED32            PB_LTYPE_FIXED32
+#define PB_LTYPE_MAP_FIXED64            PB_LTYPE_FIXED64
+#define PB_LTYPE_MAP_FLOAT              PB_LTYPE_FIXED32
+#define PB_LTYPE_MAP_INT32              PB_LTYPE_VARINT
+#define PB_LTYPE_MAP_INT64              PB_LTYPE_VARINT
+#define PB_LTYPE_MAP_MESSAGE            PB_LTYPE_SUBMESSAGE
+#define PB_LTYPE_MAP_MSG_W_CB           PB_LTYPE_SUBMSG_W_CB
+#define PB_LTYPE_MAP_SFIXED32           PB_LTYPE_FIXED32
+#define PB_LTYPE_MAP_SFIXED64           PB_LTYPE_FIXED64
+#define PB_LTYPE_MAP_SINT32             PB_LTYPE_SVARINT
+#define PB_LTYPE_MAP_SINT64             PB_LTYPE_SVARINT
+#define PB_LTYPE_MAP_STRING             PB_LTYPE_STRING
+#define PB_LTYPE_MAP_UINT32             PB_LTYPE_UVARINT
+#define PB_LTYPE_MAP_UINT64             PB_LTYPE_UVARINT
+#define PB_LTYPE_MAP_EXTENSION          PB_LTYPE_EXTENSION
+#define PB_LTYPE_MAP_FIXED_LENGTH_BYTES PB_LTYPE_FIXED_LENGTH_BYTES
+
+/* These macros are used for giving out error messages.
+ * They are mostly a debugging aid; the main error information
+ * is the true/false return value from functions.
+ * Some code space can be saved by disabling the error
+ * messages if not used.
+ *
+ * PB_SET_ERROR() sets the error message if none has been set yet.
+ *                msg must be a constant string literal.
+ * PB_GET_ERROR() always returns a pointer to a string.
+ * PB_RETURN_ERROR() sets the error and returns false from current
+ *                   function.
+ */
+#ifdef PB_NO_ERRMSG
+#define PB_SET_ERROR(stream, msg) PB_UNUSED(stream)
+#define PB_GET_ERROR(stream) "(errmsg disabled)"
+#else
+#define PB_SET_ERROR(stream, msg) (stream->errmsg = (stream)->errmsg ? (stream)->errmsg : (msg))
+#define PB_GET_ERROR(stream) ((stream)->errmsg ? (stream)->errmsg : "(none)")
+#endif
+
+#define PB_RETURN_ERROR(stream, msg) return PB_SET_ERROR(stream, msg), false
+
+#ifdef __cplusplus
+} /* extern "C" */
+#endif
+
+#ifdef __cplusplus
+#if __cplusplus >= 201103L
+#define PB_CONSTEXPR constexpr
+#else  // __cplusplus >= 201103L
+#define PB_CONSTEXPR
+#endif  // __cplusplus >= 201103L
+
+#if __cplusplus >= 201703L
+#define PB_INLINE_CONSTEXPR inline constexpr
+#else  // __cplusplus >= 201703L
+#define PB_INLINE_CONSTEXPR PB_CONSTEXPR
+#endif  // __cplusplus >= 201703L
+
+namespace nanopb {
+// Each type will be partially specialized by the generator.
+template <typename GenMessageT> struct MessageDescriptor;
+}  // namespace nanopb
+#endif  /* __cplusplus */
+
+#endif
+

diff --git a/security/container/protos/nanopb/pb_common.c b/security/container/protos/nanopb/pb_common.c
new file mode 100644
index 0000000..6aee76b
--- /dev/null
+++ b/security/container/protos/nanopb/pb_common.c

@@ -0,0 +1,388 @@
+/* pb_common.c: Common support functions for pb_encode.c and pb_decode.c.
+ *
+ * 2014 Petteri Aimonen <jpa@kapsi.fi>
+ */
+
+#include "pb_common.h"
+
+static bool load_descriptor_values(pb_field_iter_t *iter)
+{
+    uint32_t word0;
+    uint32_t data_offset;
+    int_least8_t size_offset;
+
+    if (iter->index >= iter->descriptor->field_count)
+        return false;
+
+    word0 = PB_PROGMEM_READU32(iter->descriptor->field_info[iter->field_info_index]);
+    iter->type = (pb_type_t)((word0 >> 8) & 0xFF);
+
+    switch(word0 & 3)
+    {
+        case 0: {
+            /* 1-word format */
+            iter->array_size = 1;
+            iter->tag = (pb_size_t)((word0 >> 2) & 0x3F);
+            size_offset = (int_least8_t)((word0 >> 24) & 0x0F);
+            data_offset = (word0 >> 16) & 0xFF;
+            iter->data_size = (pb_size_t)((word0 >> 28) & 0x0F);
+            break;
+        }
+
+        case 1: {
+            /* 2-word format */
+            uint32_t word1 = PB_PROGMEM_READU32(iter->descriptor->field_info[iter->field_info_index + 1]);
+
+            iter->array_size = (pb_size_t)((word0 >> 16) & 0x0FFF);
+            iter->tag = (pb_size_t)(((word0 >> 2) & 0x3F) | ((word1 >> 28) << 6));
+            size_offset = (int_least8_t)((word0 >> 28) & 0x0F);
+            data_offset = word1 & 0xFFFF;
+            iter->data_size = (pb_size_t)((word1 >> 16) & 0x0FFF);
+            break;
+        }
+
+        case 2: {
+            /* 4-word format */
+            uint32_t word1 = PB_PROGMEM_READU32(iter->descriptor->field_info[iter->field_info_index + 1]);
+            uint32_t word2 = PB_PROGMEM_READU32(iter->descriptor->field_info[iter->field_info_index + 2]);
+            uint32_t word3 = PB_PROGMEM_READU32(iter->descriptor->field_info[iter->field_info_index + 3]);
+
+            iter->array_size = (pb_size_t)(word0 >> 16);
+            iter->tag = (pb_size_t)(((word0 >> 2) & 0x3F) | ((word1 >> 8) << 6));
+            size_offset = (int_least8_t)(word1 & 0xFF);
+            data_offset = word2;
+            iter->data_size = (pb_size_t)word3;
+            break;
+        }
+
+        default: {
+            /* 8-word format */
+            uint32_t word1 = PB_PROGMEM_READU32(iter->descriptor->field_info[iter->field_info_index + 1]);
+            uint32_t word2 = PB_PROGMEM_READU32(iter->descriptor->field_info[iter->field_info_index + 2]);
+            uint32_t word3 = PB_PROGMEM_READU32(iter->descriptor->field_info[iter->field_info_index + 3]);
+            uint32_t word4 = PB_PROGMEM_READU32(iter->descriptor->field_info[iter->field_info_index + 4]);
+
+            iter->array_size = (pb_size_t)word4;
+            iter->tag = (pb_size_t)(((word0 >> 2) & 0x3F) | ((word1 >> 8) << 6));
+            size_offset = (int_least8_t)(word1 & 0xFF);
+            data_offset = word2;
+            iter->data_size = (pb_size_t)word3;
+            break;
+        }
+    }
+
+    if (!iter->message)
+    {
+        /* Avoid doing arithmetic on null pointers, it is undefined */
+        iter->pField = NULL;
+        iter->pSize = NULL;
+    }
+    else
+    {
+        iter->pField = (char*)iter->message + data_offset;
+
+        if (size_offset)
+        {
+            iter->pSize = (char*)iter->pField - size_offset;
+        }
+        else if (PB_HTYPE(iter->type) == PB_HTYPE_REPEATED &&
+                 (PB_ATYPE(iter->type) == PB_ATYPE_STATIC ||
+                  PB_ATYPE(iter->type) == PB_ATYPE_POINTER))
+        {
+            /* Fixed count array */
+            iter->pSize = &iter->array_size;
+        }
+        else
+        {
+            iter->pSize = NULL;
+        }
+
+        if (PB_ATYPE(iter->type) == PB_ATYPE_POINTER && iter->pField != NULL)
+        {
+            iter->pData = *(void**)iter->pField;
+        }
+        else
+        {
+            iter->pData = iter->pField;
+        }
+    }
+
+    if (PB_LTYPE_IS_SUBMSG(iter->type))
+    {
+        iter->submsg_desc = iter->descriptor->submsg_info[iter->submessage_index];
+    }
+    else
+    {
+        iter->submsg_desc = NULL;
+    }
+
+    return true;
+}
+
+static void advance_iterator(pb_field_iter_t *iter)
+{
+    iter->index++;
+
+    if (iter->index >= iter->descriptor->field_count)
+    {
+        /* Restart */
+        iter->index = 0;
+        iter->field_info_index = 0;
+        iter->submessage_index = 0;
+        iter->required_field_index = 0;
+    }
+    else
+    {
+        /* Increment indexes based on previous field type.
+         * All field info formats have the following fields:
+         * - lowest 2 bits tell the amount of words in the descriptor (2^n words)
+         * - bits 2..7 give the lowest bits of tag number.
+         * - bits 8..15 give the field type.
+         */
+        uint32_t prev_descriptor = PB_PROGMEM_READU32(iter->descriptor->field_info[iter->field_info_index]);
+        pb_type_t prev_type = (prev_descriptor >> 8) & 0xFF;
+        pb_size_t descriptor_len = (pb_size_t)(1 << (prev_descriptor & 3));
+
+        /* Add to fields.
+         * The cast to pb_size_t is needed to avoid -Wconversion warning.
+         * Because the data is is constants from generator, there is no danger of overflow.
+         */
+        iter->field_info_index = (pb_size_t)(iter->field_info_index + descriptor_len);
+        iter->required_field_index = (pb_size_t)(iter->required_field_index + (PB_HTYPE(prev_type) == PB_HTYPE_REQUIRED));
+        iter->submessage_index = (pb_size_t)(iter->submessage_index + PB_LTYPE_IS_SUBMSG(prev_type));
+    }
+}
+
+bool pb_field_iter_begin(pb_field_iter_t *iter, const pb_msgdesc_t *desc, void *message)
+{
+    memset(iter, 0, sizeof(*iter));
+
+    iter->descriptor = desc;
+    iter->message = message;
+
+    return load_descriptor_values(iter);
+}
+
+bool pb_field_iter_begin_extension(pb_field_iter_t *iter, pb_extension_t *extension)
+{
+    const pb_msgdesc_t *msg = (const pb_msgdesc_t*)extension->type->arg;
+    bool status;
+
+    uint32_t word0 = PB_PROGMEM_READU32(msg->field_info[0]);
+    if (PB_ATYPE(word0 >> 8) == PB_ATYPE_POINTER)
+    {
+        /* For pointer extensions, the pointer is stored directly
+         * in the extension structure. This avoids having an extra
+         * indirection. */
+        status = pb_field_iter_begin(iter, msg, &extension->dest);
+    }
+    else
+    {
+        status = pb_field_iter_begin(iter, msg, extension->dest);
+    }
+
+    iter->pSize = &extension->found;
+    return status;
+}
+
+bool pb_field_iter_next(pb_field_iter_t *iter)
+{
+    advance_iterator(iter);
+    (void)load_descriptor_values(iter);
+    return iter->index != 0;
+}
+
+bool pb_field_iter_find(pb_field_iter_t *iter, uint32_t tag)
+{
+    if (iter->tag == tag)
+    {
+        return true; /* Nothing to do, correct field already. */
+    }
+    else if (tag > iter->descriptor->largest_tag)
+    {
+        return false;
+    }
+    else
+    {
+        pb_size_t start = iter->index;
+        uint32_t fieldinfo;
+
+        if (tag < iter->tag)
+        {
+            /* Fields are in tag number order, so we know that tag is between
+             * 0 and our start position. Setting index to end forces
+             * advance_iterator() call below to restart from beginning. */
+            iter->index = iter->descriptor->field_count;
+        }
+
+        do
+        {
+            /* Advance iterator but don't load values yet */
+            advance_iterator(iter);
+
+            /* Do fast check for tag number match */
+            fieldinfo = PB_PROGMEM_READU32(iter->descriptor->field_info[iter->field_info_index]);
+
+            if (((fieldinfo >> 2) & 0x3F) == (tag & 0x3F))
+            {
+                /* Good candidate, check further */
+                (void)load_descriptor_values(iter);
+
+                if (iter->tag == tag &&
+                    PB_LTYPE(iter->type) != PB_LTYPE_EXTENSION)
+                {
+                    /* Found it */
+                    return true;
+                }
+            }
+        } while (iter->index != start);
+
+        /* Searched all the way back to start, and found nothing. */
+        (void)load_descriptor_values(iter);
+        return false;
+    }
+}
+
+bool pb_field_iter_find_extension(pb_field_iter_t *iter)
+{
+    if (PB_LTYPE(iter->type) == PB_LTYPE_EXTENSION)
+    {
+        return true;
+    }
+    else
+    {
+        pb_size_t start = iter->index;
+        uint32_t fieldinfo;
+
+        do
+        {
+            /* Advance iterator but don't load values yet */
+            advance_iterator(iter);
+
+            /* Do fast check for field type */
+            fieldinfo = PB_PROGMEM_READU32(iter->descriptor->field_info[iter->field_info_index]);
+
+            if (PB_LTYPE((fieldinfo >> 8) & 0xFF) == PB_LTYPE_EXTENSION)
+            {
+                return load_descriptor_values(iter);
+            }
+        } while (iter->index != start);
+
+        /* Searched all the way back to start, and found nothing. */
+        (void)load_descriptor_values(iter);
+        return false;
+    }
+}
+
+static void *pb_const_cast(const void *p)
+{
+    /* Note: this casts away const, in order to use the common field iterator
+     * logic for both encoding and decoding. The cast is done using union
+     * to avoid spurious compiler warnings. */
+    union {
+        void *p1;
+        const void *p2;
+    } t;
+    t.p2 = p;
+    return t.p1;
+}
+
+bool pb_field_iter_begin_const(pb_field_iter_t *iter, const pb_msgdesc_t *desc, const void *message)
+{
+    return pb_field_iter_begin(iter, desc, pb_const_cast(message));
+}
+
+bool pb_field_iter_begin_extension_const(pb_field_iter_t *iter, const pb_extension_t *extension)
+{
+    return pb_field_iter_begin_extension(iter, (pb_extension_t*)pb_const_cast(extension));
+}
+
+bool pb_default_field_callback(pb_istream_t *istream, pb_ostream_t *ostream, const pb_field_t *field)
+{
+    if (field->data_size == sizeof(pb_callback_t))
+    {
+        pb_callback_t *pCallback = (pb_callback_t*)field->pData;
+
+        if (pCallback != NULL)
+        {
+            if (istream != NULL && pCallback->funcs.decode != NULL)
+            {
+                return pCallback->funcs.decode(istream, field, &pCallback->arg);
+            }
+
+            if (ostream != NULL && pCallback->funcs.encode != NULL)
+            {
+                return pCallback->funcs.encode(ostream, field, &pCallback->arg);
+            }
+        }
+    }
+
+    return true; /* Success, but didn't do anything */
+
+}
+
+#ifdef PB_VALIDATE_UTF8
+
+/* This function checks whether a string is valid UTF-8 text.
+ *
+ * Algorithm is adapted from https://www.cl.cam.ac.uk/~mgk25/ucs/utf8_check.c
+ * Original copyright: Markus Kuhn <http://www.cl.cam.ac.uk/~mgk25/> 2005-03-30
+ * Licensed under "Short code license", which allows use under MIT license or
+ * any compatible with it.
+ */
+
+bool pb_validate_utf8(const char *str)
+{
+    const pb_byte_t *s = (const pb_byte_t*)str;
+    while (*s)
+    {
+        if (*s < 0x80)
+        {
+            /* 0xxxxxxx */
+            s++;
+        }
+        else if ((s[0] & 0xe0) == 0xc0)
+        {
+            /* 110XXXXx 10xxxxxx */
+            if ((s[1] & 0xc0) != 0x80 ||
+                (s[0] & 0xfe) == 0xc0)                        /* overlong? */
+                return false;
+            else
+                s += 2;
+        }
+        else if ((s[0] & 0xf0) == 0xe0)
+        {
+            /* 1110XXXX 10Xxxxxx 10xxxxxx */
+            if ((s[1] & 0xc0) != 0x80 ||
+                (s[2] & 0xc0) != 0x80 ||
+                (s[0] == 0xe0 && (s[1] & 0xe0) == 0x80) ||    /* overlong? */
+                (s[0] == 0xed && (s[1] & 0xe0) == 0xa0) ||    /* surrogate? */
+                (s[0] == 0xef && s[1] == 0xbf &&
+                (s[2] & 0xfe) == 0xbe))                 /* U+FFFE or U+FFFF? */
+                return false;
+            else
+                s += 3;
+        }
+        else if ((s[0] & 0xf8) == 0xf0)
+        {
+            /* 11110XXX 10XXxxxx 10xxxxxx 10xxxxxx */
+            if ((s[1] & 0xc0) != 0x80 ||
+                (s[2] & 0xc0) != 0x80 ||
+                (s[3] & 0xc0) != 0x80 ||
+                (s[0] == 0xf0 && (s[1] & 0xf0) == 0x80) ||    /* overlong? */
+                (s[0] == 0xf4 && s[1] > 0x8f) || s[0] > 0xf4) /* > U+10FFFF? */
+                return false;
+            else
+                s += 4;
+        }
+        else
+        {
+            return false;
+        }
+    }
+
+    return true;
+}
+
+#endif
+

diff --git a/security/container/protos/nanopb/pb_common.h b/security/container/protos/nanopb/pb_common.h
new file mode 100644
index 0000000..58aa90f
--- /dev/null
+++ b/security/container/protos/nanopb/pb_common.h

@@ -0,0 +1,49 @@
+/* pb_common.h: Common support functions for pb_encode.c and pb_decode.c.
+ * These functions are rarely needed by applications directly.
+ */
+
+#ifndef PB_COMMON_H_INCLUDED
+#define PB_COMMON_H_INCLUDED
+
+#include "pb.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/* Initialize the field iterator structure to beginning.
+ * Returns false if the message type is empty. */
+bool pb_field_iter_begin(pb_field_iter_t *iter, const pb_msgdesc_t *desc, void *message);
+
+/* Get a field iterator for extension field. */
+bool pb_field_iter_begin_extension(pb_field_iter_t *iter, pb_extension_t *extension);
+
+/* Same as pb_field_iter_begin(), but for const message pointer.
+ * Note that the pointers in pb_field_iter_t will be non-const but shouldn't
+ * be written to when using these functions. */
+bool pb_field_iter_begin_const(pb_field_iter_t *iter, const pb_msgdesc_t *desc, const void *message);
+bool pb_field_iter_begin_extension_const(pb_field_iter_t *iter, const pb_extension_t *extension);
+
+/* Advance the iterator to the next field.
+ * Returns false when the iterator wraps back to the first field. */
+bool pb_field_iter_next(pb_field_iter_t *iter);
+
+/* Advance the iterator until it points at a field with the given tag.
+ * Returns false if no such field exists. */
+bool pb_field_iter_find(pb_field_iter_t *iter, uint32_t tag);
+
+/* Find a field with type PB_LTYPE_EXTENSION, or return false if not found.
+ * There can be only one extension range field per message. */
+bool pb_field_iter_find_extension(pb_field_iter_t *iter);
+
+#ifdef PB_VALIDATE_UTF8
+/* Validate UTF-8 text string */
+bool pb_validate_utf8(const char *s);
+#endif
+
+#ifdef __cplusplus
+} /* extern "C" */
+#endif
+
+#endif
+

diff --git a/security/container/protos/nanopb/pb_decode.c b/security/container/protos/nanopb/pb_decode.c
new file mode 100644
index 0000000..b194825
--- /dev/null
+++ b/security/container/protos/nanopb/pb_decode.c

@@ -0,0 +1,1709 @@
+/* pb_decode.c -- decode a protobuf using minimal resources
+ *
+ * 2011 Petteri Aimonen <jpa@kapsi.fi>
+ */
+
+/* Use the GCC warn_unused_result attribute to check that all return values
+ * are propagated correctly. On other compilers and gcc before 3.4.0 just
+ * ignore the annotation.
+ */
+#if !defined(__GNUC__) || ( __GNUC__ < 3) || (__GNUC__ == 3 && __GNUC_MINOR__ < 4)
+    #define checkreturn
+#else
+    #define checkreturn __attribute__((warn_unused_result))
+#endif
+
+#include "pb.h"
+#include "pb_decode.h"
+#include "pb_common.h"
+
+/**************************************
+ * Declarations internal to this file *
+ **************************************/
+
+static bool checkreturn buf_read(pb_istream_t *stream, pb_byte_t *buf, size_t count);
+static bool checkreturn pb_decode_varint32_eof(pb_istream_t *stream, uint32_t *dest, bool *eof);
+static bool checkreturn read_raw_value(pb_istream_t *stream, pb_wire_type_t wire_type, pb_byte_t *buf, size_t *size);
+static bool checkreturn decode_basic_field(pb_istream_t *stream, pb_wire_type_t wire_type, pb_field_iter_t *field);
+static bool checkreturn decode_static_field(pb_istream_t *stream, pb_wire_type_t wire_type, pb_field_iter_t *field);
+static bool checkreturn decode_pointer_field(pb_istream_t *stream, pb_wire_type_t wire_type, pb_field_iter_t *field);
+static bool checkreturn decode_callback_field(pb_istream_t *stream, pb_wire_type_t wire_type, pb_field_iter_t *field);
+static bool checkreturn decode_field(pb_istream_t *stream, pb_wire_type_t wire_type, pb_field_iter_t *field);
+static bool checkreturn default_extension_decoder(pb_istream_t *stream, pb_extension_t *extension, uint32_t tag, pb_wire_type_t wire_type);
+static bool checkreturn decode_extension(pb_istream_t *stream, uint32_t tag, pb_wire_type_t wire_type, pb_extension_t *extension);
+static bool pb_field_set_to_default(pb_field_iter_t *field);
+static bool pb_message_set_to_defaults(pb_field_iter_t *iter);
+static bool checkreturn pb_dec_bool(pb_istream_t *stream, const pb_field_iter_t *field);
+static bool checkreturn pb_dec_varint(pb_istream_t *stream, const pb_field_iter_t *field);
+static bool checkreturn pb_dec_bytes(pb_istream_t *stream, const pb_field_iter_t *field);
+static bool checkreturn pb_dec_string(pb_istream_t *stream, const pb_field_iter_t *field);
+static bool checkreturn pb_dec_submessage(pb_istream_t *stream, const pb_field_iter_t *field);
+static bool checkreturn pb_dec_fixed_length_bytes(pb_istream_t *stream, const pb_field_iter_t *field);
+static bool checkreturn pb_skip_varint(pb_istream_t *stream);
+static bool checkreturn pb_skip_string(pb_istream_t *stream);
+
+#ifdef PB_ENABLE_MALLOC
+static bool checkreturn allocate_field(pb_istream_t *stream, void *pData, size_t data_size, size_t array_size);
+static void initialize_pointer_field(void *pItem, pb_field_iter_t *field);
+static bool checkreturn pb_release_union_field(pb_istream_t *stream, pb_field_iter_t *field);
+static void pb_release_single_field(pb_field_iter_t *field);
+#endif
+
+#ifdef PB_WITHOUT_64BIT
+#define pb_int64_t int32_t
+#define pb_uint64_t uint32_t
+#else
+#define pb_int64_t int64_t
+#define pb_uint64_t uint64_t
+#endif
+
+#define PB_WT_PACKED ((pb_wire_type_t)0xFF)
+
+typedef struct {
+    uint32_t bitfield[(PB_MAX_REQUIRED_FIELDS + 31) / 32];
+} pb_fields_seen_t;
+
+/*******************************
+ * pb_istream_t implementation *
+ *******************************/
+
+static bool checkreturn buf_read(pb_istream_t *stream, pb_byte_t *buf, size_t count)
+{
+    size_t i;
+    const pb_byte_t *source = (const pb_byte_t*)stream->state;
+    stream->state = (pb_byte_t*)stream->state + count;
+    
+    if (buf != NULL)
+    {
+        for (i = 0; i < count; i++)
+            buf[i] = source[i];
+    }
+    
+    return true;
+}
+
+bool checkreturn pb_read(pb_istream_t *stream, pb_byte_t *buf, size_t count)
+{
+    if (count == 0)
+        return true;
+
+#ifndef PB_BUFFER_ONLY
+	if (buf == NULL && stream->callback != buf_read)
+	{
+		/* Skip input bytes */
+		pb_byte_t tmp[16];
+		while (count > 16)
+		{
+			if (!pb_read(stream, tmp, 16))
+				return false;
+			
+			count -= 16;
+		}
+		
+		return pb_read(stream, tmp, count);
+	}
+#endif
+
+    if (stream->bytes_left < count)
+        PB_RETURN_ERROR(stream, "end-of-stream");
+    
+#ifndef PB_BUFFER_ONLY
+    if (!stream->callback(stream, buf, count))
+        PB_RETURN_ERROR(stream, "io error");
+#else
+    if (!buf_read(stream, buf, count))
+        return false;
+#endif
+    
+    stream->bytes_left -= count;
+    return true;
+}
+
+/* Read a single byte from input stream. buf may not be NULL.
+ * This is an optimization for the varint decoding. */
+static bool checkreturn pb_readbyte(pb_istream_t *stream, pb_byte_t *buf)
+{
+    if (stream->bytes_left == 0)
+        PB_RETURN_ERROR(stream, "end-of-stream");
+
+#ifndef PB_BUFFER_ONLY
+    if (!stream->callback(stream, buf, 1))
+        PB_RETURN_ERROR(stream, "io error");
+#else
+    *buf = *(const pb_byte_t*)stream->state;
+    stream->state = (pb_byte_t*)stream->state + 1;
+#endif
+
+    stream->bytes_left--;
+    
+    return true;    
+}
+
+pb_istream_t pb_istream_from_buffer(const pb_byte_t *buf, size_t msglen)
+{
+    pb_istream_t stream;
+    /* Cast away the const from buf without a compiler error.  We are
+     * careful to use it only in a const manner in the callbacks.
+     */
+    union {
+        void *state;
+        const void *c_state;
+    } state;
+#ifdef PB_BUFFER_ONLY
+    stream.callback = NULL;
+#else
+    stream.callback = &buf_read;
+#endif
+    state.c_state = buf;
+    stream.state = state.state;
+    stream.bytes_left = msglen;
+#ifndef PB_NO_ERRMSG
+    stream.errmsg = NULL;
+#endif
+    return stream;
+}
+
+/********************
+ * Helper functions *
+ ********************/
+
+static bool checkreturn pb_decode_varint32_eof(pb_istream_t *stream, uint32_t *dest, bool *eof)
+{
+    pb_byte_t byte;
+    uint32_t result;
+    
+    if (!pb_readbyte(stream, &byte))
+    {
+        if (stream->bytes_left == 0)
+        {
+            if (eof)
+            {
+                *eof = true;
+            }
+        }
+
+        return false;
+    }
+    
+    if ((byte & 0x80) == 0)
+    {
+        /* Quick case, 1 byte value */
+        result = byte;
+    }
+    else
+    {
+        /* Multibyte case */
+        uint_fast8_t bitpos = 7;
+        result = byte & 0x7F;
+        
+        do
+        {
+            if (!pb_readbyte(stream, &byte))
+                return false;
+            
+            if (bitpos >= 32)
+            {
+                /* Note: The varint could have trailing 0x80 bytes, or 0xFF for negative. */
+                pb_byte_t sign_extension = (bitpos < 63) ? 0xFF : 0x01;
+                bool valid_extension = ((byte & 0x7F) == 0x00 ||
+                         ((result >> 31) != 0 && byte == sign_extension));
+
+                if (bitpos >= 64 || !valid_extension)
+                {
+                    PB_RETURN_ERROR(stream, "varint overflow");
+                }
+            }
+            else
+            {
+                result |= (uint32_t)(byte & 0x7F) << bitpos;
+            }
+            bitpos = (uint_fast8_t)(bitpos + 7);
+        } while (byte & 0x80);
+        
+        if (bitpos == 35 && (byte & 0x70) != 0)
+        {
+            /* The last byte was at bitpos=28, so only bottom 4 bits fit. */
+            PB_RETURN_ERROR(stream, "varint overflow");
+        }
+   }
+   
+   *dest = result;
+   return true;
+}
+
+bool checkreturn pb_decode_varint32(pb_istream_t *stream, uint32_t *dest)
+{
+    return pb_decode_varint32_eof(stream, dest, NULL);
+}
+
+#ifndef PB_WITHOUT_64BIT
+bool checkreturn pb_decode_varint(pb_istream_t *stream, uint64_t *dest)
+{
+    pb_byte_t byte;
+    uint_fast8_t bitpos = 0;
+    uint64_t result = 0;
+    
+    do
+    {
+        if (bitpos >= 64)
+            PB_RETURN_ERROR(stream, "varint overflow");
+        
+        if (!pb_readbyte(stream, &byte))
+            return false;
+
+        result |= (uint64_t)(byte & 0x7F) << bitpos;
+        bitpos = (uint_fast8_t)(bitpos + 7);
+    } while (byte & 0x80);
+    
+    *dest = result;
+    return true;
+}
+#endif
+
+bool checkreturn pb_skip_varint(pb_istream_t *stream)
+{
+    pb_byte_t byte;
+    do
+    {
+        if (!pb_read(stream, &byte, 1))
+            return false;
+    } while (byte & 0x80);
+    return true;
+}
+
+bool checkreturn pb_skip_string(pb_istream_t *stream)
+{
+    uint32_t length;
+    if (!pb_decode_varint32(stream, &length))
+        return false;
+    
+    if ((size_t)length != length)
+    {
+        PB_RETURN_ERROR(stream, "size too large");
+    }
+
+    return pb_read(stream, NULL, (size_t)length);
+}
+
+bool checkreturn pb_decode_tag(pb_istream_t *stream, pb_wire_type_t *wire_type, uint32_t *tag, bool *eof)
+{
+    uint32_t temp;
+    *eof = false;
+    *wire_type = (pb_wire_type_t) 0;
+    *tag = 0;
+    
+    if (!pb_decode_varint32_eof(stream, &temp, eof))
+    {
+        return false;
+    }
+    
+    *tag = temp >> 3;
+    *wire_type = (pb_wire_type_t)(temp & 7);
+    return true;
+}
+
+bool checkreturn pb_skip_field(pb_istream_t *stream, pb_wire_type_t wire_type)
+{
+    switch (wire_type)
+    {
+        case PB_WT_VARINT: return pb_skip_varint(stream);
+        case PB_WT_64BIT: return pb_read(stream, NULL, 8);
+        case PB_WT_STRING: return pb_skip_string(stream);
+        case PB_WT_32BIT: return pb_read(stream, NULL, 4);
+        default: PB_RETURN_ERROR(stream, "invalid wire_type");
+    }
+}
+
+/* Read a raw value to buffer, for the purpose of passing it to callback as
+ * a substream. Size is maximum size on call, and actual size on return.
+ */
+static bool checkreturn read_raw_value(pb_istream_t *stream, pb_wire_type_t wire_type, pb_byte_t *buf, size_t *size)
+{
+    size_t max_size = *size;
+    switch (wire_type)
+    {
+        case PB_WT_VARINT:
+            *size = 0;
+            do
+            {
+                (*size)++;
+                if (*size > max_size)
+                    PB_RETURN_ERROR(stream, "varint overflow");
+
+                if (!pb_read(stream, buf, 1))
+                    return false;
+            } while (*buf++ & 0x80);
+            return true;
+            
+        case PB_WT_64BIT:
+            *size = 8;
+            return pb_read(stream, buf, 8);
+        
+        case PB_WT_32BIT:
+            *size = 4;
+            return pb_read(stream, buf, 4);
+        
+        case PB_WT_STRING:
+            /* Calling read_raw_value with a PB_WT_STRING is an error.
+             * Explicitly handle this case and fallthrough to default to avoid
+             * compiler warnings.
+             */
+
+        default: PB_RETURN_ERROR(stream, "invalid wire_type");
+    }
+}
+
+/* Decode string length from stream and return a substream with limited length.
+ * Remember to close the substream using pb_close_string_substream().
+ */
+bool checkreturn pb_make_string_substream(pb_istream_t *stream, pb_istream_t *substream)
+{
+    uint32_t size;
+    if (!pb_decode_varint32(stream, &size))
+        return false;
+    
+    *substream = *stream;
+    if (substream->bytes_left < size)
+        PB_RETURN_ERROR(stream, "parent stream too short");
+    
+    substream->bytes_left = (size_t)size;
+    stream->bytes_left -= (size_t)size;
+    return true;
+}
+
+bool checkreturn pb_close_string_substream(pb_istream_t *stream, pb_istream_t *substream)
+{
+    if (substream->bytes_left) {
+        if (!pb_read(substream, NULL, substream->bytes_left))
+            return false;
+    }
+
+    stream->state = substream->state;
+
+#ifndef PB_NO_ERRMSG
+    stream->errmsg = substream->errmsg;
+#endif
+    return true;
+}
+
+/*************************
+ * Decode a single field *
+ *************************/
+
+static bool checkreturn decode_basic_field(pb_istream_t *stream, pb_wire_type_t wire_type, pb_field_iter_t *field)
+{
+    switch (PB_LTYPE(field->type))
+    {
+        case PB_LTYPE_BOOL:
+            if (wire_type != PB_WT_VARINT && wire_type != PB_WT_PACKED)
+                PB_RETURN_ERROR(stream, "wrong wire type");
+
+            return pb_dec_bool(stream, field);
+
+        case PB_LTYPE_VARINT:
+        case PB_LTYPE_UVARINT:
+        case PB_LTYPE_SVARINT:
+            if (wire_type != PB_WT_VARINT && wire_type != PB_WT_PACKED)
+                PB_RETURN_ERROR(stream, "wrong wire type");
+
+            return pb_dec_varint(stream, field);
+
+        case PB_LTYPE_FIXED32:
+            if (wire_type != PB_WT_32BIT && wire_type != PB_WT_PACKED)
+                PB_RETURN_ERROR(stream, "wrong wire type");
+
+            return pb_decode_fixed32(stream, field->pData);
+
+        case PB_LTYPE_FIXED64:
+            if (wire_type != PB_WT_64BIT && wire_type != PB_WT_PACKED)
+                PB_RETURN_ERROR(stream, "wrong wire type");
+
+#ifdef PB_CONVERT_DOUBLE_FLOAT
+            if (field->data_size == sizeof(float))
+            {
+                return pb_decode_double_as_float(stream, (float*)field->pData);
+            }
+#endif
+
+#ifdef PB_WITHOUT_64BIT
+            PB_RETURN_ERROR(stream, "invalid data_size");
+#else
+            return pb_decode_fixed64(stream, field->pData);
+#endif
+
+        case PB_LTYPE_BYTES:
+            if (wire_type != PB_WT_STRING)
+                PB_RETURN_ERROR(stream, "wrong wire type");
+
+            return pb_dec_bytes(stream, field);
+
+        case PB_LTYPE_STRING:
+            if (wire_type != PB_WT_STRING)
+                PB_RETURN_ERROR(stream, "wrong wire type");
+
+            return pb_dec_string(stream, field);
+
+        case PB_LTYPE_SUBMESSAGE:
+        case PB_LTYPE_SUBMSG_W_CB:
+            if (wire_type != PB_WT_STRING)
+                PB_RETURN_ERROR(stream, "wrong wire type");
+
+            return pb_dec_submessage(stream, field);
+
+        case PB_LTYPE_FIXED_LENGTH_BYTES:
+            if (wire_type != PB_WT_STRING)
+                PB_RETURN_ERROR(stream, "wrong wire type");
+
+            return pb_dec_fixed_length_bytes(stream, field);
+
+        default:
+            PB_RETURN_ERROR(stream, "invalid field type");
+    }
+}
+
+static bool checkreturn decode_static_field(pb_istream_t *stream, pb_wire_type_t wire_type, pb_field_iter_t *field)
+{
+    switch (PB_HTYPE(field->type))
+    {
+        case PB_HTYPE_REQUIRED:
+            return decode_basic_field(stream, wire_type, field);
+            
+        case PB_HTYPE_OPTIONAL:
+            if (field->pSize != NULL)
+                *(bool*)field->pSize = true;
+            return decode_basic_field(stream, wire_type, field);
+    
+        case PB_HTYPE_REPEATED:
+            if (wire_type == PB_WT_STRING
+                && PB_LTYPE(field->type) <= PB_LTYPE_LAST_PACKABLE)
+            {
+                /* Packed array */
+                bool status = true;
+                pb_istream_t substream;
+                pb_size_t *size = (pb_size_t*)field->pSize;
+                field->pData = (char*)field->pField + field->data_size * (*size);
+
+                if (!pb_make_string_substream(stream, &substream))
+                    return false;
+
+                while (substream.bytes_left > 0 && *size < field->array_size)
+                {
+                    if (!decode_basic_field(&substream, PB_WT_PACKED, field))
+                    {
+                        status = false;
+                        break;
+                    }
+                    (*size)++;
+                    field->pData = (char*)field->pData + field->data_size;
+                }
+
+                if (substream.bytes_left != 0)
+                    PB_RETURN_ERROR(stream, "array overflow");
+                if (!pb_close_string_substream(stream, &substream))
+                    return false;
+
+                return status;
+            }
+            else
+            {
+                /* Repeated field */
+                pb_size_t *size = (pb_size_t*)field->pSize;
+                field->pData = (char*)field->pField + field->data_size * (*size);
+
+                if ((*size)++ >= field->array_size)
+                    PB_RETURN_ERROR(stream, "array overflow");
+
+                return decode_basic_field(stream, wire_type, field);
+            }
+
+        case PB_HTYPE_ONEOF:
+            if (PB_LTYPE_IS_SUBMSG(field->type) &&
+                *(pb_size_t*)field->pSize != field->tag)
+            {
+                /* We memset to zero so that any callbacks are set to NULL.
+                 * This is because the callbacks might otherwise have values
+                 * from some other union field.
+                 * If callbacks are needed inside oneof field, use .proto
+                 * option submsg_callback to have a separate callback function
+                 * that can set the fields before submessage is decoded.
+                 * pb_dec_submessage() will set any default values. */
+                memset(field->pData, 0, (size_t)field->data_size);
+
+                /* Set default values for the submessage fields. */
+                if (field->submsg_desc->default_value != NULL ||
+                    field->submsg_desc->field_callback != NULL ||
+                    field->submsg_desc->submsg_info[0] != NULL)
+                {
+                    pb_field_iter_t submsg_iter;
+                    if (pb_field_iter_begin(&submsg_iter, field->submsg_desc, field->pData))
+                    {
+                        if (!pb_message_set_to_defaults(&submsg_iter))
+                            PB_RETURN_ERROR(stream, "failed to set defaults");
+                    }
+                }
+            }
+            *(pb_size_t*)field->pSize = field->tag;
+
+            return decode_basic_field(stream, wire_type, field);
+
+        default:
+            PB_RETURN_ERROR(stream, "invalid field type");
+    }
+}
+
+#ifdef PB_ENABLE_MALLOC
+/* Allocate storage for the field and store the pointer at iter->pData.
+ * array_size is the number of entries to reserve in an array.
+ * Zero size is not allowed, use pb_free() for releasing.
+ */
+static bool checkreturn allocate_field(pb_istream_t *stream, void *pData, size_t data_size, size_t array_size)
+{    
+    void *ptr = *(void**)pData;
+    
+    if (data_size == 0 || array_size == 0)
+        PB_RETURN_ERROR(stream, "invalid size");
+    
+#ifdef __AVR__
+    /* Workaround for AVR libc bug 53284: http://savannah.nongnu.org/bugs/?53284
+     * Realloc to size of 1 byte can cause corruption of the malloc structures.
+     */
+    if (data_size == 1 && array_size == 1)
+    {
+        data_size = 2;
+    }
+#endif
+
+    /* Check for multiplication overflows.
+     * This code avoids the costly division if the sizes are small enough.
+     * Multiplication is safe as long as only half of bits are set
+     * in either multiplicand.
+     */
+    {
+        const size_t check_limit = (size_t)1 << (sizeof(size_t) * 4);
+        if (data_size >= check_limit || array_size >= check_limit)
+        {
+            const size_t size_max = (size_t)-1;
+            if (size_max / array_size < data_size)
+            {
+                PB_RETURN_ERROR(stream, "size too large");
+            }
+        }
+    }
+    
+    /* Allocate new or expand previous allocation */
+    /* Note: on failure the old pointer will remain in the structure,
+     * the message must be freed by caller also on error return. */
+    ptr = pb_realloc(ptr, array_size * data_size);
+    if (ptr == NULL)
+        PB_RETURN_ERROR(stream, "realloc failed");
+    
+    *(void**)pData = ptr;
+    return true;
+}
+
+/* Clear a newly allocated item in case it contains a pointer, or is a submessage. */
+static void initialize_pointer_field(void *pItem, pb_field_iter_t *field)
+{
+    if (PB_LTYPE(field->type) == PB_LTYPE_STRING ||
+        PB_LTYPE(field->type) == PB_LTYPE_BYTES)
+    {
+        *(void**)pItem = NULL;
+    }
+    else if (PB_LTYPE_IS_SUBMSG(field->type))
+    {
+        /* We memset to zero so that any callbacks are set to NULL.
+         * Default values will be set by pb_dec_submessage(). */
+        memset(pItem, 0, field->data_size);
+    }
+}
+#endif
+
+static bool checkreturn decode_pointer_field(pb_istream_t *stream, pb_wire_type_t wire_type, pb_field_iter_t *field)
+{
+#ifndef PB_ENABLE_MALLOC
+    PB_UNUSED(wire_type);
+    PB_UNUSED(field);
+    PB_RETURN_ERROR(stream, "no malloc support");
+#else
+    switch (PB_HTYPE(field->type))
+    {
+        case PB_HTYPE_REQUIRED:
+        case PB_HTYPE_OPTIONAL:
+        case PB_HTYPE_ONEOF:
+            if (PB_LTYPE_IS_SUBMSG(field->type) && *(void**)field->pField != NULL)
+            {
+                /* Duplicate field, have to release the old allocation first. */
+                /* FIXME: Does this work correctly for oneofs? */
+                pb_release_single_field(field);
+            }
+        
+            if (PB_HTYPE(field->type) == PB_HTYPE_ONEOF)
+            {
+                *(pb_size_t*)field->pSize = field->tag;
+            }
+
+            if (PB_LTYPE(field->type) == PB_LTYPE_STRING ||
+                PB_LTYPE(field->type) == PB_LTYPE_BYTES)
+            {
+                /* pb_dec_string and pb_dec_bytes handle allocation themselves */
+                field->pData = field->pField;
+                return decode_basic_field(stream, wire_type, field);
+            }
+            else
+            {
+                if (!allocate_field(stream, field->pField, field->data_size, 1))
+                    return false;
+                
+                field->pData = *(void**)field->pField;
+                initialize_pointer_field(field->pData, field);
+                return decode_basic_field(stream, wire_type, field);
+            }
+    
+        case PB_HTYPE_REPEATED:
+            if (wire_type == PB_WT_STRING
+                && PB_LTYPE(field->type) <= PB_LTYPE_LAST_PACKABLE)
+            {
+                /* Packed array, multiple items come in at once. */
+                bool status = true;
+                pb_size_t *size = (pb_size_t*)field->pSize;
+                size_t allocated_size = *size;
+                pb_istream_t substream;
+                
+                if (!pb_make_string_substream(stream, &substream))
+                    return false;
+                
+                while (substream.bytes_left)
+                {
+                    if (*size == PB_SIZE_MAX)
+                    {
+#ifndef PB_NO_ERRMSG
+                        stream->errmsg = "too many array entries";
+#endif
+                        status = false;
+                        break;
+                    }
+
+                    if ((size_t)*size + 1 > allocated_size)
+                    {
+                        /* Allocate more storage. This tries to guess the
+                         * number of remaining entries. Round the division
+                         * upwards. */
+                        size_t remain = (substream.bytes_left - 1) / field->data_size + 1;
+                        if (remain < PB_SIZE_MAX - allocated_size)
+                            allocated_size += remain;
+                        else
+                            allocated_size += 1;
+                        
+                        if (!allocate_field(&substream, field->pField, field->data_size, allocated_size))
+                        {
+                            status = false;
+                            break;
+                        }
+                    }
+
+                    /* Decode the array entry */
+                    field->pData = *(char**)field->pField + field->data_size * (*size);
+                    initialize_pointer_field(field->pData, field);
+                    if (!decode_basic_field(&substream, PB_WT_PACKED, field))
+                    {
+                        status = false;
+                        break;
+                    }
+                    
+                    (*size)++;
+                }
+                if (!pb_close_string_substream(stream, &substream))
+                    return false;
+                
+                return status;
+            }
+            else
+            {
+                /* Normal repeated field, i.e. only one item at a time. */
+                pb_size_t *size = (pb_size_t*)field->pSize;
+
+                if (*size == PB_SIZE_MAX)
+                    PB_RETURN_ERROR(stream, "too many array entries");
+                
+                if (!allocate_field(stream, field->pField, field->data_size, (size_t)(*size + 1)))
+                    return false;
+            
+                field->pData = *(char**)field->pField + field->data_size * (*size);
+                (*size)++;
+                initialize_pointer_field(field->pData, field);
+                return decode_basic_field(stream, wire_type, field);
+            }
+
+        default:
+            PB_RETURN_ERROR(stream, "invalid field type");
+    }
+#endif
+}
+
+static bool checkreturn decode_callback_field(pb_istream_t *stream, pb_wire_type_t wire_type, pb_field_iter_t *field)
+{
+    if (!field->descriptor->field_callback)
+        return pb_skip_field(stream, wire_type);
+
+    if (wire_type == PB_WT_STRING)
+    {
+        pb_istream_t substream;
+        size_t prev_bytes_left;
+        
+        if (!pb_make_string_substream(stream, &substream))
+            return false;
+        
+        do
+        {
+            prev_bytes_left = substream.bytes_left;
+            if (!field->descriptor->field_callback(&substream, NULL, field))
+                PB_RETURN_ERROR(stream, "callback failed");
+        } while (substream.bytes_left > 0 && substream.bytes_left < prev_bytes_left);
+        
+        if (!pb_close_string_substream(stream, &substream))
+            return false;
+
+        return true;
+    }
+    else
+    {
+        /* Copy the single scalar value to stack.
+         * This is required so that we can limit the stream length,
+         * which in turn allows to use same callback for packed and
+         * not-packed fields. */
+        pb_istream_t substream;
+        pb_byte_t buffer[10];
+        size_t size = sizeof(buffer);
+        
+        if (!read_raw_value(stream, wire_type, buffer, &size))
+            return false;
+        substream = pb_istream_from_buffer(buffer, size);
+        
+        return field->descriptor->field_callback(&substream, NULL, field);
+    }
+}
+
+static bool checkreturn decode_field(pb_istream_t *stream, pb_wire_type_t wire_type, pb_field_iter_t *field)
+{
+#ifdef PB_ENABLE_MALLOC
+    /* When decoding an oneof field, check if there is old data that must be
+     * released first. */
+    if (PB_HTYPE(field->type) == PB_HTYPE_ONEOF)
+    {
+        if (!pb_release_union_field(stream, field))
+            return false;
+    }
+#endif
+
+    switch (PB_ATYPE(field->type))
+    {
+        case PB_ATYPE_STATIC:
+            return decode_static_field(stream, wire_type, field);
+        
+        case PB_ATYPE_POINTER:
+            return decode_pointer_field(stream, wire_type, field);
+        
+        case PB_ATYPE_CALLBACK:
+            return decode_callback_field(stream, wire_type, field);
+        
+        default:
+            PB_RETURN_ERROR(stream, "invalid field type");
+    }
+}
+
+/* Default handler for extension fields. Expects to have a pb_msgdesc_t
+ * pointer in the extension->type->arg field, pointing to a message with
+ * only one field in it.  */
+static bool checkreturn default_extension_decoder(pb_istream_t *stream,
+    pb_extension_t *extension, uint32_t tag, pb_wire_type_t wire_type)
+{
+    pb_field_iter_t iter;
+
+    if (!pb_field_iter_begin_extension(&iter, extension))
+        PB_RETURN_ERROR(stream, "invalid extension");
+
+    if (iter.tag != tag || !iter.message)
+        return true;
+
+    extension->found = true;
+    return decode_field(stream, wire_type, &iter);
+}
+
+/* Try to decode an unknown field as an extension field. Tries each extension
+ * decoder in turn, until one of them handles the field or loop ends. */
+static bool checkreturn decode_extension(pb_istream_t *stream,
+    uint32_t tag, pb_wire_type_t wire_type, pb_extension_t *extension)
+{
+    size_t pos = stream->bytes_left;
+    
+    while (extension != NULL && pos == stream->bytes_left)
+    {
+        bool status;
+        if (extension->type->decode)
+            status = extension->type->decode(stream, extension, tag, wire_type);
+        else
+            status = default_extension_decoder(stream, extension, tag, wire_type);
+
+        if (!status)
+            return false;
+        
+        extension = extension->next;
+    }
+    
+    return true;
+}
+
+/* Initialize message fields to default values, recursively */
+static bool pb_field_set_to_default(pb_field_iter_t *field)
+{
+    pb_type_t type;
+    type = field->type;
+
+    if (PB_LTYPE(type) == PB_LTYPE_EXTENSION)
+    {
+        pb_extension_t *ext = *(pb_extension_t* const *)field->pData;
+        while (ext != NULL)
+        {
+            pb_field_iter_t ext_iter;
+            if (pb_field_iter_begin_extension(&ext_iter, ext))
+            {
+                ext->found = false;
+                if (!pb_message_set_to_defaults(&ext_iter))
+                    return false;
+            }
+            ext = ext->next;
+        }
+    }
+    else if (PB_ATYPE(type) == PB_ATYPE_STATIC)
+    {
+        bool init_data = true;
+        if (PB_HTYPE(type) == PB_HTYPE_OPTIONAL && field->pSize != NULL)
+        {
+            /* Set has_field to false. Still initialize the optional field
+             * itself also. */
+            *(bool*)field->pSize = false;
+        }
+        else if (PB_HTYPE(type) == PB_HTYPE_REPEATED ||
+                 PB_HTYPE(type) == PB_HTYPE_ONEOF)
+        {
+            /* REPEATED: Set array count to 0, no need to initialize contents.
+               ONEOF: Set which_field to 0. */
+            *(pb_size_t*)field->pSize = 0;
+            init_data = false;
+        }
+
+        if (init_data)
+        {
+            if (PB_LTYPE_IS_SUBMSG(field->type) &&
+                (field->submsg_desc->default_value != NULL ||
+                 field->submsg_desc->field_callback != NULL ||
+                 field->submsg_desc->submsg_info[0] != NULL))
+            {
+                /* Initialize submessage to defaults.
+                 * Only needed if it has default values
+                 * or callback/submessage fields. */
+                pb_field_iter_t submsg_iter;
+                if (pb_field_iter_begin(&submsg_iter, field->submsg_desc, field->pData))
+                {
+                    if (!pb_message_set_to_defaults(&submsg_iter))
+                        return false;
+                }
+            }
+            else
+            {
+                /* Initialize to zeros */
+                memset(field->pData, 0, (size_t)field->data_size);
+            }
+        }
+    }
+    else if (PB_ATYPE(type) == PB_ATYPE_POINTER)
+    {
+        /* Initialize the pointer to NULL. */
+        *(void**)field->pField = NULL;
+
+        /* Initialize array count to 0. */
+        if (PB_HTYPE(type) == PB_HTYPE_REPEATED ||
+            PB_HTYPE(type) == PB_HTYPE_ONEOF)
+        {
+            *(pb_size_t*)field->pSize = 0;
+        }
+    }
+    else if (PB_ATYPE(type) == PB_ATYPE_CALLBACK)
+    {
+        /* Don't overwrite callback */
+    }
+
+    return true;
+}
+
+static bool pb_message_set_to_defaults(pb_field_iter_t *iter)
+{
+    pb_istream_t defstream = PB_ISTREAM_EMPTY;
+    uint32_t tag = 0;
+    pb_wire_type_t wire_type = PB_WT_VARINT;
+    bool eof;
+
+    if (iter->descriptor->default_value)
+    {
+        defstream = pb_istream_from_buffer(iter->descriptor->default_value, (size_t)-1);
+        if (!pb_decode_tag(&defstream, &wire_type, &tag, &eof))
+            return false;
+    }
+
+    do
+    {
+        if (!pb_field_set_to_default(iter))
+            return false;
+
+        if (tag != 0 && iter->tag == tag)
+        {
+            /* We have a default value for this field in the defstream */
+            if (!decode_field(&defstream, wire_type, iter))
+                return false;
+            if (!pb_decode_tag(&defstream, &wire_type, &tag, &eof))
+                return false;
+
+            if (iter->pSize)
+                *(bool*)iter->pSize = false;
+        }
+    } while (pb_field_iter_next(iter));
+
+    return true;
+}
+
+/*********************
+ * Decode all fields *
+ *********************/
+
+static bool checkreturn pb_decode_inner(pb_istream_t *stream, const pb_msgdesc_t *fields, void *dest_struct, unsigned int flags)
+{
+    uint32_t extension_range_start = 0;
+    pb_extension_t *extensions = NULL;
+
+    /* 'fixed_count_field' and 'fixed_count_size' track position of a repeated fixed
+     * count field. This can only handle _one_ repeated fixed count field that
+     * is unpacked and unordered among other (non repeated fixed count) fields.
+     */
+    pb_size_t fixed_count_field = PB_SIZE_MAX;
+    pb_size_t fixed_count_size = 0;
+    pb_size_t fixed_count_total_size = 0;
+
+    pb_fields_seen_t fields_seen = {{0, 0}};
+    const uint32_t allbits = ~(uint32_t)0;
+    pb_field_iter_t iter;
+
+    if (pb_field_iter_begin(&iter, fields, dest_struct))
+    {
+        if ((flags & PB_DECODE_NOINIT) == 0)
+        {
+            if (!pb_message_set_to_defaults(&iter))
+                PB_RETURN_ERROR(stream, "failed to set defaults");
+        }
+    }
+
+    while (stream->bytes_left)
+    {
+        uint32_t tag;
+        pb_wire_type_t wire_type;
+        bool eof;
+
+        if (!pb_decode_tag(stream, &wire_type, &tag, &eof))
+        {
+            if (eof)
+                break;
+            else
+                return false;
+        }
+
+        if (tag == 0)
+        {
+          if (flags & PB_DECODE_NULLTERMINATED)
+          {
+            break;
+          }
+          else
+          {
+            PB_RETURN_ERROR(stream, "zero tag");
+          }
+        }
+
+        if (!pb_field_iter_find(&iter, tag) || PB_LTYPE(iter.type) == PB_LTYPE_EXTENSION)
+        {
+            /* No match found, check if it matches an extension. */
+            if (extension_range_start == 0)
+            {
+                if (pb_field_iter_find_extension(&iter))
+                {
+                    extensions = *(pb_extension_t* const *)iter.pData;
+                    extension_range_start = iter.tag;
+                }
+
+                if (!extensions)
+                {
+                    extension_range_start = (uint32_t)-1;
+                }
+            }
+
+            if (tag >= extension_range_start)
+            {
+                size_t pos = stream->bytes_left;
+
+                if (!decode_extension(stream, tag, wire_type, extensions))
+                    return false;
+
+                if (pos != stream->bytes_left)
+                {
+                    /* The field was handled */
+                    continue;
+                }
+            }
+
+            /* No match found, skip data */
+            if (!pb_skip_field(stream, wire_type))
+                return false;
+            continue;
+        }
+
+        /* If a repeated fixed count field was found, get size from
+         * 'fixed_count_field' as there is no counter contained in the struct.
+         */
+        if (PB_HTYPE(iter.type) == PB_HTYPE_REPEATED && iter.pSize == &iter.array_size)
+        {
+            if (fixed_count_field != iter.index) {
+                /* If the new fixed count field does not match the previous one,
+                 * check that the previous one is NULL or that it finished
+                 * receiving all the expected data.
+                 */
+                if (fixed_count_field != PB_SIZE_MAX &&
+                    fixed_count_size != fixed_count_total_size)
+                {
+                    PB_RETURN_ERROR(stream, "wrong size for fixed count field");
+                }
+
+                fixed_count_field = iter.index;
+                fixed_count_size = 0;
+                fixed_count_total_size = iter.array_size;
+            }
+
+            iter.pSize = &fixed_count_size;
+        }
+
+        if (PB_HTYPE(iter.type) == PB_HTYPE_REQUIRED
+            && iter.required_field_index < PB_MAX_REQUIRED_FIELDS)
+        {
+            uint32_t tmp = ((uint32_t)1 << (iter.required_field_index & 31));
+            fields_seen.bitfield[iter.required_field_index >> 5] |= tmp;
+        }
+
+        if (!decode_field(stream, wire_type, &iter))
+            return false;
+    }
+
+    /* Check that all elements of the last decoded fixed count field were present. */
+    if (fixed_count_field != PB_SIZE_MAX &&
+        fixed_count_size != fixed_count_total_size)
+    {
+        PB_RETURN_ERROR(stream, "wrong size for fixed count field");
+    }
+
+    /* Check that all required fields were present. */
+    {
+        pb_size_t req_field_count = iter.descriptor->required_field_count;
+
+        if (req_field_count > 0)
+        {
+            pb_size_t i;
+
+            if (req_field_count > PB_MAX_REQUIRED_FIELDS)
+                req_field_count = PB_MAX_REQUIRED_FIELDS;
+
+            /* Check the whole words */
+            for (i = 0; i < (req_field_count >> 5); i++)
+            {
+                if (fields_seen.bitfield[i] != allbits)
+                    PB_RETURN_ERROR(stream, "missing required field");
+            }
+
+            /* Check the remaining bits (if any) */
+            if ((req_field_count & 31) != 0)
+            {
+                if (fields_seen.bitfield[req_field_count >> 5] !=
+                    (allbits >> (uint_least8_t)(32 - (req_field_count & 31))))
+                {
+                    PB_RETURN_ERROR(stream, "missing required field");
+                }
+            }
+        }
+    }
+
+    return true;
+}
+
+bool checkreturn pb_decode_ex(pb_istream_t *stream, const pb_msgdesc_t *fields, void *dest_struct, unsigned int flags)
+{
+    bool status;
+
+    if ((flags & PB_DECODE_DELIMITED) == 0)
+    {
+      status = pb_decode_inner(stream, fields, dest_struct, flags);
+    }
+    else
+    {
+      pb_istream_t substream;
+      if (!pb_make_string_substream(stream, &substream))
+        return false;
+
+      status = pb_decode_inner(&substream, fields, dest_struct, flags);
+
+      if (!pb_close_string_substream(stream, &substream))
+        return false;
+    }
+    
+#ifdef PB_ENABLE_MALLOC
+    if (!status)
+        pb_release(fields, dest_struct);
+#endif
+    
+    return status;
+}
+
+bool checkreturn pb_decode(pb_istream_t *stream, const pb_msgdesc_t *fields, void *dest_struct)
+{
+    bool status;
+
+    status = pb_decode_inner(stream, fields, dest_struct, 0);
+
+#ifdef PB_ENABLE_MALLOC
+    if (!status)
+        pb_release(fields, dest_struct);
+#endif
+
+    return status;
+}
+
+#ifdef PB_ENABLE_MALLOC
+/* Given an oneof field, if there has already been a field inside this oneof,
+ * release it before overwriting with a different one. */
+static bool pb_release_union_field(pb_istream_t *stream, pb_field_iter_t *field)
+{
+    pb_field_iter_t old_field = *field;
+    pb_size_t old_tag = *(pb_size_t*)field->pSize; /* Previous which_ value */
+    pb_size_t new_tag = field->tag; /* New which_ value */
+
+    if (old_tag == 0)
+        return true; /* Ok, no old data in union */
+
+    if (old_tag == new_tag)
+        return true; /* Ok, old data is of same type => merge */
+
+    /* Release old data. The find can fail if the message struct contains
+     * invalid data. */
+    if (!pb_field_iter_find(&old_field, old_tag))
+        PB_RETURN_ERROR(stream, "invalid union tag");
+
+    pb_release_single_field(&old_field);
+
+    if (PB_ATYPE(field->type) == PB_ATYPE_POINTER)
+    {
+        /* Initialize the pointer to NULL to make sure it is valid
+         * even in case of error return. */
+        *(void**)field->pField = NULL;
+        field->pData = NULL;
+    }
+
+    return true;
+}
+
+static void pb_release_single_field(pb_field_iter_t *field)
+{
+    pb_type_t type;
+    type = field->type;
+
+    if (PB_HTYPE(type) == PB_HTYPE_ONEOF)
+    {
+        if (*(pb_size_t*)field->pSize != field->tag)
+            return; /* This is not the current field in the union */
+    }
+
+    /* Release anything contained inside an extension or submsg.
+     * This has to be done even if the submsg itself is statically
+     * allocated. */
+    if (PB_LTYPE(type) == PB_LTYPE_EXTENSION)
+    {
+        /* Release fields from all extensions in the linked list */
+        pb_extension_t *ext = *(pb_extension_t**)field->pData;
+        while (ext != NULL)
+        {
+            pb_field_iter_t ext_iter;
+            if (pb_field_iter_begin_extension(&ext_iter, ext))
+            {
+                pb_release_single_field(&ext_iter);
+            }
+            ext = ext->next;
+        }
+    }
+    else if (PB_LTYPE_IS_SUBMSG(type) && PB_ATYPE(type) != PB_ATYPE_CALLBACK)
+    {
+        /* Release fields in submessage or submsg array */
+        pb_size_t count = 1;
+        
+        if (PB_ATYPE(type) == PB_ATYPE_POINTER)
+        {
+            field->pData = *(void**)field->pField;
+        }
+        else
+        {
+            field->pData = field->pField;
+        }
+        
+        if (PB_HTYPE(type) == PB_HTYPE_REPEATED)
+        {
+            count = *(pb_size_t*)field->pSize;
+
+            if (PB_ATYPE(type) == PB_ATYPE_STATIC && count > field->array_size)
+            {
+                /* Protect against corrupted _count fields */
+                count = field->array_size;
+            }
+        }
+        
+        if (field->pData)
+        {
+            for (; count > 0; count--)
+            {
+                pb_release(field->submsg_desc, field->pData);
+                field->pData = (char*)field->pData + field->data_size;
+            }
+        }
+    }
+    
+    if (PB_ATYPE(type) == PB_ATYPE_POINTER)
+    {
+        if (PB_HTYPE(type) == PB_HTYPE_REPEATED &&
+            (PB_LTYPE(type) == PB_LTYPE_STRING ||
+             PB_LTYPE(type) == PB_LTYPE_BYTES))
+        {
+            /* Release entries in repeated string or bytes array */
+            void **pItem = *(void***)field->pField;
+            pb_size_t count = *(pb_size_t*)field->pSize;
+            for (; count > 0; count--)
+            {
+                pb_free(*pItem);
+                *pItem++ = NULL;
+            }
+        }
+        
+        if (PB_HTYPE(type) == PB_HTYPE_REPEATED)
+        {
+            /* We are going to release the array, so set the size to 0 */
+            *(pb_size_t*)field->pSize = 0;
+        }
+        
+        /* Release main pointer */
+        pb_free(*(void**)field->pField);
+        *(void**)field->pField = NULL;
+    }
+}
+
+void pb_release(const pb_msgdesc_t *fields, void *dest_struct)
+{
+    pb_field_iter_t iter;
+    
+    if (!dest_struct)
+        return; /* Ignore NULL pointers, similar to free() */
+
+    if (!pb_field_iter_begin(&iter, fields, dest_struct))
+        return; /* Empty message type */
+    
+    do
+    {
+        pb_release_single_field(&iter);
+    } while (pb_field_iter_next(&iter));
+}
+#endif
+
+/* Field decoders */
+
+bool pb_decode_bool(pb_istream_t *stream, bool *dest)
+{
+    uint32_t value;
+    if (!pb_decode_varint32(stream, &value))
+        return false;
+
+    *(bool*)dest = (value != 0);
+    return true;
+}
+
+bool pb_decode_svarint(pb_istream_t *stream, pb_int64_t *dest)
+{
+    pb_uint64_t value;
+    if (!pb_decode_varint(stream, &value))
+        return false;
+    
+    if (value & 1)
+        *dest = (pb_int64_t)(~(value >> 1));
+    else
+        *dest = (pb_int64_t)(value >> 1);
+    
+    return true;
+}
+
+bool pb_decode_fixed32(pb_istream_t *stream, void *dest)
+{
+    union {
+        uint32_t fixed32;
+        pb_byte_t bytes[4];
+    } u;
+
+    if (!pb_read(stream, u.bytes, 4))
+        return false;
+
+#if defined(__BYTE_ORDER) && __BYTE_ORDER == __LITTLE_ENDIAN && CHAR_BIT == 8
+    /* fast path - if we know that we're on little endian, assign directly */
+    *(uint32_t*)dest = u.fixed32;
+#else
+    *(uint32_t*)dest = ((uint32_t)u.bytes[0] << 0) |
+                       ((uint32_t)u.bytes[1] << 8) |
+                       ((uint32_t)u.bytes[2] << 16) |
+                       ((uint32_t)u.bytes[3] << 24);
+#endif
+    return true;
+}
+
+#ifndef PB_WITHOUT_64BIT
+bool pb_decode_fixed64(pb_istream_t *stream, void *dest)
+{
+    union {
+        uint64_t fixed64;
+        pb_byte_t bytes[8];
+    } u;
+
+    if (!pb_read(stream, u.bytes, 8))
+        return false;
+
+#if defined(__BYTE_ORDER) && __BYTE_ORDER == __LITTLE_ENDIAN && CHAR_BIT == 8
+    /* fast path - if we know that we're on little endian, assign directly */
+    *(uint64_t*)dest = u.fixed64;
+#else
+    *(uint64_t*)dest = ((uint64_t)u.bytes[0] << 0) |
+                       ((uint64_t)u.bytes[1] << 8) |
+                       ((uint64_t)u.bytes[2] << 16) |
+                       ((uint64_t)u.bytes[3] << 24) |
+                       ((uint64_t)u.bytes[4] << 32) |
+                       ((uint64_t)u.bytes[5] << 40) |
+                       ((uint64_t)u.bytes[6] << 48) |
+                       ((uint64_t)u.bytes[7] << 56);
+#endif
+    return true;
+}
+#endif
+
+static bool checkreturn pb_dec_bool(pb_istream_t *stream, const pb_field_iter_t *field)
+{
+    return pb_decode_bool(stream, (bool*)field->pData);
+}
+
+static bool checkreturn pb_dec_varint(pb_istream_t *stream, const pb_field_iter_t *field)
+{
+    if (PB_LTYPE(field->type) == PB_LTYPE_UVARINT)
+    {
+        pb_uint64_t value, clamped;
+        if (!pb_decode_varint(stream, &value))
+            return false;
+
+        /* Cast to the proper field size, while checking for overflows */
+        if (field->data_size == sizeof(pb_uint64_t))
+            clamped = *(pb_uint64_t*)field->pData = value;
+        else if (field->data_size == sizeof(uint32_t))
+            clamped = *(uint32_t*)field->pData = (uint32_t)value;
+        else if (field->data_size == sizeof(uint_least16_t))
+            clamped = *(uint_least16_t*)field->pData = (uint_least16_t)value;
+        else if (field->data_size == sizeof(uint_least8_t))
+            clamped = *(uint_least8_t*)field->pData = (uint_least8_t)value;
+        else
+            PB_RETURN_ERROR(stream, "invalid data_size");
+
+        if (clamped != value)
+            PB_RETURN_ERROR(stream, "integer too large");
+
+        return true;
+    }
+    else
+    {
+        pb_uint64_t value;
+        pb_int64_t svalue;
+        pb_int64_t clamped;
+
+        if (PB_LTYPE(field->type) == PB_LTYPE_SVARINT)
+        {
+            if (!pb_decode_svarint(stream, &svalue))
+                return false;
+        }
+        else
+        {
+            if (!pb_decode_varint(stream, &value))
+                return false;
+
+            /* See issue 97: Google's C++ protobuf allows negative varint values to
+            * be cast as int32_t, instead of the int64_t that should be used when
+            * encoding. Nanopb versions before 0.2.5 had a bug in encoding. In order to
+            * not break decoding of such messages, we cast <=32 bit fields to
+            * int32_t first to get the sign correct.
+            */
+            if (field->data_size == sizeof(pb_int64_t))
+                svalue = (pb_int64_t)value;
+            else
+                svalue = (int32_t)value;
+        }
+
+        /* Cast to the proper field size, while checking for overflows */
+        if (field->data_size == sizeof(pb_int64_t))
+            clamped = *(pb_int64_t*)field->pData = svalue;
+        else if (field->data_size == sizeof(int32_t))
+            clamped = *(int32_t*)field->pData = (int32_t)svalue;
+        else if (field->data_size == sizeof(int_least16_t))
+            clamped = *(int_least16_t*)field->pData = (int_least16_t)svalue;
+        else if (field->data_size == sizeof(int_least8_t))
+            clamped = *(int_least8_t*)field->pData = (int_least8_t)svalue;
+        else
+            PB_RETURN_ERROR(stream, "invalid data_size");
+
+        if (clamped != svalue)
+            PB_RETURN_ERROR(stream, "integer too large");
+
+        return true;
+    }
+}
+
+static bool checkreturn pb_dec_bytes(pb_istream_t *stream, const pb_field_iter_t *field)
+{
+    uint32_t size;
+    size_t alloc_size;
+    pb_bytes_array_t *dest;
+    
+    if (!pb_decode_varint32(stream, &size))
+        return false;
+    
+    if (size > PB_SIZE_MAX)
+        PB_RETURN_ERROR(stream, "bytes overflow");
+    
+    alloc_size = PB_BYTES_ARRAY_T_ALLOCSIZE(size);
+    if (size > alloc_size)
+        PB_RETURN_ERROR(stream, "size too large");
+    
+    if (PB_ATYPE(field->type) == PB_ATYPE_POINTER)
+    {
+#ifndef PB_ENABLE_MALLOC
+        PB_RETURN_ERROR(stream, "no malloc support");
+#else
+        if (stream->bytes_left < size)
+            PB_RETURN_ERROR(stream, "end-of-stream");
+
+        if (!allocate_field(stream, field->pData, alloc_size, 1))
+            return false;
+        dest = *(pb_bytes_array_t**)field->pData;
+#endif
+    }
+    else
+    {
+        if (alloc_size > field->data_size)
+            PB_RETURN_ERROR(stream, "bytes overflow");
+        dest = (pb_bytes_array_t*)field->pData;
+    }
+
+    dest->size = (pb_size_t)size;
+    return pb_read(stream, dest->bytes, (size_t)size);
+}
+
+static bool checkreturn pb_dec_string(pb_istream_t *stream, const pb_field_iter_t *field)
+{
+    uint32_t size;
+    size_t alloc_size;
+    pb_byte_t *dest = (pb_byte_t*)field->pData;
+
+    if (!pb_decode_varint32(stream, &size))
+        return false;
+
+    if (size == (uint32_t)-1)
+        PB_RETURN_ERROR(stream, "size too large");
+
+    /* Space for null terminator */
+    alloc_size = (size_t)(size + 1);
+
+    if (alloc_size < size)
+        PB_RETURN_ERROR(stream, "size too large");
+
+    if (PB_ATYPE(field->type) == PB_ATYPE_POINTER)
+    {
+#ifndef PB_ENABLE_MALLOC
+        PB_RETURN_ERROR(stream, "no malloc support");
+#else
+        if (stream->bytes_left < size)
+            PB_RETURN_ERROR(stream, "end-of-stream");
+
+        if (!allocate_field(stream, field->pData, alloc_size, 1))
+            return false;
+        dest = *(pb_byte_t**)field->pData;
+#endif
+    }
+    else
+    {
+        if (alloc_size > field->data_size)
+            PB_RETURN_ERROR(stream, "string overflow");
+    }
+    
+    dest[size] = 0;
+
+    if (!pb_read(stream, dest, (size_t)size))
+        return false;
+
+#ifdef PB_VALIDATE_UTF8
+    if (!pb_validate_utf8((const char*)dest))
+        PB_RETURN_ERROR(stream, "invalid utf8");
+#endif
+
+    return true;
+}
+
+static bool checkreturn pb_dec_submessage(pb_istream_t *stream, const pb_field_iter_t *field)
+{
+    bool status = true;
+    bool submsg_consumed = false;
+    pb_istream_t substream;
+
+    if (!pb_make_string_substream(stream, &substream))
+        return false;
+    
+    if (field->submsg_desc == NULL)
+        PB_RETURN_ERROR(stream, "invalid field descriptor");
+    
+    /* Submessages can have a separate message-level callback that is called
+     * before decoding the message. Typically it is used to set callback fields
+     * inside oneofs. */
+    if (PB_LTYPE(field->type) == PB_LTYPE_SUBMSG_W_CB && field->pSize != NULL)
+    {
+        /* Message callback is stored right before pSize. */
+        pb_callback_t *callback = (pb_callback_t*)field->pSize - 1;
+        if (callback->funcs.decode)
+        {
+            status = callback->funcs.decode(&substream, field, &callback->arg);
+
+            if (substream.bytes_left == 0)
+            {
+                submsg_consumed = true;
+            }
+        }
+    }
+
+    /* Now decode the submessage contents */
+    if (status && !submsg_consumed)
+    {
+        unsigned int flags = 0;
+
+        /* Static required/optional fields are already initialized by top-level
+         * pb_decode(), no need to initialize them again. */
+        if (PB_ATYPE(field->type) == PB_ATYPE_STATIC &&
+            PB_HTYPE(field->type) != PB_HTYPE_REPEATED)
+        {
+            flags = PB_DECODE_NOINIT;
+        }
+
+        status = pb_decode_inner(&substream, field->submsg_desc, field->pData, flags);
+    }
+    
+    if (!pb_close_string_substream(stream, &substream))
+        return false;
+
+    return status;
+}
+
+static bool checkreturn pb_dec_fixed_length_bytes(pb_istream_t *stream, const pb_field_iter_t *field)
+{
+    uint32_t size;
+
+    if (!pb_decode_varint32(stream, &size))
+        return false;
+
+    if (size > PB_SIZE_MAX)
+        PB_RETURN_ERROR(stream, "bytes overflow");
+
+    if (size == 0)
+    {
+        /* As a special case, treat empty bytes string as all zeros for fixed_length_bytes. */
+        memset(field->pData, 0, (size_t)field->data_size);
+        return true;
+    }
+
+    if (size != field->data_size)
+        PB_RETURN_ERROR(stream, "incorrect fixed length bytes size");
+
+    return pb_read(stream, (pb_byte_t*)field->pData, (size_t)field->data_size);
+}
+
+#ifdef PB_CONVERT_DOUBLE_FLOAT
+bool pb_decode_double_as_float(pb_istream_t *stream, float *dest)
+{
+    uint_least8_t sign;
+    int exponent;
+    uint32_t mantissa;
+    uint64_t value;
+    union { float f; uint32_t i; } out;
+
+    if (!pb_decode_fixed64(stream, &value))
+        return false;
+
+    /* Decompose input value */
+    sign = (uint_least8_t)((value >> 63) & 1);
+    exponent = (int)((value >> 52) & 0x7FF) - 1023;
+    mantissa = (value >> 28) & 0xFFFFFF; /* Highest 24 bits */
+
+    /* Figure if value is in range representable by floats. */
+    if (exponent == 1024)
+    {
+        /* Special value */
+        exponent = 128;
+        mantissa >>= 1;
+    }
+    else
+    {
+        if (exponent > 127)
+        {
+            /* Too large, convert to infinity */
+            exponent = 128;
+            mantissa = 0;
+        }
+        else if (exponent < -150)
+        {
+            /* Too small, convert to zero */
+            exponent = -127;
+            mantissa = 0;
+        }
+        else if (exponent < -126)
+        {
+            /* Denormalized */
+            mantissa |= 0x1000000;
+            mantissa >>= (-126 - exponent);
+            exponent = -127;
+        }
+
+        /* Round off mantissa */
+        mantissa = (mantissa + 1) >> 1;
+
+        /* Check if mantissa went over 2.0 */
+        if (mantissa & 0x800000)
+        {
+            exponent += 1;
+            mantissa &= 0x7FFFFF;
+            mantissa >>= 1;
+        }
+    }
+
+    /* Combine fields */
+    out.i = mantissa;
+    out.i |= (uint32_t)(exponent + 127) << 23;
+    out.i |= (uint32_t)sign << 31;
+
+    *dest = out.f;
+    return true;
+}
+#endif

diff --git a/security/container/protos/nanopb/pb_decode.h b/security/container/protos/nanopb/pb_decode.h
new file mode 100644
index 0000000..824acd4
--- /dev/null
+++ b/security/container/protos/nanopb/pb_decode.h

@@ -0,0 +1,199 @@
+/* pb_decode.h: Functions to decode protocol buffers. Depends on pb_decode.c.
+ * The main function is pb_decode. You also need an input stream, and the
+ * field descriptions created by nanopb_generator.py.
+ */
+
+#ifndef PB_DECODE_H_INCLUDED
+#define PB_DECODE_H_INCLUDED
+
+#include "pb.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/* Structure for defining custom input streams. You will need to provide
+ * a callback function to read the bytes from your storage, which can be
+ * for example a file or a network socket.
+ * 
+ * The callback must conform to these rules:
+ *
+ * 1) Return false on IO errors. This will cause decoding to abort.
+ * 2) You can use state to store your own data (e.g. buffer pointer),
+ *    and rely on pb_read to verify that no-body reads past bytes_left.
+ * 3) Your callback may be used with substreams, in which case bytes_left
+ *    is different than from the main stream. Don't use bytes_left to compute
+ *    any pointers.
+ */
+struct pb_istream_s
+{
+#ifdef PB_BUFFER_ONLY
+    /* Callback pointer is not used in buffer-only configuration.
+     * Having an int pointer here allows binary compatibility but
+     * gives an error if someone tries to assign callback function.
+     */
+    int *callback;
+#else
+    bool (*callback)(pb_istream_t *stream, pb_byte_t *buf, size_t count);
+#endif
+
+    void *state; /* Free field for use by callback implementation */
+    size_t bytes_left;
+    
+#ifndef PB_NO_ERRMSG
+    const char *errmsg;
+#endif
+};
+
+#ifndef PB_NO_ERRMSG
+#define PB_ISTREAM_EMPTY {0,0,0,0}
+#else
+#define PB_ISTREAM_EMPTY {0,0,0}
+#endif
+
+/***************************
+ * Main decoding functions *
+ ***************************/
+ 
+/* Decode a single protocol buffers message from input stream into a C structure.
+ * Returns true on success, false on any failure.
+ * The actual struct pointed to by dest must match the description in fields.
+ * Callback fields of the destination structure must be initialized by caller.
+ * All other fields will be initialized by this function.
+ *
+ * Example usage:
+ *    MyMessage msg = {};
+ *    uint8_t buffer[64];
+ *    pb_istream_t stream;
+ *    
+ *    // ... read some data into buffer ...
+ *
+ *    stream = pb_istream_from_buffer(buffer, count);
+ *    pb_decode(&stream, MyMessage_fields, &msg);
+ */
+bool pb_decode(pb_istream_t *stream, const pb_msgdesc_t *fields, void *dest_struct);
+
+/* Extended version of pb_decode, with several options to control
+ * the decoding process:
+ *
+ * PB_DECODE_NOINIT:         Do not initialize the fields to default values.
+ *                           This is slightly faster if you do not need the default
+ *                           values and instead initialize the structure to 0 using
+ *                           e.g. memset(). This can also be used for merging two
+ *                           messages, i.e. combine already existing data with new
+ *                           values.
+ *
+ * PB_DECODE_DELIMITED:      Input message starts with the message size as varint.
+ *                           Corresponds to parseDelimitedFrom() in Google's
+ *                           protobuf API.
+ *
+ * PB_DECODE_NULLTERMINATED: Stop reading when field tag is read as 0. This allows
+ *                           reading null terminated messages.
+ *                           NOTE: Until nanopb-0.4.0, pb_decode() also allows
+ *                           null-termination. This behaviour is not supported in
+ *                           most other protobuf implementations, so PB_DECODE_DELIMITED
+ *                           is a better option for compatibility.
+ *
+ * Multiple flags can be combined with bitwise or (| operator)
+ */
+#define PB_DECODE_NOINIT          0x01U
+#define PB_DECODE_DELIMITED       0x02U
+#define PB_DECODE_NULLTERMINATED  0x04U
+bool pb_decode_ex(pb_istream_t *stream, const pb_msgdesc_t *fields, void *dest_struct, unsigned int flags);
+
+/* Defines for backwards compatibility with code written before nanopb-0.4.0 */
+#define pb_decode_noinit(s,f,d) pb_decode_ex(s,f,d, PB_DECODE_NOINIT)
+#define pb_decode_delimited(s,f,d) pb_decode_ex(s,f,d, PB_DECODE_DELIMITED)
+#define pb_decode_delimited_noinit(s,f,d) pb_decode_ex(s,f,d, PB_DECODE_DELIMITED | PB_DECODE_NOINIT)
+#define pb_decode_nullterminated(s,f,d) pb_decode_ex(s,f,d, PB_DECODE_NULLTERMINATED)
+
+#ifdef PB_ENABLE_MALLOC
+/* Release any allocated pointer fields. If you use dynamic allocation, you should
+ * call this for any successfully decoded message when you are done with it. If
+ * pb_decode() returns with an error, the message is already released.
+ */
+void pb_release(const pb_msgdesc_t *fields, void *dest_struct);
+#else
+/* Allocation is not supported, so release is no-op */
+#define pb_release(fields, dest_struct) PB_UNUSED(fields); PB_UNUSED(dest_struct);
+#endif
+
+
+/**************************************
+ * Functions for manipulating streams *
+ **************************************/
+
+/* Create an input stream for reading from a memory buffer.
+ *
+ * msglen should be the actual length of the message, not the full size of
+ * allocated buffer.
+ *
+ * Alternatively, you can use a custom stream that reads directly from e.g.
+ * a file or a network socket.
+ */
+pb_istream_t pb_istream_from_buffer(const pb_byte_t *buf, size_t msglen);
+
+/* Function to read from a pb_istream_t. You can use this if you need to
+ * read some custom header data, or to read data in field callbacks.
+ */
+bool pb_read(pb_istream_t *stream, pb_byte_t *buf, size_t count);
+
+
+/************************************************
+ * Helper functions for writing field callbacks *
+ ************************************************/
+
+/* Decode the tag for the next field in the stream. Gives the wire type and
+ * field tag. At end of the message, returns false and sets eof to true. */
+bool pb_decode_tag(pb_istream_t *stream, pb_wire_type_t *wire_type, uint32_t *tag, bool *eof);
+
+/* Skip the field payload data, given the wire type. */
+bool pb_skip_field(pb_istream_t *stream, pb_wire_type_t wire_type);
+
+/* Decode an integer in the varint format. This works for enum, int32,
+ * int64, uint32 and uint64 field types. */
+#ifndef PB_WITHOUT_64BIT
+bool pb_decode_varint(pb_istream_t *stream, uint64_t *dest);
+#else
+#define pb_decode_varint pb_decode_varint32
+#endif
+
+/* Decode an integer in the varint format. This works for enum, int32,
+ * and uint32 field types. */
+bool pb_decode_varint32(pb_istream_t *stream, uint32_t *dest);
+
+/* Decode a bool value in varint format. */
+bool pb_decode_bool(pb_istream_t *stream, bool *dest);
+
+/* Decode an integer in the zig-zagged svarint format. This works for sint32
+ * and sint64. */
+#ifndef PB_WITHOUT_64BIT
+bool pb_decode_svarint(pb_istream_t *stream, int64_t *dest);
+#else
+bool pb_decode_svarint(pb_istream_t *stream, int32_t *dest);
+#endif
+
+/* Decode a fixed32, sfixed32 or float value. You need to pass a pointer to
+ * a 4-byte wide C variable. */
+bool pb_decode_fixed32(pb_istream_t *stream, void *dest);
+
+#ifndef PB_WITHOUT_64BIT
+/* Decode a fixed64, sfixed64 or double value. You need to pass a pointer to
+ * a 8-byte wide C variable. */
+bool pb_decode_fixed64(pb_istream_t *stream, void *dest);
+#endif
+
+#ifdef PB_CONVERT_DOUBLE_FLOAT
+/* Decode a double value into float variable. */
+bool pb_decode_double_as_float(pb_istream_t *stream, float *dest);
+#endif
+
+/* Make a limited-length substream for reading a PB_WT_STRING field. */
+bool pb_make_string_substream(pb_istream_t *stream, pb_istream_t *substream);
+bool pb_close_string_substream(pb_istream_t *stream, pb_istream_t *substream);
+
+#ifdef __cplusplus
+} /* extern "C" */
+#endif
+
+#endif

diff --git a/security/container/protos/nanopb/pb_encode.c b/security/container/protos/nanopb/pb_encode.c
new file mode 100644
index 0000000..de716f7
--- /dev/null
+++ b/security/container/protos/nanopb/pb_encode.c

@@ -0,0 +1,987 @@
+/* pb_encode.c -- encode a protobuf using minimal resources
+ *
+ * 2011 Petteri Aimonen <jpa@kapsi.fi>
+ */
+
+#include "pb.h"
+#include "pb_encode.h"
+#include "pb_common.h"
+
+/* Use the GCC warn_unused_result attribute to check that all return values
+ * are propagated correctly. On other compilers and gcc before 3.4.0 just
+ * ignore the annotation.
+ */
+#if !defined(__GNUC__) || ( __GNUC__ < 3) || (__GNUC__ == 3 && __GNUC_MINOR__ < 4)
+    #define checkreturn
+#else
+    #define checkreturn __attribute__((warn_unused_result))
+#endif
+
+/**************************************
+ * Declarations internal to this file *
+ **************************************/
+static bool checkreturn buf_write(pb_ostream_t *stream, const pb_byte_t *buf, size_t count);
+static bool checkreturn encode_array(pb_ostream_t *stream, pb_field_iter_t *field);
+static bool checkreturn pb_check_proto3_default_value(const pb_field_iter_t *field);
+static bool checkreturn encode_basic_field(pb_ostream_t *stream, const pb_field_iter_t *field);
+static bool checkreturn encode_callback_field(pb_ostream_t *stream, const pb_field_iter_t *field);
+static bool checkreturn encode_field(pb_ostream_t *stream, pb_field_iter_t *field);
+static bool checkreturn encode_extension_field(pb_ostream_t *stream, const pb_field_iter_t *field);
+static bool checkreturn default_extension_encoder(pb_ostream_t *stream, const pb_extension_t *extension);
+static bool checkreturn pb_encode_varint_32(pb_ostream_t *stream, uint32_t low, uint32_t high);
+static bool checkreturn pb_enc_bool(pb_ostream_t *stream, const pb_field_iter_t *field);
+static bool checkreturn pb_enc_varint(pb_ostream_t *stream, const pb_field_iter_t *field);
+static bool checkreturn pb_enc_fixed(pb_ostream_t *stream, const pb_field_iter_t *field);
+static bool checkreturn pb_enc_bytes(pb_ostream_t *stream, const pb_field_iter_t *field);
+static bool checkreturn pb_enc_string(pb_ostream_t *stream, const pb_field_iter_t *field);
+static bool checkreturn pb_enc_submessage(pb_ostream_t *stream, const pb_field_iter_t *field);
+static bool checkreturn pb_enc_fixed_length_bytes(pb_ostream_t *stream, const pb_field_iter_t *field);
+
+#ifdef PB_WITHOUT_64BIT
+#define pb_int64_t int32_t
+#define pb_uint64_t uint32_t
+#else
+#define pb_int64_t int64_t
+#define pb_uint64_t uint64_t
+#endif
+
+/*******************************
+ * pb_ostream_t implementation *
+ *******************************/
+
+static bool checkreturn buf_write(pb_ostream_t *stream, const pb_byte_t *buf, size_t count)
+{
+    size_t i;
+    pb_byte_t *dest = (pb_byte_t*)stream->state;
+    stream->state = dest + count;
+    
+    for (i = 0; i < count; i++)
+        dest[i] = buf[i];
+    
+    return true;
+}
+
+pb_ostream_t pb_ostream_from_buffer(pb_byte_t *buf, size_t bufsize)
+{
+    pb_ostream_t stream;
+#ifdef PB_BUFFER_ONLY
+    stream.callback = (void*)1; /* Just a marker value */
+#else
+    stream.callback = &buf_write;
+#endif
+    stream.state = buf;
+    stream.max_size = bufsize;
+    stream.bytes_written = 0;
+#ifndef PB_NO_ERRMSG
+    stream.errmsg = NULL;
+#endif
+    return stream;
+}
+
+bool checkreturn pb_write(pb_ostream_t *stream, const pb_byte_t *buf, size_t count)
+{
+    if (count > 0 && stream->callback != NULL)
+    {
+        if (stream->bytes_written + count < stream->bytes_written ||
+            stream->bytes_written + count > stream->max_size)
+        {
+            PB_RETURN_ERROR(stream, "stream full");
+        }
+
+#ifdef PB_BUFFER_ONLY
+        if (!buf_write(stream, buf, count))
+            PB_RETURN_ERROR(stream, "io error");
+#else        
+        if (!stream->callback(stream, buf, count))
+            PB_RETURN_ERROR(stream, "io error");
+#endif
+    }
+    
+    stream->bytes_written += count;
+    return true;
+}
+
+/*************************
+ * Encode a single field *
+ *************************/
+
+/* Read a bool value without causing undefined behavior even if the value
+ * is invalid. See issue #434 and
+ * https://stackoverflow.com/questions/27661768/weird-results-for-conditional
+ */
+static bool safe_read_bool(const void *pSize)
+{
+    const char *p = (const char *)pSize;
+    size_t i;
+    for (i = 0; i < sizeof(bool); i++)
+    {
+        if (p[i] != 0)
+            return true;
+    }
+    return false;
+}
+
+/* Encode a static array. Handles the size calculations and possible packing. */
+static bool checkreturn encode_array(pb_ostream_t *stream, pb_field_iter_t *field)
+{
+    pb_size_t i;
+    pb_size_t count;
+#ifndef PB_ENCODE_ARRAYS_UNPACKED
+    size_t size;
+#endif
+
+    count = *(pb_size_t*)field->pSize;
+
+    if (count == 0)
+        return true;
+
+    if (PB_ATYPE(field->type) != PB_ATYPE_POINTER && count > field->array_size)
+        PB_RETURN_ERROR(stream, "array max size exceeded");
+    
+#ifndef PB_ENCODE_ARRAYS_UNPACKED
+    /* We always pack arrays if the datatype allows it. */
+    if (PB_LTYPE(field->type) <= PB_LTYPE_LAST_PACKABLE)
+    {
+        if (!pb_encode_tag(stream, PB_WT_STRING, field->tag))
+            return false;
+        
+        /* Determine the total size of packed array. */
+        if (PB_LTYPE(field->type) == PB_LTYPE_FIXED32)
+        {
+            size = 4 * (size_t)count;
+        }
+        else if (PB_LTYPE(field->type) == PB_LTYPE_FIXED64)
+        {
+            size = 8 * (size_t)count;
+        }
+        else
+        { 
+            pb_ostream_t sizestream = PB_OSTREAM_SIZING;
+            void *pData_orig = field->pData;
+            for (i = 0; i < count; i++)
+            {
+                if (!pb_enc_varint(&sizestream, field))
+                    PB_RETURN_ERROR(stream, PB_GET_ERROR(&sizestream));
+                field->pData = (char*)field->pData + field->data_size;
+            }
+            field->pData = pData_orig;
+            size = sizestream.bytes_written;
+        }
+        
+        if (!pb_encode_varint(stream, (pb_uint64_t)size))
+            return false;
+        
+        if (stream->callback == NULL)
+            return pb_write(stream, NULL, size); /* Just sizing.. */
+        
+        /* Write the data */
+        for (i = 0; i < count; i++)
+        {
+            if (PB_LTYPE(field->type) == PB_LTYPE_FIXED32 || PB_LTYPE(field->type) == PB_LTYPE_FIXED64)
+            {
+                if (!pb_enc_fixed(stream, field))
+                    return false;
+            }
+            else
+            {
+                if (!pb_enc_varint(stream, field))
+                    return false;
+            }
+
+            field->pData = (char*)field->pData + field->data_size;
+        }
+    }
+    else /* Unpacked fields */
+#endif
+    {
+        for (i = 0; i < count; i++)
+        {
+            /* Normally the data is stored directly in the array entries, but
+             * for pointer-type string and bytes fields, the array entries are
+             * actually pointers themselves also. So we have to dereference once
+             * more to get to the actual data. */
+            if (PB_ATYPE(field->type) == PB_ATYPE_POINTER &&
+                (PB_LTYPE(field->type) == PB_LTYPE_STRING ||
+                 PB_LTYPE(field->type) == PB_LTYPE_BYTES))
+            {
+                bool status;
+                void *pData_orig = field->pData;
+                field->pData = *(void* const*)field->pData;
+
+                if (!field->pData)
+                {
+                    /* Null pointer in array is treated as empty string / bytes */
+                    status = pb_encode_tag_for_field(stream, field) &&
+                             pb_encode_varint(stream, 0);
+                }
+                else
+                {
+                    status = encode_basic_field(stream, field);
+                }
+
+                field->pData = pData_orig;
+
+                if (!status)
+                    return false;
+            }
+            else
+            {
+                if (!encode_basic_field(stream, field))
+                    return false;
+            }
+            field->pData = (char*)field->pData + field->data_size;
+        }
+    }
+    
+    return true;
+}
+
+/* In proto3, all fields are optional and are only encoded if their value is "non-zero".
+ * This function implements the check for the zero value. */
+static bool checkreturn pb_check_proto3_default_value(const pb_field_iter_t *field)
+{
+    pb_type_t type = field->type;
+
+    if (PB_ATYPE(type) == PB_ATYPE_STATIC)
+    {
+        if (PB_HTYPE(type) == PB_HTYPE_REQUIRED)
+        {
+            /* Required proto2 fields inside proto3 submessage, pretty rare case */
+            return false;
+        }
+        else if (PB_HTYPE(type) == PB_HTYPE_REPEATED)
+        {
+            /* Repeated fields inside proto3 submessage: present if count != 0 */
+            return *(const pb_size_t*)field->pSize == 0;
+        }
+        else if (PB_HTYPE(type) == PB_HTYPE_ONEOF)
+        {
+            /* Oneof fields */
+            return *(const pb_size_t*)field->pSize == 0;
+        }
+        else if (PB_HTYPE(type) == PB_HTYPE_OPTIONAL && field->pSize != NULL)
+        {
+            /* Proto2 optional fields inside proto3 message, or proto3
+             * submessage fields. */
+            return safe_read_bool(field->pSize) == false;
+        }
+        else if (field->descriptor->default_value)
+        {
+            /* Proto3 messages do not have default values, but proto2 messages
+             * can contain optional fields without has_fields (generator option 'proto3').
+             * In this case they must always be encoded, to make sure that the
+             * non-zero default value is overwritten.
+             */
+            return false;
+        }
+
+        /* Rest is proto3 singular fields */
+        if (PB_LTYPE(type) <= PB_LTYPE_LAST_PACKABLE)
+        {
+            /* Simple integer / float fields */
+            pb_size_t i;
+            const char *p = (const char*)field->pData;
+            for (i = 0; i < field->data_size; i++)
+            {
+                if (p[i] != 0)
+                {
+                    return false;
+                }
+            }
+
+            return true;
+        }
+        else if (PB_LTYPE(type) == PB_LTYPE_BYTES)
+        {
+            const pb_bytes_array_t *bytes = (const pb_bytes_array_t*)field->pData;
+            return bytes->size == 0;
+        }
+        else if (PB_LTYPE(type) == PB_LTYPE_STRING)
+        {
+            return *(const char*)field->pData == '\0';
+        }
+        else if (PB_LTYPE(type) == PB_LTYPE_FIXED_LENGTH_BYTES)
+        {
+            /* Fixed length bytes is only empty if its length is fixed
+             * as 0. Which would be pretty strange, but we can check
+             * it anyway. */
+            return field->data_size == 0;
+        }
+        else if (PB_LTYPE_IS_SUBMSG(type))
+        {
+            /* Check all fields in the submessage to find if any of them
+             * are non-zero. The comparison cannot be done byte-per-byte
+             * because the C struct may contain padding bytes that must
+             * be skipped. Note that usually proto3 submessages have
+             * a separate has_field that is checked earlier in this if.
+             */
+            pb_field_iter_t iter;
+            if (pb_field_iter_begin(&iter, field->submsg_desc, field->pData))
+            {
+                do
+                {
+                    if (!pb_check_proto3_default_value(&iter))
+                    {
+                        return false;
+                    }
+                } while (pb_field_iter_next(&iter));
+            }
+            return true;
+        }
+    }
+    else if (PB_ATYPE(type) == PB_ATYPE_POINTER)
+    {
+        return field->pData == NULL;
+    }
+    else if (PB_ATYPE(type) == PB_ATYPE_CALLBACK)
+    {
+        if (PB_LTYPE(type) == PB_LTYPE_EXTENSION)
+        {
+            const pb_extension_t *extension = *(const pb_extension_t* const *)field->pData;
+            return extension == NULL;
+        }
+        else if (field->descriptor->field_callback == pb_default_field_callback)
+        {
+            pb_callback_t *pCallback = (pb_callback_t*)field->pData;
+            return pCallback->funcs.encode == NULL;
+        }
+        else
+        {
+            return field->descriptor->field_callback == NULL;
+        }
+    }
+
+    return false; /* Not typically reached, safe default for weird special cases. */
+}
+
+/* Encode a field with static or pointer allocation, i.e. one whose data
+ * is available to the encoder directly. */
+static bool checkreturn encode_basic_field(pb_ostream_t *stream, const pb_field_iter_t *field)
+{
+    if (!field->pData)
+    {
+        /* Missing pointer field */
+        return true;
+    }
+
+    if (!pb_encode_tag_for_field(stream, field))
+        return false;
+
+    switch (PB_LTYPE(field->type))
+    {
+        case PB_LTYPE_BOOL:
+            return pb_enc_bool(stream, field);
+
+        case PB_LTYPE_VARINT:
+        case PB_LTYPE_UVARINT:
+        case PB_LTYPE_SVARINT:
+            return pb_enc_varint(stream, field);
+
+        case PB_LTYPE_FIXED32:
+        case PB_LTYPE_FIXED64:
+            return pb_enc_fixed(stream, field);
+
+        case PB_LTYPE_BYTES:
+            return pb_enc_bytes(stream, field);
+
+        case PB_LTYPE_STRING:
+            return pb_enc_string(stream, field);
+
+        case PB_LTYPE_SUBMESSAGE:
+        case PB_LTYPE_SUBMSG_W_CB:
+            return pb_enc_submessage(stream, field);
+
+        case PB_LTYPE_FIXED_LENGTH_BYTES:
+            return pb_enc_fixed_length_bytes(stream, field);
+
+        default:
+            PB_RETURN_ERROR(stream, "invalid field type");
+    }
+}
+
+/* Encode a field with callback semantics. This means that a user function is
+ * called to provide and encode the actual data. */
+static bool checkreturn encode_callback_field(pb_ostream_t *stream, const pb_field_iter_t *field)
+{
+    if (field->descriptor->field_callback != NULL)
+    {
+        if (!field->descriptor->field_callback(NULL, stream, field))
+            PB_RETURN_ERROR(stream, "callback error");
+    }
+    return true;
+}
+
+/* Encode a single field of any callback, pointer or static type. */
+static bool checkreturn encode_field(pb_ostream_t *stream, pb_field_iter_t *field)
+{
+    /* Check field presence */
+    if (PB_HTYPE(field->type) == PB_HTYPE_ONEOF)
+    {
+        if (*(const pb_size_t*)field->pSize != field->tag)
+        {
+            /* Different type oneof field */
+            return true;
+        }
+    }
+    else if (PB_HTYPE(field->type) == PB_HTYPE_OPTIONAL)
+    {
+        if (field->pSize)
+        {
+            if (safe_read_bool(field->pSize) == false)
+            {
+                /* Missing optional field */
+                return true;
+            }
+        }
+        else if (PB_ATYPE(field->type) == PB_ATYPE_STATIC)
+        {
+            /* Proto3 singular field */
+            if (pb_check_proto3_default_value(field))
+                return true;
+        }
+    }
+
+    if (!field->pData)
+    {
+        if (PB_HTYPE(field->type) == PB_HTYPE_REQUIRED)
+            PB_RETURN_ERROR(stream, "missing required field");
+
+        /* Pointer field set to NULL */
+        return true;
+    }
+
+    /* Then encode field contents */
+    if (PB_ATYPE(field->type) == PB_ATYPE_CALLBACK)
+    {
+        return encode_callback_field(stream, field);
+    }
+    else if (PB_HTYPE(field->type) == PB_HTYPE_REPEATED)
+    {
+        return encode_array(stream, field);
+    }
+    else
+    {
+        return encode_basic_field(stream, field);
+    }
+}
+
+/* Default handler for extension fields. Expects to have a pb_msgdesc_t
+ * pointer in the extension->type->arg field, pointing to a message with
+ * only one field in it.  */
+static bool checkreturn default_extension_encoder(pb_ostream_t *stream, const pb_extension_t *extension)
+{
+    pb_field_iter_t iter;
+
+    if (!pb_field_iter_begin_extension_const(&iter, extension))
+        PB_RETURN_ERROR(stream, "invalid extension");
+
+    return encode_field(stream, &iter);
+}
+
+
+/* Walk through all the registered extensions and give them a chance
+ * to encode themselves. */
+static bool checkreturn encode_extension_field(pb_ostream_t *stream, const pb_field_iter_t *field)
+{
+    const pb_extension_t *extension = *(const pb_extension_t* const *)field->pData;
+
+    while (extension)
+    {
+        bool status;
+        if (extension->type->encode)
+            status = extension->type->encode(stream, extension);
+        else
+            status = default_extension_encoder(stream, extension);
+
+        if (!status)
+            return false;
+        
+        extension = extension->next;
+    }
+    
+    return true;
+}
+
+/*********************
+ * Encode all fields *
+ *********************/
+
+bool checkreturn pb_encode(pb_ostream_t *stream, const pb_msgdesc_t *fields, const void *src_struct)
+{
+    pb_field_iter_t iter;
+    if (!pb_field_iter_begin_const(&iter, fields, src_struct))
+        return true; /* Empty message type */
+    
+    do {
+        if (PB_LTYPE(iter.type) == PB_LTYPE_EXTENSION)
+        {
+            /* Special case for the extension field placeholder */
+            if (!encode_extension_field(stream, &iter))
+                return false;
+        }
+        else
+        {
+            /* Regular field */
+            if (!encode_field(stream, &iter))
+                return false;
+        }
+    } while (pb_field_iter_next(&iter));
+    
+    return true;
+}
+
+bool checkreturn pb_encode_ex(pb_ostream_t *stream, const pb_msgdesc_t *fields, const void *src_struct, unsigned int flags)
+{
+  if ((flags & PB_ENCODE_DELIMITED) != 0)
+  {
+    return pb_encode_submessage(stream, fields, src_struct);
+  }
+  else if ((flags & PB_ENCODE_NULLTERMINATED) != 0)
+  {
+    const pb_byte_t zero = 0;
+
+    if (!pb_encode(stream, fields, src_struct))
+        return false;
+
+    return pb_write(stream, &zero, 1);
+  }
+  else
+  {
+    return pb_encode(stream, fields, src_struct);
+  }
+}
+
+bool pb_get_encoded_size(size_t *size, const pb_msgdesc_t *fields, const void *src_struct)
+{
+    pb_ostream_t stream = PB_OSTREAM_SIZING;
+    
+    if (!pb_encode(&stream, fields, src_struct))
+        return false;
+    
+    *size = stream.bytes_written;
+    return true;
+}
+
+/********************
+ * Helper functions *
+ ********************/
+
+/* This function avoids 64-bit shifts as they are quite slow on many platforms. */
+static bool checkreturn pb_encode_varint_32(pb_ostream_t *stream, uint32_t low, uint32_t high)
+{
+    size_t i = 0;
+    pb_byte_t buffer[10];
+    pb_byte_t byte = (pb_byte_t)(low & 0x7F);
+    low >>= 7;
+
+    while (i < 4 && (low != 0 || high != 0))
+    {
+        byte |= 0x80;
+        buffer[i++] = byte;
+        byte = (pb_byte_t)(low & 0x7F);
+        low >>= 7;
+    }
+
+    if (high)
+    {
+        byte = (pb_byte_t)(byte | ((high & 0x07) << 4));
+        high >>= 3;
+
+        while (high)
+        {
+            byte |= 0x80;
+            buffer[i++] = byte;
+            byte = (pb_byte_t)(high & 0x7F);
+            high >>= 7;
+        }
+    }
+
+    buffer[i++] = byte;
+
+    return pb_write(stream, buffer, i);
+}
+
+bool checkreturn pb_encode_varint(pb_ostream_t *stream, pb_uint64_t value)
+{
+    if (value <= 0x7F)
+    {
+        /* Fast path: single byte */
+        pb_byte_t byte = (pb_byte_t)value;
+        return pb_write(stream, &byte, 1);
+    }
+    else
+    {
+#ifdef PB_WITHOUT_64BIT
+        return pb_encode_varint_32(stream, value, 0);
+#else
+        return pb_encode_varint_32(stream, (uint32_t)value, (uint32_t)(value >> 32));
+#endif
+    }
+}
+
+bool checkreturn pb_encode_svarint(pb_ostream_t *stream, pb_int64_t value)
+{
+    pb_uint64_t zigzagged;
+    if (value < 0)
+        zigzagged = ~((pb_uint64_t)value << 1);
+    else
+        zigzagged = (pb_uint64_t)value << 1;
+    
+    return pb_encode_varint(stream, zigzagged);
+}
+
+bool checkreturn pb_encode_fixed32(pb_ostream_t *stream, const void *value)
+{
+    uint32_t val = *(const uint32_t*)value;
+    pb_byte_t bytes[4];
+    bytes[0] = (pb_byte_t)(val & 0xFF);
+    bytes[1] = (pb_byte_t)((val >> 8) & 0xFF);
+    bytes[2] = (pb_byte_t)((val >> 16) & 0xFF);
+    bytes[3] = (pb_byte_t)((val >> 24) & 0xFF);
+    return pb_write(stream, bytes, 4);
+}
+
+#ifndef PB_WITHOUT_64BIT
+bool checkreturn pb_encode_fixed64(pb_ostream_t *stream, const void *value)
+{
+    uint64_t val = *(const uint64_t*)value;
+    pb_byte_t bytes[8];
+    bytes[0] = (pb_byte_t)(val & 0xFF);
+    bytes[1] = (pb_byte_t)((val >> 8) & 0xFF);
+    bytes[2] = (pb_byte_t)((val >> 16) & 0xFF);
+    bytes[3] = (pb_byte_t)((val >> 24) & 0xFF);
+    bytes[4] = (pb_byte_t)((val >> 32) & 0xFF);
+    bytes[5] = (pb_byte_t)((val >> 40) & 0xFF);
+    bytes[6] = (pb_byte_t)((val >> 48) & 0xFF);
+    bytes[7] = (pb_byte_t)((val >> 56) & 0xFF);
+    return pb_write(stream, bytes, 8);
+}
+#endif
+
+bool checkreturn pb_encode_tag(pb_ostream_t *stream, pb_wire_type_t wiretype, uint32_t field_number)
+{
+    pb_uint64_t tag = ((pb_uint64_t)field_number << 3) | wiretype;
+    return pb_encode_varint(stream, tag);
+}
+
+bool pb_encode_tag_for_field ( pb_ostream_t* stream, const pb_field_iter_t* field )
+{
+    pb_wire_type_t wiretype;
+    switch (PB_LTYPE(field->type))
+    {
+        case PB_LTYPE_BOOL:
+        case PB_LTYPE_VARINT:
+        case PB_LTYPE_UVARINT:
+        case PB_LTYPE_SVARINT:
+            wiretype = PB_WT_VARINT;
+            break;
+        
+        case PB_LTYPE_FIXED32:
+            wiretype = PB_WT_32BIT;
+            break;
+        
+        case PB_LTYPE_FIXED64:
+            wiretype = PB_WT_64BIT;
+            break;
+        
+        case PB_LTYPE_BYTES:
+        case PB_LTYPE_STRING:
+        case PB_LTYPE_SUBMESSAGE:
+        case PB_LTYPE_SUBMSG_W_CB:
+        case PB_LTYPE_FIXED_LENGTH_BYTES:
+            wiretype = PB_WT_STRING;
+            break;
+        
+        default:
+            PB_RETURN_ERROR(stream, "invalid field type");
+    }
+    
+    return pb_encode_tag(stream, wiretype, field->tag);
+}
+
+bool checkreturn pb_encode_string(pb_ostream_t *stream, const pb_byte_t *buffer, size_t size)
+{
+    if (!pb_encode_varint(stream, (pb_uint64_t)size))
+        return false;
+    
+    return pb_write(stream, buffer, size);
+}
+
+bool checkreturn pb_encode_submessage(pb_ostream_t *stream, const pb_msgdesc_t *fields, const void *src_struct)
+{
+    /* First calculate the message size using a non-writing substream. */
+    pb_ostream_t substream = PB_OSTREAM_SIZING;
+    size_t size;
+    bool status;
+    
+    if (!pb_encode(&substream, fields, src_struct))
+    {
+#ifndef PB_NO_ERRMSG
+        stream->errmsg = substream.errmsg;
+#endif
+        return false;
+    }
+    
+    size = substream.bytes_written;
+    
+    if (!pb_encode_varint(stream, (pb_uint64_t)size))
+        return false;
+    
+    if (stream->callback == NULL)
+        return pb_write(stream, NULL, size); /* Just sizing */
+    
+    if (stream->bytes_written + size > stream->max_size)
+        PB_RETURN_ERROR(stream, "stream full");
+        
+    /* Use a substream to verify that a callback doesn't write more than
+     * what it did the first time. */
+    substream.callback = stream->callback;
+    substream.state = stream->state;
+    substream.max_size = size;
+    substream.bytes_written = 0;
+#ifndef PB_NO_ERRMSG
+    substream.errmsg = NULL;
+#endif
+    
+    status = pb_encode(&substream, fields, src_struct);
+    
+    stream->bytes_written += substream.bytes_written;
+    stream->state = substream.state;
+#ifndef PB_NO_ERRMSG
+    stream->errmsg = substream.errmsg;
+#endif
+    
+    if (substream.bytes_written != size)
+        PB_RETURN_ERROR(stream, "submsg size changed");
+    
+    return status;
+}
+
+/* Field encoders */
+
+static bool checkreturn pb_enc_bool(pb_ostream_t *stream, const pb_field_iter_t *field)
+{
+    uint32_t value = safe_read_bool(field->pData) ? 1 : 0;
+    PB_UNUSED(field);
+    return pb_encode_varint(stream, value);
+}
+
+static bool checkreturn pb_enc_varint(pb_ostream_t *stream, const pb_field_iter_t *field)
+{
+    if (PB_LTYPE(field->type) == PB_LTYPE_UVARINT)
+    {
+        /* Perform unsigned integer extension */
+        pb_uint64_t value = 0;
+
+        if (field->data_size == sizeof(uint_least8_t))
+            value = *(const uint_least8_t*)field->pData;
+        else if (field->data_size == sizeof(uint_least16_t))
+            value = *(const uint_least16_t*)field->pData;
+        else if (field->data_size == sizeof(uint32_t))
+            value = *(const uint32_t*)field->pData;
+        else if (field->data_size == sizeof(pb_uint64_t))
+            value = *(const pb_uint64_t*)field->pData;
+        else
+            PB_RETURN_ERROR(stream, "invalid data_size");
+
+        return pb_encode_varint(stream, value);
+    }
+    else
+    {
+        /* Perform signed integer extension */
+        pb_int64_t value = 0;
+
+        if (field->data_size == sizeof(int_least8_t))
+            value = *(const int_least8_t*)field->pData;
+        else if (field->data_size == sizeof(int_least16_t))
+            value = *(const int_least16_t*)field->pData;
+        else if (field->data_size == sizeof(int32_t))
+            value = *(const int32_t*)field->pData;
+        else if (field->data_size == sizeof(pb_int64_t))
+            value = *(const pb_int64_t*)field->pData;
+        else
+            PB_RETURN_ERROR(stream, "invalid data_size");
+
+        if (PB_LTYPE(field->type) == PB_LTYPE_SVARINT)
+            return pb_encode_svarint(stream, value);
+#ifdef PB_WITHOUT_64BIT
+        else if (value < 0)
+            return pb_encode_varint_32(stream, (uint32_t)value, (uint32_t)-1);
+#endif
+        else
+            return pb_encode_varint(stream, (pb_uint64_t)value);
+
+    }
+}
+
+static bool checkreturn pb_enc_fixed(pb_ostream_t *stream, const pb_field_iter_t *field)
+{
+#ifdef PB_CONVERT_DOUBLE_FLOAT
+    if (field->data_size == sizeof(float) && PB_LTYPE(field->type) == PB_LTYPE_FIXED64)
+    {
+        return pb_encode_float_as_double(stream, *(float*)field->pData);
+    }
+#endif
+
+    if (field->data_size == sizeof(uint32_t))
+    {
+        return pb_encode_fixed32(stream, field->pData);
+    }
+#ifndef PB_WITHOUT_64BIT
+    else if (field->data_size == sizeof(uint64_t))
+    {
+        return pb_encode_fixed64(stream, field->pData);
+    }
+#endif
+    else
+    {
+        PB_RETURN_ERROR(stream, "invalid data_size");
+    }
+}
+
+static bool checkreturn pb_enc_bytes(pb_ostream_t *stream, const pb_field_iter_t *field)
+{
+    const pb_bytes_array_t *bytes = NULL;
+
+    bytes = (const pb_bytes_array_t*)field->pData;
+    
+    if (bytes == NULL)
+    {
+        /* Treat null pointer as an empty bytes field */
+        return pb_encode_string(stream, NULL, 0);
+    }
+    
+    if (PB_ATYPE(field->type) == PB_ATYPE_STATIC &&
+        bytes->size > field->data_size - offsetof(pb_bytes_array_t, bytes))
+    {
+        PB_RETURN_ERROR(stream, "bytes size exceeded");
+    }
+    
+    return pb_encode_string(stream, bytes->bytes, (size_t)bytes->size);
+}
+
+static bool checkreturn pb_enc_string(pb_ostream_t *stream, const pb_field_iter_t *field)
+{
+    size_t size = 0;
+    size_t max_size = (size_t)field->data_size;
+    const char *str = (const char*)field->pData;
+    
+    if (PB_ATYPE(field->type) == PB_ATYPE_POINTER)
+    {
+        max_size = (size_t)-1;
+    }
+    else
+    {
+        /* pb_dec_string() assumes string fields end with a null
+         * terminator when the type isn't PB_ATYPE_POINTER, so we
+         * shouldn't allow more than max-1 bytes to be written to
+         * allow space for the null terminator.
+         */
+        if (max_size == 0)
+            PB_RETURN_ERROR(stream, "zero-length string");
+
+        max_size -= 1;
+    }
+
+
+    if (str == NULL)
+    {
+        size = 0; /* Treat null pointer as an empty string */
+    }
+    else
+    {
+        const char *p = str;
+
+        /* strnlen() is not always available, so just use a loop */
+        while (size < max_size && *p != '\0')
+        {
+            size++;
+            p++;
+        }
+
+        if (*p != '\0')
+        {
+            PB_RETURN_ERROR(stream, "unterminated string");
+        }
+    }
+
+#ifdef PB_VALIDATE_UTF8
+    if (!pb_validate_utf8(str))
+        PB_RETURN_ERROR(stream, "invalid utf8");
+#endif
+
+    return pb_encode_string(stream, (const pb_byte_t*)str, size);
+}
+
+static bool checkreturn pb_enc_submessage(pb_ostream_t *stream, const pb_field_iter_t *field)
+{
+    if (field->submsg_desc == NULL)
+        PB_RETURN_ERROR(stream, "invalid field descriptor");
+
+    if (PB_LTYPE(field->type) == PB_LTYPE_SUBMSG_W_CB && field->pSize != NULL)
+    {
+        /* Message callback is stored right before pSize. */
+        pb_callback_t *callback = (pb_callback_t*)field->pSize - 1;
+        if (callback->funcs.encode)
+        {
+            if (!callback->funcs.encode(stream, field, &callback->arg))
+                return false;
+        }
+    }
+    
+    return pb_encode_submessage(stream, field->submsg_desc, field->pData);
+}
+
+static bool checkreturn pb_enc_fixed_length_bytes(pb_ostream_t *stream, const pb_field_iter_t *field)
+{
+    return pb_encode_string(stream, (const pb_byte_t*)field->pData, (size_t)field->data_size);
+}
+
+#ifdef PB_CONVERT_DOUBLE_FLOAT
+bool pb_encode_float_as_double(pb_ostream_t *stream, float value)
+{
+    union { float f; uint32_t i; } in;
+    uint_least8_t sign;
+    int exponent;
+    uint64_t mantissa;
+
+    in.f = value;
+
+    /* Decompose input value */
+    sign = (uint_least8_t)((in.i >> 31) & 1);
+    exponent = (int)((in.i >> 23) & 0xFF) - 127;
+    mantissa = in.i & 0x7FFFFF;
+
+    if (exponent == 128)
+    {
+        /* Special value (NaN etc.) */
+        exponent = 1024;
+    }
+    else if (exponent == -127)
+    {
+        if (!mantissa)
+        {
+            /* Zero */
+            exponent = -1023;
+        }
+        else
+        {
+            /* Denormalized */
+            mantissa <<= 1;
+            while (!(mantissa & 0x800000))
+            {
+                mantissa <<= 1;
+                exponent--;
+            }
+            mantissa &= 0x7FFFFF;
+        }
+    }
+
+    /* Combine fields */
+    mantissa <<= 29;
+    mantissa |= (uint64_t)(exponent + 1023) << 52;
+    mantissa |= (uint64_t)sign << 63;
+
+    return pb_encode_fixed64(stream, &mantissa);
+}
+#endif

diff --git a/security/container/protos/nanopb/pb_encode.h b/security/container/protos/nanopb/pb_encode.h
new file mode 100644
index 0000000..9cff22a
--- /dev/null
+++ b/security/container/protos/nanopb/pb_encode.h

@@ -0,0 +1,185 @@
+/* pb_encode.h: Functions to encode protocol buffers. Depends on pb_encode.c.
+ * The main function is pb_encode. You also need an output stream, and the
+ * field descriptions created by nanopb_generator.py.
+ */
+
+#ifndef PB_ENCODE_H_INCLUDED
+#define PB_ENCODE_H_INCLUDED
+
+#include "pb.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/* Structure for defining custom output streams. You will need to provide
+ * a callback function to write the bytes to your storage, which can be
+ * for example a file or a network socket.
+ *
+ * The callback must conform to these rules:
+ *
+ * 1) Return false on IO errors. This will cause encoding to abort.
+ * 2) You can use state to store your own data (e.g. buffer pointer).
+ * 3) pb_write will update bytes_written after your callback runs.
+ * 4) Substreams will modify max_size and bytes_written. Don't use them
+ *    to calculate any pointers.
+ */
+struct pb_ostream_s
+{
+#ifdef PB_BUFFER_ONLY
+    /* Callback pointer is not used in buffer-only configuration.
+     * Having an int pointer here allows binary compatibility but
+     * gives an error if someone tries to assign callback function.
+     * Also, NULL pointer marks a 'sizing stream' that does not
+     * write anything.
+     */
+    int *callback;
+#else
+    bool (*callback)(pb_ostream_t *stream, const pb_byte_t *buf, size_t count);
+#endif
+    void *state;          /* Free field for use by callback implementation. */
+    size_t max_size;      /* Limit number of output bytes written (or use SIZE_MAX). */
+    size_t bytes_written; /* Number of bytes written so far. */
+    
+#ifndef PB_NO_ERRMSG
+    const char *errmsg;
+#endif
+};
+
+/***************************
+ * Main encoding functions *
+ ***************************/
+
+/* Encode a single protocol buffers message from C structure into a stream.
+ * Returns true on success, false on any failure.
+ * The actual struct pointed to by src_struct must match the description in fields.
+ * All required fields in the struct are assumed to have been filled in.
+ *
+ * Example usage:
+ *    MyMessage msg = {};
+ *    uint8_t buffer[64];
+ *    pb_ostream_t stream;
+ *
+ *    msg.field1 = 42;
+ *    stream = pb_ostream_from_buffer(buffer, sizeof(buffer));
+ *    pb_encode(&stream, MyMessage_fields, &msg);
+ */
+bool pb_encode(pb_ostream_t *stream, const pb_msgdesc_t *fields, const void *src_struct);
+
+/* Extended version of pb_encode, with several options to control the
+ * encoding process:
+ *
+ * PB_ENCODE_DELIMITED:      Prepend the length of message as a varint.
+ *                           Corresponds to writeDelimitedTo() in Google's
+ *                           protobuf API.
+ *
+ * PB_ENCODE_NULLTERMINATED: Append a null byte to the message for termination.
+ *                           NOTE: This behaviour is not supported in most other
+ *                           protobuf implementations, so PB_ENCODE_DELIMITED
+ *                           is a better option for compatibility.
+ */
+#define PB_ENCODE_DELIMITED       0x02U
+#define PB_ENCODE_NULLTERMINATED  0x04U
+bool pb_encode_ex(pb_ostream_t *stream, const pb_msgdesc_t *fields, const void *src_struct, unsigned int flags);
+
+/* Defines for backwards compatibility with code written before nanopb-0.4.0 */
+#define pb_encode_delimited(s,f,d) pb_encode_ex(s,f,d, PB_ENCODE_DELIMITED)
+#define pb_encode_nullterminated(s,f,d) pb_encode_ex(s,f,d, PB_ENCODE_NULLTERMINATED)
+
+/* Encode the message to get the size of the encoded data, but do not store
+ * the data. */
+bool pb_get_encoded_size(size_t *size, const pb_msgdesc_t *fields, const void *src_struct);
+
+/**************************************
+ * Functions for manipulating streams *
+ **************************************/
+
+/* Create an output stream for writing into a memory buffer.
+ * The number of bytes written can be found in stream.bytes_written after
+ * encoding the message.
+ *
+ * Alternatively, you can use a custom stream that writes directly to e.g.
+ * a file or a network socket.
+ */
+pb_ostream_t pb_ostream_from_buffer(pb_byte_t *buf, size_t bufsize);
+
+/* Pseudo-stream for measuring the size of a message without actually storing
+ * the encoded data.
+ * 
+ * Example usage:
+ *    MyMessage msg = {};
+ *    pb_ostream_t stream = PB_OSTREAM_SIZING;
+ *    pb_encode(&stream, MyMessage_fields, &msg);
+ *    printf("Message size is %d\n", stream.bytes_written);
+ */
+#ifndef PB_NO_ERRMSG
+#define PB_OSTREAM_SIZING {0,0,0,0,0}
+#else
+#define PB_OSTREAM_SIZING {0,0,0,0}
+#endif
+
+/* Function to write into a pb_ostream_t stream. You can use this if you need
+ * to append or prepend some custom headers to the message.
+ */
+bool pb_write(pb_ostream_t *stream, const pb_byte_t *buf, size_t count);
+
+
+/************************************************
+ * Helper functions for writing field callbacks *
+ ************************************************/
+
+/* Encode field header based on type and field number defined in the field
+ * structure. Call this from the callback before writing out field contents. */
+bool pb_encode_tag_for_field(pb_ostream_t *stream, const pb_field_iter_t *field);
+
+/* Encode field header by manually specifying wire type. You need to use this
+ * if you want to write out packed arrays from a callback field. */
+bool pb_encode_tag(pb_ostream_t *stream, pb_wire_type_t wiretype, uint32_t field_number);
+
+/* Encode an integer in the varint format.
+ * This works for bool, enum, int32, int64, uint32 and uint64 field types. */
+#ifndef PB_WITHOUT_64BIT
+bool pb_encode_varint(pb_ostream_t *stream, uint64_t value);
+#else
+bool pb_encode_varint(pb_ostream_t *stream, uint32_t value);
+#endif
+
+/* Encode an integer in the zig-zagged svarint format.
+ * This works for sint32 and sint64. */
+#ifndef PB_WITHOUT_64BIT
+bool pb_encode_svarint(pb_ostream_t *stream, int64_t value);
+#else
+bool pb_encode_svarint(pb_ostream_t *stream, int32_t value);
+#endif
+
+/* Encode a string or bytes type field. For strings, pass strlen(s) as size. */
+bool pb_encode_string(pb_ostream_t *stream, const pb_byte_t *buffer, size_t size);
+
+/* Encode a fixed32, sfixed32 or float value.
+ * You need to pass a pointer to a 4-byte wide C variable. */
+bool pb_encode_fixed32(pb_ostream_t *stream, const void *value);
+
+#ifndef PB_WITHOUT_64BIT
+/* Encode a fixed64, sfixed64 or double value.
+ * You need to pass a pointer to a 8-byte wide C variable. */
+bool pb_encode_fixed64(pb_ostream_t *stream, const void *value);
+#endif
+
+#ifdef PB_CONVERT_DOUBLE_FLOAT
+/* Encode a float value so that it appears like a double in the encoded
+ * message. */
+bool pb_encode_float_as_double(pb_ostream_t *stream, float value);
+#endif
+
+/* Encode a submessage field.
+ * You need to pass the pb_field_t array and pointer to struct, just like
+ * with pb_encode(). This internally encodes the submessage twice, first to
+ * calculate message size and then to actually write it out.
+ */
+bool pb_encode_submessage(pb_ostream_t *stream, const pb_msgdesc_t *fields, const void *src_struct);
+
+#ifdef __cplusplus
+} /* extern "C" */
+#endif
+
+#endif

diff --git a/security/container/protos/pbsystem.h b/security/container/protos/pbsystem.h
new file mode 100644
index 0000000..f2308f8
--- /dev/null
+++ b/security/container/protos/pbsystem.h

@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Header and types for nanopb to work with the Linux kernel */
+#include <linux/kernel.h>
+#include <linux/string.h>
+
+/* Small types.  */
+
+/* Signed.  */
+typedef signed char		int_least8_t;
+typedef short int		int_least16_t;
+typedef int			int_least32_t;
+typedef long int		int_least64_t;
+
+/* Unsigned.  */
+typedef unsigned char		uint_least8_t;
+typedef unsigned short int	uint_least16_t;
+typedef unsigned int		uint_least32_t;
+typedef unsigned long int	uint_least64_t;
+
+/* Fast types.  */
+
+/* Signed.  */
+typedef signed char		int_fast8_t;
+typedef long int		int_fast16_t;
+typedef long int		int_fast32_t;
+typedef long int		int_fast64_t;
+
+/* Unsigned.  */
+typedef unsigned char		uint_fast8_t;
+typedef unsigned long int	uint_fast16_t;
+typedef unsigned long int	uint_fast32_t;
+typedef unsigned long int	uint_fast64_t;

diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
index 2d1af88..cb2deaa 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c

@@ -501,37 +501,14 @@
 }
 EXPORT_SYMBOL_GPL(ima_file_check);
 
-/**
- * ima_file_hash - return the stored measurement if a file has been hashed and
- * is in the iint cache.
- * @file: pointer to the file
- * @buf: buffer in which to store the hash
- * @buf_size: length of the buffer
- *
- * On success, return the hash algorithm (as defined in the enum hash_algo).
- * If buf is not NULL, this function also outputs the hash into buf.
- * If the hash is larger than buf_size, then only buf_size bytes will be copied.
- * It generally just makes sense to pass a buffer capable of holding the largest
- * possible hash: IMA_MAX_DIGEST_SIZE.
- * The file hash returned is based on the entire file, including the appended
- * signature.
- *
- * If IMA is disabled or if no measurement is available, return -EOPNOTSUPP.
- * If the parameters are incorrect, return -EINVAL.
- */
-int ima_file_hash(struct file *file, char *buf, size_t buf_size)
+static int __ima_inode_hash(struct inode *inode, char *buf, size_t buf_size)
 {
-	struct inode *inode;
 	struct integrity_iint_cache *iint;
 	int hash_algo;
 
-	if (!file)
-		return -EINVAL;
-
 	if (!ima_policy_flag)
 		return -EOPNOTSUPP;
 
-	inode = file_inode(file);
 	iint = integrity_iint_find(inode);
 	if (!iint)
 		return -EOPNOTSUPP;
@@ -558,9 +535,62 @@
 
 	return hash_algo;
 }
+
+/**
+ * ima_file_hash - return the stored measurement if a file has been hashed and
+ * is in the iint cache.
+ * @file: pointer to the file
+ * @buf: buffer in which to store the hash
+ * @buf_size: length of the buffer
+ *
+ * On success, return the hash algorithm (as defined in the enum hash_algo).
+ * If buf is not NULL, this function also outputs the hash into buf.
+ * If the hash is larger than buf_size, then only buf_size bytes will be copied.
+ * It generally just makes sense to pass a buffer capable of holding the largest
+ * possible hash: IMA_MAX_DIGEST_SIZE.
+ * The file hash returned is based on the entire file, including the appended
+ * signature.
+ *
+ * If IMA is disabled or if no measurement is available, return -EOPNOTSUPP.
+ * If the parameters are incorrect, return -EINVAL.
+ */
+int ima_file_hash(struct file *file, char *buf, size_t buf_size)
+{
+	if (!file)
+		return -EINVAL;
+
+	return __ima_inode_hash(file_inode(file), buf, buf_size);
+}
 EXPORT_SYMBOL_GPL(ima_file_hash);
 
 /**
+ * ima_inode_hash - return the stored measurement if the inode has been hashed
+ * and is in the iint cache.
+ * @inode: pointer to the inode
+ * @buf: buffer in which to store the hash
+ * @buf_size: length of the buffer
+ *
+ * On success, return the hash algorithm (as defined in the enum hash_algo).
+ * If buf is not NULL, this function also outputs the hash into buf.
+ * If the hash is larger than buf_size, then only buf_size bytes will be copied.
+ * It generally just makes sense to pass a buffer capable of holding the largest
+ * possible hash: IMA_MAX_DIGEST_SIZE.
+ * The hash returned is based on the entire contents, including the appended
+ * signature.
+ *
+ * If IMA is disabled or if no measurement is available, return -EOPNOTSUPP.
+ * If the parameters are incorrect, return -EINVAL.
+ */
+int ima_inode_hash(struct inode *inode, char *buf, size_t buf_size)
+{
+	if (!inode)
+		return -EINVAL;
+
+	return __ima_inode_hash(inode, buf, buf_size);
+}
+EXPORT_SYMBOL_GPL(ima_inode_hash);
+
+/**
  * ima_post_create_tmpfile - mark newly created tmpfile as new
  * @file : newly created tmpfile
  *

diff --git a/security/loadpin/loadpin.c b/security/loadpin/loadpin.c
index b12f7d9..7f3b735 100644
--- a/security/loadpin/loadpin.c
+++ b/security/loadpin/loadpin.c

@@ -249,7 +249,7 @@
 };
 
 /* Should not be mutable after boot, so not listed in sysfs (perm == 0). */
-module_param(enforce, int, 0);
-MODULE_PARM_DESC(enforce, "Enforce module/firmware pinning");
+module_param_named(enabled, enforce, int, 0);
+MODULE_PARM_DESC(enabled, "Enforce module/firmware pinning");
 module_param_array_named(exclude, exclude_read_files, charp, NULL, 0);
 MODULE_PARM_DESC(exclude, "Exclude pinning specific read file types");

diff --git a/security/security.c b/security/security.c
index 1c696bc..62f248c 100644
--- a/security/security.c
+++ b/security/security.c

@@ -1477,6 +1477,11 @@
 	}
 }
 
+void security_file_pre_free(struct file *file)
+{
+	call_void_hook(file_pre_free_security, file);
+}
+
 int security_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
 	return call_int_hook(file_ioctl, 0, file, cmd, arg);
@@ -1592,6 +1597,11 @@
 	return rc;
 }
 
+void security_task_post_alloc(struct task_struct *task)
+{
+	call_void_hook(task_post_alloc, task);
+}
+
 void security_task_free(struct task_struct *task)
 {
 	call_void_hook(task_free, task);
@@ -1804,6 +1814,11 @@
 	return call_int_hook(task_kill, 0, p, info, sig, cred);
 }
 
+void security_task_exit(struct task_struct *p)
+{
+	call_void_hook(task_exit, p);
+}
+
 int security_task_prctl(int option, unsigned long arg2, unsigned long arg3,
 			 unsigned long arg4, unsigned long arg5)
 {

diff --git a/tools/include/linux/filter.h b/tools/include/linux/filter.h
index ca28b6a..736bdec 100644
--- a/tools/include/linux/filter.h
+++ b/tools/include/linux/filter.h

@@ -169,15 +169,31 @@
 		.off   = OFF,					\
 		.imm   = 0 })
 
-/* Atomic memory add, *(uint *)(dst_reg + off16) += src_reg */
+/*
+ * Atomic operations:
+ *
+ *   BPF_ADD                  *(uint *) (dst_reg + off16) += src_reg
+ *   BPF_AND                  *(uint *) (dst_reg + off16) &= src_reg
+ *   BPF_OR                   *(uint *) (dst_reg + off16) |= src_reg
+ *   BPF_XOR                  *(uint *) (dst_reg + off16) ^= src_reg
+ *   BPF_ADD | BPF_FETCH      src_reg = atomic_fetch_add(dst_reg + off16, src_reg);
+ *   BPF_AND | BPF_FETCH      src_reg = atomic_fetch_and(dst_reg + off16, src_reg);
+ *   BPF_OR | BPF_FETCH       src_reg = atomic_fetch_or(dst_reg + off16, src_reg);
+ *   BPF_XOR | BPF_FETCH      src_reg = atomic_fetch_xor(dst_reg + off16, src_reg);
+ *   BPF_XCHG                 src_reg = atomic_xchg(dst_reg + off16, src_reg)
+ *   BPF_CMPXCHG              r0 = atomic_cmpxchg(dst_reg + off16, r0, src_reg)
+ */
 
-#define BPF_STX_XADD(SIZE, DST, SRC, OFF)			\
+#define BPF_ATOMIC_OP(SIZE, OP, DST, SRC, OFF)			\
 	((struct bpf_insn) {					\
-		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_XADD,	\
+		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_ATOMIC,	\
 		.dst_reg = DST,					\
 		.src_reg = SRC,					\
 		.off   = OFF,					\
-		.imm   = 0 })
+		.imm   = OP })
+
+/* Legacy alias */
+#define BPF_STX_XADD(SIZE, DST, SRC, OFF) BPF_ATOMIC_OP(SIZE, BPF_ADD, DST, SRC, OFF)
 
 /* Memory store, *(uint *) (dst_reg + off16) = imm32 */
 

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 762bf87..9880f34 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h

@@ -19,7 +19,8 @@
 
 /* ld/ldx fields */
 #define BPF_DW		0x18	/* double word (64-bit) */
-#define BPF_XADD	0xc0	/* exclusive add */
+#define BPF_ATOMIC	0xc0	/* atomic memory ops - op type in immediate */
+#define BPF_XADD	0xc0	/* exclusive add - legacy name */
 
 /* alu/jmp fields */
 #define BPF_MOV		0xb0	/* mov reg to reg */
@@ -43,6 +44,11 @@
 #define BPF_CALL	0x80	/* function call */
 #define BPF_EXIT	0x90	/* function return */
 
+/* atomic op type fields (stored in immediate) */
+#define BPF_FETCH	0x01	/* not an opcode on its own, used to build others */
+#define BPF_XCHG	(0xe0 | BPF_FETCH)	/* atomic exchange */
+#define BPF_CMPXCHG	(0xf0 | BPF_FETCH)	/* atomic compare-and-write */
+
 /* Register numbers */
 enum {
 	BPF_REG_0 = 0,
@@ -157,6 +163,7 @@
 	BPF_MAP_TYPE_STRUCT_OPS,
 	BPF_MAP_TYPE_RINGBUF,
 	BPF_MAP_TYPE_INODE_STORAGE,
+	BPF_MAP_TYPE_TASK_STORAGE,
 };
 
 /* Note that tracing related programs such as
@@ -1661,6 +1668,14 @@
  * 	Return
  * 		A 8-byte long non-decreasing number.
  *
+ * u64 bpf_get_socket_cookie(struct sock *sk)
+ * 	Description
+ * 		Equivalent to **bpf_get_socket_cookie**\ () helper that accepts
+ * 		*sk*, but gets socket from a BTF **struct sock**. This helper
+ * 		also works for sleepable programs.
+ * 	Return
+ * 		A 8-byte long unique number or 0 if *sk* is NULL.
+ *
  * u32 bpf_get_socket_uid(struct sk_buff *skb)
  * 	Return
  * 		The owner UID of the socket associated to *skb*. If the socket
@@ -2442,7 +2457,7 @@
  *		running simultaneously.
  *
  *		A user should care about the synchronization by himself.
- *		For example, by using the **BPF_STX_XADD** instruction to alter
+ *		For example, by using the **BPF_ATOMIC** instructions to alter
  *		the shared data.
  *	Return
  *		A pointer to the local storage area.
@@ -3742,6 +3757,80 @@
  * 	Return
  * 		The helper returns **TC_ACT_REDIRECT** on success or
  * 		**TC_ACT_SHOT** on error.
+ *
+ * void *bpf_task_storage_get(struct bpf_map *map, struct task_struct *task, void *value, u64 flags)
+ *	Description
+ *		Get a bpf_local_storage from the *task*.
+ *
+ *		Logically, it could be thought of as getting the value from
+ *		a *map* with *task* as the **key**.  From this
+ *		perspective,  the usage is not much different from
+ *		**bpf_map_lookup_elem**\ (*map*, **&**\ *task*) except this
+ *		helper enforces the key must be an task_struct and the map must also
+ *		be a **BPF_MAP_TYPE_TASK_STORAGE**.
+ *
+ *		Underneath, the value is stored locally at *task* instead of
+ *		the *map*.  The *map* is used as the bpf-local-storage
+ *		"type". The bpf-local-storage "type" (i.e. the *map*) is
+ *		searched against all bpf_local_storage residing at *task*.
+ *
+ *		An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be
+ *		used such that a new bpf_local_storage will be
+ *		created if one does not exist.  *value* can be used
+ *		together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify
+ *		the initial value of a bpf_local_storage.  If *value* is
+ *		**NULL**, the new bpf_local_storage will be zero initialized.
+ *	Return
+ *		A bpf_local_storage pointer is returned on success.
+ *
+ *		**NULL** if not found or there was an error in adding
+ *		a new bpf_local_storage.
+ *
+ * long bpf_task_storage_delete(struct bpf_map *map, struct task_struct *task)
+ *	Description
+ *		Delete a bpf_local_storage from a *task*.
+ *	Return
+ *		0 on success.
+ *
+ *		**-ENOENT** if the bpf_local_storage cannot be found.
+ *
+ * struct task_struct *bpf_get_current_task_btf(void)
+ *	Description
+ *		Return a BTF pointer to the "current" task.
+ *		This pointer can also be used in helpers that accept an
+ *		*ARG_PTR_TO_BTF_ID* of type *task_struct*.
+ *	Return
+ *		Pointer to the current task.
+ *
+ * long bpf_bprm_opts_set(struct linux_binprm *bprm, u64 flags)
+ *	Description
+ *		Set or clear certain options on *bprm*:
+ *
+ *		**BPF_F_BPRM_SECUREEXEC** Set the secureexec bit
+ *		which sets the **AT_SECURE** auxv for glibc. The bit
+ *		is cleared if the flag is not specified.
+ *	Return
+ *		**-EINVAL** if invalid *flags* are passed, zero otherwise.
+ *
+ * u64 bpf_ktime_get_coarse_ns(void)
+ * 	Description
+ * 		Return a coarse-grained version of the time elapsed since
+ * 		system boot, in nanoseconds. Does not include time the system
+ * 		was suspended.
+ *
+ * 		See: **clock_gettime**\ (**CLOCK_MONOTONIC_COARSE**)
+ * 	Return
+ * 		Current *ktime*.
+ *
+ * long bpf_ima_inode_hash(struct inode *inode, void *dst, u32 size)
+ *	Description
+ *		Returns the stored IMA hash of the *inode* (if it's avaialable).
+ *		If the hash is larger than *size*, then only *size*
+ *		bytes will be copied to *dst*
+ *	Return
+ *		The **hash_algo** is returned on success,
+ *		**-EOPNOTSUP** if IMA is disabled or **-EINVAL** if
+ *		invalid arguments are passed.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3900,6 +3989,12 @@
 	FN(per_cpu_ptr),		\
 	FN(this_cpu_ptr),		\
 	FN(redirect_peer),		\
+	FN(task_storage_get),		\
+	FN(task_storage_delete),	\
+	FN(get_current_task_btf),	\
+	FN(bprm_opts_set),		\
+	FN(ktime_get_coarse_ns),	\
+	FN(ima_inode_hash),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -4071,6 +4166,11 @@
 	BPF_LWT_ENCAP_IP,
 };
 
+/* Flags for bpf_bprm_opts_set helper */
+enum {
+	BPF_F_BPRM_SECUREEXEC	= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\

diff --git a/tools/testing/selftests/bpf/prog_tests/atomic_bounds.c b/tools/testing/selftests/bpf/prog_tests/atomic_bounds.c
new file mode 100644
index 0000000..addf127
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/atomic_bounds.c

@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <test_progs.h>
+
+#include "atomic_bounds.skel.h"
+
+void test_atomic_bounds(void)
+{
+	struct atomic_bounds *skel;
+	__u32 duration = 0;
+
+	skel = atomic_bounds__open_and_load();
+	if (CHECK(!skel, "skel_load", "couldn't load program\n"))
+		return;
+}

diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_attach_multi.c b/tools/testing/selftests/bpf/prog_tests/cgroup_attach_multi.c
index b549fcf..0a1fc98 100644
--- a/tools/testing/selftests/bpf/prog_tests/cgroup_attach_multi.c
+++ b/tools/testing/selftests/bpf/prog_tests/cgroup_attach_multi.c

@@ -45,13 +45,13 @@
 		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
 		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
 		BPF_MOV64_IMM(BPF_REG_1, val), /* r1 = 1 */
-		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+		BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_0, BPF_REG_1, 0),
 
 		BPF_LD_MAP_FD(BPF_REG_1, cgroup_storage_fd),
 		BPF_MOV64_IMM(BPF_REG_2, 0),
 		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_get_local_storage),
 		BPF_MOV64_IMM(BPF_REG_1, val),
-		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_W, BPF_REG_0, BPF_REG_1, 0, 0),
+		BPF_ATOMIC_OP(BPF_W, BPF_ADD, BPF_REG_0, BPF_REG_1, 0),
 
 		BPF_LD_MAP_FD(BPF_REG_1, percpu_cgroup_storage_fd),
 		BPF_MOV64_IMM(BPF_REG_2, 0),

diff --git a/tools/testing/selftests/bpf/progs/atomic_bounds.c b/tools/testing/selftests/bpf/progs/atomic_bounds.c
new file mode 100644
index 0000000..e5fff7f
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/atomic_bounds.c

@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <stdbool.h>
+
+#ifdef ENABLE_ATOMICS_TESTS
+bool skip_tests __attribute((__section__(".data"))) = false;
+#else
+bool skip_tests = true;
+#endif
+
+SEC("fentry/bpf_fentry_test1")
+int BPF_PROG(sub, int x)
+{
+#ifdef ENABLE_ATOMICS_TESTS
+	int a = 0;
+	int b = __sync_fetch_and_add(&a, 1);
+	/* b is certainly 0 here. Can the verifier tell? */
+	while (b)
+		continue;
+#endif
+	return 0;
+}

diff --git a/tools/testing/selftests/bpf/test_cgroup_storage.c b/tools/testing/selftests/bpf/test_cgroup_storage.c
index d946252..0cda61d 100644
--- a/tools/testing/selftests/bpf/test_cgroup_storage.c
+++ b/tools/testing/selftests/bpf/test_cgroup_storage.c

@@ -29,7 +29,7 @@
 		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
 			     BPF_FUNC_get_local_storage),
 		BPF_MOV64_IMM(BPF_REG_1, 1),
-		BPF_STX_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0),
+		BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_0, BPF_REG_1, 0),
 		BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 0),
 		BPF_ALU64_IMM(BPF_AND, BPF_REG_1, 0x1),
 		BPF_MOV64_REG(BPF_REG_0, BPF_REG_1),

diff --git a/tools/testing/selftests/bpf/verifier/atomic_bounds.c b/tools/testing/selftests/bpf/verifier/atomic_bounds.c
new file mode 100644
index 0000000..e82183e
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/atomic_bounds.c

@@ -0,0 +1,27 @@
+{
+	"BPF_ATOMIC bounds propagation, mem->reg",
+	.insns = {
+		/* a = 0; */
+		/*
+		 * Note this is implemented with two separate instructions,
+		 * where you might think one would suffice:
+		 *
+		 * BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+		 *
+		 * This is because BPF_ST_MEM doesn't seem to set the stack slot
+		 * type to 0 when storing an immediate.
+		 */
+		BPF_MOV64_IMM(BPF_REG_0, 0),
+		BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_0, -8),
+		/* b = atomic_fetch_add(&a, 1); */
+		BPF_MOV64_IMM(BPF_REG_1, 1),
+		BPF_ATOMIC_OP(BPF_DW, BPF_ADD | BPF_FETCH, BPF_REG_10, BPF_REG_1, -8),
+		/* Verifier should be able to tell that this infinite loop isn't reachable. */
+		/* if (b) while (true) continue; */
+		BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, -1),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "back-edge",
+},

diff --git a/tools/testing/selftests/bpf/verifier/ctx.c b/tools/testing/selftests/bpf/verifier/ctx.c
index 93d6b16..2308086 100644
--- a/tools/testing/selftests/bpf/verifier/ctx.c
+++ b/tools/testing/selftests/bpf/verifier/ctx.c

@@ -10,14 +10,13 @@
 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
 },
 {
-	"context stores via XADD",
+	"context stores via BPF_ATOMIC",
 	.insns = {
 	BPF_MOV64_IMM(BPF_REG_0, 0),
-	BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_W, BPF_REG_1,
-		     BPF_REG_0, offsetof(struct __sk_buff, mark), 0),
+	BPF_ATOMIC_OP(BPF_W, BPF_ADD, BPF_REG_1, BPF_REG_0, offsetof(struct __sk_buff, mark)),
 	BPF_EXIT_INSN(),
 	},
-	.errstr = "BPF_XADD stores into R1 ctx is not allowed",
+	.errstr = "BPF_ATOMIC stores into R1 ctx is not allowed",
 	.result = REJECT,
 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
 },

diff --git a/tools/testing/selftests/bpf/verifier/direct_packet_access.c b/tools/testing/selftests/bpf/verifier/direct_packet_access.c
index ae72536..ac1e19d 100644
--- a/tools/testing/selftests/bpf/verifier/direct_packet_access.c
+++ b/tools/testing/selftests/bpf/verifier/direct_packet_access.c

@@ -333,7 +333,7 @@
 	BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
 	BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),
 	BPF_STX_MEM(BPF_DW, BPF_REG_4, BPF_REG_2, 0),
-	BPF_STX_XADD(BPF_DW, BPF_REG_4, BPF_REG_5, 0),
+	BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_4, BPF_REG_5, 0),
 	BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_4, 0),
 	BPF_STX_MEM(BPF_W, BPF_REG_2, BPF_REG_5, 0),
 	BPF_MOV64_IMM(BPF_REG_0, 0),
@@ -488,7 +488,7 @@
 	BPF_JMP_REG(BPF_JGT, BPF_REG_0, BPF_REG_3, 11),
 	BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_10, -8),
 	BPF_MOV64_IMM(BPF_REG_4, 0xffffffff),
-	BPF_STX_XADD(BPF_DW, BPF_REG_10, BPF_REG_4, -8),
+	BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_10, BPF_REG_4, -8),
 	BPF_LDX_MEM(BPF_DW, BPF_REG_4, BPF_REG_10, -8),
 	BPF_ALU64_IMM(BPF_RSH, BPF_REG_4, 49),
 	BPF_ALU64_REG(BPF_ADD, BPF_REG_4, BPF_REG_2),

diff --git a/tools/testing/selftests/bpf/verifier/leak_ptr.c b/tools/testing/selftests/bpf/verifier/leak_ptr.c
index d6eec17..73f0dea 100644
--- a/tools/testing/selftests/bpf/verifier/leak_ptr.c
+++ b/tools/testing/selftests/bpf/verifier/leak_ptr.c

@@ -5,7 +5,7 @@
 	BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
 		    offsetof(struct __sk_buff, cb[0])),
 	BPF_LD_MAP_FD(BPF_REG_2, 0),
-	BPF_STX_XADD(BPF_DW, BPF_REG_1, BPF_REG_2,
+	BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_1, BPF_REG_2,
 		      offsetof(struct __sk_buff, cb[0])),
 	BPF_EXIT_INSN(),
 	},
@@ -13,7 +13,7 @@
 	.errstr_unpriv = "R2 leaks addr into mem",
 	.result_unpriv = REJECT,
 	.result = REJECT,
-	.errstr = "BPF_XADD stores into R1 ctx is not allowed",
+	.errstr = "BPF_ATOMIC stores into R1 ctx is not allowed",
 },
 {
 	"leak pointer into ctx 2",
@@ -21,14 +21,14 @@
 	BPF_MOV64_IMM(BPF_REG_0, 0),
 	BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
 		    offsetof(struct __sk_buff, cb[0])),
-	BPF_STX_XADD(BPF_DW, BPF_REG_1, BPF_REG_10,
+	BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_1, BPF_REG_10,
 		      offsetof(struct __sk_buff, cb[0])),
 	BPF_EXIT_INSN(),
 	},
 	.errstr_unpriv = "R10 leaks addr into mem",
 	.result_unpriv = REJECT,
 	.result = REJECT,
-	.errstr = "BPF_XADD stores into R1 ctx is not allowed",
+	.errstr = "BPF_ATOMIC stores into R1 ctx is not allowed",
 },
 {
 	"leak pointer into ctx 3",
@@ -56,7 +56,7 @@
 	BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 3),
 	BPF_MOV64_IMM(BPF_REG_3, 0),
 	BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_3, 0),
-	BPF_STX_XADD(BPF_DW, BPF_REG_0, BPF_REG_6, 0),
+	BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_0, BPF_REG_6, 0),
 	BPF_MOV64_IMM(BPF_REG_0, 0),
 	BPF_EXIT_INSN(),
 	},

diff --git a/tools/testing/selftests/bpf/verifier/meta_access.c b/tools/testing/selftests/bpf/verifier/meta_access.c
index 205292b..b45e8af 100644
--- a/tools/testing/selftests/bpf/verifier/meta_access.c
+++ b/tools/testing/selftests/bpf/verifier/meta_access.c

@@ -171,7 +171,7 @@
 	BPF_MOV64_IMM(BPF_REG_5, 42),
 	BPF_MOV64_IMM(BPF_REG_6, 24),
 	BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_5, -8),
-	BPF_STX_XADD(BPF_DW, BPF_REG_10, BPF_REG_6, -8),
+	BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_10, BPF_REG_6, -8),
 	BPF_LDX_MEM(BPF_DW, BPF_REG_5, BPF_REG_10, -8),
 	BPF_JMP_IMM(BPF_JGT, BPF_REG_5, 100, 6),
 	BPF_ALU64_REG(BPF_ADD, BPF_REG_3, BPF_REG_5),
@@ -196,7 +196,7 @@
 	BPF_MOV64_IMM(BPF_REG_5, 42),
 	BPF_MOV64_IMM(BPF_REG_6, 24),
 	BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_5, -8),
-	BPF_STX_XADD(BPF_DW, BPF_REG_10, BPF_REG_6, -8),
+	BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_10, BPF_REG_6, -8),
 	BPF_LDX_MEM(BPF_DW, BPF_REG_5, BPF_REG_10, -8),
 	BPF_JMP_IMM(BPF_JGT, BPF_REG_5, 100, 6),
 	BPF_ALU64_REG(BPF_ADD, BPF_REG_2, BPF_REG_5),

diff --git a/tools/testing/selftests/bpf/verifier/unpriv.c b/tools/testing/selftests/bpf/verifier/unpriv.c
index 9dfb68c..111801a 100644
--- a/tools/testing/selftests/bpf/verifier/unpriv.c
+++ b/tools/testing/selftests/bpf/verifier/unpriv.c

@@ -207,7 +207,8 @@
 	BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -8),
 	BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 0),
 	BPF_MOV64_IMM(BPF_REG_0, 1),
-	BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_10, BPF_REG_0, -8, 0),
+	BPF_RAW_INSN(BPF_STX | BPF_ATOMIC | BPF_DW,
+		     BPF_REG_10, BPF_REG_0, -8, BPF_ADD),
 	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6, 0),
 	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_get_hash_recalc),
 	BPF_EXIT_INSN(),

diff --git a/tools/testing/selftests/bpf/verifier/value_illegal_alu.c b/tools/testing/selftests/bpf/verifier/value_illegal_alu.c
index ed1c2ce..4890628 100644
--- a/tools/testing/selftests/bpf/verifier/value_illegal_alu.c
+++ b/tools/testing/selftests/bpf/verifier/value_illegal_alu.c

@@ -82,7 +82,7 @@
 	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
 	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
 	BPF_STX_MEM(BPF_DW, BPF_REG_2, BPF_REG_0, 0),
-	BPF_STX_XADD(BPF_DW, BPF_REG_2, BPF_REG_3, 0),
+	BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_2, BPF_REG_3, 0),
 	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_2, 0),
 	BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 22),
 	BPF_EXIT_INSN(),

diff --git a/tools/testing/selftests/bpf/verifier/xadd.c b/tools/testing/selftests/bpf/verifier/xadd.c
index c5de2e6..b96ef35 100644
--- a/tools/testing/selftests/bpf/verifier/xadd.c
+++ b/tools/testing/selftests/bpf/verifier/xadd.c

@@ -3,7 +3,7 @@
 	.insns = {
 	BPF_MOV64_IMM(BPF_REG_0, 1),
 	BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_0, -8),
-	BPF_STX_XADD(BPF_W, BPF_REG_10, BPF_REG_0, -7),
+	BPF_ATOMIC_OP(BPF_W, BPF_ADD, BPF_REG_10, BPF_REG_0, -7),
 	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_10, -8),
 	BPF_EXIT_INSN(),
 	},
@@ -22,7 +22,7 @@
 	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
 	BPF_EXIT_INSN(),
 	BPF_MOV64_IMM(BPF_REG_1, 1),
-	BPF_STX_XADD(BPF_W, BPF_REG_0, BPF_REG_1, 3),
+	BPF_ATOMIC_OP(BPF_W, BPF_ADD, BPF_REG_0, BPF_REG_1, 3),
 	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 3),
 	BPF_EXIT_INSN(),
 	},
@@ -45,13 +45,13 @@
 	BPF_MOV64_IMM(BPF_REG_0, 1),
 	BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
 	BPF_ST_MEM(BPF_W, BPF_REG_2, 3, 0),
-	BPF_STX_XADD(BPF_W, BPF_REG_2, BPF_REG_0, 1),
-	BPF_STX_XADD(BPF_W, BPF_REG_2, BPF_REG_0, 2),
+	BPF_ATOMIC_OP(BPF_W, BPF_ADD, BPF_REG_2, BPF_REG_0, 1),
+	BPF_ATOMIC_OP(BPF_W, BPF_ADD, BPF_REG_2, BPF_REG_0, 2),
 	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_2, 1),
 	BPF_EXIT_INSN(),
 	},
 	.result = REJECT,
-	.errstr = "BPF_XADD stores into R2 pkt is not allowed",
+	.errstr = "BPF_ATOMIC stores into R2 pkt is not allowed",
 	.prog_type = BPF_PROG_TYPE_XDP,
 	.flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
 },
@@ -62,8 +62,8 @@
 	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
 	BPF_MOV64_REG(BPF_REG_7, BPF_REG_10),
 	BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_0, -8),
-	BPF_STX_XADD(BPF_DW, BPF_REG_10, BPF_REG_0, -8),
-	BPF_STX_XADD(BPF_DW, BPF_REG_10, BPF_REG_0, -8),
+	BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_10, BPF_REG_0, -8),
+	BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_10, BPF_REG_0, -8),
 	BPF_JMP_REG(BPF_JNE, BPF_REG_6, BPF_REG_0, 3),
 	BPF_JMP_REG(BPF_JNE, BPF_REG_7, BPF_REG_10, 2),
 	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_10, -8),
@@ -82,8 +82,8 @@
 	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
 	BPF_MOV64_REG(BPF_REG_7, BPF_REG_10),
 	BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -8),
-	BPF_STX_XADD(BPF_W, BPF_REG_10, BPF_REG_0, -8),
-	BPF_STX_XADD(BPF_W, BPF_REG_10, BPF_REG_0, -8),
+	BPF_ATOMIC_OP(BPF_W, BPF_ADD, BPF_REG_10, BPF_REG_0, -8),
+	BPF_ATOMIC_OP(BPF_W, BPF_ADD, BPF_REG_10, BPF_REG_0, -8),
 	BPF_JMP_REG(BPF_JNE, BPF_REG_6, BPF_REG_0, 3),
 	BPF_JMP_REG(BPF_JNE, BPF_REG_7, BPF_REG_10, 2),
 	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_10, -8),

diff --git a/tools/testing/selftests/futex/functional/.gitignore b/tools/testing/selftests/futex/functional/.gitignore
index 0efcd494..d661ef0 100644
--- a/tools/testing/selftests/futex/functional/.gitignore
+++ b/tools/testing/selftests/futex/functional/.gitignore

@@ -2,6 +2,7 @@
 futex_requeue_pi
 futex_requeue_pi_mismatched_ops
 futex_requeue_pi_signal_restart
+futex_swap
 futex_wait_private_mapped_file
 futex_wait_timeout
 futex_wait_uninitialized_heap

diff --git a/tools/testing/selftests/futex/functional/Makefile b/tools/testing/selftests/futex/functional/Makefile
index 2320782..6992fac 100644
--- a/tools/testing/selftests/futex/functional/Makefile
+++ b/tools/testing/selftests/futex/functional/Makefile

@@ -13,6 +13,7 @@
 	futex_requeue_pi \
 	futex_requeue_pi_signal_restart \
 	futex_requeue_pi_mismatched_ops \
+	futex_swap \
 	futex_wait_uninitialized_heap \
 	futex_wait_private_mapped_file
 

diff --git a/tools/testing/selftests/futex/functional/futex_swap.c b/tools/testing/selftests/futex/functional/futex_swap.c
new file mode 100644
index 0000000..9034d04
--- /dev/null
+++ b/tools/testing/selftests/futex/functional/futex_swap.c

@@ -0,0 +1,209 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include <errno.h>
+#include <getopt.h>
+#include <pthread.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+#include "atomic.h"
+#include "futextest.h"
+
+/* The futex the main thread waits on. */
+futex_t futex_main = FUTEX_INITIALIZER;
+/* The futex the other thread wats on. */
+futex_t futex_other = FUTEX_INITIALIZER;
+
+/* The number of iterations to run (>1 => run benchmarks. */
+static int cfg_iterations = 1;
+
+/* If != 0, print diagnostic messages. */
+static int cfg_verbose;
+
+/* If == 0, do not use validation_counter. Useful for benchmarking. */
+static int cfg_validate = 1;
+
+/* How to swap threads. */
+#define SWAP_WAKE_WAIT 1
+#define SWAP_SWAP 2
+
+/* Futex values. */
+#define FUTEX_WAITING 0
+#define FUTEX_WAKEUP 1
+
+/* An atomic counter used to validate proper swapping. */
+static atomic_t validation_counter;
+
+void futex_swap_op(int mode, futex_t *futex_this, futex_t *futex_that)
+{
+	int ret;
+
+	switch (mode) {
+	case SWAP_WAKE_WAIT:
+		futex_set(futex_this, FUTEX_WAITING);
+		futex_set(futex_that, FUTEX_WAKEUP);
+		futex_wake(futex_that, 1, FUTEX_PRIVATE_FLAG);
+		futex_wait(futex_this, FUTEX_WAITING, NULL, FUTEX_PRIVATE_FLAG);
+		if (*futex_this != FUTEX_WAKEUP) {
+			fprintf(stderr, "unexpected futex_this value on wakeup\n");
+			exit(1);
+		}
+		break;
+
+	case SWAP_SWAP:
+		futex_set(futex_this, FUTEX_WAITING);
+		futex_set(futex_that, FUTEX_WAKEUP);
+		ret = futex_swap(futex_this, FUTEX_WAITING, NULL,
+				 futex_that, FUTEX_PRIVATE_FLAG);
+		if (ret < 0 && errno == ENOSYS) {
+			/* futex_swap not implemented */
+			perror("futex_swap");
+			exit(1);
+		}
+		if (*futex_this != FUTEX_WAKEUP) {
+			fprintf(stderr, "unexpected futex_this value on wakeup\n");
+			exit(1);
+		}
+		break;
+
+	default:
+		fprintf(stderr, "unknown mode in %s\n", __func__);
+		exit(1);
+	}
+}
+
+void *other_thread(void *arg)
+{
+	int mode = *((int *)arg);
+	int counter;
+
+	if (cfg_verbose)
+		printf("%s started\n", __func__);
+
+	futex_wait(&futex_other, 0, NULL, FUTEX_PRIVATE_FLAG);
+
+	for (counter = 0; counter < cfg_iterations; ++counter) {
+		if (cfg_validate) {
+			int prev = 2 * counter + 1;
+
+			if (prev != atomic_cmpxchg(&validation_counter, prev,
+						   prev + 1)) {
+				fprintf(stderr, "swap validation failed\n");
+				exit(1);
+			}
+		}
+		futex_swap_op(mode, &futex_other, &futex_main);
+	}
+
+	if (cfg_verbose)
+		printf("%s finished: %d iteration(s)\n", __func__, counter);
+
+	return NULL;
+}
+
+void run_test(int mode)
+{
+	struct timespec start, stop;
+	int ret, counter;
+	pthread_t thread;
+	uint64_t duration;
+
+	futex_set(&futex_other, FUTEX_WAITING);
+	atomic_set(&validation_counter, 0);
+	ret = pthread_create(&thread, NULL, &other_thread, &mode);
+	if (ret) {
+		perror("pthread_create");
+		exit(1);
+	}
+
+	ret = clock_gettime(CLOCK_MONOTONIC, &start);
+	if (ret) {
+		perror("clock_gettime");
+		exit(1);
+	}
+
+	for (counter = 0; counter < cfg_iterations; ++counter) {
+		if (cfg_validate) {
+			int prev = 2 * counter;
+
+			if (prev != atomic_cmpxchg(&validation_counter, prev,
+						   prev + 1)) {
+				fprintf(stderr, "swap validation failed\n");
+				exit(1);
+			}
+		}
+		futex_swap_op(mode, &futex_main, &futex_other);
+	}
+	if (cfg_validate && validation_counter.val != 2 * cfg_iterations) {
+		fprintf(stderr, "final swap validation failed\n");
+		exit(1);
+	}
+
+	ret = clock_gettime(CLOCK_MONOTONIC, &stop);
+	if (ret) {
+		perror("clock_gettime");
+		exit(1);
+	}
+
+	duration = (stop.tv_sec - start.tv_sec) * 1000000000LL +
+	stop.tv_nsec - start.tv_nsec;
+	if (cfg_verbose || cfg_iterations > 1) {
+		printf("completed %d swap and back iterations in %lu ns: %lu ns per swap\n",
+			cfg_iterations, duration,
+			duration / (cfg_iterations * 2));
+	}
+
+	/* The remote thread is blocked; send it the final wake. */
+	futex_set(&futex_other, FUTEX_WAKEUP);
+	futex_wake(&futex_other, 1, FUTEX_PRIVATE_FLAG);
+	if (pthread_join(thread, NULL)) {
+		perror("pthread_join");
+		exit(1);
+	}
+}
+
+void usage(char *prog)
+{
+	printf("Usage: %s\n", prog);
+	printf("  -h    Display this help message\n");
+	printf("  -i N  Use N iterations to benchmark\n");
+	printf("  -n    Do not validate swapping correctness\n");
+	printf("  -v    Print diagnostic messages\n");
+}
+
+int main(int argc, char *argv[])
+{
+	int c;
+
+	while ((c = getopt(argc, argv, "hi:nv")) != -1) {
+		switch (c) {
+		case 'h':
+			usage(basename(argv[0]));
+			exit(0);
+		case 'i':
+			cfg_iterations = atoi(optarg);
+			break;
+		case 'n':
+			cfg_validate = 0;
+			break;
+		case 'v':
+			cfg_verbose = 1;
+			break;
+		default:
+			usage(basename(argv[0]));
+			exit(1);
+		}
+	}
+
+	printf("\n\n------- running SWAP_WAKE_WAIT -----------\n\n");
+	run_test(SWAP_WAKE_WAIT);
+	printf("PASS\n");
+
+	printf("\n\n------- running SWAP_SWAP -----------\n\n");
+	run_test(SWAP_SWAP);
+	printf("PASS\n");
+
+	return 0;
+}

diff --git a/tools/testing/selftests/futex/include/futextest.h b/tools/testing/selftests/futex/include/futextest.h
index ddbcfc9..d2861fd 100644
--- a/tools/testing/selftests/futex/include/futextest.h
+++ b/tools/testing/selftests/futex/include/futextest.h

@@ -38,6 +38,9 @@
 #ifndef FUTEX_CMP_REQUEUE_PI
 #define FUTEX_CMP_REQUEUE_PI		12
 #endif
+#ifndef GFUTEX_SWAP
+#define GFUTEX_SWAP			60
+#endif
 #ifndef FUTEX_WAIT_REQUEUE_PI_PRIVATE
 #define FUTEX_WAIT_REQUEUE_PI_PRIVATE	(FUTEX_WAIT_REQUEUE_PI | \
 					 FUTEX_PRIVATE_FLAG)
@@ -205,6 +208,19 @@
 }
 
 /**
+ * futex_swap() - block on uaddr and wake one task blocked on uaddr2.
+ * @uaddr:	futex to block the current task on
+ * @timeout:	relative timeout for the current task block
+ * @uaddr2:	futex to wake tasks at (can be the same as uaddr)
+ */
+static inline int
+futex_swap(futex_t *uaddr, futex_t val, struct timespec *timeout,
+	   futex_t *uaddr2, int opflags)
+{
+	return futex(uaddr, GFUTEX_SWAP, val, timeout, uaddr2, 0, opflags);
+}
+
+/**
  * futex_cmpxchg() - atomic compare and exchange
  * @uaddr:	The address of the futex to be modified
  * @oldval:	The expected value of the futex
commit	80b9fac18917cdba3d0f9147f85e2bcea3749927	[log] [tgz]
author	COS Kernel Merge Bot <cloud-image-merge-automation@prod.google.com>	Wed Sep 15 01:57:34 2021 -0700
committer	COS Kernel Merge Bot <cloud-image-merge-automation@prod.google.com>	Wed Sep 15 01:57:34 2021 -0700
tree	f279a8a6a8d5af3c508bec2a4a5a2e611335cd8b
parent	c6be0351b9fe8770372e090885c83f50af4e03ca [diff]
parent	c31c2cca229aa5280d108618bb264c713840a4c2 [diff]