merge-upstream/v4.19.127 from branch/tag: upstream/v4.19.127 into branch: cos-4.19
Changelog:
-------------------------------------------------------------
Aneesh Kumar K.V (1):
libnvdimm: Fix endian conversion issues
Anju T Sudhakar (1):
powerpc/powernv: Avoid re-registration of imc debugfs directory
Atsushi Nemoto (1):
i2c: altera: Fix race between xfer_msg and isr thread
Can Guo (1):
scsi: ufs: Release clock if DMA map fails
Chaitanya Kulkarni (1):
null_blk: return error for invalid zone size
DENG Qingfang (1):
net: dsa: mt7530: set CPU port to fallback mode
Dan Carpenter (1):
airo: Fix read overflows sending packets
Daniel Axtens (1):
kernel/relay.c: handle alloc_percpu returning NULL in relay_open
Dinghao Liu (1):
net: smsc911x: Fix runtime PM imbalance on error
Eugeniy Paltsev (1):
ARC: Fix ICCM & DCCM runtime size checks
Fan Yang (1):
mm: Fix mremap not considering huge pmd devmap
Gerald Schaefer (1):
s390/mm: fix set_huge_pte_at() for empty ptes
Giuseppe Marco Randazzo (1):
p54usb: add AirVasT USB stick device-id
Greg Kroah-Hartman (1):
Linux 4.19.127
Jan Schmidt (1):
drm/edid: Add Oculus Rift S to non-desktop list
Jeremy Kerr (1):
net: bmac: Fix read of MAC address from ROM
Jonathan McDowell (1):
net: ethernet: stmmac: Enable interface clocks on probe for IPQ806x
Julian Sax (1):
HID: i2c-hid: add Schneider SCL142ALM to descriptor override
Jérôme Pouiller (1):
mmc: fix compilation of user API
Lucas De Marchi (1):
drm/i915: fix port checks for MST support on gen >= 11
Madhuparna Bhowmik (1):
evm: Fix RCU list related warnings
Nathan Chancellor (1):
x86/mmiotrace: Use cpumask_available() for cpumask_var_t variables
Scott Shumate (1):
HID: sony: Fix for broken buttons on DS3 USB dongles
Tejun Heo (1):
Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window"
Valentin Longchamp (1):
net/ethernet/freescale: rework quiesce/activate for ucc_geth
Vasily Gorbik (1):
s390/ftrace: save traced function caller
Vineet Gupta (1):
ARC: [plat-eznps]: Restrict to CONFIG_ISA_ARCOMPACT
Xiang Chen (1):
scsi: hisi_sas: Check sas_port before using it
Xinwei Kong (1):
spi: dw: use "smp_mb()" to avoid sending spi data error
BUG=b/158444866
TEST=tryjob, validation and K8s e2e
RELEASE_NOTE=Upgraded the Linux kernel to upstream/v4.19.127
Signed-off-by: Lakitu Kernel Bot <cloud-image-merge-automation@prod.google.com>
Change-Id: I582afd906bcb6aa0b28e4e51c85ec20ac3317485
diff --git a/.gitignore b/.gitignore
index 97ba6b7..98e745c 100644
--- a/.gitignore
+++ b/.gitignore
@@ -94,6 +94,9 @@
include/ksym
arch/*/include/generated
+# kernelconfig build directory
+/build/
+
# stgit generated dirs
patches-*
diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab
index 29601d9..d742c6c 100644
--- a/Documentation/ABI/testing/sysfs-kernel-slab
+++ b/Documentation/ABI/testing/sysfs-kernel-slab
@@ -106,6 +106,15 @@
are from ZONE_DMA.
Available when CONFIG_ZONE_DMA is enabled.
+What: /sys/kernel/slab/cache/cache_dma32
+Date: December 2018
+KernelVersion: 4.21
+Contact: Nicolas Boichat <drinkcat@chromium.org>
+Description:
+ The cache_dma32 file is read-only and specifies whether objects
+ are from ZONE_DMA32.
+ Available when CONFIG_ZONE_DMA32 is enabled.
+
What: /sys/kernel/slab/cache/cpu_slabs
Date: May 2007
KernelVersion: 2.6.22
diff --git a/Documentation/ABI/testing/sysfs-kernel-wakeup_reasons b/Documentation/ABI/testing/sysfs-kernel-wakeup_reasons
new file mode 100644
index 0000000..acb19b9
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-wakeup_reasons
@@ -0,0 +1,16 @@
+What: /sys/kernel/wakeup_reasons/last_resume_reason
+Date: February 2014
+Contact: Ruchi Kandoi <kandoiruchi@google.com>
+Description:
+ The /sys/kernel/wakeup_reasons/last_resume_reason is
+ used to report wakeup reasons after system exited suspend.
+
+What: /sys/kernel/wakeup_reasons/last_suspend_time
+Date: March 2015
+Contact: jinqian <jinqian@google.com>
+Description:
+ The /sys/kernel/wakeup_reasons/last_suspend_time is
+ used to report time spent in last suspend cycle. It contains
+ two numbers (in seconds) separated by space. First number is
+ the time spent in suspend and resume processes. Second number
+ is the time spent in sleep state.
\ No newline at end of file
diff --git a/Documentation/admin-guide/LSM/LoadPin.rst b/Documentation/admin-guide/LSM/LoadPin.rst
index 3207076..716ad9b 100644
--- a/Documentation/admin-guide/LSM/LoadPin.rst
+++ b/Documentation/admin-guide/LSM/LoadPin.rst
@@ -19,3 +19,13 @@
created to toggle pinning: ``/proc/sys/kernel/loadpin/enabled``. (Having
a mutable filesystem means pinning is mutable too, but having the
sysctl allows for easy testing on systems with a mutable filesystem.)
+
+It's also possible to exclude specific file types from LoadPin using kernel
+command line option "``loadpin.exclude``". By default, all files are
+included, but they can be excluded using kernel command line option such
+as "``loadpin.exclude=kernel-module,kexec-image``". This allows to use
+different mechanisms such as ``CONFIG_MODULE_SIG`` and
+``CONFIG_KEXEC_VERIFY_SIG`` to verify kernel module and kernel image while
+still use LoadPin to protect the integrity of other files kernel loads. The
+full list of valid file types can be found in ``kernel_read_file_str``
+defined in ``include/linux/fs.h``.
diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt
index 8d8d8f0..1a0f2ac0 100644
--- a/Documentation/block/bfq-iosched.txt
+++ b/Documentation/block/bfq-iosched.txt
@@ -20,13 +20,26 @@
details on how to configure BFQ for the desired tradeoff between
latency and throughput, or on how to maximize throughput.
-BFQ has a non-null overhead, which limits the maximum IOPS that a CPU
-can process for a device scheduled with BFQ. To give an idea of the
-limits on slow or average CPUs, here are, first, the limits of BFQ for
-three different CPUs, on, respectively, an average laptop, an old
-desktop, and a cheap embedded system, in case full hierarchical
-support is enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set), but
-CONFIG_DEBUG_BLK_CGROUP is not set (Section 4-2):
+As every I/O scheduler, BFQ adds some overhead to per-I/O-request
+processing. To give an idea of this overhead, the total,
+single-lock-protected, per-request processing time of BFQ---i.e., the
+sum of the execution times of the request insertion, dispatch and
+completion hooks---is, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
+(dated CPU for notebooks; time measured with simple code
+instrumentation, and using the throughput-sync.sh script of the S
+suite [1], in performance-profiling mode). To put this result into
+context, the total, single-lock-protected, per-request execution time
+of the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7
+us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ).
+
+Scheduling overhead further limits the maximum IOPS that a CPU can
+process (already limited by the execution of the rest of the I/O
+stack). To give an idea of the limits with BFQ, on slow or average
+CPUs, here are, first, the limits of BFQ for three different CPUs, on,
+respectively, an average laptop, an old desktop, and a cheap embedded
+system, in case full hierarchical support is enabled (i.e.,
+CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_DEBUG_BLK_CGROUP is not
+set (Section 4-2):
- Intel i7-4850HQ: 400 KIOPS
- AMD A8-3850: 250 KIOPS
- ARM CortexTM-A53 Octa-core: 80 KIOPS
@@ -357,6 +370,13 @@
than maximum throughput. In these cases, consider setting the
strict_guarantees parameter.
+slice_idle_us
+-------------
+
+Controls the same tuning parameter as slice_idle, but in microseconds.
+Either tunable can be used to set idling behavior. Afterwards, the
+other tunable will reflect the newly set value in sysfs.
+
strict_guarantees
-----------------
@@ -559,3 +579,5 @@
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf
+
+[3] https://github.com/Algodev-github/S
diff --git a/Documentation/dev-tools/gcov.rst b/Documentation/dev-tools/gcov.rst
index 69a7d90..46aae52 100644
--- a/Documentation/dev-tools/gcov.rst
+++ b/Documentation/dev-tools/gcov.rst
@@ -34,10 +34,6 @@
CONFIG_DEBUG_FS=y
CONFIG_GCOV_KERNEL=y
-select the gcc's gcov format, default is autodetect based on gcc version::
-
- CONFIG_GCOV_FORMAT_AUTODETECT=y
-
and to get coverage data for the entire kernel::
CONFIG_GCOV_PROFILE_ALL=y
@@ -169,6 +165,20 @@
[user@build] gcov -o /tmp/coverage/tmp/out/init main.c
+Note on compilers
+-----------------
+
+GCC and LLVM gcov tools are not necessarily compatible. Use gcov_ to work with
+GCC-generated .gcno and .gcda files, and use llvm-cov_ for Clang.
+
+.. _gcov: http://gcc.gnu.org/onlinedocs/gcc/Gcov.html
+.. _llvm-cov: https://llvm.org/docs/CommandGuide/llvm-cov.html
+
+Build differences between GCC and Clang gcov are handled by Kconfig. It
+automatically selects the appropriate gcov format depending on the detected
+toolchain.
+
+
Troubleshooting
---------------
diff --git a/Documentation/device-mapper/dm-init.txt b/Documentation/device-mapper/dm-init.txt
new file mode 100644
index 0000000..8464ee7
--- /dev/null
+++ b/Documentation/device-mapper/dm-init.txt
@@ -0,0 +1,114 @@
+Early creation of mapped devices
+====================================
+
+It is possible to configure a device-mapper device to act as the root device for
+your system in two ways.
+
+The first is to build an initial ramdisk which boots to a minimal userspace
+which configures the device, then pivot_root(8) in to it.
+
+The second is to create one or more device-mappers using the module parameter
+"dm-mod.create=" through the kernel boot command line argument.
+
+The format is specified as a string of data separated by commas and optionally
+semi-colons, where:
+ - a comma is used to separate fields like name, uuid, flags and table
+ (specifies one device)
+ - a semi-colon is used to separate devices.
+
+So the format will look like this:
+
+ dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]
+
+Where,
+ <name> ::= The device name.
+ <uuid> ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ""
+ <minor> ::= The device minor number | ""
+ <flags> ::= "ro" | "rw"
+ <table> ::= <start_sector> <num_sectors> <target_type> <target_args>
+ <target_type> ::= "verity" | "linear" | ... (see list below)
+
+The dm line should be equivalent to the one used by the dmsetup tool with the
+--concise argument.
+
+Target types
+============
+
+Not all target types are available as there are serious risks in allowing
+activation of certain DM targets without first using userspace tools to check
+the validity of associated metadata.
+
+ "cache": constrained, userspace should verify cache device
+ "crypt": allowed
+ "delay": allowed
+ "era": constrained, userspace should verify metadata device
+ "flakey": constrained, meant for test
+ "linear": allowed
+ "log-writes": constrained, userspace should verify metadata device
+ "mirror": constrained, userspace should verify main/mirror device
+ "raid": constrained, userspace should verify metadata device
+ "snapshot": constrained, userspace should verify src/dst device
+ "snapshot-origin": allowed
+ "snapshot-merge": constrained, userspace should verify src/dst device
+ "striped": allowed
+ "switch": constrained, userspace should verify dev path
+ "thin": constrained, requires dm target message from userspace
+ "thin-pool": constrained, requires dm target message from userspace
+ "verity": allowed
+ "writecache": constrained, userspace should verify cache device
+ "zero": constrained, not meant for rootfs
+
+If the target is not listed above, it is constrained by default (not tested).
+
+Examples
+========
+An example of booting to a linear array made up of user-mode linux block
+devices:
+
+ dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" root=/dev/dm-0
+
+This will boot to a rw dm-linear target of 8192 sectors split across two block
+devices identified by their major:minor numbers. After boot, udev will rename
+this target to /dev/mapper/lroot (depending on the rules). No uuid was assigned.
+
+An example of multiple device-mappers, with the dm-mod.create="..." contents is shown here
+split on multiple lines for readability:
+
+ vroot,,,ro,
+ 0 1740800 verity 254:0 254:0 1740800 sha1
+ 76e9be054b15884a9fa85973e9cb274c93afadb6
+ 5b3549d54d6c7a3837b9b81ed72e49463a64c03680c47835bef94d768e5646fe;
+ vram,,,rw,
+ 0 32768 linear 1:0 0,
+ 32768 32768 linear 1:1 0
+
+Other examples (per target):
+
+"crypt":
+ dm-crypt,,8,ro,
+ 0 1048576 crypt aes-xts-plain64
+ babebabebabebabebabebabebabebabebabebabebabebabebabebabebabebabe 0
+ /dev/sda 0 1 allow_discards
+
+"delay":
+ dm-delay,,4,ro,0 409600 delay /dev/sda1 0 500
+
+"linear":
+ dm-linear,,,rw,
+ 0 32768 linear /dev/sda1 0,
+ 32768 1024000 linear /dev/sda2 0,
+ 1056768 204800 linear /dev/sda3 0,
+ 1261568 512000 linear /dev/sda4 0
+
+"snapshot-origin":
+ dm-snap-orig,,4,ro,0 409600 snapshot-origin 8:2
+
+"striped":
+ dm-striped,,4,ro,0 1638400 striped 4 4096
+ /dev/sda1 0 /dev/sda2 0 /dev/sda3 0 /dev/sda4 0
+
+"verity":
+ dm-verity,,4,ro,
+ 0 1638400 verity 1 8:1 8:2 4096 4096 204800 1 sha256
+ fb1a5a0f00deb908d8b53cb270858975e76cf64105d412ce764225d53b8f3cfd
+ 51934789604d1b92399c52e7cb149d1b3a1b74bbbcb103b2a0aaacbed5c08584
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 0d0ecc7..94ce383 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -398,6 +398,8 @@
[stack] = the stack of the main process
[vdso] = the "virtual dynamic shared object",
the kernel system call handler
+ [anon:<name>] = an anonymous mapping that has been
+ named by userspace
or if empty, the mapping is anonymous.
@@ -427,6 +429,7 @@
Locked: 0 kB
THPeligible: 0
VmFlags: rd ex mr mw me dw
+Name: name from userspace
the first of these lines shows the same information as is displayed for the
mapping in /proc/PID/maps. The remaining lines show the size of the mapping
@@ -503,6 +506,9 @@
might change in future as well. So each consumer of these flags has to
follow each specific kernel version for the exact semantic.
+The "Name" field will only be present on a mapping that has been named by
+userspace, and will show the name passed in by userspace.
+
This file is only present if the CONFIG_MMU kernel configuration option is
enabled.
diff --git a/Documentation/lzo.txt b/Documentation/lzo.txt
index 6fa6a93..f799342 100644
--- a/Documentation/lzo.txt
+++ b/Documentation/lzo.txt
@@ -78,16 +78,34 @@
is an implementation design choice independent on the algorithm or
encoding.
+Versions
+
+0: Original version
+1: LZO-RLE
+
+Version 1 of LZO implements an extension to encode runs of zeros using run
+length encoding. This improves speed for data with many zeros, which is a
+common case for zram. This modifies the bitstream in a backwards compatible way
+(v1 can correctly decompress v0 compressed data, but v0 cannot read v1 data).
+
+For maximum compatibility, both versions are available under different names
+(lzo and lzo-rle). Differences in the encoding are noted in this document with
+e.g.: version 1 only.
+
Byte sequences
==============
First byte encoding::
- 0..17 : follow regular instruction encoding, see below. It is worth
- noting that codes 16 and 17 will represent a block copy from
- the dictionary which is empty, and that they will always be
+ 0..16 : follow regular instruction encoding, see below. It is worth
+ noting that code 16 will represent a block copy from the
+ dictionary which is empty, and that it will always be
invalid at this place.
+ 17 : bitstream version. If the first byte is 17, the next byte
+ gives the bitstream version (version 1 only). If the first byte
+ is not 17, the bitstream version is 0.
+
18..21 : copy 0..3 literals
state = (byte - 17) = 0..3 [ copy <state> literals ]
skip byte
@@ -140,6 +158,11 @@
state = S (copy S literals after this block)
End of stream is reached if distance == 16384
+ In version 1 only, this instruction is also used to encode a run of
+ zeros if distance = 0xbfff, i.e. H = 1 and the D bits are all 1.
+ In this case, it is followed by a fourth byte, X.
+ run length = ((X << 3) | (0 0 0 0 0 L L L)) + 4.
+
0 0 1 L L L L L (32..63)
Copy of small block within 16kB distance (preferably less than 34B)
length = 2 + (L ?: 31 + (zero_bytes * 255) + non_zero_byte)
@@ -165,7 +188,9 @@
=======
This document was written by Willy Tarreau <w@1wt.eu> on 2014/07/19 during an
- analysis of the decompression code available in Linux 3.16-rc5. The code is
- tricky, it is possible that this document contains mistakes or that a few
- corner cases were overlooked. In any case, please report any doubt, fix, or
- proposed updates to the author(s) so that the document can be updated.
+ analysis of the decompression code available in Linux 3.16-rc5, and updated
+ by Dave Rodgman <dave.rodgman@arm.com> on 2018/10/30 to introduce run-length
+ encoding. The code is tricky, it is possible that this document contains
+ mistakes or that a few corner cases were overlooked. In any case, please
+ report any doubt, fix, or proposed updates to the author(s) so that the
+ document can be updated.
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 7eb9366..7294284 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -639,6 +639,16 @@
0 to disable the blackhole detection.
By default, it is set to 1hr.
+tcp_fwmark_accept - BOOLEAN
+ If set, incoming connections to listening sockets that do not have a
+ socket mark will set the mark of the accepting socket to the fwmark of
+ the incoming SYN packet. This will cause all packets on that connection
+ (starting from the first SYNACK) to be sent with that fwmark. The
+ listening socket's mark is unchanged. Listening sockets that already
+ have a fwmark set via setsockopt(SOL_SOCKET, SO_MARK, ...) are
+ unaffected.
+ Default: 0
+
tcp_syn_retries - INTEGER
Number of times initial SYNs for an active TCP connection attempt
will be retransmitted. Should not be higher than 127. Default value
diff --git a/Documentation/scheduler/sched-tune.txt b/Documentation/scheduler/sched-tune.txt
new file mode 100644
index 0000000..1a10371
--- /dev/null
+++ b/Documentation/scheduler/sched-tune.txt
@@ -0,0 +1,388 @@
+ Central, scheduler-driven, power-performance control
+ (EXPERIMENTAL)
+
+Abstract
+========
+
+The topic of a single simple power-performance tunable, that is wholly
+scheduler centric, and has well defined and predictable properties has come up
+on several occasions in the past [1,2]. With techniques such as a scheduler
+driven DVFS [3], we now have a good framework for implementing such a tunable.
+This document describes the overall ideas behind its design and implementation.
+
+
+Table of Contents
+=================
+
+1. Motivation
+2. Introduction
+3. Signal Boosting Strategy
+4. OPP selection using boosted CPU utilization
+5. Per task group boosting
+6. Per-task wakeup-placement-strategy Selection
+7. Question and Answers
+ - What about "auto" mode?
+ - What about boosting on a congested system?
+ - How CPUs are boosted when we have tasks with multiple boost values?
+8. References
+
+
+1. Motivation
+=============
+
+Schedutil [3] is a utilization-driven cpufreq governor which allows the
+scheduler to select the optimal DVFS operating point (OPP) for running a task
+allocated to a CPU.
+
+However, sometimes it may be desired to intentionally boost the performance of
+a workload even if that could imply a reasonable increase in energy
+consumption. For example, in order to reduce the response time of a task, we
+may want to run the task at a higher OPP than the one that is actually required
+by it's CPU bandwidth demand.
+
+This last requirement is especially important if we consider that one of the
+main goals of the utilization-driven governor component is to replace all
+currently available CPUFreq policies. Since schedutil is event-based, as
+opposed to the sampling driven governors we currently have, they are already
+more responsive at selecting the optimal OPP to run tasks allocated to a CPU.
+However, just tracking the actual task utilization may not be enough from a
+performance standpoint. For example, it is not possible to get behaviors
+similar to those provided by the "performance" and "interactive" CPUFreq
+governors.
+
+This document describes an implementation of a tunable, stacked on top of the
+utilization-driven governor which extends its functionality to support task
+performance boosting.
+
+By "performance boosting" we mean the reduction of the time required to
+complete a task activation, i.e. the time elapsed from a task wakeup to its
+next deactivation (e.g. because it goes back to sleep or it terminates). For
+example, if we consider a simple periodic task which executes the same workload
+for 5[s] every 20[s] while running at a certain OPP, a boosted execution of
+that task must complete each of its activations in less than 5[s].
+
+The rest of this document introduces in more details the proposed solution
+which has been named SchedTune.
+
+
+2. Introduction
+===============
+
+SchedTune exposes a simple user-space interface provided through a new
+CGroup controller 'stune' which provides two power-performance tunables
+per group:
+
+ /<stune cgroup mount point>/schedtune.prefer_idle
+ /<stune cgroup mount point>/schedtune.boost
+
+The CGroup implementation permits arbitrary user-space defined task
+classification to tune the scheduler for different goals depending on the
+specific nature of the task, e.g. background vs interactive vs low-priority.
+
+More details are given in section 5.
+
+2.1 Boosting
+============
+
+The boost value is expressed as an integer in the range [0..100].
+
+A value of 0 (default) configures the CFS scheduler for maximum energy
+efficiency. This means that schedutil runs the tasks at the minimum OPP
+required to satisfy their workload demand.
+
+A value of 100 configures scheduler for maximum performance, which translates
+to the selection of the maximum OPP on that CPU.
+
+The range between 0 and 100 can be set to satisfy other scenarios suitably. For
+example to satisfy interactive response or depending on other system events
+(battery level etc).
+
+The overall design of the SchedTune module is built on top of "Per-Entity Load
+Tracking" (PELT) signals and schedutil by introducing a bias on the OPP
+selection.
+
+Each time a task is allocated on a CPU, cpufreq is given the opportunity to tune
+the operating frequency of that CPU to better match the workload demand. The
+selection of the actual OPP being activated is influenced by the boost value
+for the task CGroup.
+
+This simple biasing approach leverages existing frameworks, which means minimal
+modifications to the scheduler, and yet it allows to achieve a range of
+different behaviours all from a single simple tunable knob.
+
+In EAS schedulers, we use boosted task and CPU utilization for energy
+calculation and energy-aware task placement.
+
+2.2 prefer_idle
+===============
+
+This is a flag which indicates to the scheduler that userspace would like
+the scheduler to focus on energy or to focus on performance.
+
+A value of 0 (default) signals to the CFS scheduler that tasks in this group
+can be placed according to the energy-aware wakeup strategy.
+
+A value of 1 signals to the CFS scheduler that tasks in this group should be
+placed to minimise wakeup latency.
+
+Android platforms typically use this flag for application tasks which the
+user is currently interacting with.
+
+
+3. Signal Boosting Strategy
+===========================
+
+The whole PELT machinery works based on the value of a few load tracking signals
+which basically track the CPU bandwidth requirements for tasks and the capacity
+of CPUs. The basic idea behind the SchedTune knob is to artificially inflate
+some of these load tracking signals to make a task or RQ appears more demanding
+that it actually is.
+
+Which signals have to be inflated depends on the specific "consumer". However,
+independently from the specific (signal, consumer) pair, it is important to
+define a simple and possibly consistent strategy for the concept of boosting a
+signal.
+
+A boosting strategy defines how the "abstract" user-space defined
+sched_cfs_boost value is translated into an internal "margin" value to be added
+to a signal to get its inflated value:
+
+ margin := boosting_strategy(sched_cfs_boost, signal)
+ boosted_signal := signal + margin
+
+The boosting strategy currently implemented in SchedTune is called 'Signal
+Proportional Compensation' (SPC). With SPC, the sched_cfs_boost value is used to
+compute a margin which is proportional to the complement of the original signal.
+When a signal has a maximum possible value, its complement is defined as
+the delta from the actual value and its possible maximum.
+
+Since the tunable implementation uses signals which have SCHED_CAPACITY_SCALE as
+the maximum possible value, the margin becomes:
+
+ margin := sched_cfs_boost * (SCHED_CAPACITY_SCALE - signal)
+
+Using this boosting strategy:
+- a 100% sched_cfs_boost means that the signal is scaled to the maximum value
+- each value in the range of sched_cfs_boost effectively inflates the signal in
+ question by a quantity which is proportional to the maximum value.
+
+For example, by applying the SPC boosting strategy to the selection of the OPP
+to run a task it is possible to achieve these behaviors:
+
+- 0% boosting: run the task at the minimum OPP required by its workload
+- 100% boosting: run the task at the maximum OPP available for the CPU
+- 50% boosting: run at the half-way OPP between minimum and maximum
+
+Which means that, at 50% boosting, a task will be scheduled to run at half of
+the maximum theoretically achievable performance on the specific target
+platform.
+
+A graphical representation of an SPC boosted signal is represented in the
+following figure where:
+ a) "-" represents the original signal
+ b) "b" represents a 50% boosted signal
+ c) "p" represents a 100% boosted signal
+
+
+ ^
+ | SCHED_CAPACITY_SCALE
+ +-----------------------------------------------------------------+
+ |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
+ |
+ | boosted_signal
+ | bbbbbbbbbbbbbbbbbbbbbbbb
+ |
+ | original signal
+ | bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
+ | |
+ |bbbbbbbbbbbbbbbbbb |
+ | |
+ | |
+ | |
+ | +-----------------------+
+ | |
+ | |
+ | |
+ |------------------+
+ |
+ |
+ +----------------------------------------------------------------------->
+
+The plot above shows a ramped load signal (titled 'original_signal') and it's
+boosted equivalent. For each step of the original signal the boosted signal
+corresponding to a 50% boost is midway from the original signal and the upper
+bound. Boosting by 100% generates a boosted signal which is always saturated to
+the upper bound.
+
+
+4. OPP selection using boosted CPU utilization
+==============================================
+
+It is worth calling out that the implementation does not introduce any new load
+signals. Instead, it provides an API to tune existing signals. This tuning is
+done on demand and only in scheduler code paths where it is sensible to do so.
+The new API calls are defined to return either the default signal or a boosted
+one, depending on the value of sched_cfs_boost. This is a clean an non invasive
+modification of the existing existing code paths.
+
+The signal representing a CPU's utilization is boosted according to the
+previously described SPC boosting strategy. To schedutil, this allows a CPU
+(ie CFS run-queue) to appear more used then it actually is.
+
+Thus, with the sched_cfs_boost enabled we have the following main functions to
+get the current utilization of a CPU:
+
+ cpu_util()
+ boosted_cpu_util()
+
+The new boosted_cpu_util() is similar to the first but returns a boosted
+utilization signal which is a function of the sched_cfs_boost value.
+
+This function is used in the CFS scheduler code paths where schedutil needs to
+decide the OPP to run a CPU at. For example, this allows selecting the highest
+OPP for a CPU which has the boost value set to 100%.
+
+
+5. Per task group boosting
+==========================
+
+On battery powered devices there usually are many background services which are
+long running and need energy efficient scheduling. On the other hand, some
+applications are more performance sensitive and require an interactive
+response and/or maximum performance, regardless of the energy cost.
+
+To better service such scenarios, the SchedTune implementation has an extension
+that provides a more fine grained boosting interface.
+
+A new CGroup controller, namely "schedtune", can be enabled which allows to
+defined and configure task groups with different boosting values.
+Tasks that require special performance can be put into separate CGroups.
+The value of the boost associated with the tasks in this group can be specified
+using a single knob exposed by the CGroup controller:
+
+ schedtune.boost
+
+This knob allows the definition of a boost value that is to be used for
+SPC boosting of all tasks attached to this group.
+
+The current schedtune controller implementation is really simple and has these
+main characteristics:
+
+ 1) It is only possible to create 1 level depth hierarchies
+
+ The root control groups define the system-wide boost value to be applied
+ by default to all tasks. Its direct subgroups are named "boost groups" and
+ they define the boost value for specific set of tasks.
+ Further nested subgroups are not allowed since they do not have a sensible
+ meaning from a user-space standpoint.
+
+ 2) It is possible to define only a limited number of "boost groups"
+
+ This number is defined at compile time and by default configured to 16.
+ This is a design decision motivated by two main reasons:
+ a) In a real system we do not expect utilization scenarios with more than
+ a few boost groups. For example, a reasonable collection of groups could
+ be just "background", "interactive" and "performance".
+ b) It simplifies the implementation considerably, especially for the code
+ which has to compute the per CPU boosting once there are multiple
+ RUNNABLE tasks with different boost values.
+
+Such a simple design should allow servicing the main utilization scenarios
+identified so far. It provides a simple interface which can be used to manage
+the power-performance of all tasks or only selected tasks.
+Moreover, this interface can be easily integrated by user-space run-times (e.g.
+Android, ChromeOS) to implement a QoS solution for task boosting based on tasks
+classification, which has been a long standing requirement.
+
+Setup and usage
+---------------
+
+0. Use a kernel with CONFIG_SCHED_TUNE support enabled
+
+1. Check that the "schedtune" CGroup controller is available:
+
+ root@linaro-nano:~# cat /proc/cgroups
+ #subsys_name hierarchy num_cgroups enabled
+ cpuset 0 1 1
+ cpu 0 1 1
+ schedtune 0 1 1
+
+2. Mount a tmpfs to create the CGroups mount point (Optional)
+
+ root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
+
+3. Mount the "schedtune" controller
+
+ root@linaro-nano:~# mkdir /sys/fs/cgroup/stune
+ root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
+
+4. Create task groups and configure their specific boost value (Optional)
+
+ For example here we create a "performance" boost group configure to boost
+ all its tasks to 100%
+
+ root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance
+ root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
+
+5. Move tasks into the boost group
+
+ For example, the following moves the tasks with PID $TASKPID (and all its
+ threads) into the "performance" boost group.
+
+ root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
+
+This simple configuration allows only the threads of the $TASKPID task to run,
+when needed, at the highest OPP in the most capable CPU of the system.
+
+
+6. Per-task wakeup-placement-strategy Selection
+===============================================
+
+Many devices have a number of CFS tasks in use which require an absolute
+minimum wakeup latency, and many tasks for which wakeup latency is not
+important.
+
+For touch-driven environments, removing additional wakeup latency can be
+critical.
+
+When you use the Schedtume CGroup controller, you have access to a second
+parameter which allows a group to be marked such that energy_aware task
+placement is bypassed for tasks belonging to that group.
+
+prefer_idle=0 (default - use energy-aware task placement if available)
+prefer_idle=1 (never use energy-aware task placement for these tasks)
+
+Since the regular wakeup task placement algorithm in CFS is biased for
+performance, this has the effect of restoring minimum wakeup latency
+for the desired tasks whilst still allowing energy-aware wakeup placement
+to save energy for other tasks.
+
+
+7. Question and Answers
+=======================
+
+What about "auto" mode?
+-----------------------
+
+The 'auto' mode as described in [5] can be implemented by interfacing SchedTune
+with some suitable user-space element. This element could use the exposed
+system-wide or cgroup based interface.
+
+How are multiple groups of tasks with different boost values managed?
+---------------------------------------------------------------------
+
+The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
+on a CPU. The CPU utilization seen by schedutil (and used to select an
+appropriate OPP) is boosted with a value which is the maximum of the boost
+values of the currently RUNNABLE tasks in its RQ.
+
+This allows cpufreq to boost a CPU only while there are boosted tasks ready
+to run and switch back to the energy efficient mode as soon as the last boosted
+task is dequeued.
+
+
+8. References
+=============
+[1] http://lwn.net/Articles/552889
+[2] http://lkml.org/lkml/2012/5/18/91
+[3] https://lkml.org/lkml/2016/3/29/1041
diff --git a/Makefile b/Makefile
index a93e38c..52e7149 100644
--- a/Makefile
+++ b/Makefile
@@ -391,7 +391,7 @@
CHECKFLAGS := -D__linux__ -Dlinux -D__STDC__ -Dunix -D__unix__ \
-Wbitwise -Wno-return-void -Wno-unknown-attribute $(CF)
-NOSTDINC_FLAGS =
+NOSTDINC_FLAGS :=
CFLAGS_MODULE =
AFLAGS_MODULE =
LDFLAGS_MODULE =
@@ -481,7 +481,7 @@
$(srctree) $(objtree) $(VERSION) $(PATCHLEVEL)
endif
-ifeq ($(cc-name),clang)
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
ifneq ($(CROSS_COMPILE),)
CLANG_FLAGS += --target=$(notdir $(CROSS_COMPILE:%-=%))
GCC_TOOLCHAIN_DIR := $(dir $(shell which $(CROSS_COMPILE)elfedit))
@@ -507,9 +507,6 @@
export RETPOLINE_CFLAGS
export RETPOLINE_VDSO_CFLAGS
-KBUILD_CFLAGS += $(call cc-option,-fno-PIE)
-KBUILD_AFLAGS += $(call cc-option,-fno-PIE)
-
# The expansion should be delayed until arch/$(SRCARCH)/Makefile is included.
# Some architectures define CROSS_COMPILE in arch/$(SRCARCH)/Makefile.
# CC_VERSION_TEXT is referenced from Kconfig (so it needs export),
@@ -596,6 +593,8 @@
# Defaults to vmlinux, but the arch makefile usually adds further targets
all: vmlinux
+KBUILD_CFLAGS += $(call cc-option,-fno-PIE)
+KBUILD_AFLAGS += $(call cc-option,-fno-PIE)
CFLAGS_GCOV := -fprofile-arcs -ftest-coverage \
$(call cc-option,-fno-tree-loop-im) \
$(call cc-disable-warning,maybe-uninitialized,)
@@ -666,6 +665,12 @@
KBUILD_CFLAGS += $(call cc-option,--param=allow-store-data-races=0)
KBUILD_CFLAGS += $(call cc-option,-fno-allow-store-data-races)
+# check for 'asm goto'
+ifeq ($(shell $(CONFIG_SHELL) $(srctree)/scripts/gcc-goto.sh $(CC) $(KBUILD_CFLAGS)), y)
+ KBUILD_CFLAGS += -DCC_HAVE_ASM_GOTO
+ KBUILD_AFLAGS += -DCC_HAVE_ASM_GOTO
+endif
+
include scripts/Makefile.kcov
include scripts/Makefile.gcc-plugins
@@ -689,17 +694,19 @@
KBUILD_CFLAGS += $(stackp-flags-y)
-ifeq ($(cc-name),clang)
-KBUILD_CPPFLAGS += $(call cc-option,-Qunused-arguments,)
-KBUILD_CFLAGS += $(call cc-disable-warning, format-invalid-specifier)
-KBUILD_CFLAGS += $(call cc-disable-warning, gnu)
+ifdef CONFIG_CC_IS_CLANG
+KBUILD_CPPFLAGS += -Qunused-arguments
+KBUILD_CFLAGS += -Wno-format-invalid-specifier
+KBUILD_CFLAGS += -Wno-gnu
+KBUILD_CFLAGS += -Wno-address-of-packed-member
+KBUILD_CFLAGS += -Wno-duplicate-decl-specifier
# Quiet clang warning: comparison of unsigned expression < 0 is always false
-KBUILD_CFLAGS += $(call cc-disable-warning, tautological-compare)
+KBUILD_CFLAGS += -Wno-tautological-compare
+KBUILD_CFLAGS += -Wno-constant-conversion
# CLANG uses a _MergedGlobals as optimization, but this breaks modpost, as the
# source of a reference will be _MergedGlobals and not on of the whitelisted names.
# See modpost pattern 2
-KBUILD_CFLAGS += $(call cc-option, -mno-global-merge,)
-KBUILD_CFLAGS += $(call cc-option, -fcatch-undefined-behavior)
+KBUILD_CFLAGS += -mno-global-merge
else
# These warnings generated too much noise in a regular build.
diff --git a/PRESUBMIT.cfg b/PRESUBMIT.cfg
new file mode 100644
index 0000000..4fb5526
--- /dev/null
+++ b/PRESUBMIT.cfg
@@ -0,0 +1,9 @@
+[Hook Overrides]
+checkpatch_check: true
+aosp_license_check: false
+cros_license_check: false
+long_line_check: false
+signoff_check: true
+stray_whitespace_check: false
+tab_check: false
+tabbed_indent_required_check: false
diff --git a/arch/Kconfig b/arch/Kconfig
index a336548..6801123 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -71,7 +71,6 @@
config JUMP_LABEL
bool "Optimize very unlikely/likely branches"
depends on HAVE_ARCH_JUMP_LABEL
- depends on CC_HAS_ASM_GOTO
help
This option enables a transparent branch optimization that
makes certain almost-always-true or almost-always-false branch
diff --git a/arch/arm/kernel/jump_label.c b/arch/arm/kernel/jump_label.c
index 303b3ab..90bce3d 100644
--- a/arch/arm/kernel/jump_label.c
+++ b/arch/arm/kernel/jump_label.c
@@ -4,6 +4,8 @@
#include <asm/patch.h>
#include <asm/insn.h>
+#ifdef HAVE_JUMP_LABEL
+
static void __arch_jump_label_transform(struct jump_entry *entry,
enum jump_label_type type,
bool is_static)
@@ -33,3 +35,5 @@
{
__arch_jump_label_transform(entry, type, true);
}
+
+#endif
diff --git a/arch/arm64/kernel/jump_label.c b/arch/arm64/kernel/jump_label.c
index b90754a..e075641 100644
--- a/arch/arm64/kernel/jump_label.c
+++ b/arch/arm64/kernel/jump_label.c
@@ -20,6 +20,8 @@
#include <linux/jump_label.h>
#include <asm/insn.h>
+#ifdef HAVE_JUMP_LABEL
+
void arch_jump_label_transform(struct jump_entry *entry,
enum jump_label_type type)
{
@@ -47,3 +49,5 @@
* NOP needs to be replaced by a branch.
*/
}
+
+#endif /* HAVE_JUMP_LABEL */
diff --git a/arch/mips/Makefile b/arch/mips/Makefile
index ad0a92f..39cab0e 100644
--- a/arch/mips/Makefile
+++ b/arch/mips/Makefile
@@ -128,7 +128,7 @@
# clang's output will be based upon the build machine. So for clang we simply
# unconditionally specify -EB or -EL as appropriate.
#
-ifeq ($(cc-name),clang)
+ifdef CONFIG_CC_IS_CLANG
cflags-$(CONFIG_CPU_BIG_ENDIAN) += -EB
cflags-$(CONFIG_CPU_LITTLE_ENDIAN) += -EL
else
diff --git a/arch/mips/kernel/jump_label.c b/arch/mips/kernel/jump_label.c
index ab94392..32e3168 100644
--- a/arch/mips/kernel/jump_label.c
+++ b/arch/mips/kernel/jump_label.c
@@ -16,6 +16,8 @@
#include <asm/cacheflush.h>
#include <asm/inst.h>
+#ifdef HAVE_JUMP_LABEL
+
/*
* Define parameters for the standard MIPS and the microMIPS jump
* instruction encoding respectively:
@@ -68,3 +70,5 @@
mutex_unlock(&text_mutex);
}
+
+#endif /* HAVE_JUMP_LABEL */
diff --git a/arch/mips/vdso/Makefile b/arch/mips/vdso/Makefile
index c99fa1c..8b23bb1 100644
--- a/arch/mips/vdso/Makefile
+++ b/arch/mips/vdso/Makefile
@@ -12,7 +12,7 @@
$(filter -mno-loongson-%,$(KBUILD_CFLAGS)) \
-D__VDSO__
-ifeq ($(cc-name),clang)
+ifdef CONFIG_CC_IS_CLANG
ccflags-vdso += $(filter --target=%,$(KBUILD_CFLAGS))
endif
diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 8954108..55ac368 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -98,7 +98,7 @@
endif
endif
-ifneq ($(cc-name),clang)
+ifndef CONFIG_CC_IS_CLANG
cflags-$(CONFIG_CPU_LITTLE_ENDIAN) += -mno-strict-align
endif
@@ -179,7 +179,7 @@
# Work around gcc code-gen bugs with -pg / -fno-omit-frame-pointer in gcc <= 4.8
# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44199
# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52828
-ifneq ($(cc-name),clang)
+ifndef CONFIG_CC_IS_CLANG
CC_FLAGS_FTRACE += $(call cc-ifversion, -lt, 0409, -mno-sched-epilog)
endif
endif
diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h
index d0609c1..95b2df1 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -38,7 +38,7 @@
void __trace_hcall_entry(unsigned long opcode, unsigned long *args);
void __trace_hcall_exit(long opcode, long retval, unsigned long *retbuf);
/* OPAL tracing */
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
extern struct static_key opal_tracepoint_key;
#endif
diff --git a/arch/powerpc/kernel/jump_label.c b/arch/powerpc/kernel/jump_label.c
index 0080c5f..6472472 100644
--- a/arch/powerpc/kernel/jump_label.c
+++ b/arch/powerpc/kernel/jump_label.c
@@ -11,6 +11,7 @@
#include <linux/jump_label.h>
#include <asm/code-patching.h>
+#ifdef HAVE_JUMP_LABEL
void arch_jump_label_transform(struct jump_entry *entry,
enum jump_label_type type)
{
@@ -21,3 +22,4 @@
else
patch_instruction(addr, PPC_INST_NOP);
}
+#endif
diff --git a/arch/powerpc/platforms/powernv/opal-tracepoints.c b/arch/powerpc/platforms/powernv/opal-tracepoints.c
index f16a435..1ab7d26 100644
--- a/arch/powerpc/platforms/powernv/opal-tracepoints.c
+++ b/arch/powerpc/platforms/powernv/opal-tracepoints.c
@@ -4,7 +4,7 @@
#include <asm/trace.h>
#include <asm/asm-prototypes.h>
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
struct static_key opal_tracepoint_key = STATIC_KEY_INIT;
int opal_tracepoint_regfunc(void)
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index 74215eb..3f98158 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -20,7 +20,7 @@
.section ".text"
#ifdef CONFIG_TRACEPOINTS
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
#define OPAL_BRANCH(LABEL) \
ARCH_STATIC_BRANCH(LABEL, opal_tracepoint_key)
#else
diff --git a/arch/powerpc/platforms/pseries/hvCall.S b/arch/powerpc/platforms/pseries/hvCall.S
index 50dc942..d91412c 100644
--- a/arch/powerpc/platforms/pseries/hvCall.S
+++ b/arch/powerpc/platforms/pseries/hvCall.S
@@ -19,7 +19,7 @@
#ifdef CONFIG_TRACEPOINTS
-#ifndef CONFIG_JUMP_LABEL
+#ifndef HAVE_JUMP_LABEL
.section ".toc","aw"
.globl hcall_tracepoint_refcount
@@ -79,7 +79,7 @@
mr r5,BUFREG; \
__HCALL_INST_POSTCALL
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
#define HCALL_BRANCH(LABEL) \
ARCH_STATIC_BRANCH(LABEL, hcall_tracepoint_key)
#else
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index d660a90..47942de 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -833,7 +833,7 @@
#endif /* CONFIG_PPC_BOOK3S_64 */
#ifdef CONFIG_TRACEPOINTS
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
struct static_key hcall_tracepoint_key = STATIC_KEY_INIT;
int hcall_tracepoint_regfunc(void)
diff --git a/arch/s390/appldata/appldata_mem.c b/arch/s390/appldata/appldata_mem.c
index e68136c..0fc6522 100644
--- a/arch/s390/appldata/appldata_mem.c
+++ b/arch/s390/appldata/appldata_mem.c
@@ -63,6 +63,9 @@
u64 pgalloc; /* page allocations */
u64 pgfault; /* page faults (major+minor) */
u64 pgmajfault; /* page faults (major only) */
+ u64 pgmajfault_s; /* shmem page faults (major only) */
+ u64 pgmajfault_a; /* anonymous page faults (major only) */
+ u64 pgmajfault_f; /* file page faults (major only) */
// <-- New in 2.6
} __packed;
@@ -94,7 +97,11 @@
mem_data->pgalloc = ev[PGALLOC_NORMAL];
mem_data->pgalloc += ev[PGALLOC_DMA];
mem_data->pgfault = ev[PGFAULT];
- mem_data->pgmajfault = ev[PGMAJFAULT];
+ mem_data->pgmajfault =
+ ev[PGMAJFAULT_S] + ev[PGMAJFAULT_A] + ev[PGMAJFAULT_F];
+ mem_data->pgmajfault_s = ev[PGMAJFAULT_S];
+ mem_data->pgmajfault_a = ev[PGMAJFAULT_A];
+ mem_data->pgmajfault_f = ev[PGMAJFAULT_F];
si_meminfo(&val);
mem_data->sharedram = val.sharedram;
diff --git a/arch/s390/kernel/Makefile b/arch/s390/kernel/Makefile
index 762fc45..b524c15 100644
--- a/arch/s390/kernel/Makefile
+++ b/arch/s390/kernel/Makefile
@@ -46,7 +46,7 @@
obj-y := traps.o time.o process.o base.o early.o setup.o idle.o vtime.o
obj-y += processor.o sys_s390.o ptrace.o signal.o cpcmd.o ebcdic.o nmi.o
obj-y += debug.o irq.o ipl.o dis.o diag.o vdso.o early_nobss.o
-obj-y += sysinfo.o lgr.o os_info.o machine_kexec.o pgm_check.o
+obj-y += sysinfo.o jump_label.o lgr.o os_info.o machine_kexec.o pgm_check.o
obj-y += runtime_instr.o cache.o fpu.o dumpstack.o guarded_storage.o sthyi.o
obj-y += entry.o reipl.o relocate_kernel.o kdebugfs.o alternative.o
obj-y += nospec-branch.o
@@ -70,7 +70,6 @@
obj-$(CONFIG_FUNCTION_TRACER) += mcount.o ftrace.o
obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_UPROBES) += uprobes.o
-obj-$(CONFIG_JUMP_LABEL) += jump_label.o
obj-$(CONFIG_KEXEC_FILE) += machine_kexec_file.o kexec_image.o
obj-$(CONFIG_KEXEC_FILE) += kexec_elf.o
diff --git a/arch/s390/kernel/jump_label.c b/arch/s390/kernel/jump_label.c
index 68f415e..43f8430 100644
--- a/arch/s390/kernel/jump_label.c
+++ b/arch/s390/kernel/jump_label.c
@@ -10,6 +10,8 @@
#include <linux/jump_label.h>
#include <asm/ipl.h>
+#ifdef HAVE_JUMP_LABEL
+
struct insn {
u16 opcode;
s32 offset;
@@ -100,3 +102,5 @@
{
__jump_label_transform(entry, type, 1);
}
+
+#endif
diff --git a/arch/sparc/kernel/Makefile b/arch/sparc/kernel/Makefile
index 97c0e19..cf86408 100644
--- a/arch/sparc/kernel/Makefile
+++ b/arch/sparc/kernel/Makefile
@@ -118,4 +118,4 @@
obj-$(CONFIG_SPARC64) += $(pc--y)
obj-$(CONFIG_UPROBES) += uprobes.o
-obj-$(CONFIG_JUMP_LABEL) += jump_label.o
+obj-$(CONFIG_SPARC64) += jump_label.o
diff --git a/arch/sparc/kernel/jump_label.c b/arch/sparc/kernel/jump_label.c
index a4cfaee..7f8eac5 100644
--- a/arch/sparc/kernel/jump_label.c
+++ b/arch/sparc/kernel/jump_label.c
@@ -9,6 +9,8 @@
#include <asm/cacheflush.h>
+#ifdef HAVE_JUMP_LABEL
+
void arch_jump_label_transform(struct jump_entry *entry,
enum jump_label_type type)
{
@@ -45,3 +47,5 @@
flushi(insn);
mutex_unlock(&text_mutex);
}
+
+#endif
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index af35f5c..fbbe59a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -204,6 +204,7 @@
select USER_STACKTRACE_SUPPORT
select VIRT_TO_BUS
select X86_FEATURE_NAMES if PROC_FS
+ select ARCH_HAS_ALT_SYSCALL if X86_64
config INSTRUCTION_DECODER
def_bool y
@@ -408,6 +409,17 @@
If in doubt, say Y.
+config X86_FAST_FEATURE_TESTS
+ bool "Fast CPU feature tests" if EMBEDDED
+ default y
+ ---help---
+ Some fast-paths in the kernel depend on the capabilities of the CPU.
+ Say Y here for the kernel to patch in the appropriate code at runtime
+ based on the capabilities of the CPU. The infrastructure for patching
+ code at runtime takes up some additional space; space-constrained
+ embedded systems may wish to say N here to produce smaller, slightly
+ slower code.
+
config X86_X2APIC
bool "Support x2apic"
depends on X86_LOCAL_APIC && X86_64 && (IRQ_REMAP || HYPERVISOR_GUEST)
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 4833dd7..5a38de8 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -306,10 +306,6 @@
archprepare: checkbin
checkbin:
-ifndef CONFIG_CC_HAS_ASM_GOTO
- @echo Compiler lacks asm-goto support.
- @exit 1
-endif
ifdef CONFIG_RETPOLINE
ifeq ($(RETPOLINE_CFLAGS),)
@echo "You are building kernel with non-retpoline compiler." >&2
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 466f66c..de1e6e6 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -28,6 +28,7 @@
KBUILD_CFLAGS := -m$(BITS) -O2
KBUILD_CFLAGS += -fno-strict-aliasing $(call cc-option, -fPIE, -fPIC)
+KBUILD_CFLAGS += -fomit-frame-pointer
KBUILD_CFLAGS += -DDISABLE_BRANCH_PROFILING
cflags-$(CONFIG_X86_32) := -march=i386
cflags-$(CONFIG_X86_64) := -mcmodel=small
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 993dd06..7c56a2a 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -341,7 +341,7 @@
*/
.macro CALL_enter_from_user_mode
#ifdef CONFIG_CONTEXT_TRACKING
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
STATIC_JUMP_IF_FALSE .Lafter_call_\@, context_tracking_enabled, def=0
#endif
call enter_from_user_mode
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 8353348..ad35f51 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -288,10 +288,17 @@
* regs->orig_ax, which changes the behavior of some syscalls.
*/
nr &= __SYSCALL_MASK;
+#ifdef CONFIG_ALT_SYSCALL
+ if (likely(nr < ti->nr_syscalls)) {
+ nr = array_index_nospec(nr, ti->nr_syscalls);
+ regs->ax = ti->sys_call_table[nr](regs);
+ }
+#else
if (likely(nr < NR_syscalls)) {
nr = array_index_nospec(nr, NR_syscalls);
regs->ax = sys_call_table[nr](regs);
}
+#endif
syscall_return_slowpath(regs);
}
@@ -323,6 +330,12 @@
nr = syscall_trace_enter(regs);
}
+#ifdef CONFIG_ALT_SYSCALL
+ if (likely(nr < ti->ia32_nr_syscalls)) {
+ nr = array_index_nospec(nr, ti->ia32_nr_syscalls);
+ regs->ax = ti->ia32_sys_call_table[nr](regs);
+ }
+#else
if (likely(nr < IA32_NR_syscalls)) {
nr = array_index_nospec(nr, IA32_NR_syscalls);
#ifdef CONFIG_IA32_EMULATION
@@ -340,6 +353,7 @@
(unsigned int)regs->di, (unsigned int)regs->bp);
#endif /* CONFIG_IA32_EMULATION */
}
+#endif
syscall_return_slowpath(regs);
}
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 68889ac..0ecc9ba 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -140,20 +140,7 @@
#define setup_force_cpu_bug(bit) setup_force_cpu_cap(bit)
-#if defined(__clang__) && !defined(CONFIG_CC_HAS_ASM_GOTO)
-
-/*
- * Workaround for the sake of BPF compilation which utilizes kernel
- * headers, but clang does not support ASM GOTO and fails the build.
- */
-#ifndef __BPF_TRACING__
-#warning "Compiler lacks ASM_GOTO support. Add -D __BPF_TRACING__ to your compiler arguments"
-#endif
-
-#define static_cpu_has(bit) boot_cpu_has(bit)
-
-#else
-
+#if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_X86_FAST_FEATURE_TESTS)
/*
* Static testing of CPU features. Used the same as boot_cpu_has().
* These will statically patch the target code for additional
@@ -209,6 +196,12 @@
boot_cpu_has(bit) : \
_static_cpu_has(bit) \
)
+#else
+/*
+ * Fall back to dynamic for gcc versions which don't support asm goto. Should be
+ * a minority now anyway.
+ */
+#define static_cpu_has(bit) boot_cpu_has(bit)
#endif
#define cpu_has_bug(c, bit) cpu_has(c, (bit))
diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index 7010e1c..8c0de42 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -2,6 +2,19 @@
#ifndef _ASM_X86_JUMP_LABEL_H
#define _ASM_X86_JUMP_LABEL_H
+#ifndef HAVE_JUMP_LABEL
+/*
+ * For better or for worse, if jump labels (the gcc extension) are missing,
+ * then the entire static branch patching infrastructure is compiled out.
+ * If that happens, the code in here will malfunction. Raise a compiler
+ * error instead.
+ *
+ * In theory, jump labels and the static branch patching infrastructure
+ * could be decoupled to fix this.
+ */
+#error asm/jump_label.h included on a non-jump-label kernel
+#endif
+
#define JUMP_LABEL_NOP_SIZE 5
#ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/rmwcc.h b/arch/x86/include/asm/rmwcc.h
index 033dc7c..4914a3e 100644
--- a/arch/x86/include/asm/rmwcc.h
+++ b/arch/x86/include/asm/rmwcc.h
@@ -4,7 +4,7 @@
#define __CLOBBERS_MEM(clb...) "memory", ## clb
-#if !defined(__GCC_ASM_FLAG_OUTPUTS__) && defined(CONFIG_CC_HAS_ASM_GOTO)
+#if !defined(__GCC_ASM_FLAG_OUTPUTS__) && defined(CC_HAVE_ASM_GOTO)
/* Use asm goto */
@@ -21,7 +21,7 @@
#define __BINARY_RMWcc_ARG " %1, "
-#else /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CONFIG_CC_HAS_ASM_GOTO) */
+#else /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CC_HAVE_ASM_GOTO) */
/* Use flags output or a set instruction */
@@ -36,7 +36,7 @@
#define __BINARY_RMWcc_ARG " %2, "
-#endif /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CONFIG_CC_HAS_ASM_GOTO) */
+#endif /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CC_HAVE_ASM_GOTO) */
#define GEN_UNARY_RMWcc(op, var, arg0, cc) \
__GEN_RMWcc(op " " arg0, var, cc, __CLOBBERS_MEM())
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index d653139..dc4418f 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -33,6 +33,7 @@
#define ia32_sys_call_table sys_call_table
#define __NR_syscall_compat_max __NR_syscall_max
#define IA32_NR_syscalls NR_syscalls
+#define ia32_nr_syscalls nr_syscalls
#endif
#if defined(CONFIG_IA32_EMULATION)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 82b73b7..7ca78ae 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -50,17 +50,52 @@
*/
#ifndef __ASSEMBLY__
struct task_struct;
+
+/* same as sys_call_ptr_t from asm/syscall.h */
+typedef asmlinkage long (*ti_sys_call_ptr_t)(const struct pt_regs *);
+
#include <asm/cpufeature.h>
#include <linux/atomic.h>
struct thread_info {
unsigned long flags; /* low level flags */
u32 status; /* thread synchronous flags */
+#ifdef CONFIG_ALT_SYSCALL
+ /*
+ * This uses nr_syscalls instead of nr_syscall_max because we want
+ * to be able to entirely disable a syscall table (e.g. compat) by
+ * setting nr_syscalls to 0. This requires some careful work in
+ * the syscall entry assembly code, most variations use ..._max.
+ */
+ unsigned int nr_syscalls; /* size of below */
+ const ti_sys_call_ptr_t *sys_call_table;
+# ifdef CONFIG_IA32_EMULATION
+ unsigned int ia32_nr_syscalls; /* size of below */
+ const ti_sys_call_ptr_t *ia32_sys_call_table;
+# endif
+#endif
};
+#ifdef CONFIG_ALT_SYSCALL
+# ifdef CONFIG_IA32_EMULATION
+# define INIT_THREAD_INFO_SYSCALL_COMPAT \
+ .ia32_nr_syscalls = IA32_NR_syscalls, \
+ .ia32_sys_call_table = ia32_sys_call_table,
+# else
+# define INIT_THREAD_INFO_SYSCALL_COMPAT /* */
+# endif
+# define INIT_THREAD_INFO_SYSCALL \
+ .nr_syscalls = NR_syscalls, \
+ .sys_call_table = sys_call_table, \
+ INIT_THREAD_INFO_SYSCALL_COMPAT
+#else
+# define INIT_THREAD_INFO_SYSCALL /* */
+#endif
+
#define INIT_THREAD_INFO(tsk) \
{ \
.flags = 0, \
+ INIT_THREAD_INFO_SYSCALL \
}
#else /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index da0b6bc..b7661a3 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -49,8 +49,7 @@
obj-y += traps.o idt.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
obj-y += time.o ioport.o dumpstack.o nmi.o
obj-$(CONFIG_MODIFY_LDT_SYSCALL) += ldt.o
-obj-y += setup.o x86_init.o i8259.o irqinit.o
-obj-$(CONFIG_JUMP_LABEL) += jump_label.o
+obj-y += setup.o x86_init.o i8259.o irqinit.o jump_label.o
obj-$(CONFIG_IRQ_WORK) += irq_work.o
obj-y += probe_roms.o
obj-$(CONFIG_X86_64) += sys_x86_64.o
@@ -140,6 +139,8 @@
obj-$(CONFIG_UNWINDER_FRAME_POINTER) += unwind_frame.o
obj-$(CONFIG_UNWINDER_GUESS) += unwind_guess.o
+obj-$(CONFIG_ALT_SYSCALL) += alt-syscall.o
+
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/alt-syscall.c b/arch/x86/kernel/alt-syscall.c
new file mode 100644
index 0000000..09e7ed7
--- /dev/null
+++ b/arch/x86/kernel/alt-syscall.c
@@ -0,0 +1,70 @@
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/unistd.h>
+#include <linux/slab.h>
+#include <linux/stddef.h>
+#include <linux/syscalls.h>
+#include <linux/alt-syscall.h>
+
+#include <asm/syscall.h>
+#include <asm/syscalls.h>
+
+int arch_dup_sys_call_table(struct alt_sys_call_table *entry)
+{
+ if (!entry)
+ return -EINVAL;
+ /* Table already allocated. */
+ if (entry->table)
+ return -EINVAL;
+#ifdef CONFIG_IA32_EMULATION
+ if (entry->compat_table)
+ return -EINVAL;
+#endif
+ entry->size = NR_syscalls;
+ entry->table = kcalloc(entry->size, sizeof(sys_call_ptr_t),
+ GFP_KERNEL);
+ if (!entry->table)
+ goto failed;
+
+ memcpy(entry->table, sys_call_table,
+ entry->size * sizeof(sys_call_ptr_t));
+
+#ifdef CONFIG_IA32_EMULATION
+ entry->compat_size = IA32_NR_syscalls;
+ entry->compat_table = kcalloc(entry->compat_size,
+ sizeof(sys_call_ptr_t), GFP_KERNEL);
+ if (!entry->compat_table)
+ goto failed;
+ memcpy(entry->compat_table, ia32_sys_call_table,
+ entry->compat_size * sizeof(sys_call_ptr_t));
+#endif
+
+ return 0;
+
+failed:
+ entry->size = 0;
+ kfree(entry->table);
+ entry->table = NULL;
+#ifdef CONFIG_IA32_EMULATION
+ entry->compat_size = 0;
+#endif
+ return -ENOMEM;
+}
+
+/* Operates on "current", which isn't racey, since it's _in_ a syscall. */
+int arch_set_sys_call_table(struct alt_sys_call_table *entry)
+{
+ if (!entry)
+ return -EINVAL;
+
+ current_thread_info()->nr_syscalls = entry->size;
+ current_thread_info()->sys_call_table = entry->table;
+#ifdef CONFIG_IA32_EMULATION
+ current_thread_info()->ia32_nr_syscalls = entry->compat_size;
+ current_thread_info()->ia32_sys_call_table = entry->compat_table;
+#endif
+
+ return 0;
+}
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index 4c3d9a3..eeea935 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -16,6 +16,8 @@
#include <asm/alternative.h>
#include <asm/text-patching.h>
+#ifdef HAVE_JUMP_LABEL
+
union jump_code_union {
char code[JUMP_LABEL_NOP_SIZE];
struct {
@@ -140,3 +142,5 @@
if (jlstate == JL_STATE_UPDATE)
__jump_label_transform(entry, type, text_poke_early, 1);
}
+
+#endif
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 03b7529..42053b2 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -185,8 +185,7 @@
/*
* Secondary CPUs do not run through tsc_init(), so set up
* all the scale factors for all CPUs, assuming the same
- * speed as the bootup CPU. (cpufreq notifiers will fix this
- * up if their speed diverges)
+ * speed as the bootup CPU.
*/
static void __init cyc2ns_init_secondary_cpus(void)
{
@@ -936,12 +935,12 @@
}
#ifdef CONFIG_CPU_FREQ
-/* Frequency scaling support. Adjust the TSC based timer when the cpu frequency
+/*
+ * Frequency scaling support. Adjust the TSC based timer when the CPU frequency
* changes.
*
- * RED-PEN: On SMP we assume all CPUs run with the same frequency. It's
- * not that important because current Opteron setups do not support
- * scaling on SMP anyroads.
+ * NOTE: On SMP the situation is not fixable in general, so simply mark the TSC
+ * as unstable and give up in those cases.
*
* Should fix up last_tsc too. Currently gettimeofday in the
* first tick after the change will be slightly wrong.
@@ -955,22 +954,22 @@
void *data)
{
struct cpufreq_freqs *freq = data;
- unsigned long *lpj;
- lpj = &boot_cpu_data.loops_per_jiffy;
-#ifdef CONFIG_SMP
- if (!(freq->flags & CPUFREQ_CONST_LOOPS))
- lpj = &cpu_data(freq->cpu).loops_per_jiffy;
-#endif
+ if (num_online_cpus() > 1) {
+ mark_tsc_unstable("cpufreq changes on SMP");
+ return 0;
+ }
if (!ref_freq) {
ref_freq = freq->old;
- loops_per_jiffy_ref = *lpj;
+ loops_per_jiffy_ref = boot_cpu_data.loops_per_jiffy;
tsc_khz_ref = tsc_khz;
}
+
if ((val == CPUFREQ_PRECHANGE && freq->old < freq->new) ||
- (val == CPUFREQ_POSTCHANGE && freq->old > freq->new)) {
- *lpj = cpufreq_scale(loops_per_jiffy_ref, ref_freq, freq->new);
+ (val == CPUFREQ_POSTCHANGE && freq->old > freq->new)) {
+ boot_cpu_data.loops_per_jiffy =
+ cpufreq_scale(loops_per_jiffy_ref, ref_freq, freq->new);
tsc_khz = cpufreq_scale(tsc_khz_ref, ref_freq, freq->new);
if (!(freq->flags & CPUFREQ_CONST_LOOPS))
@@ -1377,6 +1376,8 @@
static bool __init determine_cpu_tsc_frequencies(bool early)
{
+ u64 initial_tsc;
+
/* Make sure that cpu and tsc are not already calibrated */
WARN_ON(cpu_khz || tsc_khz);
@@ -1389,6 +1390,8 @@
cpu_khz = pit_hpet_ptimer_calibrate_cpu();
}
+ initial_tsc = rdtsc();
+
/*
* Trust non-zero tsc_khz as authoritative,
* and use it to sanity check cpu_khz,
@@ -1402,6 +1405,10 @@
if (tsc_khz == 0)
return false;
+ do_div(initial_tsc, cpu_khz / 1000);
+ pr_info("Initial usec timer %llu\n",
+ (unsigned long long)initial_tsc);
+
pr_info("Detected %lu.%03lu MHz processor\n",
(unsigned long)cpu_khz / KHZ,
(unsigned long)cpu_khz % KHZ);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 210eabd..93d07cf 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -456,7 +456,7 @@
/*
* XXX: inoutclob user must know where the argument is being expanded.
- * Relying on CONFIG_CC_HAS_ASM_GOTO would allow us to remove _fault.
+ * Relying on CC_HAVE_ASM_GOTO would allow us to remove _fault.
*/
#define asm_safe(insn, inoutclob...) \
({ \
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index e7f19de..b129915 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -86,6 +86,8 @@
pgd_t *save_pgd;
save_pgd = efi_call_phys_prolog();
+ if (!save_pgd)
+ return EFI_ABORTED;
/* Disable interrupts around EFI calls: */
local_irq_save(flags);
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 52dd59a..c54b5a58 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -84,13 +84,15 @@
if (!efi_enabled(EFI_OLD_MEMMAP)) {
efi_switch_mm(&efi_mm);
- return NULL;
+ return efi_mm.pgd;
}
early_code_mapping_set_exec(1);
n_pgds = DIV_ROUND_UP((max_pfn << PAGE_SHIFT), PGDIR_SIZE);
save_pgd = kmalloc_array(n_pgds, sizeof(*save_pgd), GFP_KERNEL);
+ if (!save_pgd)
+ return NULL;
/*
* Build 1:1 identity mapping for efi=old_map usage. Note that
@@ -138,10 +140,11 @@
pgd_offset_k(pgd * PGDIR_SIZE)->pgd &= ~_PAGE_NX;
}
-out:
__flush_tlb_all();
-
return save_pgd;
+out:
+ efi_call_phys_epilog(save_pgd);
+ return NULL;
}
void __init efi_call_phys_epilog(pgd_t *save_pgd)
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index ecd3d0e..860ad04 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -579,7 +579,8 @@
bfqg_and_blkg_get(bfqg);
if (bfq_bfqq_busy(bfqq)) {
- bfq_pos_tree_add_move(bfqd, bfqq);
+ if (unlikely(!bfqd->nonrot_with_queueing))
+ bfq_pos_tree_add_move(bfqd, bfqq);
bfq_activate_bfqq(bfqd, bfqq);
}
@@ -1103,7 +1104,7 @@
},
#endif /* CONFIG_DEBUG_BLK_CGROUP */
- /* the same statictics which cover the bfqg and its descendants */
+ /* the same statistics which cover the bfqg and its descendants */
{
.name = "bfq.io_service_bytes_recursive",
.private = (unsigned long)&blkcg_policy_bfq,
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 5198ed1..fb80791 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -189,7 +189,7 @@
/*
* When a sync request is dispatched, the queue that contains that
* request, and all the ancestor entities of that queue, are charged
- * with the number of sectors of the request. In constrast, if the
+ * with the number of sectors of the request. In contrast, if the
* request is async, then the queue and its ancestor entities are
* charged with the number of sectors of the request, multiplied by
* the factor below. This throttles the bandwidth for async I/O,
@@ -217,7 +217,7 @@
* queue merging.
*
* As can be deduced from the low time limit below, queue merging, if
- * successful, happens at the very beggining of the I/O of the involved
+ * successful, happens at the very beginning of the I/O of the involved
* cooperating processes, as a consequence of the arrival of the very
* first requests from each cooperator. After that, there is very
* little chance to find cooperators.
@@ -230,13 +230,26 @@
#define BFQ_MIN_TT (2 * NSEC_PER_MSEC)
/* hw_tag detection: parallel requests threshold and min samples needed. */
-#define BFQ_HW_QUEUE_THRESHOLD 4
+#define BFQ_HW_QUEUE_THRESHOLD 3
#define BFQ_HW_QUEUE_SAMPLES 32
#define BFQQ_SEEK_THR (sector_t)(8 * 100)
#define BFQQ_SECT_THR_NONROT (sector_t)(2 * 32)
+#define BFQ_RQ_SEEKY(bfqd, last_pos, rq) \
+ (get_sdist(last_pos, rq) > \
+ BFQQ_SEEK_THR && \
+ (!blk_queue_nonrot(bfqd->queue) || \
+ blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT))
#define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
#define BFQQ_SEEKY(bfqq) (hweight32(bfqq->seek_history) > 19)
+/*
+ * Sync random I/O is likely to be confused with soft real-time I/O,
+ * because it is characterized by limited throughput and apparently
+ * isochronous arrival pattern. To avoid false positives, queues
+ * containing only random (seeky) I/O are prevented from being tagged
+ * as soft real-time.
+ */
+#define BFQQ_TOTALLY_SEEKY(bfqq) (bfqq->seek_history == -1)
/* Min number of samples required to perform peak-rate update */
#define BFQ_RATE_MIN_SAMPLES 32
@@ -428,7 +441,7 @@
/*
* Lifted from AS - choose which of rq1 and rq2 that is best served now.
- * We choose the request that is closesr to the head right now. Distance
+ * We choose the request that is closer to the head right now. Distance
* behind the head is penalized and only allowed to a certain extent.
*/
static struct request *bfq_choose_req(struct bfq_data *bfqd,
@@ -590,7 +603,16 @@
bfq_merge_time_limit);
}
-void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+/*
+ * The following function is not marked as __cold because it is
+ * actually cold, but for the same performance goal described in the
+ * comments on the likely() at the beginning of
+ * bfq_setup_cooperator(). Unexpectedly, to reach an even lower
+ * execution time for the case where this function is not invoked, we
+ * had to add an unlikely() in each involved if().
+ */
+void __cold
+bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
{
struct rb_node **p, *parent;
struct bfq_queue *__bfqq;
@@ -624,59 +646,73 @@
}
/*
- * Tell whether there are active queues or groups with differentiated weights.
- */
-static bool bfq_differentiated_weights(struct bfq_data *bfqd)
-{
- /*
- * For weights to differ, at least one of the trees must contain
- * at least two nodes.
- */
- return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) &&
- (bfqd->queue_weights_tree.rb_node->rb_left ||
- bfqd->queue_weights_tree.rb_node->rb_right)
-#ifdef CONFIG_BFQ_GROUP_IOSCHED
- ) ||
- (!RB_EMPTY_ROOT(&bfqd->group_weights_tree) &&
- (bfqd->group_weights_tree.rb_node->rb_left ||
- bfqd->group_weights_tree.rb_node->rb_right)
-#endif
- );
-}
-
-/*
- * The following function returns true if every queue must receive the
- * same share of the throughput (this condition is used when deciding
- * whether idling may be disabled, see the comments in the function
- * bfq_better_to_idle()).
+ * The following function returns false either if every active queue
+ * must receive the same share of the throughput (symmetric scenario),
+ * or, as a special case, if bfqq must receive a share of the
+ * throughput lower than or equal to the share that every other active
+ * queue must receive. If bfqq does sync I/O, then these are the only
+ * two cases where bfqq happens to be guaranteed its share of the
+ * throughput even if I/O dispatching is not plugged when bfqq remains
+ * temporarily empty (for more details, see the comments in the
+ * function bfq_better_to_idle()). For this reason, the return value
+ * of this function is used to check whether I/O-dispatch plugging can
+ * be avoided.
*
- * Such a scenario occurs when:
+ * The above first case (symmetric scenario) occurs when:
* 1) all active queues have the same weight,
- * 2) all active groups at the same level in the groups tree have the same
- * weight,
+ * 2) all active queues belong to the same I/O-priority class,
* 3) all active groups at the same level in the groups tree have the same
+ * weight,
+ * 4) all active groups at the same level in the groups tree have the same
* number of children.
*
- * Unfortunately, keeping the necessary state for evaluating exactly the
- * above symmetry conditions would be quite complex and time-consuming.
- * Therefore this function evaluates, instead, the following stronger
- * sub-conditions, for which it is much easier to maintain the needed
- * state:
+ * Unfortunately, keeping the necessary state for evaluating exactly
+ * the last two symmetry sub-conditions above would be quite complex
+ * and time consuming. Therefore this function evaluates, instead,
+ * only the following stronger three sub-conditions, for which it is
+ * much easier to maintain the needed state:
* 1) all active queues have the same weight,
- * 2) all active groups have the same weight,
- * 3) all active groups have at most one active child each.
- * In particular, the last two conditions are always true if hierarchical
- * support and the cgroups interface are not enabled, thus no state needs
- * to be maintained in this case.
+ * 2) all active queues belong to the same I/O-priority class,
+ * 3) there are no active groups.
+ * In particular, the last condition is always true if hierarchical
+ * support or the cgroups interface are not enabled, thus no state
+ * needs to be maintained in this case.
*/
-static bool bfq_symmetric_scenario(struct bfq_data *bfqd)
+static bool bfq_asymmetric_scenario(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
{
- return !bfq_differentiated_weights(bfqd);
+ bool smallest_weight = bfqq &&
+ bfqq->weight_counter &&
+ bfqq->weight_counter ==
+ container_of(
+ rb_first_cached(&bfqd->queue_weights_tree),
+ struct bfq_weight_counter,
+ weights_node);
+
+ /*
+ * For queue weights to differ, queue_weights_tree must contain
+ * at least two nodes.
+ */
+ bool varied_queue_weights = !smallest_weight &&
+ !RB_EMPTY_ROOT(&bfqd->queue_weights_tree.rb_root) &&
+ (bfqd->queue_weights_tree.rb_root.rb_node->rb_left ||
+ bfqd->queue_weights_tree.rb_root.rb_node->rb_right);
+
+ bool multiple_classes_busy =
+ (bfqd->busy_queues[0] && bfqd->busy_queues[1]) ||
+ (bfqd->busy_queues[0] && bfqd->busy_queues[2]) ||
+ (bfqd->busy_queues[1] && bfqd->busy_queues[2]);
+
+ return varied_queue_weights || multiple_classes_busy
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ || bfqd->num_groups_with_pending_reqs > 0
+#endif
+ ;
}
/*
* If the weight-counter tree passed as input contains no counter for
- * the weight of the input entity, then add that counter; otherwise just
+ * the weight of the input queue, then add that counter; otherwise just
* increment the existing counter.
*
* Note that weight-counter trees contain few nodes in mostly symmetric
@@ -687,25 +723,26 @@
* In most scenarios, the rate at which nodes are created/destroyed
* should be low too.
*/
-void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity,
- struct rb_root *root)
+void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ struct rb_root_cached *root)
{
- struct rb_node **new = &(root->rb_node), *parent = NULL;
+ struct bfq_entity *entity = &bfqq->entity;
+ struct rb_node **new = &(root->rb_root.rb_node), *parent = NULL;
+ bool leftmost = true;
/*
- * Do not insert if the entity is already associated with a
+ * Do not insert if the queue is already associated with a
* counter, which happens if:
- * 1) the entity is associated with a queue,
- * 2) a request arrival has caused the queue to become both
+ * 1) a request arrival has caused the queue to become both
* non-weight-raised, and hence change its weight, and
* backlogged; in this respect, each of the two events
* causes an invocation of this function,
- * 3) this is the invocation of this function caused by the
+ * 2) this is the invocation of this function caused by the
* second event. This second invocation is actually useless,
* and we handle this fact by exiting immediately. More
* efficient or clearer solutions might possibly be adopted.
*/
- if (entity->weight_counter)
+ if (bfqq->weight_counter)
return;
while (*new) {
@@ -715,77 +752,79 @@
parent = *new;
if (entity->weight == __counter->weight) {
- entity->weight_counter = __counter;
+ bfqq->weight_counter = __counter;
goto inc_counter;
}
if (entity->weight < __counter->weight)
new = &((*new)->rb_left);
- else
+ else {
new = &((*new)->rb_right);
+ leftmost = false;
+ }
}
- entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
- GFP_ATOMIC);
+ bfqq->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
+ GFP_ATOMIC);
/*
* In the unlucky event of an allocation failure, we just
- * exit. This will cause the weight of entity to not be
- * considered in bfq_differentiated_weights, which, in its
- * turn, causes the scenario to be deemed wrongly symmetric in
- * case entity's weight would have been the only weight making
- * the scenario asymmetric. On the bright side, no unbalance
- * will however occur when entity becomes inactive again (the
+ * exit. This will cause the weight of queue to not be
+ * considered in bfq_asymmetric_scenario, which, in its turn,
+ * causes the scenario to be deemed wrongly symmetric in case
+ * bfqq's weight would have been the only weight making the
+ * scenario asymmetric. On the bright side, no unbalance will
+ * however occur when bfqq becomes inactive again (the
* invocation of this function is triggered by an activation
- * of entity). In fact, bfq_weights_tree_remove does nothing
- * if !entity->weight_counter.
+ * of queue). In fact, bfq_weights_tree_remove does nothing
+ * if !bfqq->weight_counter.
*/
- if (unlikely(!entity->weight_counter))
+ if (unlikely(!bfqq->weight_counter))
return;
- entity->weight_counter->weight = entity->weight;
- rb_link_node(&entity->weight_counter->weights_node, parent, new);
- rb_insert_color(&entity->weight_counter->weights_node, root);
+ bfqq->weight_counter->weight = entity->weight;
+ rb_link_node(&bfqq->weight_counter->weights_node, parent, new);
+ rb_insert_color_cached(&bfqq->weight_counter->weights_node, root,
+ leftmost);
inc_counter:
- entity->weight_counter->num_active++;
+ bfqq->weight_counter->num_active++;
+ bfqq->ref++;
}
/*
- * Decrement the weight counter associated with the entity, and, if the
+ * Decrement the weight counter associated with the queue, and, if the
* counter reaches 0, remove the counter from the tree.
* See the comments to the function bfq_weights_tree_add() for considerations
* about overhead.
*/
void __bfq_weights_tree_remove(struct bfq_data *bfqd,
- struct bfq_entity *entity,
- struct rb_root *root)
+ struct bfq_queue *bfqq,
+ struct rb_root_cached *root)
{
- if (!entity->weight_counter)
+ if (!bfqq->weight_counter)
return;
- entity->weight_counter->num_active--;
- if (entity->weight_counter->num_active > 0)
+ bfqq->weight_counter->num_active--;
+ if (bfqq->weight_counter->num_active > 0)
goto reset_entity_pointer;
- rb_erase(&entity->weight_counter->weights_node, root);
- kfree(entity->weight_counter);
+ rb_erase_cached(&bfqq->weight_counter->weights_node, root);
+ kfree(bfqq->weight_counter);
reset_entity_pointer:
- entity->weight_counter = NULL;
+ bfqq->weight_counter = NULL;
+ bfq_put_queue(bfqq);
}
/*
- * Invoke __bfq_weights_tree_remove on bfqq and all its inactive
- * parent entities.
+ * Invoke __bfq_weights_tree_remove on bfqq and decrement the number
+ * of active groups for each queue's inactive parent entity.
*/
void bfq_weights_tree_remove(struct bfq_data *bfqd,
struct bfq_queue *bfqq)
{
struct bfq_entity *entity = bfqq->entity.parent;
- __bfq_weights_tree_remove(bfqd, &bfqq->entity,
- &bfqd->queue_weights_tree);
-
for_each_entity(entity) {
struct bfq_sched_data *sd = entity->my_sched_data;
@@ -797,18 +836,37 @@
* next_in_service for details on why
* in_service_entity must be checked too).
*
- * As a consequence, the weight of entity is
- * not to be removed. In addition, if entity
- * is active, then its parent entities are
- * active as well, and thus their weights are
- * not to be removed either. In the end, this
- * loop must stop here.
+ * As a consequence, its parent entities are
+ * active as well, and thus this loop must
+ * stop here.
*/
break;
}
- __bfq_weights_tree_remove(bfqd, entity,
- &bfqd->group_weights_tree);
+
+ /*
+ * The decrement of num_groups_with_pending_reqs is
+ * not performed immediately upon the deactivation of
+ * entity, but it is delayed to when it also happens
+ * that the first leaf descendant bfqq of entity gets
+ * all its pending requests completed. The following
+ * instructions perform this delayed decrement, if
+ * needed. See the comments on
+ * num_groups_with_pending_reqs for details.
+ */
+ if (entity->in_groups_with_pending_reqs) {
+ entity->in_groups_with_pending_reqs = false;
+ bfqd->num_groups_with_pending_reqs--;
+ }
}
+
+ /*
+ * Next function is invoked last, because it causes bfqq to be
+ * freed if the following holds: bfqq is not in service and
+ * has no dispatched request. DO NOT use bfqq after the next
+ * function invocation.
+ */
+ __bfq_weights_tree_remove(bfqd, bfqq,
+ &bfqd->queue_weights_tree);
}
/*
@@ -864,7 +922,8 @@
static unsigned long bfq_serv_to_charge(struct request *rq,
struct bfq_queue *bfqq)
{
- if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1)
+ if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1 ||
+ bfq_asymmetric_scenario(bfqq->bfqd, bfqq))
return blk_rq_sectors(rq);
return blk_rq_sectors(rq) * bfq_async_charge_factor;
@@ -898,8 +957,10 @@
*/
return;
- new_budget = max_t(unsigned long, bfqq->max_budget,
- bfq_serv_to_charge(next_rq, bfqq));
+ new_budget = max_t(unsigned long,
+ max_t(unsigned long, bfqq->max_budget,
+ bfq_serv_to_charge(next_rq, bfqq)),
+ entity->service);
if (entity->budget != new_budget) {
entity->budget = new_budget;
bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
@@ -928,7 +989,7 @@
* of several files
* mplayer took 23 seconds to start, if constantly weight-raised.
*
- * As for higher values than that accomodating the above bad
+ * As for higher values than that accommodating the above bad
* scenario, tests show that higher values would often yield
* the opposite of the desired result, i.e., would worsen
* responsiveness by allowing non-interactive applications to
@@ -967,6 +1028,7 @@
else
bfq_clear_bfqq_IO_bound(bfqq);
+ bfqq->entity.new_weight = bic->saved_weight;
bfqq->ttime = bic->saved_ttime;
bfqq->wr_coeff = bic->saved_wr_coeff;
bfqq->wr_start_at_switch_to_srt = bic->saved_wr_start_at_switch_to_srt;
@@ -1002,7 +1064,8 @@
static int bfqq_process_refs(struct bfq_queue *bfqq)
{
- return bfqq->ref - bfqq->allocated - bfqq->entity.on_st;
+ return bfqq->ref - bfqq->allocated - bfqq->entity.on_st -
+ (bfqq->weight_counter != NULL);
}
/* Empty burst list and add just bfqq (see comments on bfq_handle_burst) */
@@ -1013,8 +1076,18 @@
hlist_for_each_entry_safe(item, n, &bfqd->burst_list, burst_list_node)
hlist_del_init(&item->burst_list_node);
- hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);
- bfqd->burst_size = 1;
+
+ /*
+ * Start the creation of a new burst list only if there is no
+ * active queue. See comments on the conditional invocation of
+ * bfq_handle_burst().
+ */
+ if (bfq_tot_busy_queues(bfqd) == 0) {
+ hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);
+ bfqd->burst_size = 1;
+ } else
+ bfqd->burst_size = 0;
+
bfqd->burst_parent_entity = bfqq->entity.parent;
}
@@ -1070,7 +1143,8 @@
* many parallel threads/processes. Examples are systemd during boot,
* or git grep. To help these processes get their job done as soon as
* possible, it is usually better to not grant either weight-raising
- * or device idling to their queues.
+ * or device idling to their queues, unless these queues must be
+ * protected from the I/O flowing through other active queues.
*
* In this comment we describe, firstly, the reasons why this fact
* holds, and, secondly, the next function, which implements the main
@@ -1082,7 +1156,10 @@
* cumulatively served, the sooner the target job of these queues gets
* completed. As a consequence, weight-raising any of these queues,
* which also implies idling the device for it, is almost always
- * counterproductive. In most cases it just lowers throughput.
+ * counterproductive, unless there are other active queues to isolate
+ * these new queues from. If there no other active queues, then
+ * weight-raising these new queues just lowers throughput in most
+ * cases.
*
* On the other hand, a burst of queue creations may be caused also by
* the start of an application that does not consist of a lot of
@@ -1116,14 +1193,16 @@
* are very rare. They typically occur if some service happens to
* start doing I/O exactly when the interactive task starts.
*
- * Turning back to the next function, it implements all the steps
- * needed to detect the occurrence of a large burst and to properly
- * mark all the queues belonging to it (so that they can then be
- * treated in a different way). This goal is achieved by maintaining a
- * "burst list" that holds, temporarily, the queues that belong to the
- * burst in progress. The list is then used to mark these queues as
- * belonging to a large burst if the burst does become large. The main
- * steps are the following.
+ * Turning back to the next function, it is invoked only if there are
+ * no active queues (apart from active queues that would belong to the
+ * same, possible burst bfqq would belong to), and it implements all
+ * the steps needed to detect the occurrence of a large burst and to
+ * properly mark all the queues belonging to it (so that they can then
+ * be treated in a different way). This goal is achieved by
+ * maintaining a "burst list" that holds, temporarily, the queues that
+ * belong to the burst in progress. The list is then used to mark
+ * these queues as belonging to a large burst if the burst does become
+ * large. The main steps are the following.
*
* . when the very first queue is created, the queue is inserted into the
* list (as it could be the first queue in a possible burst)
@@ -1371,7 +1450,15 @@
{
struct bfq_entity *entity = &bfqq->entity;
- if (bfq_bfqq_non_blocking_wait_rq(bfqq) && arrived_in_time) {
+ /*
+ * In the next compound condition, we check also whether there
+ * is some budget left, because otherwise there is no point in
+ * trying to go on serving bfqq with this same budget: bfqq
+ * would be expired immediately after being selected for
+ * service. This would only cause useless overhead.
+ */
+ if (bfq_bfqq_non_blocking_wait_rq(bfqq) && arrived_in_time &&
+ bfq_bfqq_budget_left(bfqq) > 0) {
/*
* We do not clear the flag non_blocking_wait_rq here, as
* the latter is used in bfq_activate_bfqq to signal
@@ -1560,6 +1647,7 @@
*/
in_burst = bfq_bfqq_in_large_burst(bfqq);
soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
+ !BFQQ_TOTALLY_SEEKY(bfqq) &&
!in_burst &&
time_is_before_jiffies(bfqq->soft_rt_next_start) &&
bfqq->dispatched == 0;
@@ -1656,6 +1744,72 @@
false, BFQQE_PREEMPTED);
}
+static void bfq_reset_inject_limit(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
+{
+ /* invalidate baseline total service time */
+ bfqq->last_serv_time_ns = 0;
+
+ /*
+ * Reset pointer in case we are waiting for
+ * some request completion.
+ */
+ bfqd->waited_rq = NULL;
+
+ /*
+ * If bfqq has a short think time, then start by setting the
+ * inject limit to 0 prudentially, because the service time of
+ * an injected I/O request may be higher than the think time
+ * of bfqq, and therefore, if one request was injected when
+ * bfqq remains empty, this injected request might delay the
+ * service of the next I/O request for bfqq significantly. In
+ * case bfqq can actually tolerate some injection, then the
+ * adaptive update will however raise the limit soon. This
+ * lucky circumstance holds exactly because bfqq has a short
+ * think time, and thus, after remaining empty, is likely to
+ * get new I/O enqueued---and then completed---before being
+ * expired. This is the very pattern that gives the
+ * limit-update algorithm the chance to measure the effect of
+ * injection on request service times, and then to update the
+ * limit accordingly.
+ *
+ * However, in the following special case, the inject limit is
+ * left to 1 even if the think time is short: bfqq's I/O is
+ * synchronized with that of some other queue, i.e., bfqq may
+ * receive new I/O only after the I/O of the other queue is
+ * completed. Keeping the inject limit to 1 allows the
+ * blocking I/O to be served while bfqq is in service. And
+ * this is very convenient both for bfqq and for overall
+ * throughput, as explained in detail in the comments in
+ * bfq_update_has_short_ttime().
+ *
+ * On the opposite end, if bfqq has a long think time, then
+ * start directly by 1, because:
+ * a) on the bright side, keeping at most one request in
+ * service in the drive is unlikely to cause any harm to the
+ * latency of bfqq's requests, as the service time of a single
+ * request is likely to be lower than the think time of bfqq;
+ * b) on the downside, after becoming empty, bfqq is likely to
+ * expire before getting its next request. With this request
+ * arrival pattern, it is very hard to sample total service
+ * times and update the inject limit accordingly (see comments
+ * on bfq_update_inject_limit()). So the limit is likely to be
+ * never, or at least seldom, updated. As a consequence, by
+ * setting the limit to 1, we avoid that no injection ever
+ * occurs with bfqq. On the downside, this proactive step
+ * further reduces chances to actually compute the baseline
+ * total service time. Thus it reduces chances to execute the
+ * limit-update algorithm and possibly raise the limit to more
+ * than 1.
+ */
+ if (bfq_bfqq_has_short_ttime(bfqq))
+ bfqq->inject_limit = 0;
+ else
+ bfqq->inject_limit = 1;
+
+ bfqq->decrease_time_jif = jiffies;
+}
+
static void bfq_add_request(struct request *rq)
{
struct bfq_queue *bfqq = RQ_BFQQ(rq);
@@ -1668,6 +1822,60 @@
bfqq->queued[rq_is_sync(rq)]++;
bfqd->queued++;
+ if (RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_bfqq_sync(bfqq)) {
+ /*
+ * Periodically reset inject limit, to make sure that
+ * the latter eventually drops in case workload
+ * changes, see step (3) in the comments on
+ * bfq_update_inject_limit().
+ */
+ if (time_is_before_eq_jiffies(bfqq->decrease_time_jif +
+ msecs_to_jiffies(1000)))
+ bfq_reset_inject_limit(bfqd, bfqq);
+
+ /*
+ * The following conditions must hold to setup a new
+ * sampling of total service time, and then a new
+ * update of the inject limit:
+ * - bfqq is in service, because the total service
+ * time is evaluated only for the I/O requests of
+ * the queues in service;
+ * - this is the right occasion to compute or to
+ * lower the baseline total service time, because
+ * there are actually no requests in the drive,
+ * or
+ * the baseline total service time is available, and
+ * this is the right occasion to compute the other
+ * quantity needed to update the inject limit, i.e.,
+ * the total service time caused by the amount of
+ * injection allowed by the current value of the
+ * limit. It is the right occasion because injection
+ * has actually been performed during the service
+ * hole, and there are still in-flight requests,
+ * which are very likely to be exactly the injected
+ * requests, or part of them;
+ * - the minimum interval for sampling the total
+ * service time and updating the inject limit has
+ * elapsed.
+ */
+ if (bfqq == bfqd->in_service_queue &&
+ (bfqd->rq_in_driver == 0 ||
+ (bfqq->last_serv_time_ns > 0 &&
+ bfqd->rqs_injected && bfqd->rq_in_driver > 0)) &&
+ time_is_before_eq_jiffies(bfqq->decrease_time_jif +
+ msecs_to_jiffies(100))) {
+ bfqd->last_empty_occupied_ns = ktime_get_ns();
+ /*
+ * Start the state machine for measuring the
+ * total service time of rq: setting
+ * wait_dispatch will cause bfqd->waited_rq to
+ * be set when rq will be dispatched.
+ */
+ bfqd->wait_dispatch = true;
+ bfqd->rqs_injected = false;
+ }
+ }
+
elv_rb_add(&bfqq->sort_list, rq);
/*
@@ -1679,8 +1887,9 @@
/*
* Adjust priority tree position, if next_rq changes.
+ * See comments on bfq_pos_tree_add_move() for the unlikely().
*/
- if (prev != bfqq->next_rq)
+ if (unlikely(!bfqd->nonrot_with_queueing && prev != bfqq->next_rq))
bfq_pos_tree_add_move(bfqd, bfqq);
if (!bfq_bfqq_busy(bfqq)) /* switching to busy ... */
@@ -1820,7 +2029,9 @@
bfqq->pos_root = NULL;
}
} else {
- bfq_pos_tree_add_move(bfqd, bfqq);
+ /* see comments on bfq_pos_tree_add_move() for the unlikely() */
+ if (unlikely(!bfqd->nonrot_with_queueing))
+ bfq_pos_tree_add_move(bfqd, bfqq);
}
if (rq->cmd_flags & REQ_META)
@@ -1910,7 +2121,12 @@
*/
if (prev != bfqq->next_rq) {
bfq_updated_next_req(bfqd, bfqq);
- bfq_pos_tree_add_move(bfqd, bfqq);
+ /*
+ * See comments on bfq_pos_tree_add_move() for
+ * the unlikely().
+ */
+ if (unlikely(!bfqd->nonrot_with_queueing))
+ bfq_pos_tree_add_move(bfqd, bfqq);
}
}
}
@@ -2196,6 +2412,46 @@
struct bfq_queue *in_service_bfqq, *new_bfqq;
/*
+ * Do not perform queue merging if the device is non
+ * rotational and performs internal queueing. In fact, such a
+ * device reaches a high speed through internal parallelism
+ * and pipelining. This means that, to reach a high
+ * throughput, it must have many requests enqueued at the same
+ * time. But, in this configuration, the internal scheduling
+ * algorithm of the device does exactly the job of queue
+ * merging: it reorders requests so as to obtain as much as
+ * possible a sequential I/O pattern. As a consequence, with
+ * the workload generated by processes doing interleaved I/O,
+ * the throughput reached by the device is likely to be the
+ * same, with and without queue merging.
+ *
+ * Disabling merging also provides a remarkable benefit in
+ * terms of throughput. Merging tends to make many workloads
+ * artificially more uneven, because of shared queues
+ * remaining non empty for incomparably more time than
+ * non-merged queues. This may accentuate workload
+ * asymmetries. For example, if one of the queues in a set of
+ * merged queues has a higher weight than a normal queue, then
+ * the shared queue may inherit such a high weight and, by
+ * staying almost always active, may force BFQ to perform I/O
+ * plugging most of the time. This evidently makes it harder
+ * for BFQ to let the device reach a high throughput.
+ *
+ * Finally, the likely() macro below is not used because one
+ * of the two branches is more likely than the other, but to
+ * have the code path after the following if() executed as
+ * fast as possible for the case of a non rotational device
+ * with queueing. We want it because this is the fastest kind
+ * of device. On the opposite end, the likely() may lengthen
+ * the execution time of BFQ for the case of slower devices
+ * (rotational or at least without queueing). But in this case
+ * the execution time of BFQ matters very little, if not at
+ * all.
+ */
+ if (likely(bfqd->nonrot_with_queueing))
+ return NULL;
+
+ /*
* Prevent bfqq from being merged if it has been created too
* long ago. The idea is that true cooperating processes, and
* thus their associated bfq_queues, are supposed to be
@@ -2216,7 +2472,7 @@
return NULL;
/* If there is only one backlogged queue, don't search. */
- if (bfqd->busy_queues == 1)
+ if (bfq_tot_busy_queues(bfqd) == 1)
return NULL;
in_service_bfqq = bfqd->in_service_queue;
@@ -2258,6 +2514,7 @@
if (!bic)
return;
+ bic->saved_weight = bfqq->entity.orig_weight;
bic->saved_ttime = bfqq->ttime;
bic->saved_has_short_ttime = bfq_bfqq_has_short_ttime(bfqq);
bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
@@ -2276,6 +2533,7 @@
* to enjoy weight raising if split soon.
*/
bic->saved_wr_coeff = bfqq->bfqd->bfq_wr_coeff;
+ bic->saved_wr_start_at_switch_to_srt = bfq_smallest_from_now();
bic->saved_wr_cur_max_time = bfq_wr_duration(bfqq->bfqd);
bic->saved_last_wr_start_finish = jiffies;
} else {
@@ -2346,6 +2604,16 @@
* assignment causes no harm).
*/
new_bfqq->bic = NULL;
+ /*
+ * If the queue is shared, the pid is the pid of one of the associated
+ * processes. Which pid depends on the exact sequence of merge events
+ * the queue underwent. So printing such a pid is useless and confusing
+ * because it reports a random pid between those of the associated
+ * processes.
+ * We mark such a queue with a pid -1, and then print SHARED instead of
+ * a pid in logging messages.
+ */
+ new_bfqq->pid = -1;
bfqq->bic = NULL;
/* release process reference to bfqq */
bfq_put_queue(bfqq);
@@ -2380,8 +2648,8 @@
/*
* bic still points to bfqq, then it has not yet been
* redirected to some other bfq_queue, and a queue
- * merge beween bfqq and new_bfqq can be safely
- * fulfillled, i.e., bic can be redirected to new_bfqq
+ * merge between bfqq and new_bfqq can be safely
+ * fulfilled, i.e., bic can be redirected to new_bfqq
* and bfqq can be put.
*/
bfq_merge_bfqqs(bfqd, bfqd->bio_bic, bfqq,
@@ -2515,12 +2783,14 @@
* queue).
*/
if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 &&
- bfq_symmetric_scenario(bfqd))
+ !bfq_asymmetric_scenario(bfqd, bfqq))
sl = min_t(u64, sl, BFQ_MIN_TT);
else if (bfqq->wr_coeff > 1)
sl = max_t(u32, sl, 20ULL * NSEC_PER_MSEC);
bfqd->last_idling_start = ktime_get();
+ bfqd->last_idling_start_jiffies = jiffies;
+
hrtimer_start(&bfqd->idle_slice_timer, ns_to_ktime(sl),
HRTIMER_MODE_REL);
bfqg_stats_set_start_idle_time(bfqq_group(bfqq));
@@ -2744,7 +3014,7 @@
if ((bfqd->rq_in_driver > 0 ||
now_ns - bfqd->last_completion < BFQ_MIN_TT)
- && get_sdist(bfqd->last_position, rq) < BFQQ_SEEK_THR)
+ && !BFQ_RQ_SEEKY(bfqd, bfqd->last_position, rq))
bfqd->sequential_samples++;
bfqd->tot_sectors_dispatched += blk_rq_sectors(rq);
@@ -2796,7 +3066,7 @@
bfq_remove_request(q, rq);
}
-static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+static bool __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
{
/*
* If this bfqq is shared between multiple processes, check
@@ -2822,16 +3092,20 @@
bfq_requeue_bfqq(bfqd, bfqq, true);
/*
* Resort priority tree of potential close cooperators.
+ * See comments on bfq_pos_tree_add_move() for the unlikely().
*/
- bfq_pos_tree_add_move(bfqd, bfqq);
+ if (unlikely(!bfqd->nonrot_with_queueing))
+ bfq_pos_tree_add_move(bfqd, bfqq);
}
/*
* All in-service entities must have been properly deactivated
* or requeued before executing the next function, which
- * resets all in-service entites as no more in service.
+ * resets all in-service entities as no more in service. This
+ * may cause bfqq to be freed. If this happens, the next
+ * function returns true.
*/
- __bfq_bfqd_reset_in_service(bfqd);
+ return __bfq_bfqd_reset_in_service(bfqd);
}
/**
@@ -3195,13 +3469,6 @@
jiffies + nsecs_to_jiffies(bfqq->bfqd->bfq_slice_idle) + 4);
}
-static bool bfq_bfqq_injectable(struct bfq_queue *bfqq)
-{
- return BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 &&
- blk_queue_nonrot(bfqq->bfqd->queue) &&
- bfqq->bfqd->hw_tag;
-}
-
/**
* bfq_bfqq_expire - expire a queue.
* @bfqd: device owning the queue.
@@ -3236,7 +3503,6 @@
bool slow;
unsigned long delta = 0;
struct bfq_entity *entity = &bfqq->entity;
- int ref;
/*
* Check whether the process is slow (see bfq_bfqq_is_slow).
@@ -3278,16 +3544,32 @@
* requests, then the request pattern is isochronous
* (see the comments on the function
* bfq_bfqq_softrt_next_start()). Thus we can compute
- * soft_rt_next_start. If, instead, the queue still
- * has outstanding requests, then we have to wait for
- * the completion of all the outstanding requests to
- * discover whether the request pattern is actually
- * isochronous.
+ * soft_rt_next_start. And we do it, unless bfqq is in
+ * interactive weight raising. We do not do it in the
+ * latter subcase, for the following reason. bfqq may
+ * be conveying the I/O needed to load a soft
+ * real-time application. Such an application will
+ * actually exhibit a soft real-time I/O pattern after
+ * it finally starts doing its job. But, if
+ * soft_rt_next_start is computed here for an
+ * interactive bfqq, and bfqq had received a lot of
+ * service before remaining with no outstanding
+ * request (likely to happen on a fast device), then
+ * soft_rt_next_start would be assigned such a high
+ * value that, for a very long time, bfqq would be
+ * prevented from being possibly considered as soft
+ * real time.
+ *
+ * If, instead, the queue still has outstanding
+ * requests, then we have to wait for the completion
+ * of all the outstanding requests to discover whether
+ * the request pattern is actually isochronous.
*/
- if (bfqq->dispatched == 0)
+ if (bfqq->dispatched == 0 &&
+ bfqq->wr_coeff != bfqd->bfq_wr_coeff)
bfqq->soft_rt_next_start =
bfq_bfqq_softrt_next_start(bfqd, bfqq);
- else {
+ else if (bfqq->dispatched > 0) {
/*
* Schedule an update of soft_rt_next_start to when
* the task may be discovered to be isochronous.
@@ -3301,18 +3583,22 @@
slow, bfqq->dispatched, bfq_bfqq_has_short_ttime(bfqq));
/*
+ * bfqq expired, so no total service time needs to be computed
+ * any longer: reset state machine for measuring total service
+ * times.
+ */
+ bfqd->rqs_injected = bfqd->wait_dispatch = false;
+ bfqd->waited_rq = NULL;
+
+ /*
* Increase, decrease or leave budget unchanged according to
* reason.
*/
__bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
- ref = bfqq->ref;
- __bfq_bfqq_expire(bfqd, bfqq);
-
- if (ref == 1) /* bfqq is gone, no more actions on it */
+ if (__bfq_bfqq_expire(bfqd, bfqq))
+ /* bfqq is gone, no more actions on it */
return;
- bfqq->injected_service = 0;
-
/* mark bfqq as waiting a request only if a bic still points to it */
if (!bfq_bfqq_busy(bfqq) &&
reason != BFQQE_BUDGET_TIMEOUT &&
@@ -3380,53 +3666,13 @@
bfq_bfqq_budget_timeout(bfqq);
}
-/*
- * For a queue that becomes empty, device idling is allowed only if
- * this function returns true for the queue. As a consequence, since
- * device idling plays a critical role in both throughput boosting and
- * service guarantees, the return value of this function plays a
- * critical role in both these aspects as well.
- *
- * In a nutshell, this function returns true only if idling is
- * beneficial for throughput or, even if detrimental for throughput,
- * idling is however necessary to preserve service guarantees (low
- * latency, desired throughput distribution, ...). In particular, on
- * NCQ-capable devices, this function tries to return false, so as to
- * help keep the drives' internal queues full, whenever this helps the
- * device boost the throughput without causing any service-guarantee
- * issue.
- *
- * In more detail, the return value of this function is obtained by,
- * first, computing a number of boolean variables that take into
- * account throughput and service-guarantee issues, and, then,
- * combining these variables in a logical expression. Most of the
- * issues taken into account are not trivial. We discuss these issues
- * individually while introducing the variables.
- */
-static bool bfq_better_to_idle(struct bfq_queue *bfqq)
+static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
{
- struct bfq_data *bfqd = bfqq->bfqd;
bool rot_without_queueing =
!blk_queue_nonrot(bfqd->queue) && !bfqd->hw_tag,
bfqq_sequential_and_IO_bound,
- idling_boosts_thr, idling_boosts_thr_without_issues,
- idling_needed_for_service_guarantees,
- asymmetric_scenario;
-
- if (bfqd->strict_guarantees)
- return true;
-
- /*
- * Idling is performed only if slice_idle > 0. In addition, we
- * do not idle if
- * (a) bfqq is async
- * (b) bfqq is in the idle io prio class: in this case we do
- * not idle because we want to minimize the bandwidth that
- * queues in this class can steal to higher-priority queues
- */
- if (bfqd->bfq_slice_idle == 0 || !bfq_bfqq_sync(bfqq) ||
- bfq_class_idle(bfqq))
- return false;
+ idling_boosts_thr;
bfqq_sequential_and_IO_bound = !BFQQ_SEEKY(bfqq) &&
bfq_bfqq_IO_bound(bfqq) && bfq_bfqq_has_short_ttime(bfqq);
@@ -3458,8 +3704,7 @@
bfqq_sequential_and_IO_bound);
/*
- * The value of the next variable,
- * idling_boosts_thr_without_issues, is equal to that of
+ * The return value of this function is equal to that of
* idling_boosts_thr, unless a special case holds. In this
* special case, described below, idling may cause problems to
* weight-raised queues.
@@ -3476,169 +3721,259 @@
* which enqueue several requests in advance, and further
* reorder internally-queued requests.
*
- * For this reason, we force to false the value of
- * idling_boosts_thr_without_issues if there are weight-raised
- * busy queues. In this case, and if bfqq is not weight-raised,
- * this guarantees that the device is not idled for bfqq (if,
- * instead, bfqq is weight-raised, then idling will be
- * guaranteed by another variable, see below). Combined with
- * the timestamping rules of BFQ (see [1] for details), this
- * behavior causes bfqq, and hence any sync non-weight-raised
- * queue, to get a lower number of requests served, and thus
- * to ask for a lower number of requests from the request
- * pool, before the busy weight-raised queues get served
- * again. This often mitigates starvation problems in the
- * presence of heavy write workloads and NCQ, thereby
- * guaranteeing a higher application and system responsiveness
- * in these hostile scenarios.
+ * For this reason, we force to false the return value if
+ * there are weight-raised busy queues. In this case, and if
+ * bfqq is not weight-raised, this guarantees that the device
+ * is not idled for bfqq (if, instead, bfqq is weight-raised,
+ * then idling will be guaranteed by another variable, see
+ * below). Combined with the timestamping rules of BFQ (see
+ * [1] for details), this behavior causes bfqq, and hence any
+ * sync non-weight-raised queue, to get a lower number of
+ * requests served, and thus to ask for a lower number of
+ * requests from the request pool, before the busy
+ * weight-raised queues get served again. This often mitigates
+ * starvation problems in the presence of heavy write
+ * workloads and NCQ, thereby guaranteeing a higher
+ * application and system responsiveness in these hostile
+ * scenarios.
*/
- idling_boosts_thr_without_issues = idling_boosts_thr &&
+ return idling_boosts_thr &&
bfqd->wr_busy_queues == 0;
+}
+
+/*
+ * There is a case where idling does not have to be performed for
+ * throughput concerns, but to preserve the throughput share of
+ * the process associated with bfqq.
+ *
+ * To introduce this case, we can note that allowing the drive
+ * to enqueue more than one request at a time, and hence
+ * delegating de facto final scheduling decisions to the
+ * drive's internal scheduler, entails loss of control on the
+ * actual request service order. In particular, the critical
+ * situation is when requests from different processes happen
+ * to be present, at the same time, in the internal queue(s)
+ * of the drive. In such a situation, the drive, by deciding
+ * the service order of the internally-queued requests, does
+ * determine also the actual throughput distribution among
+ * these processes. But the drive typically has no notion or
+ * concern about per-process throughput distribution, and
+ * makes its decisions only on a per-request basis. Therefore,
+ * the service distribution enforced by the drive's internal
+ * scheduler is likely to coincide with the desired throughput
+ * distribution only in a completely symmetric, or favorably
+ * skewed scenario where:
+ * (i-a) each of these processes must get the same throughput as
+ * the others,
+ * (i-b) in case (i-a) does not hold, it holds that the process
+ * associated with bfqq must receive a lower or equal
+ * throughput than any of the other processes;
+ * (ii) the I/O of each process has the same properties, in
+ * terms of locality (sequential or random), direction
+ * (reads or writes), request sizes, greediness
+ * (from I/O-bound to sporadic), and so on;
+
+ * In fact, in such a scenario, the drive tends to treat the requests
+ * of each process in about the same way as the requests of the
+ * others, and thus to provide each of these processes with about the
+ * same throughput. This is exactly the desired throughput
+ * distribution if (i-a) holds, or, if (i-b) holds instead, this is an
+ * even more convenient distribution for (the process associated with)
+ * bfqq.
+ *
+ * In contrast, in any asymmetric or unfavorable scenario, device
+ * idling (I/O-dispatch plugging) is certainly needed to guarantee
+ * that bfqq receives its assigned fraction of the device throughput
+ * (see [1] for details).
+ *
+ * The problem is that idling may significantly reduce throughput with
+ * certain combinations of types of I/O and devices. An important
+ * example is sync random I/O on flash storage with command
+ * queueing. So, unless bfqq falls in cases where idling also boosts
+ * throughput, it is important to check conditions (i-a), i(-b) and
+ * (ii) accurately, so as to avoid idling when not strictly needed for
+ * service guarantees.
+ *
+ * Unfortunately, it is extremely difficult to thoroughly check
+ * condition (ii). And, in case there are active groups, it becomes
+ * very difficult to check conditions (i-a) and (i-b) too. In fact,
+ * if there are active groups, then, for conditions (i-a) or (i-b) to
+ * become false 'indirectly', it is enough that an active group
+ * contains more active processes or sub-groups than some other active
+ * group. More precisely, for conditions (i-a) or (i-b) to become
+ * false because of such a group, it is not even necessary that the
+ * group is (still) active: it is sufficient that, even if the group
+ * has become inactive, some of its descendant processes still have
+ * some request already dispatched but still waiting for
+ * completion. In fact, requests have still to be guaranteed their
+ * share of the throughput even after being dispatched. In this
+ * respect, it is easy to show that, if a group frequently becomes
+ * inactive while still having in-flight requests, and if, when this
+ * happens, the group is not considered in the calculation of whether
+ * the scenario is asymmetric, then the group may fail to be
+ * guaranteed its fair share of the throughput (basically because
+ * idling may not be performed for the descendant processes of the
+ * group, but it had to be). We address this issue with the following
+ * bi-modal behavior, implemented in the function
+ * bfq_asymmetric_scenario().
+ *
+ * If there are groups with requests waiting for completion
+ * (as commented above, some of these groups may even be
+ * already inactive), then the scenario is tagged as
+ * asymmetric, conservatively, without checking any of the
+ * conditions (i-a), (i-b) or (ii). So the device is idled for bfqq.
+ * This behavior matches also the fact that groups are created
+ * exactly if controlling I/O is a primary concern (to
+ * preserve bandwidth and latency guarantees).
+ *
+ * On the opposite end, if there are no groups with requests waiting
+ * for completion, then only conditions (i-a) and (i-b) are actually
+ * controlled, i.e., provided that conditions (i-a) or (i-b) holds,
+ * idling is not performed, regardless of whether condition (ii)
+ * holds. In other words, only if conditions (i-a) and (i-b) do not
+ * hold, then idling is allowed, and the device tends to be prevented
+ * from queueing many requests, possibly of several processes. Since
+ * there are no groups with requests waiting for completion, then, to
+ * control conditions (i-a) and (i-b) it is enough to check just
+ * whether all the queues with requests waiting for completion also
+ * have the same weight.
+ *
+ * Not checking condition (ii) evidently exposes bfqq to the
+ * risk of getting less throughput than its fair share.
+ * However, for queues with the same weight, a further
+ * mechanism, preemption, mitigates or even eliminates this
+ * problem. And it does so without consequences on overall
+ * throughput. This mechanism and its benefits are explained
+ * in the next three paragraphs.
+ *
+ * Even if a queue, say Q, is expired when it remains idle, Q
+ * can still preempt the new in-service queue if the next
+ * request of Q arrives soon (see the comments on
+ * bfq_bfqq_update_budg_for_activation). If all queues and
+ * groups have the same weight, this form of preemption,
+ * combined with the hole-recovery heuristic described in the
+ * comments on function bfq_bfqq_update_budg_for_activation,
+ * are enough to preserve a correct bandwidth distribution in
+ * the mid term, even without idling. In fact, even if not
+ * idling allows the internal queues of the device to contain
+ * many requests, and thus to reorder requests, we can rather
+ * safely assume that the internal scheduler still preserves a
+ * minimum of mid-term fairness.
+ *
+ * More precisely, this preemption-based, idleless approach
+ * provides fairness in terms of IOPS, and not sectors per
+ * second. This can be seen with a simple example. Suppose
+ * that there are two queues with the same weight, but that
+ * the first queue receives requests of 8 sectors, while the
+ * second queue receives requests of 1024 sectors. In
+ * addition, suppose that each of the two queues contains at
+ * most one request at a time, which implies that each queue
+ * always remains idle after it is served. Finally, after
+ * remaining idle, each queue receives very quickly a new
+ * request. It follows that the two queues are served
+ * alternatively, preempting each other if needed. This
+ * implies that, although both queues have the same weight,
+ * the queue with large requests receives a service that is
+ * 1024/8 times as high as the service received by the other
+ * queue.
+ *
+ * The motivation for using preemption instead of idling (for
+ * queues with the same weight) is that, by not idling,
+ * service guarantees are preserved (completely or at least in
+ * part) without minimally sacrificing throughput. And, if
+ * there is no active group, then the primary expectation for
+ * this device is probably a high throughput.
+ *
+ * We are now left only with explaining the additional
+ * compound condition that is checked below for deciding
+ * whether the scenario is asymmetric. To explain this
+ * compound condition, we need to add that the function
+ * bfq_asymmetric_scenario checks the weights of only
+ * non-weight-raised queues, for efficiency reasons (see
+ * comments on bfq_weights_tree_add()). Then the fact that
+ * bfqq is weight-raised is checked explicitly here. More
+ * precisely, the compound condition below takes into account
+ * also the fact that, even if bfqq is being weight-raised,
+ * the scenario is still symmetric if all queues with requests
+ * waiting for completion happen to be
+ * weight-raised. Actually, we should be even more precise
+ * here, and differentiate between interactive weight raising
+ * and soft real-time weight raising.
+ *
+ * As a side note, it is worth considering that the above
+ * device-idling countermeasures may however fail in the
+ * following unlucky scenario: if idling is (correctly)
+ * disabled in a time period during which all symmetry
+ * sub-conditions hold, and hence the device is allowed to
+ * enqueue many requests, but at some later point in time some
+ * sub-condition stops to hold, then it may become impossible
+ * to let requests be served in the desired order until all
+ * the requests already queued in the device have been served.
+ */
+static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
+{
+ return (bfqq->wr_coeff > 1 &&
+ bfqd->wr_busy_queues <
+ bfq_tot_busy_queues(bfqd)) ||
+ bfq_asymmetric_scenario(bfqd, bfqq);
+}
+
+/*
+ * For a queue that becomes empty, device idling is allowed only if
+ * this function returns true for that queue. As a consequence, since
+ * device idling plays a critical role for both throughput boosting
+ * and service guarantees, the return value of this function plays a
+ * critical role as well.
+ *
+ * In a nutshell, this function returns true only if idling is
+ * beneficial for throughput or, even if detrimental for throughput,
+ * idling is however necessary to preserve service guarantees (low
+ * latency, desired throughput distribution, ...). In particular, on
+ * NCQ-capable devices, this function tries to return false, so as to
+ * help keep the drives' internal queues full, whenever this helps the
+ * device boost the throughput without causing any service-guarantee
+ * issue.
+ *
+ * Most of the issues taken into account to get the return value of
+ * this function are not trivial. We discuss these issues in the two
+ * functions providing the main pieces of information needed by this
+ * function.
+ */
+static bool bfq_better_to_idle(struct bfq_queue *bfqq)
+{
+ struct bfq_data *bfqd = bfqq->bfqd;
+ bool idling_boosts_thr_with_no_issue, idling_needed_for_service_guar;
+
+ if (unlikely(bfqd->strict_guarantees))
+ return true;
/*
- * There is then a case where idling must be performed not
- * for throughput concerns, but to preserve service
- * guarantees.
- *
- * To introduce this case, we can note that allowing the drive
- * to enqueue more than one request at a time, and hence
- * delegating de facto final scheduling decisions to the
- * drive's internal scheduler, entails loss of control on the
- * actual request service order. In particular, the critical
- * situation is when requests from different processes happen
- * to be present, at the same time, in the internal queue(s)
- * of the drive. In such a situation, the drive, by deciding
- * the service order of the internally-queued requests, does
- * determine also the actual throughput distribution among
- * these processes. But the drive typically has no notion or
- * concern about per-process throughput distribution, and
- * makes its decisions only on a per-request basis. Therefore,
- * the service distribution enforced by the drive's internal
- * scheduler is likely to coincide with the desired
- * device-throughput distribution only in a completely
- * symmetric scenario where:
- * (i) each of these processes must get the same throughput as
- * the others;
- * (ii) all these processes have the same I/O pattern
- (either sequential or random).
- * In fact, in such a scenario, the drive will tend to treat
- * the requests of each of these processes in about the same
- * way as the requests of the others, and thus to provide
- * each of these processes with about the same throughput
- * (which is exactly the desired throughput distribution). In
- * contrast, in any asymmetric scenario, device idling is
- * certainly needed to guarantee that bfqq receives its
- * assigned fraction of the device throughput (see [1] for
- * details).
- *
- * We address this issue by controlling, actually, only the
- * symmetry sub-condition (i), i.e., provided that
- * sub-condition (i) holds, idling is not performed,
- * regardless of whether sub-condition (ii) holds. In other
- * words, only if sub-condition (i) holds, then idling is
- * allowed, and the device tends to be prevented from queueing
- * many requests, possibly of several processes. The reason
- * for not controlling also sub-condition (ii) is that we
- * exploit preemption to preserve guarantees in case of
- * symmetric scenarios, even if (ii) does not hold, as
- * explained in the next two paragraphs.
- *
- * Even if a queue, say Q, is expired when it remains idle, Q
- * can still preempt the new in-service queue if the next
- * request of Q arrives soon (see the comments on
- * bfq_bfqq_update_budg_for_activation). If all queues and
- * groups have the same weight, this form of preemption,
- * combined with the hole-recovery heuristic described in the
- * comments on function bfq_bfqq_update_budg_for_activation,
- * are enough to preserve a correct bandwidth distribution in
- * the mid term, even without idling. In fact, even if not
- * idling allows the internal queues of the device to contain
- * many requests, and thus to reorder requests, we can rather
- * safely assume that the internal scheduler still preserves a
- * minimum of mid-term fairness. The motivation for using
- * preemption instead of idling is that, by not idling,
- * service guarantees are preserved without minimally
- * sacrificing throughput. In other words, both a high
- * throughput and its desired distribution are obtained.
- *
- * More precisely, this preemption-based, idleless approach
- * provides fairness in terms of IOPS, and not sectors per
- * second. This can be seen with a simple example. Suppose
- * that there are two queues with the same weight, but that
- * the first queue receives requests of 8 sectors, while the
- * second queue receives requests of 1024 sectors. In
- * addition, suppose that each of the two queues contains at
- * most one request at a time, which implies that each queue
- * always remains idle after it is served. Finally, after
- * remaining idle, each queue receives very quickly a new
- * request. It follows that the two queues are served
- * alternatively, preempting each other if needed. This
- * implies that, although both queues have the same weight,
- * the queue with large requests receives a service that is
- * 1024/8 times as high as the service received by the other
- * queue.
- *
- * On the other hand, device idling is performed, and thus
- * pure sector-domain guarantees are provided, for the
- * following queues, which are likely to need stronger
- * throughput guarantees: weight-raised queues, and queues
- * with a higher weight than other queues. When such queues
- * are active, sub-condition (i) is false, which triggers
- * device idling.
- *
- * According to the above considerations, the next variable is
- * true (only) if sub-condition (i) holds. To compute the
- * value of this variable, we not only use the return value of
- * the function bfq_symmetric_scenario(), but also check
- * whether bfqq is being weight-raised, because
- * bfq_symmetric_scenario() does not take into account also
- * weight-raised queues (see comments on
- * bfq_weights_tree_add()). In particular, if bfqq is being
- * weight-raised, it is important to idle only if there are
- * other, non-weight-raised queues that may steal throughput
- * to bfqq. Actually, we should be even more precise, and
- * differentiate between interactive weight raising and
- * soft real-time weight raising.
- *
- * As a side note, it is worth considering that the above
- * device-idling countermeasures may however fail in the
- * following unlucky scenario: if idling is (correctly)
- * disabled in a time period during which all symmetry
- * sub-conditions hold, and hence the device is allowed to
- * enqueue many requests, but at some later point in time some
- * sub-condition stops to hold, then it may become impossible
- * to let requests be served in the desired order until all
- * the requests already queued in the device have been served.
+ * Idling is performed only if slice_idle > 0. In addition, we
+ * do not idle if
+ * (a) bfqq is async
+ * (b) bfqq is in the idle io prio class: in this case we do
+ * not idle because we want to minimize the bandwidth that
+ * queues in this class can steal to higher-priority queues
*/
- asymmetric_scenario = (bfqq->wr_coeff > 1 &&
- bfqd->wr_busy_queues < bfqd->busy_queues) ||
- !bfq_symmetric_scenario(bfqd);
+ if (bfqd->bfq_slice_idle == 0 || !bfq_bfqq_sync(bfqq) ||
+ bfq_class_idle(bfqq))
+ return false;
+
+ idling_boosts_thr_with_no_issue =
+ idling_boosts_thr_without_issues(bfqd, bfqq);
+
+ idling_needed_for_service_guar =
+ idling_needed_for_service_guarantees(bfqd, bfqq);
/*
- * Finally, there is a case where maximizing throughput is the
- * best choice even if it may cause unfairness toward
- * bfqq. Such a case is when bfqq became active in a burst of
- * queue activations. Queues that became active during a large
- * burst benefit only from throughput, as discussed in the
- * comments on bfq_handle_burst. Thus, if bfqq became active
- * in a burst and not idling the device maximizes throughput,
- * then the device must no be idled, because not idling the
- * device provides bfqq and all other queues in the burst with
- * maximum benefit. Combining this and the above case, we can
- * now establish when idling is actually needed to preserve
- * service guarantees.
- */
- idling_needed_for_service_guarantees =
- asymmetric_scenario && !bfq_bfqq_in_large_burst(bfqq);
-
- /*
- * We have now all the components we need to compute the
+ * We have now the two components we need to compute the
* return value of the function, which is true only if idling
* either boosts the throughput (without issues), or is
* necessary to preserve service guarantees.
*/
- return idling_boosts_thr_without_issues ||
- idling_needed_for_service_guarantees;
+ return idling_boosts_thr_with_no_issue ||
+ idling_needed_for_service_guar;
}
/*
@@ -3657,26 +3992,98 @@
return RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_better_to_idle(bfqq);
}
-static struct bfq_queue *bfq_choose_bfqq_for_injection(struct bfq_data *bfqd)
+/*
+ * This function chooses the queue from which to pick the next extra
+ * I/O request to inject, if it finds a compatible queue. See the
+ * comments on bfq_update_inject_limit() for details on the injection
+ * mechanism, and for the definitions of the quantities mentioned
+ * below.
+ */
+static struct bfq_queue *
+bfq_choose_bfqq_for_injection(struct bfq_data *bfqd)
{
- struct bfq_queue *bfqq;
+ struct bfq_queue *bfqq, *in_serv_bfqq = bfqd->in_service_queue;
+ unsigned int limit = in_serv_bfqq->inject_limit;
+ /*
+ * If
+ * - bfqq is not weight-raised and therefore does not carry
+ * time-critical I/O,
+ * or
+ * - regardless of whether bfqq is weight-raised, bfqq has
+ * however a long think time, during which it can absorb the
+ * effect of an appropriate number of extra I/O requests
+ * from other queues (see bfq_update_inject_limit for
+ * details on the computation of this number);
+ * then injection can be performed without restrictions.
+ */
+ bool in_serv_always_inject = in_serv_bfqq->wr_coeff == 1 ||
+ !bfq_bfqq_has_short_ttime(in_serv_bfqq);
/*
- * A linear search; but, with a high probability, very few
- * steps are needed to find a candidate queue, i.e., a queue
- * with enough budget left for its next request. In fact:
+ * If
+ * - the baseline total service time could not be sampled yet,
+ * so the inject limit happens to be still 0, and
+ * - a lot of time has elapsed since the plugging of I/O
+ * dispatching started, so drive speed is being wasted
+ * significantly;
+ * then temporarily raise inject limit to one request.
+ */
+ if (limit == 0 && in_serv_bfqq->last_serv_time_ns == 0 &&
+ bfq_bfqq_wait_request(in_serv_bfqq) &&
+ time_is_before_eq_jiffies(bfqd->last_idling_start_jiffies +
+ bfqd->bfq_slice_idle)
+ )
+ limit = 1;
+
+ if (bfqd->rq_in_driver >= limit)
+ return NULL;
+
+ /*
+ * Linear search of the source queue for injection; but, with
+ * a high probability, very few steps are needed to find a
+ * candidate queue, i.e., a queue with enough budget left for
+ * its next request. In fact:
* - BFQ dynamically updates the budget of every queue so as
* to accommodate the expected backlog of the queue;
* - if a queue gets all its requests dispatched as injected
* service, then the queue is removed from the active list
- * (and re-added only if it gets new requests, but with
- * enough budget for its new backlog).
+ * (and re-added only if it gets new requests, but then it
+ * is assigned again enough budget for its new backlog).
*/
list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
if (!RB_EMPTY_ROOT(&bfqq->sort_list) &&
+ (in_serv_always_inject || bfqq->wr_coeff > 1) &&
bfq_serv_to_charge(bfqq->next_rq, bfqq) <=
- bfq_bfqq_budget_left(bfqq))
- return bfqq;
+ bfq_bfqq_budget_left(bfqq)) {
+ /*
+ * Allow for only one large in-flight request
+ * on non-rotational devices, for the
+ * following reason. On non-rotationl drives,
+ * large requests take much longer than
+ * smaller requests to be served. In addition,
+ * the drive prefers to serve large requests
+ * w.r.t. to small ones, if it can choose. So,
+ * having more than one large requests queued
+ * in the drive may easily make the next first
+ * request of the in-service queue wait for so
+ * long to break bfqq's service guarantees. On
+ * the bright side, large requests let the
+ * drive reach a very high throughput, even if
+ * there is only one in-flight large request
+ * at a time.
+ */
+ if (blk_queue_nonrot(bfqd->queue) &&
+ blk_rq_sectors(bfqq->next_rq) >=
+ BFQQ_SECT_THR_NONROT)
+ limit = min_t(unsigned int, 1, limit);
+ else
+ limit = in_serv_bfqq->inject_limit;
+
+ if (bfqd->rq_in_driver < limit) {
+ bfqd->rqs_injected = true;
+ return bfqq;
+ }
+ }
return NULL;
}
@@ -3763,14 +4170,32 @@
* for a new request, or has requests waiting for a completion and
* may idle after their completion, then keep it anyway.
*
- * Yet, to boost throughput, inject service from other queues if
- * possible.
+ * Yet, inject service from other queues if it boosts
+ * throughput and is possible.
*/
if (bfq_bfqq_wait_request(bfqq) ||
(bfqq->dispatched != 0 && bfq_better_to_idle(bfqq))) {
- if (bfq_bfqq_injectable(bfqq) &&
- bfqq->injected_service * bfqq->inject_coeff <
- bfqq->entity.service * 10)
+ struct bfq_queue *async_bfqq =
+ bfqq->bic && bfqq->bic->bfqq[0] &&
+ bfq_bfqq_busy(bfqq->bic->bfqq[0]) ?
+ bfqq->bic->bfqq[0] : NULL;
+
+ /*
+ * If the process associated with bfqq has also async
+ * I/O pending, then inject it
+ * unconditionally. Injecting I/O from the same
+ * process can cause no harm to the process. On the
+ * contrary, it can only increase bandwidth and reduce
+ * latency for the process.
+ */
+ if (async_bfqq &&
+ icq_to_bic(async_bfqq->next_rq->elv.icq) == bfqq->bic &&
+ bfq_serv_to_charge(async_bfqq->next_rq, async_bfqq) <=
+ bfq_bfqq_budget_left(async_bfqq))
+ bfqq = bfqq->bic->bfqq[0];
+ else if (!idling_boosts_thr_without_issues(bfqd, bfqq) &&
+ (bfqq->wr_coeff == 1 || bfqd->wr_busy_queues > 1 ||
+ !bfq_bfqq_has_short_ttime(bfqq)))
bfqq = bfq_choose_bfqq_for_injection(bfqd);
else
bfqq = NULL;
@@ -3862,15 +4287,15 @@
bfq_bfqq_served(bfqq, service_to_charge);
+ if (bfqq == bfqd->in_service_queue && bfqd->wait_dispatch) {
+ bfqd->wait_dispatch = false;
+ bfqd->waited_rq = rq;
+ }
+
bfq_dispatch_remove(bfqd->queue, rq);
- if (bfqq != bfqd->in_service_queue) {
- if (likely(bfqd->in_service_queue))
- bfqd->in_service_queue->injected_service +=
- bfq_serv_to_charge(rq, bfqq);
-
+ if (bfqq != bfqd->in_service_queue)
goto return_rq;
- }
/*
* If weight raising has to terminate for bfqq, then next
@@ -3890,7 +4315,7 @@
* belongs to CLASS_IDLE and other queues are waiting for
* service.
*/
- if (!(bfqd->busy_queues > 1 && bfq_class_idle(bfqq)))
+ if (!(bfq_tot_busy_queues(bfqd) > 1 && bfq_class_idle(bfqq)))
goto return_rq;
bfq_bfqq_expire(bfqd, bfqq, false, BFQQE_BUDGET_EXHAUSTED);
@@ -3908,7 +4333,7 @@
* most a call to dispatch for nothing
*/
return !list_empty_careful(&bfqd->dispatch) ||
- bfqd->busy_queues > 0;
+ bfq_tot_busy_queues(bfqd) > 0;
}
static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
@@ -3962,9 +4387,10 @@
goto start_rq;
}
- bfq_log(bfqd, "dispatch requests: %d busy queues", bfqd->busy_queues);
+ bfq_log(bfqd, "dispatch requests: %d busy queues",
+ bfq_tot_busy_queues(bfqd));
- if (bfqd->busy_queues == 0)
+ if (bfq_tot_busy_queues(bfqd) == 0)
goto exit;
/*
@@ -4301,13 +4727,6 @@
bfq_mark_bfqq_has_short_ttime(bfqq);
bfq_mark_bfqq_sync(bfqq);
bfq_mark_bfqq_just_created(bfqq);
- /*
- * Aggressively inject a lot of service: up to 90%.
- * This coefficient remains constant during bfqq life,
- * but this behavior might be changed, after enough
- * testing and tuning.
- */
- bfqq->inject_coeff = 1;
} else
bfq_clear_bfqq_sync(bfqq);
@@ -4445,17 +4864,19 @@
struct request *rq)
{
bfqq->seek_history <<= 1;
- bfqq->seek_history |=
- get_sdist(bfqq->last_request_pos, rq) > BFQQ_SEEK_THR &&
- (!blk_queue_nonrot(bfqd->queue) ||
- blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT);
+ bfqq->seek_history |= BFQ_RQ_SEEKY(bfqd, bfqq->last_request_pos, rq);
+
+ if (bfqq->wr_coeff > 1 &&
+ bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
+ BFQQ_TOTALLY_SEEKY(bfqq))
+ bfq_bfqq_end_wr(bfqq);
}
static void bfq_update_has_short_ttime(struct bfq_data *bfqd,
struct bfq_queue *bfqq,
struct bfq_io_cq *bic)
{
- bool has_short_ttime = true;
+ bool has_short_ttime = true, state_changed;
/*
* No need to update has_short_ttime if bfqq is async or in
@@ -4480,13 +4901,93 @@
bfqq->ttime.ttime_mean > bfqd->bfq_slice_idle))
has_short_ttime = false;
- bfq_log_bfqq(bfqd, bfqq, "update_has_short_ttime: has_short_ttime %d",
- has_short_ttime);
+ state_changed = has_short_ttime != bfq_bfqq_has_short_ttime(bfqq);
if (has_short_ttime)
bfq_mark_bfqq_has_short_ttime(bfqq);
else
bfq_clear_bfqq_has_short_ttime(bfqq);
+
+ /*
+ * Until the base value for the total service time gets
+ * finally computed for bfqq, the inject limit does depend on
+ * the think-time state (short|long). In particular, the limit
+ * is 0 or 1 if the think time is deemed, respectively, as
+ * short or long (details in the comments in
+ * bfq_update_inject_limit()). Accordingly, the next
+ * instructions reset the inject limit if the think-time state
+ * has changed and the above base value is still to be
+ * computed.
+ *
+ * However, the reset is performed only if more than 100 ms
+ * have elapsed since the last update of the inject limit, or
+ * (inclusive) if the change is from short to long think
+ * time. The reason for this waiting is as follows.
+ *
+ * bfqq may have a long think time because of a
+ * synchronization with some other queue, i.e., because the
+ * I/O of some other queue may need to be completed for bfqq
+ * to receive new I/O. This happens, e.g., if bfqq is
+ * associated with a process that does some sync. A sync
+ * generates extra blocking I/O, which must be completed
+ * before the process associated with bfqq can go on with its
+ * I/O.
+ *
+ * If such a synchronization is actually in place, then,
+ * without injection on bfqq, the blocking I/O cannot happen
+ * to served while bfqq is in service. As a consequence, if
+ * bfqq is granted I/O-dispatch-plugging, then bfqq remains
+ * empty, and no I/O is dispatched, until the idle timeout
+ * fires. This is likely to result in lower bandwidth and
+ * higher latencies for bfqq, and in a severe loss of total
+ * throughput.
+ *
+ * On the opposite end, a non-zero inject limit may allow the
+ * I/O that blocks bfqq to be executed soon, and therefore
+ * bfqq to receive new I/O soon. But, if this actually
+ * happens, then the next think-time sample for bfqq may be
+ * very low. This in turn may cause bfqq's think time to be
+ * deemed short. Without the 100 ms barrier, this new state
+ * change would cause the body of the next if to be executed
+ * immediately. But this would set to 0 the inject
+ * limit. Without injection, the blocking I/O would cause the
+ * think time of bfqq to become long again, and therefore the
+ * inject limit to be raised again, and so on. The only effect
+ * of such a steady oscillation between the two think-time
+ * states would be to prevent effective injection on bfqq.
+ *
+ * In contrast, if the inject limit is not reset during such a
+ * long time interval as 100 ms, then the number of short
+ * think time samples can grow significantly before the reset
+ * is allowed. As a consequence, the think time state can
+ * become stable before the reset. There will be no state
+ * change when the 100 ms elapse, and therefore no reset of
+ * the inject limit. The inject limit remains steadily equal
+ * to 1 both during and after the 100 ms. So injection can be
+ * performed at all times, and throughput gets boosted.
+ *
+ * An inject limit equal to 1 is however in conflict, in
+ * general, with the fact that the think time of bfqq is
+ * short, because injection may be likely to delay bfqq's I/O
+ * (as explained in the comments in
+ * bfq_update_inject_limit()). But this does not happen in
+ * this special case, because bfqq's low think time is due to
+ * an effective handling of a synchronization, through
+ * injection. In this special case, bfqq's I/O does not get
+ * delayed by injection; on the contrary, bfqq's I/O is
+ * brought forward, because it is not blocked for
+ * milliseconds.
+ *
+ * In addition, during the 100 ms, the base value for the
+ * total service time is likely to get finally computed,
+ * freeing the inject limit from its relation with the think
+ * time.
+ */
+ if (state_changed && bfqq->last_serv_time_ns == 0 &&
+ (time_is_before_eq_jiffies(bfqq->decrease_time_jif +
+ msecs_to_jiffies(100)) ||
+ !has_short_ttime))
+ bfq_reset_inject_limit(bfqd, bfqq);
}
/*
@@ -4496,19 +4997,9 @@
static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
struct request *rq)
{
- struct bfq_io_cq *bic = RQ_BIC(rq);
-
if (rq->cmd_flags & REQ_META)
bfqq->meta_pending++;
- bfq_update_io_thinktime(bfqd, bfqq);
- bfq_update_has_short_ttime(bfqd, bfqq, bic);
- bfq_update_io_seektime(bfqd, bfqq, rq);
-
- bfq_log_bfqq(bfqd, bfqq,
- "rq_enqueued: has_short_ttime=%d (seeky %d)",
- bfq_bfqq_has_short_ttime(bfqq), BFQQ_SEEKY(bfqq));
-
bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {
@@ -4517,28 +5008,31 @@
bool budget_timeout = bfq_bfqq_budget_timeout(bfqq);
/*
- * There is just this request queued: if the request
- * is small and the queue is not to be expired, then
- * just exit.
+ * There is just this request queued: if
+ * - the request is small, and
+ * - we are idling to boost throughput, and
+ * - the queue is not to be expired,
+ * then just exit.
*
* In this way, if the device is being idled to wait
* for a new request from the in-service queue, we
* avoid unplugging the device and committing the
- * device to serve just a small request. On the
- * contrary, we wait for the block layer to decide
- * when to unplug the device: hopefully, new requests
- * will be merged to this one quickly, then the device
- * will be unplugged and larger requests will be
- * dispatched.
+ * device to serve just a small request. In contrast
+ * we wait for the block layer to decide when to
+ * unplug the device: hopefully, new requests will be
+ * merged to this one quickly, then the device will be
+ * unplugged and larger requests will be dispatched.
*/
- if (small_req && !budget_timeout)
+ if (small_req && idling_boosts_thr_without_issues(bfqd, bfqq) &&
+ !budget_timeout)
return;
/*
- * A large enough request arrived, or the queue is to
- * be expired: in both cases disk idling is to be
- * stopped, so clear wait_request flag and reset
- * timer.
+ * A large enough request arrived, or idling is being
+ * performed to preserve service guarantees, or
+ * finally the queue is to be expired: in all these
+ * cases disk idling is to be stopped, so clear
+ * wait_request flag and reset timer.
*/
bfq_clear_bfqq_wait_request(bfqq);
hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
@@ -4564,8 +5058,6 @@
bool waiting, idle_timer_disabled = false;
if (new_bfqq) {
- if (bic_to_bfqq(RQ_BIC(rq), 1) != bfqq)
- new_bfqq = bic_to_bfqq(RQ_BIC(rq), 1);
/*
* Release the request's reference to the old bfqq
* and make sure one is taken to the shared queue.
@@ -4595,6 +5087,10 @@
bfqq = new_bfqq;
}
+ bfq_update_io_thinktime(bfqd, bfqq);
+ bfq_update_has_short_ttime(bfqd, bfqq, RQ_BIC(rq));
+ bfq_update_io_seektime(bfqd, bfqq, rq);
+
waiting = bfqq && bfq_bfqq_wait_request(bfqq);
bfq_add_request(rq);
idle_timer_disabled = waiting && !bfq_bfqq_wait_request(bfqq);
@@ -4708,6 +5204,8 @@
static void bfq_update_hw_tag(struct bfq_data *bfqd)
{
+ struct bfq_queue *bfqq = bfqd->in_service_queue;
+
bfqd->max_rq_in_driver = max_t(int, bfqd->max_rq_in_driver,
bfqd->rq_in_driver);
@@ -4720,7 +5218,18 @@
* sum is not exact, as it's not taking into account deactivated
* requests.
*/
- if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD)
+ if (bfqd->rq_in_driver + bfqd->queued <= BFQ_HW_QUEUE_THRESHOLD)
+ return;
+
+ /*
+ * If active queue hasn't enough requests and can idle, bfq might not
+ * dispatch sufficient requests to hardware. Don't zero hw_tag in this
+ * case
+ */
+ if (bfqq && bfq_bfqq_has_short_ttime(bfqq) &&
+ bfqq->dispatched + bfqq->queued[0] + bfqq->queued[1] <
+ BFQ_HW_QUEUE_THRESHOLD &&
+ bfqd->rq_in_driver < BFQ_HW_QUEUE_THRESHOLD)
return;
if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
@@ -4729,6 +5238,9 @@
bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
bfqd->max_rq_in_driver = 0;
bfqd->hw_tag_samples = 0;
+
+ bfqd->nonrot_with_queueing =
+ blk_queue_nonrot(bfqd->queue) && bfqd->hw_tag;
}
static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
@@ -4791,11 +5303,14 @@
* isochronous, and both requisites for this condition to hold
* are now satisfied, then compute soft_rt_next_start (see the
* comments on the function bfq_bfqq_softrt_next_start()). We
- * schedule this delayed check when bfqq expires, if it still
- * has in-flight requests.
+ * do not compute soft_rt_next_start if bfqq is in interactive
+ * weight raising (see the comments in bfq_bfqq_expire() for
+ * an explanation). We schedule this delayed update when bfqq
+ * expires, if it still has in-flight requests.
*/
if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 &&
- RB_EMPTY_ROOT(&bfqq->sort_list))
+ RB_EMPTY_ROOT(&bfqq->sort_list) &&
+ bfqq->wr_coeff != bfqd->bfq_wr_coeff)
bfqq->soft_rt_next_start =
bfq_bfqq_softrt_next_start(bfqd, bfqq);
@@ -4853,6 +5368,164 @@
}
/*
+ * The processes associated with bfqq may happen to generate their
+ * cumulative I/O at a lower rate than the rate at which the device
+ * could serve the same I/O. This is rather probable, e.g., if only
+ * one process is associated with bfqq and the device is an SSD. It
+ * results in bfqq becoming often empty while in service. In this
+ * respect, if BFQ is allowed to switch to another queue when bfqq
+ * remains empty, then the device goes on being fed with I/O requests,
+ * and the throughput is not affected. In contrast, if BFQ is not
+ * allowed to switch to another queue---because bfqq is sync and
+ * I/O-dispatch needs to be plugged while bfqq is temporarily
+ * empty---then, during the service of bfqq, there will be frequent
+ * "service holes", i.e., time intervals during which bfqq gets empty
+ * and the device can only consume the I/O already queued in its
+ * hardware queues. During service holes, the device may even get to
+ * remaining idle. In the end, during the service of bfqq, the device
+ * is driven at a lower speed than the one it can reach with the kind
+ * of I/O flowing through bfqq.
+ *
+ * To counter this loss of throughput, BFQ implements a "request
+ * injection mechanism", which tries to fill the above service holes
+ * with I/O requests taken from other queues. The hard part in this
+ * mechanism is finding the right amount of I/O to inject, so as to
+ * both boost throughput and not break bfqq's bandwidth and latency
+ * guarantees. In this respect, the mechanism maintains a per-queue
+ * inject limit, computed as below. While bfqq is empty, the injection
+ * mechanism dispatches extra I/O requests only until the total number
+ * of I/O requests in flight---i.e., already dispatched but not yet
+ * completed---remains lower than this limit.
+ *
+ * A first definition comes in handy to introduce the algorithm by
+ * which the inject limit is computed. We define as first request for
+ * bfqq, an I/O request for bfqq that arrives while bfqq is in
+ * service, and causes bfqq to switch from empty to non-empty. The
+ * algorithm updates the limit as a function of the effect of
+ * injection on the service times of only the first requests of
+ * bfqq. The reason for this restriction is that these are the
+ * requests whose service time is affected most, because they are the
+ * first to arrive after injection possibly occurred.
+ *
+ * To evaluate the effect of injection, the algorithm measures the
+ * "total service time" of first requests. We define as total service
+ * time of an I/O request, the time that elapses since when the
+ * request is enqueued into bfqq, to when it is completed. This
+ * quantity allows the whole effect of injection to be measured. It is
+ * easy to see why. Suppose that some requests of other queues are
+ * actually injected while bfqq is empty, and that a new request R
+ * then arrives for bfqq. If the device does start to serve all or
+ * part of the injected requests during the service hole, then,
+ * because of this extra service, it may delay the next invocation of
+ * the dispatch hook of BFQ. Then, even after R gets eventually
+ * dispatched, the device may delay the actual service of R if it is
+ * still busy serving the extra requests, or if it decides to serve,
+ * before R, some extra request still present in its queues. As a
+ * conclusion, the cumulative extra delay caused by injection can be
+ * easily evaluated by just comparing the total service time of first
+ * requests with and without injection.
+ *
+ * The limit-update algorithm works as follows. On the arrival of a
+ * first request of bfqq, the algorithm measures the total time of the
+ * request only if one of the three cases below holds, and, for each
+ * case, it updates the limit as described below:
+ *
+ * (1) If there is no in-flight request. This gives a baseline for the
+ * total service time of the requests of bfqq. If the baseline has
+ * not been computed yet, then, after computing it, the limit is
+ * set to 1, to start boosting throughput, and to prepare the
+ * ground for the next case. If the baseline has already been
+ * computed, then it is updated, in case it results to be lower
+ * than the previous value.
+ *
+ * (2) If the limit is higher than 0 and there are in-flight
+ * requests. By comparing the total service time in this case with
+ * the above baseline, it is possible to know at which extent the
+ * current value of the limit is inflating the total service
+ * time. If the inflation is below a certain threshold, then bfqq
+ * is assumed to be suffering from no perceivable loss of its
+ * service guarantees, and the limit is even tentatively
+ * increased. If the inflation is above the threshold, then the
+ * limit is decreased. Due to the lack of any hysteresis, this
+ * logic makes the limit oscillate even in steady workload
+ * conditions. Yet we opted for it, because it is fast in reaching
+ * the best value for the limit, as a function of the current I/O
+ * workload. To reduce oscillations, this step is disabled for a
+ * short time interval after the limit happens to be decreased.
+ *
+ * (3) Periodically, after resetting the limit, to make sure that the
+ * limit eventually drops in case the workload changes. This is
+ * needed because, after the limit has gone safely up for a
+ * certain workload, it is impossible to guess whether the
+ * baseline total service time may have changed, without measuring
+ * it again without injection. A more effective version of this
+ * step might be to just sample the baseline, by interrupting
+ * injection only once, and then to reset/lower the limit only if
+ * the total service time with the current limit does happen to be
+ * too large.
+ *
+ * More details on each step are provided in the comments on the
+ * pieces of code that implement these steps: the branch handling the
+ * transition from empty to non empty in bfq_add_request(), the branch
+ * handling injection in bfq_select_queue(), and the function
+ * bfq_choose_bfqq_for_injection(). These comments also explain some
+ * exceptions, made by the injection mechanism in some special cases.
+ */
+static void bfq_update_inject_limit(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
+{
+ u64 tot_time_ns = ktime_get_ns() - bfqd->last_empty_occupied_ns;
+ unsigned int old_limit = bfqq->inject_limit;
+
+ if (bfqq->last_serv_time_ns > 0) {
+ u64 threshold = (bfqq->last_serv_time_ns * 3)>>1;
+
+ if (tot_time_ns >= threshold && old_limit > 0) {
+ bfqq->inject_limit--;
+ bfqq->decrease_time_jif = jiffies;
+ } else if (tot_time_ns < threshold &&
+ old_limit < bfqd->max_rq_in_driver<<1)
+ bfqq->inject_limit++;
+ }
+
+ /*
+ * Either we still have to compute the base value for the
+ * total service time, and there seem to be the right
+ * conditions to do it, or we can lower the last base value
+ * computed.
+ *
+ * NOTE: (bfqd->rq_in_driver == 1) means that there is no I/O
+ * request in flight, because this function is in the code
+ * path that handles the completion of a request of bfqq, and,
+ * in particular, this function is executed before
+ * bfqd->rq_in_driver is decremented in such a code path.
+ */
+ if ((bfqq->last_serv_time_ns == 0 && bfqd->rq_in_driver == 1) ||
+ tot_time_ns < bfqq->last_serv_time_ns) {
+ bfqq->last_serv_time_ns = tot_time_ns;
+ /*
+ * Now we certainly have a base value: make sure we
+ * start trying injection.
+ */
+ bfqq->inject_limit = max_t(unsigned int, 1, old_limit);
+ } else if (!bfqd->rqs_injected && bfqd->rq_in_driver == 1)
+ /*
+ * No I/O injected and no request still in service in
+ * the drive: these are the exact conditions for
+ * computing the base value of the total service time
+ * for bfqq. So let's update this value, because it is
+ * rather variable. For example, it varies if the size
+ * or the spatial locality of the I/O requests in bfqq
+ * change.
+ */
+ bfqq->last_serv_time_ns = tot_time_ns;
+
+
+ /* update complete, not waiting for any request completion any longer */
+ bfqd->waited_rq = NULL;
+}
+
+/*
* Handle either a requeue or a finish for rq. The things to do are
* the same in both cases: all references to rq are to be dropped. In
* particular, rq is considered completed from the point of view of
@@ -4896,6 +5569,9 @@
spin_lock_irqsave(&bfqd->lock, flags);
+ if (rq == bfqd->waited_rq)
+ bfq_update_inject_limit(bfqd, bfqq);
+
bfq_completed_request(bfqq, bfqd);
bfq_finish_requeue_request_body(bfqq);
@@ -5059,7 +5735,7 @@
* preparation is that, after the prepare_request hook is invoked for
* rq, rq may still be transformed into a request with no icq, i.e., a
* request not associated with any queue. No bfq hook is invoked to
- * signal this tranformation. As a consequence, should these
+ * signal this transformation. As a consequence, should these
* preparation operations be performed when the prepare_request hook
* is invoked, and should rq be transformed one moment later, bfq
* would end up in an inconsistent state, because it would have
@@ -5150,7 +5826,29 @@
}
}
- if (unlikely(bfq_bfqq_just_created(bfqq)))
+ /*
+ * Consider bfqq as possibly belonging to a burst of newly
+ * created queues only if:
+ * 1) A burst is actually happening (bfqd->burst_size > 0)
+ * or
+ * 2) There is no other active queue. In fact, if, in
+ * contrast, there are active queues not belonging to the
+ * possible burst bfqq may belong to, then there is no gain
+ * in considering bfqq as belonging to a burst, and
+ * therefore in not weight-raising bfqq. See comments on
+ * bfq_handle_burst().
+ *
+ * This filtering also helps eliminating false positives,
+ * occurring when bfqq does not belong to an actual large
+ * burst, but some background task (e.g., a service) happens
+ * to trigger the creation of new queues very close to when
+ * bfqq and its possible companion queues are created. See
+ * comments on bfq_handle_burst() for further details also on
+ * this issue.
+ */
+ if (unlikely(bfq_bfqq_just_created(bfqq) &&
+ (bfqd->burst_size > 0 ||
+ bfq_tot_busy_queues(bfqd) == 0)))
bfq_handle_burst(bfqd, bfqq);
return bfqq;
@@ -5418,14 +6116,15 @@
HRTIMER_MODE_REL);
bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
- bfqd->queue_weights_tree = RB_ROOT;
- bfqd->group_weights_tree = RB_ROOT;
+ bfqd->queue_weights_tree = RB_ROOT_CACHED;
+ bfqd->num_groups_with_pending_reqs = 0;
INIT_LIST_HEAD(&bfqd->active_list);
INIT_LIST_HEAD(&bfqd->idle_list);
INIT_HLIST_HEAD(&bfqd->burst_list);
bfqd->hw_tag = -1;
+ bfqd->nonrot_with_queueing = blk_queue_nonrot(bfqd->queue);
bfqd->bfq_max_budget = bfq_default_max_budget;
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index a41e988..eba7cd4 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -32,6 +32,8 @@
#define BFQ_DEFAULT_GRP_IOPRIO 0
#define BFQ_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE
+#define MAX_PID_STR_LENGTH 12
+
/*
* Soft real-time applications are extremely more latency sensitive
* than interactive ones. Over-raise the weight of the former to
@@ -89,7 +91,7 @@
* expiration. This peculiar definition allows for the following
* optimization, not yet exploited: while a given entity is still in
* service, we already know which is the best candidate for next
- * service among the other active entitities in the same parent
+ * service among the other active entities in the same parent
* entity. We can then quickly compare the timestamps of the
* in-service entity with those of such best candidate.
*
@@ -108,15 +110,14 @@
};
/**
- * struct bfq_weight_counter - counter of the number of all active entities
+ * struct bfq_weight_counter - counter of the number of all active queues
* with a given weight.
*/
struct bfq_weight_counter {
- unsigned int weight; /* weight of the entities this counter refers to */
- unsigned int num_active; /* nr of active entities with this weight */
+ unsigned int weight; /* weight of the queues this counter refers to */
+ unsigned int num_active; /* nr of active queues with this weight */
/*
- * Weights tree member (see bfq_data's @queue_weights_tree and
- * @group_weights_tree)
+ * Weights tree member (see bfq_data's @queue_weights_tree)
*/
struct rb_node weights_node;
};
@@ -141,7 +142,7 @@
*
* Unless cgroups are used, the weight value is calculated from the
* ioprio to export the same interface as CFQ. When dealing with
- * ``well-behaved'' queues (i.e., queues that do not spend too much
+ * "well-behaved" queues (i.e., queues that do not spend too much
* time to consume their budget and have true sequential behavior, and
* when there are no external factors breaking anticipation) the
* relative weights at each level of the cgroups hierarchy should be
@@ -151,8 +152,6 @@
struct bfq_entity {
/* service_tree member */
struct rb_node rb_node;
- /* pointer to the weight counter associated with this entity */
- struct bfq_weight_counter *weight_counter;
/*
* Flag, true if the entity is on a tree (either the active or
@@ -199,6 +198,9 @@
/* flag, set to request a weight, ioprio or ioprio_class change */
int prio_changed;
+
+ /* flag, set if the entity is counted in groups_with_pending_reqs */
+ bool in_groups_with_pending_reqs;
};
struct bfq_group;
@@ -240,6 +242,13 @@
/* next ioprio and ioprio class if a change is in progress */
unsigned short new_ioprio, new_ioprio_class;
+ /* last total-service-time sample, see bfq_update_inject_limit() */
+ u64 last_serv_time_ns;
+ /* limit for request injection */
+ unsigned int inject_limit;
+ /* last time the inject limit has been decreased, in jiffies */
+ unsigned long decrease_time_jif;
+
/*
* Shared bfq_queue if queue is cooperating with one or more
* other queues.
@@ -266,6 +275,9 @@
/* entity representing this queue in the scheduler */
struct bfq_entity entity;
+ /* pointer to the weight counter associated with this entity */
+ struct bfq_weight_counter *weight_counter;
+
/* maximum budget allowed from the feedback mechanism */
int max_budget;
/* budget expiration (in jiffies) */
@@ -354,29 +366,6 @@
/* max service rate measured so far */
u32 max_service_rate;
- /*
- * Ratio between the service received by bfqq while it is in
- * service, and the cumulative service (of requests of other
- * queues) that may be injected while bfqq is empty but still
- * in service. To increase precision, the coefficient is
- * measured in tenths of unit. Here are some example of (1)
- * ratios, (2) resulting percentages of service injected
- * w.r.t. to the total service dispatched while bfqq is in
- * service, and (3) corresponding values of the coefficient:
- * 1 (50%) -> 10
- * 2 (33%) -> 20
- * 10 (9%) -> 100
- * 9.9 (9%) -> 99
- * 1.5 (40%) -> 15
- * 0.5 (66%) -> 5
- * 0.1 (90%) -> 1
- *
- * So, if the coefficient is lower than 10, then
- * injected service is more than bfqq service.
- */
- unsigned int inject_coeff;
- /* amount of service injected in current service slot */
- unsigned int injected_service;
};
/**
@@ -416,6 +405,15 @@
bool was_in_burst_list;
/*
+ * Save the weight when a merge occurs, to be able
+ * to restore it in case of split. If the weight is not
+ * correctly resumed when the queue is recycled,
+ * then the weight of the recycled queue could differ
+ * from the weight of the original queue.
+ */
+ unsigned int saved_weight;
+
+ /*
* Similar to previous fields: save wr information.
*/
unsigned long saved_wr_coeff;
@@ -447,22 +445,62 @@
* weight-raised @bfq_queue (see the comments to the functions
* bfq_weights_tree_[add|remove] for further details).
*/
- struct rb_root queue_weights_tree;
- /*
- * rbtree of non-queue @bfq_entity weight counters, sorted by
- * weight. Used to keep track of whether all @bfq_groups have
- * the same weight. The tree contains one counter for each
- * distinct weight associated to some active @bfq_group (see
- * the comments to the functions bfq_weights_tree_[add|remove]
- * for further details).
- */
- struct rb_root group_weights_tree;
+ struct rb_root_cached queue_weights_tree;
/*
- * Number of bfq_queues containing requests (including the
- * queue in service, even if it is idling).
+ * Number of groups with at least one descendant process that
+ * has at least one request waiting for completion. Note that
+ * this accounts for also requests already dispatched, but not
+ * yet completed. Therefore this number of groups may differ
+ * (be larger) than the number of active groups, as a group is
+ * considered active only if its corresponding entity has
+ * descendant queues with at least one request queued. This
+ * number is used to decide whether a scenario is symmetric.
+ * For a detailed explanation see comments on the computation
+ * of the variable asymmetric_scenario in the function
+ * bfq_better_to_idle().
+ *
+ * However, it is hard to compute this number exactly, for
+ * groups with multiple descendant processes. Consider a group
+ * that is inactive, i.e., that has no descendant process with
+ * pending I/O inside BFQ queues. Then suppose that
+ * num_groups_with_pending_reqs is still accounting for this
+ * group, because the group has descendant processes with some
+ * I/O request still in flight. num_groups_with_pending_reqs
+ * should be decremented when the in-flight request of the
+ * last descendant process is finally completed (assuming that
+ * nothing else has changed for the group in the meantime, in
+ * terms of composition of the group and active/inactive state of child
+ * groups and processes). To accomplish this, an additional
+ * pending-request counter must be added to entities, and must
+ * be updated correctly. To avoid this additional field and operations,
+ * we resort to the following tradeoff between simplicity and
+ * accuracy: for an inactive group that is still counted in
+ * num_groups_with_pending_reqs, we decrement
+ * num_groups_with_pending_reqs when the first descendant
+ * process of the group remains with no request waiting for
+ * completion.
+ *
+ * Even this simpler decrement strategy requires a little
+ * carefulness: to avoid multiple decrements, we flag a group,
+ * more precisely an entity representing a group, as still
+ * counted in num_groups_with_pending_reqs when it becomes
+ * inactive. Then, when the first descendant queue of the
+ * entity remains with no request waiting for completion,
+ * num_groups_with_pending_reqs is decremented, and this flag
+ * is reset. After this flag is reset for the entity,
+ * num_groups_with_pending_reqs won't be decremented any
+ * longer in case a new descendant queue of the entity remains
+ * with no request waiting for completion.
*/
- int busy_queues;
+ unsigned int num_groups_with_pending_reqs;
+
+ /*
+ * Per-class (RT, BE, IDLE) number of bfq_queues containing
+ * requests (including the queue in service, even if it is
+ * idling).
+ */
+ unsigned int busy_queues[3];
/* number of weight-raised busy @bfq_queues */
int wr_busy_queues;
/* number of queued requests */
@@ -470,6 +508,9 @@
/* number of requests dispatched and waiting for completion */
int rq_in_driver;
+ /* true if the device is non rotational and performs queueing */
+ bool nonrot_with_queueing;
+
/*
* Maximum number of requests in driver in the last
* @hw_tag_samples completed requests.
@@ -501,6 +542,26 @@
/* time of last request completion (ns) */
u64 last_completion;
+ /* time of last transition from empty to non-empty (ns) */
+ u64 last_empty_occupied_ns;
+
+ /*
+ * Flag set to activate the sampling of the total service time
+ * of a just-arrived first I/O request (see
+ * bfq_update_inject_limit()). This will cause the setting of
+ * waited_rq when the request is finally dispatched.
+ */
+ bool wait_dispatch;
+ /*
+ * If set, then bfq_update_inject_limit() is invoked when
+ * waited_rq is eventually completed.
+ */
+ struct request *waited_rq;
+ /*
+ * True if some request has been injected during the last service hole.
+ */
+ bool rqs_injected;
+
/* time of first rq dispatch in current observation interval (ns) */
u64 first_dispatch;
/* time of last rq dispatch in current observation interval (ns) */
@@ -510,6 +571,7 @@
ktime_t last_budget_start;
/* beginning of the last idle slice */
ktime_t last_idling_start;
+ unsigned long last_idling_start_jiffies;
/* number of samples in current observation interval */
int peak_rate_samples;
@@ -854,11 +916,11 @@
void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq, bool is_sync);
struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic);
void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq);
-void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity,
- struct rb_root *root);
+void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ struct rb_root_cached *root);
void __bfq_weights_tree_remove(struct bfq_data *bfqd,
- struct bfq_entity *entity,
- struct rb_root *root);
+ struct bfq_queue *bfqq,
+ struct rb_root_cached *root);
void bfq_weights_tree_remove(struct bfq_data *bfqd,
struct bfq_queue *bfqq);
void bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq,
@@ -935,6 +997,7 @@
struct bfq_group *bfq_bfqq_to_bfqg(struct bfq_queue *bfqq);
struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity);
+unsigned int bfq_tot_busy_queues(struct bfq_data *bfqd);
struct bfq_service_tree *bfq_entity_service_tree(struct bfq_entity *entity);
struct bfq_entity *bfq_entity_of(struct rb_node *node);
unsigned short bfq_ioprio_to_weight(int ioprio);
@@ -951,7 +1014,7 @@
bool ins_into_idle_tree);
bool next_queue_may_preempt(struct bfq_data *bfqd);
struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd);
-void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd);
+bool __bfq_bfqd_reset_in_service(struct bfq_data *bfqd);
void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
bool ins_into_idle_tree, bool expiration);
void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
@@ -964,13 +1027,23 @@
/* --------------- end of interface of B-WF2Q+ ---------------- */
/* Logging facilities. */
+static inline void bfq_pid_to_str(int pid, char *str, int len)
+{
+ if (pid != -1)
+ snprintf(str, len, "%d", pid);
+ else
+ snprintf(str, len, "SHARED-");
+}
+
#ifdef CONFIG_BFQ_GROUP_IOSCHED
struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do { \
+ char pid_str[MAX_PID_STR_LENGTH]; \
+ bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH); \
blk_add_cgroup_trace_msg((bfqd)->queue, \
bfqg_to_blkg(bfqq_group(bfqq))->blkcg, \
- "bfq%d%c " fmt, (bfqq)->pid, \
+ "bfq%s%c " fmt, pid_str, \
bfq_bfqq_sync((bfqq)) ? 'S' : 'A', ##args); \
} while (0)
@@ -981,10 +1054,13 @@
#else /* CONFIG_BFQ_GROUP_IOSCHED */
-#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
- blk_add_trace_msg((bfqd)->queue, "bfq%d%c " fmt, (bfqq)->pid, \
+#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do { \
+ char pid_str[MAX_PID_STR_LENGTH]; \
+ bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH); \
+ blk_add_trace_msg((bfqd)->queue, "bfq%s%c " fmt, pid_str, \
bfq_bfqq_sync((bfqq)) ? 'S' : 'A', \
- ##args)
+ ##args); \
+} while (0)
#define bfq_log_bfqg(bfqd, bfqg, fmt, args...) do {} while (0)
#endif /* CONFIG_BFQ_GROUP_IOSCHED */
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index ff7c2d4..48d899c 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -44,6 +44,12 @@
BFQ_DEFAULT_GRP_CLASS - 1;
}
+unsigned int bfq_tot_busy_queues(struct bfq_data *bfqd)
+{
+ return bfqd->busy_queues[0] + bfqd->busy_queues[1] +
+ bfqd->busy_queues[2];
+}
+
static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
bool expiration);
@@ -53,7 +59,7 @@
* bfq_update_next_in_service - update sd->next_in_service
* @sd: sched_data for which to perform the update.
* @new_entity: if not NULL, pointer to the entity whose activation,
- * requeueing or repositionig triggered the invocation of
+ * requeueing or repositioning triggered the invocation of
* this function.
* @expiration: id true, this function is being invoked after the
* expiration of the in-service entity
@@ -84,7 +90,7 @@
/*
* If this update is triggered by the activation, requeueing
- * or repositiong of an entity that does not coincide with
+ * or repositioning of an entity that does not coincide with
* sd->next_in_service, then a full lookup in the active tree
* can be avoided. In fact, it is enough to check whether the
* just-modified entity has the same priority as
@@ -731,7 +737,7 @@
struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
unsigned int prev_weight, new_weight;
struct bfq_data *bfqd = NULL;
- struct rb_root *root;
+ struct rb_root_cached *root;
#ifdef CONFIG_BFQ_GROUP_IOSCHED
struct bfq_sched_data *sd;
struct bfq_group *bfqg;
@@ -788,25 +794,23 @@
new_weight = entity->orig_weight *
(bfqq ? bfqq->wr_coeff : 1);
/*
- * If the weight of the entity changes, remove the entity
- * from its old weight counter (if there is a counter
- * associated with the entity), and add it to the counter
- * associated with its new weight.
+ * If the weight of the entity changes, and the entity is a
+ * queue, remove the entity from its old weight counter (if
+ * there is a counter associated with the entity).
*/
- if (prev_weight != new_weight) {
- root = bfqq ? &bfqd->queue_weights_tree :
- &bfqd->group_weights_tree;
- __bfq_weights_tree_remove(bfqd, entity, root);
+ if (prev_weight != new_weight && bfqq) {
+ root = &bfqd->queue_weights_tree;
+ __bfq_weights_tree_remove(bfqd, bfqq, root);
}
entity->weight = new_weight;
/*
- * Add the entity to its weights tree only if it is
- * not associated with a weight-raised queue.
+ * Add the entity, if it is not a weight-raised queue,
+ * to the counter associated with its new weight.
*/
- if (prev_weight != new_weight &&
- (bfqq ? bfqq->wr_coeff == 1 : 1))
+ if (prev_weight != new_weight && bfqq && bfqq->wr_coeff == 1) {
/* If we get here, root has been initialized. */
- bfq_weights_tree_add(bfqd, entity, root);
+ bfq_weights_tree_add(bfqd, bfqq, root);
+ }
new_st->wsum += entity->weight;
@@ -1008,13 +1012,16 @@
entity->on_st = true;
}
-#ifdef BFQ_GROUP_IOSCHED_ENABLED
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
if (!bfq_entity_to_bfqq(entity)) { /* bfq_group */
struct bfq_group *bfqg =
container_of(entity, struct bfq_group, entity);
+ struct bfq_data *bfqd = bfqg->bfqd;
- bfq_weights_tree_add(bfqg->bfqd, entity,
- &bfqd->group_weights_tree);
+ if (!entity->in_groups_with_pending_reqs) {
+ entity->in_groups_with_pending_reqs = true;
+ bfqd->num_groups_with_pending_reqs++;
+ }
}
#endif
@@ -1153,15 +1160,14 @@
}
/**
- * __bfq_deactivate_entity - deactivate an entity from its service tree.
- * @entity: the entity to deactivate.
+ * __bfq_deactivate_entity - update sched_data and service trees for
+ * entity, so as to represent entity as inactive
+ * @entity: the entity being deactivated.
* @ins_into_idle_tree: if false, the entity will not be put into the
* idle tree.
*
- * Deactivates an entity, independently of its previous state. Must
- * be invoked only if entity is on a service tree. Extracts the entity
- * from that tree, and if necessary and allowed, puts it into the idle
- * tree.
+ * If necessary and allowed, puts entity into the idle tree. NOTE:
+ * entity may be on no tree if in service.
*/
bool __bfq_deactivate_entity(struct bfq_entity *entity, bool ins_into_idle_tree)
{
@@ -1390,7 +1396,7 @@
* In this first case, update the virtual time in @st too (see the
* comments on this update inside the function).
*
- * In constrast, if there is an in-service entity, then return the
+ * In contrast, if there is an in-service entity, then return the
* entity that would be set in service if not only the above
* conditions, but also the next one held true: the currently
* in-service entity, on expiration,
@@ -1473,12 +1479,12 @@
* is being invoked as a part of the expiration path
* of the in-service queue. In this case, even if
* sd->in_service_entity is not NULL,
- * sd->in_service_entiy at this point is actually not
+ * sd->in_service_entity at this point is actually not
* in service any more, and, if needed, has already
* been properly queued or requeued into the right
* tree. The reason why sd->in_service_entity is still
* not NULL here, even if expiration is true, is that
- * sd->in_service_entiy is reset as a last step in the
+ * sd->in_service_entity is reset as a last step in the
* expiration path. So, if expiration is true, tell
* __bfq_lookup_next_entity that there is no
* sd->in_service_entity.
@@ -1513,7 +1519,7 @@
struct bfq_sched_data *sd;
struct bfq_queue *bfqq;
- if (bfqd->busy_queues == 0)
+ if (bfq_tot_busy_queues(bfqd) == 0)
return NULL;
/*
@@ -1599,7 +1605,8 @@
return bfqq;
}
-void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
+/* returns true if the in-service queue gets freed */
+bool __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
{
struct bfq_queue *in_serv_bfqq = bfqd->in_service_queue;
struct bfq_entity *in_serv_entity = &in_serv_bfqq->entity;
@@ -1623,8 +1630,20 @@
* service tree either, then release the service reference to
* the queue it represents (taken with bfq_get_entity).
*/
- if (!in_serv_entity->on_st)
+ if (!in_serv_entity->on_st) {
+ /*
+ * If no process is referencing in_serv_bfqq any
+ * longer, then the service reference may be the only
+ * reference to the queue. If this is the case, then
+ * bfqq gets freed here.
+ */
+ int ref = in_serv_bfqq->ref;
bfq_put_queue(in_serv_bfqq);
+ if (ref == 1)
+ return true;
+ }
+
+ return false;
}
void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
@@ -1665,10 +1684,7 @@
bfq_clear_bfqq_busy(bfqq);
- bfqd->busy_queues--;
-
- if (!bfqq->dispatched)
- bfq_weights_tree_remove(bfqd, bfqq);
+ bfqd->busy_queues[bfqq->ioprio_class - 1]--;
if (bfqq->wr_coeff > 1)
bfqd->wr_busy_queues--;
@@ -1676,6 +1692,9 @@
bfqg_stats_update_dequeue(bfqq_group(bfqq));
bfq_deactivate_bfqq(bfqd, bfqq, true, expiration);
+
+ if (!bfqq->dispatched)
+ bfq_weights_tree_remove(bfqd, bfqq);
}
/*
@@ -1688,11 +1707,11 @@
bfq_activate_bfqq(bfqd, bfqq);
bfq_mark_bfqq_busy(bfqq);
- bfqd->busy_queues++;
+ bfqd->busy_queues[bfqq->ioprio_class - 1]++;
if (!bfqq->dispatched)
if (bfqq->wr_coeff == 1)
- bfq_weights_tree_add(bfqd, &bfqq->entity,
+ bfq_weights_tree_add(bfqd, bfqq,
&bfqd->queue_weights_tree);
if (bfqq->wr_coeff > 1)
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 5006a0d..4ca4f0b 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -16,6 +16,18 @@
static void blk_mq_sysfs_release(struct kobject *kobj)
{
+ struct blk_mq_ctxs *ctxs = container_of(kobj, struct blk_mq_ctxs, kobj);
+
+ free_percpu(ctxs->queue_ctx);
+ kfree(ctxs);
+}
+
+static void blk_mq_ctx_sysfs_release(struct kobject *kobj)
+{
+ struct blk_mq_ctx *ctx = container_of(kobj, struct blk_mq_ctx, kobj);
+
+ /* ctx->ctxs won't be released until all ctx are freed */
+ kobject_put(&ctx->ctxs->kobj);
}
static void blk_mq_hw_sysfs_release(struct kobject *kobj)
@@ -214,7 +226,7 @@
static struct kobj_type blk_mq_ctx_ktype = {
.sysfs_ops = &blk_mq_sysfs_ops,
.default_attrs = default_ctx_attrs,
- .release = blk_mq_sysfs_release,
+ .release = blk_mq_ctx_sysfs_release,
};
static struct kobj_type blk_mq_hw_ktype = {
@@ -246,7 +258,7 @@
if (!hctx->nr_ctx)
return 0;
- ret = kobject_add(&hctx->kobj, &q->mq_kobj, "%u", hctx->queue_num);
+ ret = kobject_add(&hctx->kobj, q->mq_kobj, "%u", hctx->queue_num);
if (ret)
return ret;
@@ -269,8 +281,8 @@
queue_for_each_hw_ctx(q, hctx, i)
blk_mq_unregister_hctx(hctx);
- kobject_uevent(&q->mq_kobj, KOBJ_REMOVE);
- kobject_del(&q->mq_kobj);
+ kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
+ kobject_del(q->mq_kobj);
kobject_put(&dev->kobj);
q->mq_sysfs_init_done = false;
@@ -290,7 +302,7 @@
ctx = per_cpu_ptr(q->queue_ctx, cpu);
kobject_put(&ctx->kobj);
}
- kobject_put(&q->mq_kobj);
+ kobject_put(q->mq_kobj);
}
void blk_mq_sysfs_init(struct request_queue *q)
@@ -298,10 +310,12 @@
struct blk_mq_ctx *ctx;
int cpu;
- kobject_init(&q->mq_kobj, &blk_mq_ktype);
+ kobject_init(q->mq_kobj, &blk_mq_ktype);
for_each_possible_cpu(cpu) {
ctx = per_cpu_ptr(q->queue_ctx, cpu);
+
+ kobject_get(q->mq_kobj);
kobject_init(&ctx->kobj, &blk_mq_ctx_ktype);
}
}
@@ -314,11 +328,11 @@
WARN_ON_ONCE(!q->kobj.parent);
lockdep_assert_held(&q->sysfs_lock);
- ret = kobject_add(&q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
+ ret = kobject_add(q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
if (ret < 0)
goto out;
- kobject_uevent(&q->mq_kobj, KOBJ_ADD);
+ kobject_uevent(q->mq_kobj, KOBJ_ADD);
queue_for_each_hw_ctx(q, hctx, i) {
ret = blk_mq_register_hctx(hctx);
@@ -335,8 +349,8 @@
while (--i >= 0)
blk_mq_unregister_hctx(q->queue_hw_ctx[i]);
- kobject_uevent(&q->mq_kobj, KOBJ_REMOVE);
- kobject_del(&q->mq_kobj);
+ kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
+ kobject_del(q->mq_kobj);
kobject_put(&dev->kobj);
return ret;
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 684acaa..dba55f3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2453,6 +2453,34 @@
mutex_unlock(&set->tag_list_lock);
}
+/* All allocations will be freed in release handler of q->mq_kobj */
+static int blk_mq_alloc_ctxs(struct request_queue *q)
+{
+ struct blk_mq_ctxs *ctxs;
+ int cpu;
+
+ ctxs = kzalloc(sizeof(*ctxs), GFP_KERNEL);
+ if (!ctxs)
+ return -ENOMEM;
+
+ ctxs->queue_ctx = alloc_percpu(struct blk_mq_ctx);
+ if (!ctxs->queue_ctx)
+ goto fail;
+
+ for_each_possible_cpu(cpu) {
+ struct blk_mq_ctx *ctx = per_cpu_ptr(ctxs->queue_ctx, cpu);
+ ctx->ctxs = ctxs;
+ }
+
+ q->mq_kobj = &ctxs->kobj;
+ q->queue_ctx = ctxs->queue_ctx;
+
+ return 0;
+ fail:
+ kfree(ctxs);
+ return -ENOMEM;
+}
+
/*
* It is the actual release handler for mq, but we do it from
* request queue's release handler for avoiding use-after-free
@@ -2480,8 +2508,6 @@
* both share lifetime with request queue.
*/
blk_mq_sysfs_deinit(q);
-
- free_percpu(q->queue_ctx);
}
struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
@@ -2586,8 +2612,7 @@
if (!q->poll_cb)
goto err_exit;
- q->queue_ctx = alloc_percpu(struct blk_mq_ctx);
- if (!q->queue_ctx)
+ if (blk_mq_alloc_ctxs(q))
goto err_exit;
/* init q->mq_kobj and sw queues' kobjects */
@@ -2596,7 +2621,7 @@
q->queue_hw_ctx = kcalloc_node(nr_cpu_ids, sizeof(*(q->queue_hw_ctx)),
GFP_KERNEL, set->numa_node);
if (!q->queue_hw_ctx)
- goto err_percpu;
+ goto err_sys_init;
q->mq_map = set->mq_map;
@@ -2653,8 +2678,8 @@
err_hctxs:
kfree(q->queue_hw_ctx);
-err_percpu:
- free_percpu(q->queue_ctx);
+err_sys_init:
+ blk_mq_sysfs_deinit(q);
err_exit:
q->mq_ops = NULL;
return ERR_PTR(-ENOMEM);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 5ad9251..a6094c2 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -7,6 +7,11 @@
struct blk_mq_tag_set;
+struct blk_mq_ctxs {
+ struct kobject kobj;
+ struct blk_mq_ctx __percpu *queue_ctx;
+};
+
/**
* struct blk_mq_ctx - State for a software queue facing the submitting CPUs
*/
@@ -27,6 +32,7 @@
unsigned long ____cacheline_aligned_in_smp rq_completed[2];
struct request_queue *queue;
+ struct blk_mq_ctxs *ctxs;
struct kobject kobj;
} ____cacheline_aligned_in_smp;
diff --git a/crypto/Makefile b/crypto/Makefile
index f6a234d..e7397bd 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -124,7 +124,7 @@
obj-$(CONFIG_CRYPTO_CRC32) += crc32_generic.o
obj-$(CONFIG_CRYPTO_CRCT10DIF) += crct10dif_common.o crct10dif_generic.o
obj-$(CONFIG_CRYPTO_AUTHENC) += authenc.o authencesn.o
-obj-$(CONFIG_CRYPTO_LZO) += lzo.o
+obj-$(CONFIG_CRYPTO_LZO) += lzo.o lzo-rle.o
obj-$(CONFIG_CRYPTO_LZ4) += lz4.o
obj-$(CONFIG_CRYPTO_LZ4HC) += lz4hc.o
obj-$(CONFIG_CRYPTO_842) += 842.o
diff --git a/crypto/lzo-rle.c b/crypto/lzo-rle.c
new file mode 100644
index 0000000..ea9c75b
--- /dev/null
+++ b/crypto/lzo-rle.c
@@ -0,0 +1,175 @@
+/*
+ * Cryptographic API.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 51
+ * Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/vmalloc.h>
+#include <linux/mm.h>
+#include <linux/lzo.h>
+#include <crypto/internal/scompress.h>
+
+struct lzorle_ctx {
+ void *lzorle_comp_mem;
+};
+
+static void *lzorle_alloc_ctx(struct crypto_scomp *tfm)
+{
+ void *ctx;
+
+ ctx = kvmalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL);
+ if (!ctx)
+ return ERR_PTR(-ENOMEM);
+
+ return ctx;
+}
+
+static int lzorle_init(struct crypto_tfm *tfm)
+{
+ struct lzorle_ctx *ctx = crypto_tfm_ctx(tfm);
+
+ ctx->lzorle_comp_mem = lzorle_alloc_ctx(NULL);
+ if (IS_ERR(ctx->lzorle_comp_mem))
+ return -ENOMEM;
+
+ return 0;
+}
+
+static void lzorle_free_ctx(struct crypto_scomp *tfm, void *ctx)
+{
+ kvfree(ctx);
+}
+
+static void lzorle_exit(struct crypto_tfm *tfm)
+{
+ struct lzorle_ctx *ctx = crypto_tfm_ctx(tfm);
+
+ lzorle_free_ctx(NULL, ctx->lzorle_comp_mem);
+}
+
+static int __lzorle_compress(const u8 *src, unsigned int slen,
+ u8 *dst, unsigned int *dlen, void *ctx)
+{
+ size_t tmp_len = *dlen; /* size_t(ulong) <-> uint on 64 bit */
+ int err;
+
+ err = lzorle1x_1_compress(src, slen, dst, &tmp_len, ctx);
+
+ if (err != LZO_E_OK)
+ return -EINVAL;
+
+ *dlen = tmp_len;
+ return 0;
+}
+
+static int lzorle_compress(struct crypto_tfm *tfm, const u8 *src,
+ unsigned int slen, u8 *dst, unsigned int *dlen)
+{
+ struct lzorle_ctx *ctx = crypto_tfm_ctx(tfm);
+
+ return __lzorle_compress(src, slen, dst, dlen, ctx->lzorle_comp_mem);
+}
+
+static int lzorle_scompress(struct crypto_scomp *tfm, const u8 *src,
+ unsigned int slen, u8 *dst, unsigned int *dlen,
+ void *ctx)
+{
+ return __lzorle_compress(src, slen, dst, dlen, ctx);
+}
+
+static int __lzorle_decompress(const u8 *src, unsigned int slen,
+ u8 *dst, unsigned int *dlen)
+{
+ int err;
+ size_t tmp_len = *dlen; /* size_t(ulong) <-> uint on 64 bit */
+
+ err = lzo1x_decompress_safe(src, slen, dst, &tmp_len);
+
+ if (err != LZO_E_OK)
+ return -EINVAL;
+
+ *dlen = tmp_len;
+ return 0;
+}
+
+static int lzorle_decompress(struct crypto_tfm *tfm, const u8 *src,
+ unsigned int slen, u8 *dst, unsigned int *dlen)
+{
+ return __lzorle_decompress(src, slen, dst, dlen);
+}
+
+static int lzorle_sdecompress(struct crypto_scomp *tfm, const u8 *src,
+ unsigned int slen, u8 *dst, unsigned int *dlen,
+ void *ctx)
+{
+ return __lzorle_decompress(src, slen, dst, dlen);
+}
+
+static struct crypto_alg alg = {
+ .cra_name = "lzo-rle",
+ .cra_flags = CRYPTO_ALG_TYPE_COMPRESS,
+ .cra_ctxsize = sizeof(struct lzorle_ctx),
+ .cra_module = THIS_MODULE,
+ .cra_init = lzorle_init,
+ .cra_exit = lzorle_exit,
+ .cra_u = { .compress = {
+ .coa_compress = lzorle_compress,
+ .coa_decompress = lzorle_decompress } }
+};
+
+static struct scomp_alg scomp = {
+ .alloc_ctx = lzorle_alloc_ctx,
+ .free_ctx = lzorle_free_ctx,
+ .compress = lzorle_scompress,
+ .decompress = lzorle_sdecompress,
+ .base = {
+ .cra_name = "lzo-rle",
+ .cra_driver_name = "lzo-rle-scomp",
+ .cra_module = THIS_MODULE,
+ }
+};
+
+static int __init lzorle_mod_init(void)
+{
+ int ret;
+
+ ret = crypto_register_alg(&alg);
+ if (ret)
+ return ret;
+
+ ret = crypto_register_scomp(&scomp);
+ if (ret) {
+ crypto_unregister_alg(&alg);
+ return ret;
+ }
+
+ return ret;
+}
+
+static void __exit lzorle_mod_fini(void)
+{
+ crypto_unregister_alg(&alg);
+ crypto_unregister_scomp(&scomp);
+}
+
+module_init(lzorle_mod_init);
+module_exit(lzorle_mod_fini);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("LZO-RLE Compression Algorithm");
+MODULE_ALIAS_CRYPTO("lzo-rle");
diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index d332988..18948f8 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -76,8 +76,8 @@
"cast6", "arc4", "michael_mic", "deflate", "crc32c", "tea", "xtea",
"khazad", "wp512", "wp384", "wp256", "tnepres", "xeta", "fcrypt",
"camellia", "seed", "salsa20", "rmd128", "rmd160", "rmd256", "rmd320",
- "lzo", "cts", "zlib", "sha3-224", "sha3-256", "sha3-384", "sha3-512",
- NULL
+ "lzo", "lzo-rle", "cts", "zlib", "sha3-224", "sha3-256", "sha3-384",
+ "sha3-512", NULL
};
static u32 block_sizes[] = { 16, 64, 256, 1024, 8192, 0 };
diff --git a/drivers/acpi/acpica/evgpe.c b/drivers/acpi/acpica/evgpe.c
index 4b5d3b4..4da586f 100644
--- a/drivers/acpi/acpica/evgpe.c
+++ b/drivers/acpi/acpica/evgpe.c
@@ -81,8 +81,12 @@
ACPI_FUNCTION_TRACE(ev_enable_gpe);
- /* Enable the requested GPE */
+ /* Clear the GPE (of stale events) */
+ status = acpi_hw_clear_gpe(gpe_event_info);
+ if (ACPI_FAILURE(status))
+ return_ACPI_STATUS(status);
+ /* Enable the requested GPE */
status = acpi_hw_low_set_gpe(gpe_event_info, ACPI_GPE_ENABLE);
return_ACPI_STATUS(status);
}
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index dd4c728..bce135d 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2292,7 +2292,7 @@
offset = to_interleave_offset(offset, mmio);
writeq(cmd, mmio->addr.base + offset);
- nvdimm_flush(nfit_blk->nd_region);
+ nvdimm_flush(nfit_blk->nd_region, NULL);
if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH)
readq(mmio->addr.base + offset);
@@ -2341,7 +2341,7 @@
}
if (rw)
- nvdimm_flush(nfit_blk->nd_region);
+ nvdimm_flush(nfit_blk->nd_region, NULL);
rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
return rc;
diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
index 847db3e..6bd58d7 100644
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -584,6 +584,7 @@
acpi_status status = AE_OK;
u32 acpi_state = acpi_target_sleep_state;
int error;
+ u64 tsc;
ACPI_FLUSH_CPU_CACHE();
@@ -600,6 +601,9 @@
error = acpi_suspend_lowlevel();
if (error)
return error;
+ tsc = rdtsc_ordered();
+ printk(KERN_INFO "TSC at resume: %llu\n",
+ (unsigned long long)tsc);
pr_info(PREFIX "Low-level resume complete\n");
pm_set_resume_via_firmware();
break;
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 3e63a90..3360746 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -60,6 +60,15 @@
rescue mode with init=/bin/sh, even when the /dev directory
on the rootfs is completely empty.
+config DEVTMPFS_SAFE
+ bool "Automount devtmpfs with nosuid/noexec"
+ depends on DEVTMPFS_MOUNT
+ default y
+ help
+ This instructs the kernel to automount devtmpfs with the
+ MS_NOEXEC and MS_NOSUID mount flags, which can prevent
+ certain kinds of code-execution attack on embedded platforms.
+
config STANDALONE
bool "Select only drivers that don't need compile-time external firmware"
default y
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index caaeb79..c11f14e 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -614,6 +614,7 @@
return -EBUSY;
return 0;
}
+EXPORT_SYMBOL(driver_probe_done);
/**
* wait_for_device_probe
diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index f776807..5b6b1b7e 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -349,6 +349,7 @@
int devtmpfs_mount(const char *mntdir)
{
int err;
+ int mflags = MS_SILENT;
if (!mount_dev)
return 0;
@@ -356,8 +357,10 @@
if (!thread)
return 0;
- err = ksys_mount("devtmpfs", (char *)mntdir, "devtmpfs", MS_SILENT,
- NULL);
+#ifdef CONFIG_DEVTMPFS_SAFE
+ mflags |= MS_NOEXEC | MS_NOSUID;
+#endif
+ err = ksys_mount("devtmpfs", (char *)mntdir, "devtmpfs", mflags, NULL);
if (err)
printk(KERN_INFO "devtmpfs: error mounting %i\n", err);
else
diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 3b382a7..09678fa 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -1773,7 +1773,12 @@
dev->power.direct_complete = false;
if (dev->power.direct_complete) {
- if (pm_runtime_status_suspended(dev)) {
+ /*
+ * Check if we're runtime suspended. If not, try to runtime
+ * suspend for autosuspend cases.
+ */
+ if (pm_runtime_status_suspended(dev) ||
+ !pm_runtime_suspend(dev)) {
pm_runtime_disable(dev);
if (pm_runtime_status_suspended(dev))
goto Complete;
diff --git a/drivers/base/power/wakeup.c b/drivers/base/power/wakeup.c
index 2dfa2e0..33773c5 100644
--- a/drivers/base/power/wakeup.c
+++ b/drivers/base/power/wakeup.c
@@ -818,7 +818,7 @@
srcuidx = srcu_read_lock(&wakeup_srcu);
list_for_each_entry_rcu(ws, &wakeup_sources, entry) {
if (ws->active) {
- pr_debug("active wakeup source: %s\n", ws->name);
+ pm_pr_dbg("active wakeup source: %s\n", ws->name);
active = 1;
} else if (!active &&
(!last_activity_ws ||
@@ -829,7 +829,7 @@
}
if (!active && last_activity_ws)
- pr_debug("last active wakeup source: %s\n",
+ pm_pr_dbg("last active wakeup source: %s\n",
last_activity_ws->name);
srcu_read_unlock(&wakeup_srcu, srcuidx);
}
@@ -859,7 +859,7 @@
raw_spin_unlock_irqrestore(&events_lock, flags);
if (ret) {
- pr_debug("PM: Wakeup pending, aborting suspend\n");
+ pm_pr_dbg("Wakeup pending, aborting suspend\n");
pm_print_active_wakeup_sources();
}
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c1341c8..4164d3a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -460,7 +460,9 @@
if (!cmd->use_aio || cmd->ret < 0 || cmd->ret == blk_rq_bytes(rq) ||
req_op(rq) != REQ_OP_READ) {
- if (cmd->ret < 0)
+ if (cmd->ret == -EOPNOTSUPP)
+ ret = BLK_STS_NOTSUPP;
+ else if (cmd->ret < 0)
ret = BLK_STS_IOERR;
goto end_io;
}
@@ -931,6 +933,24 @@
return 0;
}
+static void loop_update_rotational(struct loop_device *lo)
+{
+ struct file *file = lo->lo_backing_file;
+ struct inode *file_inode = file->f_mapping->host;
+ struct block_device *file_bdev = file_inode->i_sb->s_bdev;
+ struct request_queue *q = lo->lo_queue;
+ bool nonrot = true;
+
+ /* not all filesystems (e.g. tmpfs) have a sb->s_bdev */
+ if (file_bdev)
+ nonrot = blk_queue_nonrot(bdev_get_queue(file_bdev));
+
+ if (nonrot)
+ blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
+ else
+ blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
+}
+
static int loop_set_fd(struct loop_device *lo, fmode_t mode,
struct block_device *bdev, unsigned int arg)
{
@@ -994,6 +1014,7 @@
if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
blk_queue_write_cache(lo->lo_queue, true, false);
+ loop_update_rotational(lo);
loop_update_dio(lo);
set_capacity(lo->lo_disk, size);
bd_set_size(bdev, size << 9);
@@ -1924,7 +1945,10 @@
failed:
/* complete non-aio request */
if (!cmd->use_aio || ret) {
- cmd->ret = ret ? -EIO : 0;
+ if (ret == -EOPNOTSUPP)
+ cmd->ret = ret;
+ else
+ cmd->ret = ret ? -EIO : 0;
blk_mq_complete_request(rq);
}
}
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 9be54e5..83a09a7 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -18,6 +18,7 @@
#define PART_BITS 4
#define VQ_NAME_LEN 16
+#define DISCARD_MAX_SEGMENTS 256
static int major;
static DEFINE_IDA(vd_index_ida);
@@ -188,10 +189,50 @@
return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
}
+
+static inline int virtblk_setup_discard_write_zeroes(struct request *req,
+ bool unmap)
+{
+ unsigned short segments = blk_rq_nr_discard_segments(req);
+ unsigned short n = 0;
+ struct virtio_blk_discard_write_zeroes *range;
+ struct bio *bio;
+ u32 flags = 0;
+
+ if (unmap)
+ flags |= VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP;
+
+ range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
+ if (!range)
+ return -ENOMEM;
+
+ __rq_for_each_bio(bio, req) {
+ u64 sector = bio->bi_iter.bi_sector;
+ u32 num_sectors = bio->bi_iter.bi_size >> 9;
+
+ range[n].flags = cpu_to_le32(flags);
+ range[n].num_sectors = cpu_to_le32(num_sectors);
+ range[n].sector = cpu_to_le64(sector);
+ n++;
+ }
+
+ req->special_vec.bv_page = virt_to_page(range);
+ req->special_vec.bv_offset = offset_in_page(range);
+ req->special_vec.bv_len = sizeof(*range) * segments;
+ req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+ return 0;
+}
+
static inline void virtblk_request_done(struct request *req)
{
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
+ if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
+ kfree(page_address(req->special_vec.bv_page) +
+ req->special_vec.bv_offset);
+ }
+
switch (req_op(req)) {
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
@@ -241,6 +282,7 @@
int qid = hctx->queue_num;
int err;
bool notify = false;
+ bool unmap = false;
u32 type;
BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
@@ -253,6 +295,13 @@
case REQ_OP_FLUSH:
type = VIRTIO_BLK_T_FLUSH;
break;
+ case REQ_OP_DISCARD:
+ type = VIRTIO_BLK_T_DISCARD;
+ break;
+ case REQ_OP_WRITE_ZEROES:
+ type = VIRTIO_BLK_T_WRITE_ZEROES;
+ unmap = !(req->cmd_flags & REQ_NOUNMAP);
+ break;
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
type = VIRTIO_BLK_T_SCSI_CMD;
@@ -272,6 +321,12 @@
blk_mq_start_request(req);
+ if (type == VIRTIO_BLK_T_DISCARD || type == VIRTIO_BLK_T_WRITE_ZEROES) {
+ err = virtblk_setup_discard_write_zeroes(req, unmap);
+ if (err)
+ return BLK_STS_RESOURCE;
+ }
+
num = blk_rq_map_sg(hctx->queue, req, vbr->sg);
if (num) {
if (rq_data_dir(req) == WRITE)
@@ -855,6 +910,42 @@
if (!err && opt_io_size)
blk_queue_io_opt(q, blk_size * opt_io_size);
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_DISCARD)) {
+ q->limits.discard_granularity = blk_size;
+
+ virtio_cread(vdev, struct virtio_blk_config,
+ discard_sector_alignment, &v);
+ if (v)
+ q->limits.discard_alignment = v << 9;
+ else
+ q->limits.discard_alignment = 0;
+
+ virtio_cread(vdev, struct virtio_blk_config,
+ max_discard_sectors, &v);
+ if (v)
+ blk_queue_max_discard_sectors(q, v);
+ else
+ blk_queue_max_discard_sectors(q, UINT_MAX);
+
+ virtio_cread(vdev, struct virtio_blk_config, max_discard_seg,
+ &v);
+ if (v && v <= DISCARD_MAX_SEGMENTS)
+ blk_queue_max_discard_segments(q, v);
+ else
+ blk_queue_max_discard_segments(q, DISCARD_MAX_SEGMENTS);
+
+ blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
+ }
+
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_WRITE_ZEROES)) {
+ virtio_cread(vdev, struct virtio_blk_config,
+ max_write_zeroes_sectors, &v);
+ if (v)
+ blk_queue_max_write_zeroes_sectors(q, v);
+ else
+ blk_queue_max_write_zeroes_sectors(q, UINT_MAX);
+ }
+
virtblk_update_capacity(vblk, false);
virtio_device_ready(vdev);
@@ -964,14 +1055,14 @@
VIRTIO_BLK_F_SCSI,
#endif
VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
- VIRTIO_BLK_F_MQ,
+ VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
}
;
static unsigned int features[] = {
VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
- VIRTIO_BLK_F_MQ,
+ VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
};
static struct virtio_driver virtio_blk = {
diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index 4ed0a78..4d9a388 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -20,6 +20,7 @@
static const char * const backends[] = {
"lzo",
+ "lzo-rle",
#if IS_ENABLED(CONFIG_CRYPTO_LZ4)
"lz4",
#endif
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 76abe40..e2c9e76 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -41,7 +41,7 @@
static DEFINE_MUTEX(zram_index_mutex);
static int zram_major;
-static const char *default_compressor = "lzo";
+static const char *default_compressor = "lzo-rle";
/* Module params (documentation at end) */
static unsigned int num_devices = 1;
@@ -1588,23 +1588,55 @@
return len;
}
-static int zram_open(struct block_device *bdev, fmode_t mode)
+int zram_open(struct block_device *bdev, fmode_t mode)
{
- int ret = 0;
struct zram *zram;
+ int open_count;
WARN_ON(!mutex_is_locked(&bdev->bd_mutex));
zram = bdev->bd_disk->private_data;
/* zram was claimed to reset so open request fails */
if (zram->claim)
- ret = -EBUSY;
+ goto out_busy;
- return ret;
+ /*
+ * Chromium OS specific behavior:
+ * sys_swapon opens the device once to populate its swapinfo->swap_file
+ * and once when it claims the block device (blkdev_get). By limiting
+ * the maximum number of opens to 2, we ensure there are no prior open
+ * references before swap is enabled.
+ * (Note, kzalloc ensures nr_opens starts at 0.)
+ */
+ open_count = atomic_inc_return(&zram->nr_opens);
+ if (open_count > 2)
+ goto out_busy_dec_nr_opens;
+ /*
+ * swapon(2) claims the block device after setup. If a zram is claimed
+ * then open attempts are rejected. This is belt-and-suspenders as the
+ * the block device and swap_file will both hold open nr_opens until
+ * swapoff(2) is called.
+ */
+ if (bdev->bd_holder != NULL)
+ goto out_busy_dec_nr_opens;
+
+ return 0;
+
+out_busy_dec_nr_opens:
+ atomic_dec(&zram->nr_opens);
+out_busy:
+ return -EBUSY;
+}
+
+void zram_release(struct gendisk *disk, fmode_t mode)
+{
+ struct zram *zram = disk->private_data;
+ atomic_dec(&zram->nr_opens);
}
static const struct block_device_operations zram_devops = {
.open = zram_open,
+ .release = zram_release,
.swap_slot_free_notify = zram_slot_free_notify,
.rw_page = zram_rw_page,
.owner = THIS_MODULE
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index d1095df..e934528 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -16,6 +16,8 @@
#define _ZRAM_DRV_H_
#include <linux/rwsem.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
#include <linux/zsmalloc.h>
#include <linux/crypto.h>
@@ -93,6 +95,8 @@
* the number of pages zram can consume for storing compressed data
*/
unsigned long limit_pages;
+ int max_comp_streams;
+ atomic_t nr_opens; /* number of active file handles */
struct zram_stats stats;
/*
diff --git a/drivers/clk/clk-bulk.c b/drivers/clk/clk-bulk.c
index 6904ed6..6a7118d 100644
--- a/drivers/clk/clk-bulk.c
+++ b/drivers/clk/clk-bulk.c
@@ -17,8 +17,65 @@
*/
#include <linux/clk.h>
+#include <linux/clk-provider.h>
#include <linux/device.h>
#include <linux/export.h>
+#include <linux/of.h>
+#include <linux/slab.h>
+
+static int __must_check of_clk_bulk_get(struct device_node *np, int num_clks,
+ struct clk_bulk_data *clks)
+{
+ int ret;
+ int i;
+
+ for (i = 0; i < num_clks; i++)
+ clks[i].clk = NULL;
+
+ for (i = 0; i < num_clks; i++) {
+ clks[i].clk = of_clk_get(np, i);
+ if (IS_ERR(clks[i].clk)) {
+ ret = PTR_ERR(clks[i].clk);
+ pr_err("%pOF: Failed to get clk index: %d ret: %d\n",
+ np, i, ret);
+ clks[i].clk = NULL;
+ goto err;
+ }
+ }
+
+ return 0;
+
+err:
+ clk_bulk_put(i, clks);
+
+ return ret;
+}
+
+static int __must_check of_clk_bulk_get_all(struct device_node *np,
+ struct clk_bulk_data **clks)
+{
+ struct clk_bulk_data *clk_bulk;
+ int num_clks;
+ int ret;
+
+ num_clks = of_clk_get_parent_count(np);
+ if (!num_clks)
+ return 0;
+
+ clk_bulk = kmalloc_array(num_clks, sizeof(*clk_bulk), GFP_KERNEL);
+ if (!clk_bulk)
+ return -ENOMEM;
+
+ ret = of_clk_bulk_get(np, num_clks, clk_bulk);
+ if (ret) {
+ kfree(clk_bulk);
+ return ret;
+ }
+
+ *clks = clk_bulk;
+
+ return num_clks;
+}
void clk_bulk_put(int num_clks, struct clk_bulk_data *clks)
{
@@ -59,6 +116,29 @@
}
EXPORT_SYMBOL(clk_bulk_get);
+void clk_bulk_put_all(int num_clks, struct clk_bulk_data *clks)
+{
+ if (IS_ERR_OR_NULL(clks))
+ return;
+
+ clk_bulk_put(num_clks, clks);
+
+ kfree(clks);
+}
+EXPORT_SYMBOL(clk_bulk_put_all);
+
+int __must_check clk_bulk_get_all(struct device *dev,
+ struct clk_bulk_data **clks)
+{
+ struct device_node *np = dev_of_node(dev);
+
+ if (!np)
+ return 0;
+
+ return of_clk_bulk_get_all(np, clks);
+}
+EXPORT_SYMBOL(clk_bulk_get_all);
+
#ifdef CONFIG_HAVE_CLK_PREPARE
/**
diff --git a/drivers/clk/clk-devres.c b/drivers/clk/clk-devres.c
index d854e26..12c8745 100644
--- a/drivers/clk/clk-devres.c
+++ b/drivers/clk/clk-devres.c
@@ -70,6 +70,30 @@
}
EXPORT_SYMBOL_GPL(devm_clk_bulk_get);
+int __must_check devm_clk_bulk_get_all(struct device *dev,
+ struct clk_bulk_data **clks)
+{
+ struct clk_bulk_devres *devres;
+ int ret;
+
+ devres = devres_alloc(devm_clk_bulk_release,
+ sizeof(*devres), GFP_KERNEL);
+ if (!devres)
+ return -ENOMEM;
+
+ ret = clk_bulk_get_all(dev, &devres->clks);
+ if (ret > 0) {
+ *clks = devres->clks;
+ devres->num_clks = ret;
+ devres_add(dev, devres);
+ } else {
+ devres_free(devres);
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(devm_clk_bulk_get_all);
+
static int devm_clk_match(struct device *dev, void *res, void *data)
{
struct clk **c = res;
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index e35c397..8cfd8f4 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2280,6 +2280,7 @@
ret = cpufreq_start_governor(policy);
if (!ret) {
pr_debug("cpufreq: governor change\n");
+ sched_cpufreq_governor_change(policy, old_gov);
return 0;
}
cpufreq_exit_governor(policy);
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 6df894d..96a3a9b 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -221,7 +221,7 @@
}
/* Take note of the planned idle state. */
- sched_idle_set_state(target_state);
+ sched_idle_set_state(target_state, index);
trace_cpu_idle_rcuidle(index, dev->cpu);
time_start = ns_to_ktime(local_clock());
@@ -235,7 +235,7 @@
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
/* The cpu is no longer idle or about to enter idle. */
- sched_idle_set_state(NULL);
+ sched_idle_set_state(NULL, -1);
if (broadcast) {
if (WARN_ON_ONCE(!irqs_disabled()))
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index a89ebd9..41aaac6 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -658,7 +658,7 @@
* No 'host' or dax_operations since there is no access to this
* device outside of mmap of the resulting character device.
*/
- dax_dev = alloc_dax(dev_dax, NULL, NULL);
+ dax_dev = alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC);
if (!dax_dev) {
rc = -ENOMEM;
goto err_dax;
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 6e928f3..e3234fc 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -72,26 +72,18 @@
EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
#endif
-/**
- * __bdev_dax_supported() - Check if the device supports dax for filesystem
- * @bdev: block device to check
- * @blocksize: The block size of the device
- *
- * This is a library function for filesystems to check if the block device
- * can be mounted with dax option.
- *
- * Return: true if supported, false if unsupported
- */
-bool __bdev_dax_supported(struct block_device *bdev, int blocksize)
+bool __generic_fsdax_supported(struct dax_device *dax_dev,
+ struct block_device *bdev, int blocksize, sector_t start,
+ sector_t sectors)
{
- struct dax_device *dax_dev;
bool dax_enabled = false;
- struct request_queue *q;
- pgoff_t pgoff;
- int err, id;
- pfn_t pfn;
- long len;
+ pgoff_t pgoff, pgoff_end;
char buf[BDEVNAME_SIZE];
+ void *kaddr, *end_kaddr;
+ pfn_t pfn, end_pfn;
+ sector_t last_page;
+ long len, len2;
+ int err, id;
if (blocksize != PAGE_SIZE) {
pr_debug("%s: error: unsupported blocksize for dax\n",
@@ -99,36 +91,29 @@
return false;
}
- q = bdev_get_queue(bdev);
- if (!q || !blk_queue_dax(q)) {
- pr_debug("%s: error: request queue doesn't support dax\n",
- bdevname(bdev, buf));
- return false;
- }
-
- err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, &pgoff);
+ err = bdev_dax_pgoff(bdev, start, PAGE_SIZE, &pgoff);
if (err) {
pr_debug("%s: error: unaligned partition for dax\n",
bdevname(bdev, buf));
return false;
}
- dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
- if (!dax_dev) {
- pr_debug("%s: error: device does not support dax\n",
+ last_page = PFN_DOWN((start + sectors - 1) * 512) * PAGE_SIZE / 512;
+ err = bdev_dax_pgoff(bdev, last_page, PAGE_SIZE, &pgoff_end);
+ if (err) {
+ pr_debug("%s: error: unaligned partition for dax\n",
bdevname(bdev, buf));
return false;
}
id = dax_read_lock();
- len = dax_direct_access(dax_dev, pgoff, 1, NULL, &pfn);
+ len = dax_direct_access(dax_dev, pgoff, 1, &kaddr, &pfn);
+ len2 = dax_direct_access(dax_dev, pgoff_end, 1, &end_kaddr, &end_pfn);
dax_read_unlock(id);
- put_dax(dax_dev);
-
- if (len < 1) {
+ if (len < 1 || len2 < 1) {
pr_debug("%s: error: dax access failed (%ld)\n",
- bdevname(bdev, buf), len);
+ bdevname(bdev, buf), len < 1 ? len : len2);
return false;
}
@@ -143,13 +128,20 @@
*/
WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
dax_enabled = true;
- } else if (pfn_t_devmap(pfn)) {
- struct dev_pagemap *pgmap;
+ } else if (pfn_t_devmap(pfn) && pfn_t_devmap(end_pfn)) {
+ struct dev_pagemap *pgmap, *end_pgmap;
pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL);
- if (pgmap && pgmap->type == MEMORY_DEVICE_FS_DAX)
+ end_pgmap = get_dev_pagemap(pfn_t_to_pfn(end_pfn), NULL);
+ if (pgmap && pgmap == end_pgmap && pgmap->type == MEMORY_DEVICE_FS_DAX
+ && pfn_t_to_page(pfn)->pgmap == pgmap
+ && pfn_t_to_page(end_pfn)->pgmap == pgmap
+ && pfn_t_to_pfn(pfn) == PHYS_PFN(__pa(kaddr))
+ && pfn_t_to_pfn(end_pfn) == PHYS_PFN(__pa(end_kaddr)))
dax_enabled = true;
put_dev_pagemap(pgmap);
+ put_dev_pagemap(end_pgmap);
+
}
if (!dax_enabled) {
@@ -159,6 +151,49 @@
}
return true;
}
+EXPORT_SYMBOL_GPL(__generic_fsdax_supported);
+
+/**
+ * __bdev_dax_supported() - Check if the device supports dax for filesystem
+ * @bdev: block device to check
+ * @blocksize: The block size of the device
+ *
+ * This is a library function for filesystems to check if the block device
+ * can be mounted with dax option.
+ *
+ * Return: true if supported, false if unsupported
+ */
+bool __bdev_dax_supported(struct block_device *bdev, int blocksize)
+{
+ struct dax_device *dax_dev;
+ struct request_queue *q;
+ char buf[BDEVNAME_SIZE];
+ bool ret;
+ int id;
+
+ q = bdev_get_queue(bdev);
+ if (!q || !blk_queue_dax(q)) {
+ pr_debug("%s: error: request queue doesn't support dax\n",
+ bdevname(bdev, buf));
+ return false;
+ }
+
+ dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+ if (!dax_dev) {
+ pr_debug("%s: error: device does not support dax\n",
+ bdevname(bdev, buf));
+ return false;
+ }
+
+ id = dax_read_lock();
+ ret = dax_supported(dax_dev, bdev, blocksize, 0,
+ i_size_read(bdev->bd_inode) / 512);
+ dax_read_unlock(id);
+
+ put_dax(dax_dev);
+
+ return ret;
+}
EXPORT_SYMBOL_GPL(__bdev_dax_supported);
#endif
@@ -167,6 +202,8 @@
DAXDEV_ALIVE,
/* gate whether dax_flush() calls the low level flush routine */
DAXDEV_WRITE_CACHE,
+ /* flag to check if device supports synchronous flush */
+ DAXDEV_SYNC,
};
/**
@@ -284,6 +321,15 @@
}
EXPORT_SYMBOL_GPL(dax_direct_access);
+bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
+ int blocksize, sector_t start, sector_t len)
+{
+ if (!dax_alive(dax_dev))
+ return false;
+
+ return dax_dev->ops->dax_supported(dax_dev, bdev, blocksize, start, len);
+}
+
size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
size_t bytes, struct iov_iter *i)
{
@@ -335,6 +381,18 @@
}
EXPORT_SYMBOL_GPL(dax_write_cache_enabled);
+bool __dax_synchronous(struct dax_device *dax_dev)
+{
+ return test_bit(DAXDEV_SYNC, &dax_dev->flags);
+}
+EXPORT_SYMBOL_GPL(__dax_synchronous);
+
+void __set_dax_synchronous(struct dax_device *dax_dev)
+{
+ set_bit(DAXDEV_SYNC, &dax_dev->flags);
+}
+EXPORT_SYMBOL_GPL(__set_dax_synchronous);
+
bool dax_alive(struct dax_device *dax_dev)
{
lockdep_assert_held(&dax_srcu);
@@ -488,7 +546,7 @@
}
struct dax_device *alloc_dax(void *private, const char *__host,
- const struct dax_operations *ops)
+ const struct dax_operations *ops, unsigned long flags)
{
struct dax_device *dax_dev;
const char *host;
@@ -511,6 +569,9 @@
dax_add_host(dax_dev, host);
dax_dev->ops = ops;
dax_dev->private = private;
+ if (flags & DAXDEV_F_SYNC)
+ set_dax_synchronous(dax_dev);
+
return dax_dev;
err_dev:
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 8b8c123..9949fd9 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -447,6 +447,18 @@
If unsure, say N.
+config DM_INIT
+ bool "DM \"dm-mod.create=\" parameter support"
+ depends on BLK_DEV_DM=y
+ ---help---
+ Enable "dm-mod.create=" parameter to create mapped devices at init time.
+ This option is useful to allow mounting rootfs without requiring an
+ initramfs.
+ See Documentation/device-mapper/dm-init.txt for dm-mod.create="..."
+ format.
+
+ If unsure, say N.
+
config DM_UEVENT
bool "DM uevents"
depends on BLK_DEV_DM
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 822f4e8..a52b703 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -69,6 +69,10 @@
obj-$(CONFIG_DM_ZONED) += dm-zoned.o
obj-$(CONFIG_DM_WRITECACHE) += dm-writecache.o
+ifeq ($(CONFIG_DM_INIT),y)
+dm-mod-objs += dm-init.o
+endif
+
ifeq ($(CONFIG_DM_UEVENT),y)
dm-mod-objs += dm-uevent.o
endif
diff --git a/drivers/md/dm-init.c b/drivers/md/dm-init.c
new file mode 100644
index 0000000..6f06e6b
--- /dev/null
+++ b/drivers/md/dm-init.c
@@ -0,0 +1,554 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * dm-init.c
+ * Copyright (C) 2017 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/ctype.h>
+#include <linux/device.h>
+#include <linux/device-mapper.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/moduleparam.h>
+
+#define DM_MSG_PREFIX "init"
+#define DM_MAX_DEVICES 256
+#define DM_MAX_TARGETS 256
+#define DM_MAX_STR_SIZE 4096
+
+static char *create;
+
+/*
+ * Format: dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]
+ * Table format: <start_sector> <num_sectors> <target_type> <target_args>
+ *
+ * See Documentation/device-mapper/dm-init.txt for dm-mod.create="..." format
+ * details.
+ */
+
+struct dm_device {
+ struct dm_ioctl dmi;
+ struct dm_target_spec *table[DM_MAX_TARGETS];
+ char *target_args_array[DM_MAX_TARGETS];
+ struct list_head list;
+};
+
+const char * const dm_allowed_targets[] __initconst = {
+ "crypt",
+ "delay",
+ "linear",
+ "snapshot-origin",
+ "striped",
+ "verity",
+};
+
+static int __init dm_verify_target_type(const char *target)
+{
+ unsigned int i;
+
+ for (i = 0; i < ARRAY_SIZE(dm_allowed_targets); i++) {
+ if (!strcmp(dm_allowed_targets[i], target))
+ return 0;
+ }
+ return -EINVAL;
+}
+
+static void __init dm_setup_cleanup(struct list_head *devices)
+{
+ struct dm_device *dev, *tmp;
+ unsigned int i;
+
+ list_for_each_entry_safe(dev, tmp, devices, list) {
+ list_del(&dev->list);
+ for (i = 0; i < dev->dmi.target_count; i++) {
+ kfree(dev->table[i]);
+ kfree(dev->target_args_array[i]);
+ }
+ kfree(dev);
+ }
+}
+
+/**
+ * str_field_delimit - delimit a string based on a separator char.
+ * @str: the pointer to the string to delimit.
+ * @separator: char that delimits the field
+ *
+ * Find a @separator and replace it by '\0'.
+ * Remove leading and trailing spaces.
+ * Return the remainder string after the @separator.
+ */
+static char __init *str_field_delimit(char **str, char separator)
+{
+ char *s;
+
+ /* TODO: add support for escaped characters */
+ *str = skip_spaces(*str);
+ s = strchr(*str, separator);
+ /* Delimit the field and remove trailing spaces */
+ if (s)
+ *s = '\0';
+ *str = strim(*str);
+ return s ? ++s : NULL;
+}
+
+/**
+ * dm_parse_table_entry - parse a table entry
+ * @dev: device to store the parsed information.
+ * @str: the pointer to a string with the format:
+ * <start_sector> <num_sectors> <target_type> <target_args>[, ...]
+ *
+ * Return the remainder string after the table entry, i.e, after the comma which
+ * delimits the entry or NULL if reached the end of the string.
+ */
+static char __init *dm_parse_table_entry(struct dm_device *dev, char *str)
+{
+ const unsigned int n = dev->dmi.target_count - 1;
+ struct dm_target_spec *sp;
+ unsigned int i;
+ /* fields: */
+ char *field[4];
+ char *next;
+
+ field[0] = str;
+ /* Delimit first 3 fields that are separated by space */
+ for (i = 0; i < ARRAY_SIZE(field) - 1; i++) {
+ field[i + 1] = str_field_delimit(&field[i], ' ');
+ if (!field[i + 1])
+ return ERR_PTR(-EINVAL);
+ }
+ /* Delimit last field that can be terminated by comma */
+ next = str_field_delimit(&field[i], ',');
+
+ sp = kzalloc(sizeof(*sp), GFP_KERNEL);
+ if (!sp)
+ return ERR_PTR(-ENOMEM);
+ dev->table[n] = sp;
+
+ /* start_sector */
+ if (kstrtoull(field[0], 0, &sp->sector_start))
+ return ERR_PTR(-EINVAL);
+ /* num_sector */
+ if (kstrtoull(field[1], 0, &sp->length))
+ return ERR_PTR(-EINVAL);
+ /* target_type */
+ strscpy(sp->target_type, field[2], sizeof(sp->target_type));
+ if (dm_verify_target_type(sp->target_type)) {
+ DMERR("invalid type \"%s\"", sp->target_type);
+ return ERR_PTR(-EINVAL);
+ }
+ /* target_args */
+ dev->target_args_array[n] = kstrndup(field[3], GFP_KERNEL,
+ DM_MAX_STR_SIZE);
+ if (!dev->target_args_array[n])
+ return ERR_PTR(-ENOMEM);
+
+ return next;
+}
+
+/**
+ * dm_parse_table - parse "dm-mod.create=" table field
+ * @dev: device to store the parsed information.
+ * @str: the pointer to a string with the format:
+ * <table>[,<table>+]
+ */
+static int __init dm_parse_table(struct dm_device *dev, char *str)
+{
+ char *table_entry = str;
+
+ while (table_entry) {
+ DMDEBUG("parsing table \"%s\"", str);
+ if (++dev->dmi.target_count > DM_MAX_TARGETS) {
+ DMERR("too many targets %u > %d",
+ dev->dmi.target_count, DM_MAX_TARGETS);
+ return -EINVAL;
+ }
+ table_entry = dm_parse_table_entry(dev, table_entry);
+ if (IS_ERR(table_entry)) {
+ DMERR("couldn't parse table");
+ return PTR_ERR(table_entry);
+ }
+ }
+
+ return 0;
+}
+
+/**
+ * dm_parse_device_entry - parse a device entry
+ * @dev: device to store the parsed information.
+ * @str: the pointer to a string with the format:
+ * name,uuid,minor,flags,table[; ...]
+ *
+ * Return the remainder string after the table entry, i.e, after the semi-colon
+ * which delimits the entry or NULL if reached the end of the string.
+ */
+static char __init *dm_parse_device_entry(struct dm_device *dev, char *str)
+{
+ /* There are 5 fields: name,uuid,minor,flags,table; */
+ char *field[5];
+ unsigned int i;
+ char *next;
+
+ field[0] = str;
+ /* Delimit first 4 fields that are separated by comma */
+ for (i = 0; i < ARRAY_SIZE(field) - 1; i++) {
+ field[i+1] = str_field_delimit(&field[i], ',');
+ if (!field[i+1])
+ return ERR_PTR(-EINVAL);
+ }
+ /* Delimit last field that can be delimited by semi-colon */
+ next = str_field_delimit(&field[i], ';');
+
+ /* name */
+ strscpy(dev->dmi.name, field[0], sizeof(dev->dmi.name));
+ /* uuid */
+ strscpy(dev->dmi.uuid, field[1], sizeof(dev->dmi.uuid));
+ /* minor */
+ if (strlen(field[2])) {
+ if (kstrtoull(field[2], 0, &dev->dmi.dev))
+ return ERR_PTR(-EINVAL);
+ dev->dmi.flags |= DM_PERSISTENT_DEV_FLAG;
+ }
+ /* flags */
+ if (!strcmp(field[3], "ro"))
+ dev->dmi.flags |= DM_READONLY_FLAG;
+ else if (strcmp(field[3], "rw"))
+ return ERR_PTR(-EINVAL);
+ /* table */
+ if (dm_parse_table(dev, field[4]))
+ return ERR_PTR(-EINVAL);
+
+ return next;
+}
+
+/**
+ * dm_parse_devices - parse "dm-mod.create=" argument
+ * @devices: list of struct dm_device to store the parsed information.
+ * @str: the pointer to a string with the format:
+ * <device>[;<device>+]
+ */
+static int __init dm_parse_devices(struct list_head *devices, char *str)
+{
+ unsigned long ndev = 0;
+ struct dm_device *dev;
+ char *device = str;
+
+ DMDEBUG("parsing \"%s\"", str);
+ while (device) {
+ dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+ if (!dev)
+ return -ENOMEM;
+ list_add_tail(&dev->list, devices);
+
+ if (++ndev > DM_MAX_DEVICES) {
+ DMERR("too many devices %lu > %d",
+ ndev, DM_MAX_DEVICES);
+ return -EINVAL;
+ }
+
+ device = dm_parse_device_entry(dev, device);
+ if (IS_ERR(device)) {
+ DMERR("couldn't parse device");
+ return PTR_ERR(device);
+ }
+ }
+
+ return 0;
+}
+
+/**
+ * dm_init_init - parse "dm-mod.create=" argument and configure drivers
+ */
+static int __init dm_init_init(void)
+{
+ struct dm_device *dev;
+ LIST_HEAD(devices);
+ char *str;
+ int r;
+
+ if (!create)
+ return 0;
+
+ if (strlen(create) >= DM_MAX_STR_SIZE) {
+ DMERR("Argument is too big. Limit is %d\n", DM_MAX_STR_SIZE);
+ return -EINVAL;
+ }
+ str = kstrndup(create, GFP_KERNEL, DM_MAX_STR_SIZE);
+ if (!str)
+ return -ENOMEM;
+
+ r = dm_parse_devices(&devices, str);
+ if (r)
+ goto out;
+
+ DMINFO("waiting for all devices to be available before creating mapped devices\n");
+ wait_for_device_probe();
+
+ list_for_each_entry(dev, &devices, list) {
+ if (dm_early_create(&dev->dmi, dev->table,
+ dev->target_args_array))
+ break;
+ }
+out:
+ kfree(str);
+ dm_setup_cleanup(&devices);
+ return r;
+}
+
+late_initcall(dm_init_init);
+
+module_param(create, charp, 0);
+MODULE_PARM_DESC(create, "Create a mapped device in early boot");
+
+/* ---------------------------------------------------------------
+ * ChromeOS shim - convert dm= format to dm-mod.create= format
+ * ---------------------------------------------------------------
+ */
+
+struct dm_chrome_target {
+ char *field[4];
+};
+
+struct dm_chrome_dev {
+ char *name, *uuid, *mode;
+ unsigned int num_targets;
+ struct dm_chrome_target targets[DM_MAX_TARGETS];
+};
+
+static char __init *dm_chrome_parse_target(char *str, struct dm_chrome_target *tgt)
+{
+ unsigned int i;
+
+ tgt->field[0] = str;
+ /* Delimit first 3 fields that are separated by space */
+ for (i = 0; i < ARRAY_SIZE(tgt->field) - 1; i++) {
+ tgt->field[i + 1] = str_field_delimit(&tgt->field[i], ' ');
+ if (!tgt->field[i + 1])
+ return NULL;
+ }
+ /* Delimit last field that can be terminated by comma */
+ return str_field_delimit(&tgt->field[i], ',');
+}
+
+static char __init *dm_chrome_parse_dev(char *str, struct dm_chrome_dev *dev)
+{
+ char *target, *num;
+ unsigned int i;
+
+ if (!str)
+ return ERR_PTR(-EINVAL);
+
+ target = str_field_delimit(&str, ',');
+ if (!target)
+ return ERR_PTR(-EINVAL);
+
+ /* Delimit first 3 fields that are separated by space */
+ dev->name = str;
+ dev->uuid = str_field_delimit(&dev->name, ' ');
+ if (!dev->uuid)
+ return ERR_PTR(-EINVAL);
+
+ dev->mode = str_field_delimit(&dev->uuid, ' ');
+ if (!dev->mode)
+ return ERR_PTR(-EINVAL);
+
+ /* num is optional */
+ num = str_field_delimit(&dev->mode, ' ');
+ if (!num)
+ dev->num_targets = 1;
+ else {
+ /* Delimit num and check if it the last field */
+ if(str_field_delimit(&num, ' '))
+ return ERR_PTR(-EINVAL);
+ if (kstrtouint(num, 0, &dev->num_targets))
+ return ERR_PTR(-EINVAL);
+ }
+
+ if (dev->num_targets > DM_MAX_TARGETS) {
+ DMERR("too many targets %u > %d",
+ dev->num_targets, DM_MAX_TARGETS);
+ return ERR_PTR(-EINVAL);
+ }
+
+ for (i = 0; i < dev->num_targets - 1; i++) {
+ target = dm_chrome_parse_target(target, &dev->targets[i]);
+ if (!target)
+ return ERR_PTR(-EINVAL);
+ }
+ /* The last one can return NULL if it reaches the end of str */
+ return dm_chrome_parse_target(target, &dev->targets[i]);
+}
+
+static char __init *dm_chrome_convert(struct dm_chrome_dev *devs, unsigned int num_devs)
+{
+ char *str = kmalloc(DM_MAX_STR_SIZE, GFP_KERNEL);
+ char *p = str;
+ unsigned int i, j;
+ int ret;
+
+ if (!str)
+ return ERR_PTR(-ENOMEM);
+
+ for (i = 0; i < num_devs; i++) {
+ if (!strcmp(devs[i].uuid, "none"))
+ devs[i].uuid = "";
+ ret = snprintf(p, DM_MAX_STR_SIZE - (p - str),
+ "%s,%s,,%s",
+ devs[i].name,
+ devs[i].uuid,
+ devs[i].mode);
+ if (ret < 0)
+ goto out;
+ p += ret;
+
+ for (j = 0; j < devs[i].num_targets; j++) {
+ ret = snprintf(p, DM_MAX_STR_SIZE - (p - str),
+ ",%s %s %s %s",
+ devs[i].targets[j].field[0],
+ devs[i].targets[j].field[1],
+ devs[i].targets[j].field[2],
+ devs[i].targets[j].field[3]);
+ if (ret < 0)
+ goto out;
+ p += ret;
+ }
+ if (i < num_devs - 1) {
+ ret = snprintf(p, DM_MAX_STR_SIZE - (p - str), ";");
+ if (ret < 0)
+ goto out;
+ p += ret;
+ }
+ }
+
+ return str;
+
+out:
+ kfree(str);
+ return ERR_PTR(ret);
+}
+
+/**
+ * dm_chrome_shim - convert old dm= format used in chromeos to the new
+ * upstream format.
+ *
+ * ChromeOS old format
+ * -------------------
+ * <device> ::= [<num>] <device-mapper>+
+ * <device-mapper> ::= <head> "," <target>+
+ * <head> ::= <name> <uuid> <mode> [<num>]
+ * <target> ::= <start> <length> <type> <options> ","
+ * <mode> ::= "ro" | "rw"
+ * <uuid> ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | "none"
+ * <type> ::= "verity" | "bootcache" | ...
+ *
+ * Example:
+ * 2 vboot none ro 1,
+ * 0 1768000 bootcache
+ * device=aa55b119-2a47-8c45-946a-5ac57765011f+1
+ * signature=76e9be054b15884a9fa85973e9cb274c93afadb6
+ * cache_start=1768000 max_blocks=100000 size_limit=23 max_trace=20000,
+ * vroot none ro 1,
+ * 0 1740800 verity payload=254:0 hashtree=254:0 hashstart=1740800 alg=sha1
+ * root_hexdigest=76e9be054b15884a9fa85973e9cb274c93afadb6
+ * salt=5b3549d54d6c7a3837b9b81ed72e49463a64c03680c47835bef94d768e5646fe
+ *
+ * Notes:
+ * 1. uuid is a label for the device and we set it to "none".
+ * 2. The <num> field will be optional initially and assumed to be 1.
+ * Once all the scripts that set these fields have been set, it will
+ * be made mandatory.
+ */
+
+static char *chrome_create;
+
+static int __init dm_chrome_shim(char *arg) {
+ if (!arg || create)
+ return -EINVAL;
+ chrome_create = arg;
+ return 0;
+}
+
+static int __init dm_chrome_parse_devices(void)
+{
+ struct dm_chrome_dev *devs;
+ unsigned int num_devs, i;
+ char *next, *base_str;
+ int ret = 0;
+
+ /* Verify if dm-mod.create was not used */
+ if (!chrome_create || create)
+ return -EINVAL;
+
+ if (strlen(chrome_create) >= DM_MAX_STR_SIZE) {
+ DMERR("Argument is too big. Limit is %d\n", DM_MAX_STR_SIZE);
+ return -EINVAL;
+ }
+
+ base_str = kstrdup(chrome_create, GFP_KERNEL);
+ if (!base_str)
+ return -ENOMEM;
+
+ next = str_field_delimit(&base_str, ' ');
+ if (!next) {
+ ret = -EINVAL;
+ goto out_str;
+ }
+
+ /* if first field is not the optional <num> field */
+ if (kstrtouint(base_str, 0, &num_devs)) {
+ num_devs = 1;
+ /* rewind next pointer */
+ next = base_str;
+ }
+
+ if (num_devs > DM_MAX_DEVICES) {
+ DMERR("too many devices %u > %d", num_devs, DM_MAX_DEVICES);
+ ret = -EINVAL;
+ goto out_str;
+ }
+
+ devs = kcalloc(num_devs, sizeof(*devs), GFP_KERNEL);
+ if (!devs)
+ return -ENOMEM;
+
+ /* restore string */
+ strcpy(base_str, chrome_create);
+
+ /* parse devices */
+ for (i = 0; i < num_devs; i++) {
+ next = dm_chrome_parse_dev(next, &devs[i]);
+ if (IS_ERR(next)) {
+ DMERR("couldn't parse device");
+ ret = PTR_ERR(next);
+ goto out_devs;
+ }
+ }
+
+ create = dm_chrome_convert(devs, num_devs);
+ if (IS_ERR(create)) {
+ ret = PTR_ERR(create);
+ goto out_devs;
+ }
+
+ DMDEBUG("Converting:\n\tdm=\"%s\"\n\tdm-mod.create=\"%s\"\n",
+ chrome_create, create);
+
+ /* Call upstream code */
+ dm_init_init();
+
+ kfree(create);
+
+out_devs:
+ create = NULL;
+ kfree(devs);
+out_str:
+ kfree(base_str);
+
+ return ret;
+}
+
+late_initcall(dm_chrome_parse_devices);
+
+__setup("dm=", dm_chrome_shim);
diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index f666778..1e03bc8 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -2018,3 +2018,110 @@
return r;
}
+
+
+/**
+ * dm_early_create - create a mapped device in early boot.
+ *
+ * @dmi: Contains main information of the device mapping to be created.
+ * @spec_array: array of pointers to struct dm_target_spec. Describes the
+ * mapping table of the device.
+ * @target_params_array: array of strings with the parameters to a specific
+ * target.
+ *
+ * Instead of having the struct dm_target_spec and the parameters for every
+ * target embedded at the end of struct dm_ioctl (as performed in a normal
+ * ioctl), pass them as arguments, so the caller doesn't need to serialize them.
+ * The size of the spec_array and target_params_array is given by
+ * @dmi->target_count.
+ * This function is supposed to be called in early boot, so locking mechanisms
+ * to protect against concurrent loads are not required.
+ */
+int __init dm_early_create(struct dm_ioctl *dmi,
+ struct dm_target_spec **spec_array,
+ char **target_params_array)
+{
+ int r, m = DM_ANY_MINOR;
+ struct dm_table *t, *old_map;
+ struct mapped_device *md;
+ unsigned int i;
+
+ if (!dmi->target_count)
+ return -EINVAL;
+
+ r = check_name(dmi->name);
+ if (r)
+ return r;
+
+ if (dmi->flags & DM_PERSISTENT_DEV_FLAG)
+ m = MINOR(huge_decode_dev(dmi->dev));
+
+ /* alloc dm device */
+ r = dm_create(m, &md);
+ if (r)
+ return r;
+
+ /* hash insert */
+ r = dm_hash_insert(dmi->name, *dmi->uuid ? dmi->uuid : NULL, md);
+ if (r)
+ goto err_destroy_dm;
+
+ /* alloc table */
+ r = dm_table_create(&t, get_mode(dmi), dmi->target_count, md);
+ if (r)
+ goto err_hash_remove;
+
+ /* add targets */
+ for (i = 0; i < dmi->target_count; i++) {
+ r = dm_table_add_target(t, spec_array[i]->target_type,
+ (sector_t) spec_array[i]->sector_start,
+ (sector_t) spec_array[i]->length,
+ target_params_array[i]);
+ if (r) {
+ DMWARN("error adding target to table");
+ goto err_destroy_table;
+ }
+ }
+
+ /* finish table */
+ r = dm_table_complete(t);
+ if (r)
+ goto err_destroy_table;
+
+ md->type = dm_table_get_type(t);
+ /* setup md->queue to reflect md's type (may block) */
+ r = dm_setup_md_queue(md, t);
+ if (r) {
+ DMWARN("unable to set up device queue for new table.");
+ goto err_destroy_table;
+ }
+
+ /* Set new map */
+ dm_suspend(md, 0);
+ old_map = dm_swap_table(md, t);
+ if (IS_ERR(old_map)) {
+ r = PTR_ERR(old_map);
+ goto err_destroy_table;
+ }
+ set_disk_ro(dm_disk(md), !!(dmi->flags & DM_READONLY_FLAG));
+
+ /* resume device */
+ r = dm_resume(md);
+ if (r)
+ goto err_destroy_table;
+
+ DMINFO("%s (%s) is ready", md->disk->disk_name, dmi->name);
+ dm_put(md);
+ return 0;
+
+err_destroy_table:
+ dm_table_destroy(t);
+err_hash_remove:
+ (void) __hash_remove(__get_name_cell(dmi->name));
+ /* release reference from __get_name_cell */
+ dm_put(md);
+err_destroy_dm:
+ dm_put(md);
+ dm_destroy(md);
+ return r;
+}
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 36275c5..eed37b6 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -882,13 +882,25 @@
}
EXPORT_SYMBOL_GPL(dm_table_set_type);
-static int device_supports_dax(struct dm_target *ti, struct dm_dev *dev,
- sector_t start, sector_t len, void *data)
+/* validate the dax capability of the target device span */
+int device_supports_dax(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
{
- return bdev_dax_supported(dev->bdev, PAGE_SIZE);
+ int blocksize = *(int *) data;
+
+ return generic_fsdax_supported(dev->dax_dev, dev->bdev, blocksize,
+ start, len);
}
-static bool dm_table_supports_dax(struct dm_table *t)
+/* Check devices support synchronous DAX */
+static int device_synchronous(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
+{
+ return dev->dax_dev && dax_synchronous(dev->dax_dev);
+}
+
+bool dm_table_supports_dax(struct dm_table *t,
+ iterate_devices_callout_fn iterate_fn, int *blocksize)
{
struct dm_target *ti;
unsigned i;
@@ -901,7 +913,7 @@
return false;
if (!ti->type->iterate_devices ||
- !ti->type->iterate_devices(ti, device_supports_dax, NULL))
+ !ti->type->iterate_devices(ti, iterate_fn, blocksize))
return false;
}
@@ -937,6 +949,7 @@
struct dm_target *tgt;
struct list_head *devices = dm_table_get_devices(t);
enum dm_queue_mode live_md_type = dm_get_md_type(t->md);
+ int page_size = PAGE_SIZE;
if (t->type != DM_TYPE_NONE) {
/* target already set the table's type */
@@ -981,7 +994,7 @@
verify_bio_based:
/* We must use this table as bio-based */
t->type = DM_TYPE_BIO_BASED;
- if (dm_table_supports_dax(t) ||
+ if (dm_table_supports_dax(t, device_supports_dax, &page_size) ||
(list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) {
t->type = DM_TYPE_DAX_BIO_BASED;
} else {
@@ -1909,6 +1922,7 @@
struct queue_limits *limits)
{
bool wc = false, fua = false;
+ int page_size = PAGE_SIZE;
/*
* Copy table's limits to the DM device's request_queue
@@ -1936,8 +1950,11 @@
}
blk_queue_write_cache(q, wc, fua);
- if (dm_table_supports_dax(t))
+ if (dm_table_supports_dax(t, device_supports_dax, &page_size)) {
blk_queue_flag_set(QUEUE_FLAG_DAX, q);
+ if (dm_table_supports_dax(t, device_synchronous, NULL))
+ set_dax_synchronous(t->md->dax_dev);
+ }
else
blk_queue_flag_clear(QUEUE_FLAG_DAX, q);
diff --git a/drivers/md/dm-verity-fec.c b/drivers/md/dm-verity-fec.c
index bb83279..6c6493c 100644
--- a/drivers/md/dm-verity-fec.c
+++ b/drivers/md/dm-verity-fec.c
@@ -11,6 +11,7 @@
#include "dm-verity-fec.h"
#include <linux/math64.h>
+#include <linux/sysfs.h>
#define DM_MSG_PREFIX "verity-fec"
@@ -175,9 +176,11 @@
if (r < 0 && neras)
DMERR_LIMIT("%s: FEC %llu: failed to correct: %d",
v->data_dev->name, (unsigned long long)rsb, r);
- else if (r > 0)
+ else if (r > 0) {
DMWARN_LIMIT("%s: FEC %llu: corrected %d errors",
v->data_dev->name, (unsigned long long)rsb, r);
+ atomic_add_unless(&v->fec->corrected, 1, INT_MAX);
+ }
return r;
}
@@ -545,6 +548,7 @@
void verity_fec_dtr(struct dm_verity *v)
{
struct dm_verity_fec *f = v->fec;
+ struct kobject *kobj = &f->kobj_holder.kobj;
if (!verity_fec_is_enabled(v))
goto out;
@@ -562,6 +566,12 @@
if (f->dev)
dm_put_device(v->ti, f->dev);
+
+ if (kobj->state_initialized) {
+ kobject_put(kobj);
+ wait_for_completion(dm_get_completion_from_kobject(kobj));
+ }
+
out:
kfree(f);
v->fec = NULL;
@@ -650,6 +660,28 @@
return 0;
}
+static ssize_t corrected_show(struct kobject *kobj, struct kobj_attribute *attr,
+ char *buf)
+{
+ struct dm_verity_fec *f = container_of(kobj, struct dm_verity_fec,
+ kobj_holder.kobj);
+
+ return sprintf(buf, "%d\n", atomic_read(&f->corrected));
+}
+
+static struct kobj_attribute attr_corrected = __ATTR_RO(corrected);
+
+static struct attribute *fec_attrs[] = {
+ &attr_corrected.attr,
+ NULL
+};
+
+static struct kobj_type fec_ktype = {
+ .sysfs_ops = &kobj_sysfs_ops,
+ .default_attrs = fec_attrs,
+ .release = dm_kobject_release
+};
+
/*
* Allocate dm_verity_fec for v->fec. Must be called before verity_fec_ctr.
*/
@@ -673,8 +705,10 @@
*/
int verity_fec_ctr(struct dm_verity *v)
{
+ int r;
struct dm_verity_fec *f = v->fec;
struct dm_target *ti = v->ti;
+ struct mapped_device *md = dm_table_get_md(ti->table);
u64 hash_blocks;
int ret;
@@ -683,6 +717,16 @@
return 0;
}
+ /* Create a kobject and sysfs attributes */
+ init_completion(&f->kobj_holder.completion);
+
+ r = kobject_init_and_add(&f->kobj_holder.kobj, &fec_ktype,
+ &disk_to_dev(dm_disk(md))->kobj, "%s", "fec");
+ if (r) {
+ ti->error = "Cannot create kobject";
+ return r;
+ }
+
/*
* FEC is computed over data blocks, possible metadata, and
* hash blocks. In other words, FEC covers total of fec_blocks
diff --git a/drivers/md/dm-verity-fec.h b/drivers/md/dm-verity-fec.h
index 6ad803b..93af417 100644
--- a/drivers/md/dm-verity-fec.h
+++ b/drivers/md/dm-verity-fec.h
@@ -12,6 +12,8 @@
#ifndef DM_VERITY_FEC_H
#define DM_VERITY_FEC_H
+#include "dm.h"
+#include "dm-core.h"
#include "dm-verity.h"
#include <linux/rslib.h>
@@ -51,6 +53,8 @@
mempool_t extra_pool; /* mempool for extra buffers */
mempool_t output_pool; /* mempool for output */
struct kmem_cache *cache; /* cache for buffers */
+ atomic_t corrected; /* corrected errors */
+ struct dm_kobject_holder kobj_holder; /* for sysfs attributes */
};
/* per-bio data */
diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c
index e3599b4..6332a83 100644
--- a/drivers/md/dm-verity-target.c
+++ b/drivers/md/dm-verity-target.c
@@ -17,8 +17,12 @@
#include "dm-verity.h"
#include "dm-verity-fec.h"
+#include <linux/async.h>
+#include <linux/delay.h>
+#include <linux/device-mapper.h>
#include <linux/module.h>
#include <linux/reboot.h>
+#include <crypto/hash.h>
#define DM_MSG_PREFIX "verity"
@@ -28,6 +32,7 @@
#define DM_VERITY_DEFAULT_PREFETCH_SIZE 262144
#define DM_VERITY_MAX_CORRUPTED_ERRS 100
+#define DM_VERITY_NUM_POSITIONAL_ARGS 10
#define DM_VERITY_OPT_LOGGING "ignore_corruption"
#define DM_VERITY_OPT_RESTART "restart_on_corruption"
@@ -47,6 +52,118 @@
unsigned n_blocks;
};
+/* Provide a lightweight means of specifying the global default for
+ * error behavior: eio, reboot, or none
+ * Legacy support for 0 = eio, 1 = reboot/panic, 2 = none, 3 = notify.
+ * This is matched to the enum in dm-verity.h.
+ */
+static const char *allowed_error_behaviors[] = { "eio", "panic", "none",
+ "notify", NULL };
+static char *error_behavior = "eio";
+module_param(error_behavior, charp, 0644);
+MODULE_PARM_DESC(error_behavior, "Behavior on error "
+ "(eio, panic, none, notify)");
+
+/* Controls whether verity_get_device will wait forever for a device. */
+static int dev_wait;
+module_param(dev_wait, int, 0444);
+MODULE_PARM_DESC(dev_wait, "Wait forever for a backing device");
+
+static BLOCKING_NOTIFIER_HEAD(verity_error_notifier);
+
+int dm_verity_register_error_notifier(struct notifier_block *nb)
+{
+ return blocking_notifier_chain_register(&verity_error_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(dm_verity_register_error_notifier);
+
+int dm_verity_unregister_error_notifier(struct notifier_block *nb)
+{
+ return blocking_notifier_chain_unregister(&verity_error_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(dm_verity_unregister_error_notifier);
+
+/* If the request is not successful, this handler takes action.
+ * TODO make this call a registered handler.
+ */
+static void verity_error(struct dm_verity *v, struct dm_verity_io *io,
+ blk_status_t status)
+{
+ const char *message = v->hash_failed ? "integrity" : "block";
+ int error_behavior = DM_VERITY_ERROR_BEHAVIOR_PANIC;
+ dev_t devt = 0;
+ u64 block = ~0;
+ struct dm_verity_error_state error_state;
+ /* If the hash did not fail, then this is likely transient. */
+ int transient = !v->hash_failed;
+
+ devt = v->data_dev->bdev->bd_dev;
+ error_behavior = v->error_behavior;
+
+ DMERR_LIMIT("verification failure occurred: %s failure", message);
+
+ if (error_behavior == DM_VERITY_ERROR_BEHAVIOR_NOTIFY) {
+ error_state.code = status;
+ error_state.transient = transient;
+ error_state.block = block;
+ error_state.message = message;
+ error_state.dev_start = v->data_start;
+ error_state.dev_len = v->data_blocks;
+ error_state.dev = v->data_dev->bdev;
+ error_state.hash_dev_start = v->hash_start;
+ error_state.hash_dev_len = v->hash_blocks;
+ error_state.hash_dev = v->hash_dev->bdev;
+
+ /* Set default fallthrough behavior. */
+ error_state.behavior = DM_VERITY_ERROR_BEHAVIOR_PANIC;
+ error_behavior = DM_VERITY_ERROR_BEHAVIOR_PANIC;
+
+ if (!blocking_notifier_call_chain(
+ &verity_error_notifier, transient, &error_state)) {
+ error_behavior = error_state.behavior;
+ }
+ }
+
+ switch (error_behavior) {
+ case DM_VERITY_ERROR_BEHAVIOR_EIO:
+ break;
+ case DM_VERITY_ERROR_BEHAVIOR_NONE:
+ break;
+ default:
+ if (!transient)
+ goto do_panic;
+ }
+ return;
+
+do_panic:
+ panic("dm-verity failure: "
+ "device:%u:%u status:%d block:%llu message:%s",
+ MAJOR(devt), MINOR(devt), status, (u64)block, message);
+}
+
+/**
+ * verity_parse_error_behavior - parse a behavior charp to the enum
+ * @behavior: NUL-terminated char array
+ *
+ * Checks if the behavior is valid either as text or as an index digit
+ * and returns the proper enum value or -1 on error.
+ */
+static int verity_parse_error_behavior(const char *behavior)
+{
+ const char **allowed = allowed_error_behaviors;
+ char index = '0';
+
+ for (; *allowed; allowed++, index++)
+ if (!strcmp(*allowed, behavior) || behavior[0] == index)
+ break;
+
+ if (!*allowed)
+ return -1;
+
+ /* Convert to the integer index matching the enum. */
+ return allowed - allowed_error_behaviors;
+}
+
/*
* Auxiliary structure appended to each dm-bufio buffer. If the value
* hash_verified is nonzero, hash of the block has been verified.
@@ -541,6 +658,8 @@
struct dm_verity *v = io->v;
struct bio *bio = dm_bio_from_per_bio_data(io, v->ti->per_io_data_size);
+ if (status && !verity_fec_is_enabled(io->v))
+ verity_error(v, io, status);
bio->bi_end_io = io->orig_bi_end_io;
bio->bi_status = status;
@@ -564,7 +683,6 @@
verity_finish_io(io, bio->bi_status);
return;
}
-
INIT_WORK(&io->work, verity_work);
queue_work(io->v->verify_wq, &io->work);
}
@@ -913,6 +1031,187 @@
return r;
}
+static int verity_get_device(struct dm_target *ti, const char *devname,
+ struct dm_dev **dm_dev)
+{
+ do {
+ /* Try the normal path first since if everything is ready, it
+ * will be the fastest.
+ */
+ if (!dm_get_device(ti, devname, /*FMODE_READ*/
+ dm_table_get_mode(ti->table), dm_dev))
+ return 0;
+
+ /* No need to be too aggressive since this is a slow path. */
+ msleep(500);
+ } while (dev_wait && (driver_probe_done() != 0 || *dm_dev == NULL));
+ async_synchronize_full();
+ return -1;
+}
+
+struct verity_args {
+ int version;
+ char *data_device;
+ char *hash_device;
+ int data_block_size_bits;
+ int hash_block_size_bits;
+ u64 num_data_blocks;
+ u64 hash_start_block;
+ char *algorithm;
+ char *digest;
+ char *salt;
+ char *error_behavior;
+};
+
+static void pr_args(struct verity_args *args)
+{
+ printk(KERN_INFO "VERITY args: version=%d data_device=%s hash_device=%s"
+ " data_block_size_bits=%d hash_block_size_bits=%d"
+ " num_data_blocks=%lld hash_start_block=%lld"
+ " algorithm=%s digest=%s salt=%s error_behavior=%s\n",
+ args->version,
+ args->data_device,
+ args->hash_device,
+ args->data_block_size_bits,
+ args->hash_block_size_bits,
+ args->num_data_blocks,
+ args->hash_start_block,
+ args->algorithm,
+ args->digest,
+ args->salt,
+ args->error_behavior);
+}
+
+/*
+ * positional_args - collects the argments using the positional
+ * parameters.
+ * arg# - parameter
+ * 0 - version
+ * 1 - data device
+ * 2 - hash device - may be same as data device
+ * 3 - data block size log2
+ * 4 - hash block size log2
+ * 5 - number of data blocks
+ * 6 - hash start block
+ * 7 - algorithm
+ * 8 - digest
+ * 9 - salt
+ */
+static char *positional_args(unsigned argc, char **argv,
+ struct verity_args *args)
+{
+ unsigned num;
+ unsigned long long num_ll;
+ char dummy;
+
+ if (argc < DM_VERITY_NUM_POSITIONAL_ARGS)
+ return "Invalid argument count: at least 10 arguments required";
+
+ if (sscanf(argv[0], "%d%c", &num, &dummy) != 1 ||
+ num < 0 || num > 1)
+ return "Invalid version";
+ args->version = num;
+
+ args->data_device = argv[1];
+ args->hash_device = argv[2];
+
+
+ if (sscanf(argv[3], "%u%c", &num, &dummy) != 1 ||
+ !num || (num & (num - 1)) ||
+ num > PAGE_SIZE)
+ return "Invalid data device block size";
+ args->data_block_size_bits = ffs(num) - 1;
+
+ if (sscanf(argv[4], "%u%c", &num, &dummy) != 1 ||
+ !num || (num & (num - 1)) ||
+ num > INT_MAX)
+ return "Invalid hash device block size";
+ args->hash_block_size_bits = ffs(num) - 1;
+
+ if (sscanf(argv[5], "%llu%c", &num_ll, &dummy) != 1 ||
+ (sector_t)(num_ll << (args->data_block_size_bits - SECTOR_SHIFT))
+ >> (args->data_block_size_bits - SECTOR_SHIFT) != num_ll)
+ return "Invalid data blocks";
+ args->num_data_blocks = num_ll;
+
+
+ if (sscanf(argv[6], "%llu%c", &num_ll, &dummy) != 1 ||
+ (sector_t)(num_ll << (args->hash_block_size_bits - SECTOR_SHIFT))
+ >> (args->hash_block_size_bits - SECTOR_SHIFT) != num_ll)
+ return "Invalid hash start";
+ args->hash_start_block = num_ll;
+
+
+ args->algorithm = argv[7];
+ args->digest = argv[8];
+ args->salt = argv[9];
+
+ return NULL;
+}
+
+static void splitarg(char *arg, char **key, char **val)
+{
+ *key = strsep(&arg, "=");
+ *val = strsep(&arg, "");
+}
+
+static char *chromeos_args(unsigned argc, char **argv, struct verity_args *args)
+{
+ char *key, *val;
+ unsigned long num;
+ int i;
+
+ args->version = 0;
+ args->data_block_size_bits = 12;
+ args->hash_block_size_bits = 12;
+ for (i = 0; i < argc; ++i) {
+ DMDEBUG("Argument %d: '%s'", i, argv[i]);
+ splitarg(argv[i], &key, &val);
+ if (!key) {
+ DMWARN("Bad argument %d: missing key?", i);
+ return "Bad argument: missing key";
+ }
+ if (!val) {
+ DMWARN("Bad argument %d='%s': missing value", i, key);
+ return "Bad argument: missing value";
+ }
+ if (!strcmp(key, "alg")) {
+ args->algorithm = val;
+ } else if (!strcmp(key, "payload")) {
+ args->data_device = val;
+ } else if (!strcmp(key, "hashtree")) {
+ args->hash_device = val;
+ } else if (!strcmp(key, "root_hexdigest")) {
+ args->digest = val;
+ } else if (!strcmp(key, "hashstart")) {
+ if (kstrtoul(val, 10, &num))
+ return "Invalid hashstart";
+ args->hash_start_block =
+ num >> (args->hash_block_size_bits - SECTOR_SHIFT);
+ args->num_data_blocks = args->hash_start_block;
+ } else if (!strcmp(key, "error_behavior")) {
+ args->error_behavior = val;
+ } else if (!strcmp(key, "salt")) {
+ args->salt = val;
+ }
+ }
+ if (!args->salt)
+ args->salt = "";
+
+#define NEEDARG(n) \
+ if (!(n)) { \
+ return "Missing argument: " #n; \
+ }
+
+ NEEDARG(args->algorithm);
+ NEEDARG(args->data_device);
+ NEEDARG(args->hash_device);
+ NEEDARG(args->digest);
+
+#undef NEEDARG
+ return NULL;
+}
+
/*
* Target parameters:
* <version> The current format is version 1.
@@ -929,14 +1228,22 @@
*/
static int verity_ctr(struct dm_target *ti, unsigned argc, char **argv)
{
+ struct verity_args args = { 0 };
struct dm_verity *v;
struct dm_arg_set as;
- unsigned int num;
- unsigned long long num_ll;
int r;
int i;
sector_t hash_position;
- char dummy;
+
+ args.error_behavior = error_behavior;
+ if (argc >= DM_VERITY_NUM_POSITIONAL_ARGS)
+ ti->error = positional_args(argc, argv, &args);
+ else
+ ti->error = chromeos_args(argc, argv, &args);
+ if (ti->error)
+ return -EINVAL;
+ if (0)
+ pr_args(&args);
v = kzalloc(sizeof(struct dm_verity), GFP_KERNEL);
if (!v) {
@@ -949,84 +1256,46 @@
r = verity_fec_ctr_alloc(v);
if (r)
goto bad;
+ v->version = args.version;
- if ((dm_table_get_mode(ti->table) & ~FMODE_READ)) {
- ti->error = "Device must be readonly";
- r = -EINVAL;
- goto bad;
- }
-
- if (argc < 10) {
- ti->error = "Not enough arguments";
- r = -EINVAL;
- goto bad;
- }
-
- if (sscanf(argv[0], "%u%c", &num, &dummy) != 1 ||
- num > 1) {
- ti->error = "Invalid version";
- r = -EINVAL;
- goto bad;
- }
- v->version = num;
-
- r = dm_get_device(ti, argv[1], FMODE_READ, &v->data_dev);
+ r = verity_get_device(ti, args.data_device, &v->data_dev);
if (r) {
ti->error = "Data device lookup failed";
goto bad;
}
- r = dm_get_device(ti, argv[2], FMODE_READ, &v->hash_dev);
+ r = verity_get_device(ti, args.hash_device, &v->hash_dev);
if (r) {
ti->error = "Hash device lookup failed";
goto bad;
}
- if (sscanf(argv[3], "%u%c", &num, &dummy) != 1 ||
- !num || (num & (num - 1)) ||
- num < bdev_logical_block_size(v->data_dev->bdev) ||
- num > PAGE_SIZE) {
+ v->data_dev_block_bits = args.data_block_size_bits;
+ if ((1 << v->data_dev_block_bits) <
+ bdev_logical_block_size(v->data_dev->bdev)) {
ti->error = "Invalid data device block size";
r = -EINVAL;
goto bad;
}
- v->data_dev_block_bits = __ffs(num);
- if (sscanf(argv[4], "%u%c", &num, &dummy) != 1 ||
- !num || (num & (num - 1)) ||
- num < bdev_logical_block_size(v->hash_dev->bdev) ||
- num > INT_MAX) {
+ v->hash_dev_block_bits = args.hash_block_size_bits;
+ if ((1 << v->data_dev_block_bits) <
+ bdev_logical_block_size(v->hash_dev->bdev)) {
ti->error = "Invalid hash device block size";
r = -EINVAL;
goto bad;
}
- v->hash_dev_block_bits = __ffs(num);
- if (sscanf(argv[5], "%llu%c", &num_ll, &dummy) != 1 ||
- (sector_t)(num_ll << (v->data_dev_block_bits - SECTOR_SHIFT))
- >> (v->data_dev_block_bits - SECTOR_SHIFT) != num_ll) {
- ti->error = "Invalid data blocks";
- r = -EINVAL;
- goto bad;
- }
- v->data_blocks = num_ll;
-
+ v->data_blocks = args.num_data_blocks;
if (ti->len > (v->data_blocks << (v->data_dev_block_bits - SECTOR_SHIFT))) {
ti->error = "Data device is too small";
r = -EINVAL;
goto bad;
}
- if (sscanf(argv[6], "%llu%c", &num_ll, &dummy) != 1 ||
- (sector_t)(num_ll << (v->hash_dev_block_bits - SECTOR_SHIFT))
- >> (v->hash_dev_block_bits - SECTOR_SHIFT) != num_ll) {
- ti->error = "Invalid hash start";
- r = -EINVAL;
- goto bad;
- }
- v->hash_start = num_ll;
+ v->hash_start = args.hash_start_block;
- v->alg_name = kstrdup(argv[7], GFP_KERNEL);
+ v->alg_name = kstrdup(args.algorithm, GFP_KERNEL);
if (!v->alg_name) {
ti->error = "Cannot allocate algorithm name";
r = -ENOMEM;
@@ -1055,36 +1324,33 @@
r = -ENOMEM;
goto bad;
}
- if (strlen(argv[8]) != v->digest_size * 2 ||
- hex2bin(v->root_digest, argv[8], v->digest_size)) {
+ if (strlen(args.digest) != v->digest_size * 2 ||
+ hex2bin(v->root_digest, args.digest, v->digest_size)) {
ti->error = "Invalid root digest";
r = -EINVAL;
goto bad;
}
- if (strcmp(argv[9], "-")) {
- v->salt_size = strlen(argv[9]) / 2;
+ if (strcmp(args.salt, "-")) {
+ v->salt_size = strlen(args.salt) / 2;
v->salt = kmalloc(v->salt_size, GFP_KERNEL);
if (!v->salt) {
ti->error = "Cannot allocate salt";
r = -ENOMEM;
goto bad;
}
- if (strlen(argv[9]) != v->salt_size * 2 ||
- hex2bin(v->salt, argv[9], v->salt_size)) {
+ if (strlen(args.salt) != v->salt_size * 2 ||
+ hex2bin(v->salt, args.salt, v->salt_size)) {
ti->error = "Invalid salt";
r = -EINVAL;
goto bad;
}
}
- argv += 10;
- argc -= 10;
-
/* Optional parameters */
- if (argc) {
- as.argc = argc;
- as.argv = argv;
+ if (argc > DM_VERITY_NUM_POSITIONAL_ARGS) {
+ as.argc = argc - DM_VERITY_NUM_POSITIONAL_ARGS;
+ as.argv = argv + DM_VERITY_NUM_POSITIONAL_ARGS;
r = verity_parse_opt_args(&as, v);
if (r < 0)
@@ -1156,6 +1422,16 @@
ti->per_io_data_size = roundup(ti->per_io_data_size,
__alignof__(struct dm_verity_io));
+ /* chromeos allows setting error_behavior from both the module
+ * parameters and the device args.
+ */
+ v->error_behavior = verity_parse_error_behavior(args.error_behavior);
+ if (v->error_behavior == -1) {
+ ti->error = "Bad error_behavior supplied";
+ r = -EINVAL;
+ goto bad;
+ }
+
return 0;
bad:
diff --git a/drivers/md/dm-verity.h b/drivers/md/dm-verity.h
index 3441c10..b1a8442 100644
--- a/drivers/md/dm-verity.h
+++ b/drivers/md/dm-verity.h
@@ -15,6 +15,7 @@
#include <linux/dm-bufio.h>
#include <linux/device-mapper.h>
#include <crypto/hash.h>
+#include <linux/notifier.h>
#define DM_VERITY_MAX_LEVELS 63
@@ -56,6 +57,7 @@
int hash_failed; /* set to 1 if hash of any block failed */
enum verity_mode mode; /* mode for handling verification errors */
unsigned corrupted_errs;/* Number of errors for corrupted blocks */
+ int error_behavior; /* selects error behavior on io errors */
struct workqueue_struct *verify_wq;
@@ -91,6 +93,40 @@
*/
};
+struct verity_result {
+ struct completion completion;
+ int err;
+};
+
+struct dm_verity_error_state {
+ int code;
+ int transient; /* Likely to not happen after a reboot */
+ u64 block;
+ const char *message;
+
+ sector_t dev_start;
+ sector_t dev_len;
+ struct block_device *dev;
+
+ sector_t hash_dev_start;
+ sector_t hash_dev_len;
+ struct block_device *hash_dev;
+
+ /* Final behavior after all notifications are completed. */
+ int behavior;
+};
+
+/* This enum must be matched to allowed_error_behaviors in dm-verity.c */
+enum dm_verity_error_behavior {
+ DM_VERITY_ERROR_BEHAVIOR_EIO = 0,
+ DM_VERITY_ERROR_BEHAVIOR_PANIC,
+ DM_VERITY_ERROR_BEHAVIOR_NONE,
+ DM_VERITY_ERROR_BEHAVIOR_NOTIFY
+};
+
+int dm_verity_register_error_notifier(struct notifier_block *nb);
+int dm_verity_unregister_error_notifier(struct notifier_block *nb);
+
static inline struct ahash_request *verity_io_hash_req(struct dm_verity *v,
struct dm_verity_io *io)
{
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 4364315..1ca7c5c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1070,6 +1070,25 @@
return ret;
}
+static bool dm_dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
+ int blocksize, sector_t start, sector_t len)
+{
+ struct mapped_device *md = dax_get_private(dax_dev);
+ struct dm_table *map;
+ int srcu_idx;
+ bool ret;
+
+ map = dm_get_live_table(md, &srcu_idx);
+ if (!map)
+ return false;
+
+ ret = dm_table_supports_dax(map, device_supports_dax, &blocksize);
+
+ dm_put_live_table(md, srcu_idx);
+
+ return ret;
+}
+
static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i)
{
@@ -1941,7 +1960,8 @@
sprintf(md->disk->disk_name, "dm-%d", minor);
if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
- dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops);
+ dax_dev = alloc_dax(md, md->disk->disk_name,
+ &dm_dax_ops, 0);
if (!dax_dev)
goto bad;
}
@@ -3185,6 +3205,7 @@
static const struct dax_operations dm_dax_ops = {
.direct_access = dm_dax_direct_access,
+ .dax_supported = dm_dax_supported,
.copy_from_iter = dm_dax_copy_from_iter,
.copy_to_iter = dm_dax_copy_to_iter,
};
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 114a81b..5022e83 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -73,6 +73,10 @@
bool dm_table_all_blk_mq_devices(struct dm_table *t);
void dm_table_free_md_mempools(struct dm_table *t);
struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t);
+bool dm_table_supports_dax(struct dm_table *t, iterate_devices_callout_fn fn,
+ int *blocksize);
+int device_supports_dax(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data);
void dm_lock_md_type(struct mapped_device *md);
void dm_unlock_md_type(struct mapped_device *md);
diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig
index 6fde68a..c1ffbf1 100644
--- a/drivers/net/ethernet/Kconfig
+++ b/drivers/net/ethernet/Kconfig
@@ -75,6 +75,7 @@
source "drivers/net/ethernet/faraday/Kconfig"
source "drivers/net/ethernet/freescale/Kconfig"
source "drivers/net/ethernet/fujitsu/Kconfig"
+source "drivers/net/ethernet/google/Kconfig"
source "drivers/net/ethernet/hisilicon/Kconfig"
source "drivers/net/ethernet/hp/Kconfig"
source "drivers/net/ethernet/huawei/Kconfig"
diff --git a/drivers/net/ethernet/Makefile b/drivers/net/ethernet/Makefile
index b45d5f6..60caee1 100644
--- a/drivers/net/ethernet/Makefile
+++ b/drivers/net/ethernet/Makefile
@@ -39,6 +39,7 @@
obj-$(CONFIG_NET_VENDOR_FARADAY) += faraday/
obj-$(CONFIG_NET_VENDOR_FREESCALE) += freescale/
obj-$(CONFIG_NET_VENDOR_FUJITSU) += fujitsu/
+obj-$(CONFIG_NET_VENDOR_GOOGLE) += google/
obj-$(CONFIG_NET_VENDOR_HISILICON) += hisilicon/
obj-$(CONFIG_NET_VENDOR_HP) += hp/
obj-$(CONFIG_NET_VENDOR_HUAWEI) += huawei/
diff --git a/drivers/net/ethernet/google/Kconfig b/drivers/net/ethernet/google/Kconfig
new file mode 100644
index 0000000..888f08f
--- /dev/null
+++ b/drivers/net/ethernet/google/Kconfig
@@ -0,0 +1,27 @@
+#
+# Google network device configuration
+#
+
+config NET_VENDOR_GOOGLE
+ bool "Google Devices"
+ default y
+ help
+ If you have a network (Ethernet) device belonging to this class, say Y.
+
+ Note that the answer to this question doesn't directly affect the
+ kernel: saying N will just cause the configurator to skip all
+ the questions about Google devices. If you say Y, you will be asked
+ for your specific device in the following questions.
+
+if NET_VENDOR_GOOGLE
+
+config GVE
+ tristate "Google Virtual NIC (gVNIC) support"
+ depends on (PCI_MSI && X86)
+ help
+ This driver supports Google Virtual NIC (gVNIC)"
+
+ To compile this driver as a module, choose M here.
+ The module will be called gve.
+
+endif #NET_VENDOR_GOOGLE
diff --git a/drivers/net/ethernet/google/Makefile b/drivers/net/ethernet/google/Makefile
new file mode 100644
index 0000000..402cc3b
--- /dev/null
+++ b/drivers/net/ethernet/google/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the Google network device drivers.
+#
+
+obj-$(CONFIG_GVE) += gve/
diff --git a/drivers/net/ethernet/google/gve/Makefile b/drivers/net/ethernet/google/gve/Makefile
new file mode 100644
index 0000000..3354ce4
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/Makefile
@@ -0,0 +1,4 @@
+# Makefile for the Google virtual Ethernet (gve) driver
+
+obj-$(CONFIG_GVE) += gve.o
+gve-objs := gve_main.o gve_tx.o gve_rx.o gve_ethtool.o gve_adminq.o
diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
new file mode 100644
index 0000000..e0f6142
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -0,0 +1,575 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR MIT)
+ * Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#ifndef _GVE_H_
+#define _GVE_H_
+
+#include <linux/dma-mapping.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
+#include <linux/u64_stats_sync.h>
+#include "gve_desc.h"
+
+#ifndef PCI_VENDOR_ID_GOOGLE
+#define PCI_VENDOR_ID_GOOGLE 0x1ae0
+#endif
+
+#define PCI_DEV_ID_GVNIC 0x0042
+
+#define GVE_REGISTER_BAR 0
+#define GVE_DOORBELL_BAR 2
+
+/* Driver can alloc up to 2 segments for the header and 2 for the payload. */
+#define GVE_TX_MAX_IOVEC 4
+#ifndef ETH_MIN_MTU
+#define ETH_MIN_MTU 68/* Min IPv4 MTU per RFC791 */
+#endif
+
+/* 1 for management, 1 for rx, 1 for tx */
+#define GVE_MIN_MSIX 3
+
+/* Numbers of gve tx/rx stats in stats report. */
+#define GVE_TX_STATS_REPORT_NUM 5
+#define GVE_RX_STATS_REPORT_NUM 2
+
+/* Numbers of NIC tx/rx stats in stats report. */
+#define NIC_TX_STATS_REPORT_NUM 0
+#define NIC_RX_STATS_REPORT_NUM 4
+
+/* Interval to schedule a service task, 20000ms. */
+#define GVE_SERVICE_TIMER_PERIOD 20000
+
+/* Each slot in the desc ring has a 1:1 mapping to a slot in the data ring */
+struct gve_rx_desc_queue {
+ struct gve_rx_desc *desc_ring; /* the descriptor ring */
+ dma_addr_t bus; /* the bus for the desc_ring */
+ u8 seqno; /* the next expected seqno for this desc*/
+};
+
+/* The page info for a single slot in the RX data queue */
+struct gve_rx_slot_page_info {
+ struct page *page;
+ void *page_address;
+ u32 page_offset; /* offset to write to in page */
+ int pagecnt_bias; /* expected pagecnt if only the driver has a ref */
+ bool can_flip; /* page can be flipped and reused */
+};
+
+/* A list of pages registered with the device during setup and used by a queue
+ * as buffers
+ */
+struct gve_queue_page_list {
+ u32 id; /* unique id */
+ u32 num_entries;
+ struct page **pages; /* list of num_entries pages */
+ dma_addr_t *page_buses; /* the dma addrs of the pages */
+};
+
+/* Each slot in the data ring has a 1:1 mapping to a slot in the desc ring */
+struct gve_rx_data_queue {
+ struct gve_rx_data_slot *data_ring; /* read by NIC */
+ dma_addr_t data_bus; /* dma mapping of the slots */
+ struct gve_rx_slot_page_info *page_info; /* page info of the buffers */
+ struct gve_queue_page_list *qpl; /* qpl assigned to this queue */
+ bool raw_addressing; /* use raw_addressing? */
+};
+
+struct gve_priv;
+
+/* An RX ring that contains a power-of-two sized desc and data ring. */
+struct gve_rx_ring {
+ struct gve_priv *gve;
+ struct gve_rx_desc_queue desc;
+ struct gve_rx_data_queue data;
+ u64 rbytes; /* free-running bytes received */
+ u64 rpackets; /* free-running packets received */
+ u32 cnt; /* free-running total number of completed packets */
+ u32 fill_cnt; /* free-running total number of descs and buffs posted */
+ u32 mask; /* masks the cnt and fill_cnt to the size of the ring */
+ u32 db_threshold; /* threshold for posting new buffs and descs */
+ u64 rx_copybreak_pkt; /* free-running count of copybreak packets */
+ u64 rx_copied_pkt; /* free-running total number of copied packets */
+ u64 rx_skb_alloc_fail; /* free-running count of skb alloc fails */
+ u64 rx_buf_alloc_fail; /* free-running count of buffer alloc fails */
+ u64 rx_desc_err_dropped_pkt; /* free-running count of packets dropped by descriptor error */
+ u64 rx_no_refill_dropped_pkt; /* free-running count of packets dropped because of lack of buffer refill */
+ u32 q_num; /* queue index */
+ u32 ntfy_id; /* notification block index */
+ struct gve_queue_resources *q_resources; /* head and tail pointer idx */
+ dma_addr_t q_resources_bus; /* dma address for the queue resources */
+ struct u64_stats_sync statss; /* sync stats for 32bit archs */
+};
+
+/* A TX desc ring entry */
+union gve_tx_desc {
+ struct gve_tx_pkt_desc pkt; /* first desc for a packet */
+ struct gve_tx_seg_desc seg; /* subsequent descs for a packet */
+};
+
+/* Tracks the memory in the fifo occupied by a segment of a packet */
+struct gve_tx_iovec {
+ u32 iov_offset; /* offset into this segment */
+ u32 iov_len; /* length */
+ u32 iov_padding; /* padding associated with this segment */
+};
+
+struct gve_tx_dma_buf {
+ DEFINE_DMA_UNMAP_ADDR(dma);
+ DEFINE_DMA_UNMAP_LEN(len);
+};
+
+/* Tracks the memory in the fifo occupied by the skb. Mapped 1:1 to a desc
+ * ring entry but only used for a pkt_desc not a seg_desc
+ */
+struct gve_tx_buffer_state {
+ struct sk_buff *skb; /* skb for this pkt */
+ union {
+ struct gve_tx_iovec iov[GVE_TX_MAX_IOVEC]; /* segments of this pkt */
+ struct gve_tx_dma_buf buf;
+ };
+};
+
+/* A TX buffer - each queue has one */
+struct gve_tx_fifo {
+ void *base; /* address of base of FIFO */
+ u32 size; /* total size */
+ atomic_t available; /* how much space is still available */
+ u32 head; /* offset to write at */
+ struct gve_queue_page_list *qpl; /* QPL mapped into this FIFO */
+};
+
+/* A TX ring that contains a power-of-two sized desc ring and a FIFO buffer */
+struct gve_tx_ring {
+ /* Cacheline 0 -- Accessed & dirtied during transmit */
+ struct gve_tx_fifo tx_fifo;
+ u32 req; /* driver tracked head pointer */
+ u32 done; /* driver tracked tail pointer */
+
+ /* Cacheline 1 -- Accessed & dirtied during gve_clean_tx_done */
+ __be32 last_nic_done ____cacheline_aligned; /* NIC tail pointer */
+ u64 pkt_done; /* free-running - total packets completed */
+ u64 bytes_done; /* free-running - total bytes completed */
+ u32 dropped_pkt; /* free-running - total packets dropped */
+
+ /* Cacheline 2 -- Read-mostly fields */
+ union gve_tx_desc *desc ____cacheline_aligned;
+ struct gve_tx_buffer_state *info; /* Maps 1:1 to a desc */
+ struct netdev_queue *netdev_txq;
+ struct gve_queue_resources *q_resources; /* head and tail pointer idx */
+ struct device *dev;
+ u32 mask; /* masks req and done down to queue size */
+ bool raw_addressing; /* use raw_addressing? */
+
+ /* Slow-path fields */
+ u32 q_num ____cacheline_aligned; /* queue idx */
+ u32 stop_queue; /* count of queue stops */
+ u32 wake_queue; /* count of queue wakes */
+ u32 ntfy_id; /* notification block index */
+ dma_addr_t bus; /* dma address of the descr ring */
+ dma_addr_t q_resources_bus; /* dma address of the queue resources */
+ struct u64_stats_sync statss; /* sync stats for 32bit archs */
+} ____cacheline_aligned;
+
+/* Wraps the info for one irq including the napi struct and the queues
+ * associated with that irq.
+ */
+struct gve_notify_block {
+ __be32 *irq_db_index; /* pointer to idx into Bar2 */
+ char name[IFNAMSIZ + 16]; /* name registered with the kernel */
+ struct napi_struct napi; /* kernel napi struct for this block */
+ struct gve_priv *priv;
+ struct gve_tx_ring *tx; /* tx rings on this block */
+ struct gve_rx_ring *rx; /* rx rings on this block */
+};
+
+/* Tracks allowed and current queue settings */
+struct gve_queue_config {
+ u16 max_queues;
+ u16 num_queues; /* current */
+};
+
+/* Tracks the available and used qpl IDs */
+struct gve_qpl_config {
+ u32 qpl_map_size; /* map memory size */
+ unsigned long *qpl_id_map; /* bitmap of used qpl ids */
+};
+
+struct gve_irq_db {
+ __be32 index;
+} ____cacheline_aligned;
+
+struct gve_priv {
+ struct net_device *dev;
+ struct gve_tx_ring *tx; /* array of tx_cfg.num_queues */
+ struct gve_rx_ring *rx; /* array of rx_cfg.num_queues */
+ struct gve_queue_page_list *qpls; /* array of num qpls */
+ struct gve_notify_block *ntfy_blocks; /* array of num_ntfy_blks */
+ struct gve_irq_db *irq_db_indices; /* array of num_ntfy_blks */
+ dma_addr_t irq_db_indices_bus;
+ struct msix_entry *msix_vectors; /* array of num_ntfy_blks + 1 */
+ char mgmt_msix_name[IFNAMSIZ + 16];
+ u32 mgmt_msix_idx;
+ __be32 *counter_array; /* array of num_event_counters */
+ dma_addr_t counter_array_bus;
+
+ u16 num_event_counters;
+ u16 tx_desc_cnt; /* num desc per ring */
+ u16 rx_desc_cnt; /* num desc per ring */
+ u16 tx_pages_per_qpl; /* tx buffer length */
+ u16 rx_data_slot_cnt; /* rx buffer length */
+ u64 max_registered_pages;
+ u64 num_registered_pages; /* num pages registered with NIC */
+ u32 rx_copybreak; /* copy packets smaller than this */
+ u16 default_num_queues; /* default num queues to set up */
+ bool raw_addressing; /* true if this dev supports raw addressing */
+
+ struct gve_queue_config tx_cfg;
+ struct gve_queue_config rx_cfg;
+ struct gve_qpl_config qpl_cfg; /* map used QPL ids */
+ u32 num_ntfy_blks; /* spilt between TX and RX so must be even */
+
+ struct gve_registers __iomem *reg_bar0; /* see gve_register.h */
+ __be32 __iomem *db_bar2; /* "array" of doorbells */
+ u32 msg_enable; /* level for netif* netdev print macros */
+ struct pci_dev *pdev;
+
+ /* metrics */
+ u32 tx_timeo_cnt;
+
+ /* Admin queue - see gve_adminq.h*/
+ union gve_adminq_command *adminq;
+ dma_addr_t adminq_bus_addr;
+ u32 adminq_mask; /* masks prod_cnt to adminq size */
+ u32 adminq_prod_cnt; /* free-running count of AQ cmds executed */
+ u32 adminq_cmd_fail; /* free-running count of AQ cmds failed */
+ u32 adminq_timeouts; /* free-running count of AQ cmds timeouts */
+ /* free-running count of per AQ cmd executed */
+ u32 adminq_describe_device_cnt;
+ u32 adminq_cfg_device_resources_cnt;
+ u32 adminq_register_page_list_cnt;
+ u32 adminq_unregister_page_list_cnt;
+ u32 adminq_create_tx_queue_cnt;
+ u32 adminq_create_rx_queue_cnt;
+ u32 adminq_destroy_tx_queue_cnt;
+ u32 adminq_destroy_rx_queue_cnt;
+ u32 adminq_dcfg_device_resources_cnt;
+ u32 adminq_set_driver_parameter_cnt;
+ u32 adminq_report_stats_cnt;
+
+ /* Global stats */
+ u32 interface_up_cnt; /* count of times interface turned up */
+ u32 interface_down_cnt; /* count of times interface turned down */
+ u32 reset_cnt; /* count of reset */
+ u32 page_alloc_fail; /* count of page alloc fails */
+ u32 dma_mapping_error; /* count of dma mapping errors */
+
+ struct workqueue_struct *gve_wq;
+ struct work_struct service_task;
+ unsigned long service_task_flags;
+ unsigned long state_flags;
+
+ struct gve_stats_report *stats_report;
+ u64 stats_report_len;
+ dma_addr_t stats_report_bus; /* dma address for the stats report */
+ unsigned long ethtool_flags;
+
+ unsigned long service_timer_period;
+ struct timer_list service_timer;
+
+ /* Gvnic device's dma mask, set during probe. */
+ u8 dma_mask;
+
+ /* Gvnic device link speed from hypervisor. */
+ u64 link_speed;
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0))
+ int max_mtu;
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+};
+
+enum gve_service_task_flags_bit {
+ GVE_PRIV_FLAGS_DO_RESET = 1,
+ GVE_PRIV_FLAGS_RESET_IN_PROGRESS = 2,
+ GVE_PRIV_FLAGS_PROBE_IN_PROGRESS = 3,
+ GVE_PRIV_FLAGS_DO_REPORT_STATS = 4,
+};
+
+enum gve_state_flags_bit {
+ GVE_PRIV_FLAGS_ADMIN_QUEUE_OK = 1,
+ GVE_PRIV_FLAGS_DEVICE_RESOURCES_OK = 2,
+ GVE_PRIV_FLAGS_DEVICE_RINGS_OK = 3,
+ GVE_PRIV_FLAGS_NAPI_ENABLED = 4,
+};
+
+enum gve_ethtool_flags_bit {
+ GVE_PRIV_FLAGS_REPORT_STATS = 0,
+};
+
+static inline bool gve_get_do_reset(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_DO_RESET, &priv->service_task_flags);
+}
+
+static inline void gve_set_do_reset(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_DO_RESET, &priv->service_task_flags);
+}
+
+static inline void gve_clear_do_reset(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_DO_RESET, &priv->service_task_flags);
+}
+
+static inline bool gve_get_reset_in_progress(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_RESET_IN_PROGRESS,
+ &priv->service_task_flags);
+}
+
+static inline void gve_set_reset_in_progress(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_RESET_IN_PROGRESS, &priv->service_task_flags);
+}
+
+static inline void gve_clear_reset_in_progress(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_RESET_IN_PROGRESS, &priv->service_task_flags);
+}
+
+static inline bool gve_get_probe_in_progress(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_PROBE_IN_PROGRESS,
+ &priv->service_task_flags);
+}
+
+static inline void gve_set_probe_in_progress(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_PROBE_IN_PROGRESS, &priv->service_task_flags);
+}
+
+static inline void gve_clear_probe_in_progress(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_PROBE_IN_PROGRESS, &priv->service_task_flags);
+}
+
+static inline bool gve_get_do_report_stats(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_DO_REPORT_STATS,
+ &priv->service_task_flags);
+}
+
+static inline void gve_set_do_report_stats(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_DO_REPORT_STATS, &priv->service_task_flags);
+}
+
+static inline void gve_clear_do_report_stats(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_DO_REPORT_STATS, &priv->service_task_flags);
+}
+
+static inline bool gve_get_admin_queue_ok(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_ADMIN_QUEUE_OK, &priv->state_flags);
+}
+
+static inline void gve_set_admin_queue_ok(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_ADMIN_QUEUE_OK, &priv->state_flags);
+}
+
+static inline void gve_clear_admin_queue_ok(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_ADMIN_QUEUE_OK, &priv->state_flags);
+}
+
+static inline bool gve_get_device_resources_ok(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_DEVICE_RESOURCES_OK, &priv->state_flags);
+}
+
+static inline void gve_set_device_resources_ok(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_DEVICE_RESOURCES_OK, &priv->state_flags);
+}
+
+static inline void gve_clear_device_resources_ok(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_DEVICE_RESOURCES_OK, &priv->state_flags);
+}
+
+static inline bool gve_get_device_rings_ok(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_DEVICE_RINGS_OK, &priv->state_flags);
+}
+
+static inline void gve_set_device_rings_ok(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_DEVICE_RINGS_OK, &priv->state_flags);
+}
+
+static inline void gve_clear_device_rings_ok(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_DEVICE_RINGS_OK, &priv->state_flags);
+}
+
+static inline bool gve_get_napi_enabled(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_NAPI_ENABLED, &priv->state_flags);
+}
+
+static inline void gve_set_napi_enabled(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_NAPI_ENABLED, &priv->state_flags);
+}
+
+static inline void gve_clear_napi_enabled(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_NAPI_ENABLED, &priv->state_flags);
+}
+
+static inline bool gve_get_report_stats(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_REPORT_STATS, &priv->ethtool_flags);
+}
+
+static inline void gve_set_report_stats(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_REPORT_STATS, &priv->ethtool_flags);
+}
+
+static inline void gve_clear_report_stats(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_REPORT_STATS, &priv->ethtool_flags);
+}
+
+/* Returns the address of the ntfy_blocks irq doorbell
+ */
+static inline __be32 __iomem *gve_irq_doorbell(struct gve_priv *priv,
+ struct gve_notify_block *block)
+{
+ return &priv->db_bar2[be32_to_cpu(*block->irq_db_index)];
+}
+
+/* Returns the index into ntfy_blocks of the given tx ring's block
+ */
+static inline u32 gve_tx_idx_to_ntfy(struct gve_priv *priv, u32 queue_idx)
+{
+ return queue_idx;
+}
+
+/* Returns the index into ntfy_blocks of the given rx ring's block
+ */
+static inline u32 gve_rx_idx_to_ntfy(struct gve_priv *priv, u32 queue_idx)
+{
+ return (priv->num_ntfy_blks / 2) + queue_idx;
+}
+
+/* Returns the number of tx queue page lists
+ */
+static inline u32 gve_num_tx_qpls(struct gve_priv *priv)
+{
+ if (priv->raw_addressing) {
+ return 0;
+ } else {
+ return priv->tx_cfg.num_queues;
+ }
+}
+
+/* Returns the number of rx queue page lists
+ */
+static inline u32 gve_num_rx_qpls(struct gve_priv *priv)
+{
+ if (priv->raw_addressing) {
+ return 0;
+ } else {
+ return priv->rx_cfg.num_queues;
+ }
+}
+
+/* Returns a pointer to the next available tx qpl in the list of qpls
+ */
+static inline
+struct gve_queue_page_list *gve_assign_tx_qpl(struct gve_priv *priv)
+{
+ int id = find_first_zero_bit(priv->qpl_cfg.qpl_id_map,
+ priv->qpl_cfg.qpl_map_size);
+
+ /* we are out of tx qpls */
+ if (id >= gve_num_tx_qpls(priv))
+ return NULL;
+
+ set_bit(id, priv->qpl_cfg.qpl_id_map);
+ return &priv->qpls[id];
+}
+
+/* Returns a pointer to the next available rx qpl in the list of qpls
+ */
+static inline
+struct gve_queue_page_list *gve_assign_rx_qpl(struct gve_priv *priv)
+{
+ int id = find_next_zero_bit(priv->qpl_cfg.qpl_id_map,
+ priv->qpl_cfg.qpl_map_size,
+ gve_num_tx_qpls(priv));
+
+ /* we are out of rx qpls */
+ if (id == priv->qpl_cfg.qpl_map_size)
+ return NULL;
+
+ set_bit(id, priv->qpl_cfg.qpl_id_map);
+ return &priv->qpls[id];
+}
+
+/* Unassigns the qpl with the given id
+ */
+static inline void gve_unassign_qpl(struct gve_priv *priv, int id)
+{
+ clear_bit(id, priv->qpl_cfg.qpl_id_map);
+}
+
+/* Returns the correct dma direction for tx and rx qpls
+ */
+static inline enum dma_data_direction gve_qpl_dma_dir(struct gve_priv *priv,
+ int id)
+{
+ if (id < gve_num_tx_qpls(priv))
+ return DMA_TO_DEVICE;
+ else
+ return DMA_FROM_DEVICE;
+}
+
+/* buffers */
+int gve_alloc_page(struct gve_priv *priv, struct device *dev,
+ struct page **page, dma_addr_t *dma,
+ enum dma_data_direction, gfp_t gfp_flags);
+void gve_free_page(struct device *dev, struct page *page, dma_addr_t dma,
+ enum dma_data_direction);
+/* tx handling */
+netdev_tx_t gve_tx(struct sk_buff *skb, struct net_device *dev);
+bool gve_tx_poll(struct gve_notify_block *block, int budget);
+int gve_tx_alloc_rings(struct gve_priv *priv);
+void gve_tx_free_rings(struct gve_priv *priv);
+__be32 gve_tx_load_event_counter(struct gve_priv *priv,
+ struct gve_tx_ring *tx);
+/* rx handling */
+void gve_rx_write_doorbell(struct gve_priv *priv, struct gve_rx_ring *rx);
+bool gve_rx_poll(struct gve_notify_block *block, int budget);
+int gve_rx_alloc_rings(struct gve_priv *priv);
+void gve_rx_free_rings(struct gve_priv *priv);
+bool gve_clean_rx_done(struct gve_rx_ring *rx, int budget,
+ netdev_features_t feat);
+/* Reset */
+void gve_schedule_reset(struct gve_priv *priv);
+int gve_reset(struct gve_priv *priv, bool attempt_teardown);
+int gve_adjust_queues(struct gve_priv *priv,
+ struct gve_queue_config new_rx_config,
+ struct gve_queue_config new_tx_config);
+/* report stats handling */
+void gve_handle_report_stats(struct gve_priv *priv);
+/* exported by ethtool.c */
+extern const struct ethtool_ops gve_ethtool_ops;
+/* needed by ethtool */
+extern const char gve_version_str[];
+#endif /* _GVE_H_ */
diff --git a/drivers/net/ethernet/google/gve/gve_adminq.c b/drivers/net/ethernet/google/gve/gve_adminq.c
new file mode 100644
index 0000000..052b6b8
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_adminq.c
@@ -0,0 +1,647 @@
+// SPDX-License-Identifier: (GPL-2.0 OR MIT)
+/* Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#include "gve_linux_version.h"
+#include <linux/etherdevice.h>
+#include <linux/pci.h>
+#include "gve.h"
+#include "gve_adminq.h"
+#include "gve_register.h"
+
+#define GVE_MAX_ADMINQ_RELEASE_CHECK 500
+#define GVE_ADMINQ_SLEEP_LEN 20
+#define GVE_MAX_ADMINQ_EVENT_COUNTER_CHECK 100
+
+int gve_adminq_alloc(struct device *dev, struct gve_priv *priv)
+{
+ priv->adminq = dma_alloc_coherent(dev, PAGE_SIZE,
+ &priv->adminq_bus_addr, GFP_KERNEL);
+ if (unlikely(!priv->adminq))
+ return -ENOMEM;
+
+ priv->adminq_mask = (PAGE_SIZE / sizeof(union gve_adminq_command)) - 1;
+ priv->adminq_prod_cnt = 0;
+ priv->adminq_cmd_fail = 0;
+ priv->adminq_timeouts = 0;
+ priv->adminq_describe_device_cnt = 0;
+ priv->adminq_cfg_device_resources_cnt = 0;
+ priv->adminq_register_page_list_cnt = 0;
+ priv->adminq_unregister_page_list_cnt = 0;
+ priv->adminq_create_tx_queue_cnt = 0;
+ priv->adminq_create_rx_queue_cnt = 0;
+ priv->adminq_destroy_tx_queue_cnt = 0;
+ priv->adminq_destroy_rx_queue_cnt = 0;
+ priv->adminq_dcfg_device_resources_cnt = 0;
+ priv->adminq_set_driver_parameter_cnt = 0;
+ priv->adminq_report_stats_cnt = 0;
+
+ /* Setup Admin queue with the device */
+ iowrite32be(priv->adminq_bus_addr / PAGE_SIZE,
+ &priv->reg_bar0->adminq_pfn);
+
+ gve_set_admin_queue_ok(priv);
+ return 0;
+}
+
+void gve_adminq_release(struct gve_priv *priv)
+{
+ int i = 0;
+
+ /* Tell the device the adminq is leaving */
+ iowrite32be(0x0, &priv->reg_bar0->adminq_pfn);
+ while (ioread32be(&priv->reg_bar0->adminq_pfn)) {
+ /* If this is reached the device is unrecoverable and still
+ * holding memory. Continue looping to avoid memory corruption,
+ * but WARN so it is visible what is going on.
+ */
+ if (i == GVE_MAX_ADMINQ_RELEASE_CHECK)
+ WARN(1, "Unrecoverable platform error!");
+ i++;
+ msleep(GVE_ADMINQ_SLEEP_LEN);
+ }
+ gve_clear_device_rings_ok(priv);
+ gve_clear_device_resources_ok(priv);
+ gve_clear_admin_queue_ok(priv);
+}
+
+void gve_adminq_free(struct device *dev, struct gve_priv *priv)
+{
+ if (!gve_get_admin_queue_ok(priv))
+ return;
+ gve_adminq_release(priv);
+ dma_free_coherent(dev, PAGE_SIZE, priv->adminq, priv->adminq_bus_addr);
+ gve_clear_admin_queue_ok(priv);
+}
+
+static void gve_adminq_kick_cmd(struct gve_priv *priv, u32 prod_cnt)
+{
+ iowrite32be(prod_cnt, &priv->reg_bar0->adminq_doorbell);
+}
+
+static bool gve_adminq_wait_for_cmd(struct gve_priv *priv, u32 prod_cnt)
+{
+ int i;
+
+ for (i = 0; i < GVE_MAX_ADMINQ_EVENT_COUNTER_CHECK; i++) {
+ if (ioread32be(&priv->reg_bar0->adminq_event_counter)
+ == prod_cnt)
+ return true;
+ msleep(GVE_ADMINQ_SLEEP_LEN);
+ }
+
+ return false;
+}
+
+static int gve_adminq_parse_err(struct gve_priv *priv, u32 status)
+{
+ if (status != GVE_ADMINQ_COMMAND_PASSED &&
+ status != GVE_ADMINQ_COMMAND_UNSET) {
+ dev_err(&priv->pdev->dev, "AQ command failed with status %d\n", status);
+ priv->adminq_cmd_fail++;
+ }
+ switch (status) {
+ case GVE_ADMINQ_COMMAND_PASSED:
+ return 0;
+ case GVE_ADMINQ_COMMAND_UNSET:
+ dev_err(&priv->pdev->dev, "parse_aq_err: err and status both unset, this should not be possible.\n");
+ return -EINVAL;
+ case GVE_ADMINQ_COMMAND_ERROR_ABORTED:
+ case GVE_ADMINQ_COMMAND_ERROR_CANCELLED:
+ case GVE_ADMINQ_COMMAND_ERROR_DATALOSS:
+ case GVE_ADMINQ_COMMAND_ERROR_FAILED_PRECONDITION:
+ case GVE_ADMINQ_COMMAND_ERROR_UNAVAILABLE:
+ return -EAGAIN;
+ case GVE_ADMINQ_COMMAND_ERROR_ALREADY_EXISTS:
+ case GVE_ADMINQ_COMMAND_ERROR_INTERNAL_ERROR:
+ case GVE_ADMINQ_COMMAND_ERROR_INVALID_ARGUMENT:
+ case GVE_ADMINQ_COMMAND_ERROR_NOT_FOUND:
+ case GVE_ADMINQ_COMMAND_ERROR_OUT_OF_RANGE:
+ case GVE_ADMINQ_COMMAND_ERROR_UNKNOWN_ERROR:
+ return -EINVAL;
+ case GVE_ADMINQ_COMMAND_ERROR_DEADLINE_EXCEEDED:
+ return -ETIME;
+ case GVE_ADMINQ_COMMAND_ERROR_PERMISSION_DENIED:
+ case GVE_ADMINQ_COMMAND_ERROR_UNAUTHENTICATED:
+ return -EACCES;
+ case GVE_ADMINQ_COMMAND_ERROR_RESOURCE_EXHAUSTED:
+ return -ENOMEM;
+ case GVE_ADMINQ_COMMAND_ERROR_UNIMPLEMENTED:
+ return -ENOTSUPP;
+ default:
+ dev_err(&priv->pdev->dev, "parse_aq_err: unknown status code %d\n", status);
+ return -EINVAL;
+ }
+}
+
+/* Flushes all AQ commands currently queued and waits for them to complete.
+ * If there are failures, it will return the first error.
+ */
+static int gve_adminq_kick_and_wait(struct gve_priv *priv)
+{
+ u32 tail, head;
+ int i;
+
+ tail = ioread32be(&priv->reg_bar0->adminq_event_counter);
+ head = priv->adminq_prod_cnt;
+
+ gve_adminq_kick_cmd(priv, head);
+ if (!gve_adminq_wait_for_cmd(priv, head)) {
+ dev_err(&priv->pdev->dev, "AQ commands timed out, need to reset AQ\n");
+ priv->adminq_timeouts++;
+ return -ENOTRECOVERABLE;
+ }
+
+ for (i = tail; i < head; i++) {
+ union gve_adminq_command *cmd;
+ u32 status, err;
+
+ cmd = &priv->adminq[i & priv->adminq_mask];
+ status = be32_to_cpu(READ_ONCE(cmd->status));
+ err = gve_adminq_parse_err(priv, status);
+ if (err)
+ // Return the first error if we failed.
+ return err;
+ }
+
+ return 0;
+}
+
+/* This function is not threadsafe - the caller is responsible for any
+ * necessary locks.
+ */
+static int gve_adminq_issue_cmd(struct gve_priv *priv,
+ union gve_adminq_command *cmd_orig)
+{
+ union gve_adminq_command *cmd;
+ u32 tail;
+ u32 opcode;
+
+ tail = ioread32be(&priv->reg_bar0->adminq_event_counter);
+
+ // Check if next command will overflow the buffer.
+ if (((priv->adminq_prod_cnt + 1) & priv->adminq_mask) == tail) {
+ int err;
+
+ // Flush existing commands to make room.
+ err = gve_adminq_kick_and_wait(priv);
+ if (err)
+ return err;
+
+ // Retry.
+ tail = ioread32be(&priv->reg_bar0->adminq_event_counter);
+ if (((priv->adminq_prod_cnt + 1) & priv->adminq_mask) == tail) {
+ // This should never happen. We just flushed the
+ // command queue so there should be enough space.
+ return -ENOMEM;
+ }
+ }
+
+ cmd = &priv->adminq[priv->adminq_prod_cnt & priv->adminq_mask];
+ priv->adminq_prod_cnt++;
+
+ memcpy(cmd, cmd_orig, sizeof(*cmd_orig));
+ opcode = be32_to_cpu(READ_ONCE(cmd->opcode));
+
+ switch(opcode) {
+ case GVE_ADMINQ_DESCRIBE_DEVICE:
+ priv->adminq_describe_device_cnt++;
+ break;
+ case GVE_ADMINQ_CONFIGURE_DEVICE_RESOURCES:
+ priv->adminq_cfg_device_resources_cnt++;
+ break;
+ case GVE_ADMINQ_REGISTER_PAGE_LIST:
+ priv->adminq_register_page_list_cnt++;
+ break;
+ case GVE_ADMINQ_UNREGISTER_PAGE_LIST:
+ priv->adminq_unregister_page_list_cnt++;
+ break;
+ case GVE_ADMINQ_CREATE_TX_QUEUE:
+ priv->adminq_create_tx_queue_cnt++;
+ break;
+ case GVE_ADMINQ_CREATE_RX_QUEUE:
+ priv->adminq_create_rx_queue_cnt++;
+ break;
+ case GVE_ADMINQ_DESTROY_TX_QUEUE:
+ priv->adminq_destroy_tx_queue_cnt++;
+ break;
+ case GVE_ADMINQ_DESTROY_RX_QUEUE:
+ priv->adminq_destroy_rx_queue_cnt++;
+ break;
+ case GVE_ADMINQ_DECONFIGURE_DEVICE_RESOURCES:
+ priv->adminq_dcfg_device_resources_cnt++;
+ break;
+ case GVE_ADMINQ_SET_DRIVER_PARAMETER:
+ priv->adminq_set_driver_parameter_cnt++;
+ break;
+ case GVE_ADMINQ_REPORT_STATS:
+ priv->adminq_report_stats_cnt++;
+ break;
+ default:
+ dev_err(&priv->pdev->dev, "unknown AQ command opcode %d\n", opcode);
+ }
+
+ return 0;
+}
+
+/* This function is not threadsafe - the caller is responsible for any
+ * necessary locks.
+ * The caller is also responsible for making sure there are no commands
+ * waiting to be executed.
+ */
+static int gve_adminq_execute_cmd(struct gve_priv *priv,
+ union gve_adminq_command *cmd_orig)
+{
+ u32 tail, head;
+ int err;
+
+ tail = ioread32be(&priv->reg_bar0->adminq_event_counter);
+ head = priv->adminq_prod_cnt;
+ if (tail != head)
+ // This is not a valid path
+ return -EINVAL;
+
+ err = gve_adminq_issue_cmd(priv, cmd_orig);
+ if (err)
+ return err;
+
+ return gve_adminq_kick_and_wait(priv);
+}
+/* The device specifies that the management vector can either be the first irq
+ * or the last irq. ntfy_blk_msix_base_idx indicates the first irq assigned to
+ * the ntfy blks. It if is 0 then the management vector is last, if it is 1 then
+ * the management vector is first.
+ *
+ * gve arranges the msix vectors so that the management vector is last.
+ */
+#define GVE_NTFY_BLK_BASE_MSIX_IDX 0
+int gve_adminq_configure_device_resources(struct gve_priv *priv,
+ dma_addr_t counter_array_bus_addr,
+ u32 num_counters,
+ dma_addr_t db_array_bus_addr,
+ u32 num_ntfy_blks)
+{
+ union gve_adminq_command cmd;
+
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_CONFIGURE_DEVICE_RESOURCES);
+ cmd.configure_device_resources =
+ (struct gve_adminq_configure_device_resources) {
+ .counter_array = cpu_to_be64(counter_array_bus_addr),
+ .num_counters = cpu_to_be32(num_counters),
+ .irq_db_addr = cpu_to_be64(db_array_bus_addr),
+ .num_irq_dbs = cpu_to_be32(num_ntfy_blks),
+ .irq_db_stride = cpu_to_be32(sizeof(*priv->irq_db_indices)),
+ .ntfy_blk_msix_base_idx =
+ cpu_to_be32(GVE_NTFY_BLK_BASE_MSIX_IDX),
+ };
+
+ return gve_adminq_execute_cmd(priv, &cmd);
+}
+
+int gve_adminq_deconfigure_device_resources(struct gve_priv *priv)
+{
+ union gve_adminq_command cmd;
+
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_DECONFIGURE_DEVICE_RESOURCES);
+
+ return gve_adminq_execute_cmd(priv, &cmd);
+}
+
+int gve_adminq_create_tx_queues(struct gve_priv *priv, u32 num_queues)
+{
+ union gve_adminq_command cmd;
+ struct gve_tx_ring *tx;
+ u32 qpl_id;
+ int err;
+ int i;
+
+ for (i = 0; i < num_queues; i++) {
+ tx = &priv->tx[i];
+ qpl_id = priv->raw_addressing ? GVE_RAW_ADDRESSING_QPL_ID :
+ tx->tx_fifo.qpl->id;
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_CREATE_TX_QUEUE);
+ cmd.create_tx_queue = (struct gve_adminq_create_tx_queue) {
+ .queue_id = cpu_to_be32(i),
+ .reserved = 0,
+ .queue_resources_addr =
+ cpu_to_be64(tx->q_resources_bus),
+ .tx_ring_addr = cpu_to_be64(tx->bus),
+ .queue_page_list_id = cpu_to_be32(qpl_id),
+ .ntfy_id = cpu_to_be32(tx->ntfy_id),
+ };
+ err = gve_adminq_issue_cmd(priv, &cmd);
+ if (err)
+ return err;
+ }
+
+ return gve_adminq_kick_and_wait(priv);
+}
+
+int gve_adminq_create_rx_queues(struct gve_priv *priv, u32 num_queues)
+{
+ union gve_adminq_command cmd;
+ struct gve_rx_ring *rx;
+ u32 qpl_id;
+ int err;
+ int i;
+
+ for (i = 0; i < num_queues; i++) {
+ rx = &priv->rx[i];
+ qpl_id = priv->raw_addressing ? GVE_RAW_ADDRESSING_QPL_ID :
+ rx->data.qpl->id;
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_CREATE_RX_QUEUE);
+ cmd.create_rx_queue = (struct gve_adminq_create_rx_queue) {
+ .queue_id = cpu_to_be32(i),
+ .index = cpu_to_be32(i),
+ .reserved = 0,
+ .ntfy_id = cpu_to_be32(rx->ntfy_id),
+ .queue_resources_addr = cpu_to_be64(rx->q_resources_bus),
+ .rx_desc_ring_addr = cpu_to_be64(rx->desc.bus),
+ .rx_data_ring_addr = cpu_to_be64(rx->data.data_bus),
+ .queue_page_list_id = cpu_to_be32(qpl_id),
+ };
+ err = gve_adminq_issue_cmd(priv, &cmd);
+ if (err)
+ return err;
+ }
+
+ return gve_adminq_kick_and_wait(priv);
+}
+
+int gve_adminq_destroy_tx_queues(struct gve_priv *priv, u32 num_queues)
+{
+ union gve_adminq_command cmd;
+ int err;
+ int i;
+
+ for (i = 0; i < num_queues; i++) {
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_DESTROY_TX_QUEUE);
+ cmd.destroy_tx_queue = (struct gve_adminq_destroy_tx_queue) {
+ .queue_id = cpu_to_be32(i),
+ };
+ err = gve_adminq_issue_cmd(priv, &cmd);
+ if (err)
+ return err;
+ }
+
+ return gve_adminq_kick_and_wait(priv);
+}
+
+int gve_adminq_destroy_rx_queues(struct gve_priv *priv, u32 num_queues)
+{
+ union gve_adminq_command cmd;
+ int err;
+ int i;
+
+ for (i = 0; i < num_queues; i++) {
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_DESTROY_RX_QUEUE);
+ cmd.destroy_rx_queue = (struct gve_adminq_destroy_rx_queue) {
+ .queue_id = cpu_to_be32(i),
+ };
+ err = gve_adminq_issue_cmd(priv, &cmd);
+ if (err)
+ return err;
+ }
+
+ return gve_adminq_kick_and_wait(priv);
+}
+
+int gve_adminq_describe_device(struct gve_priv *priv)
+{
+ struct gve_device_descriptor *descriptor;
+ struct gve_device_option *dev_opt;
+ union gve_adminq_command cmd;
+ dma_addr_t descriptor_bus;
+ u16 num_options;
+ int err = 0;
+ u8 *mac;
+ u16 mtu;
+ int i;
+
+ memset(&cmd, 0, sizeof(cmd));
+ descriptor = dma_alloc_coherent(&priv->pdev->dev, PAGE_SIZE,
+ &descriptor_bus, GFP_KERNEL);
+ if (!descriptor)
+ return -ENOMEM;
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_DESCRIBE_DEVICE);
+ cmd.describe_device.device_descriptor_addr =
+ cpu_to_be64(descriptor_bus);
+ cmd.describe_device.device_descriptor_version =
+ cpu_to_be32(GVE_ADMINQ_DEVICE_DESCRIPTOR_VERSION);
+ cmd.describe_device.available_length = cpu_to_be32(PAGE_SIZE);
+
+ err = gve_adminq_execute_cmd(priv, &cmd);
+ if (err)
+ goto free_device_descriptor;
+
+ priv->tx_desc_cnt = be16_to_cpu(descriptor->tx_queue_entries);
+ if (priv->tx_desc_cnt * sizeof(priv->tx->desc[0]) < PAGE_SIZE) {
+ dev_err(&priv->pdev->dev, "Tx desc count %d too low\n",
+ priv->tx_desc_cnt);
+ err = -EINVAL;
+ goto free_device_descriptor;
+ }
+ priv->rx_desc_cnt = be16_to_cpu(descriptor->rx_queue_entries);
+ if (priv->rx_desc_cnt * sizeof(priv->rx->desc.desc_ring[0])
+ < PAGE_SIZE ||
+ priv->rx_desc_cnt * sizeof(priv->rx->data.data_ring[0])
+ < PAGE_SIZE) {
+ dev_err(&priv->pdev->dev, "Rx desc count %d too low\n",
+ priv->rx_desc_cnt);
+ err = -EINVAL;
+
+ err = -EINVAL;
+ goto free_device_descriptor;
+ }
+ priv->max_registered_pages =
+ be64_to_cpu(descriptor->max_registered_pages);
+ mtu = be16_to_cpu(descriptor->mtu);
+ if (mtu < ETH_MIN_MTU) {
+ dev_err(&priv->pdev->dev, "MTU %d below minimum MTU\n", mtu);
+ err = -EINVAL;
+
+ err = -EINVAL;
+ goto free_device_descriptor;
+ }
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0))
+ priv->max_mtu = mtu;
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+ priv->dev->max_mtu = mtu;
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+ priv->num_event_counters = be16_to_cpu(descriptor->counters);
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0)
+ ether_addr_copy(priv->dev->dev_addr, descriptor->mac);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0) */
+ memcpy(priv->dev->dev_addr, descriptor->mac, ETH_ALEN);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0) */
+ mac = descriptor->mac;
+ dev_info(&priv->pdev->dev, "MAC addr: %pM\n", mac);
+ priv->tx_pages_per_qpl = be16_to_cpu(descriptor->tx_pages_per_qpl);
+ priv->rx_data_slot_cnt = be16_to_cpu(descriptor->rx_pages_per_qpl);
+ if (priv->rx_data_slot_cnt < priv->rx_desc_cnt) {
+ dev_err(&priv->pdev->dev, "rx_data_slot_cnt cannot be smaller than rx_desc_cnt, setting rx_desc_cnt down to %d.\n",
+ priv->rx_data_slot_cnt);
+ priv->rx_desc_cnt = priv->rx_data_slot_cnt;
+ }
+ priv->default_num_queues = be16_to_cpu(descriptor->default_num_queues);
+ dev_opt = (struct gve_device_option *)((void *)descriptor +
+ sizeof(*descriptor));
+
+ num_options = be16_to_cpu(descriptor->num_device_options);
+ for (i = 0; i < num_options; i++) {
+ u16 option_id;
+ u16 option_length;
+
+ if ((void *)dev_opt + sizeof(*dev_opt) > (void *)descriptor +
+ be16_to_cpu(descriptor->total_length)) {
+ dev_err(&priv->dev->dev,
+ "num_options in device_descriptor does not match total length.\n");
+ err = -EINVAL;
+ goto free_device_descriptor;
+ }
+
+ option_id = be16_to_cpu(dev_opt->option_id);
+ option_length = be16_to_cpu(dev_opt->option_length);
+ switch(option_id) {
+ case GVE_DEV_OPT_ID_RAW_ADDRESSING:
+ /* If the length or feature mask doesn't match,
+ * continue without enabling the feature.
+ */
+ if (option_length != GVE_DEV_OPT_LEN_RAW_ADDRESSING ||
+ be32_to_cpu(dev_opt->feat_mask) !=
+ GVE_DEV_OPT_FEAT_MASK_RAW_ADDRESSING) {
+ dev_info(&priv->pdev->dev,
+ "Raw addressing device option not enabled, length or features mask did not match expected.\n");
+ priv->raw_addressing = false;
+ } else {
+ dev_info(&priv->pdev->dev,
+ "Raw addressing device option enabled.\n");
+ priv->raw_addressing = true;
+ }
+ break;
+ default:
+ /* If we don't recognize the option just continue
+ * without doing anything.
+ */
+ dev_info(&priv->pdev->dev,
+ "Unrecognized device option 0x%hx not enabled.\n",
+ option_id);
+ break;
+ }
+ dev_opt = (void *)dev_opt + sizeof(*dev_opt) + option_length;
+ }
+
+free_device_descriptor:
+ dma_free_coherent(&priv->pdev->dev, PAGE_SIZE, descriptor,
+ descriptor_bus);
+ return err;
+}
+
+int gve_adminq_register_page_list(struct gve_priv *priv,
+ struct gve_queue_page_list *qpl)
+{
+ struct device *hdev = &priv->pdev->dev;
+ u32 num_entries = qpl->num_entries;
+ u32 size = num_entries * sizeof(qpl->page_buses[0]);
+ union gve_adminq_command cmd;
+ dma_addr_t page_list_bus;
+ __be64 *page_list;
+ int err;
+ int i;
+
+ memset(&cmd, 0, sizeof(cmd));
+ page_list = dma_alloc_coherent(hdev, size, &page_list_bus, GFP_KERNEL);
+ if (!page_list)
+ return -ENOMEM;
+
+ for (i = 0; i < num_entries; i++)
+ page_list[i] = cpu_to_be64(qpl->page_buses[i]);
+
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_REGISTER_PAGE_LIST);
+ cmd.reg_page_list = (struct gve_adminq_register_page_list) {
+ .page_list_id = cpu_to_be32(qpl->id),
+ .num_pages = cpu_to_be32(num_entries),
+ .page_address_list_addr = cpu_to_be64(page_list_bus),
+ };
+
+ err = gve_adminq_execute_cmd(priv, &cmd);
+ dma_free_coherent(hdev, size, page_list, page_list_bus);
+ return err;
+}
+
+int gve_adminq_unregister_page_list(struct gve_priv *priv, u32 page_list_id)
+{
+ union gve_adminq_command cmd;
+
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_UNREGISTER_PAGE_LIST);
+ cmd.unreg_page_list = (struct gve_adminq_unregister_page_list) {
+ .page_list_id = cpu_to_be32(page_list_id),
+ };
+
+ return gve_adminq_execute_cmd(priv, &cmd);
+}
+
+int gve_adminq_set_mtu(struct gve_priv *priv, u64 mtu)
+{
+ union gve_adminq_command cmd;
+
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_SET_DRIVER_PARAMETER);
+ cmd.set_driver_param = (struct gve_adminq_set_driver_parameter) {
+ .parameter_type = cpu_to_be32(GVE_SET_PARAM_MTU),
+ .parameter_value = cpu_to_be64(mtu),
+ };
+
+ return gve_adminq_execute_cmd(priv, &cmd);
+}
+
+int gve_adminq_report_stats(struct gve_priv *priv, u64 stats_report_len,
+ dma_addr_t stats_report_addr, u64 interval)
+{
+ union gve_adminq_command cmd;
+
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_REPORT_STATS);
+ cmd.report_stats = (struct gve_adminq_report_stats) {
+ .stats_report_len = cpu_to_be64(stats_report_len),
+ .stats_report_addr = cpu_to_be64(stats_report_addr),
+ .interval = cpu_to_be64(interval),
+ };
+
+ return gve_adminq_execute_cmd(priv, &cmd);
+}
+
+int gve_adminq_report_link_speed(struct gve_priv *priv)
+{
+ union gve_adminq_command gvnic_cmd;
+ dma_addr_t link_speed_region_bus;
+ u64* link_speed_region;
+ int err;
+
+ link_speed_region = dma_alloc_coherent(&priv->pdev->dev,
+ sizeof(*link_speed_region), &link_speed_region_bus, GFP_KERNEL);
+
+ if (!link_speed_region)
+ return -ENOMEM;
+
+ memset(&gvnic_cmd, 0, sizeof(gvnic_cmd));
+ gvnic_cmd.opcode = cpu_to_be32(GVE_ADMINQ_REPORT_LINK_SPEED);
+ gvnic_cmd.report_link_speed.link_speed_address =
+ cpu_to_be64(link_speed_region_bus);
+
+ err = gve_adminq_execute_cmd(priv, &gvnic_cmd);
+
+ priv->link_speed = be64_to_cpu(*link_speed_region);
+ dma_free_coherent(&priv->pdev->dev, sizeof(*link_speed_region),
+ link_speed_region, link_speed_region_bus);
+ return err;
+}
diff --git a/drivers/net/ethernet/google/gve/gve_adminq.h b/drivers/net/ethernet/google/gve/gve_adminq.h
new file mode 100644
index 0000000..0b12486
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_adminq.h
@@ -0,0 +1,278 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR MIT)
+ * Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#ifndef _GVE_ADMINQ_H
+#define _GVE_ADMINQ_H
+
+#if LINUX_VERSION_CODE < KERNEL_VERSION(5,1,0)
+#include "gve_size_assert.h"
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(5,1,0) */
+#include <linux/build_bug.h>
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(5,1,0) */
+
+/* Admin queue opcodes */
+enum gve_adminq_opcodes {
+ GVE_ADMINQ_DESCRIBE_DEVICE = 0x1,
+ GVE_ADMINQ_CONFIGURE_DEVICE_RESOURCES = 0x2,
+ GVE_ADMINQ_REGISTER_PAGE_LIST = 0x3,
+ GVE_ADMINQ_UNREGISTER_PAGE_LIST = 0x4,
+ GVE_ADMINQ_CREATE_TX_QUEUE = 0x5,
+ GVE_ADMINQ_CREATE_RX_QUEUE = 0x6,
+ GVE_ADMINQ_DESTROY_TX_QUEUE = 0x7,
+ GVE_ADMINQ_DESTROY_RX_QUEUE = 0x8,
+ GVE_ADMINQ_DECONFIGURE_DEVICE_RESOURCES = 0x9,
+ GVE_ADMINQ_SET_DRIVER_PARAMETER = 0xB,
+ GVE_ADMINQ_REPORT_STATS = 0xC,
+ GVE_ADMINQ_REPORT_LINK_SPEED = 0xD
+};
+
+/* Admin queue status codes */
+enum gve_adminq_statuses {
+ GVE_ADMINQ_COMMAND_UNSET = 0x0,
+ GVE_ADMINQ_COMMAND_PASSED = 0x1,
+ GVE_ADMINQ_COMMAND_ERROR_ABORTED = 0xFFFFFFF0,
+ GVE_ADMINQ_COMMAND_ERROR_ALREADY_EXISTS = 0xFFFFFFF1,
+ GVE_ADMINQ_COMMAND_ERROR_CANCELLED = 0xFFFFFFF2,
+ GVE_ADMINQ_COMMAND_ERROR_DATALOSS = 0xFFFFFFF3,
+ GVE_ADMINQ_COMMAND_ERROR_DEADLINE_EXCEEDED = 0xFFFFFFF4,
+ GVE_ADMINQ_COMMAND_ERROR_FAILED_PRECONDITION = 0xFFFFFFF5,
+ GVE_ADMINQ_COMMAND_ERROR_INTERNAL_ERROR = 0xFFFFFFF6,
+ GVE_ADMINQ_COMMAND_ERROR_INVALID_ARGUMENT = 0xFFFFFFF7,
+ GVE_ADMINQ_COMMAND_ERROR_NOT_FOUND = 0xFFFFFFF8,
+ GVE_ADMINQ_COMMAND_ERROR_OUT_OF_RANGE = 0xFFFFFFF9,
+ GVE_ADMINQ_COMMAND_ERROR_PERMISSION_DENIED = 0xFFFFFFFA,
+ GVE_ADMINQ_COMMAND_ERROR_UNAUTHENTICATED = 0xFFFFFFFB,
+ GVE_ADMINQ_COMMAND_ERROR_RESOURCE_EXHAUSTED = 0xFFFFFFFC,
+ GVE_ADMINQ_COMMAND_ERROR_UNAVAILABLE = 0xFFFFFFFD,
+ GVE_ADMINQ_COMMAND_ERROR_UNIMPLEMENTED = 0xFFFFFFFE,
+ GVE_ADMINQ_COMMAND_ERROR_UNKNOWN_ERROR = 0xFFFFFFFF,
+};
+
+#define GVE_ADMINQ_DEVICE_DESCRIPTOR_VERSION 1
+
+/* All AdminQ command structs should be naturally packed. The static_assert
+ * calls make sure this is the case at compile time.
+ */
+
+struct gve_adminq_describe_device {
+ __be64 device_descriptor_addr;
+ __be32 device_descriptor_version;
+ __be32 available_length;
+};
+
+static_assert(sizeof(struct gve_adminq_describe_device) == 16);
+
+struct gve_device_descriptor {
+ __be64 max_registered_pages;
+ __be16 reserved1;
+ __be16 tx_queue_entries;
+ __be16 rx_queue_entries;
+ __be16 default_num_queues;
+ __be16 mtu;
+ __be16 counters;
+ __be16 tx_pages_per_qpl;
+ __be16 rx_pages_per_qpl;
+ u8 mac[ETH_ALEN];
+ __be16 num_device_options;
+ __be16 total_length;
+ u8 reserved2[6];
+};
+
+static_assert(sizeof(struct gve_device_descriptor) == 40);
+
+struct gve_device_option {
+ __be16 option_id;
+ __be16 option_length;
+ __be32 feat_mask;
+};
+
+static_assert(sizeof(struct gve_device_option) == 8);
+
+#define GVE_DEV_OPT_ID_RAW_ADDRESSING 0x1
+#define GVE_DEV_OPT_LEN_RAW_ADDRESSING 0x0
+#define GVE_DEV_OPT_FEAT_MASK_RAW_ADDRESSING 0x0
+
+struct gve_adminq_configure_device_resources {
+ __be64 counter_array;
+ __be64 irq_db_addr;
+ __be32 num_counters;
+ __be32 num_irq_dbs;
+ __be32 irq_db_stride;
+ __be32 ntfy_blk_msix_base_idx;
+};
+
+static_assert(sizeof(struct gve_adminq_configure_device_resources) == 32);
+
+struct gve_adminq_register_page_list {
+ __be32 page_list_id;
+ __be32 num_pages;
+ __be64 page_address_list_addr;
+};
+
+static_assert(sizeof(struct gve_adminq_register_page_list) == 16);
+
+struct gve_adminq_unregister_page_list {
+ __be32 page_list_id;
+};
+
+static_assert(sizeof(struct gve_adminq_unregister_page_list) == 4);
+
+#define GVE_RAW_ADDRESSING_QPL_ID 0xFFFFFFFF
+
+struct gve_adminq_create_tx_queue {
+ __be32 queue_id;
+ __be32 reserved;
+ __be64 queue_resources_addr;
+ __be64 tx_ring_addr;
+ __be32 queue_page_list_id;
+ __be32 ntfy_id;
+};
+
+static_assert(sizeof(struct gve_adminq_create_tx_queue) == 32);
+
+struct gve_adminq_create_rx_queue {
+ __be32 queue_id;
+ __be32 index;
+ __be32 reserved;
+ __be32 ntfy_id;
+ __be64 queue_resources_addr;
+ __be64 rx_desc_ring_addr;
+ __be64 rx_data_ring_addr;
+ __be32 queue_page_list_id;
+ u8 padding[4];
+};
+
+static_assert(sizeof(struct gve_adminq_create_rx_queue) == 48);
+
+/* Queue resources that are shared with the device */
+struct gve_queue_resources {
+ union {
+ struct {
+ __be32 db_index; /* Device -> Guest */
+ __be32 counter_index; /* Device -> Guest */
+ };
+ u8 reserved[64];
+ };
+};
+
+static_assert(sizeof(struct gve_queue_resources) == 64);
+
+struct gve_adminq_destroy_tx_queue {
+ __be32 queue_id;
+};
+
+static_assert(sizeof(struct gve_adminq_destroy_tx_queue) == 4);
+
+struct gve_adminq_destroy_rx_queue {
+ __be32 queue_id;
+};
+
+static_assert(sizeof(struct gve_adminq_destroy_rx_queue) == 4);
+
+/* GVE Set Driver Parameter Types */
+enum gve_set_driver_param_types {
+ GVE_SET_PARAM_MTU = 0x1,
+};
+
+struct gve_adminq_set_driver_parameter {
+ __be32 parameter_type;
+ u8 reserved[4];
+ __be64 parameter_value;
+};
+
+static_assert(sizeof(struct gve_adminq_set_driver_parameter) == 16);
+
+struct gve_adminq_report_stats {
+ __be64 stats_report_len;
+ __be64 stats_report_addr;
+ __be64 interval;
+};
+
+static_assert(sizeof(struct gve_adminq_report_stats) == 24);
+
+struct gve_adminq_report_link_speed {
+ __be64 link_speed_address;
+};
+
+static_assert(sizeof(struct gve_adminq_report_link_speed) == 8);
+
+struct stats {
+ __be32 stat_name;
+ __be32 queue_id;
+ __be64 value;
+};
+
+static_assert(sizeof(struct stats) == 16);
+
+struct gve_stats_report {
+ __be64 written_count;
+ struct stats stats[0];
+};
+
+static_assert(sizeof(struct gve_stats_report) == 8);
+
+enum gve_stat_names {
+ // stats from gve
+ TX_WAKE_CNT = 1,
+ TX_STOP_CNT = 2,
+ TX_FRAMES_SENT = 3,
+ TX_BYTES_SENT = 4,
+ TX_LAST_COMPLETION_PROCESSED = 5,
+ RX_NEXT_EXPECTED_SEQUENCE = 6,
+ RX_BUFFERS_POSTED = 7,
+ // stats from NIC
+ RX_QUEUE_DROP_CNT = 65,
+ RX_NO_BUFFERS_POSTED = 66,
+ RX_DROPS_PACKET_OVER_MRU = 67,
+ RX_DROPS_INVALID_CHECKSUM = 68,
+};
+
+union gve_adminq_command {
+ struct {
+ __be32 opcode;
+ __be32 status;
+ union {
+ struct gve_adminq_configure_device_resources
+ configure_device_resources;
+ struct gve_adminq_create_tx_queue create_tx_queue;
+ struct gve_adminq_create_rx_queue create_rx_queue;
+ struct gve_adminq_destroy_tx_queue destroy_tx_queue;
+ struct gve_adminq_destroy_rx_queue destroy_rx_queue;
+ struct gve_adminq_describe_device describe_device;
+ struct gve_adminq_register_page_list reg_page_list;
+ struct gve_adminq_unregister_page_list unreg_page_list;
+ struct gve_adminq_set_driver_parameter set_driver_param;
+ struct gve_adminq_report_stats report_stats;
+ struct gve_adminq_report_link_speed report_link_speed;
+ };
+ };
+ u8 reserved[64];
+};
+
+static_assert(sizeof(union gve_adminq_command) == 64);
+
+int gve_adminq_alloc(struct device *dev, struct gve_priv *priv);
+void gve_adminq_free(struct device *dev, struct gve_priv *priv);
+void gve_adminq_release(struct gve_priv *priv);
+int gve_adminq_describe_device(struct gve_priv *priv);
+int gve_adminq_configure_device_resources(struct gve_priv *priv,
+ dma_addr_t counter_array_bus_addr,
+ u32 num_counters,
+ dma_addr_t db_array_bus_addr,
+ u32 num_ntfy_blks);
+int gve_adminq_deconfigure_device_resources(struct gve_priv *priv);
+int gve_adminq_create_tx_queues(struct gve_priv *priv, u32 num_queues);
+int gve_adminq_destroy_tx_queues(struct gve_priv *priv, u32 queue_id);
+int gve_adminq_create_rx_queues(struct gve_priv *priv, u32 num_queues);
+int gve_adminq_destroy_rx_queues(struct gve_priv *priv, u32 queue_id);
+int gve_adminq_register_page_list(struct gve_priv *priv,
+ struct gve_queue_page_list *qpl);
+int gve_adminq_unregister_page_list(struct gve_priv *priv, u32 page_list_id);
+int gve_adminq_set_mtu(struct gve_priv *priv, u64 mtu);
+int gve_adminq_report_stats(struct gve_priv *priv, u64 stats_report_len,
+ dma_addr_t stats_report_addr, u64 interval);
+int gve_adminq_report_link_speed(struct gve_priv *priv);
+#endif /* _GVE_ADMINQ_H */
diff --git a/drivers/net/ethernet/google/gve/gve_desc.h b/drivers/net/ethernet/google/gve/gve_desc.h
new file mode 100644
index 0000000..d4553fb
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_desc.h
@@ -0,0 +1,121 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR MIT)
+ * Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+/* GVE Transmit Descriptor formats */
+
+#ifndef _GVE_DESC_H_
+#define _GVE_DESC_H_
+
+#if LINUX_VERSION_CODE < KERNEL_VERSION(5,1,0)
+#include "gve_size_assert.h"
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(5,1,0) */
+#include <linux/build_bug.h>
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(5,1,0) */
+
+/* A note on seg_addrs
+ *
+ * Base addresses encoded in seg_addr are not assumed to be physical
+ * addresses. The ring format assumes these come from some linear address
+ * space. This could be physical memory, kernel virtual memory, user virtual
+ * memory.
+ * If raw dma addressing is not supported then gVNIC uses lists of registered
+ * pages. Each queue is assumed to be associated with a single such linear
+ * address space to ensure a consistent meaning for seg_addrs posted to its
+ * rings.
+ */
+
+struct gve_tx_pkt_desc {
+ u8 type_flags; /* desc type is lower 4 bits, flags upper */
+ u8 l4_csum_offset; /* relative offset of L4 csum word */
+ u8 l4_hdr_offset; /* Offset of start of L4 headers in packet */
+ u8 desc_cnt; /* Total descriptors for this packet */
+ __be16 len; /* Total length of this packet (in bytes) */
+ __be16 seg_len; /* Length of this descriptor's segment */
+ __be64 seg_addr; /* Base address (see note) of this segment */
+} __packed;
+
+struct gve_tx_seg_desc {
+ u8 type_flags; /* type is lower 4 bits, flags upper */
+ u8 l3_offset; /* TSO: 2 byte units to start of IPH */
+ __be16 reserved;
+ __be16 mss; /* TSO MSS */
+ __be16 seg_len;
+ __be64 seg_addr;
+} __packed;
+
+/* GVE Transmit Descriptor Types */
+#define GVE_TXD_STD (0x0 << 4) /* Std with Host Address */
+#define GVE_TXD_TSO (0x1 << 4) /* TSO with Host Address */
+#define GVE_TXD_SEG (0x2 << 4) /* Seg with Host Address */
+
+/* GVE Transmit Descriptor Flags for Std Pkts */
+#define GVE_TXF_L4CSUM BIT(0) /* Need csum offload */
+#define GVE_TXF_TSTAMP BIT(2) /* Timestamp required */
+
+/* GVE Transmit Descriptor Flags for TSO Segs */
+#define GVE_TXSF_IPV6 BIT(1) /* IPv6 TSO */
+
+/* GVE Receive Packet Descriptor */
+/* The start of an ethernet packet comes 2 bytes into the rx buffer.
+ * gVNIC adds this padding so that both the DMA and the L3/4 protocol header
+ * access is aligned.
+ */
+#define GVE_RX_PAD 2
+
+struct gve_rx_desc {
+ u8 padding[48];
+ __be32 rss_hash; /* Receive-side scaling hash (Toeplitz for gVNIC) */
+ __be16 mss;
+ __be16 reserved; /* Reserved to zero */
+ u8 hdr_len; /* Header length (L2-L4) including padding */
+ u8 hdr_off; /* 64-byte-scaled offset into RX_DATA entry */
+ __sum16 csum; /* 1's-complement partial checksum of L3+ bytes */
+ __be16 len; /* Length of the received packet */
+ __be16 flags_seq; /* Flags [15:3] and sequence number [2:0] (1-7) */
+} __packed;
+static_assert(sizeof(struct gve_rx_desc) == 64);
+
+/* If the device supports raw dma addressing then the addr in data slot is
+ * the dma address of the buffer.
+ * If the device only supports registered segments than the addr is a byte
+ * offset into the registered segment (an ordered list of pages) where the
+ * buffer is.
+ */
+struct gve_rx_data_slot {
+ __be64 addr;
+};
+
+/* GVE Recive Packet Descriptor Seq No */
+#define GVE_SEQNO(x) (be16_to_cpu(x) & 0x7)
+
+/* GVE Recive Packet Descriptor Flags */
+#define GVE_RXFLG(x) cpu_to_be16(1 << (3 + (x)))
+#define GVE_RXF_FRAG GVE_RXFLG(3) /* IP Fragment */
+#define GVE_RXF_IPV4 GVE_RXFLG(4) /* IPv4 */
+#define GVE_RXF_IPV6 GVE_RXFLG(5) /* IPv6 */
+#define GVE_RXF_TCP GVE_RXFLG(6) /* TCP Packet */
+#define GVE_RXF_UDP GVE_RXFLG(7) /* UDP Packet */
+#define GVE_RXF_ERR GVE_RXFLG(8) /* Packet Error Detected */
+
+/* GVE IRQ */
+#define GVE_IRQ_ACK BIT(31)
+#define GVE_IRQ_MASK BIT(30)
+#define GVE_IRQ_EVENT BIT(29)
+
+static inline bool gve_needs_rss(__be16 flag)
+{
+ if (flag & GVE_RXF_FRAG)
+ return false;
+ if (flag & (GVE_RXF_IPV4 | GVE_RXF_IPV6))
+ return true;
+ return false;
+}
+
+static inline u8 gve_next_seqno(u8 seq)
+{
+ return (seq + 1) == 8 ? 1 : seq + 1;
+}
+#endif /* _GVE_DESC_H_ */
diff --git a/drivers/net/ethernet/google/gve/gve_ethtool.c b/drivers/net/ethernet/google/gve/gve_ethtool.c
new file mode 100644
index 0000000..0ad957f
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_ethtool.c
@@ -0,0 +1,541 @@
+// SPDX-License-Identifier: (GPL-2.0 OR MIT)
+/* Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#include "gve_linux_version.h"
+#include <linux/rtnetlink.h>
+#include "gve.h"
+#include "gve_adminq.h"
+
+static void gve_get_drvinfo(struct net_device *netdev,
+ struct ethtool_drvinfo *info)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+
+ strlcpy(info->driver, "gve", sizeof(info->driver));
+ strlcpy(info->version, gve_version_str, sizeof(info->version));
+ strlcpy(info->bus_info, pci_name(priv->pdev), sizeof(info->bus_info));
+}
+
+static void gve_set_msglevel(struct net_device *netdev, u32 value)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+
+ priv->msg_enable = value;
+}
+
+static u32 gve_get_msglevel(struct net_device *netdev)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+
+ return priv->msg_enable;
+}
+
+static const char gve_gstrings_main_stats[][ETH_GSTRING_LEN] = {
+ "rx_packets", "rx_total_bytes", "rx_total_dropped_pkt",
+ "rx_skb_alloc_fail", "rx_buf_alloc_fail", "rx_desc_err_dropped_pkt",
+ "tx_packets", "tx_total_bytes", "tx_total_dropped_pkt", "tx_timeouts",
+ "interface_up_cnt", "interface_down_cnt", "reset_cnt",
+ "page_alloc_fail", "dma_mapping_error",
+};
+
+static const char gve_gstrings_rx_stats[][ETH_GSTRING_LEN] = {
+ "rx_posted_desc[%u]", "rx_completed_desc[%u]", "rx_bytes[%u]",
+ "rx_dropped_pkt[%u]", "rx_copybreak_pkt[%u]", "rx_copied_pkt[%u]",
+ "rx_queue_drop_cnt[%u]", "rx_no_buffers_posted[%u]",
+ "rx_drops_packet_over_mru[%u]", "rx_drops_invalid_checksum[%u]",
+};
+
+static const char gve_gstrings_tx_stats[][ETH_GSTRING_LEN] = {
+ "tx_posted_desc[%u]", "tx_completed_desc[%u]", "tx_bytes[%u]",
+ "tx_wake[%u]", "tx_stop[%u]", "tx_event_counter[%u]",
+};
+
+static const char gve_gstrings_adminq_stats[][ETH_GSTRING_LEN] = {
+ "adminq_prod_cnt", "adminq_cmd_fail", "adminq_timeouts",
+ "adminq_describe_device_cnt", "adminq_cfg_device_resources_cnt",
+ "adminq_register_page_list_cnt", "adminq_unregister_page_list_cnt",
+ "adminq_create_tx_queue_cnt", "adminq_create_rx_queue_cnt",
+ "adminq_destroy_tx_queue_cnt", "adminq_destroy_rx_queue_cnt",
+ "adminq_dcfg_device_resources_cnt", "adminq_set_driver_parameter_cnt",
+ "adminq_report_stats_cnt",
+};
+
+static const char gve_gstrings_priv_flags[][ETH_GSTRING_LEN] = {
+ "report-stats",
+};
+
+#define GVE_MAIN_STATS_LEN ARRAY_SIZE(gve_gstrings_main_stats)
+#define GVE_ADMINQ_STATS_LEN ARRAY_SIZE(gve_gstrings_adminq_stats)
+#define NUM_GVE_TX_CNTS ARRAY_SIZE(gve_gstrings_tx_stats)
+#define NUM_GVE_RX_CNTS ARRAY_SIZE(gve_gstrings_rx_stats)
+#define GVE_PRIV_FLAGS_STR_LEN ARRAY_SIZE(gve_gstrings_priv_flags)
+
+static void gve_get_strings(struct net_device *netdev, u32 stringset, u8 *data)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+ char *s = (char *)data;
+ int i, j;
+
+ switch (stringset) {
+ case ETH_SS_STATS:
+ memcpy(s, *gve_gstrings_main_stats,
+ sizeof(gve_gstrings_main_stats));
+ s += sizeof(gve_gstrings_main_stats);
+
+ for (i = 0; i < priv->rx_cfg.num_queues; i++) {
+ for (j = 0; j < NUM_GVE_RX_CNTS; j++) {
+ snprintf(s, ETH_GSTRING_LEN,
+ gve_gstrings_rx_stats[j], i);
+ s += ETH_GSTRING_LEN;
+ }
+ }
+
+ for (i = 0; i < priv->tx_cfg.num_queues; i++) {
+ for (j = 0; j < NUM_GVE_TX_CNTS; j++) {
+ snprintf(s, ETH_GSTRING_LEN,
+ gve_gstrings_tx_stats[j], i);
+ s += ETH_GSTRING_LEN;
+ }
+ }
+
+ memcpy(s, *gve_gstrings_adminq_stats,
+ sizeof(gve_gstrings_adminq_stats));
+ s += sizeof(gve_gstrings_adminq_stats);
+ break;
+
+ case ETH_SS_PRIV_FLAGS:
+ memcpy(s, *gve_gstrings_priv_flags,
+ sizeof(gve_gstrings_priv_flags));
+ s += sizeof(gve_gstrings_priv_flags);
+ break;
+
+ default:
+ break;
+ }
+}
+
+static int gve_get_sset_count(struct net_device *netdev, int sset)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+
+ switch (sset) {
+ case ETH_SS_STATS:
+ return GVE_MAIN_STATS_LEN + GVE_ADMINQ_STATS_LEN +
+ (priv->rx_cfg.num_queues * NUM_GVE_RX_CNTS) +
+ (priv->tx_cfg.num_queues * NUM_GVE_TX_CNTS);
+ case ETH_SS_PRIV_FLAGS:
+ return GVE_PRIV_FLAGS_STR_LEN;
+ default:
+ return -EOPNOTSUPP;
+ }
+}
+
+static void
+gve_get_ethtool_stats(struct net_device *netdev,
+ struct ethtool_stats *stats, u64 *data)
+{
+ u64 tmp_rx_pkts, tmp_rx_bytes, tmp_rx_skb_alloc_fail,
+ tmp_rx_buf_alloc_fail, tmp_rx_desc_err_dropped_pkt,
+ tmp_tx_pkts, tmp_tx_bytes;
+ u64 rx_pkts, rx_bytes, rx_skb_alloc_fail, rx_buf_alloc_fail,
+ rx_desc_err_dropped_pkt, tx_pkts, tx_bytes;
+ struct gve_priv *priv = netdev_priv(netdev);
+ int *rx_qid_to_stats_idx;
+ int *tx_qid_to_stats_idx;
+ struct stats *report_stats = priv->stats_report->stats;
+ int stats_idx, base_stats_idx, max_stats_idx;
+ bool skip_nic_stats;
+
+ unsigned int start;
+ int ring;
+ int i, j;
+
+ ASSERT_RTNL();
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,11,0))
+ memset(data, 0, stats->n_stats * sizeof(*data));
+#endif /* (LINUX_VERSION_CODE < KERNEL_VERSION(4,11,0)) */
+
+ rx_qid_to_stats_idx = kmalloc_array(priv->rx_cfg.num_queues,
+ sizeof(int), GFP_KERNEL);
+ if (!rx_qid_to_stats_idx) {
+ return;
+ }
+ tx_qid_to_stats_idx = kmalloc_array(priv->tx_cfg.num_queues,
+ sizeof(int), GFP_KERNEL);
+ if (!tx_qid_to_stats_idx) {
+ kfree(rx_qid_to_stats_idx);
+ return;
+ }
+
+ for (rx_pkts = 0, rx_bytes = 0, rx_skb_alloc_fail = 0,
+ rx_buf_alloc_fail = 0, rx_desc_err_dropped_pkt = 0, ring = 0;
+ ring < priv->rx_cfg.num_queues; ring++) {
+ if (priv->rx) {
+ do {
+ struct gve_rx_ring *rx = &priv->rx[ring];
+ start =
+ u64_stats_fetch_begin(&priv->rx[ring].statss);
+ tmp_rx_pkts = rx->rpackets;
+ tmp_rx_bytes = rx->rbytes;
+ tmp_rx_skb_alloc_fail = rx->rx_skb_alloc_fail;
+ tmp_rx_buf_alloc_fail = rx->rx_buf_alloc_fail;
+ tmp_rx_desc_err_dropped_pkt =
+ rx->rx_desc_err_dropped_pkt;
+
+ } while (u64_stats_fetch_retry(&priv->rx[ring].statss,
+ start));
+ rx_pkts += tmp_rx_pkts;
+ rx_bytes += tmp_rx_bytes;
+ rx_skb_alloc_fail += tmp_rx_skb_alloc_fail;
+ rx_buf_alloc_fail += tmp_rx_buf_alloc_fail;
+ rx_desc_err_dropped_pkt += tmp_rx_desc_err_dropped_pkt;
+
+ }
+ }
+ for (tx_pkts = 0, tx_bytes = 0, ring = 0;
+ ring < priv->tx_cfg.num_queues; ring++) {
+ if (priv->tx) {
+ do {
+ start =
+ u64_stats_fetch_begin(&priv->tx[ring].statss);
+ tmp_tx_pkts = priv->tx[ring].pkt_done;
+ tmp_tx_bytes = priv->tx[ring].bytes_done;
+ } while (u64_stats_fetch_retry(&priv->tx[ring].statss,
+ start));
+ tx_pkts += tmp_tx_pkts;
+ tx_bytes += tmp_tx_bytes;
+ }
+ }
+
+ i = 0;
+ data[i++] = rx_pkts;
+ data[i++] = rx_bytes;
+ /* total rx dropped packets */
+ data[i++] = rx_skb_alloc_fail + rx_buf_alloc_fail +
+ rx_desc_err_dropped_pkt;
+ data[i++] = rx_skb_alloc_fail;
+ data[i++] = rx_buf_alloc_fail;
+ data[i++] = rx_desc_err_dropped_pkt;
+ data[i++] = tx_pkts;
+ data[i++] = tx_bytes;
+ /* Skip tx_dropped */
+ i++;
+ data[i++] = priv->tx_timeo_cnt;
+ data[i++] = priv->interface_up_cnt;
+ data[i++] = priv->interface_down_cnt;
+ data[i++] = priv->reset_cnt;
+ data[i++] = priv->page_alloc_fail;
+ data[i++] = priv->dma_mapping_error;
+ i = GVE_MAIN_STATS_LEN;
+
+ /* For rx cross-reporting stats, start from nic rx stats in report */
+ base_stats_idx = GVE_TX_STATS_REPORT_NUM * priv->tx_cfg.num_queues +
+ GVE_RX_STATS_REPORT_NUM * priv->rx_cfg.num_queues;
+ max_stats_idx = NIC_RX_STATS_REPORT_NUM * priv->rx_cfg.num_queues +
+ base_stats_idx;
+ /* Preprocess the stats report for rx, map queue id to start index */
+ skip_nic_stats = false;
+ for (stats_idx = base_stats_idx; stats_idx < max_stats_idx;
+ stats_idx += NIC_RX_STATS_REPORT_NUM) {
+ u32 stat_name = be32_to_cpu(report_stats[stats_idx].stat_name);
+ u32 queue_id = be32_to_cpu(report_stats[stats_idx].queue_id);
+ if (stat_name == 0) {
+ /* no stats written by NIC yet */
+ skip_nic_stats = true;
+ break;
+ }
+ rx_qid_to_stats_idx[queue_id] = stats_idx;
+ }
+ /* walk RX rings */
+ if (priv->rx) {
+ for (ring = 0; ring < priv->rx_cfg.num_queues; ring++) {
+ struct gve_rx_ring *rx = &priv->rx[ring];
+
+ data[i++] = rx->fill_cnt;
+ data[i++] = rx->cnt;
+ do {
+ start =
+ u64_stats_fetch_begin(&priv->rx[ring].statss);
+ tmp_rx_bytes = rx->rbytes;
+ tmp_rx_skb_alloc_fail = rx->rx_skb_alloc_fail;
+ tmp_rx_buf_alloc_fail = rx->rx_buf_alloc_fail;
+ tmp_rx_desc_err_dropped_pkt =
+ rx->rx_desc_err_dropped_pkt;
+ } while (u64_stats_fetch_retry(&priv->rx[ring].statss,
+ start));
+ data[i++] = tmp_rx_bytes;
+ /* rx dropped packets */
+ data[i++] = tmp_rx_skb_alloc_fail +
+ tmp_rx_buf_alloc_fail +
+ tmp_rx_desc_err_dropped_pkt;
+ data[i++] = rx->rx_copybreak_pkt;
+ data[i++] = rx->rx_copied_pkt;
+ /* stats from NIC */
+ if (skip_nic_stats) {
+ /* skip NIC rx stats */
+ i += NIC_RX_STATS_REPORT_NUM;
+ continue;
+ }
+ for (j = 0; j < NIC_RX_STATS_REPORT_NUM; j++) {
+ u64 value = be64_to_cpu(report_stats[
+ rx_qid_to_stats_idx[ring] + j].value);
+ data[i++] = value;
+ }
+ }
+ } else {
+ i += priv->rx_cfg.num_queues * NUM_GVE_RX_CNTS;
+ }
+ /* For tx cross-reporting stats, start from nic tx stats in report */
+ base_stats_idx = max_stats_idx;
+ max_stats_idx = NIC_TX_STATS_REPORT_NUM * priv->tx_cfg.num_queues +
+ max_stats_idx;
+ /* Preprocess the stats report for tx, map queue id to start index */
+ skip_nic_stats = false;
+ for (stats_idx = base_stats_idx; stats_idx < max_stats_idx;
+ stats_idx += NIC_TX_STATS_REPORT_NUM) {
+ u32 stat_name = be32_to_cpu(report_stats[stats_idx].stat_name);
+ u32 queue_id = be32_to_cpu(report_stats[stats_idx].queue_id);
+ if (stat_name == 0) {
+ /* no stats written by NIC yet */
+ skip_nic_stats = true;
+ break;
+ }
+ tx_qid_to_stats_idx[queue_id] = stats_idx;
+ }
+ /* walk TX rings */
+ if (priv->tx) {
+ for (ring = 0; ring < priv->tx_cfg.num_queues; ring++) {
+ struct gve_tx_ring *tx = &priv->tx[ring];
+
+ data[i++] = tx->req;
+ data[i++] = tx->done;
+ do {
+ start =
+ u64_stats_fetch_begin(&priv->tx[ring].statss);
+ tmp_tx_bytes = tx->bytes_done;
+ } while (u64_stats_fetch_retry(&priv->tx[ring].statss,
+ start));
+ data[i++] = tmp_tx_bytes;
+ data[i++] = tx->wake_queue;
+ data[i++] = tx->stop_queue;
+ data[i++] = be32_to_cpu(gve_tx_load_event_counter(priv,
+ tx));
+ /* stats from NIC */
+ if (skip_nic_stats) {
+ /* skip NIC tx stats */
+ i += NIC_TX_STATS_REPORT_NUM;
+ continue;
+ }
+ for (j = 0; j < NIC_TX_STATS_REPORT_NUM; j++) {
+ u64 value = be64_to_cpu(report_stats[
+ tx_qid_to_stats_idx[ring] + j].value);
+ data[i++] = value;
+ }
+ }
+ } else {
+ i += priv->tx_cfg.num_queues * NUM_GVE_TX_CNTS;
+ }
+
+ kfree(rx_qid_to_stats_idx);
+ kfree(tx_qid_to_stats_idx);
+
+ /* AQ Stats */
+ data[i++] = priv->adminq_prod_cnt;
+ data[i++] = priv->adminq_cmd_fail;
+ data[i++] = priv->adminq_timeouts;
+ data[i++] = priv->adminq_describe_device_cnt;
+ data[i++] = priv->adminq_cfg_device_resources_cnt;
+ data[i++] = priv->adminq_register_page_list_cnt;
+ data[i++] = priv->adminq_unregister_page_list_cnt;
+ data[i++] = priv->adminq_create_tx_queue_cnt;
+ data[i++] = priv->adminq_create_rx_queue_cnt;
+ data[i++] = priv->adminq_destroy_tx_queue_cnt;
+ data[i++] = priv->adminq_destroy_rx_queue_cnt;
+ data[i++] = priv->adminq_dcfg_device_resources_cnt;
+ data[i++] = priv->adminq_set_driver_parameter_cnt;
+ data[i++] = priv->adminq_report_stats_cnt;
+}
+
+static void gve_get_channels(struct net_device *netdev,
+ struct ethtool_channels *cmd)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+
+ cmd->max_rx = priv->rx_cfg.max_queues;
+ cmd->max_tx = priv->tx_cfg.max_queues;
+ cmd->max_other = 0;
+ cmd->max_combined = 0;
+ cmd->rx_count = priv->rx_cfg.num_queues;
+ cmd->tx_count = priv->tx_cfg.num_queues;
+ cmd->other_count = 0;
+ cmd->combined_count = 0;
+}
+
+static int gve_set_channels(struct net_device *netdev,
+ struct ethtool_channels *cmd)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+ struct gve_queue_config new_tx_cfg = priv->tx_cfg;
+ struct gve_queue_config new_rx_cfg = priv->rx_cfg;
+ struct ethtool_channels old_settings;
+ int new_tx = cmd->tx_count;
+ int new_rx = cmd->rx_count;
+
+ gve_get_channels(netdev, &old_settings);
+
+ /* Changing combined is not allowed allowed */
+ if (cmd->combined_count != old_settings.combined_count)
+ return -EINVAL;
+
+ if (!new_rx || !new_tx)
+ return -EINVAL;
+
+ if (!netif_carrier_ok(netdev)) {
+ priv->tx_cfg.num_queues = new_tx;
+ priv->rx_cfg.num_queues = new_rx;
+ return 0;
+ }
+
+ new_tx_cfg.num_queues = new_tx;
+ new_rx_cfg.num_queues = new_rx;
+
+ return gve_adjust_queues(priv, new_rx_cfg, new_tx_cfg);
+}
+
+static void gve_get_ringparam(struct net_device *netdev,
+ struct ethtool_ringparam *cmd)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+
+ cmd->rx_max_pending = priv->rx_desc_cnt;
+ cmd->tx_max_pending = priv->tx_desc_cnt;
+ cmd->rx_pending = priv->rx_desc_cnt;
+ cmd->tx_pending = priv->tx_desc_cnt;
+}
+
+static int gve_user_reset(struct net_device *netdev, u32 *flags)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+
+ if (*flags == ETH_RESET_ALL) {
+ *flags = 0;
+ return gve_reset(priv, true);
+ }
+
+ return -EOPNOTSUPP;
+}
+
+static int gve_get_tunable(struct net_device *netdev,
+ const struct ethtool_tunable *etuna, void *value)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+
+ switch (etuna->id) {
+ case ETHTOOL_RX_COPYBREAK:
+ *(u32 *)value = priv->rx_copybreak;
+ return 0;
+ default:
+ return -EINVAL;
+ }
+}
+
+static int gve_set_tunable(struct net_device *netdev,
+ const struct ethtool_tunable *etuna,
+ const void *value)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+ u32 len;
+
+ switch(etuna->id) {
+ case ETHTOOL_RX_COPYBREAK:
+ len = *(u32 *)value;
+ if (len > priv->dev->mtu) {
+ return -EINVAL;
+ }
+ priv->rx_copybreak = len;
+ return 0;
+ default:
+ return -EINVAL;
+ }
+}
+
+static u32 gve_get_priv_flags(struct net_device *netdev)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+ u32 i, ret_flags = 0;
+
+ for (i = 0; i < GVE_PRIV_FLAGS_STR_LEN; i++) {
+ if (priv->ethtool_flags & BIT(i)) {
+ ret_flags |= BIT(i);
+ }
+ }
+ return ret_flags;
+}
+
+static int gve_set_priv_flags(struct net_device *netdev, u32 flags)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+ u64 ori_flags, new_flags;
+ u32 i;
+
+ ori_flags = READ_ONCE(priv->ethtool_flags);
+ new_flags = ori_flags;
+
+ for (i = 0; i < GVE_PRIV_FLAGS_STR_LEN; i++) {
+ if (flags & BIT(i))
+ new_flags |= BIT(i);
+ else
+ new_flags &= ~(BIT(i));
+ priv->ethtool_flags = new_flags;
+ /* set report-stats */
+ if (strcmp(gve_gstrings_priv_flags[i], "report-stats") == 0) {
+ /* update the stats when user turns report-stats on */
+ if (flags & BIT(i))
+ gve_handle_report_stats(priv);
+ /* zero off gve stats when report-stats turned off */
+ if (!(flags & BIT(i)) && (ori_flags & BIT(i))) {
+ int tx_stats_num = GVE_TX_STATS_REPORT_NUM *
+ priv->tx_cfg.num_queues;
+ int rx_stats_num = GVE_RX_STATS_REPORT_NUM *
+ priv->rx_cfg.num_queues;
+ memset(priv->stats_report->stats, 0,
+ (tx_stats_num + rx_stats_num) *
+ sizeof(struct stats));
+ }
+ }
+ }
+
+ return 0;
+}
+
+static int gve_get_link_ksettings(struct net_device *netdev,
+ struct ethtool_link_ksettings *cmd)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+ int err = gve_adminq_report_link_speed(priv);
+
+ cmd->base.speed = priv->link_speed;
+ return err;
+}
+
+const struct ethtool_ops gve_ethtool_ops = {
+ .get_drvinfo = gve_get_drvinfo,
+ .get_strings = gve_get_strings,
+ .get_sset_count = gve_get_sset_count,
+ .get_ethtool_stats = gve_get_ethtool_stats,
+ .set_msglevel = gve_set_msglevel,
+ .get_msglevel = gve_get_msglevel,
+ .set_channels = gve_set_channels,
+ .get_channels = gve_get_channels,
+ .get_link = ethtool_op_get_link,
+ .get_ringparam = gve_get_ringparam,
+ .reset = gve_user_reset,
+ .get_tunable = gve_get_tunable,
+ .set_tunable = gve_set_tunable,
+ .get_priv_flags = gve_get_priv_flags,
+ .set_priv_flags = gve_set_priv_flags,
+ .get_link_ksettings = gve_get_link_ksettings
+};
diff --git a/drivers/net/ethernet/google/gve/gve_linux_version.h b/drivers/net/ethernet/google/gve/gve_linux_version.h
new file mode 100644
index 0000000..08b6bea
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_linux_version.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR MIT)
+ * Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2018 Google, Inc.
+ */
+
+#ifndef _GVE_LINUX_VERSION_H
+#define _GVE_LINUX_VERSION_H
+
+#ifndef LINUX_VERSION_CODE
+#include <linux/version.h>
+#else
+#define KERNEL_VERSION(a,b,c) ((((a) << 16) + (b) << 8) + (c))
+#endif
+#ifndef UTS_RELEASE
+#include <generated/utsrelease.h>
+#endif /* UTS_RELEASE */
+
+#ifndef RHEL_RELEASE_CODE
+#define RHEL_RELEASE_CODE 0
+#endif /* RHEL_RELEASE_CODE */
+
+#ifndef RHEL_RELEASE_VERSION
+#define RHEL_RELEASE_VERSION(a,b) (((a) << 8) + (b))
+#endif /* RHEL_RELEASE_VERSION */
+
+#ifndef UTS_UBUNTU_RELEASE_ABI
+#define UTS_UBUNTU_RELEASE_ABI 0
+#define UBUNTU_VERSION_CODE 0
+#else
+#define UBUNTU_VERSION_CODE (((LINUX_VERSION_CODE & ~0xFF) << 8) + (UTS_UBUNTU_RELEASE_ABI))
+#endif /* UTS_UBUNTU_RELEASE_ABI */
+
+#define UBUNTU_VERSION(a,b,c,d) ((KERNEL_VERSION(a,b,0) << 8) + (d))
+
+#endif /* _GVE_LINUX_VERSION_H_ */
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
new file mode 100644
index 0000000..baad1b6
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -0,0 +1,1565 @@
+// SPDX-License-Identifier: (GPL-2.0 OR MIT)
+/* Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#include "gve_linux_version.h"
+#include <linux/cpumask.h>
+#include <linux/etherdevice.h>
+#include <linux/interrupt.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/sched.h>
+#include <linux/timer.h>
+#include <linux/workqueue.h>
+#include <net/sch_generic.h>
+#include "gve.h"
+#include "gve_adminq.h"
+#include "gve_register.h"
+
+#define GVE_DEFAULT_RX_COPYBREAK (256)
+
+#define DEFAULT_MSG_LEVEL (NETIF_MSG_DRV | NETIF_MSG_LINK)
+#define GVE_VERSION "1.1.0"
+#define GVE_VERSION_PREFIX "GVE-"
+
+const char gve_version_str[] = GVE_VERSION;
+static const char gve_version_prefix[] = GVE_VERSION_PREFIX;
+
+static void gve_get_stats(struct net_device *dev, struct rtnl_link_stats64 *s)
+{
+ struct gve_priv *priv = netdev_priv(dev);
+ unsigned int start;
+ int ring;
+
+ if (priv->rx) {
+ for (ring = 0; ring < priv->rx_cfg.num_queues; ring++) {
+ do {
+ start =
+ u64_stats_fetch_begin(&priv->rx[ring].statss);
+ s->rx_packets += priv->rx[ring].rpackets;
+ s->rx_bytes += priv->rx[ring].rbytes;
+ } while (u64_stats_fetch_retry(&priv->rx[ring].statss,
+ start));
+ }
+ }
+ if (priv->tx) {
+ for (ring = 0; ring < priv->tx_cfg.num_queues; ring++) {
+ do {
+ start =
+ u64_stats_fetch_begin(&priv->tx[ring].statss);
+ s->tx_packets += priv->tx[ring].pkt_done;
+ s->tx_bytes += priv->tx[ring].bytes_done;
+ } while (u64_stats_fetch_retry(&priv->rx[ring].statss,
+ start));
+ }
+ }
+}
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,11,0))
+static struct rtnl_link_stats64 *
+backport_gve_get_stats(struct net_device *dev, struct rtnl_link_stats64 *s){
+ gve_get_stats(dev, s);
+ return s;
+}
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,11.0) */
+
+static int gve_alloc_counter_array(struct gve_priv *priv)
+{
+ priv->counter_array =
+ dma_alloc_coherent(&priv->pdev->dev,
+ priv->num_event_counters *
+ sizeof(*priv->counter_array),
+ &priv->counter_array_bus, GFP_KERNEL);
+ if (!priv->counter_array)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static void gve_free_counter_array(struct gve_priv *priv)
+{
+ dma_free_coherent(&priv->pdev->dev,
+ priv->num_event_counters *
+ sizeof(*priv->counter_array),
+ priv->counter_array, priv->counter_array_bus);
+ priv->counter_array = NULL;
+}
+
+void gve_service_task_schedule(struct gve_priv *priv)
+{
+ if (!gve_get_probe_in_progress(priv) &&
+ !gve_get_reset_in_progress(priv)) {
+ gve_set_do_report_stats(priv);
+ queue_work(priv->gve_wq, &priv->service_task);
+ }
+}
+
+static void gve_service_timer(struct timer_list *t)
+{
+ struct gve_priv *priv = from_timer(priv, t, service_timer);
+
+ mod_timer(&priv->service_timer,
+ round_jiffies(jiffies +
+ msecs_to_jiffies(priv->service_timer_period)));
+ gve_service_task_schedule(priv);
+}
+
+static int gve_alloc_stats_report(struct gve_priv *priv)
+{
+ int tx_stats_num, rx_stats_num;
+
+ tx_stats_num = (GVE_TX_STATS_REPORT_NUM + NIC_TX_STATS_REPORT_NUM) *
+ priv->tx_cfg.num_queues;
+ rx_stats_num = (GVE_RX_STATS_REPORT_NUM + NIC_RX_STATS_REPORT_NUM) *
+ priv->rx_cfg.num_queues;
+ priv->stats_report_len = sizeof(struct gve_stats_report) +
+ (tx_stats_num + rx_stats_num) *
+ sizeof(struct stats);
+ priv->stats_report =
+ dma_alloc_coherent(&priv->pdev->dev, priv->stats_report_len,
+ &priv->stats_report_bus, GFP_KERNEL);
+ if (!priv->stats_report)
+ return -ENOMEM;
+ /* Set up timer for periodic task */
+ timer_setup(&priv->service_timer, gve_service_timer, 0);
+ priv->service_timer_period = GVE_SERVICE_TIMER_PERIOD;
+ /* Start the service task timer */
+ mod_timer(&priv->service_timer,
+ round_jiffies(jiffies +
+ msecs_to_jiffies(priv->service_timer_period)));
+ return 0;
+}
+
+static void gve_free_stats_report(struct gve_priv *priv)
+{
+
+ del_timer_sync(&priv->service_timer);
+ dma_free_coherent(&priv->pdev->dev, priv->stats_report_len,
+ priv->stats_report, priv->stats_report_bus);
+ priv->stats_report = NULL;
+}
+
+static irqreturn_t gve_mgmnt_intr(int irq, void *arg)
+{
+ struct gve_priv *priv = arg;
+
+ queue_work(priv->gve_wq, &priv->service_task);
+ return IRQ_HANDLED;
+}
+
+static irqreturn_t gve_intr(int irq, void *arg)
+{
+ struct gve_notify_block *block = arg;
+ struct gve_priv *priv = block->priv;
+
+ iowrite32be(GVE_IRQ_MASK, gve_irq_doorbell(priv, block));
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0)
+ napi_schedule_irqoff(&block->napi);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ napi_schedule(&block->napi);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ return IRQ_HANDLED;
+}
+
+static int gve_napi_poll(struct napi_struct *napi, int budget)
+{
+ struct gve_notify_block *block;
+ __be32 __iomem *irq_doorbell;
+ bool reschedule = false;
+ struct gve_priv *priv;
+
+ block = container_of(napi, struct gve_notify_block, napi);
+ priv = block->priv;
+
+ if (block->tx)
+ reschedule |= gve_tx_poll(block, budget);
+ if (block->rx)
+ reschedule |= gve_rx_poll(block, budget);
+
+ if (reschedule)
+ return budget;
+
+ napi_complete(napi);
+ irq_doorbell = gve_irq_doorbell(priv, block);
+ iowrite32be(GVE_IRQ_ACK | GVE_IRQ_EVENT, irq_doorbell);
+
+ /* Double check we have no extra work.
+ * Ensure unmask synchronizes with checking for work.
+ */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0)
+ dma_rmb();
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ rmb();
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ if (block->tx)
+ reschedule |= gve_tx_poll(block, -1);
+ if (block->rx)
+ reschedule |= gve_rx_poll(block, -1);
+ if (reschedule && napi_reschedule(napi))
+ iowrite32be(GVE_IRQ_MASK, irq_doorbell);
+
+ return 0;
+}
+
+static int gve_alloc_notify_blocks(struct gve_priv *priv)
+{
+ int num_vecs_requested = priv->num_ntfy_blks + 1;
+ char *name = priv->dev->name;
+ unsigned int active_cpus;
+ int vecs_enabled;
+ int i, j;
+ int err;
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0)
+ priv->msix_vectors = kvzalloc(num_vecs_requested *
+ sizeof(*priv->msix_vectors), GFP_KERNEL);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ priv->msix_vectors = kcalloc(num_vecs_requested,
+ sizeof(*priv->msix_vectors), GFP_KERNEL);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ if (!priv->msix_vectors)
+ return -ENOMEM;
+ for (i = 0; i < num_vecs_requested; i++)
+ priv->msix_vectors[i].entry = i;
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0)
+ vecs_enabled = pci_enable_msix_range(priv->pdev, priv->msix_vectors,
+ GVE_MIN_MSIX, num_vecs_requested);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0) */
+ vecs_enabled = pci_enable_msix(priv->pdev, priv->msix_vectors,
+ num_vecs_requested);
+ if (!vecs_enabled) {
+ vecs_enabled = num_vecs_requested;
+ }
+ else
+ if (vecs_enabled > 0) {
+ if (vecs_enabled >= GVE_MIN_MSIX) {
+ vecs_enabled = pci_enable_msix(priv->pdev,
+ priv->msix_vectors,
+ GVE_MIN_MSIX);
+ if (vecs_enabled) {
+ dev_err(&priv->pdev->dev,
+ "Could not enable min msix %d error %d\n",
+ GVE_MIN_MSIX, vecs_enabled);
+ err = vecs_enabled;
+ goto abort_with_msix_vectors;
+ }
+ else {
+ vecs_enabled = GVE_MIN_MSIX;
+ }
+ }
+ else {
+ dev_err(&priv->pdev->dev,
+ "Could not enable msix error %d\n",
+ vecs_enabled);
+ err = vecs_enabled;
+ goto abort_with_msix_vectors;
+ }
+ }
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0) */
+
+ if (vecs_enabled < 0) {
+ dev_err(&priv->pdev->dev, "Could not enable min msix %d/%d\n",
+ GVE_MIN_MSIX, vecs_enabled);
+ err = vecs_enabled;
+ goto abort_with_msix_vectors;
+ }
+ if (vecs_enabled != num_vecs_requested) {
+ int new_num_ntfy_blks = (vecs_enabled - 1) & ~0x1;
+ int vecs_per_type = new_num_ntfy_blks / 2;
+ int vecs_left = new_num_ntfy_blks % 2;
+
+ priv->num_ntfy_blks = new_num_ntfy_blks;
+ priv->tx_cfg.max_queues = min_t(int, priv->tx_cfg.max_queues,
+ vecs_per_type);
+ priv->rx_cfg.max_queues = min_t(int, priv->rx_cfg.max_queues,
+ vecs_per_type + vecs_left);
+ dev_err(&priv->pdev->dev,
+ "Could not enable desired msix, only enabled %d, adjusting tx max queues to %d, and rx max queues to %d\n",
+ vecs_enabled, priv->tx_cfg.max_queues,
+ priv->rx_cfg.max_queues);
+ if (priv->tx_cfg.num_queues > priv->tx_cfg.max_queues)
+ priv->tx_cfg.num_queues = priv->tx_cfg.max_queues;
+ if (priv->rx_cfg.num_queues > priv->rx_cfg.max_queues)
+ priv->rx_cfg.num_queues = priv->rx_cfg.max_queues;
+ }
+ /* Half the notification blocks go to TX and half to RX */
+ active_cpus = min_t(int, priv->num_ntfy_blks / 2, num_online_cpus());
+
+ /* Setup Management Vector - the last vector */
+ snprintf(priv->mgmt_msix_name, sizeof(priv->mgmt_msix_name), "%s-mgmnt",
+ name);
+ err = request_irq(priv->msix_vectors[priv->mgmt_msix_idx].vector,
+ gve_mgmnt_intr, 0, priv->mgmt_msix_name, priv);
+ if (err) {
+ dev_err(&priv->pdev->dev, "Did not receive management vector.\n");
+ goto abort_with_msix_enabled;
+ }
+
+ priv->irq_db_indices =
+ dma_alloc_coherent(&priv->pdev->dev,
+ priv->num_ntfy_blks *
+ sizeof(*priv->irq_db_indices),
+ &priv->irq_db_indices_bus, GFP_KERNEL);
+ if (!priv->irq_db_indices) {
+ err = -ENOMEM;
+ goto abort_with_mgmt_vector;
+ }
+
+ priv->ntfy_blocks = kvzalloc(priv->num_ntfy_blks *
+ sizeof(*priv->ntfy_blocks), GFP_KERNEL);
+ if (!priv->ntfy_blocks) {
+ err = -ENOMEM;
+ goto abort_with_irq_db_indices;
+ }
+
+ /* Setup the other blocks - the first n-1 vectors */
+ for (i = 0; i < priv->num_ntfy_blks; i++) {
+ struct gve_notify_block *block = &priv->ntfy_blocks[i];
+ int msix_idx = i;
+
+ snprintf(block->name, sizeof(block->name), "%s-ntfy-block.%d",
+ name, i);
+ block->priv = priv;
+ err = request_irq(priv->msix_vectors[msix_idx].vector,
+ gve_intr, 0, block->name, block);
+ if (err) {
+ dev_err(&priv->pdev->dev,
+ "Failed to receive msix vector %d\n", i);
+ goto abort_with_some_ntfy_blocks;
+ }
+ irq_set_affinity_hint(priv->msix_vectors[msix_idx].vector,
+ get_cpu_mask(i % active_cpus));
+ block->irq_db_index = &priv->irq_db_indices[i].index;
+ }
+ return 0;
+abort_with_some_ntfy_blocks:
+ for (j = 0; j < i; j++) {
+ struct gve_notify_block *block = &priv->ntfy_blocks[j];
+ int msix_idx = j;
+
+ irq_set_affinity_hint(priv->msix_vectors[msix_idx].vector,
+ NULL);
+ free_irq(priv->msix_vectors[msix_idx].vector, block);
+ }
+ kvfree(priv->ntfy_blocks);
+ priv->ntfy_blocks = NULL;
+abort_with_irq_db_indices:
+ dma_free_coherent(&priv->pdev->dev, priv->num_ntfy_blks *
+ sizeof(*priv->irq_db_indices),
+ priv->irq_db_indices, priv->irq_db_indices_bus);
+ priv->irq_db_indices = NULL;
+abort_with_mgmt_vector:
+ free_irq(priv->msix_vectors[priv->mgmt_msix_idx].vector, priv);
+abort_with_msix_enabled:
+ pci_disable_msix(priv->pdev);
+abort_with_msix_vectors:
+ kfree(priv->msix_vectors);
+ priv->msix_vectors = NULL;
+ return err;
+}
+
+static void gve_free_notify_blocks(struct gve_priv *priv)
+{
+ int i;
+
+ /* Free the irqs */
+ for (i = 0; i < priv->num_ntfy_blks; i++) {
+ struct gve_notify_block *block = &priv->ntfy_blocks[i];
+ int msix_idx = i;
+
+ irq_set_affinity_hint(priv->msix_vectors[msix_idx].vector,
+ NULL);
+ free_irq(priv->msix_vectors[msix_idx].vector, block);
+ }
+ kvfree(priv->ntfy_blocks);
+ priv->ntfy_blocks = NULL;
+ dma_free_coherent(&priv->pdev->dev, priv->num_ntfy_blks *
+ sizeof(*priv->irq_db_indices),
+ priv->irq_db_indices, priv->irq_db_indices_bus);
+ priv->irq_db_indices = NULL;
+ free_irq(priv->msix_vectors[priv->mgmt_msix_idx].vector, priv);
+ pci_disable_msix(priv->pdev);
+ kfree(priv->msix_vectors);
+ priv->msix_vectors = NULL;
+}
+
+static int gve_setup_device_resources(struct gve_priv *priv)
+{
+ int err;
+
+ err = gve_alloc_counter_array(priv);
+ if (err)
+ return err;
+ err = gve_alloc_notify_blocks(priv);
+ if (err)
+ goto abort_with_counter;
+ err = gve_alloc_stats_report(priv);
+ if (err)
+ goto abort_with_ntfy_blocks;
+ err = gve_adminq_configure_device_resources(priv,
+ priv->counter_array_bus,
+ priv->num_event_counters,
+ priv->irq_db_indices_bus,
+ priv->num_ntfy_blks);
+ if (unlikely(err)) {
+ dev_err(&priv->pdev->dev,
+ "could not setup device_resources: err=%d\n", err);
+ err = -ENXIO;
+ goto abort_with_stats_report;
+ }
+ err = gve_adminq_report_stats(priv, priv->stats_report_len,
+ priv->stats_report_bus,
+ GVE_SERVICE_TIMER_PERIOD);
+ if (err)
+ dev_err(&priv->pdev->dev,
+ "Failed to report stats: err=%d\n", err);
+ gve_set_device_resources_ok(priv);
+ return 0;
+abort_with_stats_report:
+ gve_free_stats_report(priv);
+abort_with_ntfy_blocks:
+ gve_free_notify_blocks(priv);
+abort_with_counter:
+ gve_free_counter_array(priv);
+ return err;
+}
+
+static void gve_trigger_reset(struct gve_priv *priv);
+
+static void gve_teardown_device_resources(struct gve_priv *priv)
+{
+ int err;
+
+ /* Tell device its resources are being freed */
+ if (gve_get_device_resources_ok(priv)) {
+ /* detach the stats report */
+ err = gve_adminq_report_stats(priv, 0, 0x0,
+ GVE_SERVICE_TIMER_PERIOD);
+ if (err) {
+ dev_err(&priv->pdev->dev,
+ "Failed to detach stats report: err=%d\n", err);
+ gve_trigger_reset(priv);
+ }
+ err = gve_adminq_deconfigure_device_resources(priv);
+ if (err) {
+ dev_err(&priv->pdev->dev,
+ "Could not deconfigure device resources: err=%d\n",
+ err);
+ gve_trigger_reset(priv);
+ }
+ }
+ gve_free_counter_array(priv);
+ gve_free_notify_blocks(priv);
+ gve_free_stats_report(priv);
+ gve_clear_device_resources_ok(priv);
+}
+
+static void gve_add_napi(struct gve_priv *priv, int ntfy_idx)
+{
+ struct gve_notify_block *block = &priv->ntfy_blocks[ntfy_idx];
+
+ netif_napi_add(priv->dev, &block->napi, gve_napi_poll,
+ NAPI_POLL_WEIGHT);
+#if LINUX_VERSION_CODE < KERNEL_VERSION(4,5,0) && LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0)
+ napi_hash_add(&block->napi);
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,5,0) && LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) */
+}
+
+static void gve_remove_napi(struct gve_priv *priv, int ntfy_idx)
+{
+ struct gve_notify_block *block = &priv->ntfy_blocks[ntfy_idx];
+
+#if LINUX_VERSION_CODE < KERNEL_VERSION(4,5,0) && LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0)
+ napi_hash_del(&block->napi);
+ synchronize_net();
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,5,0) && LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) */
+ netif_napi_del(&block->napi);
+}
+
+static int gve_register_qpls(struct gve_priv *priv)
+{
+ int num_qpls = gve_num_tx_qpls(priv) + gve_num_rx_qpls(priv);
+ int err;
+ int i;
+
+ for (i = 0; i < num_qpls; i++) {
+ err = gve_adminq_register_page_list(priv, &priv->qpls[i]);
+ if (err) {
+ netif_err(priv, drv, priv->dev,
+ "failed to register queue page list %d\n",
+ priv->qpls[i].id);
+ /* This failure will trigger a reset - no need to clean
+ * up
+ */
+ return err;
+ }
+ }
+ return 0;
+}
+
+static int gve_unregister_qpls(struct gve_priv *priv)
+{
+ int num_qpls = gve_num_tx_qpls(priv) + gve_num_rx_qpls(priv);
+ int err;
+ int i;
+
+ for (i = 0; i < num_qpls; i++) {
+ err = gve_adminq_unregister_page_list(priv, priv->qpls[i].id);
+ /* This failure will trigger a reset - no need to clean up */
+ if (err) {
+ netif_err(priv, drv, priv->dev,
+ "Failed to unregister queue page list %d\n",
+ priv->qpls[i].id);
+ return err;
+ }
+ }
+ return 0;
+}
+
+static int gve_create_rings(struct gve_priv *priv)
+{
+ int err;
+ int i;
+
+ err = gve_adminq_create_tx_queues(priv, priv->tx_cfg.num_queues);
+ if (err) {
+ netif_err(priv, drv, priv->dev, "failed to create %d tx queues\n",
+ priv->tx_cfg.num_queues);
+ /* This failure will trigger a reset - no need to clean
+ * up
+ */
+ return err;
+ }
+ netif_dbg(priv, drv, priv->dev, "created %d tx queues \n",
+ priv->tx_cfg.num_queues);
+
+ err = gve_adminq_create_rx_queues(priv, priv->rx_cfg.num_queues);
+ if (err) {
+ netif_err(priv, drv, priv->dev, "failed to create %d rx queues\n",
+ priv->rx_cfg.num_queues);
+ /* This failure will trigger a reset - no need to clean
+ * up
+ */
+ return err;
+ }
+ netif_dbg(priv, drv, priv->dev, "created %d rx queues \n",
+ priv->rx_cfg.num_queues);
+
+ /* Rx data ring has been prefilled with packet buffers at queue
+ * allocation time.
+ * Write the doorbell to provide descriptor slots and packet buffers
+ * to the NIC.
+ */
+ for (i = 0; i < priv->rx_cfg.num_queues; i++) {
+ gve_rx_write_doorbell(priv, &priv->rx[i]);
+ }
+
+ return 0;
+}
+
+static int gve_alloc_rings(struct gve_priv *priv)
+{
+ int ntfy_idx;
+ int err;
+ int i;
+
+ /* Setup tx rings */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0)
+ priv->tx = kvzalloc(priv->tx_cfg.num_queues * sizeof(*priv->tx),
+ GFP_KERNEL);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ priv->tx = kcalloc(priv->tx_cfg.num_queues, sizeof(*priv->tx),
+ GFP_KERNEL);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ if (!priv->tx)
+ return -ENOMEM;
+ err = gve_tx_alloc_rings(priv);
+ if (err)
+ goto free_tx;
+ /* Setup rx rings */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0)
+ priv->rx = kvzalloc(priv->rx_cfg.num_queues * sizeof(*priv->rx),
+ GFP_KERNEL);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ priv->rx = kcalloc(priv->rx_cfg.num_queues, sizeof(*priv->rx),
+ GFP_KERNEL);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ if (!priv->rx) {
+ err = -ENOMEM;
+ goto free_tx_queue;
+ }
+ err = gve_rx_alloc_rings(priv);
+ if (err)
+ goto free_rx;
+ /* Add tx napi & init sync stats*/
+ for (i = 0; i < priv->tx_cfg.num_queues; i++) {
+ u64_stats_init(&priv->tx[i].statss);
+ ntfy_idx = gve_tx_idx_to_ntfy(priv, i);
+ gve_add_napi(priv, ntfy_idx);
+ }
+ /* Add rx napi & init sync stats*/
+ for (i = 0; i < priv->rx_cfg.num_queues; i++) {
+ u64_stats_init(&priv->rx[i].statss);
+ ntfy_idx = gve_rx_idx_to_ntfy(priv, i);
+ gve_add_napi(priv, ntfy_idx);
+ }
+
+ return 0;
+
+free_rx:
+ kfree(priv->rx);
+ priv->rx = NULL;
+free_tx_queue:
+ gve_tx_free_rings(priv);
+free_tx:
+ kfree(priv->tx);
+ priv->tx = NULL;
+ return err;
+}
+
+static int gve_destroy_rings(struct gve_priv *priv)
+{
+ int err;
+
+ err = gve_adminq_destroy_tx_queues(priv, priv->tx_cfg.num_queues);
+ if (err) {
+ netif_err(priv, drv, priv->dev,
+ "failed to destroy tx queues\n");
+ /* This failure will trigger a reset - no need to clean up */
+ return err;
+ }
+ netif_dbg(priv, drv, priv->dev, "destroyed tx queues\n");
+ err = gve_adminq_destroy_rx_queues(priv, priv->rx_cfg.num_queues);
+ if (err) {
+ netif_err(priv, drv, priv->dev,
+ "failed to destroy rx queues\n");
+ /* This failure will trigger a reset - no need to clean up */
+ return err;
+ }
+ netif_dbg(priv, drv, priv->dev, "destroyed rx queues\n");
+ return 0;
+}
+
+static void gve_free_rings(struct gve_priv *priv)
+{
+ int ntfy_idx;
+ int i;
+
+ if (priv->tx) {
+ for (i = 0; i < priv->tx_cfg.num_queues; i++) {
+ ntfy_idx = gve_tx_idx_to_ntfy(priv, i);
+ gve_remove_napi(priv, ntfy_idx);
+ }
+ gve_tx_free_rings(priv);
+ kfree(priv->tx);
+ priv->tx = NULL;
+ }
+ if (priv->rx) {
+ for (i = 0; i < priv->rx_cfg.num_queues; i++) {
+ ntfy_idx = gve_rx_idx_to_ntfy(priv, i);
+ gve_remove_napi(priv, ntfy_idx);
+ }
+ gve_rx_free_rings(priv);
+ kfree(priv->rx);
+ priv->rx = NULL;
+ }
+}
+
+int gve_alloc_page(struct gve_priv* priv, struct device* dev,
+ struct page **page, dma_addr_t *dma,
+ enum dma_data_direction dir, gfp_t gfp_flags)
+{
+ if (priv->dma_mask == 24)
+ gfp_flags |= GFP_DMA;
+ else if (priv->dma_mask == 32)
+ gfp_flags |= GFP_DMA32;
+
+ *page = alloc_page(gfp_flags);
+ if (!*page) {
+ priv->page_alloc_fail++;
+ return -ENOMEM;
+ }
+ *dma = dma_map_page(dev, *page, 0, PAGE_SIZE, dir);
+ if (dma_mapping_error(dev, *dma)) {
+ priv->dma_mapping_error++;
+ put_page(*page);
+ *page = NULL;
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+static int gve_alloc_queue_page_list(struct gve_priv *priv, u32 id,
+ int pages)
+{
+ struct gve_queue_page_list *qpl = &priv->qpls[id];
+ int err;
+ int i;
+
+ if (pages + priv->num_registered_pages > priv->max_registered_pages) {
+ netif_err(priv, drv, priv->dev,
+ "Reached max number of registered pages %llu > %llu\n",
+ pages + priv->num_registered_pages,
+ priv->max_registered_pages);
+ return -EINVAL;
+ }
+
+ qpl->id = id;
+ qpl->num_entries = 0;
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0)
+ qpl->pages = kvzalloc(pages * sizeof(*qpl->pages), GFP_KERNEL);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ qpl->pages = kcalloc(pages, sizeof(*qpl->pages), GFP_KERNEL);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ /* caller handles clean up */
+ if (!qpl->pages)
+ return -ENOMEM;
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0)
+ qpl->page_buses = kvzalloc(pages * sizeof(*qpl->page_buses),
+ GFP_KERNEL);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ qpl->page_buses = kcalloc(pages, sizeof(*qpl->page_buses), GFP_KERNEL);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ /* caller handles clean up */
+ if (!qpl->page_buses)
+ return -ENOMEM;
+
+ for (i = 0; i < pages; i++) {
+ err = gve_alloc_page(priv, &priv->pdev->dev, &qpl->pages[i],
+ &qpl->page_buses[i],
+ gve_qpl_dma_dir(priv, id), GFP_KERNEL);
+ /* caller handles clean up */
+ if (err)
+ return -ENOMEM;
+ qpl->num_entries++;
+ }
+ priv->num_registered_pages += pages;
+
+ return 0;
+}
+
+void gve_free_page(struct device *dev, struct page *page, dma_addr_t dma,
+ enum dma_data_direction dir)
+{
+ if (!dma_mapping_error(dev, dma))
+ dma_unmap_page(dev, dma, PAGE_SIZE, dir);
+ if (page)
+ put_page(page);
+}
+
+static void gve_free_queue_page_list(struct gve_priv *priv,
+ int id)
+{
+ struct gve_queue_page_list *qpl = &priv->qpls[id];
+ int i;
+
+ if (!qpl->pages)
+ return;
+ if (!qpl->page_buses)
+ goto free_pages;
+
+ for (i = 0; i < qpl->num_entries; i++)
+ gve_free_page(&priv->pdev->dev, qpl->pages[i],
+ qpl->page_buses[i], gve_qpl_dma_dir(priv, id));
+
+ kfree(qpl->page_buses);
+free_pages:
+ kfree(qpl->pages);
+ priv->num_registered_pages -= qpl->num_entries;
+}
+
+static int gve_alloc_qpls(struct gve_priv *priv)
+{
+ int num_qpls = gve_num_tx_qpls(priv) + gve_num_rx_qpls(priv);
+ int i, j;
+ int err;
+
+ /* Raw addressing means no QPLs */
+ if (priv->raw_addressing)
+ return 0;
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0)
+ priv->qpls = kvzalloc(num_qpls * sizeof(*priv->qpls), GFP_KERNEL);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ priv->qpls = kcalloc(num_qpls, sizeof(*priv->qpls), GFP_KERNEL);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ if (!priv->qpls)
+ return -ENOMEM;
+
+ for (i = 0; i < gve_num_tx_qpls(priv); i++) {
+ err = gve_alloc_queue_page_list(priv, i,
+ priv->tx_pages_per_qpl);
+ if (err)
+ goto free_qpls;
+ }
+ for (; i < num_qpls; i++) {
+ err = gve_alloc_queue_page_list(priv, i,
+ priv->rx_data_slot_cnt);
+ if (err)
+ goto free_qpls;
+ }
+
+ priv->qpl_cfg.qpl_map_size = BITS_TO_LONGS(num_qpls) *
+ sizeof(unsigned long) * BITS_PER_BYTE;
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0)
+ priv->qpl_cfg.qpl_id_map = kvzalloc(BITS_TO_LONGS(num_qpls) *
+ sizeof(unsigned long), GFP_KERNEL);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ priv->qpl_cfg.qpl_id_map = kcalloc(BITS_TO_LONGS(num_qpls),
+ sizeof(unsigned long), GFP_KERNEL);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ if (!priv->qpl_cfg.qpl_id_map)
+ goto free_qpls;
+
+ return 0;
+
+free_qpls:
+ for (j = 0; j <= i; j++)
+ gve_free_queue_page_list(priv, j);
+ kfree(priv->qpls);
+ return err;
+}
+
+static void gve_free_qpls(struct gve_priv *priv)
+{
+ int num_qpls = gve_num_tx_qpls(priv) + gve_num_rx_qpls(priv);
+ int i;
+
+ /* Raw addressing means no QPLs */
+ if (priv->raw_addressing)
+ return;
+
+ kfree(priv->qpl_cfg.qpl_id_map);
+
+ for (i = 0; i < num_qpls; i++)
+ gve_free_queue_page_list(priv, i);
+
+ kfree(priv->qpls);
+}
+
+/* Use this to schedule a reset when the device is capable of continuing
+ * to handle other requests in its current state. If it is not, do a reset
+ * in thread instead.
+ */
+void gve_schedule_reset(struct gve_priv *priv)
+{
+ gve_set_do_reset(priv);
+ queue_work(priv->gve_wq, &priv->service_task);
+}
+
+static void gve_reset_and_teardown(struct gve_priv *priv, bool was_up);
+static int gve_reset_recovery(struct gve_priv *priv, bool was_up);
+static void gve_turndown(struct gve_priv *priv);
+static void gve_turnup(struct gve_priv *priv);
+
+static int gve_open(struct net_device *dev)
+{
+ struct gve_priv *priv = netdev_priv(dev);
+ int err;
+
+ err = gve_alloc_qpls(priv);
+ if (err)
+ return err;
+ err = gve_alloc_rings(priv);
+ if (err)
+ goto free_qpls;
+
+ err = netif_set_real_num_tx_queues(dev, priv->tx_cfg.num_queues);
+ if (err)
+ goto free_rings;
+ err = netif_set_real_num_rx_queues(dev, priv->rx_cfg.num_queues);
+ if (err)
+ goto free_rings;
+
+ err = gve_register_qpls(priv);
+ if (err)
+ goto reset;
+ err = gve_create_rings(priv);
+ if (err)
+ goto reset;
+ gve_set_device_rings_ok(priv);
+
+ gve_turnup(priv);
+ queue_work(priv->gve_wq, &priv->service_task);
+ priv->interface_up_cnt++;
+ return 0;
+
+free_rings:
+ gve_free_rings(priv);
+free_qpls:
+ gve_free_qpls(priv);
+ return err;
+
+reset:
+ /* This must have been called from a reset due to the rtnl lock
+ * so just return at this point.
+ */
+ if (gve_get_reset_in_progress(priv))
+ return err;
+ /* Otherwise reset before returning */
+ gve_reset_and_teardown(priv, true);
+ /* if this fails there is nothing we can do so just ignore the return */
+ gve_reset_recovery(priv, false);
+ /* return the original error */
+ return err;
+}
+
+static int gve_close(struct net_device *dev)
+{
+ struct gve_priv *priv = netdev_priv(dev);
+ int err;
+
+ netif_carrier_off(dev);
+ if (gve_get_device_rings_ok(priv)) {
+ gve_turndown(priv);
+ err = gve_destroy_rings(priv);
+ if (err)
+ goto err;
+ err = gve_unregister_qpls(priv);
+ if (err)
+ goto err;
+ gve_clear_device_rings_ok(priv);
+ }
+
+ gve_free_rings(priv);
+ gve_free_qpls(priv);
+ priv->interface_down_cnt++;
+ return 0;
+
+err:
+ /* This must have been called from a reset due to the rtnl lock
+ * so just return at this point.
+ */
+ if (gve_get_reset_in_progress(priv))
+ return err;
+ /* Otherwise reset before returning */
+ gve_reset_and_teardown(priv, true);
+ return gve_reset_recovery(priv, false);
+}
+
+int gve_adjust_queues(struct gve_priv *priv,
+ struct gve_queue_config new_rx_config,
+ struct gve_queue_config new_tx_config)
+{
+ int err;
+
+ if (netif_carrier_ok(priv->dev)) {
+ /* To make this process as simple as possible we teardown the
+ * device, set the new configuration, and then bring the device
+ * up again.
+ */
+ err = gve_close(priv->dev);
+ /* we have already tried to reset in close,
+ * just fail at this point
+ */
+ if (err)
+ return err;
+ priv->tx_cfg = new_tx_config;
+ priv->rx_cfg = new_rx_config;
+
+ err = gve_open(priv->dev);
+ if (err)
+ goto err;
+
+ return 0;
+ }
+ /* Set the config for the next up. */
+ priv->tx_cfg = new_tx_config;
+ priv->rx_cfg = new_rx_config;
+
+ return 0;
+err:
+ netif_err(priv, drv, priv->dev,
+ "Adjust queues failed! !!! DISABLING ALL QUEUES !!!\n");
+ gve_turndown(priv);
+ return err;
+}
+
+static void gve_turndown(struct gve_priv *priv)
+{
+ int idx;
+
+ if (netif_carrier_ok(priv->dev))
+ netif_carrier_off(priv->dev);
+
+ if (!gve_get_napi_enabled(priv))
+ return;
+
+ /* Disable napi to prevent more work from coming in */
+ for (idx = 0; idx < priv->tx_cfg.num_queues; idx++) {
+ int ntfy_idx = gve_tx_idx_to_ntfy(priv, idx);
+ struct gve_notify_block *block = &priv->ntfy_blocks[ntfy_idx];
+
+ napi_disable(&block->napi);
+ }
+ for (idx = 0; idx < priv->rx_cfg.num_queues; idx++) {
+ int ntfy_idx = gve_rx_idx_to_ntfy(priv, idx);
+ struct gve_notify_block *block = &priv->ntfy_blocks[ntfy_idx];
+
+ napi_disable(&block->napi);
+ }
+
+ /* Stop tx queues */
+ netif_tx_disable(priv->dev);
+
+ gve_clear_napi_enabled(priv);
+ gve_clear_report_stats(priv);
+}
+
+static void gve_turnup(struct gve_priv *priv)
+{
+ int idx;
+
+ /* Start the tx queues */
+ netif_tx_start_all_queues(priv->dev);
+
+ /* Enable napi and unmask interrupts for all queues */
+ for (idx = 0; idx < priv->tx_cfg.num_queues; idx++) {
+ int ntfy_idx = gve_tx_idx_to_ntfy(priv, idx);
+ struct gve_notify_block *block = &priv->ntfy_blocks[ntfy_idx];
+
+ napi_enable(&block->napi);
+ iowrite32be(0, gve_irq_doorbell(priv, block));
+ }
+ for (idx = 0; idx < priv->rx_cfg.num_queues; idx++) {
+ int ntfy_idx = gve_rx_idx_to_ntfy(priv, idx);
+ struct gve_notify_block *block = &priv->ntfy_blocks[ntfy_idx];
+
+ napi_enable(&block->napi);
+ iowrite32be(0, gve_irq_doorbell(priv, block));
+ }
+
+ gve_set_napi_enabled(priv);
+}
+
+static void gve_tx_timeout(struct net_device *dev)
+{
+ struct gve_priv *priv = netdev_priv(dev);
+
+ gve_schedule_reset(priv);
+ priv->tx_timeo_cnt++;
+}
+
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0))
+int gve_change_mtu(struct net_device *dev, int new_mtu){
+ struct gve_priv *priv = netdev_priv(dev);
+
+ if (new_mtu < ETH_MIN_MTU || new_mtu > priv->max_mtu)
+ return -EINVAL;
+ dev->mtu = new_mtu;
+ return 0;
+}
+#endif /* (LINUX_VERSION_CODE >= KERNEL_VERSION(4,10,0)) */
+
+static const struct net_device_ops gve_netdev_ops = {
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0))
+#if RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(7, 5) && RHEL_RELEASE_CODE < RHEL_RELEASE_VERSION(8, 0)
+ .ndo_change_mtu_rh74 = gve_change_mtu,
+#else /* RHEL_RELEASE_CODE < RHEL_RELEASE_VERSION(7, 5) || RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(8, 0) */
+
+ .ndo_change_mtu = gve_change_mtu,
+#endif /* RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(7, 5) && RHEL_RELEASE_CODE < RHEL_RELEASE_VERSION(8, 0) */
+#endif /* (LINUX_VERSION_CODE >= KERNEL_VERSION(4,10,0)) */
+
+ .ndo_start_xmit = gve_tx,
+ .ndo_open = gve_open,
+ .ndo_stop = gve_close,
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,11,0))
+ .ndo_get_stats64 = backport_gve_get_stats,
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(4,11.0) */
+
+ .ndo_get_stats64 = gve_get_stats,
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,11.0) */
+ .ndo_tx_timeout = gve_tx_timeout,
+};
+
+static void gve_handle_status(struct gve_priv *priv, u32 status)
+{
+ if (GVE_DEVICE_STATUS_RESET_MASK & status) {
+ dev_info(&priv->pdev->dev, "Device requested reset.\n");
+ gve_set_do_reset(priv);
+ }
+ if (GVE_DEVICE_STATUS_REPORT_STATS_MASK & status) {
+ dev_info(&priv->pdev->dev, "Device report stats on.\n");
+ gve_set_do_report_stats(priv);
+ }
+}
+
+static void gve_handle_reset(struct gve_priv *priv)
+{
+ /* A service task will be scheduled at the end of probe to catch any
+ * resets that need to happen, and we don't want to reset until
+ * probe is done.
+ */
+ if (gve_get_probe_in_progress(priv))
+ return;
+
+ if (gve_get_do_reset(priv)) {
+ rtnl_lock();
+ gve_reset(priv, false);
+ rtnl_unlock();
+ }
+}
+
+void gve_handle_report_stats(struct gve_priv *priv)
+{
+ int idx, stats_idx = 0, tx_bytes;
+ unsigned int start = 0;
+ struct stats *stats = priv->stats_report->stats;
+
+ if (!gve_get_report_stats(priv))
+ return;
+
+ be64_add_cpu(&priv->stats_report->written_count, 1);
+ /* tx stats */
+ if (priv->tx) {
+ for (idx = 0; idx < priv->tx_cfg.num_queues; idx++) {
+ do {
+ start = u64_stats_fetch_begin(&priv->tx[idx].statss);
+ tx_bytes = priv->tx[idx].bytes_done;
+ } while (u64_stats_fetch_retry(&priv->tx[idx].statss, start));
+ stats[stats_idx++] = (struct stats) {
+ .stat_name = cpu_to_be32(TX_WAKE_CNT),
+ .value = cpu_to_be64(priv->tx[idx].wake_queue),
+ .queue_id = cpu_to_be32(idx),
+ };
+ stats[stats_idx++] = (struct stats) {
+ .stat_name = cpu_to_be32(TX_STOP_CNT),
+ .value = cpu_to_be64(priv->tx[idx].stop_queue),
+ .queue_id = cpu_to_be32(idx),
+ };
+ stats[stats_idx++] = (struct stats) {
+ .stat_name = cpu_to_be32(TX_FRAMES_SENT),
+ .value = cpu_to_be64(priv->tx[idx].req),
+ .queue_id = cpu_to_be32(idx),
+ };
+ stats[stats_idx++] = (struct stats) {
+ .stat_name = cpu_to_be32(TX_BYTES_SENT),
+ .value = cpu_to_be64(tx_bytes),
+ .queue_id = cpu_to_be32(idx),
+ };
+ stats[stats_idx++] = (struct stats) {
+ .stat_name = cpu_to_be32(
+ TX_LAST_COMPLETION_PROCESSED),
+ .value = cpu_to_be64(priv->tx[idx].done),
+ .queue_id = cpu_to_be32(idx),
+ };
+ }
+ }
+ /* rx stats */
+ if (priv->rx) {
+ for (idx = 0; idx < priv->rx_cfg.num_queues; idx++) {
+ stats[stats_idx++] = (struct stats) {
+ .stat_name = cpu_to_be32(
+ RX_NEXT_EXPECTED_SEQUENCE),
+ .value = cpu_to_be64(priv->rx[idx].desc.seqno),
+ .queue_id = cpu_to_be32(idx),
+ };
+ stats[stats_idx++] = (struct stats) {
+ .stat_name = cpu_to_be32(RX_BUFFERS_POSTED),
+ .value = cpu_to_be64(priv->rx[0].fill_cnt),
+ .queue_id = cpu_to_be32(idx),
+ };
+ }
+ }
+}
+
+void gve_handle_link_status(struct gve_priv *priv, bool link_status)
+{
+ if (!gve_get_napi_enabled(priv))
+ return;
+
+ if (link_status == netif_carrier_ok(priv->dev))
+ return;
+
+ if (link_status) {
+ netif_carrier_on(priv->dev);
+ } else {
+ dev_info(&priv->pdev->dev, "Device link is down.\n");
+ netif_carrier_off(priv->dev);
+ }
+}
+
+/* Handle NIC status register changes, reset requests and report stats */
+static void gve_service_task(struct work_struct *work)
+{
+ struct gve_priv *priv = container_of(work, struct gve_priv,
+ service_task);
+ u32 status = ioread32be(&priv->reg_bar0->device_status);
+
+ gve_handle_status(priv, status);
+
+ gve_handle_reset(priv);
+ gve_handle_link_status(priv, GVE_DEVICE_STATUS_LINK_STATUS_MASK & status);
+ if (gve_get_do_report_stats(priv)) {
+ gve_handle_report_stats(priv);
+ gve_clear_do_report_stats(priv);
+ }
+}
+
+static int gve_init_priv(struct gve_priv *priv, bool skip_describe_device)
+{
+ int num_ntfy;
+ int err;
+
+ /* Set up the adminq */
+ err = gve_adminq_alloc(&priv->pdev->dev, priv);
+ if (err) {
+ dev_err(&priv->pdev->dev,
+ "Failed to alloc admin queue: err=%d\n", err);
+ return err;
+ }
+
+ if (skip_describe_device)
+ goto setup_device;
+
+ priv->raw_addressing = false;
+ /* Get the initial information we need from the device */
+ err = gve_adminq_describe_device(priv);
+ if (err) {
+ dev_err(&priv->pdev->dev,
+ "Could not get device information: err=%d\n", err);
+ goto err;
+ }
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0))
+ if (priv->max_mtu > PAGE_SIZE)
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+if (priv->dev->max_mtu > PAGE_SIZE)
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+ {
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0))
+ priv->max_mtu = PAGE_SIZE;
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+ priv->dev->max_mtu = PAGE_SIZE;
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+ err = gve_adminq_set_mtu(priv, priv->dev->mtu);
+ if (err) {
+ dev_err(&priv->pdev->dev, "Could not set mtu");
+ goto err;
+ }
+ }
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0))
+ priv->dev->mtu = priv->max_mtu;
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+ priv->dev->mtu = priv->dev->max_mtu;
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+ num_ntfy = pci_msix_vec_count(priv->pdev);
+ if (num_ntfy <= 0) {
+ dev_err(&priv->pdev->dev,
+ "could not count MSI-x vectors: err=%d\n", num_ntfy);
+ err = num_ntfy;
+ goto err;
+ } else if (num_ntfy < GVE_MIN_MSIX) {
+ dev_err(&priv->pdev->dev, "gve needs at least %d MSI-x vectors, but only has %d\n",
+ GVE_MIN_MSIX, num_ntfy);
+ err = -EINVAL;
+ goto err;
+ }
+
+ priv->num_registered_pages = 0;
+ priv->rx_copybreak = GVE_DEFAULT_RX_COPYBREAK;
+ /* gvnic has one Notification Block per MSI-x vector, except for the
+ * management vector
+ */
+ priv->num_ntfy_blks = (num_ntfy - 1) & ~0x1;
+ priv->mgmt_msix_idx = priv->num_ntfy_blks;
+
+ priv->tx_cfg.max_queues =
+ min_t(int, priv->tx_cfg.max_queues, priv->num_ntfy_blks / 2);
+ priv->rx_cfg.max_queues =
+ min_t(int, priv->rx_cfg.max_queues, priv->num_ntfy_blks / 2);
+
+ priv->tx_cfg.num_queues = priv->tx_cfg.max_queues;
+ priv->rx_cfg.num_queues = priv->rx_cfg.max_queues;
+ if (priv->default_num_queues > 0) {
+ priv->tx_cfg.num_queues = min_t(int, priv->default_num_queues,
+ priv->tx_cfg.num_queues);
+ priv->rx_cfg.num_queues = min_t(int, priv->default_num_queues,
+ priv->rx_cfg.num_queues);
+ }
+
+ dev_info(&priv->pdev->dev, "TX queues %d, RX queues %d\n",
+ priv->tx_cfg.num_queues, priv->rx_cfg.num_queues);
+ dev_info(&priv->pdev->dev, "Max TX queues %d, Max RX queues %d\n",
+ priv->tx_cfg.max_queues, priv->rx_cfg.max_queues);
+
+setup_device:
+ err = gve_setup_device_resources(priv);
+ if (!err)
+ return 0;
+err:
+ gve_adminq_free(&priv->pdev->dev, priv);
+ return err;
+}
+
+static void gve_teardown_priv_resources(struct gve_priv *priv)
+{
+ gve_teardown_device_resources(priv);
+ gve_adminq_free(&priv->pdev->dev, priv);
+}
+
+static void gve_trigger_reset(struct gve_priv *priv)
+{
+ /* Reset the device by releasing the AQ */
+ gve_adminq_release(priv);
+}
+
+static void gve_reset_and_teardown(struct gve_priv *priv, bool was_up)
+{
+ gve_trigger_reset(priv);
+ /* With the reset having already happened, close cannot fail */
+ if (was_up)
+ gve_close(priv->dev);
+ gve_teardown_priv_resources(priv);
+}
+
+static int gve_reset_recovery(struct gve_priv *priv, bool was_up)
+{
+ int err;
+
+ err = gve_init_priv(priv, true);
+ if (err)
+ goto err;
+ if (was_up) {
+ err = gve_open(priv->dev);
+ if (err)
+ goto err;
+ }
+ return 0;
+err:
+ dev_err(&priv->pdev->dev, "Reset failed! !!! DISABLING ALL QUEUES !!!\n");
+ gve_turndown(priv);
+ return err;
+}
+
+int gve_reset(struct gve_priv *priv, bool attempt_teardown)
+{
+ bool was_up = netif_carrier_ok(priv->dev);
+ int err;
+
+ dev_info(&priv->pdev->dev, "Performing reset\n");
+ gve_clear_do_reset(priv);
+ gve_set_reset_in_progress(priv);
+ /* If we aren't attempting to teardown normally, just go turndown and
+ * reset right away.
+ */
+ if (!attempt_teardown) {
+ gve_turndown(priv);
+ gve_reset_and_teardown(priv, was_up);
+ } else {
+ /* Otherwise attempt to close normally */
+ if (was_up) {
+ err = gve_close(priv->dev);
+ /* If that fails reset as we did above */
+ if (err)
+ gve_reset_and_teardown(priv, was_up);
+ }
+ /* Clean up any remaining resources */
+ gve_teardown_priv_resources(priv);
+ }
+
+ /* Set it all back up */
+ err = gve_reset_recovery(priv, was_up);
+ gve_clear_reset_in_progress(priv);
+ priv->reset_cnt++;
+ priv->interface_up_cnt = 0;
+ priv->interface_down_cnt = 0;
+ return err;
+}
+
+static void gve_write_version(u8 __iomem *driver_version_register)
+{
+ const char *c = gve_version_prefix;
+
+ while (*c) {
+ writeb(*c, driver_version_register);
+ c++;
+ }
+
+ c = gve_version_str;
+ while (*c) {
+ writeb(*c, driver_version_register);
+ c++;
+ }
+ writeb('\n', driver_version_register);
+}
+
+static int gve_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
+{
+ int max_tx_queues, max_rx_queues;
+ struct net_device *dev;
+ __be32 __iomem *db_bar;
+ struct gve_registers __iomem *reg_bar;
+ struct gve_priv *priv;
+ u8 dma_mask;
+ int err;
+
+ err = pci_enable_device(pdev);
+ if (err)
+ return -ENXIO;
+
+ err = pci_request_regions(pdev, "gvnic-cfg");
+ if (err)
+ goto abort_with_enabled;
+
+ pci_set_master(pdev);
+
+ reg_bar = pci_iomap(pdev, GVE_REGISTER_BAR, 0);
+ if (!reg_bar) {
+ dev_err(&pdev->dev, "Failed to map pci bar!\n");
+ err = -ENOMEM;
+ goto abort_with_pci_region;
+ }
+
+ db_bar = pci_iomap(pdev, GVE_DOORBELL_BAR, 0);
+ if (!db_bar) {
+ dev_err(&pdev->dev, "Failed to map doorbell bar!\n");
+ err = -ENOMEM;
+ goto abort_with_reg_bar;
+ }
+
+ dma_mask = readb(®_bar->dma_mask);
+ // Default to 64 if the register isn't set
+ if (!dma_mask)
+ dma_mask = 64;
+ gve_write_version(®_bar->driver_version);
+ /* Get max queues to alloc etherdev */
+ max_tx_queues = ioread32be(®_bar->max_tx_queues);
+ max_rx_queues = ioread32be(®_bar->max_rx_queues);
+
+ err = pci_set_dma_mask(pdev, DMA_BIT_MASK(dma_mask));
+ if (err) {
+ dev_err(&pdev->dev, "Failed to set dma mask: err=%d\n", err);
+ goto abort_with_reg_bar;
+ }
+
+ err = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(dma_mask));
+ if (err) {
+ dev_err(&pdev->dev,
+ "Failed to set consistent dma mask: err=%d\n", err);
+ goto abort_with_reg_bar;
+ }
+
+ /* Alloc and setup the netdev and priv */
+ dev = alloc_etherdev_mqs(sizeof(*priv), max_tx_queues, max_rx_queues);
+ if (!dev) {
+ dev_err(&pdev->dev, "could not allocate netdev\n");
+ goto abort_with_db_bar;
+ }
+ SET_NETDEV_DEV(dev, &pdev->dev);
+
+ pci_set_drvdata(pdev, dev);
+
+ dev->ethtool_ops = &gve_ethtool_ops;
+ dev->netdev_ops = &gve_netdev_ops;
+ /* advertise features */
+ dev->hw_features = NETIF_F_HIGHDMA;
+ dev->hw_features |= NETIF_F_SG;
+ dev->hw_features |= NETIF_F_HW_CSUM;
+ dev->hw_features |= NETIF_F_TSO;
+ dev->hw_features |= NETIF_F_TSO6;
+ dev->hw_features |= NETIF_F_TSO_ECN;
+ dev->hw_features |= NETIF_F_RXCSUM;
+ dev->hw_features |= NETIF_F_RXHASH;
+ dev->features = dev->hw_features;
+ dev->watchdog_timeo = 5 * HZ;
+#if (LINUX_VERSION_CODE >= KERNEL_VERSION(4,10,0))
+ dev->min_mtu = ETH_MIN_MTU;
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,10,0) */
+ netif_carrier_off(dev);
+
+ priv = netdev_priv(dev);
+ priv->dev = dev;
+ priv->pdev = pdev;
+ priv->msg_enable = DEFAULT_MSG_LEVEL;
+ priv->reg_bar0 = reg_bar;
+ priv->db_bar2 = db_bar;
+ priv->service_task_flags = 0x0;
+ priv->state_flags = 0x0;
+ priv->ethtool_flags = 0x0;
+ priv->dma_mask = dma_mask;
+
+ gve_set_probe_in_progress(priv);
+
+ priv->gve_wq = alloc_ordered_workqueue("gve", 0);
+ if (!priv->gve_wq) {
+ dev_err(&pdev->dev, "Could not allocate workqueue");
+ err = -ENOMEM;
+ goto abort_with_netdev;
+ }
+ INIT_WORK(&priv->service_task, gve_service_task);
+ priv->tx_cfg.max_queues = max_tx_queues;
+ priv->rx_cfg.max_queues = max_rx_queues;
+
+ err = gve_init_priv(priv, false);
+ if (err)
+ goto abort_with_wq;
+
+ err = register_netdev(dev);
+ if (err)
+ goto abort_with_wq;
+
+ dev_info(&pdev->dev, "GVE version %s\n", gve_version_str);
+ gve_clear_probe_in_progress(priv);
+ queue_work(priv->gve_wq, &priv->service_task);
+
+ return 0;
+
+abort_with_wq:
+ destroy_workqueue(priv->gve_wq);
+
+abort_with_netdev:
+ free_netdev(dev);
+
+abort_with_db_bar:
+ pci_iounmap(pdev, db_bar);
+
+abort_with_reg_bar:
+ pci_iounmap(pdev, reg_bar);
+
+abort_with_pci_region:
+ pci_release_regions(pdev);
+
+abort_with_enabled:
+ pci_disable_device(pdev);
+ return -ENXIO;
+}
+EXPORT_SYMBOL(gve_probe);
+
+static void gve_remove(struct pci_dev *pdev)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ struct gve_priv *priv = netdev_priv(netdev);
+ __be32 __iomem *db_bar = priv->db_bar2;
+ void __iomem *reg_bar = priv->reg_bar0;
+
+ unregister_netdev(netdev);
+ gve_teardown_priv_resources(priv);
+ destroy_workqueue(priv->gve_wq);
+ free_netdev(netdev);
+ pci_iounmap(pdev, db_bar);
+ pci_iounmap(pdev, reg_bar);
+ pci_release_regions(pdev);
+ pci_disable_device(pdev);
+}
+
+static const struct pci_device_id gve_id_table[] = {
+ { PCI_DEVICE(PCI_VENDOR_ID_GOOGLE, PCI_DEV_ID_GVNIC) },
+ { }
+};
+
+static struct pci_driver gvnic_driver = {
+ .name = "gvnic",
+ .id_table = gve_id_table,
+ .probe = gve_probe,
+ .remove = gve_remove,
+};
+
+module_pci_driver(gvnic_driver);
+
+MODULE_DEVICE_TABLE(pci, gve_id_table);
+MODULE_AUTHOR("Google, Inc.");
+MODULE_DESCRIPTION("gVNIC Driver");
+MODULE_LICENSE("Dual MIT/GPL");
+MODULE_VERSION(GVE_VERSION);
diff --git a/drivers/net/ethernet/google/gve/gve_register.h b/drivers/net/ethernet/google/gve/gve_register.h
new file mode 100644
index 0000000..776c291
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_register.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR MIT)
+ * Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#ifndef _GVE_REGISTER_H_
+#define _GVE_REGISTER_H_
+
+/* Fixed Configuration Registers */
+struct gve_registers {
+ __be32 device_status;
+ __be32 driver_status;
+ __be32 max_tx_queues;
+ __be32 max_rx_queues;
+ __be32 adminq_pfn;
+ __be32 adminq_doorbell;
+ __be32 adminq_event_counter;
+ u8 reserved[2];
+ u8 dma_mask;
+ u8 driver_version;
+};
+
+enum gve_device_status_flags {
+ GVE_DEVICE_STATUS_RESET_MASK = BIT(1),
+ GVE_DEVICE_STATUS_LINK_STATUS_MASK = BIT(2),
+ GVE_DEVICE_STATUS_REPORT_STATS_MASK = BIT(3),
+};
+#endif /* _GVE_REGISTER_H_ */
diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
new file mode 100644
index 0000000..302f443
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_rx.c
@@ -0,0 +1,690 @@
+// SPDX-License-Identifier: (GPL-2.0 OR MIT)
+/* Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#include "gve_linux_version.h"
+#include "gve.h"
+#include "gve_adminq.h"
+#include <linux/etherdevice.h>
+
+static void gve_rx_remove_from_block(struct gve_priv *priv, int queue_idx)
+{
+ struct gve_notify_block *block =
+ &priv->ntfy_blocks[gve_rx_idx_to_ntfy(priv, queue_idx)];
+
+ block->rx = NULL;
+}
+
+static void gve_rx_free_buffer(struct device *dev,
+ struct gve_rx_slot_page_info *page_info,
+ struct gve_rx_data_slot *data_slot) {
+ dma_addr_t dma = (dma_addr_t)(be64_to_cpu(data_slot->addr) -
+ page_info->page_offset);
+
+ page_ref_sub(page_info->page, page_info->pagecnt_bias - 1);
+ gve_free_page(dev, page_info->page, dma, DMA_FROM_DEVICE);
+}
+
+static void gve_rx_free_ring(struct gve_priv *priv, int idx)
+{
+ struct gve_rx_ring *rx = &priv->rx[idx];
+ struct device *dev = &priv->pdev->dev;
+ size_t bytes;
+ u32 slots = rx->mask + 1;
+
+ gve_rx_remove_from_block(priv, idx);
+
+ bytes = sizeof(struct gve_rx_desc) * priv->rx_desc_cnt;
+ dma_free_coherent(dev, bytes, rx->desc.desc_ring, rx->desc.bus);
+ rx->desc.desc_ring = NULL;
+
+ dma_free_coherent(dev, sizeof(*rx->q_resources),
+ rx->q_resources, rx->q_resources_bus);
+ rx->q_resources = NULL;
+
+ if (rx->data.raw_addressing) {
+ int i;
+
+ for (i = 0; i < slots; i++)
+ gve_rx_free_buffer(dev, &rx->data.page_info[i],
+ &rx->data.data_ring[i]);
+ } else {
+ gve_unassign_qpl(priv, rx->data.qpl->id);
+ rx->data.qpl = NULL;
+ }
+ kfree(rx->data.page_info);
+
+ bytes = sizeof(*rx->data.data_ring) * slots;
+ dma_free_coherent(dev, bytes, rx->data.data_ring,
+ rx->data.data_bus);
+ rx->data.data_ring = NULL;
+ netif_dbg(priv, drv, priv->dev, "freed rx ring %d\n", idx);
+}
+
+static void gve_setup_rx_buffer(struct gve_rx_slot_page_info *page_info,
+ struct gve_rx_data_slot *slot,
+ dma_addr_t addr, struct page *page)
+{
+ page_info->page = page;
+ page_info->page_offset = 0;
+ page_info->page_address = page_address(page);
+ slot->addr = cpu_to_be64(addr);
+ /* The page already has 1 ref */
+ page_ref_add(page, INT_MAX - 1);
+ page_info->pagecnt_bias = INT_MAX;
+}
+
+static int gve_prefill_rx_pages(struct gve_rx_ring *rx)
+{
+ struct gve_priv *priv = rx->gve;
+ u32 slots;
+ int err;
+ int i;
+
+ /* Allocate one page per Rx queue slot. Each page is split into two
+ * packet buffers, when possible we "page flip" between the two.
+ */
+ slots = rx->mask + 1;
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0)
+ rx->data.page_info = kvzalloc(slots *
+ sizeof(*rx->data.page_info), GFP_KERNEL);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ rx->data.page_info = kcalloc(slots, sizeof(*rx->data.page_info),
+ GFP_KERNEL);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ if (!rx->data.page_info)
+ return -ENOMEM;
+
+ if (!rx->data.raw_addressing)
+ rx->data.qpl = gve_assign_rx_qpl(priv);
+ for (i = 0; i < slots; i++) {
+ struct page *page;
+ dma_addr_t addr;
+
+ if (rx->data.raw_addressing) {
+ err = gve_alloc_page(priv, &priv->pdev->dev, &page,
+ &addr, DMA_FROM_DEVICE,
+ GFP_KERNEL);
+ if (err) {
+ int j;
+
+ u64_stats_update_begin(&rx->statss);
+ rx->rx_buf_alloc_fail++;
+ u64_stats_update_end(&rx->statss);
+ for (j = 0; j < i; j++)
+ gve_free_page(&priv->pdev->dev, page,
+ addr, DMA_FROM_DEVICE);
+ return err;
+ }
+ } else {
+ page = rx->data.qpl->pages[i];
+ addr = i * PAGE_SIZE;
+ }
+ gve_setup_rx_buffer(&rx->data.page_info[i],
+ &rx->data.data_ring[i], addr, page);
+ }
+
+ return slots;
+}
+
+static void gve_rx_add_to_block(struct gve_priv *priv, int queue_idx)
+{
+ u32 ntfy_idx = gve_rx_idx_to_ntfy(priv, queue_idx);
+ struct gve_notify_block *block = &priv->ntfy_blocks[ntfy_idx];
+ struct gve_rx_ring *rx = &priv->rx[queue_idx];
+
+ block->rx = rx;
+ rx->ntfy_id = ntfy_idx;
+}
+
+static int gve_rx_alloc_ring(struct gve_priv *priv, int idx)
+{
+ struct gve_rx_ring *rx = &priv->rx[idx];
+ struct device *hdev = &priv->pdev->dev;
+ u32 slots, npages;
+ int filled_pages;
+ size_t bytes;
+ int err;
+
+ netif_dbg(priv, drv, priv->dev, "allocating rx ring\n");
+ /* Make sure everything is zeroed to start with */
+ memset(rx, 0, sizeof(*rx));
+
+ rx->gve = priv;
+ rx->q_num = idx;
+
+ slots = priv->rx_data_slot_cnt;
+ rx->mask = slots - 1;
+ rx->data.raw_addressing = priv->raw_addressing;
+
+ /* alloc rx data ring */
+ bytes = sizeof(*rx->data.data_ring) * slots;
+ rx->data.data_ring = dma_alloc_coherent(hdev, bytes,
+ &rx->data.data_bus,
+ GFP_KERNEL);
+ if (!rx->data.data_ring)
+ return -ENOMEM;
+ filled_pages = gve_prefill_rx_pages(rx);
+ if (filled_pages < 0) {
+ err = -ENOMEM;
+ goto abort_with_slots;
+ }
+ rx->fill_cnt = filled_pages;
+ /* Ensure data ring slots (packet buffers) are visible. */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0)
+ dma_wmb();
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ wmb();
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+
+ /* Alloc gve_queue_resources */
+ rx->q_resources =
+ dma_alloc_coherent(hdev,
+ sizeof(*rx->q_resources),
+ &rx->q_resources_bus,
+ GFP_KERNEL);
+ if (!rx->q_resources) {
+ err = -ENOMEM;
+ goto abort_filled;
+ }
+ netif_dbg(priv, drv, priv->dev, "rx[%d]->data.data_bus=%lx\n", idx,
+ (unsigned long)rx->data.data_bus);
+
+ /* alloc rx desc ring */
+ bytes = sizeof(struct gve_rx_desc) * priv->rx_desc_cnt;
+ npages = bytes / PAGE_SIZE;
+ if (npages * PAGE_SIZE != bytes) {
+ err = -EIO;
+ goto abort_with_q_resources;
+ }
+
+ rx->desc.desc_ring = dma_alloc_coherent(hdev, bytes, &rx->desc.bus,
+ GFP_KERNEL);
+ if (!rx->desc.desc_ring) {
+ err = -ENOMEM;
+ goto abort_with_q_resources;
+ }
+ rx->cnt = 0;
+ rx->db_threshold = priv->rx_desc_cnt / 2;
+ rx->desc.seqno = 1;
+ gve_rx_add_to_block(priv, idx);
+
+ return 0;
+
+abort_with_q_resources:
+ dma_free_coherent(hdev, sizeof(*rx->q_resources),
+ rx->q_resources, rx->q_resources_bus);
+ rx->q_resources = NULL;
+abort_filled:
+ kfree(rx->data.page_info);
+abort_with_slots:
+ bytes = sizeof(*rx->data.data_ring) * slots;
+ dma_free_coherent(hdev, bytes, rx->data.data_ring, rx->data.data_bus);
+ rx->data.data_ring = NULL;
+
+ return err;
+}
+
+int gve_rx_alloc_rings(struct gve_priv *priv)
+{
+ int err = 0;
+ int i;
+
+ for (i = 0; i < priv->rx_cfg.num_queues; i++) {
+ err = gve_rx_alloc_ring(priv, i);
+ if (err) {
+ netif_err(priv, drv, priv->dev,
+ "Failed to alloc rx ring=%d: err=%d\n",
+ i, err);
+ break;
+ }
+ }
+ /* Unallocate if there was an error */
+ if (err) {
+ int j;
+
+ for (j = 0; j < i; j++)
+ gve_rx_free_ring(priv, j);
+ }
+ return err;
+}
+
+void gve_rx_free_rings(struct gve_priv *priv)
+{
+ int i;
+
+ for (i = 0; i < priv->rx_cfg.num_queues; i++)
+ gve_rx_free_ring(priv, i);
+}
+
+void gve_rx_write_doorbell(struct gve_priv *priv, struct gve_rx_ring *rx)
+{
+ u32 db_idx = be32_to_cpu(rx->q_resources->db_index);
+
+ iowrite32be(rx->fill_cnt, &priv->db_bar2[db_idx]);
+}
+
+#if RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(7, 0) || LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0)
+static enum pkt_hash_types gve_rss_type(__be16 pkt_flags)
+{
+ if (likely(pkt_flags & (GVE_RXF_TCP | GVE_RXF_UDP)))
+ return PKT_HASH_TYPE_L4;
+ if (pkt_flags & (GVE_RXF_IPV4 | GVE_RXF_IPV6))
+ return PKT_HASH_TYPE_L3;
+ return PKT_HASH_TYPE_L2;
+}
+#endif /* RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(7, 0) || LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0) */
+
+static struct sk_buff *gve_rx_copy(struct net_device *dev,
+ struct napi_struct *napi,
+ struct gve_rx_slot_page_info *page_info,
+ u16 len)
+{
+ struct sk_buff *skb = napi_alloc_skb(napi, len);
+ void *va = page_info->page_address + GVE_RX_PAD +
+ page_info->page_offset;
+
+ if (unlikely(!skb))
+ return NULL;
+
+ __skb_put(skb, len);
+
+ skb_copy_to_linear_data(skb, va, len);
+
+ skb->protocol = eth_type_trans(skb, dev);
+
+ return skb;
+}
+
+static struct sk_buff *gve_rx_add_frags(struct napi_struct *napi,
+ struct gve_rx_slot_page_info *page_info,
+ u16 len)
+{
+ struct sk_buff *skb = napi_get_frags(napi);
+
+ if (unlikely(!skb))
+ return NULL;
+
+ skb_add_rx_frag(skb, 0, page_info->page,
+ page_info->page_offset +
+ GVE_RX_PAD, len, PAGE_SIZE / 2);
+
+ return skb;
+}
+
+static int gve_rx_alloc_buffer(struct gve_priv *priv, struct device *dev,
+ struct gve_rx_slot_page_info *page_info,
+ struct gve_rx_data_slot *data_slot,
+ struct gve_rx_ring *rx)
+{
+ struct page *page;
+ dma_addr_t dma;
+ int err;
+
+ err = gve_alloc_page(priv, dev, &page, &dma, DMA_FROM_DEVICE,
+ GFP_ATOMIC);
+ if (err) {
+ u64_stats_update_begin(&rx->statss);
+ rx->rx_buf_alloc_fail++;
+ u64_stats_update_end(&rx->statss);
+ return err;
+ }
+
+ gve_setup_rx_buffer(page_info, data_slot, dma, page);
+ return 0;
+}
+
+static void gve_rx_flip_buffer(struct gve_rx_slot_page_info *page_info,
+ struct gve_rx_data_slot *data_slot)
+{
+ u64 addr = be64_to_cpu(data_slot->addr);
+
+ /* "flip" to other packet buffer on this page */
+ page_info->page_offset ^= PAGE_SIZE / 2;
+ addr ^= PAGE_SIZE / 2;
+ data_slot->addr = cpu_to_be64(addr);
+}
+
+static bool gve_rx_can_flip_buffers(struct net_device *netdev) {
+#if PAGE_SIZE == 4096
+ /* We can't flip a buffer if we can't fit a packet
+ * into half a page.
+ */
+ if (netdev->max_mtu + GVE_RX_PAD + ETH_HLEN > PAGE_SIZE / 2)
+ return false;
+ return true;
+#else
+ /* PAGE_SIZE != 4096 - don't try to reuse */
+ return false;
+#endif
+}
+
+static int gve_rx_can_recycle_buffer(struct gve_rx_slot_page_info *page_info)
+{
+ int pagecount = page_count(page_info->page);
+
+ /* This page is not being used by any SKBs - reuse */
+ if (pagecount == page_info->pagecnt_bias) {
+ return 1;
+ /* This page is still being used by an SKB - we can't reuse */
+ } else if (pagecount > page_info->pagecnt_bias) {
+ return 0;
+ } else {
+ WARN(pagecount < page_info->pagecnt_bias,
+ "Pagecount should never be less than the bias.");
+ return -1;
+ }
+}
+
+static void gve_rx_update_pagecnt_bias(struct gve_rx_slot_page_info *page_info)
+{
+ page_info->pagecnt_bias--;
+ if (page_info->pagecnt_bias == 0) {
+ int pagecount = page_count(page_info->page);
+
+ /* If we have run out of bias - set it back up to INT_MAX
+ * minus the existing refs.
+ */
+ page_info->pagecnt_bias = INT_MAX - (pagecount);
+ /* Set pagecount back up to max */
+ page_ref_add(page_info->page, INT_MAX - pagecount);
+ }
+}
+
+static struct sk_buff *
+gve_rx_raw_addressing(struct device *dev, struct net_device *netdev,
+ struct gve_rx_slot_page_info *page_info, u16 len,
+ struct napi_struct *napi,
+ struct gve_rx_data_slot *data_slot, bool can_flip)
+{
+ struct sk_buff *skb = gve_rx_add_frags(napi, page_info, len);
+
+ if (!skb)
+ return NULL;
+
+ /* Optimistically stop the kernel from freeing the page.
+ * We will check again in refill to determine if we need to alloc a
+ * new page.
+ */
+ gve_rx_update_pagecnt_bias(page_info);
+ page_info->can_flip = can_flip;
+
+ return skb;
+}
+
+static struct sk_buff *
+gve_rx_qpl(struct device *dev, struct net_device *netdev,
+ struct gve_rx_ring *rx, struct gve_rx_slot_page_info *page_info,
+ u16 len, struct napi_struct *napi,
+ struct gve_rx_data_slot *data_slot, bool recycle)
+{
+ struct sk_buff *skb;
+ /* if raw_addressing mode is not enabled gvnic can only receive into
+ * registered segments. If the buffer can't be recycled, our only
+ * choice is to copy the data out of it so that we can return it to the
+ * device.
+ */
+ if (recycle) {
+ skb = gve_rx_add_frags(napi, page_info, len);
+ /* No point in recycling if we didn't get the skb */
+ if (skb) {
+ /* Make sure the networking stack can't free the page */
+ gve_rx_update_pagecnt_bias(page_info);
+ gve_rx_flip_buffer(page_info, data_slot);
+ }
+ } else {
+ skb = gve_rx_copy(netdev, napi, page_info, len);
+ if (skb) {
+ u64_stats_update_begin(&rx->statss);
+ rx->rx_copied_pkt++;
+ u64_stats_update_end(&rx->statss);
+ }
+ }
+ return skb;
+}
+
+static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
+ netdev_features_t feat, u32 idx)
+{
+ struct gve_rx_slot_page_info *page_info;
+ struct gve_priv *priv = rx->gve;
+ struct napi_struct *napi = &priv->ntfy_blocks[rx->ntfy_id].napi;
+ struct net_device *netdev = priv->dev;
+ struct gve_rx_data_slot *data_slot;
+ struct sk_buff *skb = NULL;
+ dma_addr_t page_bus;
+ u16 len;
+
+ /* drop this packet */
+ if (unlikely(rx_desc->flags_seq & GVE_RXF_ERR)) {
+ u64_stats_update_begin(&rx->statss);
+ rx->rx_desc_err_dropped_pkt++;
+ u64_stats_update_end(&rx->statss);
+ return false;
+ }
+
+ len = be16_to_cpu(rx_desc->len) - GVE_RX_PAD;
+ page_info = &rx->data.page_info[idx];
+ data_slot = &rx->data.data_ring[idx];
+ page_bus = (rx->data.raw_addressing) ?
+ be64_to_cpu(data_slot->addr) - page_info->page_offset:
+ rx->data.qpl->page_buses[idx];
+ dma_sync_single_for_cpu(&priv->pdev->dev, page_bus,
+ PAGE_SIZE, DMA_FROM_DEVICE);
+
+ if (len <= priv->rx_copybreak) {
+ /* Just copy small packets */
+ skb = gve_rx_copy(netdev, napi, page_info, len);
+ if (skb) {
+ u64_stats_update_begin(&rx->statss);
+ rx->rx_copied_pkt++;
+ rx->rx_copybreak_pkt++;
+ u64_stats_update_end(&rx->statss);
+ }
+ } else {
+ bool can_flip = gve_rx_can_flip_buffers(netdev);
+ int recycle = 0;
+
+ if (can_flip) {
+ recycle = gve_rx_can_recycle_buffer(page_info);
+ if (recycle < 0) {
+ gve_schedule_reset(priv);
+ return false;
+ }
+ }
+ if (rx->data.raw_addressing) {
+ skb = gve_rx_raw_addressing(&priv->pdev->dev, netdev,
+ page_info, len, napi,
+ data_slot,
+ can_flip && recycle);
+ } else {
+ skb = gve_rx_qpl(&priv->pdev->dev, netdev, rx,
+ page_info, len, napi, data_slot,
+ can_flip && recycle);
+ }
+ }
+
+ if (!skb) {
+ u64_stats_update_begin(&rx->statss);
+ rx->rx_skb_alloc_fail++;
+ u64_stats_update_end(&rx->statss);
+ return false;
+ }
+
+ if (likely(feat & NETIF_F_RXCSUM)) {
+ /* NIC passes up the partial sum */
+ if (rx_desc->csum)
+ skb->ip_summed = CHECKSUM_COMPLETE;
+ else
+ skb->ip_summed = CHECKSUM_NONE;
+ skb->csum = csum_unfold(rx_desc->csum);
+ }
+
+ /* parse flags & pass relevant info up */
+ if (likely(feat & NETIF_F_RXHASH) &&
+ gve_needs_rss(rx_desc->flags_seq)) {
+#if RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(7, 0) || LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0)
+ skb_set_hash(skb, be32_to_cpu(rx_desc->rss_hash),
+ gve_rss_type(rx_desc->flags_seq));
+#else /* RHEL_RELEASE_CODE < RHEL_RELEASE_VERSION(7, 0) && LINUX_VERSION_CODE < KERNEL_VERSION(3,14,0) */
+ skb->rxhash = be32_to_cpu(rx_desc->rss_hash);
+ skb->l4_rxhash = !!(rx_desc->flags_seq & (GVE_RXF_TCP | GVE_RXF_UDP));
+#endif /* RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(7, 0) || LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0) */
+ }
+
+ if (skb_is_nonlinear(skb))
+ napi_gro_frags(napi);
+ else
+ napi_gro_receive(napi, skb);
+ return true;
+}
+
+static bool gve_rx_work_pending(struct gve_rx_ring *rx)
+{
+ struct gve_rx_desc *desc;
+ __be16 flags_seq;
+ u32 next_idx;
+
+ next_idx = rx->cnt & rx->mask;
+ desc = rx->desc.desc_ring + next_idx;
+
+ /* make sure we have synchronized the seq no with the device */
+ smp_mb();
+ flags_seq = desc->flags_seq;
+ return (GVE_SEQNO(flags_seq) == rx->desc.seqno);
+}
+
+static bool gve_rx_refill_buffers(struct gve_priv *priv, struct gve_rx_ring *rx)
+{
+ u32 fill_cnt = rx->fill_cnt;
+
+ while ((fill_cnt & rx->mask) != (rx->cnt & rx->mask)) {
+ u32 idx = fill_cnt & rx->mask;
+ struct gve_rx_slot_page_info *page_info =
+ &rx->data.page_info[idx];
+
+ if (page_info->can_flip) {
+ /* The other half of the page is free because it was
+ * free when we processed the descriptor. Flip to it.
+ */
+ struct gve_rx_data_slot *data_slot =
+ &rx->data.data_ring[idx];
+
+ gve_rx_flip_buffer(page_info, data_slot);
+ } else {
+ /* It is possible that the networking stack has already
+ * finished processing all outstanding packets in the buffer
+ * and it can be reused.
+ * Flipping is unceccessary here - if the networking stack still
+ * owns half the page it is impossible to tell which half. Either
+ * the whole page is free or it needs to be replaced.
+ */
+ int recycle = gve_rx_can_recycle_buffer(page_info);
+
+ if (recycle < 0) {
+ gve_schedule_reset(priv);
+ return false;
+ }
+ if (!recycle) {
+ /* We can't reuse the buffer - alloc a new one*/
+ struct gve_rx_data_slot *data_slot =
+ &rx->data.data_ring[idx];
+ struct device *dev = &priv->pdev->dev;
+
+ gve_rx_free_buffer(dev, page_info, data_slot);
+ page_info->page = NULL;
+ if (gve_rx_alloc_buffer(priv, dev, page_info,
+ data_slot, rx)) {
+ break;
+ }
+ }
+ }
+ fill_cnt++;
+ }
+ rx->fill_cnt = fill_cnt;
+ return true;
+}
+
+bool gve_clean_rx_done(struct gve_rx_ring *rx, int budget,
+ netdev_features_t feat)
+{
+ struct gve_priv *priv = rx->gve;
+ u32 work_done = 0, packets = 0;
+ struct gve_rx_desc *desc;
+ u32 cnt = rx->cnt;
+ u32 idx = cnt & rx->mask;
+ u64 bytes = 0;
+
+ desc = rx->desc.desc_ring + idx;
+ while ((GVE_SEQNO(desc->flags_seq) == rx->desc.seqno) &&
+ work_done < budget) {
+ bool dropped;
+ netif_info(priv, rx_status, priv->dev,
+ "[%d] idx=%d desc=%p desc->flags_seq=0x%x\n",
+ rx->q_num, idx, desc, desc->flags_seq);
+ netif_info(priv, rx_status, priv->dev,
+ "[%d] seqno=%d rx->desc.seqno=%d\n",
+ rx->q_num, GVE_SEQNO(desc->flags_seq),
+ rx->desc.seqno);
+ dropped = !gve_rx(rx, desc, feat, idx);
+ if (!dropped) {
+ bytes += be16_to_cpu(desc->len) - GVE_RX_PAD;
+ packets++;
+ }
+ cnt++;
+ idx = cnt & rx->mask;
+ desc = rx->desc.desc_ring + idx;
+ rx->desc.seqno = gve_next_seqno(rx->desc.seqno);
+ work_done++;
+ }
+
+ if (!work_done)
+ return false;
+
+ u64_stats_update_begin(&rx->statss);
+ rx->rpackets += packets;
+ rx->rbytes += bytes;
+ u64_stats_update_end(&rx->statss);
+ rx->cnt = cnt;
+
+ /* restock ring slots */
+ if (!rx->data.raw_addressing) {
+ /* In QPL mode buffs are refilled as the desc are processed */
+ rx->fill_cnt += work_done;
+ dma_wmb();/* Ensure descs are visible before ringing doorbell */
+ gve_rx_write_doorbell(priv, rx);
+ } else if (rx->fill_cnt - cnt <= rx->db_threshold) {
+ /* In raw addressing mode buffs are only refilled if the avail
+ * falls below a threshold.
+ */
+ if(!gve_rx_refill_buffers(priv, rx))
+ return false;
+ /* restock desc ring slots */
+ dma_wmb();/* Ensure descs are visible before ringing doorbell */
+ gve_rx_write_doorbell(priv, rx);
+ }
+
+ return gve_rx_work_pending(rx);
+}
+
+bool gve_rx_poll(struct gve_notify_block *block, int budget)
+{
+ struct gve_rx_ring *rx = block->rx;
+ netdev_features_t feat;
+ bool repoll = false;
+
+ feat = block->napi.dev->features;
+
+ /* If budget is 0, do all the work */
+ if (budget == 0)
+ budget = INT_MAX;
+
+ if (budget > 0)
+ repoll |= gve_clean_rx_done(rx, budget, feat);
+ else
+ repoll |= gve_rx_work_pending(rx);
+ return repoll;
+}
diff --git a/drivers/net/ethernet/google/gve/gve_size_assert.h b/drivers/net/ethernet/google/gve/gve_size_assert.h
new file mode 100644
index 0000000..a6a238e
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_size_assert.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR MIT)
+ * Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#ifndef _GVE_ASSERT_H_
+#define _GVE_ASSERT_H_
+#define static_assert(expr, ...) _Static_assert(expr, #expr)
+#endif /* _GVE_ASSERT_H_ */
diff --git a/drivers/net/ethernet/google/gve/gve_tx.c b/drivers/net/ethernet/google/gve/gve_tx.c
new file mode 100644
index 0000000..cf66eb4
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_tx.c
@@ -0,0 +1,772 @@
+// SPDX-License-Identifier: (GPL-2.0 OR MIT)
+/* Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#include "gve_linux_version.h"
+#include "gve.h"
+#include "gve_adminq.h"
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/vmalloc.h>
+#include <linux/skbuff.h>
+
+static inline void gve_tx_put_doorbell(struct gve_priv *priv,
+ struct gve_queue_resources *q_resources,
+ u32 val)
+{
+ iowrite32be(val, &priv->db_bar2[be32_to_cpu(q_resources->db_index)]);
+}
+
+/* gvnic can only transmit from a Registered Segment.
+ * We copy skb payloads into the registered segment before writing Tx
+ * descriptors and ringing the Tx doorbell.
+ *
+ * gve_tx_fifo_* manages the Registered Segment as a FIFO - clients must
+ * free allocations in the order they were allocated.
+ */
+
+static int gve_tx_fifo_init(struct gve_priv *priv, struct gve_tx_fifo *fifo)
+{
+ fifo->base = vmap(fifo->qpl->pages, fifo->qpl->num_entries, VM_MAP,
+ PAGE_KERNEL);
+ if (unlikely(!fifo->base)) {
+ netif_err(priv, drv, priv->dev, "Failed to vmap fifo, qpl_id = %d\n",
+ fifo->qpl->id);
+ return -ENOMEM;
+ }
+
+ fifo->size = fifo->qpl->num_entries * PAGE_SIZE;
+ atomic_set(&fifo->available, fifo->size);
+ fifo->head = 0;
+ return 0;
+}
+
+static void gve_tx_fifo_release(struct gve_priv *priv, struct gve_tx_fifo *fifo)
+{
+ WARN(atomic_read(&fifo->available) != fifo->size,
+ "Releasing non-empty fifo");
+
+ vunmap(fifo->base);
+}
+
+static int gve_tx_fifo_pad_alloc_one_frag(struct gve_tx_fifo *fifo,
+ size_t bytes)
+{
+ return (fifo->head + bytes < fifo->size) ? 0 : fifo->size - fifo->head;
+}
+
+static bool gve_tx_fifo_can_alloc(struct gve_tx_fifo *fifo, size_t bytes)
+{
+ return (atomic_read(&fifo->available) <= bytes) ? false : true;
+}
+
+/* gve_tx_alloc_fifo - Allocate fragment(s) from Tx FIFO
+ * @fifo: FIFO to allocate from
+ * @bytes: Allocation size
+ * @iov: Scatter-gather elements to fill with allocation fragment base/len
+ *
+ * Returns number of valid elements in iov[] or negative on error.
+ *
+ * Allocations from a given FIFO must be externally synchronized but concurrent
+ * allocation and frees are allowed.
+ */
+static int gve_tx_alloc_fifo(struct gve_tx_fifo *fifo, size_t bytes,
+ struct gve_tx_iovec iov[2])
+{
+ size_t overflow, padding;
+ u32 aligned_head;
+ int nfrags = 0;
+
+ if (!bytes)
+ return 0;
+
+ /* This check happens before we know how much padding is needed to
+ * align to a cacheline boundary for the payload, but that is fine,
+ * because the FIFO head always start aligned, and the FIFO's boundaries
+ * are aligned, so if there is space for the data, there is space for
+ * the padding to the next alignment.
+ */
+ WARN(!gve_tx_fifo_can_alloc(fifo, bytes),
+ "Reached %s when there's not enough space in the fifo", __func__);
+
+ nfrags++;
+
+ iov[0].iov_offset = fifo->head;
+ iov[0].iov_len = bytes;
+ fifo->head += bytes;
+
+ if (fifo->head > fifo->size) {
+ /* If the allocation did not fit in the tail fragment of the
+ * FIFO, also use the head fragment.
+ */
+ nfrags++;
+ overflow = fifo->head - fifo->size;
+ iov[0].iov_len -= overflow;
+ iov[1].iov_offset = 0; /* Start of fifo*/
+ iov[1].iov_len = overflow;
+
+ fifo->head = overflow;
+ }
+
+ /* Re-align to a cacheline boundary */
+ aligned_head = L1_CACHE_ALIGN(fifo->head);
+ padding = aligned_head - fifo->head;
+ iov[nfrags - 1].iov_padding = padding;
+ atomic_sub(bytes + padding, &fifo->available);
+ fifo->head = aligned_head;
+
+ if (fifo->head == fifo->size)
+ fifo->head = 0;
+
+ return nfrags;
+}
+
+/* gve_tx_free_fifo - Return space to Tx FIFO
+ * @fifo: FIFO to return fragments to
+ * @bytes: Bytes to free
+ */
+static void gve_tx_free_fifo(struct gve_tx_fifo *fifo, size_t bytes)
+{
+ atomic_add(bytes, &fifo->available);
+}
+
+static void gve_tx_remove_from_block(struct gve_priv *priv, int queue_idx)
+{
+ struct gve_notify_block *block =
+ &priv->ntfy_blocks[gve_tx_idx_to_ntfy(priv, queue_idx)];
+
+ block->tx = NULL;
+}
+
+static int gve_clean_tx_done(struct gve_priv *priv, struct gve_tx_ring *tx,
+ u32 to_do, bool try_to_wake);
+
+static void gve_tx_free_ring(struct gve_priv *priv, int idx)
+{
+ struct gve_tx_ring *tx = &priv->tx[idx];
+ struct device *hdev = &priv->pdev->dev;
+ size_t bytes;
+ u32 slots;
+
+ gve_tx_remove_from_block(priv, idx);
+ slots = tx->mask + 1;
+ gve_clean_tx_done(priv, tx, tx->req, false);
+ netdev_tx_reset_queue(tx->netdev_txq);
+
+ dma_free_coherent(hdev, sizeof(*tx->q_resources),
+ tx->q_resources, tx->q_resources_bus);
+ tx->q_resources = NULL;
+
+ if (!tx->raw_addressing) {
+ gve_tx_fifo_release(priv, &tx->tx_fifo);
+ gve_unassign_qpl(priv, tx->tx_fifo.qpl->id);
+ tx->tx_fifo.qpl = NULL;
+ }
+
+ bytes = sizeof(*tx->desc) * slots;
+ dma_free_coherent(hdev, bytes, tx->desc, tx->bus);
+ tx->desc = NULL;
+
+ vfree(tx->info);
+ tx->info = NULL;
+
+ netif_dbg(priv, drv, priv->dev, "freed tx queue %d\n", idx);
+}
+
+static void gve_tx_add_to_block(struct gve_priv *priv, int queue_idx)
+{
+ unsigned int active_cpus = min_t(int, priv->num_ntfy_blks / 2,
+ num_online_cpus());
+ int ntfy_idx = gve_tx_idx_to_ntfy(priv, queue_idx);
+ struct gve_notify_block *block = &priv->ntfy_blocks[ntfy_idx];
+ struct gve_tx_ring *tx = &priv->tx[queue_idx];
+
+ block->tx = tx;
+ tx->ntfy_id = ntfy_idx;
+ netif_set_xps_queue(priv->dev, get_cpu_mask(ntfy_idx % active_cpus),
+ queue_idx);
+}
+
+static int gve_tx_alloc_ring(struct gve_priv *priv, int idx)
+{
+ struct gve_tx_ring *tx = &priv->tx[idx];
+ struct device *hdev = &priv->pdev->dev;
+ u32 slots = priv->tx_desc_cnt;
+ size_t bytes;
+
+ /* Make sure everything is zeroed to start */
+ memset(tx, 0, sizeof(*tx));
+ tx->q_num = idx;
+
+ tx->mask = slots - 1;
+
+ /* alloc metadata */
+ tx->info = vzalloc(sizeof(*tx->info) * slots);
+ if (!tx->info)
+ return -ENOMEM;
+
+ /* alloc tx queue */
+ bytes = sizeof(*tx->desc) * slots;
+ tx->desc = dma_alloc_coherent(hdev, bytes, &tx->bus, GFP_KERNEL);
+ if (!tx->desc)
+ goto abort_with_info;
+
+ tx->raw_addressing = priv->raw_addressing;
+ tx->dev = &priv->pdev->dev;
+ if (!tx->raw_addressing) {
+ tx->tx_fifo.qpl = gve_assign_tx_qpl(priv);
+
+ /* map Tx FIFO */
+ if (gve_tx_fifo_init(priv, &tx->tx_fifo))
+ goto abort_with_desc;
+ }
+
+ tx->q_resources =
+ dma_alloc_coherent(hdev,
+ sizeof(*tx->q_resources),
+ &tx->q_resources_bus,
+ GFP_KERNEL);
+ if (!tx->q_resources)
+ goto abort_with_fifo;
+
+ netif_dbg(priv, drv, priv->dev, "tx[%d]->bus=%lx\n", idx,
+ (unsigned long)tx->bus);
+ tx->netdev_txq = netdev_get_tx_queue(priv->dev, idx);
+ gve_tx_add_to_block(priv, idx);
+
+ return 0;
+
+abort_with_fifo:
+ if (!tx->raw_addressing)
+ gve_tx_fifo_release(priv, &tx->tx_fifo);
+abort_with_desc:
+ dma_free_coherent(hdev, bytes, tx->desc, tx->bus);
+ tx->desc = NULL;
+abort_with_info:
+ vfree(tx->info);
+ tx->info = NULL;
+ return -ENOMEM;
+}
+
+int gve_tx_alloc_rings(struct gve_priv *priv)
+{
+ int err = 0;
+ int i;
+
+ for (i = 0; i < priv->tx_cfg.num_queues; i++) {
+ err = gve_tx_alloc_ring(priv, i);
+ if (err) {
+ netif_err(priv, drv, priv->dev,
+ "Failed to alloc tx ring=%d: err=%d\n",
+ i, err);
+ break;
+ }
+ }
+ /* Unallocate if there was an error */
+ if (err) {
+ int j;
+
+ for (j = 0; j < i; j++)
+ gve_tx_free_ring(priv, j);
+ }
+ return err;
+}
+
+void gve_tx_free_rings(struct gve_priv *priv)
+{
+ int i;
+
+ for (i = 0; i < priv->tx_cfg.num_queues; i++)
+ gve_tx_free_ring(priv, i);
+}
+
+/* gve_tx_avail - Calculates the number of slots available in the ring
+ * @tx: tx ring to check
+ *
+ * Returns the number of slots available
+ *
+ * The capacity of the queue is mask + 1. We don't need to reserve an entry.
+ **/
+static inline u32 gve_tx_avail(struct gve_tx_ring *tx)
+{
+ return tx->mask + 1 - (tx->req - tx->done);
+}
+
+static inline int gve_skb_fifo_bytes_required(struct gve_tx_ring *tx,
+ struct sk_buff *skb)
+{
+ int pad_bytes, align_hdr_pad;
+ int bytes;
+ int hlen;
+
+ hlen = skb_is_gso(skb) ? skb_checksum_start_offset(skb) +
+ tcp_hdrlen(skb) : skb_headlen(skb);
+
+ pad_bytes = gve_tx_fifo_pad_alloc_one_frag(&tx->tx_fifo,
+ hlen);
+ /* We need to take into account the header alignment padding. */
+ align_hdr_pad = L1_CACHE_ALIGN(hlen) - hlen;
+ bytes = align_hdr_pad + pad_bytes + skb->len;
+
+ return bytes;
+}
+
+/* The most descriptors we could need are 3 - 1 for the headers, 1 for
+ * the beginning of the payload at the end of the FIFO, and 1 if the
+ * payload wraps to the beginning of the FIFO.
+ */
+#define MAX_TX_DESC_NEEDED 3
+static void gve_tx_unmap_buf(struct device *dev,
+ struct gve_tx_dma_buf *buf)
+{
+ const int buf_len = (int)dma_unmap_len(buf, len);
+ if (buf_len > 0) {
+ dma_unmap_single(dev, dma_unmap_addr(buf, dma),
+ dma_unmap_len(buf, len),
+ DMA_TO_DEVICE);
+ dma_unmap_len_set(buf, len, 0);
+ } else if (buf_len < 0) {
+ dma_unmap_page(dev, dma_unmap_addr(buf, dma),
+ -dma_unmap_len(buf, len),
+ DMA_TO_DEVICE);
+ dma_unmap_len_set(buf, len, 0);
+ }
+}
+
+/* Check if sufficient resources (descriptor ring space, FIFO space) are
+ * available to transmit the given number of bytes.
+ */
+static inline bool gve_can_tx(struct gve_tx_ring *tx, int bytes_required)
+{
+ bool can_alloc = true;
+
+ if (!tx->raw_addressing)
+ can_alloc = gve_tx_fifo_can_alloc(&tx->tx_fifo, bytes_required);
+
+ return (gve_tx_avail(tx) >= MAX_TX_DESC_NEEDED && can_alloc);
+}
+
+/* Stops the queue if the skb cannot be transmitted. */
+static int gve_maybe_stop_tx(struct gve_tx_ring *tx, struct sk_buff *skb)
+{
+ int bytes_required = 0;
+
+ if (!tx->raw_addressing)
+ bytes_required = gve_skb_fifo_bytes_required(tx, skb);
+
+ if (likely(gve_can_tx(tx, bytes_required)))
+ return 0;
+
+ /* No space, so stop the queue */
+ tx->stop_queue++;
+ netif_tx_stop_queue(tx->netdev_txq);
+ smp_mb(); /* sync with restarting queue in gve_clean_tx_done() */
+
+ /* Now check for resources again, in case gve_clean_tx_done() freed
+ * resources after we checked and we stopped the queue after
+ * gve_clean_tx_done() checked.
+ *
+ * gve_maybe_stop_tx() gve_clean_tx_done()
+ * nsegs/can_alloc test failed
+ * gve_tx_free_fifo()
+ * if (tx queue stopped)
+ * netif_tx_queue_wake()
+ * netif_tx_stop_queue()
+ * Need to check again for space here!
+ */
+ if (likely(!gve_can_tx(tx, bytes_required)))
+ return -EBUSY;
+
+ netif_tx_start_queue(tx->netdev_txq);
+ tx->wake_queue++;
+ return 0;
+}
+
+static void gve_tx_fill_pkt_desc(union gve_tx_desc *pkt_desc,
+ struct sk_buff *skb, bool is_gso,
+ int l4_hdr_offset, u32 desc_cnt,
+ u16 hlen, u64 addr)
+{
+ /* l4_hdr_offset and csum_offset are in units of 16-bit words */
+ if (is_gso) {
+ pkt_desc->pkt.type_flags = GVE_TXD_TSO | GVE_TXF_L4CSUM;
+ pkt_desc->pkt.l4_csum_offset = skb->csum_offset >> 1;
+ pkt_desc->pkt.l4_hdr_offset = l4_hdr_offset >> 1;
+ } else if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
+ pkt_desc->pkt.type_flags = GVE_TXD_STD | GVE_TXF_L4CSUM;
+ pkt_desc->pkt.l4_csum_offset = skb->csum_offset >> 1;
+ pkt_desc->pkt.l4_hdr_offset = l4_hdr_offset >> 1;
+ } else {
+ pkt_desc->pkt.type_flags = GVE_TXD_STD;
+ pkt_desc->pkt.l4_csum_offset = 0;
+ pkt_desc->pkt.l4_hdr_offset = 0;
+ }
+ pkt_desc->pkt.desc_cnt = desc_cnt;
+ pkt_desc->pkt.len = cpu_to_be16(skb->len);
+ pkt_desc->pkt.seg_len = cpu_to_be16(hlen);
+ pkt_desc->pkt.seg_addr = cpu_to_be64(addr);
+}
+
+static void gve_tx_fill_seg_desc(union gve_tx_desc *seg_desc,
+ struct sk_buff *skb, bool is_gso,
+ u16 len, u64 addr)
+{
+ seg_desc->seg.type_flags = GVE_TXD_SEG;
+ if (is_gso) {
+ if (skb_is_gso_v6(skb))
+ seg_desc->seg.type_flags |= GVE_TXSF_IPV6;
+ seg_desc->seg.l3_offset = skb_network_offset(skb) >> 1;
+ seg_desc->seg.mss = cpu_to_be16(skb_shinfo(skb)->gso_size);
+ }
+ seg_desc->seg.seg_len = cpu_to_be16(len);
+ seg_desc->seg.seg_addr = cpu_to_be64(addr);
+}
+
+static inline void gve_dma_sync_for_device(struct gve_priv *priv,
+ dma_addr_t *page_buses,
+ u64 iov_offset, u64 iov_len)
+{
+ u64 last_page = (iov_offset + iov_len - 1) / PAGE_SIZE;
+ u64 first_page = iov_offset / PAGE_SIZE;
+ u64 page;
+
+ for (page = first_page; page <= last_page; page++) {
+ dma_addr_t dma = page_buses[page];
+ dma_sync_single_for_device(&priv->pdev->dev, dma, PAGE_SIZE,
+ DMA_TO_DEVICE);
+ }
+}
+
+
+static int gve_tx_add_skb_copy(struct gve_priv *priv, struct gve_tx_ring *tx,
+ struct sk_buff *skb)
+{
+ int pad_bytes, hlen, hdr_nfrags, payload_nfrags, l4_hdr_offset;
+ union gve_tx_desc *pkt_desc, *seg_desc;
+ struct gve_tx_buffer_state *info;
+ bool is_gso = skb_is_gso(skb);
+ u32 idx = tx->req & tx->mask;
+ int payload_iov = 2;
+ int copy_offset;
+ u32 next_idx;
+ int i;
+
+ info = &tx->info[idx];
+ pkt_desc = &tx->desc[idx];
+
+ l4_hdr_offset = skb_checksum_start_offset(skb);
+ /* If the skb is gso, then we want the tcp header in the first segment
+ * otherwise we want the linear portion of the skb (which will contain
+ * the checksum because skb->csum_start and skb->csum_offset are given
+ * relative to skb->head) in the first segment.
+ */
+ hlen = is_gso ? l4_hdr_offset + tcp_hdrlen(skb) :
+ skb_headlen(skb);
+
+ info->skb = skb;
+ /* We don't want to split the header, so if necessary, pad to the end
+ * of the fifo and then put the header at the beginning of the fifo.
+ */
+ pad_bytes = gve_tx_fifo_pad_alloc_one_frag(&tx->tx_fifo, hlen);
+ hdr_nfrags = gve_tx_alloc_fifo(&tx->tx_fifo, hlen + pad_bytes,
+ &info->iov[0]);
+ WARN(!hdr_nfrags, "hdr_nfrags should never be 0!");
+ payload_nfrags = gve_tx_alloc_fifo(&tx->tx_fifo, skb->len - hlen,
+ &info->iov[payload_iov]);
+
+ gve_tx_fill_pkt_desc(pkt_desc, skb, is_gso, l4_hdr_offset,
+ 1 + payload_nfrags, hlen,
+ info->iov[hdr_nfrags - 1].iov_offset);
+
+ skb_copy_bits(skb, 0,
+ tx->tx_fifo.base + info->iov[hdr_nfrags - 1].iov_offset,
+ hlen);
+ gve_dma_sync_for_device(priv, tx->tx_fifo.qpl->page_buses,
+ info->iov[hdr_nfrags - 1].iov_offset,
+ info->iov[hdr_nfrags - 1].iov_len);
+ copy_offset = hlen;
+
+ for (i = payload_iov; i < payload_nfrags + payload_iov; i++) {
+ next_idx = (tx->req + 1 + i - payload_iov) & tx->mask;
+ seg_desc = &tx->desc[next_idx];
+
+ gve_tx_fill_seg_desc(seg_desc, skb, is_gso,
+ info->iov[i].iov_len,
+ info->iov[i].iov_offset);
+
+ skb_copy_bits(skb, copy_offset,
+ tx->tx_fifo.base + info->iov[i].iov_offset,
+ info->iov[i].iov_len);
+ gve_dma_sync_for_device(priv, tx->tx_fifo.qpl->page_buses,
+ info->iov[i].iov_offset,
+ info->iov[i].iov_len);
+ copy_offset += info->iov[i].iov_len;
+ }
+
+ return 1 + payload_nfrags;
+}
+
+static int gve_tx_add_skb_no_copy(struct gve_priv *priv, struct gve_tx_ring *tx,
+ struct sk_buff *skb)
+{
+ const struct skb_shared_info *shinfo = skb_shinfo(skb);
+ int hlen, payload_nfrags, l4_hdr_offset, seg_idx_bias;
+ union gve_tx_desc *pkt_desc, *seg_desc;
+ struct gve_tx_buffer_state *info;
+ bool is_gso = skb_is_gso(skb);
+ u32 idx = tx->req & tx->mask;
+ struct gve_tx_dma_buf *buf;
+ int last_mapped = 0;
+ u64 addr;
+ u32 len;
+ int i;
+
+ info = &tx->info[idx];
+ pkt_desc = &tx->desc[idx];
+
+ l4_hdr_offset = skb_checksum_start_offset(skb);
+ /* If the skb is gso, then we want the tcp header in the first segment
+ * otherwise we want the linear portion of the skb (which will contain
+ * the checksum because skb->csum_start and skb->csum_offset are given
+ * relative to skb->head) in the first segment.
+ */
+ hlen = is_gso ? l4_hdr_offset + tcp_hdrlen(skb) :
+ skb_headlen(skb);
+ len = skb_headlen(skb);
+
+ info->skb = skb;
+
+ addr = dma_map_single(tx->dev, skb->data, len, DMA_TO_DEVICE);
+ if (unlikely(dma_mapping_error(tx->dev, addr))) {
+ priv->dma_mapping_error++;
+ goto drop;
+ }
+ buf = &info->buf;
+ dma_unmap_len_set(buf, len, len);
+ dma_unmap_addr_set(buf, dma, addr);
+
+ payload_nfrags = shinfo->nr_frags;
+ if (hlen < len) {
+ /* For gso the rest of the linear portion of the skb needs to
+ * be in its own descriptor.
+ */
+ payload_nfrags++;
+ gve_tx_fill_pkt_desc(pkt_desc, skb, is_gso, l4_hdr_offset,
+ 1 + payload_nfrags, hlen, addr);
+
+ len -= hlen;
+ addr += hlen;
+ seg_desc = &tx->desc[(tx->req + 1) & tx->mask];
+ seg_idx_bias = 2;
+ gve_tx_fill_seg_desc(seg_desc, skb, is_gso, len, addr);
+ } else {
+ seg_idx_bias = 1;
+ gve_tx_fill_pkt_desc(pkt_desc, skb, is_gso, l4_hdr_offset,
+ 1 + payload_nfrags, hlen, addr);
+ }
+
+ for (i = 0; i < payload_nfrags - (seg_idx_bias - 1); i++) {
+ struct skb_frag_struct frag = shinfo->frags[i];
+
+ idx = (tx->req + i + seg_idx_bias) & tx->mask;
+ seg_desc = &tx->desc[idx];
+ len = skb_frag_size(&frag);
+ addr = skb_frag_dma_map(tx->dev, &frag, 0, len, DMA_TO_DEVICE);
+ if (unlikely(dma_mapping_error(tx->dev, addr))) {
+ priv->dma_mapping_error++;
+ goto unmap_drop;
+ }
+ buf = &tx->info[idx].buf;
+ dma_unmap_len_set(buf, len, -len);
+ dma_unmap_addr_set(buf, dma, addr);
+
+ gve_tx_fill_seg_desc(seg_desc, skb, is_gso, len, addr);
+ }
+
+ return 1 + payload_nfrags;
+
+unmap_drop:
+ i--;
+ for (last_mapped = i + seg_idx_bias; last_mapped >= 0; last_mapped--) {
+ idx = (tx->req + last_mapped) & tx->mask;
+ gve_tx_unmap_buf(tx->dev, &tx->info[idx].buf);
+ }
+drop:
+ tx->dropped_pkt++;
+ return 0;
+}
+
+netdev_tx_t gve_tx(struct sk_buff *skb, struct net_device *dev)
+{
+ struct gve_priv *priv = netdev_priv(dev);
+ struct gve_tx_ring *tx;
+ int nsegs;
+
+ WARN(skb_get_queue_mapping(skb) > priv->tx_cfg.num_queues,
+ "skb queue index out of range");
+ tx = &priv->tx[skb_get_queue_mapping(skb)];
+ if (unlikely(gve_maybe_stop_tx(tx, skb))) {
+ /* We need to ring the txq doorbell -- we have stopped the Tx
+ * queue for want of resources, but prior calls to gve_tx()
+ * may have added descriptors without ringing the doorbell.
+ */
+
+ /* Ensure tx descs from a prior gve_tx are visible before
+ * ringing doorbell.
+ */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0)
+ dma_wmb();
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ wmb();
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ gve_tx_put_doorbell(priv, tx->q_resources, tx->req);
+ return NETDEV_TX_BUSY;
+ }
+ if (tx->raw_addressing)
+ nsegs = gve_tx_add_skb_no_copy(priv, tx, skb);
+ else
+ nsegs = gve_tx_add_skb_copy(priv, tx, skb);
+
+ /* If the packet is getting sent, we need to update the skb */
+ if (nsegs) {
+ netdev_tx_sent_queue(tx->netdev_txq, skb->len);
+ skb_tx_timestamp(skb);
+ }
+
+ /* Give packets to NIC. Even if this packet failed to send the doorbell
+ * might need to be rung because of xmit_more.
+ */
+ tx->req += nsegs;
+
+ /* If we have xmit_more - don't ring the doorbell unless we are stopped */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,18,0)
+ if (!netif_xmit_stopped(tx->netdev_txq)
+#if LINUX_VERSION_CODE > KERNEL_VERSION(5,2,0)
+ && netdev_xmit_more()
+#else /* LINUX_VERSION_CODE > KERNEL_VERSION(5,2,0) */
+ && skb->xmit_more
+#endif /* LINUX_VERSION_CODE > KERNEL_VERSION(5,2,0) */
+)
+ return NETDEV_TX_OK;
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,18,0) */
+
+ /* Ensure tx descs are visible before ringing doorbell */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0)
+ dma_wmb();
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ wmb();
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ gve_tx_put_doorbell(priv, tx->q_resources, tx->req);
+ return NETDEV_TX_OK;
+}
+
+#define GVE_TX_START_THRESH PAGE_SIZE
+
+static int gve_clean_tx_done(struct gve_priv *priv, struct gve_tx_ring *tx,
+ u32 to_do, bool try_to_wake)
+{
+ struct gve_tx_buffer_state *info;
+ u64 pkts = 0, bytes = 0;
+ size_t space_freed = 0;
+ struct sk_buff *skb;
+ int i, j;
+ u32 idx;
+
+ for (j = 0; j < to_do; j++) {
+ idx = tx->done & tx->mask;
+ netif_info(priv, tx_done, priv->dev,
+ "[%d] %s: idx=%d (req=%u done=%u)\n",
+ tx->q_num, __func__, idx, tx->req, tx->done);
+ info = &tx->info[idx];
+ skb = info->skb;
+
+ /* Unmap the buffer */
+ if (tx->raw_addressing)
+ gve_tx_unmap_buf(tx->dev, &tx->info[idx].buf);
+
+ /* Mark as free */
+ if (skb) {
+ info->skb = NULL;
+ bytes += skb->len;
+ pkts++;
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0)
+ dev_consume_skb_any(skb);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0) */
+ dev_kfree_skb_any(skb);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0) */
+ if (!tx->raw_addressing) {
+ /* FIFO free */
+ for (i = 0; i < ARRAY_SIZE(info->iov); i++) {
+ space_freed += info->iov[i].iov_len +
+ info->iov[i].iov_padding;
+ info->iov[i].iov_len = 0;
+ info->iov[i].iov_padding = 0;
+ }
+ }
+ }
+ tx->done++;
+ }
+
+ if (!tx->raw_addressing) {
+ gve_tx_free_fifo(&tx->tx_fifo, space_freed);
+ }
+ u64_stats_update_begin(&tx->statss);
+ tx->bytes_done += bytes;
+ tx->pkt_done += pkts;
+ u64_stats_update_end(&tx->statss);
+ netdev_tx_completed_queue(tx->netdev_txq, pkts, bytes);
+
+ /* start the queue if we've stopped it */
+#ifndef CONFIG_BQL
+ /* Make sure that the doorbells are synced */
+ smp_mb();
+#endif
+ if (try_to_wake && netif_tx_queue_stopped(tx->netdev_txq) &&
+ likely(gve_can_tx(tx, GVE_TX_START_THRESH))) {
+ tx->wake_queue++;
+ netif_tx_wake_queue(tx->netdev_txq);
+ }
+
+ return pkts;
+}
+
+__be32 gve_tx_load_event_counter(struct gve_priv *priv,
+ struct gve_tx_ring *tx)
+{
+ u32 counter_index = be32_to_cpu((tx->q_resources->counter_index));
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,20,0)
+ return READ_ONCE(priv->counter_array[counter_index]);
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(3,20,0) */
+ return ACCESS_ONCE(priv->counter_array[counter_index]);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,20,0) */
+}
+
+bool gve_tx_poll(struct gve_notify_block *block, int budget)
+{
+ struct gve_priv *priv = block->priv;
+ struct gve_tx_ring *tx = block->tx;
+ bool repoll = false;
+ u32 nic_done;
+ u32 to_do;
+
+ /* If budget is 0, do all the work */
+ if (budget == 0)
+ budget = INT_MAX;
+
+ /* Find out how much work there is to be done */
+ tx->last_nic_done = gve_tx_load_event_counter(priv, tx);
+ nic_done = be32_to_cpu(tx->last_nic_done);
+ if (budget > 0) {
+ /* Do as much work as we have that the budget will
+ * allow
+ */
+ to_do = min_t(u32, (nic_done - tx->done), budget);
+ gve_clean_tx_done(priv, tx, to_do, true);
+ }
+ /* If we still have work we want to repoll */
+ repoll |= (nic_done != tx->done);
+ return repoll;
+}
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b21223b..208ec45 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2234,6 +2234,53 @@
return 0;
}
+static int virtnet_set_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec)
+{
+ struct ethtool_coalesce ec_default = {
+ .cmd = ETHTOOL_SCOALESCE,
+ .rx_max_coalesced_frames = 1,
+ };
+ struct virtnet_info *vi = netdev_priv(dev);
+ int i, napi_weight;
+
+ if (ec->tx_max_coalesced_frames > 1)
+ return -EINVAL;
+
+ ec_default.tx_max_coalesced_frames = ec->tx_max_coalesced_frames;
+ napi_weight = ec->tx_max_coalesced_frames ? NAPI_POLL_WEIGHT : 0;
+
+ /* disallow changes to fields not explicitly tested above */
+ if (memcmp(ec, &ec_default, sizeof(ec_default)))
+ return -EINVAL;
+
+ if (napi_weight ^ vi->sq[0].napi.weight) {
+ if (dev->flags & IFF_UP)
+ return -EBUSY;
+ for (i = 0; i < vi->max_queue_pairs; i++)
+ vi->sq[i].napi.weight = napi_weight;
+ }
+
+ return 0;
+}
+
+static int virtnet_get_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec)
+{
+ struct ethtool_coalesce ec_default = {
+ .cmd = ETHTOOL_GCOALESCE,
+ .rx_max_coalesced_frames = 1,
+ };
+ struct virtnet_info *vi = netdev_priv(dev);
+
+ memcpy(ec, &ec_default, sizeof(ec_default));
+
+ if (vi->sq[0].napi.weight)
+ ec->tx_max_coalesced_frames = 1;
+
+ return 0;
+}
+
static void virtnet_init_settings(struct net_device *dev)
{
struct virtnet_info *vi = netdev_priv(dev);
@@ -2272,6 +2319,8 @@
.get_ts_info = ethtool_op_get_ts_info,
.get_link_ksettings = virtnet_get_link_ksettings,
.set_link_ksettings = virtnet_set_link_ksettings,
+ .set_coalesce = virtnet_set_coalesce,
+ .get_coalesce = virtnet_get_coalesce,
};
static void virtnet_freeze_down(struct virtio_device *vdev)
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index fb667bf..13510ba 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -263,7 +263,7 @@
struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
unsigned int sz_align = ALIGN(size + (offset & (512 - 1)), 512);
sector_t sector = offset >> 9;
- int rc = 0;
+ int rc = 0, ret = 0;
if (unlikely(!size))
return 0;
@@ -301,7 +301,9 @@
}
memcpy_flushcache(nsio->addr + offset, buf, size);
- nvdimm_flush(to_nd_region(ndns->dev.parent));
+ ret = nvdimm_flush(to_nd_region(ndns->dev.parent), NULL);
+ if (ret)
+ rc = ret;
return rc;
}
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 01e194a..fbb01a7 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -163,6 +163,7 @@
struct badblocks bb;
struct nd_interleave_set *nd_set;
struct nd_percpu_lane __percpu *lane;
+ int (*flush)(struct nd_region *nd_region, struct bio *bio);
struct nd_mapping mapping[0];
};
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index a7ce2f1..68b4a90 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -192,6 +192,7 @@
static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
{
+ int ret = 0;
blk_status_t rc = 0;
bool do_acct;
unsigned long start;
@@ -201,7 +202,7 @@
struct nd_region *nd_region = to_region(pmem);
if (bio->bi_opf & REQ_PREFLUSH)
- nvdimm_flush(nd_region);
+ ret = nvdimm_flush(nd_region, bio);
do_acct = nd_iostat_start(bio, &start);
bio_for_each_segment(bvec, bio, iter) {
@@ -216,7 +217,10 @@
nd_iostat_end(bio, start);
if (bio->bi_opf & REQ_FUA)
- nvdimm_flush(nd_region);
+ ret = nvdimm_flush(nd_region, bio);
+
+ if (ret)
+ bio->bi_status = errno_to_blk_status(ret);
bio_endio(bio);
return BLK_QC_T_NONE;
@@ -301,6 +305,7 @@
static const struct dax_operations pmem_dax_ops = {
.direct_access = pmem_dax_direct_access,
+ .dax_supported = generic_fsdax_supported,
.copy_from_iter = pmem_copy_from_iter,
.copy_to_iter = pmem_copy_to_iter,
};
@@ -371,6 +376,7 @@
struct gendisk *disk;
void *addr;
int rc;
+ unsigned long flags = 0UL;
pmem = devm_kzalloc(dev, sizeof(*pmem), GFP_KERNEL);
if (!pmem)
@@ -468,14 +474,15 @@
nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
disk->bb = &pmem->bb;
- dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+ if (is_nvdimm_sync(nd_region))
+ flags = DAXDEV_F_SYNC;
+ dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops, flags);
if (!dax_dev) {
put_disk(disk);
return -ENOMEM;
}
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
pmem->dax_dev = dax_dev;
-
gendev = disk_to_dev(disk);
gendev->groups = pmem_attribute_groups;
@@ -533,14 +540,14 @@
sysfs_put(pmem->bb_state);
pmem->bb_state = NULL;
}
- nvdimm_flush(to_nd_region(dev->parent));
+ nvdimm_flush(to_nd_region(dev->parent), NULL);
return 0;
}
static void nd_pmem_shutdown(struct device *dev)
{
- nvdimm_flush(to_nd_region(dev->parent));
+ nvdimm_flush(to_nd_region(dev->parent), NULL);
}
static void nd_pmem_notify(struct device *dev, enum nvdimm_event event)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 609fc45..aa0f6f5 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -290,7 +290,9 @@
return rc;
if (!flush)
return -EINVAL;
- nvdimm_flush(nd_region);
+ rc = nvdimm_flush(nd_region, NULL);
+ if (rc)
+ return rc;
return len;
}
@@ -1076,6 +1078,11 @@
dev->of_node = ndr_desc->of_node;
nd_region->ndr_size = resource_size(ndr_desc->res);
nd_region->ndr_start = ndr_desc->res->start;
+ if (ndr_desc->flush)
+ nd_region->flush = ndr_desc->flush;
+ else
+ nd_region->flush = NULL;
+
nd_device_register(dev);
return nd_region;
@@ -1116,11 +1123,24 @@
}
EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
+int nvdimm_flush(struct nd_region *nd_region, struct bio *bio)
+{
+ int rc = 0;
+
+ if (!nd_region->flush)
+ rc = generic_nvdimm_flush(nd_region);
+ else {
+ if (nd_region->flush(nd_region, bio))
+ rc = -EIO;
+ }
+
+ return rc;
+}
/**
* nvdimm_flush - flush any posted write queues between the cpu and pmem media
* @nd_region: blk or interleaved pmem region
*/
-void nvdimm_flush(struct nd_region *nd_region)
+int generic_nvdimm_flush(struct nd_region *nd_region)
{
struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev);
int i, idx;
@@ -1144,6 +1164,8 @@
if (ndrd_get_flush_wpq(ndrd, i, 0))
writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
wmb();
+
+ return 0;
}
EXPORT_SYMBOL_GPL(nvdimm_flush);
@@ -1188,6 +1210,13 @@
}
EXPORT_SYMBOL_GPL(nvdimm_has_cache);
+bool is_nvdimm_sync(struct nd_region *nd_region)
+{
+ return is_nd_pmem(&nd_region->dev) &&
+ !test_bit(ND_REGION_ASYNC, &nd_region->flags);
+}
+EXPORT_SYMBOL_GPL(is_nvdimm_sync);
+
struct conflict_context {
struct nd_region *nd_region;
resource_size_t start, size;
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d5359c7..e206c39 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1055,15 +1055,15 @@
return id;
}
-static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
- void *buffer, size_t buflen, u32 *result)
+static int nvme_features(struct nvme_ctrl *dev, u8 op, unsigned int fid,
+ unsigned int dword11, void *buffer, size_t buflen, u32 *result)
{
union nvme_result res = { 0 };
struct nvme_command c;
int ret;
memset(&c, 0, sizeof(c));
- c.features.opcode = nvme_admin_set_features;
+ c.features.opcode = op;
c.features.fid = cpu_to_le32(fid);
c.features.dword11 = cpu_to_le32(dword11);
@@ -1074,6 +1074,24 @@
return ret;
}
+int nvme_set_features(struct nvme_ctrl *dev, unsigned int fid,
+ unsigned int dword11, void *buffer, size_t buflen,
+ u32 *result)
+{
+ return nvme_features(dev, nvme_admin_set_features, fid, dword11, buffer,
+ buflen, result);
+}
+EXPORT_SYMBOL_GPL(nvme_set_features);
+
+int nvme_get_features(struct nvme_ctrl *dev, unsigned int fid,
+ unsigned int dword11, void *buffer, size_t buflen,
+ u32 *result)
+{
+ return nvme_features(dev, nvme_admin_get_features, fid, dword11, buffer,
+ buflen, result);
+}
+EXPORT_SYMBOL_GPL(nvme_get_features);
+
int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
{
u32 q_count = (*count - 1) | ((*count - 1) << 16);
@@ -3772,6 +3790,17 @@
}
EXPORT_SYMBOL_GPL(nvme_start_queues);
+void nvme_sync_queues(struct nvme_ctrl *ctrl)
+{
+ struct nvme_ns *ns;
+
+ down_read(&ctrl->namespaces_rwsem);
+ list_for_each_entry(ns, &ctrl->namespaces, list)
+ blk_sync_queue(ns->queue);
+ up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_sync_queues);
+
int __init nvme_core_init(void)
{
int result = -ENOMEM;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index cc4273f..40192b6 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -436,6 +436,7 @@
void nvme_stop_queues(struct nvme_ctrl *ctrl);
void nvme_start_queues(struct nvme_ctrl *ctrl);
void nvme_kill_queues(struct nvme_ctrl *ctrl);
+void nvme_sync_queues(struct nvme_ctrl *ctrl);
void nvme_unfreeze(struct nvme_ctrl *ctrl);
void nvme_wait_freeze(struct nvme_ctrl *ctrl);
void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
@@ -453,6 +454,12 @@
union nvme_result *result, void *buffer, unsigned bufflen,
unsigned timeout, int qid, int at_head,
blk_mq_req_flags_t flags);
+int nvme_set_features(struct nvme_ctrl *dev, unsigned int fid,
+ unsigned int dword11, void *buffer, size_t buflen,
+ u32 *result);
+int nvme_get_features(struct nvme_ctrl *dev, unsigned int fid,
+ unsigned int dword11, void *buffer, size_t buflen,
+ u32 *result);
int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);
int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3c68a5b..7c00c85 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -26,6 +26,7 @@
#include <linux/mutex.h>
#include <linux/once.h>
#include <linux/pci.h>
+#include <linux/suspend.h>
#include <linux/t10-pi.h>
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
@@ -106,6 +107,7 @@
u32 cmbloc;
struct nvme_ctrl ctrl;
struct completion ioq_wait;
+ u32 last_ps;
mempool_t *iod_mempool;
@@ -1132,7 +1134,6 @@
struct nvme_dev *dev = nvmeq->dev;
struct request *abort_req;
struct nvme_command cmd;
- bool shutdown = false;
u32 csts = readl(dev->bar + NVME_REG_CSTS);
/* If PCI error recovery process is happening, we cannot reset or
@@ -1169,16 +1170,18 @@
* shutdown, so we return BLK_EH_DONE.
*/
switch (dev->ctrl.state) {
- case NVME_CTRL_DELETING:
- shutdown = true;
case NVME_CTRL_CONNECTING:
- case NVME_CTRL_RESETTING:
+ nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
+ /* fall through */
+ case NVME_CTRL_DELETING:
dev_warn_ratelimited(dev->ctrl.device,
"I/O %d QID %d timeout, disable controller\n",
req->tag, nvmeq->qid);
- nvme_dev_disable(dev, shutdown);
+ nvme_dev_disable(dev, true);
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
return BLK_EH_DONE;
+ case NVME_CTRL_RESETTING:
+ return BLK_EH_RESET_TIMER;
default:
break;
}
@@ -2150,7 +2153,7 @@
static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
{
int i;
- bool dead = true;
+ bool dead = true, freeze = false;
struct pci_dev *pdev = to_pci_dev(dev->dev);
mutex_lock(&dev->shutdown_lock);
@@ -2158,8 +2161,10 @@
u32 csts = readl(dev->bar + NVME_REG_CSTS);
if (dev->ctrl.state == NVME_CTRL_LIVE ||
- dev->ctrl.state == NVME_CTRL_RESETTING)
+ dev->ctrl.state == NVME_CTRL_RESETTING) {
+ freeze = true;
nvme_start_freeze(&dev->ctrl);
+ }
dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
pdev->error_state != pci_channel_io_normal);
}
@@ -2168,10 +2173,8 @@
* Give the controller a chance to complete all entered requests if
* doing a safe shutdown.
*/
- if (!dead) {
- if (shutdown)
- nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
- }
+ if (!dead && shutdown && freeze)
+ nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
nvme_stop_queues(&dev->ctrl);
@@ -2269,6 +2272,7 @@
*/
if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
nvme_dev_disable(dev, false);
+ nvme_sync_queues(&dev->ctrl);
mutex_lock(&dev->shutdown_lock);
result = nvme_pci_enable(dev);
@@ -2608,12 +2612,68 @@
}
#ifdef CONFIG_PM_SLEEP
-static int nvme_suspend(struct device *dev)
+static int nvme_deep_state(struct nvme_dev *dev)
+{
+ struct pci_dev *pdev = to_pci_dev(dev->dev);
+ struct nvme_ctrl *ctrl = &dev->ctrl;
+ int ret = -EBUSY;;
+
+ nvme_start_freeze(ctrl);
+ nvme_wait_freeze(ctrl);
+ nvme_sync_queues(ctrl);
+
+ if (ctrl->state != NVME_CTRL_LIVE &&
+ ctrl->state != NVME_CTRL_ADMIN_ONLY)
+ goto unfreeze;
+
+ dev->last_ps = 0;
+ ret = nvme_get_features(ctrl, NVME_FEAT_POWER_MGMT, 0, NULL, 0,
+ &dev->last_ps);
+ if (ret < 0)
+ goto unfreeze;
+
+ ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, dev->ctrl.npss,
+ NULL, 0, NULL);
+ if (ret < 0)
+ goto unfreeze;
+ if (ret) {
+ /*
+ * Clearing npss forces a controller reset on resume. The
+ * correct value will be resdicovered then.
+ */
+ ctrl->npss = 0;
+ nvme_dev_disable(dev, true);
+ ret = 0;
+ } else {
+ /*
+ * A saved state prevents pci pm from generically controlling
+ * the device's power. If we're using protocol specific
+ * settings, we don't want pci interfering.
+ */
+ pci_save_state(pdev);
+ }
+unfreeze:
+ nvme_unfreeze(ctrl);
+ return ret;
+}
+
+static int nvme_make_operational(struct nvme_dev *dev)
+{
+ struct nvme_ctrl *ctrl = &dev->ctrl;
+
+ if (nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, dev->last_ps,
+ NULL, 0, NULL) == 0)
+ return 0;
+ nvme_reset_ctrl(ctrl);
+ return 0;
+}
+
+static int nvme_simple_resume(struct device *dev)
{
struct pci_dev *pdev = to_pci_dev(dev);
struct nvme_dev *ndev = pci_get_drvdata(pdev);
- nvme_dev_disable(ndev, true);
+ nvme_reset_ctrl(&ndev->ctrl);
return 0;
}
@@ -2622,12 +2682,45 @@
struct pci_dev *pdev = to_pci_dev(dev);
struct nvme_dev *ndev = pci_get_drvdata(pdev);
- nvme_reset_ctrl(&ndev->ctrl);
+ return pm_resume_via_firmware() || !ndev->ctrl.npss ?
+ nvme_simple_resume(dev) : nvme_make_operational(ndev);
+}
+
+static int nvme_simple_suspend(struct device *dev)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ struct nvme_dev *ndev = pci_get_drvdata(pdev);
+
+ nvme_dev_disable(ndev, true);
return 0;
}
-#endif
-static SIMPLE_DEV_PM_OPS(nvme_dev_pm_ops, nvme_suspend, nvme_resume);
+static int nvme_suspend(struct device *dev)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ struct nvme_dev *ndev = pci_get_drvdata(pdev);
+
+ /*
+ * The platform does not remove power for a kernel managed suspend so
+ * use host managed nvme power settings for lowest idle power. This
+ * should have quicker resume latency than a full device shutdown.
+ */
+ return pm_suspend_via_firmware() || !ndev->ctrl.npss ?
+ nvme_simple_suspend(dev) : nvme_deep_state(ndev);
+}
+
+const struct dev_pm_ops nvme_dev_pm_ops = {
+ .suspend = nvme_suspend,
+ .resume = nvme_resume,
+ .freeze = nvme_simple_suspend,
+ .thaw = nvme_simple_resume,
+ .poweroff = nvme_simple_suspend,
+ .restore = nvme_simple_resume,
+};
+
+#else
+const struct dev_pm_ops nvme_dev_pm_ops = {};
+#endif
static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
pci_channel_state_t state)
diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c
index ce051f9..c8242d4 100644
--- a/drivers/rtc/interface.c
+++ b/drivers/rtc/interface.c
@@ -579,7 +579,9 @@
struct rtc_time tm;
ktime_t now, onesec;
- __rtc_read_time(rtc, &tm);
+ err = __rtc_read_time(rtc, &tm);
+ if (err)
+ goto out;
onesec = ktime_set(1, 0);
now = rtc_tm_to_ktime(tm);
rtc->uie_rtctimer.node.expires = ktime_add(now, onesec);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 23e526c..737dc0e 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -59,6 +59,7 @@
static const struct dax_operations dcssblk_dax_ops = {
.direct_access = dcssblk_dax_direct_access,
+ .dax_supported = generic_fsdax_supported,
.copy_from_iter = dcssblk_dax_copy_from_iter,
.copy_to_iter = dcssblk_dax_copy_to_iter,
};
@@ -678,7 +679,7 @@
goto put_dev;
dev_info->dax_dev = alloc_dax(dev_info, dev_info->gd->disk_name,
- &dcssblk_dax_ops);
+ &dcssblk_dax_ops, DAXDEV_F_SYNC);
if (!dev_info->dax_dev) {
rc = -ENOMEM;
goto put_dev;
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1afcbef..2df7b1c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -266,7 +266,10 @@
pages_to_bytes(events[PSWPIN]));
update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
pages_to_bytes(events[PSWPOUT]));
- update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
+ update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT,
+ events[PGMAJFAULT_S] +
+ events[PGMAJFAULT_A] +
+ events[PGMAJFAULT_F]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
#ifdef CONFIG_HUGETLB_PAGE
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 58f48ea..8b4ded9 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -34,6 +34,7 @@
#include <linux/mutex.h>
#include <linux/anon_inodes.h>
#include <linux/device.h>
+#include <linux/freezer.h>
#include <linux/uaccess.h>
#include <asm/io.h>
#include <asm/mman.h>
@@ -1816,7 +1817,8 @@
}
spin_unlock_irq(&ep->wq.lock);
- if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
+ if (!freezable_schedule_hrtimeout_range(to, slack,
+ HRTIMER_MODE_ABS))
timed_out = 1;
spin_lock_irq(&ep->wq.lock);
diff --git a/fs/exec.c b/fs/exec.c
index cece8c1..eeac87c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -67,6 +67,7 @@
#include <asm/mmu_context.h>
#include <asm/tlb.h>
+#include <trace/events/fs.h>
#include <trace/events/task.h>
#include "internal.h"
@@ -865,9 +866,12 @@
if (err)
goto exit;
- if (name->name[0] != '\0')
+ if (name->name[0] != '\0') {
fsnotify_open(file);
+ trace_open_exec(name->name);
+ }
+
out:
return file;
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0a4461a..6c73dd8 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -207,6 +207,12 @@
*/
#define EXT4_IO_END_UNWRITTEN 0x0001
+struct ext4_io_end_vec {
+ struct list_head list; /* list of io_end_vec */
+ loff_t offset; /* offset in the file */
+ ssize_t size; /* size of the extent */
+};
+
/*
* For converting unwritten extents on a work queue. 'handle' is used for
* buffered writeback.
@@ -220,8 +226,7 @@
* bios covering the extent */
unsigned int flag; /* unwritten or not */
atomic_t count; /* reference counter */
- loff_t offset; /* offset in the file */
- ssize_t size; /* size of the extent */
+ struct list_head list_vec; /* list of ext4_io_end_vec */
} ext4_io_end_t;
struct ext4_io_submit {
@@ -3177,6 +3182,8 @@
loff_t len);
extern int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
loff_t offset, ssize_t len);
+extern int ext4_convert_unwritten_io_end_vec(handle_t *handle,
+ ext4_io_end_t *io_end);
extern int ext4_map_blocks(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map, int flags);
extern int ext4_ext_calc_metadata_amount(struct inode *inode,
@@ -3235,6 +3242,8 @@
int len,
struct writeback_control *wbc,
bool keep_towrite);
+extern struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end);
+extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);
/* mmp.c */
extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 6e80490..6d5cee1 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5031,23 +5031,13 @@
int ret = 0;
int ret2 = 0;
struct ext4_map_blocks map;
- unsigned int credits, blkbits = inode->i_blkbits;
+ unsigned int blkbits = inode->i_blkbits;
+ unsigned int credits = 0;
map.m_lblk = offset >> blkbits;
max_blocks = EXT4_MAX_BLOCKS(len, offset, blkbits);
- /*
- * This is somewhat ugly but the idea is clear: When transaction is
- * reserved, everything goes into it. Otherwise we rather start several
- * smaller transactions for conversion of each extent separately.
- */
- if (handle) {
- handle = ext4_journal_start_reserved(handle,
- EXT4_HT_EXT_CONVERT);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
- credits = 0;
- } else {
+ if (!handle) {
/*
* credits to insert 1 extent into extent tree
*/
@@ -5078,11 +5068,40 @@
if (ret <= 0 || ret2)
break;
}
- if (!credits)
- ret2 = ext4_journal_stop(handle);
return ret > 0 ? ret2 : ret;
}
+int ext4_convert_unwritten_io_end_vec(handle_t *handle, ext4_io_end_t *io_end)
+{
+ int ret, err = 0;
+ struct ext4_io_end_vec *io_end_vec;
+
+ /*
+ * This is somewhat ugly but the idea is clear: When transaction is
+ * reserved, everything goes into it. Otherwise we rather start several
+ * smaller transactions for conversion of each extent separately.
+ */
+ if (handle) {
+ handle = ext4_journal_start_reserved(handle,
+ EXT4_HT_EXT_CONVERT);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+ }
+
+ list_for_each_entry(io_end_vec, &io_end->list_vec, list) {
+ ret = ext4_convert_unwritten_extents(handle, io_end->inode,
+ io_end_vec->offset,
+ io_end_vec->size);
+ if (ret)
+ break;
+ }
+
+ if (handle)
+ err = ext4_journal_stop(handle);
+
+ return ret < 0 ? ret : err;
+}
+
/*
* If newes is not existing extent (newes->ec_pblk equals zero) find
* delayed extent at start of newes and update newes accordingly and
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 52d155b..adcd424 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -373,15 +373,17 @@
static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct inode *inode = file->f_mapping->host;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ struct dax_device *dax_dev = sbi->s_daxdev;
- if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+ if (unlikely(ext4_forced_shutdown(sbi)))
return -EIO;
/*
- * We don't support synchronous mappings for non-DAX files. At least
- * until someone comes with a sensible use case.
+ * We don't support synchronous mappings for non-DAX files and
+ * for DAX files if underneath dax_device is not synchronous.
*/
- if (!IS_DAX(file_inode(file)) && (vma->vm_flags & VM_SYNC))
+ if (!daxdev_mapping_supported(vma, dax_dev))
return -EOPNOTSUPP;
file_accessed(file);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3b1a759..8c01c71 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2342,6 +2342,79 @@
}
/*
+ * mpage_process_page - update page buffers corresponding to changed extent and
+ * may submit fully mapped page for IO
+ *
+ * @mpd - description of extent to map, on return next extent to map
+ * @m_lblk - logical block mapping.
+ * @m_pblk - corresponding physical mapping.
+ * @map_bh - determines on return whether this page requires any further
+ * mapping or not.
+ * Scan given page buffers corresponding to changed extent and update buffer
+ * state according to new extent state.
+ * We map delalloc buffers to their physical location, clear unwritten bits.
+ * If the given page is not fully mapped, we update @map to the next extent in
+ * the given page that needs mapping & return @map_bh as true.
+ */
+static int mpage_process_page(struct mpage_da_data *mpd, struct page *page,
+ ext4_lblk_t *m_lblk, ext4_fsblk_t *m_pblk,
+ bool *map_bh)
+{
+ struct buffer_head *head, *bh;
+ ext4_io_end_t *io_end = mpd->io_submit.io_end;
+ ext4_lblk_t lblk = *m_lblk;
+ ext4_fsblk_t pblock = *m_pblk;
+ int err = 0;
+ int blkbits = mpd->inode->i_blkbits;
+ ssize_t io_end_size = 0;
+ struct ext4_io_end_vec *io_end_vec = ext4_last_io_end_vec(io_end);
+
+ bh = head = page_buffers(page);
+ do {
+ if (lblk < mpd->map.m_lblk)
+ continue;
+ if (lblk >= mpd->map.m_lblk + mpd->map.m_len) {
+ /*
+ * Buffer after end of mapped extent.
+ * Find next buffer in the page to map.
+ */
+ mpd->map.m_len = 0;
+ mpd->map.m_flags = 0;
+ io_end_vec->size += io_end_size;
+ io_end_size = 0;
+
+ err = mpage_process_page_bufs(mpd, head, bh, lblk);
+ if (err > 0)
+ err = 0;
+ if (!err && mpd->map.m_len && mpd->map.m_lblk > lblk) {
+ io_end_vec = ext4_alloc_io_end_vec(io_end);
+ if (IS_ERR(io_end_vec)) {
+ err = PTR_ERR(io_end_vec);
+ goto out;
+ }
+ io_end_vec->offset = mpd->map.m_lblk << blkbits;
+ }
+ *map_bh = true;
+ goto out;
+ }
+ if (buffer_delay(bh)) {
+ clear_buffer_delay(bh);
+ bh->b_blocknr = pblock++;
+ }
+ clear_buffer_unwritten(bh);
+ io_end_size += (1 << blkbits);
+ } while (lblk++, (bh = bh->b_this_page) != head);
+
+ io_end_vec->size += io_end_size;
+ io_end_size = 0;
+ *map_bh = false;
+out:
+ *m_lblk = lblk;
+ *m_pblk = pblock;
+ return err;
+}
+
+/*
* mpage_map_buffers - update buffers corresponding to changed extent and
* submit fully mapped pages for IO
*
@@ -2360,12 +2433,12 @@
struct pagevec pvec;
int nr_pages, i;
struct inode *inode = mpd->inode;
- struct buffer_head *head, *bh;
int bpp_bits = PAGE_SHIFT - inode->i_blkbits;
pgoff_t start, end;
ext4_lblk_t lblk;
- sector_t pblock;
+ ext4_fsblk_t pblock;
int err;
+ bool map_bh = false;
start = mpd->map.m_lblk >> bpp_bits;
end = (mpd->map.m_lblk + mpd->map.m_len - 1) >> bpp_bits;
@@ -2381,50 +2454,19 @@
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
- bh = head = page_buffers(page);
- do {
- if (lblk < mpd->map.m_lblk)
- continue;
- if (lblk >= mpd->map.m_lblk + mpd->map.m_len) {
- /*
- * Buffer after end of mapped extent.
- * Find next buffer in the page to map.
- */
- mpd->map.m_len = 0;
- mpd->map.m_flags = 0;
- /*
- * FIXME: If dioread_nolock supports
- * blocksize < pagesize, we need to make
- * sure we add size mapped so far to
- * io_end->size as the following call
- * can submit the page for IO.
- */
- err = mpage_process_page_bufs(mpd, head,
- bh, lblk);
- pagevec_release(&pvec);
- if (err > 0)
- err = 0;
- return err;
- }
- if (buffer_delay(bh)) {
- clear_buffer_delay(bh);
- bh->b_blocknr = pblock++;
- }
- clear_buffer_unwritten(bh);
- } while (lblk++, (bh = bh->b_this_page) != head);
-
+ err = mpage_process_page(mpd, page, &lblk, &pblock,
+ &map_bh);
/*
- * FIXME: This is going to break if dioread_nolock
- * supports blocksize < pagesize as we will try to
- * convert potentially unmapped parts of inode.
+ * If map_bh is true, means page may require further bh
+ * mapping, or maybe the page was submitted for IO.
+ * So we return to call further extent mapping.
*/
- mpd->io_submit.io_end->size += PAGE_SIZE;
+ if (err < 0 || map_bh == true)
+ goto out;
/* Page fully mapped - let IO run! */
err = mpage_submit_page(mpd, page);
- if (err < 0) {
- pagevec_release(&pvec);
- return err;
- }
+ if (err < 0)
+ goto out;
}
pagevec_release(&pvec);
}
@@ -2432,6 +2474,9 @@
mpd->map.m_len = 0;
mpd->map.m_flags = 0;
return 0;
+out:
+ pagevec_release(&pvec);
+ return err;
}
static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
@@ -2515,9 +2560,13 @@
int err;
loff_t disksize;
int progress = 0;
+ ext4_io_end_t *io_end = mpd->io_submit.io_end;
+ struct ext4_io_end_vec *io_end_vec;
- mpd->io_submit.io_end->offset =
- ((loff_t)map->m_lblk) << inode->i_blkbits;
+ io_end_vec = ext4_alloc_io_end_vec(io_end);
+ if (IS_ERR(io_end_vec))
+ return PTR_ERR(io_end_vec);
+ io_end_vec->offset = ((loff_t)map->m_lblk) << inode->i_blkbits;
do {
err = mpage_map_one_extent(handle, mpd);
if (err < 0) {
@@ -3640,6 +3689,7 @@
ssize_t size, void *private)
{
ext4_io_end_t *io_end = private;
+ struct ext4_io_end_vec *io_end_vec;
/* if not async direct IO just return */
if (!io_end)
@@ -3657,8 +3707,9 @@
ext4_clear_io_unwritten_flag(io_end);
size = 0;
}
- io_end->offset = offset;
- io_end->size = size;
+ io_end_vec = ext4_alloc_io_end_vec(io_end);
+ io_end_vec->offset = offset;
+ io_end_vec->size = size;
ext4_put_io_end(io_end);
return 0;
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 9cc79b7..92860e6 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -31,18 +31,56 @@
#include "acl.h"
static struct kmem_cache *io_end_cachep;
+static struct kmem_cache *io_end_vec_cachep;
int __init ext4_init_pageio(void)
{
io_end_cachep = KMEM_CACHE(ext4_io_end, SLAB_RECLAIM_ACCOUNT);
if (io_end_cachep == NULL)
return -ENOMEM;
+
+ io_end_vec_cachep = KMEM_CACHE(ext4_io_end_vec, 0);
+ if (io_end_vec_cachep == NULL) {
+ kmem_cache_destroy(io_end_cachep);
+ return -ENOMEM;
+ }
return 0;
}
void ext4_exit_pageio(void)
{
kmem_cache_destroy(io_end_cachep);
+ kmem_cache_destroy(io_end_vec_cachep);
+}
+
+struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end)
+{
+ struct ext4_io_end_vec *io_end_vec;
+
+ io_end_vec = kmem_cache_zalloc(io_end_vec_cachep, GFP_NOFS);
+ if (!io_end_vec)
+ return ERR_PTR(-ENOMEM);
+ INIT_LIST_HEAD(&io_end_vec->list);
+ list_add_tail(&io_end_vec->list, &io_end->list_vec);
+ return io_end_vec;
+}
+
+static void ext4_free_io_end_vec(ext4_io_end_t *io_end)
+{
+ struct ext4_io_end_vec *io_end_vec, *tmp;
+
+ if (list_empty(&io_end->list_vec))
+ return;
+ list_for_each_entry_safe(io_end_vec, tmp, &io_end->list_vec, list) {
+ list_del(&io_end_vec->list);
+ kmem_cache_free(io_end_vec_cachep, io_end_vec);
+ }
+}
+
+struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end)
+{
+ BUG_ON(list_empty(&io_end->list_vec));
+ return list_last_entry(&io_end->list_vec, struct ext4_io_end_vec, list);
}
/*
@@ -133,6 +171,7 @@
ext4_finish_bio(bio);
bio_put(bio);
}
+ ext4_free_io_end_vec(io_end);
kmem_cache_free(io_end_cachep, io_end);
}
@@ -144,29 +183,26 @@
* cannot get to ext4_ext_truncate() before all IOs overlapping that range are
* completed (happens from ext4_free_ioend()).
*/
-static int ext4_end_io(ext4_io_end_t *io)
+static int ext4_end_io_end(ext4_io_end_t *io_end)
{
- struct inode *inode = io->inode;
- loff_t offset = io->offset;
- ssize_t size = io->size;
- handle_t *handle = io->handle;
+ struct inode *inode = io_end->inode;
+ handle_t *handle = io_end->handle;
int ret = 0;
- ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
+ ext4_debug("ext4_end_io_nolock: io_end 0x%p from inode %lu,list->next 0x%p,"
"list->prev 0x%p\n",
- io, inode->i_ino, io->list.next, io->list.prev);
+ io_end, inode->i_ino, io_end->list.next, io_end->list.prev);
- io->handle = NULL; /* Following call will use up the handle */
- ret = ext4_convert_unwritten_extents(handle, inode, offset, size);
+ io_end->handle = NULL; /* Following call will use up the handle */
+ ret = ext4_convert_unwritten_io_end_vec(handle, io_end);
if (ret < 0 && !ext4_forced_shutdown(EXT4_SB(inode->i_sb))) {
ext4_msg(inode->i_sb, KERN_EMERG,
"failed to convert unwritten extents to written "
"extents -- potential data loss! "
- "(inode %lu, offset %llu, size %zd, error %d)",
- inode->i_ino, offset, size, ret);
+ "(inode %lu, error %d)", inode->i_ino, ret);
}
- ext4_clear_io_unwritten_flag(io);
- ext4_release_io_end(io);
+ ext4_clear_io_unwritten_flag(io_end);
+ ext4_release_io_end(io_end);
return ret;
}
@@ -174,21 +210,21 @@
{
#ifdef EXT4FS_DEBUG
struct list_head *cur, *before, *after;
- ext4_io_end_t *io, *io0, *io1;
+ ext4_io_end_t *io_end, *io_end0, *io_end1;
if (list_empty(head))
return;
ext4_debug("Dump inode %lu completed io list\n", inode->i_ino);
- list_for_each_entry(io, head, list) {
- cur = &io->list;
+ list_for_each_entry(io_end, head, list) {
+ cur = &io_end->list;
before = cur->prev;
- io0 = container_of(before, ext4_io_end_t, list);
+ io_end0 = container_of(before, ext4_io_end_t, list);
after = cur->next;
- io1 = container_of(after, ext4_io_end_t, list);
+ io_end1 = container_of(after, ext4_io_end_t, list);
ext4_debug("io 0x%p from inode %lu,prev 0x%p,next 0x%p\n",
- io, inode->i_ino, io0, io1);
+ io_end, inode->i_ino, io_end0, io_end1);
}
#endif
}
@@ -215,7 +251,7 @@
static int ext4_do_flush_completed_IO(struct inode *inode,
struct list_head *head)
{
- ext4_io_end_t *io;
+ ext4_io_end_t *io_end;
struct list_head unwritten;
unsigned long flags;
struct ext4_inode_info *ei = EXT4_I(inode);
@@ -227,11 +263,11 @@
spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
while (!list_empty(&unwritten)) {
- io = list_entry(unwritten.next, ext4_io_end_t, list);
- BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
- list_del_init(&io->list);
+ io_end = list_entry(unwritten.next, ext4_io_end_t, list);
+ BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
+ list_del_init(&io_end->list);
- err = ext4_end_io(io);
+ err = ext4_end_io_end(io_end);
if (unlikely(!ret && err))
ret = err;
}
@@ -250,19 +286,22 @@
ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
{
- ext4_io_end_t *io = kmem_cache_zalloc(io_end_cachep, flags);
- if (io) {
- io->inode = inode;
- INIT_LIST_HEAD(&io->list);
- atomic_set(&io->count, 1);
+ ext4_io_end_t *io_end = kmem_cache_zalloc(io_end_cachep, flags);
+
+ if (io_end) {
+ io_end->inode = inode;
+ INIT_LIST_HEAD(&io_end->list);
+ INIT_LIST_HEAD(&io_end->list_vec);
+ atomic_set(&io_end->count, 1);
}
- return io;
+ return io_end;
}
void ext4_put_io_end_defer(ext4_io_end_t *io_end)
{
if (atomic_dec_and_test(&io_end->count)) {
- if (!(io_end->flag & EXT4_IO_END_UNWRITTEN) || !io_end->size) {
+ if (!(io_end->flag & EXT4_IO_END_UNWRITTEN) ||
+ list_empty(&io_end->list_vec)) {
ext4_release_io_end(io_end);
return;
}
@@ -276,9 +315,8 @@
if (atomic_dec_and_test(&io_end->count)) {
if (io_end->flag & EXT4_IO_END_UNWRITTEN) {
- err = ext4_convert_unwritten_extents(io_end->handle,
- io_end->inode, io_end->offset,
- io_end->size);
+ err = ext4_convert_unwritten_io_end_vec(io_end->handle,
+ io_end);
io_end->handle = NULL;
ext4_clear_io_unwritten_flag(io_end);
}
@@ -315,10 +353,8 @@
struct inode *inode = io_end->inode;
ext4_warning(inode->i_sb, "I/O error %d writing to inode %lu "
- "(offset %llu size %ld starting block %llu)",
+ "starting block %llu)",
bio->bi_status, inode->i_ino,
- (unsigned long long) io_end->offset,
- (long) io_end->size,
(unsigned long long)
bi_sector >> (inode->i_blkbits - 9));
mapping_set_error(inode->i_mapping,
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 93c14ec..e04d9ba 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1538,6 +1538,7 @@
{Opt_auto_da_alloc, "auto_da_alloc"},
{Opt_noauto_da_alloc, "noauto_da_alloc"},
{Opt_dioread_nolock, "dioread_nolock"},
+ {Opt_dioread_lock, "nodioread_nolock"},
{Opt_dioread_lock, "dioread_lock"},
{Opt_discard, "discard"},
{Opt_nodiscard, "nodiscard"},
@@ -2024,7 +2025,7 @@
unsigned int *journal_ioprio,
int is_remount)
{
- struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_sb_info __maybe_unused *sbi = EXT4_SB(sb);
char *p, __maybe_unused *usr_qf_name, __maybe_unused *grp_qf_name;
substring_t args[MAX_OPT_ARGS];
int token;
@@ -2078,16 +2079,6 @@
}
}
#endif
- if (test_opt(sb, DIOREAD_NOLOCK)) {
- int blocksize =
- BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
-
- if (blocksize < PAGE_SIZE) {
- ext4_msg(sb, KERN_ERR, "can't mount with "
- "dioread_nolock if block size != PAGE_SIZE");
- return 0;
- }
- }
return 1;
}
@@ -3701,6 +3692,7 @@
set_opt(sb, NO_UID32);
/* xattr user namespace & acls are now defaulted on */
set_opt(sb, XATTR_USER);
+ set_opt(sb, DIOREAD_NOLOCK);
#ifdef CONFIG_EXT4_FS_POSIX_ACL
set_opt(sb, POSIX_ACL);
#endif
@@ -3838,9 +3830,8 @@
goto failed_mount;
if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) {
- printk_once(KERN_WARNING "EXT4-fs: Warning: mounting "
- "with data=journal disables delayed "
- "allocation and O_DIRECT support!\n");
+ printk_once(KERN_WARNING "EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support!\n");
+ clear_opt(sb, DIOREAD_NOLOCK);
if (test_opt2(sb, EXPLICIT_DELALLOC)) {
ext4_msg(sb, KERN_ERR, "can't mount with "
"both data=journal and delalloc");
diff --git a/fs/file_table.c b/fs/file_table.c
index e49af4c..023fd1e 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -276,6 +276,9 @@
}
if (file->f_op->release)
file->f_op->release(inode, file);
+
+ security_file_pre_free(file);
+
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
!(file->f_mode & FMODE_PATH))) {
cdev_put(inode->i_cdev);
diff --git a/fs/namei.c b/fs/namei.c
index 327844f..5742d73 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -885,12 +885,65 @@
path_put(&last->link);
}
-int sysctl_protected_symlinks __read_mostly = 0;
-int sysctl_protected_hardlinks __read_mostly = 0;
+int sysctl_protected_symlinks __read_mostly = 1;
+int sysctl_protected_hardlinks __read_mostly = 1;
int sysctl_protected_fifos __read_mostly;
int sysctl_protected_regular __read_mostly;
/**
+ * nameidata_set_temporary - Used by Chromium OS LSM to check
+ * whether a mount point includes traversing symlinks.
+ */
+int nameidata_set_temporary(const char __user *dir_name)
+{
+ struct nameidata *tmp;
+ struct filename *name;
+
+ tmp = kmalloc(sizeof(*tmp), GFP_KERNEL);
+ if (unlikely(!tmp))
+ return -ENOMEM;
+ name = getname_flags(dir_name, LOOKUP_FOLLOW, NULL);
+ if (IS_ERR(name)) {
+ kfree(tmp);
+ return PTR_ERR(name);
+ }
+ set_nameidata(tmp, AT_FDCWD, name);
+ return 0;
+}
+
+/**
+ * nameidata_restore_temporary - Used by Chromium OS LSM to check
+ * whether a mount point includes traversing symlinks.
+ */
+void nameidata_restore_temporary(void)
+{
+ struct nameidata *tmp = current->nameidata;
+
+ restore_nameidata();
+ putname(tmp->name);
+ kfree(tmp);
+}
+
+/**
+ * nameidata_get_total_link_count - Used by security/chromiumos/lsm.c to check
+ * whether a mount point includes traversing symlinks.
+ */
+int nameidata_get_total_link_count(void)
+{
+ struct nameidata *tmp = current->nameidata;
+
+ if (unlikely(!tmp)) {
+ WARN(1, "Unexpectedly got here with current->nameidata == NULL");
+ /* Pretend we did traverse symlinks, that is the safe/sane
+ * result here from a security point of view...
+ */
+ return MAXSYMLINKS;
+ }
+ return tmp->total_link_count;
+}
+EXPORT_SYMBOL(nameidata_get_total_link_count);
+
+/**
* may_follow_link - Check symlink following for unsafe situations
* @nd: nameidata pathwalk data
*
diff --git a/fs/namespace.c b/fs/namespace.c
index 741f40c..dee73c0 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -719,8 +719,14 @@
goto done;
}
- if (!new)
- new = kmalloc(sizeof(struct mountpoint), GFP_KERNEL);
+ if (!new) {
+ /*
+ * We are allocating as GFP_NOFS to appease lockdep:
+ * since we are holding i_mutex we should not try to
+ * recurse into filesystem code.
+ */
+ new = kmalloc(sizeof(struct mountpoint), GFP_NOFS);
+ }
if (!new)
return ERR_PTR(-ENOMEM);
@@ -2736,12 +2742,19 @@
return -EINVAL;
/* ... and get the mountpoint */
- retval = user_path(dir_name, &path);
+ retval = nameidata_set_temporary(dir_name);
if (retval)
return retval;
+ retval = user_path(dir_name, &path);
+ if (retval) {
+ nameidata_restore_temporary();
+ return retval;
+ }
+
retval = security_sb_mount(dev_name, &path,
type_page, flags, data_page);
+ nameidata_restore_temporary();
if (!retval && !may_mount())
retval = -EPERM;
if (!retval && (flags & SB_MANDLOCK) && !may_mandlock())
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index 97a5169..61a440b 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -702,6 +702,8 @@
struct fsnotify_group *group;
struct inode *inode;
struct path path;
+ struct path alteredpath;
+ struct path *canonical_path = &path;
struct fd f;
int ret;
unsigned flags = 0;
@@ -747,13 +749,22 @@
if (ret)
goto fput_and_out;
+ /* support stacked filesystems */
+ if(path.dentry && path.dentry->d_op) {
+ if (path.dentry->d_op->d_canonical_path) {
+ path.dentry->d_op->d_canonical_path(&path, &alteredpath);
+ canonical_path = &alteredpath;
+ path_put(&path);
+ }
+ }
+
/* inode held in place by reference to path; group by fget on fd */
- inode = path.dentry->d_inode;
+ inode = canonical_path->dentry->d_inode;
group = f.file->private_data;
/* create/update an inode mark */
ret = inotify_update_watch(group, inode, mask);
- path_put(&path);
+ path_put(canonical_path);
fput_and_out:
fdput(f);
return ret;
diff --git a/fs/nsfs.c b/fs/nsfs.c
index 30d150a4..ceab3a5 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -246,6 +246,7 @@
fput(file);
return ERR_PTR(-EINVAL);
}
+EXPORT_SYMBOL(proc_ns_fget);
static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry)
{
diff --git a/fs/open.c b/fs/open.c
index 76996f9..0d2bd0a 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -34,8 +34,11 @@
#include "internal.h"
-int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
- struct file *filp)
+#define CREATE_TRACE_POINTS
+#include <trace/events/fs.h>
+
+int do_truncate2(struct vfsmount *mnt, struct dentry *dentry, loff_t length,
+ unsigned int time_attrs, struct file *filp)
{
int ret;
struct iattr newattrs;
@@ -65,6 +68,12 @@
return ret;
}
+int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
+ struct file *filp)
+{
+ return do_truncate2(NULL, dentry, length, time_attrs, filp);
+}
+
long vfs_truncate(const struct path *path, loff_t length)
{
struct inode *inode;
@@ -1089,6 +1098,7 @@
} else {
fsnotify_open(f);
fd_install(fd, f);
+ trace_do_sys_open(tmp->name, flags, mode);
}
}
putname(tmp);
diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 817c02b..4d96a7c 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -97,3 +97,10 @@
Say Y if you are running any user-space software which takes benefit from
this interface. For example, rkt is such a piece of software.
+
+config PROC_UID
+ bool "Include /proc/uid/ files"
+ default y
+ depends on PROC_FS && RT_MUTEXES
+ help
+ Provides aggregated per-uid information under /proc/uid.
diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index ead487e..3f849ca 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -27,6 +27,7 @@
proc-y += namespaces.o
proc-y += self.o
proc-y += thread_self.o
+proc-$(CONFIG_PROC_UID) += uid.o
proc-$(CONFIG_PROC_SYSCTL) += proc_sysctl.o
proc-$(CONFIG_NET) += proc_net.o
proc-$(CONFIG_PROC_KCORE) += kcore.o
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3b9b726..40089b8 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -144,6 +144,12 @@
NULL, &proc_single_file_operations, \
{ .proc_show = show } )
+#ifdef CONFIG_SECURITY_CHROMIUMOS_READONLY_PROC_SELF_MEM
+# define PROC_PID_MEM_MODE S_IRUSR
+#else
+# define PROC_PID_MEM_MODE S_IRUSR|S_IWUSR
+#endif
+
/*
* Count the number of hardlinks for the pid_entry table, excluding the .
* and .. links.
@@ -876,7 +882,11 @@
static ssize_t mem_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
+#ifdef CONFIG_SECURITY_CHROMIUMOS_READONLY_PROC_SELF_MEM
+ return -EACCES;
+#else
return mem_rw(file, (char __user*)buf, count, ppos, 1);
+#endif
}
loff_t mem_lseek(struct file *file, loff_t offset, int orig)
@@ -2386,10 +2396,13 @@
return -ESRCH;
if (p != current) {
- if (!capable(CAP_SYS_NICE)) {
+ rcu_read_lock();
+ if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
+ rcu_read_unlock();
count = -EPERM;
goto out;
}
+ rcu_read_unlock();
err = security_task_setscheduler(p);
if (err) {
@@ -2422,11 +2435,14 @@
return -ESRCH;
if (p != current) {
-
- if (!capable(CAP_SYS_NICE)) {
+ rcu_read_lock();
+ if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
+ rcu_read_unlock();
err = -EPERM;
goto out;
}
+ rcu_read_unlock();
+
err = security_task_getscheduler(p);
if (err)
goto out;
@@ -2977,7 +2993,7 @@
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
#endif
- REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations),
+ REG("mem", PROC_PID_MEM_MODE, proc_mem_operations),
LNK("cwd", proc_cwd_link),
LNK("root", proc_root_link),
LNK("exe", proc_exe_link),
@@ -2989,6 +3005,7 @@
REG("smaps", S_IRUGO, proc_pid_smaps_operations),
REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
REG("pagemap", S_IRUSR, proc_pagemap_operations),
+ REG("totmaps", S_IRUGO, proc_totmaps_operations),
#endif
#ifdef CONFIG_SECURITY
DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
@@ -3363,7 +3380,7 @@
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
#endif
- REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations),
+ REG("mem", PROC_PID_MEM_MODE, proc_mem_operations),
LNK("cwd", proc_cwd_link),
LNK("root", proc_root_link),
LNK("exe", proc_exe_link),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 95b1419..a4cd4f5 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -84,6 +84,9 @@
struct task_struct *task);
};
+
+extern const struct file_operations proc_totmaps_operations;
+
struct proc_inode {
struct pid *pid;
unsigned int fd;
@@ -258,6 +261,15 @@
#endif
/*
+ * uid.c
+ */
+#ifdef CONFIG_PROC_UID
+extern int proc_uid_init(void);
+#else
+static inline void proc_uid_init(void) { }
+#endif
+
+/*
* proc_tty.c
*/
#ifdef CONFIG_TTY
@@ -285,6 +297,7 @@
struct mm_struct *mm;
#ifdef CONFIG_MMU
struct vm_area_struct *tail_vma;
+ struct mem_size_stats *mss;
#endif
#ifdef CONFIG_NUMA
struct mempolicy *task_mempolicy;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index f4b1a9d..efc63a6 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -130,6 +130,7 @@
proc_symlink("mounts", NULL, "self/mounts");
proc_net_init();
+ proc_uid_init();
proc_mkdir("fs", NULL);
proc_mkdir("driver", NULL);
proc_create_mount_point("fs/nfsd"); /* somewhere for the nfsd filesystem to be mounted */
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index efa6273..006ae59 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -123,6 +123,56 @@
}
#endif
+static void seq_print_vma_name(struct seq_file *m, struct vm_area_struct *vma)
+{
+ const char __user *name = vma_get_anon_name(vma);
+ struct mm_struct *mm = vma->vm_mm;
+
+ unsigned long page_start_vaddr;
+ unsigned long page_offset;
+ unsigned long num_pages;
+ unsigned long max_len = NAME_MAX;
+ int i;
+
+ page_start_vaddr = (unsigned long)name & PAGE_MASK;
+ page_offset = (unsigned long)name - page_start_vaddr;
+ num_pages = DIV_ROUND_UP(page_offset + max_len, PAGE_SIZE);
+
+ seq_puts(m, "[anon:");
+
+ for (i = 0; i < num_pages; i++) {
+ int len;
+ int write_len;
+ const char *kaddr;
+ long pages_pinned;
+ struct page *page;
+
+ pages_pinned = get_user_pages_remote(current, mm,
+ page_start_vaddr, 1, 0, &page, NULL, NULL);
+ if (pages_pinned < 1) {
+ seq_puts(m, "<fault>]");
+ return;
+ }
+
+ kaddr = (const char *)kmap(page);
+ len = min(max_len, PAGE_SIZE - page_offset);
+ write_len = strnlen(kaddr + page_offset, len);
+ seq_write(m, kaddr + page_offset, write_len);
+ kunmap(page);
+ put_page(page);
+
+ /* if strnlen hit a null terminator then we're done */
+ if (write_len != len)
+ break;
+
+ max_len -= len;
+ page_offset = 0;
+ page_start_vaddr += PAGE_SIZE;
+ }
+
+ seq_putc(m, ']');
+}
+
static void vma_stop(struct proc_maps_private *priv)
{
struct mm_struct *mm = priv->mm;
@@ -348,8 +398,15 @@
goto done;
}
- if (is_stack(vma))
+ if (is_stack(vma)) {
name = "[stack]";
+ goto done;
+ }
+
+ if (vma_get_anon_name(vma)) {
+ seq_pad(m, ' ');
+ seq_print_vma_name(m, vma);
+ }
}
done:
@@ -421,17 +478,53 @@
unsigned long shared_hugetlb;
unsigned long private_hugetlb;
u64 pss;
+ u64 pss_anon;
+ u64 pss_file;
+ u64 pss_shmem;
u64 pss_locked;
u64 swap_pss;
bool check_shmem_swap;
};
+static void smaps_page_accumulate(struct mem_size_stats *mss,
+ struct page *page, unsigned long size, unsigned long pss,
+ bool dirty, bool locked, bool private)
+{
+ mss->pss += pss;
+
+ if (PageAnon(page))
+ mss->pss_anon += pss;
+ else if (PageSwapBacked(page))
+ mss->pss_shmem += pss;
+ else
+ mss->pss_file += pss;
+
+ if (locked)
+ mss->pss_locked += pss;
+
+ if (dirty || PageDirty(page)) {
+ if (private)
+ mss->private_dirty += size;
+ else
+ mss->shared_dirty += size;
+ } else {
+ if (private)
+ mss->private_clean += size;
+ else
+ mss->shared_clean += size;
+ }
+}
+
static void smaps_account(struct mem_size_stats *mss, struct page *page,
bool compound, bool young, bool dirty, bool locked)
{
int i, nr = compound ? 1 << compound_order(page) : 1;
unsigned long size = nr * PAGE_SIZE;
+ /*
+ * First accumulate quantities that depend only on |size| and the type
+ * of the compound page.
+ */
if (PageAnon(page)) {
mss->anonymous += size;
if (!PageSwapBacked(page) && !dirty && !PageDirty(page))
@@ -444,42 +537,26 @@
mss->referenced += size;
/*
+ * Then accumulate quantities that may depend on sharing, or that may
+ * differ page-by-page.
+ *
* page_count(page) == 1 guarantees the page is mapped exactly once.
* If any subpage of the compound page mapped with PTE it would elevate
* page_count().
*/
if (page_count(page) == 1) {
- if (dirty || PageDirty(page))
- mss->private_dirty += size;
- else
- mss->private_clean += size;
- mss->pss += (u64)size << PSS_SHIFT;
- if (locked)
- mss->pss_locked += (u64)size << PSS_SHIFT;
+ smaps_page_accumulate(mss, page, size, size << PSS_SHIFT, dirty,
+ locked, true);
return;
}
-
for (i = 0; i < nr; i++, page++) {
int mapcount = page_mapcount(page);
- unsigned long pss = (PAGE_SIZE << PSS_SHIFT);
+ bool private = mapcount < 2;
+ unsigned long pss = private ? PAGE_SIZE << PSS_SHIFT :
+ (PAGE_SIZE << PSS_SHIFT) / mapcount;
- if (mapcount >= 2) {
- if (dirty || PageDirty(page))
- mss->shared_dirty += PAGE_SIZE;
- else
- mss->shared_clean += PAGE_SIZE;
- mss->pss += pss / mapcount;
- if (locked)
- mss->pss_locked += pss / mapcount;
- } else {
- if (dirty || PageDirty(page))
- mss->private_dirty += PAGE_SIZE;
- else
- mss->private_clean += PAGE_SIZE;
- mss->pss += pss;
- if (locked)
- mss->pss_locked += pss;
- }
+ smaps_page_accumulate(mss, page, PAGE_SIZE, pss,
+ dirty, locked, private);
}
}
@@ -758,10 +835,21 @@
seq_put_decimal_ull_width(m, str, (val) >> 10, 8)
/* Show the contents common for smaps and smaps_rollup */
-static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss)
+static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss,
+ bool rollup_mode)
{
SEQ_PUT_DEC("Rss: ", mss->resident);
SEQ_PUT_DEC(" kB\nPss: ", mss->pss >> PSS_SHIFT);
+ if (rollup_mode) {
+ // These are meaningful only for smaps_rollup, otherwise two of
+ // them are zero, and the other is the same as Pss.
+ SEQ_PUT_DEC(" kB\nPss_Anon: ",
+ mss->pss_anon >> PSS_SHIFT);
+ SEQ_PUT_DEC(" kB\nPss_File: ",
+ mss->pss_file >> PSS_SHIFT);
+ SEQ_PUT_DEC(" kB\nPss_Shmem: ",
+ mss->pss_shmem >> PSS_SHIFT);
+ }
SEQ_PUT_DEC(" kB\nShared_Clean: ", mss->shared_clean);
SEQ_PUT_DEC(" kB\nShared_Dirty: ", mss->shared_dirty);
SEQ_PUT_DEC(" kB\nPrivate_Clean: ", mss->private_clean);
@@ -792,13 +880,18 @@
smap_gather_stats(vma, &mss);
show_map_vma(m, vma);
+ if (vma_get_anon_name(vma)) {
+ seq_puts(m, "Name: ");
+ seq_print_vma_name(m, vma);
+ seq_putc(m, '\n');
+ }
SEQ_PUT_DEC("Size: ", vma->vm_end - vma->vm_start);
SEQ_PUT_DEC(" kB\nKernelPageSize: ", vma_kernel_pagesize(vma));
SEQ_PUT_DEC(" kB\nMMUPageSize: ", vma_mmu_pagesize(vma));
seq_puts(m, " kB\n");
- __show_smap(m, &mss);
+ __show_smap(m, &mss, false);
seq_printf(m, "THPeligible: %d\n", transparent_hugepage_enabled(vma));
@@ -811,6 +904,84 @@
return 0;
}
+static void add_smaps_sum(struct mem_size_stats *mss,
+ struct mem_size_stats *mss_sum)
+{
+ mss_sum->resident += mss->resident;
+ mss_sum->pss += mss->pss;
+ mss_sum->pss_anon += mss->pss_anon;
+ mss_sum->pss_file += mss->pss_file;
+ mss_sum->pss_shmem += mss->pss_shmem;
+ mss_sum->shared_clean += mss->shared_clean;
+ mss_sum->shared_dirty += mss->shared_dirty;
+ mss_sum->private_clean += mss->private_clean;
+ mss_sum->private_dirty += mss->private_dirty;
+ mss_sum->referenced += mss->referenced;
+ mss_sum->anonymous += mss->anonymous;
+ mss_sum->anonymous_thp += mss->anonymous_thp;
+ mss_sum->swap += mss->swap;
+}
+
+static int totmaps_proc_show(struct seq_file *m, void *data)
+{
+ struct proc_maps_private *priv = m->private;
+ struct mm_struct *mm;
+ struct vm_area_struct *vma;
+ struct mem_size_stats *mss_sum = priv->mss;
+
+ /* reference to priv->task already taken */
+ /* but need to get the mm here because */
+ /* task could be in the process of exiting */
+ mm = get_task_mm(priv->task);
+ if (!mm || IS_ERR(mm))
+ return -EINVAL;
+
+ down_read(&mm->mmap_sem);
+ hold_task_mempolicy(priv);
+
+ for (vma = mm->mmap; vma != priv->tail_vma; vma = vma->vm_next) {
+ struct mem_size_stats mss;
+ struct mm_walk smaps_walk = {
+ .pmd_entry = smaps_pte_range,
+ .mm = vma->vm_mm,
+ .private = &mss,
+ };
+
+ if (vma->vm_mm && !is_vm_hugetlb_page(vma)) {
+ memset(&mss, 0, sizeof(mss));
+ walk_page_vma(vma, &smaps_walk);
+ add_smaps_sum(&mss, mss_sum);
+ }
+ }
+ seq_printf(m,
+ "Rss: %8lu kB\n"
+ "Pss: %8lu kB\n"
+ "Shared_Clean: %8lu kB\n"
+ "Shared_Dirty: %8lu kB\n"
+ "Private_Clean: %8lu kB\n"
+ "Private_Dirty: %8lu kB\n"
+ "Referenced: %8lu kB\n"
+ "Anonymous: %8lu kB\n"
+ "AnonHugePages: %8lu kB\n"
+ "Swap: %8lu kB\n",
+ mss_sum->resident >> 10,
+ (unsigned long)(mss_sum->pss >> (10 + PSS_SHIFT)),
+ mss_sum->shared_clean >> 10,
+ mss_sum->shared_dirty >> 10,
+ mss_sum->private_clean >> 10,
+ mss_sum->private_dirty >> 10,
+ mss_sum->referenced >> 10,
+ mss_sum->anonymous >> 10,
+ mss_sum->anonymous_thp >> 10,
+ mss_sum->swap >> 10);
+
+ release_task_mempolicy(priv);
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+
+ return 0;
+}
+
static int show_smaps_rollup(struct seq_file *m, void *v)
{
struct proc_maps_private *priv = m->private;
@@ -848,7 +1019,7 @@
seq_pad(m, ' ');
seq_puts(m, "[rollup]\n");
- __show_smap(m, &mss);
+ __show_smap(m, &mss, true);
release_task_mempolicy(priv);
up_read(&mm->mmap_sem);
@@ -916,6 +1087,50 @@
return single_release(inode, file);
}
+static int totmaps_open(struct inode *inode, struct file *file)
+{
+ struct proc_maps_private *priv;
+ int ret = -ENOMEM;
+ priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+ if (priv) {
+ priv->mss = kzalloc(sizeof(*priv->mss), GFP_KERNEL);
+ if (!priv->mss)
+ return -ENOMEM;
+
+ /* we need to grab references to the task_struct */
+ /* at open time, because there's a potential information */
+ /* leak where the totmaps file is opened and held open */
+ /* while the underlying pid to task mapping changes */
+ /* underneath it */
+ priv->task = get_pid_task(proc_pid(inode), PIDTYPE_PID);
+ if (!priv->task) {
+ kfree(priv->mss);
+ kfree(priv);
+ return -ESRCH;
+ }
+
+ ret = single_open(file, totmaps_proc_show, priv);
+ if (ret) {
+ put_task_struct(priv->task);
+ kfree(priv->mss);
+ kfree(priv);
+ }
+ }
+ return ret;
+}
+
+static int totmaps_release(struct inode *inode, struct file *file)
+{
+ struct seq_file *m = file->private_data;
+ struct proc_maps_private *priv = m->private;
+
+ put_task_struct(priv->task);
+ kfree(priv->mss);
+ kfree(priv);
+ m->private = NULL;
+ return single_release(inode, file);
+}
+
const struct file_operations proc_pid_smaps_operations = {
.open = pid_smaps_open,
.read = seq_read,
@@ -930,6 +1145,13 @@
.release = smaps_rollup_release,
};
+const struct file_operations proc_totmaps_operations = {
+ .open = totmaps_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = totmaps_release,
+};
+
enum clear_refs_types {
CLEAR_REFS_ALL = 1,
CLEAR_REFS_ANON,
diff --git a/fs/proc/uid.c b/fs/proc/uid.c
new file mode 100644
index 0000000..ae720b9
--- /dev/null
+++ b/fs/proc/uid.c
@@ -0,0 +1,290 @@
+/*
+ * /proc/uid support
+ */
+
+#include <linux/fs.h>
+#include <linux/hashtable.h>
+#include <linux/init.h>
+#include <linux/proc_fs.h>
+#include <linux/rtmutex.h>
+#include <linux/sched.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include "internal.h"
+
+static struct proc_dir_entry *proc_uid;
+
+#define UID_HASH_BITS 10
+
+static DECLARE_HASHTABLE(proc_uid_hash_table, UID_HASH_BITS);
+
+/*
+ * use rt_mutex here to avoid priority inversion between high-priority readers
+ * of these files and tasks calling proc_register_uid().
+ */
+static DEFINE_RT_MUTEX(proc_uid_lock); /* proc_uid_hash_table */
+
+struct uid_hash_entry {
+ uid_t uid;
+ struct hlist_node hash;
+};
+
+/* Caller must hold proc_uid_lock */
+static bool uid_hash_entry_exists_locked(uid_t uid)
+{
+ struct uid_hash_entry *entry;
+
+ hash_for_each_possible(proc_uid_hash_table, entry, hash, uid) {
+ if (entry->uid == uid)
+ return true;
+ }
+ return false;
+}
+
+void proc_register_uid(kuid_t kuid)
+{
+ struct uid_hash_entry *entry;
+ bool exists;
+ uid_t uid = from_kuid_munged(current_user_ns(), kuid);
+
+ rt_mutex_lock(&proc_uid_lock);
+ exists = uid_hash_entry_exists_locked(uid);
+ rt_mutex_unlock(&proc_uid_lock);
+ if (exists)
+ return;
+
+ entry = kzalloc(sizeof(struct uid_hash_entry), GFP_KERNEL);
+ if (!entry)
+ return;
+ entry->uid = uid;
+
+ rt_mutex_lock(&proc_uid_lock);
+ if (uid_hash_entry_exists_locked(uid))
+ kfree(entry);
+ else
+ hash_add(proc_uid_hash_table, &entry->hash, uid);
+ rt_mutex_unlock(&proc_uid_lock);
+}
+
+struct uid_entry {
+ const char *name;
+ int len;
+ umode_t mode;
+ const struct inode_operations *iop;
+ const struct file_operations *fop;
+};
+
+#define NOD(NAME, MODE, IOP, FOP) { \
+ .name = (NAME), \
+ .len = sizeof(NAME) - 1, \
+ .mode = MODE, \
+ .iop = IOP, \
+ .fop = FOP, \
+}
+
+static const struct uid_entry uid_base_stuff[] = {};
+
+static const struct inode_operations proc_uid_def_inode_operations = {
+ .setattr = proc_setattr,
+};
+
+static struct inode *proc_uid_make_inode(struct super_block *sb, kuid_t kuid)
+{
+ struct inode *inode;
+
+ inode = new_inode(sb);
+ if (!inode)
+ return NULL;
+
+ inode->i_ino = get_next_ino();
+ inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+ inode->i_op = &proc_uid_def_inode_operations;
+ inode->i_uid = kuid;
+
+ return inode;
+}
+
+static struct dentry *proc_uident_instantiate(struct dentry *dentry,
+ struct task_struct *unused, const void *ptr)
+{
+ const struct uid_entry *u = ptr;
+ struct inode *inode;
+
+ uid_t uid = name_to_int(&dentry->d_name);
+ kuid_t kuid;
+ bool uid_exists;
+ rt_mutex_lock(&proc_uid_lock);
+ uid_exists = uid_hash_entry_exists_locked(uid);
+ rt_mutex_unlock(&proc_uid_lock);
+ if (uid_exists) {
+ kuid = make_kuid(current_user_ns(), uid);
+ inode = proc_uid_make_inode(dentry->d_sb, kuid);
+ if (!inode)
+ return ERR_PTR(-ENOENT);
+ } else {
+ return ERR_PTR(-ENOENT);
+ }
+
+ inode->i_mode = u->mode;
+ if (S_ISDIR(inode->i_mode))
+ set_nlink(inode, 2);
+ if (u->iop)
+ inode->i_op = u->iop;
+ if (u->fop)
+ inode->i_fop = u->fop;
+
+ return d_splice_alias(inode, dentry);
+}
+
+static struct dentry *proc_uid_base_lookup(struct inode *dir,
+ struct dentry *dentry,
+ unsigned int flags)
+{
+ const struct uid_entry *u, *last;
+ unsigned int nents = ARRAY_SIZE(uid_base_stuff);
+
+ if (nents == 0)
+ return ERR_PTR(-ENOENT);
+
+ last = &uid_base_stuff[nents - 1];
+ for (u = uid_base_stuff; u <= last; u++) {
+ if (u->len != dentry->d_name.len)
+ continue;
+ if (!memcmp(dentry->d_name.name, u->name, u->len))
+ break;
+ }
+ if (u > last)
+ return ERR_PTR(-ENOENT);
+
+ return proc_uident_instantiate(dentry, NULL, u);
+}
+
+static int proc_uid_base_readdir(struct file *file, struct dir_context *ctx)
+{
+ unsigned int nents = ARRAY_SIZE(uid_base_stuff);
+ const struct uid_entry *u;
+
+ if (!dir_emit_dots(file, ctx))
+ return 0;
+
+ if (ctx->pos >= nents + 2)
+ return 0;
+
+ for (u = uid_base_stuff + (ctx->pos - 2);
+ u < uid_base_stuff + nents; u++) {
+ if (!proc_fill_cache(file, ctx, u->name, u->len,
+ proc_uident_instantiate, NULL, u))
+ break;
+ ctx->pos++;
+ }
+
+ return 0;
+}
+
+static const struct inode_operations proc_uid_base_inode_operations = {
+ .lookup = proc_uid_base_lookup,
+ .setattr = proc_setattr,
+};
+
+static const struct file_operations proc_uid_base_operations = {
+ .read = generic_read_dir,
+ .iterate = proc_uid_base_readdir,
+ .llseek = default_llseek,
+};
+
+static struct dentry *proc_uid_instantiate(struct dentry *dentry,
+ struct task_struct *unused, const void *ptr)
+{
+ unsigned int i, len;
+ nlink_t nlinks;
+ kuid_t *kuid = (kuid_t *)ptr;
+ struct inode *inode = proc_uid_make_inode(dentry->d_sb, *kuid);
+
+ if (!inode)
+ return ERR_PTR(-ENOENT);
+
+ inode->i_mode = S_IFDIR | 0555;
+ inode->i_op = &proc_uid_base_inode_operations;
+ inode->i_fop = &proc_uid_base_operations;
+ inode->i_flags |= S_IMMUTABLE;
+
+ nlinks = 2;
+ len = ARRAY_SIZE(uid_base_stuff);
+ for (i = 0; i < len; ++i) {
+ if (S_ISDIR(uid_base_stuff[i].mode))
+ ++nlinks;
+ }
+ set_nlink(inode, nlinks);
+
+ return d_splice_alias(inode, dentry);
+}
+
+static int proc_uid_readdir(struct file *file, struct dir_context *ctx)
+{
+ int last_shown, i;
+ unsigned long bkt;
+ struct uid_hash_entry *entry;
+
+ if (!dir_emit_dots(file, ctx))
+ return 0;
+
+ i = 0;
+ last_shown = ctx->pos - 2;
+ rt_mutex_lock(&proc_uid_lock);
+ hash_for_each(proc_uid_hash_table, bkt, entry, hash) {
+ int len;
+ char buf[PROC_NUMBUF];
+
+ if (i < last_shown)
+ continue;
+ len = snprintf(buf, sizeof(buf), "%u", entry->uid);
+ if (!proc_fill_cache(file, ctx, buf, len,
+ proc_uid_instantiate, NULL, &entry->uid))
+ break;
+ i++;
+ ctx->pos++;
+ }
+ rt_mutex_unlock(&proc_uid_lock);
+ return 0;
+}
+
+static struct dentry *proc_uid_lookup(struct inode *dir, struct dentry *dentry,
+ unsigned int flags)
+{
+ int result = -ENOENT;
+
+ uid_t uid = name_to_int(&dentry->d_name);
+ bool uid_exists;
+
+ rt_mutex_lock(&proc_uid_lock);
+ uid_exists = uid_hash_entry_exists_locked(uid);
+ rt_mutex_unlock(&proc_uid_lock);
+ if (uid_exists) {
+ kuid_t kuid = make_kuid(current_user_ns(), uid);
+
+ return proc_uid_instantiate(dentry, NULL, &kuid);
+ }
+ return ERR_PTR(result);
+}
+
+static const struct file_operations proc_uid_operations = {
+ .read = generic_read_dir,
+ .iterate = proc_uid_readdir,
+ .llseek = default_llseek,
+};
+
+static const struct inode_operations proc_uid_inode_operations = {
+ .lookup = proc_uid_lookup,
+ .setattr = proc_setattr,
+};
+
+int __init proc_uid_init(void)
+{
+ proc_uid = proc_mkdir("uid", NULL);
+ if (!proc_uid)
+ return -ENOMEM;
+ proc_uid->proc_iops = &proc_uid_inode_operations;
+ proc_uid->proc_fops = &proc_uid_operations;
+
+ return 0;
+}
diff --git a/fs/sync.c b/fs/sync.c
index b54e054..055daab 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -9,7 +9,7 @@
#include <linux/slab.h>
#include <linux/export.h>
#include <linux/namei.h>
-#include <linux/sched.h>
+#include <linux/sched/xacct.h>
#include <linux/writeback.h>
#include <linux/syscalls.h>
#include <linux/linkage.h>
@@ -220,6 +220,7 @@
if (f.file) {
ret = vfs_fsync(f.file, datasync);
fdput(f);
+ inc_syscfs(current);
}
return ret;
}
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index d269d11..0e89a6d 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -913,7 +913,9 @@
new_flags, vma->anon_vma,
vma->vm_file, vma->vm_pgoff,
vma_policy(vma),
- NULL_VM_UFFD_CTX);
+ NULL_VM_UFFD_CTX,
+ vma_get_anon_name(vma));
+
if (prev)
vma = prev;
else
@@ -1463,7 +1465,8 @@
prev = vma_merge(mm, prev, start, vma_end, new_flags,
vma->anon_vma, vma->vm_file, vma->vm_pgoff,
vma_policy(vma),
- ((struct vm_userfaultfd_ctx){ ctx }));
+ ((struct vm_userfaultfd_ctx){ ctx }),
+ vma_get_anon_name(vma));
if (prev) {
vma = prev;
goto next;
@@ -1625,7 +1628,8 @@
prev = vma_merge(mm, prev, start, vma_end, new_flags,
vma->anon_vma, vma->vm_file, vma->vm_pgoff,
vma_policy(vma),
- NULL_VM_UFFD_CTX);
+ NULL_VM_UFFD_CTX,
+ vma_get_anon_name(vma));
if (prev) {
vma = prev;
goto next;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 2595496..3a56299 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1151,11 +1151,14 @@
struct file *filp,
struct vm_area_struct *vma)
{
+ struct dax_device *dax_dev;
+
+ dax_dev = xfs_find_daxdev_for_inode(file_inode(filp));
/*
- * We don't support synchronous mappings for non-DAX files. At least
- * until someone comes with a sensible use case.
+ * We don't support synchronous mappings for non-DAX files and
+ * for DAX files if underneath dax_device is not synchronous.
*/
- if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
+ if (!daxdev_mapping_supported(vma, dax_dev))
return -EOPNOTSUPP;
file_accessed(filp);
diff --git a/include/linux/alt-syscall.h b/include/linux/alt-syscall.h
new file mode 100644
index 0000000..00f37c0
--- /dev/null
+++ b/include/linux/alt-syscall.h
@@ -0,0 +1,59 @@
+#ifndef _ALT_SYSCALL_H
+#define _ALT_SYSCALL_H
+
+#include <linux/errno.h>
+
+#ifdef CONFIG_ALT_SYSCALL
+
+#include <linux/list.h>
+#include <asm/syscall.h>
+
+#define ALT_SYS_CALL_NAME_MAX 32
+
+struct alt_sys_call_table {
+ char name[ALT_SYS_CALL_NAME_MAX + 1];
+ sys_call_ptr_t *table;
+ int size;
+#ifdef CONFIG_IA32_EMULATION
+ sys_call_ptr_t *compat_table;
+ int compat_size;
+#endif
+ struct list_head node;
+};
+
+/*
+ * arch_dup_sys_call_table should return the default syscall table, not
+ * the current syscall table, since we want to explicitly not allow
+ * syscall table composition. A selected syscall table should be treated
+ * as a single execution personality.
+ */
+
+int arch_dup_sys_call_table(struct alt_sys_call_table *table);
+int arch_set_sys_call_table(struct alt_sys_call_table *table);
+
+int register_alt_sys_call_table(struct alt_sys_call_table *table);
+int set_alt_sys_call_table(char __user *name);
+
+#else
+
+struct alt_sys_call_table;
+
+static inline int arch_dup_sys_call_table(struct alt_sys_call_table *table)
+{
+ return -ENOSYS;
+}
+static inline int arch_set_sys_call_table(struct alt_sys_call_table *table)
+{
+ return -ENOSYS;
+}
+static inline int register_alt_sys_call_table(struct alt_sys_call_table *table)
+{
+ return -ENOSYS;
+}
+static inline int set_alt_sys_call_table(char __user *name)
+{
+ return -ENOSYS;
+}
+#endif
+
+#endif /* _ALT_SYSCALL_H */
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 9334fbe..bea2b15 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -85,6 +85,17 @@
u32 op;
};
+struct audit_task_info {
+ kuid_t loginuid;
+ unsigned int sessionid;
+ u64 contid;
+#ifdef CONFIG_AUDITSYSCALL
+ struct audit_context *ctx;
+#endif
+};
+
+extern struct audit_task_info init_struct_audit;
+
extern int is_audit_feature_set(int which);
extern int __init audit_register_class(int class, unsigned *list);
@@ -123,6 +134,9 @@
#ifdef CONFIG_AUDIT
/* These are defined in audit.c */
/* Public API */
+extern int audit_alloc(struct task_struct *task);
+extern void audit_free(struct task_struct *task);
+extern void __init audit_task_init(void);
extern __printf(4, 5)
void audit_log(struct audit_context *ctx, gfp_t gfp_mask, int type,
const char *fmt, ...);
@@ -162,8 +176,39 @@
extern int audit_rule_change(int type, int seq, void *data, size_t datasz);
extern int audit_list_rules_send(struct sk_buff *request_skb, int seq);
+static inline kuid_t audit_get_loginuid(struct task_struct *tsk)
+{
+ if (!tsk->audit)
+ return INVALID_UID;
+ return tsk->audit->loginuid;
+}
+
+static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
+{
+ if (!tsk->audit)
+ return AUDIT_SID_UNSET;
+ return tsk->audit->sessionid;
+}
+
+extern int audit_set_contid(struct task_struct *tsk, u64 contid);
+
+static inline u64 audit_get_contid(struct task_struct *tsk)
+{
+ if (!tsk->audit)
+ return AUDIT_CID_UNSET;
+ return tsk->audit->contid;
+}
+
extern u32 audit_enabled;
#else /* CONFIG_AUDIT */
+static inline int audit_alloc(struct task_struct *task)
+{
+ return 0;
+}
+static inline void audit_free(struct task_struct *task)
+{ }
+static inline void __init audit_task_init(void)
+{ }
static inline __printf(4, 5)
void audit_log(struct audit_context *ctx, gfp_t gfp_mask, int type,
const char *fmt, ...)
@@ -205,6 +250,12 @@
static inline void audit_log_task_info(struct audit_buffer *ab,
struct task_struct *tsk)
{ }
+
+static inline u64 audit_get_contid(struct task_struct *tsk)
+{
+ return AUDIT_CID_UNSET;
+}
+
#define audit_enabled AUDIT_OFF
#endif /* CONFIG_AUDIT */
@@ -219,8 +270,6 @@
/* These are defined in auditsc.c */
/* Public API */
-extern int audit_alloc(struct task_struct *task);
-extern void __audit_free(struct task_struct *task);
extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long a1,
unsigned long a2, unsigned long a3);
extern void __audit_syscall_exit(int ret_success, long ret_value);
@@ -242,12 +291,14 @@
static inline void audit_set_context(struct task_struct *task, struct audit_context *ctx)
{
- task->audit_context = ctx;
+ task->audit->ctx = ctx;
}
static inline struct audit_context *audit_context(void)
{
- return current->audit_context;
+ if (!current->audit)
+ return NULL;
+ return current->audit->ctx;
}
static inline bool audit_dummy_context(void)
@@ -255,11 +306,7 @@
void *p = audit_context();
return !p || *(int *)p;
}
-static inline void audit_free(struct task_struct *task)
-{
- if (unlikely(task->audit_context))
- __audit_free(task);
-}
+
static inline void audit_syscall_entry(int major, unsigned long a0,
unsigned long a1, unsigned long a2,
unsigned long a3)
@@ -329,16 +376,6 @@
struct timespec64 *t, unsigned int *serial);
extern int audit_set_loginuid(kuid_t loginuid);
-static inline kuid_t audit_get_loginuid(struct task_struct *tsk)
-{
- return tsk->loginuid;
-}
-
-static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
-{
- return tsk->sessionid;
-}
-
extern void __audit_ipc_obj(struct kern_ipc_perm *ipcp);
extern void __audit_ipc_set_perm(unsigned long qbytes, uid_t uid, gid_t gid, umode_t mode);
extern void __audit_bprm(struct linux_binprm *bprm);
@@ -461,12 +498,6 @@
extern int audit_n_rules;
extern int audit_signals;
#else /* CONFIG_AUDITSYSCALL */
-static inline int audit_alloc(struct task_struct *task)
-{
- return 0;
-}
-static inline void audit_free(struct task_struct *task)
-{ }
static inline void audit_syscall_entry(int major, unsigned long a0,
unsigned long a1, unsigned long a2,
unsigned long a3)
@@ -595,6 +626,16 @@
return uid_valid(audit_get_loginuid(tsk));
}
+static inline bool audit_contid_valid(u64 contid)
+{
+ return contid != AUDIT_CID_UNSET;
+}
+
+static inline bool audit_contid_set(struct task_struct *tsk)
+{
+ return audit_contid_valid(audit_get_contid(tsk));
+}
+
static inline void audit_log_string(struct audit_buffer *ab, const char *buf)
{
audit_log_n_string(ab, buf, strlen(buf));
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 745b2d0..ec08bba 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -538,7 +538,7 @@
/*
* mq queue kobject
*/
- struct kobject mq_kobj;
+ struct kobject *mq_kobj;
#ifdef CONFIG_BLK_DEV_INTEGRITY
struct blk_integrity integrity;
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index acb77dcf..8996c09 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -21,6 +21,10 @@
SUBSYS(cpuacct)
#endif
+#if IS_ENABLED(CONFIG_SCHED_TUNE)
+SUBSYS(schedtune)
+#endif
+
#if IS_ENABLED(CONFIG_BLK_CGROUP)
SUBSYS(io)
#endif
diff --git a/include/linux/clk.h b/include/linux/clk.h
index 4f750c4..c705271 100644
--- a/include/linux/clk.h
+++ b/include/linux/clk.h
@@ -312,7 +312,26 @@
*/
int __must_check clk_bulk_get(struct device *dev, int num_clks,
struct clk_bulk_data *clks);
-
+/**
+ * clk_bulk_get_all - lookup and obtain all available references to clock
+ * producer.
+ * @dev: device for clock "consumer"
+ * @clks: pointer to the clk_bulk_data table of consumer
+ *
+ * This helper function allows drivers to get all clk consumers in one
+ * operation. If any of the clk cannot be acquired then any clks
+ * that were obtained will be freed before returning to the caller.
+ *
+ * Returns a positive value for the number of clocks obtained while the
+ * clock references are stored in the clk_bulk_data table in @clks field.
+ * Returns 0 if there're none and a negative value if something failed.
+ *
+ * Drivers must assume that the clock source is not enabled.
+ *
+ * clk_bulk_get should not be called from within interrupt context.
+ */
+int __must_check clk_bulk_get_all(struct device *dev,
+ struct clk_bulk_data **clks);
/**
* devm_clk_bulk_get - managed get multiple clk consumers
* @dev: device for clock "consumer"
@@ -327,6 +346,22 @@
*/
int __must_check devm_clk_bulk_get(struct device *dev, int num_clks,
struct clk_bulk_data *clks);
+/**
+ * devm_clk_bulk_get_all - managed get multiple clk consumers
+ * @dev: device for clock "consumer"
+ * @clks: pointer to the clk_bulk_data table of consumer
+ *
+ * Returns a positive value for the number of clocks obtained while the
+ * clock references are stored in the clk_bulk_data table in @clks field.
+ * Returns 0 if there're none and a negative value if something failed.
+ *
+ * This helper function allows drivers to get several clk
+ * consumers in one operation with management, the clks will
+ * automatically be freed when the device is unbound.
+ */
+
+int __must_check devm_clk_bulk_get_all(struct device *dev,
+ struct clk_bulk_data **clks);
/**
* devm_clk_get - lookup and obtain a managed reference to a clock producer.
@@ -488,6 +523,19 @@
void clk_bulk_put(int num_clks, struct clk_bulk_data *clks);
/**
+ * clk_bulk_put_all - "free" all the clock source
+ * @num_clks: the number of clk_bulk_data
+ * @clks: the clk_bulk_data table of consumer
+ *
+ * Note: drivers must ensure that all clk_bulk_enable calls made on this
+ * clock source are balanced by clk_bulk_disable calls prior to calling
+ * this function.
+ *
+ * clk_bulk_put_all should not be called from within interrupt context.
+ */
+void clk_bulk_put_all(int num_clks, struct clk_bulk_data *clks);
+
+/**
* devm_clk_put - "free" a managed clock source
* @dev: device used to acquire the clock
* @clk: clock source acquired with devm_clk_get()
@@ -642,6 +690,12 @@
return 0;
}
+static inline int __must_check clk_bulk_get_all(struct device *dev,
+ struct clk_bulk_data **clks)
+{
+ return 0;
+}
+
static inline struct clk *devm_clk_get(struct device *dev, const char *id)
{
return NULL;
@@ -653,6 +707,13 @@
return 0;
}
+static inline int __must_check devm_clk_bulk_get_all(struct device *dev,
+ struct clk_bulk_data **clks)
+{
+
+ return 0;
+}
+
static inline struct clk *devm_get_clk_from_child(struct device *dev,
struct device_node *np, const char *con_id)
{
@@ -663,6 +724,8 @@
static inline void clk_bulk_put(int num_clks, struct clk_bulk_data *clks) {}
+static inline void clk_bulk_put_all(int num_clks, struct clk_bulk_data *clks) {}
+
static inline void devm_clk_put(struct device *dev, struct clk *clk) {}
diff --git a/include/linux/compat.h b/include/linux/compat.h
index de0c13b..eb67e078 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -993,6 +993,13 @@
#endif /* CONFIG_ARCH_HAS_SYSCALL_WRAPPER */
+#ifdef CONFIG_ALT_SYSCALL
+
+int compat_ksys_clock_adjtime(clockid_t which_clock,
+ struct compat_timex __user *tp);
+int compat_ksys_adjtimex(struct compat_timex __user *utp);
+
+#endif
/*
* For most but not all architectures, "am I in a compat syscall?" and
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 3361663..a885e9c 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -338,14 +338,15 @@
};
/* flags */
-#define CPUFREQ_STICKY (1 << 0) /* driver isn't removed even if
- all ->init() calls failed */
-#define CPUFREQ_CONST_LOOPS (1 << 1) /* loops_per_jiffy or other
- kernel "constants" aren't
- affected by frequency
- transitions */
-#define CPUFREQ_PM_NO_WARN (1 << 2) /* don't warn on suspend/resume
- speed mismatches */
+
+/* driver isn't removed even if all ->init() calls failed */
+#define CPUFREQ_STICKY BIT(0)
+
+/* loops_per_jiffy or other kernel "constants" aren't affected by frequency transitions */
+#define CPUFREQ_CONST_LOOPS BIT(1)
+
+/* don't warn on suspend/resume speed mismatches */
+#define CPUFREQ_PM_NO_WARN BIT(2)
/*
* This should be set by platforms having multiple clock-domains, i.e.
@@ -353,14 +354,14 @@
* be created in cpu/cpu<num>/cpufreq/ directory and so they can use the same
* governor with different tunables for different clusters.
*/
-#define CPUFREQ_HAVE_GOVERNOR_PER_POLICY (1 << 3)
+#define CPUFREQ_HAVE_GOVERNOR_PER_POLICY BIT(3)
/*
* Driver will do POSTCHANGE notifications from outside of their ->target()
* routine and so must set cpufreq_driver->flags with this flag, so that core
* can handle them specially.
*/
-#define CPUFREQ_ASYNC_NOTIFICATION (1 << 4)
+#define CPUFREQ_ASYNC_NOTIFICATION BIT(4)
/*
* Set by drivers which want cpufreq core to check if CPU is running at a
@@ -369,13 +370,13 @@
* from the table. And if that fails, we will stop further boot process by
* issuing a BUG_ON().
*/
-#define CPUFREQ_NEED_INITIAL_FREQ_CHECK (1 << 5)
+#define CPUFREQ_NEED_INITIAL_FREQ_CHECK BIT(5)
/*
* Set by drivers to disallow use of governors with "dynamic_switching" flag
* set.
*/
-#define CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING (1 << 6)
+#define CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING BIT(6)
int cpufreq_register_driver(struct cpufreq_driver *driver_data);
int cpufreq_unregister_driver(struct cpufreq_driver *driver_data);
@@ -931,6 +932,14 @@
}
#endif
+#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
+void sched_cpufreq_governor_change(struct cpufreq_policy *policy,
+ struct cpufreq_governor *old_gov);
+#else
+static inline void sched_cpufreq_governor_change(struct cpufreq_policy *policy,
+ struct cpufreq_governor *old_gov) { }
+#endif
+
extern void arch_freq_prepare_all(void);
extern unsigned int arch_freq_get_on_cpu(int cpu);
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 317aeca..8ccffba8 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -220,7 +220,7 @@
#endif
/* kernel/sched/idle.c */
-extern void sched_idle_set_state(struct cpuidle_state *idle_state);
+extern void sched_idle_set_state(struct cpuidle_state *idle_state, int index);
extern void default_idle_call(void);
#ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 450b28d..836bd3f 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -7,6 +7,9 @@
#include <linux/radix-tree.h>
#include <asm/pgtable.h>
+/* Flag for synchronous flush */
+#define DAXDEV_F_SYNC (1UL << 0)
+
struct iomap_ops;
struct dax_device;
struct dax_operations {
@@ -17,6 +20,12 @@
*/
long (*direct_access)(struct dax_device *, pgoff_t, long,
void **, pfn_t *);
+ /*
+ * Validate whether this device is usable as an fsdax backing
+ * device.
+ */
+ bool (*dax_supported)(struct dax_device *, struct block_device *, int,
+ sector_t, sector_t);
/* copy_from_iter: required operation for fs-dax direct-i/o */
size_t (*copy_from_iter)(struct dax_device *, pgoff_t, void *, size_t,
struct iov_iter *);
@@ -30,18 +39,40 @@
#if IS_ENABLED(CONFIG_DAX)
struct dax_device *dax_get_by_host(const char *host);
struct dax_device *alloc_dax(void *private, const char *host,
- const struct dax_operations *ops);
+ const struct dax_operations *ops, unsigned long flags);
void put_dax(struct dax_device *dax_dev);
void kill_dax(struct dax_device *dax_dev);
void dax_write_cache(struct dax_device *dax_dev, bool wc);
bool dax_write_cache_enabled(struct dax_device *dax_dev);
+bool __dax_synchronous(struct dax_device *dax_dev);
+static inline bool dax_synchronous(struct dax_device *dax_dev)
+{
+ return __dax_synchronous(dax_dev);
+}
+void __set_dax_synchronous(struct dax_device *dax_dev);
+static inline void set_dax_synchronous(struct dax_device *dax_dev)
+{
+ __set_dax_synchronous(dax_dev);
+}
+/*
+ * Check if given mapping is supported by the file / underlying device.
+ */
+static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
+ struct dax_device *dax_dev)
+{
+ if (!(vma->vm_flags & VM_SYNC))
+ return true;
+ if (!IS_DAX(file_inode(vma->vm_file)))
+ return false;
+ return dax_synchronous(dax_dev);
+}
#else
static inline struct dax_device *dax_get_by_host(const char *host)
{
return NULL;
}
static inline struct dax_device *alloc_dax(void *private, const char *host,
- const struct dax_operations *ops)
+ const struct dax_operations *ops, unsigned long flags)
{
/*
* Callers should check IS_ENABLED(CONFIG_DAX) to know if this
@@ -62,6 +93,18 @@
{
return false;
}
+static inline bool dax_synchronous(struct dax_device *dax_dev)
+{
+ return true;
+}
+static inline void set_dax_synchronous(struct dax_device *dax_dev)
+{
+}
+static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
+ struct dax_device *dax_dev)
+{
+ return !(vma->vm_flags & VM_SYNC);
+}
#endif
struct writeback_control;
@@ -73,6 +116,17 @@
return __bdev_dax_supported(bdev, blocksize);
}
+bool __generic_fsdax_supported(struct dax_device *dax_dev,
+ struct block_device *bdev, int blocksize, sector_t start,
+ sector_t sectors);
+static inline bool generic_fsdax_supported(struct dax_device *dax_dev,
+ struct block_device *bdev, int blocksize, sector_t start,
+ sector_t sectors)
+{
+ return __generic_fsdax_supported(dax_dev, bdev, blocksize, start,
+ sectors);
+}
+
static inline struct dax_device *fs_dax_get_by_host(const char *host)
{
return dax_get_by_host(host);
@@ -97,6 +151,13 @@
return false;
}
+static inline bool generic_fsdax_supported(struct dax_device *dax_dev,
+ struct block_device *bdev, int blocksize, sector_t start,
+ sector_t sectors)
+{
+ return false;
+}
+
static inline struct dax_device *fs_dax_get_by_host(const char *host)
{
return NULL;
@@ -140,6 +201,8 @@
void *dax_get_private(struct dax_device *dax_dev);
long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
void **kaddr, pfn_t *pfn);
+bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
+ int blocksize, sector_t start, sector_t len);
size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
size_t bytes, struct iov_iter *i);
size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 0880bae..bd19969 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -146,6 +146,7 @@
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(const struct path *, bool);
struct dentry *(*d_real)(struct dentry *, const struct inode *);
+ void (*d_canonical_path)(const struct path *, struct path *);
} ____cacheline_aligned;
/*
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 91f9f95..9f93f3ee 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -10,6 +10,7 @@
#include <linux/bio.h>
#include <linux/blkdev.h>
+#include <linux/dm-ioctl.h>
#include <linux/math64.h>
#include <linux/ratelimit.h>
@@ -425,6 +426,14 @@
sector_t start);
union map_info *dm_get_rq_mapinfo(struct request *rq);
+/*
+ * Device mapper functions to parse and create devices specified by the
+ * parameter "dm-mod.create="
+ */
+int __init dm_early_create(struct dm_ioctl *dmi,
+ struct dm_target_spec **spec_array,
+ char **target_params_array);
+
struct queue_limits *dm_get_queue_limits(struct mapped_device *md);
/*
diff --git a/include/linux/dynamic_debug.h b/include/linux/dynamic_debug.h
index b3419da..2fd8006 100644
--- a/include/linux/dynamic_debug.h
+++ b/include/linux/dynamic_debug.h
@@ -2,7 +2,7 @@
#ifndef _DYNAMIC_DEBUG_H
#define _DYNAMIC_DEBUG_H
-#if defined(CONFIG_JUMP_LABEL)
+#if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
#include <linux/jump_label.h>
#endif
@@ -38,7 +38,7 @@
#define _DPRINTK_FLAGS_DEFAULT 0
#endif
unsigned int flags:8;
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
union {
struct static_key_true dd_key_true;
struct static_key_false dd_key_false;
@@ -83,7 +83,7 @@
dd_key_init(key, init) \
}
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
#define dd_key_init(key, init) key = (init)
diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
new file mode 100644
index 0000000..aa027f7
--- /dev/null
+++ b/include/linux/energy_model.h
@@ -0,0 +1,187 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_ENERGY_MODEL_H
+#define _LINUX_ENERGY_MODEL_H
+#include <linux/cpumask.h>
+#include <linux/jump_label.h>
+#include <linux/kobject.h>
+#include <linux/rcupdate.h>
+#include <linux/sched/cpufreq.h>
+#include <linux/sched/topology.h>
+#include <linux/types.h>
+
+#ifdef CONFIG_ENERGY_MODEL
+/**
+ * em_cap_state - Capacity state of a performance domain
+ * @frequency: The CPU frequency in KHz, for consistency with CPUFreq
+ * @power: The power consumed by 1 CPU at this level, in milli-watts
+ * @cost: The cost coefficient associated with this level, used during
+ * energy calculation. Equal to: power * max_frequency / frequency
+ */
+struct em_cap_state {
+ unsigned long frequency;
+ unsigned long power;
+ unsigned long cost;
+};
+
+/**
+ * em_perf_domain - Performance domain
+ * @table: List of capacity states, in ascending order
+ * @nr_cap_states: Number of capacity states
+ * @cpus: Cpumask covering the CPUs of the domain
+ *
+ * A "performance domain" represents a group of CPUs whose performance is
+ * scaled together. All CPUs of a performance domain must have the same
+ * micro-architecture. Performance domains often have a 1-to-1 mapping with
+ * CPUFreq policies.
+ */
+struct em_perf_domain {
+ struct em_cap_state *table;
+ int nr_cap_states;
+ unsigned long cpus[0];
+};
+
+#define EM_CPU_MAX_POWER 0xFFFF
+
+struct em_data_callback {
+ /**
+ * active_power() - Provide power at the next capacity state of a CPU
+ * @power : Active power at the capacity state in mW (modified)
+ * @freq : Frequency at the capacity state in kHz (modified)
+ * @cpu : CPU for which we do this operation
+ *
+ * active_power() must find the lowest capacity state of 'cpu' above
+ * 'freq' and update 'power' and 'freq' to the matching active power
+ * and frequency.
+ *
+ * The power is the one of a single CPU in the domain, expressed in
+ * milli-watts. It is expected to fit in the [0, EM_CPU_MAX_POWER]
+ * range.
+ *
+ * Return 0 on success.
+ */
+ int (*active_power)(unsigned long *power, unsigned long *freq, int cpu);
+};
+#define EM_DATA_CB(_active_power_cb) { .active_power = &_active_power_cb }
+
+struct em_perf_domain *em_cpu_get(int cpu);
+int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
+ struct em_data_callback *cb);
+
+/**
+ * em_pd_energy() - Estimates the energy consumed by the CPUs of a perf. domain
+ * @pd : performance domain for which energy has to be estimated
+ * @max_util : highest utilization among CPUs of the domain
+ * @sum_util : sum of the utilization of all CPUs in the domain
+ *
+ * Return: the sum of the energy consumed by the CPUs of the domain assuming
+ * a capacity state satisfying the max utilization of the domain.
+ */
+static inline unsigned long em_pd_energy(struct em_perf_domain *pd,
+ unsigned long max_util, unsigned long sum_util)
+{
+ unsigned long freq, scale_cpu;
+ struct em_cap_state *cs;
+ int i, cpu;
+
+ /*
+ * In order to predict the capacity state, map the utilization of the
+ * most utilized CPU of the performance domain to a requested frequency,
+ * like schedutil.
+ */
+ cpu = cpumask_first(to_cpumask(pd->cpus));
+ scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+ cs = &pd->table[pd->nr_cap_states - 1];
+ freq = map_util_freq(max_util, cs->frequency, scale_cpu);
+
+ /*
+ * Find the lowest capacity state of the Energy Model above the
+ * requested frequency.
+ */
+ for (i = 0; i < pd->nr_cap_states; i++) {
+ cs = &pd->table[i];
+ if (cs->frequency >= freq)
+ break;
+ }
+
+ /*
+ * The capacity of a CPU in the domain at that capacity state (cs)
+ * can be computed as:
+ *
+ * cs->freq * scale_cpu
+ * cs->cap = -------------------- (1)
+ * cpu_max_freq
+ *
+ * So, ignoring the costs of idle states (which are not available in
+ * the EM), the energy consumed by this CPU at that capacity state is
+ * estimated as:
+ *
+ * cs->power * cpu_util
+ * cpu_nrg = -------------------- (2)
+ * cs->cap
+ *
+ * since 'cpu_util / cs->cap' represents its percentage of busy time.
+ *
+ * NOTE: Although the result of this computation actually is in
+ * units of power, it can be manipulated as an energy value
+ * over a scheduling period, since it is assumed to be
+ * constant during that interval.
+ *
+ * By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product
+ * of two terms:
+ *
+ * cs->power * cpu_max_freq cpu_util
+ * cpu_nrg = ------------------------ * --------- (3)
+ * cs->freq scale_cpu
+ *
+ * The first term is static, and is stored in the em_cap_state struct
+ * as 'cs->cost'.
+ *
+ * Since all CPUs of the domain have the same micro-architecture, they
+ * share the same 'cs->cost', and the same CPU capacity. Hence, the
+ * total energy of the domain (which is the simple sum of the energy of
+ * all of its CPUs) can be factorized as:
+ *
+ * cs->cost * \Sum cpu_util
+ * pd_nrg = ------------------------ (4)
+ * scale_cpu
+ */
+ return cs->cost * sum_util / scale_cpu;
+}
+
+/**
+ * em_pd_nr_cap_states() - Get the number of capacity states of a perf. domain
+ * @pd : performance domain for which this must be done
+ *
+ * Return: the number of capacity states in the performance domain table
+ */
+static inline int em_pd_nr_cap_states(struct em_perf_domain *pd)
+{
+ return pd->nr_cap_states;
+}
+
+#else
+struct em_perf_domain {};
+struct em_data_callback {};
+#define EM_DATA_CB(_active_power_cb) { }
+
+static inline int em_register_perf_domain(cpumask_t *span,
+ unsigned int nr_states, struct em_data_callback *cb)
+{
+ return -EINVAL;
+}
+static inline struct em_perf_domain *em_cpu_get(int cpu)
+{
+ return NULL;
+}
+static inline unsigned long em_pd_energy(struct em_perf_domain *pd,
+ unsigned long max_util, unsigned long sum_util)
+{
+ return 0;
+}
+static inline int em_pd_nr_cap_states(struct em_perf_domain *pd)
+{
+ return 0;
+}
+#endif
+
+#endif
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index fd1ce10..61b7251 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -210,12 +210,19 @@
static inline void fsnotify_open(struct file *file)
{
const struct path *path = &file->f_path;
+ struct path lower_path;
struct inode *inode = file_inode(file);
__u32 mask = FS_OPEN;
if (S_ISDIR(inode->i_mode))
mask |= FS_ISDIR;
+ if (path->dentry->d_op && path->dentry->d_op->d_canonical_path) {
+ path->dentry->d_op->d_canonical_path(path, &lower_path);
+ fsnotify_parent(&lower_path, NULL, mask);
+ fsnotify(lower_path.dentry->d_inode, mask, &lower_path, FSNOTIFY_EVENT_PATH, NULL, 0);
+ path_put(&lower_path);
+ }
fsnotify_parent(path, NULL, mask);
fsnotify(inode, mask, path, FSNOTIFY_EVENT_PATH, NULL, 0);
}
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 4c3e776..1a0b6f1 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -71,6 +71,10 @@
* Additional babbling in: Documentation/static-keys.txt
*/
+#if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
+# define HAVE_JUMP_LABEL
+#endif
+
#ifndef __ASSEMBLY__
#include <linux/types.h>
@@ -82,7 +86,7 @@
"%s(): static key '%pS' used before call to jump_label_init()", \
__func__, (key))
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
struct static_key {
atomic_t enabled;
@@ -110,10 +114,10 @@
struct static_key {
atomic_t enabled;
};
-#endif /* CONFIG_JUMP_LABEL */
+#endif /* HAVE_JUMP_LABEL */
#endif /* __ASSEMBLY__ */
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
#include <asm/jump_label.h>
#endif
@@ -126,7 +130,7 @@
struct module;
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
#define JUMP_TYPE_FALSE 0UL
#define JUMP_TYPE_TRUE 1UL
@@ -180,7 +184,7 @@
{ .enabled = { 0 }, \
{ .entries = (void *)JUMP_TYPE_FALSE } }
-#else /* !CONFIG_JUMP_LABEL */
+#else /* !HAVE_JUMP_LABEL */
#include <linux/atomic.h>
#include <linux/bug.h>
@@ -267,7 +271,7 @@
#define STATIC_KEY_INIT_TRUE { .enabled = ATOMIC_INIT(1) }
#define STATIC_KEY_INIT_FALSE { .enabled = ATOMIC_INIT(0) }
-#endif /* CONFIG_JUMP_LABEL */
+#endif /* HAVE_JUMP_LABEL */
#define STATIC_KEY_INIT STATIC_KEY_INIT_FALSE
#define jump_label_enabled static_key_enabled
@@ -331,7 +335,7 @@
static_key_count((struct static_key *)x) > 0; \
})
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
/*
* Combine the right initial value (type) with the right branch order
@@ -413,12 +417,12 @@
unlikely(branch); \
})
-#else /* !CONFIG_JUMP_LABEL */
+#else /* !HAVE_JUMP_LABEL */
#define static_branch_likely(x) likely(static_key_enabled(&(x)->key))
#define static_branch_unlikely(x) unlikely(static_key_enabled(&(x)->key))
-#endif /* CONFIG_JUMP_LABEL */
+#endif /* HAVE_JUMP_LABEL */
/*
* Advanced usage; refcount, branch is enabled when: count != 0
diff --git a/include/linux/jump_label_ratelimit.h b/include/linux/jump_label_ratelimit.h
index a49f2b4..baa8eab 100644
--- a/include/linux/jump_label_ratelimit.h
+++ b/include/linux/jump_label_ratelimit.h
@@ -5,19 +5,21 @@
#include <linux/jump_label.h>
#include <linux/workqueue.h>
-#if defined(CONFIG_JUMP_LABEL)
+#if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
struct static_key_deferred {
struct static_key key;
unsigned long timeout;
struct delayed_work work;
};
+#endif
+#ifdef HAVE_JUMP_LABEL
extern void static_key_slow_dec_deferred(struct static_key_deferred *key);
extern void static_key_deferred_flush(struct static_key_deferred *key);
extern void
jump_label_rate_limit(struct static_key_deferred *key, unsigned long rl);
-#else /* !CONFIG_JUMP_LABEL */
+#else /* !HAVE_JUMP_LABEL */
struct static_key_deferred {
struct static_key key;
};
@@ -36,5 +38,5 @@
{
STATIC_KEY_CHECK_USE(key);
}
-#endif /* CONFIG_JUMP_LABEL */
+#endif /* HAVE_JUMP_LABEL */
#endif /* _LINUX_JUMP_LABEL_RATELIMIT_H */
diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index c196176..1577a2d 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -56,6 +56,7 @@
int kthread_stop(struct task_struct *k);
bool kthread_should_stop(void);
bool kthread_should_park(void);
+bool __kthread_should_park(struct task_struct *k);
bool kthread_freezable_should_stop(bool *was_frozen);
void *kthread_data(struct task_struct *k);
void *kthread_probe_data(struct task_struct *k);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 097072c..a59cb67 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -19,6 +19,7 @@
#include <linux/types.h>
#include <linux/uuid.h>
#include <linux/spinlock.h>
+#include <linux/bio.h>
struct badrange_entry {
u64 start;
@@ -59,6 +60,9 @@
*/
ND_REGION_PERSIST_MEMCTRL = 2,
+ /* Platform provides asynchronous flush mechanism */
+ ND_REGION_ASYNC = 3,
+
/* mark newly adjusted resources as requiring a label update */
DPA_RESOURCE_ADJUSTED = 1 << 0,
};
@@ -115,6 +119,7 @@
int position;
};
+struct nd_region;
struct nd_region_desc {
struct resource *res;
struct nd_mapping_desc *mapping;
@@ -126,6 +131,7 @@
int numa_node;
unsigned long flags;
struct device_node *of_node;
+ int (*flush)(struct nd_region *nd_region, struct bio *bio);
};
struct device;
@@ -201,9 +207,11 @@
unsigned int nd_region_acquire_lane(struct nd_region *nd_region);
void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane);
u64 nd_fletcher64(void *addr, size_t len, bool le);
-void nvdimm_flush(struct nd_region *nd_region);
+int nvdimm_flush(struct nd_region *nd_region, struct bio *bio);
+int generic_nvdimm_flush(struct nd_region *nd_region);
int nvdimm_has_flush(struct nd_region *nd_region);
int nvdimm_has_cache(struct nd_region *nd_region);
+bool is_nvdimm_sync(struct nd_region *nd_region);
#ifdef CONFIG_ARCH_HAS_PMEM_API
#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
diff --git a/include/linux/list_sort.h b/include/linux/list_sort.h
index ba79956..20f178c 100644
--- a/include/linux/list_sort.h
+++ b/include/linux/list_sort.h
@@ -6,6 +6,7 @@
struct list_head;
+__attribute__((nonnull(2,3)))
void list_sort(void *priv, struct list_head *head,
int (*cmp)(void *priv, struct list_head *a,
struct list_head *b));
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 3833c87..66489b4 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -457,6 +457,10 @@
* @file_free_security:
* Deallocate and free any security structures stored in file->f_security.
* @file contains the file structure being modified.
+ * @file_pre_free_security:
+ * Perform any logging or LSM state updates for a file being deleted
+ * using fields of the file before they have been cleared.
+ * @file contains the file structure being freed
* @file_ioctl:
* @file contains the file structure.
* @cmd contains the operation to perform.
@@ -533,6 +537,10 @@
* @clone_flags contains the flags indicating what should be shared.
* Handle allocation of task-related resources.
* Returns a zero on success, negative values on failure.
+ * @task_post_alloc:
+ * @task task being allocated.
+ * Handle allocation of task-related resources after all task fields are
+ * filled in.
* @task_free:
* @task task about to be freed.
* Handle release of task-related resources. (Note that this can be called
@@ -683,6 +691,9 @@
* @cred contains the cred of the process where the signal originated, or
* NULL if the current task is the originator.
* Return 0 if permission is granted.
+ * @task_exit:
+ * Called early when a task is exiting before all state is lost.
+ * @p contains the task_struct for process.
* @task_prctl:
* Check permission before performing a process control operation on the
* current process.
@@ -1561,6 +1572,7 @@
int (*file_permission)(struct file *file, int mask);
int (*file_alloc_security)(struct file *file);
void (*file_free_security)(struct file *file);
+ void (*file_pre_free_security)(struct file *file);
int (*file_ioctl)(struct file *file, unsigned int cmd,
unsigned long arg);
int (*mmap_addr)(unsigned long addr);
@@ -1578,6 +1590,7 @@
int (*file_open)(struct file *file);
int (*task_alloc)(struct task_struct *task, unsigned long clone_flags);
+ void (*task_post_alloc)(struct task_struct *task); // Do not upstream.
void (*task_free)(struct task_struct *task);
int (*cred_alloc_blank)(struct cred *cred, gfp_t gfp);
void (*cred_free)(struct cred *cred);
@@ -1610,6 +1623,7 @@
int (*task_movememory)(struct task_struct *p);
int (*task_kill)(struct task_struct *p, struct siginfo *info,
int sig, const struct cred *cred);
+ void (*task_exit)(struct task_struct *p);
int (*task_prctl)(int option, unsigned long arg2, unsigned long arg3,
unsigned long arg4, unsigned long arg5);
void (*task_to_inode)(struct task_struct *p, struct inode *inode);
@@ -1860,6 +1874,7 @@
struct hlist_head file_permission;
struct hlist_head file_alloc_security;
struct hlist_head file_free_security;
+ struct hlist_head file_pre_free_security;
struct hlist_head file_ioctl;
struct hlist_head mmap_addr;
struct hlist_head mmap_file;
@@ -1871,6 +1886,7 @@
struct hlist_head file_receive;
struct hlist_head file_open;
struct hlist_head task_alloc;
+ struct hlist_head task_post_alloc;
struct hlist_head task_free;
struct hlist_head cred_alloc_blank;
struct hlist_head cred_free;
@@ -1897,6 +1913,7 @@
struct hlist_head task_getscheduler;
struct hlist_head task_movememory;
struct hlist_head task_kill;
+ struct hlist_head task_exit;
struct hlist_head task_prctl;
struct hlist_head task_to_inode;
struct hlist_head ipc_permission;
diff --git a/include/linux/lzo.h b/include/linux/lzo.h
index 2ae27cb..e95c7d1 100644
--- a/include/linux/lzo.h
+++ b/include/linux/lzo.h
@@ -18,12 +18,16 @@
#define LZO1X_1_MEM_COMPRESS (8192 * sizeof(unsigned short))
#define LZO1X_MEM_COMPRESS LZO1X_1_MEM_COMPRESS
-#define lzo1x_worst_compress(x) ((x) + ((x) / 16) + 64 + 3)
+#define lzo1x_worst_compress(x) ((x) + ((x) / 16) + 64 + 3 + 2)
/* This requires 'wrkmem' of size LZO1X_1_MEM_COMPRESS */
int lzo1x_1_compress(const unsigned char *src, size_t src_len,
unsigned char *dst, size_t *dst_len, void *wrkmem);
+/* This requires 'wrkmem' of size LZO1X_1_MEM_COMPRESS */
+int lzorle1x_1_compress(const unsigned char *src, size_t src_len,
+ unsigned char *dst, size_t *dst_len, void *wrkmem);
+
/* safe decompression with overrun testing */
int lzo1x_decompress_safe(const unsigned char *src, size_t src_len,
unsigned char *dst, size_t *dst_len);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b109204..9fb6c6e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -58,6 +58,8 @@
#define sysctl_legacy_va_layout 0
#endif
+extern int min_filelist_kbytes;
+
#ifdef CONFIG_HAVE_ARCH_MMAP_RND_BITS
extern const int mmap_rnd_bits_min;
extern const int mmap_rnd_bits_max;
@@ -125,6 +127,7 @@
#define DEFAULT_MAX_MAP_COUNT (USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
extern int sysctl_max_map_count;
+extern int sysctl_mmap_noexec_taint;
extern unsigned long sysctl_user_reserve_kbytes;
extern unsigned long sysctl_admin_reserve_kbytes;
@@ -2254,7 +2257,7 @@
extern struct vm_area_struct *vma_merge(struct mm_struct *,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
- struct mempolicy *, struct vm_userfaultfd_ctx);
+ struct mempolicy *, struct vm_userfaultfd_ctx, const char __user *);
extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
unsigned long addr, int new_below);
@@ -2823,5 +2826,9 @@
static inline void setup_nr_node_ids(void) {}
#endif
+#ifdef CONFIG_DISK_BASED_SWAP
+extern int sysctl_disk_based_swap;
+#endif
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3a9a996..3f6ab00 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -295,11 +295,18 @@
/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap interval tree.
+ *
+ * For private anonymous mappings, a pointer to a null terminated string
+ * in the user process containing the name given to the vma, or NULL
+ * if unnamed.
*/
- struct {
- struct rb_node rb;
- unsigned long rb_subtree_last;
- } shared;
+ union {
+ struct {
+ struct rb_node rb;
+ unsigned long rb_subtree_last;
+ } shared;
+ const char __user *anon_name;
+ };
/*
* A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
@@ -654,4 +661,13 @@
unsigned long val;
} swp_entry_t;
+/* Return the name for an anonymous mapping or NULL for a file-backed mapping */
+static inline const char __user *vma_get_anon_name(struct vm_area_struct *vma)
+{
+ if (vma->vm_file)
+ return NULL;
+
+ return vma->anon_name;
+}
+
#endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/module.h b/include/linux/module.h
index 9915397..ddefdcd 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -433,7 +433,7 @@
unsigned int num_tracepoints;
tracepoint_ptr_t *tracepoints_ptrs;
#endif
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
struct jump_entry *jump_entries;
unsigned int num_jump_entries;
#endif
diff --git a/include/linux/namei.h b/include/linux/namei.h
index a78606e..0d08bad 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -94,6 +94,10 @@
extern void nd_jump_link(struct path *path);
+extern int nameidata_set_temporary(const char __user *dir_name);
+extern void nameidata_restore_temporary(void);
+extern int nameidata_get_total_link_count(void);
+
static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
{
((char *) name)[min(len, maxlen)] = '\0';
diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 72cb19c..bbe99d2 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -176,7 +176,7 @@
int nf_register_sockopt(struct nf_sockopt_ops *reg);
void nf_unregister_sockopt(struct nf_sockopt_ops *reg);
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
extern struct static_key nf_hooks_needed[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
#endif
@@ -198,7 +198,7 @@
struct nf_hook_entries *hook_head = NULL;
int ret = 1;
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
if (__builtin_constant_p(pf) &&
__builtin_constant_p(hook) &&
!static_key_false(&nf_hooks_needed[pf][hook]))
diff --git a/include/linux/netfilter/xt_quota2.h b/include/linux/netfilter/xt_quota2.h
new file mode 100644
index 0000000..eadc69033
--- /dev/null
+++ b/include/linux/netfilter/xt_quota2.h
@@ -0,0 +1,25 @@
+#ifndef _XT_QUOTA_H
+#define _XT_QUOTA_H
+
+enum xt_quota_flags {
+ XT_QUOTA_INVERT = 1 << 0,
+ XT_QUOTA_GROW = 1 << 1,
+ XT_QUOTA_PACKET = 1 << 2,
+ XT_QUOTA_NO_CHANGE = 1 << 3,
+ XT_QUOTA_MASK = 0x0F,
+};
+
+struct xt_quota_counter;
+
+struct xt_quota_mtinfo2 {
+ char name[15];
+ u_int8_t flags;
+
+ /* Comparison-invariant */
+ aligned_u64 quota;
+
+ /* Used internally by the kernel */
+ struct xt_quota_counter *master __attribute__((aligned(8)));
+};
+
+#endif /* _XT_QUOTA_H */
diff --git a/include/linux/netfilter_ingress.h b/include/linux/netfilter_ingress.h
index a13774b..554c920 100644
--- a/include/linux/netfilter_ingress.h
+++ b/include/linux/netfilter_ingress.h
@@ -8,7 +8,7 @@
#ifdef CONFIG_NETFILTER_INGRESS
static inline bool nf_hook_ingress_active(const struct sk_buff *skb)
{
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
if (!static_key_false(&nf_hooks_needed[NFPROTO_NETDEV][NF_NETDEV_INGRESS]))
return false;
#endif
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 49538b1..dded498 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -45,6 +45,9 @@
int hide_pid;
int reboot; /* group exit code if this pidns was rebooted */
struct ns_common ns;
+#ifdef CONFIG_SECURITY_CONTAINER_MONITOR
+ u64 cid; /* Main container identifier, zero if not assigned. */
+#endif
} __randomize_layout;
extern struct pid_namespace init_pid_ns;
diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
index 776c546..3b5d728 100644
--- a/include/linux/pm_domain.h
+++ b/include/linux/pm_domain.h
@@ -17,11 +17,36 @@
#include <linux/notifier.h>
#include <linux/spinlock.h>
-/* Defines used for the flags field in the struct generic_pm_domain */
-#define GENPD_FLAG_PM_CLK (1U << 0) /* PM domain uses PM clk */
-#define GENPD_FLAG_IRQ_SAFE (1U << 1) /* PM domain operates in atomic */
-#define GENPD_FLAG_ALWAYS_ON (1U << 2) /* PM domain is always powered on */
-#define GENPD_FLAG_ACTIVE_WAKEUP (1U << 3) /* Keep devices active if wakeup */
+/*
+ * Flags to control the behaviour of a genpd.
+ *
+ * These flags may be set in the struct generic_pm_domain's flags field by a
+ * genpd backend driver. The flags must be set before it calls pm_genpd_init(),
+ * which initializes a genpd.
+ *
+ * GENPD_FLAG_PM_CLK: Instructs genpd to use the PM clk framework,
+ * while powering on/off attached devices.
+ *
+ * GENPD_FLAG_IRQ_SAFE: This informs genpd that its backend callbacks,
+ * ->power_on|off(), doesn't sleep. Hence, these
+ * can be invoked from within atomic context, which
+ * enables genpd to power on/off the PM domain,
+ * even when pm_runtime_is_irq_safe() returns true,
+ * for any of its attached devices. Note that, a
+ * genpd having this flag set, requires its
+ * masterdomains to also have it set.
+ *
+ * GENPD_FLAG_ALWAYS_ON: Instructs genpd to always keep the PM domain
+ * powered on.
+ *
+ * GENPD_FLAG_ACTIVE_WAKEUP: Instructs genpd to keep the PM domain powered
+ * on, in case any of its attached devices is used
+ * in the wakeup path to serve system wakeups.
+ */
+#define GENPD_FLAG_PM_CLK (1U << 0)
+#define GENPD_FLAG_IRQ_SAFE (1U << 1)
+#define GENPD_FLAG_ALWAYS_ON (1U << 2)
+#define GENPD_FLAG_ACTIVE_WAKEUP (1U << 3)
enum gpd_status {
GPD_STATE_ACTIVE = 0, /* PM domain is active */
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index d0e1f15..a50b1c9 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -116,6 +116,12 @@
#endif /* CONFIG_PROC_FS */
+#ifdef CONFIG_PROC_UID
+extern void proc_register_uid(kuid_t uid);
+#else
+static inline void proc_register_uid(kuid_t uid) {}
+#endif
+
struct net;
static inline struct proc_dir_entry *proc_net_mkdir(
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c69f308..5e2bb88 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -30,7 +30,6 @@
#include <linux/rseq.h>
/* task_struct member predeclarations (sorted alphabetically): */
-struct audit_context;
struct backing_dev_info;
struct bio_list;
struct blk_plug;
@@ -355,12 +354,6 @@
* For cfs_rq, it is the aggregated load_avg of all runnable and
* blocked sched_entities.
*
- * load_avg may also take frequency scaling into account:
- *
- * load_avg = runnable% * scale_load_down(load) * freq%
- *
- * where freq% is the CPU frequency normalized to the highest frequency.
- *
* [util_avg definition]
*
* util_avg = running% * SCHED_CAPACITY_SCALE
@@ -369,17 +362,14 @@
* a CPU. For cfs_rq, it is the aggregated util_avg of all runnable
* and blocked sched_entities.
*
- * util_avg may also factor frequency scaling and CPU capacity scaling:
+ * load_avg and util_avg don't direcly factor frequency scaling and CPU
+ * capacity scaling. The scaling is done through the rq_clock_pelt that
+ * is used for computing those signals (see update_rq_clock_pelt())
*
- * util_avg = running% * SCHED_CAPACITY_SCALE * freq% * capacity%
- *
- * where freq% is the same as above, and capacity% is the CPU capacity
- * normalized to the greatest capacity (due to uarch differences, etc).
- *
- * N.B., the above ratios (runnable%, running%, freq%, and capacity%)
- * themselves are in the range of [0, 1]. To do fixed point arithmetics,
- * we therefore scale them to as large a range as necessary. This is for
- * example reflected by util_avg's SCHED_CAPACITY_SCALE.
+ * N.B., the above ratios (runnable% and running%) themselves are in the
+ * range of [0, 1]. To do fixed point arithmetics, we therefore scale them
+ * to as large a range as necessary. This is for example reflected by
+ * util_avg's SCHED_CAPACITY_SCALE.
*
* [Overflow issue]
*
@@ -879,10 +869,8 @@
struct callback_head *task_works;
- struct audit_context *audit_context;
-#ifdef CONFIG_AUDITSYSCALL
- kuid_t loginuid;
- unsigned int sessionid;
+#ifdef CONFIG_AUDIT
+ struct audit_task_info *audit;
#endif
struct seccomp seccomp;
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index a4530d7..cc6bcc1 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -23,6 +23,12 @@
unsigned int flags));
void cpufreq_remove_update_util_hook(int cpu);
bool cpufreq_this_cpu_can_update(struct cpufreq_policy *policy);
+
+static inline unsigned long map_util_freq(unsigned long util,
+ unsigned long freq, unsigned long cap)
+{
+ return (freq + (freq >> 2)) * util / cap;
+}
#endif /* CONFIG_CPU_FREQ */
#endif /* _LINUX_SCHED_CPUFREQ_H */
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index a9c32da..77e9fa3 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -22,6 +22,8 @@
extern unsigned int sysctl_sched_latency;
extern unsigned int sysctl_sched_min_granularity;
+extern unsigned int sysctl_sched_sync_hint_enable;
+extern unsigned int sysctl_sched_cstate_aware;
extern unsigned int sysctl_sched_wakeup_granularity;
extern unsigned int sysctl_sched_child_runs_first;
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 15f3f61..beffd85 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -23,10 +23,10 @@
#define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */
#define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */
#define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */
-#define SD_ASYM_CPUCAPACITY 0x0040 /* Groups have different max cpu capacities */
-#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share cpu capacity */
+#define SD_ASYM_CPUCAPACITY 0x0040 /* Domain members have different CPU capacities */
+#define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share CPU capacity */
#define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */
-#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */
+#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share CPU pkg resources */
#define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */
#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
@@ -202,6 +202,17 @@
# define SD_INIT_NAME(type)
#endif
+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+{
+ if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+ return sd->smt_gain / sd->span_weight;
+
+ return SCHED_CAPACITY_SCALE;
+}
+#endif
+
#else /* CONFIG_SMP */
struct sched_domain_attr;
@@ -217,6 +228,14 @@
return true;
}
+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
+{
+ return SCHED_CAPACITY_SCALE;
+}
+#endif
+
#endif /* !CONFIG_SMP */
static inline int task_node(const struct task_struct *p)
diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index 10b19a1..a3661e9 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -34,6 +34,7 @@
struct wake_q_head {
struct wake_q_node *first;
struct wake_q_node **lastp;
+ int count;
};
#define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
@@ -45,6 +46,7 @@
{
head->first = WAKE_Q_TAIL;
head->lastp = &head->first;
+ head->count = 0;
}
extern void wake_q_add(struct wake_q_head *head,
diff --git a/include/linux/sched/xacct.h b/include/linux/sched/xacct.h
index c078f0a..9544c9d 100644
--- a/include/linux/sched/xacct.h
+++ b/include/linux/sched/xacct.h
@@ -28,6 +28,11 @@
{
tsk->ioac.syscw++;
}
+
+static inline void inc_syscfs(struct task_struct *tsk)
+{
+ tsk->ioac.syscfs++;
+}
#else
static inline void add_rchar(struct task_struct *tsk, ssize_t amt)
{
@@ -44,6 +49,10 @@
static inline void inc_syscw(struct task_struct *tsk)
{
}
+
+static inline void inc_syscfs(struct task_struct *tsk)
+{
+}
#endif
#endif /* _LINUX_SCHED_XACCT_H */
diff --git a/include/linux/security.h b/include/linux/security.h
index d2240605..ded314a 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -321,6 +321,7 @@
int security_file_permission(struct file *file, int mask);
int security_file_alloc(struct file *file);
void security_file_free(struct file *file);
+void security_file_pre_free(struct file *file);
int security_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
int security_mmap_file(struct file *file, unsigned long prot,
unsigned long flags);
@@ -335,6 +336,7 @@
int security_file_receive(struct file *file);
int security_file_open(struct file *file);
int security_task_alloc(struct task_struct *task, unsigned long clone_flags);
+void security_task_post_alloc(struct task_struct *task);
void security_task_free(struct task_struct *task);
int security_cred_alloc_blank(struct cred *cred, gfp_t gfp);
void security_cred_free(struct cred *cred);
@@ -366,6 +368,7 @@
int security_task_movememory(struct task_struct *p);
int security_task_kill(struct task_struct *p, struct siginfo *info,
int sig, const struct cred *cred);
+void security_task_exit(struct task_struct *p);
int security_task_prctl(int option, unsigned long arg2, unsigned long arg3,
unsigned long arg4, unsigned long arg5);
void security_task_to_inode(struct task_struct *p, struct inode *inode);
@@ -828,6 +831,9 @@
static inline void security_file_free(struct file *file)
{ }
+static inline void security_file_pre_free(struct file *file)
+{ }
+
static inline int security_file_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
@@ -891,6 +897,9 @@
return 0;
}
+static inline void security_task_post_alloc(struct task_struct *task)
+{ }
+
static inline void security_task_free(struct task_struct *task)
{ }
@@ -1026,6 +1035,9 @@
return 0;
}
+static inline void security_task_exit(struct task_struct *p)
+{ }
+
static inline int security_task_prctl(int option, unsigned long arg2,
unsigned long arg3,
unsigned long arg4,
diff --git a/include/linux/serial_core.h b/include/linux/serial_core.h
index 3460b15..ff9d0ee 100644
--- a/include/linux/serial_core.h
+++ b/include/linux/serial_core.h
@@ -22,6 +22,7 @@
#include <linux/bitops.h>
#include <linux/compiler.h>
+#include <linux/console.h>
#include <linux/interrupt.h>
#include <linux/circ_buf.h>
#include <linux/spinlock.h>
diff --git a/include/linux/sort.h b/include/linux/sort.h
index 2b99a5d..61b96d0 100644
--- a/include/linux/sort.h
+++ b/include/linux/sort.h
@@ -4,6 +4,11 @@
#include <linux/types.h>
+void sort_r(void *base, size_t num, size_t size,
+ int (*cmp)(const void *, const void *, const void *),
+ void (*swap)(void *, void *, int),
+ const void *priv);
+
void sort(void *base, size_t num, size_t size,
int (*cmp)(const void *, const void *),
void (*swap)(void *, void *, int));
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2ff814c..635a660 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1159,6 +1159,10 @@
return -EINVAL;
}
#endif
+
+int ksys_prctl(int option, unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5);
+
unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
unsigned long prot, unsigned long flags,
unsigned long fd, unsigned long pgoff);
@@ -1293,4 +1297,22 @@
return old;
}
+#ifdef CONFIG_ALT_SYSCALL
+
+/* Only used with ALT_SYSCALL enabled */
+
+int ksys_prctl(int option, unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5);
+int ksys_setpriority(int which, int who, int niceval);
+int ksys_getpriority(int which, int who);
+int ksys_perf_event_open(
+ struct perf_event_attr __user *attr_uptr,
+ pid_t pid, int cpu, int group_fd, unsigned long flags);
+int ksys_adjtimex(struct timex __user *txc_p);
+int ksys_clock_adjtime(clockid_t which_clock,
+ struct timex __user *tx);
+int ksys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
+
+#endif /* CONFIG_ALT_SYSCALL */
+
#endif
diff --git a/include/linux/task_io_accounting.h b/include/linux/task_io_accounting.h
index 6f6acce..bb26108 100644
--- a/include/linux/task_io_accounting.h
+++ b/include/linux/task_io_accounting.h
@@ -19,6 +19,8 @@
u64 syscr;
/* # of write syscalls */
u64 syscw;
+ /* # of fsync syscalls */
+ u64 syscfs;
#endif /* CONFIG_TASK_XACCT */
#ifdef CONFIG_TASK_IO_ACCOUNTING
diff --git a/include/linux/task_io_accounting_ops.h b/include/linux/task_io_accounting_ops.h
index bb5498b..733ab62 100644
--- a/include/linux/task_io_accounting_ops.h
+++ b/include/linux/task_io_accounting_ops.h
@@ -97,6 +97,7 @@
dst->wchar += src->wchar;
dst->syscr += src->syscr;
dst->syscw += src->syscw;
+ dst->syscfs += src->syscfs;
}
#else
static inline void task_chr_io_accounting_add(struct task_io_accounting *dst,
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 0643c08..3aa0559 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -577,7 +577,8 @@
bool perf_type_tracepoint);
#endif
#ifdef CONFIG_UPROBE_EVENTS
-extern int perf_uprobe_init(struct perf_event *event, bool is_retprobe);
+extern int perf_uprobe_init(struct perf_event *event,
+ unsigned long ref_ctr_offset, bool is_retprobe);
extern void perf_uprobe_destroy(struct perf_event *event);
extern int bpf_get_uprobe_info(const struct perf_event *event,
u32 *fd_type, const char **filename,
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index bb9d208..103a48a 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -123,6 +123,7 @@
extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
+extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
extern int uprobe_mmap(struct vm_area_struct *vma);
@@ -160,6 +161,10 @@
{
return -ENOSYS;
}
+static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc)
+{
+ return -ENOSYS;
+}
static inline int
uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool add)
{
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 47a3441..870d36e 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -28,6 +28,7 @@
FOR_ALL_ZONES(PGSCAN_SKIP),
PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE,
PGFAULT, PGMAJFAULT,
+ PGMAJFAULT_S, PGMAJFAULT_A, PGMAJFAULT_F,
PGLAZYFREED,
PGREFILL,
PGSTEAL_KSWAPD,
diff --git a/include/linux/wakeup_reason.h b/include/linux/wakeup_reason.h
new file mode 100644
index 0000000..7ce50f0d
--- /dev/null
+++ b/include/linux/wakeup_reason.h
@@ -0,0 +1,23 @@
+/*
+ * include/linux/wakeup_reason.h
+ *
+ * Logs the reason which caused the kernel to resume
+ * from the suspend mode.
+ *
+ * Copyright (C) 2014 Google, Inc.
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef _LINUX_WAKEUP_REASON_H
+#define _LINUX_WAKEUP_REASON_H
+
+void log_wakeup_reason(int irq);
+
+#endif /* _LINUX_WAKEUP_REASON_H */
diff --git a/include/net/netfilter/br_netfilter.h b/include/net/netfilter/br_netfilter.h
index a4ba601..da7af8e 100644
--- a/include/net/netfilter/br_netfilter.h
+++ b/include/net/netfilter/br_netfilter.h
@@ -48,7 +48,8 @@
return port ? &port->br->fake_rtable : NULL;
}
-struct net_device *setup_pre_routing(struct sk_buff *skb);
+struct net_device *setup_pre_routing(struct sk_buff *skb,
+ const struct net *net);
#if IS_ENABLED(CONFIG_IPV6)
int br_validate_ipv6(struct net *net, struct sk_buff *skb);
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 366e2a6..a255a62 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -161,6 +161,7 @@
int sysctl_tcp_invalid_ratelimit;
int sysctl_tcp_pacing_ss_ratio;
int sysctl_tcp_pacing_ca_ratio;
+ int sysctl_tcp_default_init_rwnd;
int sysctl_tcp_wmem[3];
int sysctl_tcp_rmem[3];
int sysctl_tcp_comp_sack_nr;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0d4501f..4008d76 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1329,7 +1329,7 @@
rx_opt->num_sacks = 0;
}
-u32 tcp_default_init_rwnd(u32 mss);
+u32 tcp_default_init_rwnd(const struct sock *sk, u32 mss);
void tcp_cwnd_restart(struct sock *sk, s32 delta);
static inline void tcp_slow_start_after_idle_check(struct sock *sk)
diff --git a/include/trace/events/fs.h b/include/trace/events/fs.h
new file mode 100644
index 0000000..fb634b7
--- /dev/null
+++ b/include/trace/events/fs.h
@@ -0,0 +1,53 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM fs
+
+#if !defined(_TRACE_FS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_FS_H
+
+#include <linux/fs.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(do_sys_open,
+
+ TP_PROTO(const char *filename, int flags, int mode),
+
+ TP_ARGS(filename, flags, mode),
+
+ TP_STRUCT__entry(
+ __string( filename, filename )
+ __field( int, flags )
+ __field( int, mode )
+ ),
+
+ TP_fast_assign(
+ __assign_str(filename, filename);
+ __entry->flags = flags;
+ __entry->mode = mode;
+ ),
+
+ TP_printk("\"%s\" %x %o",
+ __get_str(filename), __entry->flags, __entry->mode)
+);
+
+TRACE_EVENT(open_exec,
+
+ TP_PROTO(const char *filename),
+
+ TP_ARGS(filename),
+
+ TP_STRUCT__entry(
+ __string( filename, filename )
+ ),
+
+ TP_fast_assign(
+ __assign_str(filename, filename);
+ ),
+
+ TP_printk("\"%s\"",
+ __get_str(filename))
+);
+
+#endif /* _TRACE_FS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 9a4bdfa..8067679 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -241,7 +241,7 @@
DEFINE_EVENT(sched_process_template, sched_process_free,
TP_PROTO(struct task_struct *p),
TP_ARGS(p));
-
+
/*
* Tracepoint for a task exiting:
@@ -396,6 +396,30 @@
TP_ARGS(tsk, delay));
/*
+ * Tracepoint for recording the cause of uninterruptible sleep.
+ */
+TRACE_EVENT(sched_blocked_reason,
+
+ TP_PROTO(struct task_struct *tsk),
+
+ TP_ARGS(tsk),
+
+ TP_STRUCT__entry(
+ __field( pid_t, pid )
+ __field( void*, caller )
+ __field( bool, io_wait )
+ ),
+
+ TP_fast_assign(
+ __entry->pid = tsk->pid;
+ __entry->caller = (void*)get_wchan(tsk);
+ __entry->io_wait = tsk->in_iowait;
+ ),
+
+ TP_printk("pid=%d iowait=%d caller=%pS", __entry->pid, __entry->io_wait, __entry->caller)
+);
+
+/*
* Tracepoint for accounting runtime (time the task is executing
* on a CPU).
*/
@@ -587,6 +611,424 @@
TP_printk("cpu=%d", __entry->cpu)
);
+
+#ifdef CONFIG_SMP
+#ifdef CREATE_TRACE_POINTS
+static inline
+int __trace_sched_cpu(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ struct rq *rq = cfs_rq ? cfs_rq->rq : NULL;
+#else
+ struct rq *rq = cfs_rq ? container_of(cfs_rq, struct rq, cfs) : NULL;
+#endif
+ return rq ? cpu_of(rq)
+ : task_cpu((container_of(se, struct task_struct, se)));
+}
+
+static inline
+int __trace_sched_path(struct cfs_rq *cfs_rq, char *path, int len)
+{
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ int l = path ? len : 0;
+
+ if (cfs_rq && task_group_is_autogroup(cfs_rq->tg))
+ return autogroup_path(cfs_rq->tg, path, l) + 1;
+ else if (cfs_rq && cfs_rq->tg->css.cgroup)
+ return cgroup_path(cfs_rq->tg->css.cgroup, path, l) + 1;
+#endif
+ if (path)
+ strcpy(path, "(null)");
+
+ return strlen("(null)");
+}
+
+static inline
+struct cfs_rq *__trace_sched_group_cfs_rq(struct sched_entity *se)
+{
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ return se->my_q;
+#else
+ return NULL;
+#endif
+}
+#endif /* CREATE_TRACE_POINTS */
+
+/*
+ * Tracepoint for cfs_rq load tracking:
+ */
+TRACE_EVENT(sched_load_cfs_rq,
+
+ TP_PROTO(struct cfs_rq *cfs_rq),
+
+ TP_ARGS(cfs_rq),
+
+ TP_STRUCT__entry(
+ __field( int, cpu )
+ __dynamic_array(char, path,
+ __trace_sched_path(cfs_rq, NULL, 0) )
+ __field( unsigned long, load )
+ __field( unsigned long, rbl_load )
+ __field( unsigned long, util )
+ ),
+
+ TP_fast_assign(
+ __entry->cpu = __trace_sched_cpu(cfs_rq, NULL);
+ __trace_sched_path(cfs_rq, __get_dynamic_array(path),
+ __get_dynamic_array_len(path));
+ __entry->load = cfs_rq->avg.load_avg;
+ __entry->rbl_load = cfs_rq->avg.runnable_load_avg;
+ __entry->util = cfs_rq->avg.util_avg;
+ ),
+
+ TP_printk("cpu=%d path=%s load=%lu rbl_load=%lu util=%lu",
+ __entry->cpu, __get_str(path), __entry->load,
+ __entry->rbl_load,__entry->util)
+);
+
+/*
+ * Tracepoint for rt_rq load tracking:
+ */
+struct rq;
+TRACE_EVENT(sched_load_rt_rq,
+
+ TP_PROTO(struct rq *rq),
+
+ TP_ARGS(rq),
+
+ TP_STRUCT__entry(
+ __field( int, cpu )
+ __field( unsigned long, util )
+ ),
+
+ TP_fast_assign(
+ __entry->cpu = rq->cpu;
+ __entry->util = rq->avg_rt.util_avg;
+ ),
+
+ TP_printk("cpu=%d util=%lu", __entry->cpu,
+ __entry->util)
+);
+
+/*
+ * Tracepoint for sched_entity load tracking:
+ */
+TRACE_EVENT(sched_load_se,
+
+ TP_PROTO(struct sched_entity *se),
+
+ TP_ARGS(se),
+
+ TP_STRUCT__entry(
+ __field( int, cpu )
+ __dynamic_array(char, path,
+ __trace_sched_path(__trace_sched_group_cfs_rq(se), NULL, 0) )
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( unsigned long, load )
+ __field( unsigned long, rbl_load )
+ __field( unsigned long, util )
+ ),
+
+ TP_fast_assign(
+ struct cfs_rq *gcfs_rq = __trace_sched_group_cfs_rq(se);
+ struct task_struct *p = gcfs_rq ? NULL
+ : container_of(se, struct task_struct, se);
+
+ __entry->cpu = __trace_sched_cpu(gcfs_rq, se);
+ __trace_sched_path(gcfs_rq, __get_dynamic_array(path),
+ __get_dynamic_array_len(path));
+ memcpy(__entry->comm, p ? p->comm : "(null)",
+ p ? TASK_COMM_LEN : sizeof("(null)"));
+ __entry->pid = p ? p->pid : -1;
+ __entry->load = se->avg.load_avg;
+ __entry->rbl_load = se->avg.runnable_load_avg;
+ __entry->util = se->avg.util_avg;
+ ),
+
+ TP_printk("cpu=%d path=%s comm=%s pid=%d load=%lu rbl_load=%lu util=%lu",
+ __entry->cpu, __get_str(path), __entry->comm, __entry->pid,
+ __entry->load, __entry->rbl_load, __entry->util)
+);
+
+/*
+ * Tracepoint for task_group load tracking:
+ */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+TRACE_EVENT(sched_load_tg,
+
+ TP_PROTO(struct cfs_rq *cfs_rq),
+
+ TP_ARGS(cfs_rq),
+
+ TP_STRUCT__entry(
+ __field( int, cpu )
+ __dynamic_array(char, path,
+ __trace_sched_path(cfs_rq, NULL, 0) )
+ __field( long, load )
+ ),
+
+ TP_fast_assign(
+ __entry->cpu = cfs_rq->rq->cpu;
+ __trace_sched_path(cfs_rq, __get_dynamic_array(path),
+ __get_dynamic_array_len(path));
+ __entry->load = atomic_long_read(&cfs_rq->tg->load_avg);
+ ),
+
+ TP_printk("cpu=%d path=%s load=%ld", __entry->cpu, __get_str(path),
+ __entry->load)
+);
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+/*
+ * Tracepoint for tasks' estimated utilization.
+ */
+TRACE_EVENT(sched_util_est_task,
+
+ TP_PROTO(struct task_struct *tsk, struct sched_avg *avg),
+
+ TP_ARGS(tsk, avg),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( int, cpu )
+ __field( unsigned int, util_avg )
+ __field( unsigned int, est_enqueued )
+ __field( unsigned int, est_ewma )
+
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+ __entry->pid = tsk->pid;
+ __entry->cpu = task_cpu(tsk);
+ __entry->util_avg = avg->util_avg;
+ __entry->est_enqueued = avg->util_est.enqueued;
+ __entry->est_ewma = avg->util_est.ewma;
+ ),
+
+ TP_printk("comm=%s pid=%d cpu=%d util_avg=%u util_est_ewma=%u util_est_enqueued=%u",
+ __entry->comm,
+ __entry->pid,
+ __entry->cpu,
+ __entry->util_avg,
+ __entry->est_ewma,
+ __entry->est_enqueued)
+);
+
+/*
+ * Tracepoint for root cfs_rq's estimated utilization.
+ */
+TRACE_EVENT(sched_util_est_cpu,
+
+ TP_PROTO(int cpu, struct cfs_rq *cfs_rq),
+
+ TP_ARGS(cpu, cfs_rq),
+
+ TP_STRUCT__entry(
+ __field( int, cpu )
+ __field( unsigned int, util_avg )
+ __field( unsigned int, util_est_enqueued )
+ ),
+
+ TP_fast_assign(
+ __entry->cpu = cpu;
+ __entry->util_avg = cfs_rq->avg.util_avg;
+ __entry->util_est_enqueued = cfs_rq->avg.util_est.enqueued;
+ ),
+
+ TP_printk("cpu=%d util_avg=%u util_est_enqueued=%u",
+ __entry->cpu,
+ __entry->util_avg,
+ __entry->util_est_enqueued)
+);
+
+/*
+ * Tracepoint for find_best_target
+ */
+TRACE_EVENT(sched_find_best_target,
+
+ TP_PROTO(struct task_struct *tsk, bool prefer_idle,
+ unsigned long min_util, int best_idle, int best_active,
+ int target, int backup),
+
+ TP_ARGS(tsk, prefer_idle, min_util, best_idle,
+ best_active, target, backup),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( unsigned long, min_util )
+ __field( bool, prefer_idle )
+ __field( int, best_idle )
+ __field( int, best_active )
+ __field( int, target )
+ __field( int, backup )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+ __entry->pid = tsk->pid;
+ __entry->min_util = min_util;
+ __entry->prefer_idle = prefer_idle;
+ __entry->best_idle = best_idle;
+ __entry->best_active = best_active;
+ __entry->target = target;
+ __entry->backup = backup;
+ ),
+
+ TP_printk("pid=%d comm=%s prefer_idle=%d "
+ "best_idle=%d best_active=%d target=%d backup=%d",
+ __entry->pid, __entry->comm, __entry->prefer_idle,
+ __entry->best_idle, __entry->best_active,
+ __entry->target, __entry->backup)
+);
+
+/*
+ * Tracepoint for accounting CPU boosted utilization
+ */
+TRACE_EVENT(sched_boost_cpu,
+
+ TP_PROTO(int cpu, unsigned long util, long margin),
+
+ TP_ARGS(cpu, util, margin),
+
+ TP_STRUCT__entry(
+ __field( int, cpu )
+ __field( unsigned long, util )
+ __field(long, margin )
+ ),
+
+ TP_fast_assign(
+ __entry->cpu = cpu;
+ __entry->util = util;
+ __entry->margin = margin;
+ ),
+
+ TP_printk("cpu=%d util=%lu margin=%ld",
+ __entry->cpu,
+ __entry->util,
+ __entry->margin)
+);
+
+/*
+ * Tracepoint for schedtune_tasks_update
+ */
+TRACE_EVENT(sched_tune_tasks_update,
+
+ TP_PROTO(struct task_struct *tsk, int cpu, int tasks, int idx,
+ int boost, int max_boost, u64 group_ts),
+
+ TP_ARGS(tsk, cpu, tasks, idx, boost, max_boost, group_ts),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( int, cpu )
+ __field( int, tasks )
+ __field( int, idx )
+ __field( int, boost )
+ __field( int, max_boost )
+ __field( u64, group_ts )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+ __entry->pid = tsk->pid;
+ __entry->cpu = cpu;
+ __entry->tasks = tasks;
+ __entry->idx = idx;
+ __entry->boost = boost;
+ __entry->max_boost = max_boost;
+ __entry->group_ts = group_ts;
+ ),
+
+ TP_printk("pid=%d comm=%s "
+ "cpu=%d tasks=%d idx=%d boost=%d max_boost=%d timeout=%llu",
+ __entry->pid, __entry->comm,
+ __entry->cpu, __entry->tasks, __entry->idx,
+ __entry->boost, __entry->max_boost,
+ __entry->group_ts)
+);
+
+/*
+ * Tracepoint for schedtune_boostgroup_update
+ */
+TRACE_EVENT(sched_tune_boostgroup_update,
+
+ TP_PROTO(int cpu, int variation, int max_boost),
+
+ TP_ARGS(cpu, variation, max_boost),
+
+ TP_STRUCT__entry(
+ __field( int, cpu )
+ __field( int, variation )
+ __field( int, max_boost )
+ ),
+
+ TP_fast_assign(
+ __entry->cpu = cpu;
+ __entry->variation = variation;
+ __entry->max_boost = max_boost;
+ ),
+
+ TP_printk("cpu=%d variation=%d max_boost=%d",
+ __entry->cpu, __entry->variation, __entry->max_boost)
+);
+
+/*
+ * Tracepoint for accounting task boosted utilization
+ */
+TRACE_EVENT(sched_boost_task,
+
+ TP_PROTO(struct task_struct *tsk, unsigned long util, long margin),
+
+ TP_ARGS(tsk, util, margin),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( unsigned long, util )
+ __field( long, margin )
+
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->comm, tsk->comm, TASK_COMM_LEN);
+ __entry->pid = tsk->pid;
+ __entry->util = util;
+ __entry->margin = margin;
+ ),
+
+ TP_printk("comm=%s pid=%d util=%lu margin=%ld",
+ __entry->comm, __entry->pid,
+ __entry->util,
+ __entry->margin)
+);
+
+/*
+ * Tracepoint for system overutilized flag
+*/
+TRACE_EVENT(sched_overutilized,
+
+ TP_PROTO(int overutilized),
+
+ TP_ARGS(overutilized),
+
+ TP_STRUCT__entry(
+ __field( int, overutilized )
+ ),
+
+ TP_fast_assign(
+ __entry->overutilized = overutilized;
+ ),
+
+ TP_printk("overutilized=%d",
+ __entry->overutilized)
+);
+
+#endif /* CONFIG_SMP */
#endif /* _TRACE_SCHED_H */
/* This part must be outside protection */
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 818ae69..64b327dd 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -71,6 +71,7 @@
#define AUDIT_TTY_SET 1017 /* Set TTY auditing status */
#define AUDIT_SET_FEATURE 1018 /* Turn an audit feature on or off */
#define AUDIT_GET_FEATURE 1019 /* Get which features are enabled */
+#define AUDIT_CONTAINER_OP 1020 /* Define the container id and info */
#define AUDIT_FIRST_USER_MSG 1100 /* Userspace messages mostly uninteresting to kernel */
#define AUDIT_USER_AVC 1107 /* We filter this differently */
@@ -469,6 +470,7 @@
#define AUDIT_UID_UNSET (unsigned int)-1
#define AUDIT_SID_UNSET ((unsigned int)-1)
+#define AUDIT_CID_UNSET ((u64)-1)
/* audit_rule_data supports filter rules with both integer and string
* fields. It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
diff --git a/include/uapi/linux/netfilter/xt_IDLETIMER.h b/include/uapi/linux/netfilter/xt_IDLETIMER.h
index 3c586a1..c82a1c1 100644
--- a/include/uapi/linux/netfilter/xt_IDLETIMER.h
+++ b/include/uapi/linux/netfilter/xt_IDLETIMER.h
@@ -5,6 +5,7 @@
* Header file for Xtables timer target module.
*
* Copyright (C) 2004, 2010 Nokia Corporation
+ *
* Written by Timo Teras <ext-timo.teras@nokia.com>
*
* Converted to x_tables and forward-ported to 2.6.34
@@ -33,12 +34,19 @@
#include <linux/types.h>
#define MAX_IDLETIMER_LABEL_SIZE 28
+#define NLMSG_MAX_SIZE 64
+
+#define NL_EVENT_TYPE_INACTIVE 0
+#define NL_EVENT_TYPE_ACTIVE 1
struct idletimer_tg_info {
__u32 timeout;
char label[MAX_IDLETIMER_LABEL_SIZE];
+ /* Use netlink messages for notification in addition to sysfs */
+ __u8 send_nl_msg;
+
/* for kernel module internal use only */
struct idletimer_tg *timer __attribute__((aligned(8)));
};
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index b17201e..e7aa960 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -155,6 +155,9 @@
#define PR_SET_PTRACER 0x59616d61
# define PR_SET_PTRACER_ANY ((unsigned long)-1)
+#define PR_ALT_SYSCALL 0x43724f53
+# define PR_ALT_SYSCALL_SET_SYSCALL_TABLE 1
+
#define PR_SET_CHILD_SUBREAPER 36
#define PR_GET_CHILD_SUBREAPER 37
@@ -220,4 +223,7 @@
# define PR_SPEC_DISABLE (1UL << 2)
# define PR_SPEC_FORCE_DISABLE (1UL << 3)
+#define PR_SET_VMA 0x53564d41
+# define PR_SET_VMA_ANON_NAME 0
+
#endif /* _LINUX_PRCTL_H */
diff --git a/include/uapi/linux/virtio_blk.h b/include/uapi/linux/virtio_blk.h
index 9ebe4d9..682afbf 100644
--- a/include/uapi/linux/virtio_blk.h
+++ b/include/uapi/linux/virtio_blk.h
@@ -38,6 +38,8 @@
#define VIRTIO_BLK_F_BLK_SIZE 6 /* Block size of disk is available*/
#define VIRTIO_BLK_F_TOPOLOGY 10 /* Topology information is available */
#define VIRTIO_BLK_F_MQ 12 /* support more than one vq */
+#define VIRTIO_BLK_F_DISCARD 13 /* DISCARD is supported */
+#define VIRTIO_BLK_F_WRITE_ZEROES 14 /* WRITE ZEROES is supported */
/* Legacy feature bits */
#ifndef VIRTIO_BLK_NO_LEGACY
@@ -86,6 +88,39 @@
/* number of vqs, only available when VIRTIO_BLK_F_MQ is set */
__u16 num_queues;
+
+ /* the next 3 entries are guarded by VIRTIO_BLK_F_DISCARD */
+ /*
+ * The maximum discard sectors (in 512-byte sectors) for
+ * one segment.
+ */
+ __u32 max_discard_sectors;
+ /*
+ * The maximum number of discard segments in a
+ * discard command.
+ */
+ __u32 max_discard_seg;
+ /* Discard commands must be aligned to this number of sectors. */
+ __u32 discard_sector_alignment;
+
+ /* the next 3 entries are guarded by VIRTIO_BLK_F_WRITE_ZEROES */
+ /*
+ * The maximum number of write zeroes sectors (in 512-byte sectors) in
+ * one segment.
+ */
+ __u32 max_write_zeroes_sectors;
+ /*
+ * The maximum number of segments in a write zeroes
+ * command.
+ */
+ __u32 max_write_zeroes_seg;
+ /*
+ * Set if a VIRTIO_BLK_T_WRITE_ZEROES request may result in the
+ * deallocation of one or more of the sectors.
+ */
+ __u8 write_zeroes_may_unmap;
+
+ __u8 unused1[3];
} __attribute__((packed));
/*
@@ -114,6 +149,12 @@
/* Get device ID command */
#define VIRTIO_BLK_T_GET_ID 8
+/* Discard command */
+#define VIRTIO_BLK_T_DISCARD 11
+
+/* Write zeroes command */
+#define VIRTIO_BLK_T_WRITE_ZEROES 13
+
#ifndef VIRTIO_BLK_NO_LEGACY
/* Barrier before this op. */
#define VIRTIO_BLK_T_BARRIER 0x80000000
@@ -133,6 +174,19 @@
__virtio64 sector;
};
+/* Unmap this range (only valid for write zeroes command) */
+#define VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP 0x00000001
+
+/* Discard/write zeroes range for each request. */
+struct virtio_blk_discard_write_zeroes {
+ /* discard/write zeroes start sector */
+ __virtio64 sector;
+ /* number of discard/write zeroes sectors */
+ __virtio32 num_sectors;
+ /* flags for this range */
+ __virtio32 flags;
+};
+
#ifndef VIRTIO_BLK_NO_LEGACY
struct virtio_scsi_inhdr {
__virtio32 errors;
diff --git a/init/Kconfig b/init/Kconfig
index 47035b5..fe39bfd 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -23,9 +23,6 @@
int
default $(shell,$(srctree)/scripts/clang-version.sh $(CC))
-config CC_HAS_ASM_GOTO
- def_bool $(success,$(srctree)/scripts/gcc-goto.sh $(CC))
-
config CONSTRUCTORS
bool
depends on !UML
@@ -260,6 +257,15 @@
used to provide more virtual memory than the actual RAM present
in your computer. If unsure say Y.
+config DISK_BASED_SWAP
+ bool "Allow disk-based swap files in Chromium OS kernels"
+ depends on SWAP
+ default n
+ help
+ By default, the Chromium OS kernel allows swapping only to
+ zram devices. This option allows you to use disk-based files
+ as swap devices too. If unsure say N.
+
config SYSVIPC
bool "System V IPC"
---help---
@@ -999,6 +1005,29 @@
desktop applications. Task group autogeneration is currently based
upon task session.
+config SCHED_TUNE
+ bool "Boosting for CFS tasks (EXPERIMENTAL)"
+ depends on SMP
+ help
+ This option enables support for task classification using a new
+ cgroup controller, schedtune. Schedtune allows tasks to be given
+ a boost value and marked as latency-sensitive or not. This option
+ provides the "schedtune" controller.
+
+ This new controller:
+ 1. allows only a two layers hierarchy, where the root defines the
+ system-wide boost value and its direct childrens define each one a
+ different "class of tasks" to be boosted with a different value
+ 2. supports up to 16 different task classes, each one which could be
+ configured with a different boost value
+
+ Latency-sensitive tasks are not subject to energy-aware wakeup
+ task placement. The boost value assigned to tasks is used to
+ influence task placement and CPU frequency selection (if
+ utilization-driven frequency selection is in use).
+
+ If unsure, say N.
+
config SYSFS_DEPRECATED
bool "Enable deprecated sysfs features to support old userspace tools"
depends on SYSFS
diff --git a/init/init_task.c b/init/init_task.c
index 5aebe3b..67fe17b 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -11,6 +11,8 @@
#include <linux/mm.h>
#include <linux/audit.h>
+#include <linux/alt-syscall.h>
+
#include <asm/pgtable.h>
#include <linux/uaccess.h>
@@ -121,9 +123,8 @@
.thread_pid = &init_struct_pid,
.thread_group = LIST_HEAD_INIT(init_task.thread_group),
.thread_node = LIST_HEAD_INIT(init_signals.thread_head),
-#ifdef CONFIG_AUDITSYSCALL
- .loginuid = INVALID_UID,
- .sessionid = AUDIT_SID_UNSET,
+#ifdef CONFIG_AUDIT
+ .audit = &init_struct_audit,
#endif
#ifdef CONFIG_PERF_EVENTS
.perf_event_mutex = __MUTEX_INITIALIZER(init_task.perf_event_mutex),
diff --git a/init/main.c b/init/main.c
index ec78f23..57d4cf2 100644
--- a/init/main.c
+++ b/init/main.c
@@ -92,6 +92,7 @@
#include <linux/rodata_test.h>
#include <linux/jump_label.h>
#include <linux/mem_encrypt.h>
+#include <linux/audit.h>
#include <asm/io.h>
#include <asm/bugs.h>
@@ -720,6 +721,7 @@
nsfs_init();
cpuset_init();
cgroup_init();
+ audit_task_init();
taskstats_init_early();
delayacct_init();
diff --git a/kernel/Makefile b/kernel/Makefile
index ad4b324..8bfac76 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -44,6 +44,8 @@
obj-y += livepatch/
obj-y += dma/
+obj-$(CONFIG_ALT_SYSCALL) += alt-syscall.o
+
obj-$(CONFIG_CHECKPOINT_RESTORE) += kcmp.o
obj-$(CONFIG_FREEZER) += freezer.o
obj-$(CONFIG_PROFILING) += profile.o
diff --git a/kernel/alt-syscall.c b/kernel/alt-syscall.c
new file mode 100644
index 0000000..99599e1
--- /dev/null
+++ b/kernel/alt-syscall.c
@@ -0,0 +1,66 @@
+/*
+ * Alternate Syscall Table Infrastructure
+ *
+ * Copyright 2014 Google Inc. All Rights Reserved
+ *
+ * Authors:
+ * Kees Cook <keescook@chromium.org>
+ * Will Drewry <wad@chromium.org>
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/alt-syscall.h>
+
+static LIST_HEAD(alt_sys_call_tables);
+static DEFINE_SPINLOCK(alt_sys_call_tables_lock);
+
+/* XXX: there is no "unregister" yet. */
+int register_alt_sys_call_table(struct alt_sys_call_table *entry)
+{
+ if (!entry)
+ return -EINVAL;
+
+ spin_lock(&alt_sys_call_tables_lock);
+ list_add(&entry->node, &alt_sys_call_tables);
+ spin_unlock(&alt_sys_call_tables_lock);
+
+ pr_info("table '%s' available.\n", entry->name);
+
+ return 0;
+}
+
+int set_alt_sys_call_table(char * __user uname)
+{
+ char name[ALT_SYS_CALL_NAME_MAX + 1] = { };
+ struct alt_sys_call_table *entry;
+
+ if (copy_from_user(name, uname, ALT_SYS_CALL_NAME_MAX))
+ return -EFAULT;
+
+ spin_lock(&alt_sys_call_tables_lock);
+ list_for_each_entry(entry, &alt_sys_call_tables, node) {
+ if (!strcmp(entry->name, name)) {
+ if (arch_set_sys_call_table(entry))
+ continue;
+ spin_unlock(&alt_sys_call_tables_lock);
+ return 0;
+ }
+ }
+ spin_unlock(&alt_sys_call_tables_lock);
+
+ return -ENOENT;
+}
diff --git a/kernel/audit.c b/kernel/audit.c
index 7afec5f..e1b7c35 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -216,6 +216,75 @@
struct sk_buff *skb;
};
+static struct kmem_cache *audit_task_cache;
+
+void __init audit_task_init(void)
+{
+ audit_task_cache = kmem_cache_create("audit_task",
+ sizeof(struct audit_task_info),
+ 0, SLAB_PANIC, NULL);
+}
+
+/**
+ * audit_alloc - allocate an audit info block for a task
+ * @tsk: task
+ *
+ * Call audit_alloc_syscall to filter on the task information and
+ * allocate a per-task audit context if necessary. This is called from
+ * copy_process, so no lock is needed.
+ */
+int audit_alloc(struct task_struct *tsk)
+{
+ int ret = 0;
+ struct audit_task_info *info;
+
+ info = kmem_cache_alloc(audit_task_cache, GFP_KERNEL);
+ if (!info) {
+ ret = -ENOMEM;
+ goto out;
+ }
+ info->loginuid = audit_get_loginuid(current);
+ info->sessionid = audit_get_sessionid(current);
+ info->contid = audit_get_contid(current);
+ tsk->audit = info;
+
+ ret = audit_alloc_syscall(tsk);
+ if (ret) {
+ tsk->audit = NULL;
+ kmem_cache_free(audit_task_cache, info);
+ }
+out:
+ return ret;
+}
+
+struct audit_task_info init_struct_audit = {
+ .loginuid = INVALID_UID,
+ .sessionid = AUDIT_SID_UNSET,
+ .contid = AUDIT_CID_UNSET,
+#ifdef CONFIG_AUDITSYSCALL
+ .ctx = NULL,
+#endif
+};
+
+/**
+ * audit_free - free per-task audit info
+ * @tsk: task whose audit info block to free
+ *
+ * Called from copy_process and do_exit
+ */
+void audit_free(struct task_struct *tsk)
+{
+ struct audit_task_info *info = tsk->audit;
+
+ audit_free_syscall(tsk);
+ /* Freeing the audit_task_info struct must be performed after
+ * audit_log_exit() due to need for loginuid and sessionid.
+ */
+ info = tsk->audit;
+ tsk->audit = NULL;
+ kmem_cache_free(audit_task_cache, info);
+}
+
/**
* auditd_test_task - Check to see if a given task is an audit daemon
* @task: the task to check
@@ -2027,6 +2096,12 @@
if (prefix)
audit_log_format(ab, "%s", prefix);
+ /* The process may be exiting. */
+ if (!current->fs) {
+ audit_log_string(ab, "<unknown>");
+ return;
+ }
+
/* We will allow 11 spaces for ' (deleted)' to be appended */
pathname = kmalloc(PATH_MAX+11, ab->gfp_mask);
if (!pathname) {
@@ -2328,6 +2403,73 @@
}
/**
+ * audit_set_contid - set current task's audit contid
+ * @contid: contid value
+ *
+ * Returns 0 on success, -EPERM on permission failure.
+ *
+ * Called (set) from fs/proc/base.c::proc_contid_write().
+ */
+int audit_set_contid(struct task_struct *task, u64 contid)
+{
+ u64 oldcontid;
+ int rc = 0;
+ struct audit_buffer *ab;
+ uid_t uid;
+ struct tty_struct *tty;
+ char comm[sizeof(current->comm)];
+
+ task_lock(task);
+ /* Can't set if audit disabled */
+ if (!task->audit) {
+ task_unlock(task);
+ return -ENOPROTOOPT;
+ }
+ oldcontid = audit_get_contid(task);
+ read_lock(&tasklist_lock);
+ /* Don't allow the audit containerid to be unset */
+ if (!audit_contid_valid(contid))
+ rc = -EINVAL;
+ /* if we don't have caps, reject */
+ else if (!capable(CAP_AUDIT_CONTROL))
+ rc = -EPERM;
+ /* if task has children or is not single-threaded, deny */
+ else if (!list_empty(&task->children))
+ rc = -EBUSY;
+ else if (!(thread_group_leader(task) && thread_group_empty(task)))
+ rc = -EALREADY;
+ read_unlock(&tasklist_lock);
+ if (!rc)
+ task->audit->contid = contid;
+ task_unlock(task);
+
+ if (!audit_enabled)
+ return rc;
+
+ ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONTAINER_OP);
+ if (!ab)
+ return rc;
+
+ uid = from_kuid(&init_user_ns, task_uid(current));
+ tty = audit_get_tty(current);
+ audit_log_format(ab,
+ "op=set opid=%d contid=%llu old-contid=%llu pid=%d uid=%u auid=%u tty=%s ses=%u",
+ task_tgid_nr(task), contid, oldcontid,
+ task_tgid_nr(current), uid,
+ from_kuid(&init_user_ns, audit_get_loginuid(current)),
+ tty ? tty_name(tty) : "(none)",
+ audit_get_sessionid(current));
+ audit_put_tty(tty);
+ audit_log_task_context(ab);
+ audit_log_format(ab, " comm=");
+ audit_log_untrustedstring(ab, get_task_comm(comm, current));
+ audit_log_d_path_exe(ab, current->mm);
+ audit_log_format(ab, " res=%d", !rc);
+ audit_log_end(ab);
+ return rc;
+}
+
+/**
* audit_log_end - end one audit record
* @ab: the audit_buffer
*
diff --git a/kernel/audit.h b/kernel/audit.h
index 214e149..6bf56f3 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -147,6 +147,7 @@
kuid_t target_uid;
unsigned int target_sessionid;
u32 target_sid;
+ u64 target_cid;
char target_comm[TASK_COMM_LEN];
struct audit_tree_refs *trees, *first_trees;
@@ -267,6 +268,8 @@
/* audit watch functions */
#ifdef CONFIG_AUDIT_WATCH
+extern int audit_alloc_syscall(struct task_struct *tsk);
+extern void audit_free_syscall(struct task_struct *tsk);
extern void audit_put_watch(struct audit_watch *watch);
extern void audit_get_watch(struct audit_watch *watch);
extern int audit_to_watch(struct audit_krule *krule, char *path, int len, u32 op);
@@ -284,6 +287,8 @@
extern int audit_exe_compare(struct task_struct *tsk, struct audit_fsnotify_mark *mark);
#else
+#define audit_alloc_syscall(t) 0
+#define audit_free_syscall(t) {}
#define audit_put_watch(w) {}
#define audit_get_watch(w) {}
#define audit_to_watch(k, p, l, o) (-EINVAL)
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 1513873..9c32b70 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -113,6 +113,7 @@
kuid_t target_uid[AUDIT_AUX_PIDS];
unsigned int target_sessionid[AUDIT_AUX_PIDS];
u32 target_sid[AUDIT_AUX_PIDS];
+ u64 target_cid[AUDIT_AUX_PIDS];
char target_comm[AUDIT_AUX_PIDS][TASK_COMM_LEN];
int pid_count;
};
@@ -841,7 +842,7 @@
int return_valid,
long return_code)
{
- struct audit_context *context = tsk->audit_context;
+ struct audit_context *context = tsk->audit->ctx;
if (!context)
return NULL;
@@ -926,23 +927,25 @@
return context;
}
-/**
- * audit_alloc - allocate an audit context block for a task
+/*
+ * audit_alloc_syscall - allocate an audit context block for a task
* @tsk: task
*
* Filter on the task information and allocate a per-task audit context
* if necessary. Doing so turns on system call auditing for the
- * specified task. This is called from copy_process, so no lock is
- * needed.
+ * specified task. This is called from copy_process via audit_alloc, so
+ * no lock is needed.
*/
-int audit_alloc(struct task_struct *tsk)
+int audit_alloc_syscall(struct task_struct *tsk)
{
struct audit_context *context;
enum audit_state state;
char *key = NULL;
- if (likely(!audit_ever_enabled))
+ if (likely(!audit_ever_enabled)) {
+ audit_set_context(tsk, NULL);
return 0; /* Return if not auditing. */
+ }
state = audit_filter_task(tsk, &key);
if (state == AUDIT_DISABLED) {
@@ -952,7 +955,7 @@
if (!(context = audit_alloc_context(state))) {
kfree(key);
- audit_log_lost("out of memory in audit_alloc");
+ audit_log_lost("out of memory in audit_alloc_syscall");
return -ENOMEM;
}
context->filterkey = key;
@@ -1473,16 +1476,16 @@
}
/**
- * __audit_free - free a per-task audit context
+ * audit_free_syscall - free per-task audit context info
* @tsk: task whose audit context block to free
*
- * Called from copy_process and do_exit
+ * Called from audit_free
*/
-void __audit_free(struct task_struct *tsk)
+void audit_free_syscall(struct task_struct *tsk)
{
- struct audit_context *context;
+ struct audit_task_info *info = tsk->audit;
+ struct audit_context *context = info->ctx;
- context = audit_take_context(tsk, 0, 0);
if (!context)
return;
@@ -2075,8 +2078,8 @@
sessionid = (unsigned int)atomic_inc_return(&session_id);
}
- task->sessionid = sessionid;
- task->loginuid = loginuid;
+ task->audit->sessionid = sessionid;
+ task->audit->loginuid = loginuid;
out:
audit_log_set_loginuid(oldloginuid, loginuid, oldsessionid, sessionid, rc);
return rc;
@@ -2272,6 +2275,7 @@
context->target_uid = task_uid(t);
context->target_sessionid = audit_get_sessionid(t);
security_task_getsecid(t, &context->target_sid);
+ context->target_cid = audit_get_contid(t);
memcpy(context->target_comm, t->comm, TASK_COMM_LEN);
}
@@ -2312,6 +2316,7 @@
ctx->target_uid = t_uid;
ctx->target_sessionid = audit_get_sessionid(t);
security_task_getsecid(t, &ctx->target_sid);
+ ctx->target_cid = audit_get_contid(t);
memcpy(ctx->target_comm, t->comm, TASK_COMM_LEN);
return 0;
}
@@ -2333,6 +2338,7 @@
axp->target_uid[axp->pid_count] = t_uid;
axp->target_sessionid[axp->pid_count] = audit_get_sessionid(t);
security_task_getsecid(t, &axp->target_sid[axp->pid_count]);
+ axp->target_cid[axp->pid_count] = audit_get_contid(t);
memcpy(axp->target_comm[axp->pid_count], t->comm, TASK_COMM_LEN);
axp->pid_count++;
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 6d6c106..ee688fa 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1226,6 +1226,7 @@
void enable_nonboot_cpus(void)
{
int cpu, error;
+ struct device *cpu_device;
/* Allow everyone to use the CPU hotplug again */
cpu_maps_update_begin();
@@ -1243,6 +1244,12 @@
trace_suspend_resume(TPS("CPU_ON"), cpu, false);
if (!error) {
pr_info("CPU%d is up\n", cpu);
+ cpu_device = get_cpu_device(cpu);
+ if (!cpu_device)
+ pr_err("%s: failed to get cpu%d device\n",
+ __func__, cpu);
+ else
+ kobject_uevent(&cpu_device->kobj, KOBJ_ONLINE);
continue;
}
pr_warn("Error taking CPU%d up: %d\n", cpu, error);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 21e3c65..0e83d55 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8484,30 +8484,39 @@
*
* PERF_PROBE_CONFIG_IS_RETPROBE if set, create kretprobe/uretprobe
* if not set, create kprobe/uprobe
+ *
+ * The following values specify a reference counter (or semaphore in the
+ * terminology of tools like dtrace, systemtap, etc.) Userspace Statically
+ * Defined Tracepoints (USDT). Currently, we use 40 bit for the offset.
+ *
+ * PERF_UPROBE_REF_CTR_OFFSET_BITS # of bits in config as th offset
+ * PERF_UPROBE_REF_CTR_OFFSET_SHIFT # of bits to shift left
*/
enum perf_probe_config {
PERF_PROBE_CONFIG_IS_RETPROBE = 1U << 0, /* [k,u]retprobe */
+ PERF_UPROBE_REF_CTR_OFFSET_BITS = 32,
+ PERF_UPROBE_REF_CTR_OFFSET_SHIFT = 64 - PERF_UPROBE_REF_CTR_OFFSET_BITS,
};
PMU_FORMAT_ATTR(retprobe, "config:0");
+#endif
-static struct attribute *probe_attrs[] = {
+#ifdef CONFIG_KPROBE_EVENTS
+static struct attribute *kprobe_attrs[] = {
&format_attr_retprobe.attr,
NULL,
};
-static struct attribute_group probe_format_group = {
+static struct attribute_group kprobe_format_group = {
.name = "format",
- .attrs = probe_attrs,
+ .attrs = kprobe_attrs,
};
-static const struct attribute_group *probe_attr_groups[] = {
- &probe_format_group,
+static const struct attribute_group *kprobe_attr_groups[] = {
+ &kprobe_format_group,
NULL,
};
-#endif
-#ifdef CONFIG_KPROBE_EVENTS
static int perf_kprobe_event_init(struct perf_event *event);
static struct pmu perf_kprobe = {
.task_ctx_nr = perf_sw_context,
@@ -8517,7 +8526,7 @@
.start = perf_swevent_start,
.stop = perf_swevent_stop,
.read = perf_swevent_read,
- .attr_groups = probe_attr_groups,
+ .attr_groups = kprobe_attr_groups,
};
static int perf_kprobe_event_init(struct perf_event *event)
@@ -8549,6 +8558,24 @@
#endif /* CONFIG_KPROBE_EVENTS */
#ifdef CONFIG_UPROBE_EVENTS
+PMU_FORMAT_ATTR(ref_ctr_offset, "config:32-63");
+
+static struct attribute *uprobe_attrs[] = {
+ &format_attr_retprobe.attr,
+ &format_attr_ref_ctr_offset.attr,
+ NULL,
+};
+
+static struct attribute_group uprobe_format_group = {
+ .name = "format",
+ .attrs = uprobe_attrs,
+};
+
+static const struct attribute_group *uprobe_attr_groups[] = {
+ &uprobe_format_group,
+ NULL,
+};
+
static int perf_uprobe_event_init(struct perf_event *event);
static struct pmu perf_uprobe = {
.task_ctx_nr = perf_sw_context,
@@ -8558,12 +8585,13 @@
.start = perf_swevent_start,
.stop = perf_swevent_stop,
.read = perf_swevent_read,
- .attr_groups = probe_attr_groups,
+ .attr_groups = uprobe_attr_groups,
};
static int perf_uprobe_event_init(struct perf_event *event)
{
int err;
+ unsigned long ref_ctr_offset;
bool is_retprobe;
if (event->attr.type != perf_uprobe.type)
@@ -8579,7 +8607,8 @@
return -EOPNOTSUPP;
is_retprobe = event->attr.config & PERF_PROBE_CONFIG_IS_RETPROBE;
- err = perf_uprobe_init(event, is_retprobe);
+ ref_ctr_offset = event->attr.config >> PERF_UPROBE_REF_CTR_OFFSET_SHIFT;
+ err = perf_uprobe_init(event, ref_ctr_offset, is_retprobe);
if (err)
return err;
@@ -10517,9 +10546,8 @@
* @cpu: target cpu
* @group_fd: group leader event fd
*/
-SYSCALL_DEFINE5(perf_event_open,
- struct perf_event_attr __user *, attr_uptr,
- pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
+int ksys_perf_event_open(struct perf_event_attr __user * attr_uptr, pid_t pid,
+ int cpu, int group_fd, unsigned long flags)
{
struct perf_event *group_leader = NULL, *output_event = NULL;
struct perf_event *event, *sibling;
@@ -10947,6 +10975,13 @@
return err;
}
+SYSCALL_DEFINE5(perf_event_open,
+ struct perf_event_attr __user *, attr_uptr,
+ pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
+{
+ return ksys_perf_event_open(attr_uptr, pid, cpu, group_fd, flags);
+}
+
/**
* perf_event_create_kernel_counter
*
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index c173e41..223a7a5 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -73,6 +73,7 @@
struct uprobe_consumer *consumers;
struct inode *inode; /* Also hold a ref to inode */
loff_t offset;
+ loff_t ref_ctr_offset;
unsigned long flags;
/*
@@ -88,6 +89,15 @@
struct arch_uprobe arch;
};
+struct delayed_uprobe {
+ struct list_head list;
+ struct uprobe *uprobe;
+ struct mm_struct *mm;
+};
+
+static DEFINE_MUTEX(delayed_uprobe_lock);
+static LIST_HEAD(delayed_uprobe_list);
+
/*
* Execute out of line area: anonymous executable mapping installed
* by the probed task to execute the copy of the original instruction
@@ -282,6 +292,166 @@
return 1;
}
+static struct delayed_uprobe *
+delayed_uprobe_check(struct uprobe *uprobe, struct mm_struct *mm)
+{
+ struct delayed_uprobe *du;
+
+ list_for_each_entry(du, &delayed_uprobe_list, list)
+ if (du->uprobe == uprobe && du->mm == mm)
+ return du;
+ return NULL;
+}
+
+static int delayed_uprobe_add(struct uprobe *uprobe, struct mm_struct *mm)
+{
+ struct delayed_uprobe *du;
+
+ if (delayed_uprobe_check(uprobe, mm))
+ return 0;
+
+ du = kzalloc(sizeof(*du), GFP_KERNEL);
+ if (!du)
+ return -ENOMEM;
+
+ du->uprobe = uprobe;
+ du->mm = mm;
+ list_add(&du->list, &delayed_uprobe_list);
+ return 0;
+}
+
+static void delayed_uprobe_delete(struct delayed_uprobe *du)
+{
+ if (WARN_ON(!du))
+ return;
+ list_del(&du->list);
+ kfree(du);
+}
+
+static void delayed_uprobe_remove(struct uprobe *uprobe, struct mm_struct *mm)
+{
+ struct list_head *pos, *q;
+ struct delayed_uprobe *du;
+
+ if (!uprobe && !mm)
+ return;
+
+ list_for_each_safe(pos, q, &delayed_uprobe_list) {
+ du = list_entry(pos, struct delayed_uprobe, list);
+
+ if (uprobe && du->uprobe != uprobe)
+ continue;
+ if (mm && du->mm != mm)
+ continue;
+
+ delayed_uprobe_delete(du);
+ }
+}
+
+static bool valid_ref_ctr_vma(struct uprobe *uprobe,
+ struct vm_area_struct *vma)
+{
+ unsigned long vaddr = offset_to_vaddr(vma, uprobe->ref_ctr_offset);
+
+ return uprobe->ref_ctr_offset &&
+ vma->vm_file &&
+ file_inode(vma->vm_file) == uprobe->inode &&
+ (vma->vm_flags & (VM_WRITE|VM_SHARED)) == VM_WRITE &&
+ vma->vm_start <= vaddr &&
+ vma->vm_end > vaddr;
+}
+
+static struct vm_area_struct *
+find_ref_ctr_vma(struct uprobe *uprobe, struct mm_struct *mm)
+{
+ struct vm_area_struct *tmp;
+
+ for (tmp = mm->mmap; tmp; tmp = tmp->vm_next)
+ if (valid_ref_ctr_vma(uprobe, tmp))
+ return tmp;
+
+ return NULL;
+}
+
+static int
+__update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
+{
+ void *kaddr;
+ struct page *page;
+ struct vm_area_struct *vma;
+ int ret;
+ short *ptr;
+
+ if (!vaddr || !d)
+ return -EINVAL;
+
+ ret = get_user_pages_remote(NULL, mm, vaddr, 1,
+ FOLL_WRITE, &page, &vma, NULL);
+ if (unlikely(ret <= 0)) {
+ /*
+ * We are asking for 1 page. If get_user_pages_remote() fails,
+ * it may return 0, in that case we have to return error.
+ */
+ return ret == 0 ? -EBUSY : ret;
+ }
+
+ kaddr = kmap_atomic(page);
+ ptr = kaddr + (vaddr & ~PAGE_MASK);
+
+ if (unlikely(*ptr + d < 0)) {
+ pr_warn("ref_ctr going negative. vaddr: 0x%lx, "
+ "curr val: %d, delta: %d\n", vaddr, *ptr, d);
+ ret = -EINVAL;
+ goto out;
+ }
+
+ *ptr += d;
+ ret = 0;
+out:
+ kunmap_atomic(kaddr);
+ put_page(page);
+ return ret;
+}
+
+static void update_ref_ctr_warn(struct uprobe *uprobe,
+ struct mm_struct *mm, short d)
+{
+ pr_warn("ref_ctr %s failed for inode: 0x%lx offset: "
+ "0x%llx ref_ctr_offset: 0x%llx of mm: 0x%pK\n",
+ d > 0 ? "increment" : "decrement", uprobe->inode->i_ino,
+ (unsigned long long) uprobe->offset,
+ (unsigned long long) uprobe->ref_ctr_offset, mm);
+}
+
+static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
+ short d)
+{
+ struct vm_area_struct *rc_vma;
+ unsigned long rc_vaddr;
+ int ret = 0;
+
+ rc_vma = find_ref_ctr_vma(uprobe, mm);
+
+ if (rc_vma) {
+ rc_vaddr = offset_to_vaddr(rc_vma, uprobe->ref_ctr_offset);
+ ret = __update_ref_ctr(mm, rc_vaddr, d);
+ if (ret)
+ update_ref_ctr_warn(uprobe, mm, d);
+
+ if (d > 0)
+ return ret;
+ }
+
+ mutex_lock(&delayed_uprobe_lock);
+ if (d > 0)
+ ret = delayed_uprobe_add(uprobe, mm);
+ else
+ delayed_uprobe_remove(uprobe, mm);
+ mutex_unlock(&delayed_uprobe_lock);
+
+ return ret;
+}
+
/*
* NOTE:
* Expect the breakpoint instruction to be the smallest size instruction for
@@ -302,9 +472,13 @@
int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
unsigned long vaddr, uprobe_opcode_t opcode)
{
+ struct uprobe *uprobe;
struct page *old_page, *new_page;
struct vm_area_struct *vma;
- int ret;
+ int ret, is_register, ref_ctr_updated = 0;
+
+ is_register = is_swbp_insn(&opcode);
+ uprobe = container_of(auprobe, struct uprobe, arch);
retry:
/* Read the page with vaddr into memory */
@@ -317,6 +491,15 @@
if (ret <= 0)
goto put_old;
+ /* We are going to replace instruction, update ref_ctr. */
+ if (!ref_ctr_updated && uprobe->ref_ctr_offset) {
+ ret = update_ref_ctr(uprobe, mm, is_register ? 1 : -1);
+ if (ret)
+ goto put_old;
+
+ ref_ctr_updated = 1;
+ }
+
ret = anon_vma_prepare(vma);
if (ret)
goto put_old;
@@ -337,6 +520,11 @@
if (unlikely(ret == -EAGAIN))
goto retry;
+
+ /* Revert back reference counter if instruction update failed. */
+ if (ret && is_register && ref_ctr_updated)
+ update_ref_ctr(uprobe, mm, -1);
+
return ret;
}
@@ -378,8 +566,15 @@
static void put_uprobe(struct uprobe *uprobe)
{
- if (atomic_dec_and_test(&uprobe->ref))
+ if (atomic_dec_and_test(&uprobe->ref)) {
+ /*
+ * If application munmap(exec_vma) before uprobe_unregister()
+ * gets called, we don't get a chance to remove uprobe from
+ * delayed_uprobe_list from remove_breakpoint(). Do it here.
+ */
+ delayed_uprobe_remove(uprobe, NULL);
kfree(uprobe);
+ }
}
static int match_uprobe(struct uprobe *l, struct uprobe *r)
@@ -484,7 +679,8 @@
return u;
}
-static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
+static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
+ loff_t ref_ctr_offset)
{
struct uprobe *uprobe, *cur_uprobe;
@@ -494,6 +690,7 @@
uprobe->inode = inode;
uprobe->offset = offset;
+ uprobe->ref_ctr_offset = ref_ctr_offset;
init_rwsem(&uprobe->register_rwsem);
init_rwsem(&uprobe->consumer_rwsem);
@@ -895,7 +1092,7 @@
* else return 0 (success)
*/
static int __uprobe_register(struct inode *inode, loff_t offset,
- struct uprobe_consumer *uc)
+ loff_t ref_ctr_offset, struct uprobe_consumer *uc)
{
struct uprobe *uprobe;
int ret;
@@ -912,7 +1109,7 @@
return -EINVAL;
retry:
- uprobe = alloc_uprobe(inode, offset);
+ uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
if (!uprobe)
return -ENOMEM;
/*
@@ -938,10 +1135,17 @@
int uprobe_register(struct inode *inode, loff_t offset,
struct uprobe_consumer *uc)
{
- return __uprobe_register(inode, offset, uc);
+ return __uprobe_register(inode, offset, 0, uc);
}
EXPORT_SYMBOL_GPL(uprobe_register);
+int uprobe_register_refctr(struct inode *inode, loff_t offset,
+ loff_t ref_ctr_offset, struct uprobe_consumer *uc)
+{
+ return __uprobe_register(inode, offset, ref_ctr_offset, uc);
+}
+EXPORT_SYMBOL_GPL(uprobe_register_refctr);
+
/*
* uprobe_apply - unregister an already registered probe.
* @inode: the file in which the probe has to be removed.
@@ -1060,6 +1264,35 @@
spin_unlock(&uprobes_treelock);
}
+/* @vma contains reference counter, not the probed instruction. */
+static int delayed_ref_ctr_inc(struct vm_area_struct *vma)
+{
+ struct list_head *pos, *q;
+ struct delayed_uprobe *du;
+ unsigned long vaddr;
+ int ret = 0, err = 0;
+
+ mutex_lock(&delayed_uprobe_lock);
+ list_for_each_safe(pos, q, &delayed_uprobe_list) {
+ du = list_entry(pos, struct delayed_uprobe, list);
+
+ if (du->mm != vma->vm_mm ||
+ !valid_ref_ctr_vma(du->uprobe, vma))
+ continue;
+
+ vaddr = offset_to_vaddr(vma, du->uprobe->ref_ctr_offset);
+ ret = __update_ref_ctr(vma->vm_mm, vaddr, 1);
+ if (ret) {
+ update_ref_ctr_warn(du->uprobe, vma->vm_mm, 1);
+ if (!err)
+ err = ret;
+ }
+ delayed_uprobe_delete(du);
+ }
+ mutex_unlock(&delayed_uprobe_lock);
+ return err;
+}
+
/*
* Called from mmap_region/vma_adjust with mm->mmap_sem acquired.
*
@@ -1072,7 +1305,15 @@
struct uprobe *uprobe, *u;
struct inode *inode;
- if (no_uprobe_events() || !valid_vma(vma, true))
+ if (no_uprobe_events())
+ return 0;
+
+ if (vma->vm_file &&
+ (vma->vm_flags & (VM_WRITE|VM_SHARED)) == VM_WRITE &&
+ test_bit(MMF_HAS_UPROBES, &vma->vm_mm->flags))
+ delayed_ref_ctr_inc(vma);
+
+ if (!valid_vma(vma, true))
return 0;
inode = file_inode(vma->vm_file);
@@ -1246,6 +1487,10 @@
{
struct xol_area *area = mm->uprobes_state.xol_area;
+ mutex_lock(&delayed_uprobe_lock);
+ delayed_uprobe_remove(NULL, mm);
+ mutex_unlock(&delayed_uprobe_lock);
+
if (!area)
return;
diff --git a/kernel/exit.c b/kernel/exit.c
index 54c3269..34bb2b0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -62,6 +62,7 @@
#include <linux/random.h>
#include <linux/rcuwait.h>
#include <linux/compat.h>
+#include <linux/security.h>
#include <linux/uaccess.h>
#include <asm/unistd.h>
@@ -855,6 +856,8 @@
#endif
if (tsk->mm)
setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm);
+
+ security_task_exit(tsk);
}
acct_collect(code, group_dead);
if (group_dead)
diff --git a/kernel/fork.c b/kernel/fork.c
index 1a2d18e..01b49a5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1815,7 +1815,6 @@
posix_cpu_timers_init(p);
p->io_context = NULL;
- audit_set_context(p, NULL);
cgroup_fork(p);
#ifdef CONFIG_NUMA
p->mempolicy = mpol_dup(p->mempolicy);
@@ -2084,6 +2083,8 @@
trace_task_newtask(p, clone_flags);
uprobe_copy_process(p, clone_flags);
+ security_task_post_alloc(p);
+
return p;
bad_fork_cancel_cgroup:
diff --git a/kernel/gcov/Kconfig b/kernel/gcov/Kconfig
index 1e3823f..f71c1ad 100644
--- a/kernel/gcov/Kconfig
+++ b/kernel/gcov/Kconfig
@@ -53,6 +53,7 @@
choice
prompt "Specify GCOV format"
depends on GCOV_KERNEL
+ depends on CC_IS_GCC
---help---
The gcov format is usually determined by the GCC version, and the
default is chosen according to your GCC version. However, there are
@@ -62,7 +63,7 @@
config GCOV_FORMAT_3_4
bool "GCC 3.4 format"
- depends on CC_IS_GCC && GCC_VERSION < 40700
+ depends on GCC_VERSION < 40700
---help---
Select this option to use the format defined by GCC 3.4.
diff --git a/kernel/gcov/Makefile b/kernel/gcov/Makefile
index ff06d64..d66a74b 100644
--- a/kernel/gcov/Makefile
+++ b/kernel/gcov/Makefile
@@ -2,5 +2,6 @@
ccflags-y := -DSRCTREE='"$(srctree)"' -DOBJTREE='"$(objtree)"'
obj-y := base.o fs.o
-obj-$(CONFIG_GCOV_FORMAT_3_4) += gcc_3_4.o
-obj-$(CONFIG_GCOV_FORMAT_4_7) += gcc_4_7.o
+obj-$(CONFIG_GCOV_FORMAT_3_4) += gcc_base.o gcc_3_4.o
+obj-$(CONFIG_GCOV_FORMAT_4_7) += gcc_base.o gcc_4_7.o
+obj-$(CONFIG_CC_IS_CLANG) += clang.o
diff --git a/kernel/gcov/base.c b/kernel/gcov/base.c
index 9c7c8d5..0ffe9f1 100644
--- a/kernel/gcov/base.c
+++ b/kernel/gcov/base.c
@@ -22,88 +22,8 @@
#include <linux/sched.h>
#include "gcov.h"
-static int gcov_events_enabled;
-static DEFINE_MUTEX(gcov_lock);
-
-/*
- * __gcov_init is called by gcc-generated constructor code for each object
- * file compiled with -fprofile-arcs.
- */
-void __gcov_init(struct gcov_info *info)
-{
- static unsigned int gcov_version;
-
- mutex_lock(&gcov_lock);
- if (gcov_version == 0) {
- gcov_version = gcov_info_version(info);
- /*
- * Printing gcc's version magic may prove useful for debugging
- * incompatibility reports.
- */
- pr_info("version magic: 0x%x\n", gcov_version);
- }
- /*
- * Add new profiling data structure to list and inform event
- * listener.
- */
- gcov_info_link(info);
- if (gcov_events_enabled)
- gcov_event(GCOV_ADD, info);
- mutex_unlock(&gcov_lock);
-}
-EXPORT_SYMBOL(__gcov_init);
-
-/*
- * These functions may be referenced by gcc-generated profiling code but serve
- * no function for kernel profiling.
- */
-void __gcov_flush(void)
-{
- /* Unused. */
-}
-EXPORT_SYMBOL(__gcov_flush);
-
-void __gcov_merge_add(gcov_type *counters, unsigned int n_counters)
-{
- /* Unused. */
-}
-EXPORT_SYMBOL(__gcov_merge_add);
-
-void __gcov_merge_single(gcov_type *counters, unsigned int n_counters)
-{
- /* Unused. */
-}
-EXPORT_SYMBOL(__gcov_merge_single);
-
-void __gcov_merge_delta(gcov_type *counters, unsigned int n_counters)
-{
- /* Unused. */
-}
-EXPORT_SYMBOL(__gcov_merge_delta);
-
-void __gcov_merge_ior(gcov_type *counters, unsigned int n_counters)
-{
- /* Unused. */
-}
-EXPORT_SYMBOL(__gcov_merge_ior);
-
-void __gcov_merge_time_profile(gcov_type *counters, unsigned int n_counters)
-{
- /* Unused. */
-}
-EXPORT_SYMBOL(__gcov_merge_time_profile);
-
-void __gcov_merge_icall_topn(gcov_type *counters, unsigned int n_counters)
-{
- /* Unused. */
-}
-EXPORT_SYMBOL(__gcov_merge_icall_topn);
-
-void __gcov_exit(void)
-{
- /* Unused. */
-}
-EXPORT_SYMBOL(__gcov_exit);
+int gcov_events_enabled;
+DEFINE_MUTEX(gcov_lock);
/**
* gcov_enable_events - enable event reporting through gcov_event()
@@ -144,7 +64,7 @@
/* Remove entries located in module from linked list. */
while ((info = gcov_info_next(info))) {
- if (within_module((unsigned long)info, mod)) {
+ if (gcov_info_within_module(info, mod)) {
gcov_info_unlink(prev, info);
if (gcov_events_enabled)
gcov_event(GCOV_REMOVE, info);
diff --git a/kernel/gcov/clang.c b/kernel/gcov/clang.c
new file mode 100644
index 0000000..c94b820
--- /dev/null
+++ b/kernel/gcov/clang.c
@@ -0,0 +1,581 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 Google, Inc.
+ * modified from kernel/gcov/gcc_4_7.c
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ *
+ * LLVM uses profiling data that's deliberately similar to GCC, but has a
+ * very different way of exporting that data. LLVM calls llvm_gcov_init() once
+ * per module, and provides a couple of callbacks that we can use to ask for
+ * more data.
+ *
+ * We care about the "writeout" callback, which in turn calls back into
+ * compiler-rt/this module to dump all the gathered coverage data to disk:
+ *
+ * llvm_gcda_start_file()
+ * llvm_gcda_emit_function()
+ * llvm_gcda_emit_arcs()
+ * llvm_gcda_emit_function()
+ * llvm_gcda_emit_arcs()
+ * [... repeats for each function ...]
+ * llvm_gcda_summary_info()
+ * llvm_gcda_end_file()
+ *
+ * This design is much more stateless and unstructured than gcc's, and is
+ * intended to run at process exit. This forces us to keep some local state
+ * about which module we're dealing with at the moment. On the other hand, it
+ * also means we don't depend as much on how LLVM represents profiling data
+ * internally.
+ *
+ * See LLVM's lib/Transforms/Instrumentation/GCOVProfiling.cpp for more
+ * details on how this works, particularly GCOVProfiler::emitProfileArcs(),
+ * GCOVProfiler::insertCounterWriteout(), and
+ * GCOVProfiler::insertFlush().
+ */
+
+#define pr_fmt(fmt) "gcov: " fmt
+
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/printk.h>
+#include <linux/ratelimit.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include "gcov.h"
+
+typedef void (*llvm_gcov_callback)(void);
+
+struct gcov_info {
+ struct list_head head;
+
+ const char *filename;
+ unsigned int version;
+ u32 checksum;
+
+ struct list_head functions;
+};
+
+struct gcov_fn_info {
+ struct list_head head;
+
+ u32 ident;
+ u32 checksum;
+ u8 use_extra_checksum;
+ u32 cfg_checksum;
+
+ u32 num_counters;
+ u64 *counters;
+ const char *function_name;
+};
+
+static struct gcov_info *current_info;
+
+static LIST_HEAD(clang_gcov_list);
+
+void llvm_gcov_init(llvm_gcov_callback writeout, llvm_gcov_callback flush)
+{
+ struct gcov_info *info = kzalloc(sizeof(*info), GFP_KERNEL);
+
+ if (!info)
+ return;
+
+ INIT_LIST_HEAD(&info->head);
+ INIT_LIST_HEAD(&info->functions);
+
+ mutex_lock(&gcov_lock);
+
+ list_add_tail(&info->head, &clang_gcov_list);
+ current_info = info;
+ writeout();
+ current_info = NULL;
+ if (gcov_events_enabled)
+ gcov_event(GCOV_ADD, info);
+
+ mutex_unlock(&gcov_lock);
+}
+EXPORT_SYMBOL(llvm_gcov_init);
+
+void llvm_gcda_start_file(const char *orig_filename, const char version[4],
+ u32 checksum)
+{
+ current_info->filename = orig_filename;
+ memcpy(¤t_info->version, version, sizeof(current_info->version));
+ current_info->checksum = checksum;
+}
+EXPORT_SYMBOL(llvm_gcda_start_file);
+
+void llvm_gcda_emit_function(u32 ident, const char *function_name,
+ u32 func_checksum, u8 use_extra_checksum, u32 cfg_checksum)
+{
+ struct gcov_fn_info *info = kzalloc(sizeof(*info), GFP_KERNEL);
+
+ if (!info)
+ return;
+
+ INIT_LIST_HEAD(&info->head);
+ info->ident = ident;
+ info->checksum = func_checksum;
+ info->use_extra_checksum = use_extra_checksum;
+ info->cfg_checksum = cfg_checksum;
+ if (function_name)
+ info->function_name = kstrdup(function_name, GFP_KERNEL);
+
+ list_add_tail(&info->head, ¤t_info->functions);
+}
+EXPORT_SYMBOL(llvm_gcda_emit_function);
+
+void llvm_gcda_emit_arcs(u32 num_counters, u64 *counters)
+{
+ struct gcov_fn_info *info = list_last_entry(¤t_info->functions,
+ struct gcov_fn_info, head);
+
+ info->num_counters = num_counters;
+ info->counters = counters;
+}
+EXPORT_SYMBOL(llvm_gcda_emit_arcs);
+
+void llvm_gcda_summary_info(void)
+{
+}
+EXPORT_SYMBOL(llvm_gcda_summary_info);
+
+void llvm_gcda_end_file(void)
+{
+}
+EXPORT_SYMBOL(llvm_gcda_end_file);
+
+/**
+ * gcov_info_filename - return info filename
+ * @info: profiling data set
+ */
+const char *gcov_info_filename(struct gcov_info *info)
+{
+ return info->filename;
+}
+
+/**
+ * gcov_info_version - return info version
+ * @info: profiling data set
+ */
+unsigned int gcov_info_version(struct gcov_info *info)
+{
+ return info->version;
+}
+
+/**
+ * gcov_info_next - return next profiling data set
+ * @info: profiling data set
+ *
+ * Returns next gcov_info following @info or first gcov_info in the chain if
+ * @info is %NULL.
+ */
+struct gcov_info *gcov_info_next(struct gcov_info *info)
+{
+ if (!info)
+ return list_first_entry_or_null(&clang_gcov_list,
+ struct gcov_info, head);
+ if (list_is_last(&info->head, &clang_gcov_list))
+ return NULL;
+ return list_next_entry(info, head);
+}
+
+/**
+ * gcov_info_link - link/add profiling data set to the list
+ * @info: profiling data set
+ */
+void gcov_info_link(struct gcov_info *info)
+{
+ list_add_tail(&info->head, &clang_gcov_list);
+}
+
+/**
+ * gcov_info_unlink - unlink/remove profiling data set from the list
+ * @prev: previous profiling data set
+ * @info: profiling data set
+ */
+void gcov_info_unlink(struct gcov_info *prev, struct gcov_info *info)
+{
+ /* Generic code unlinks while iterating. */
+ __list_del_entry(&info->head);
+}
+
+/**
+ * gcov_info_within_module - check if a profiling data set belongs to a module
+ * @info: profiling data set
+ * @mod: module
+ *
+ * Returns true if profiling data belongs module, false otherwise.
+ */
+bool gcov_info_within_module(struct gcov_info *info, struct module *mod)
+{
+ return within_module((unsigned long)info->filename, mod);
+}
+
+/* Symbolic links to be created for each profiling data file. */
+const struct gcov_link gcov_link[] = {
+ { OBJ_TREE, "gcno" }, /* Link to .gcno file in $(objtree). */
+ { 0, NULL},
+};
+
+/**
+ * gcov_info_reset - reset profiling data to zero
+ * @info: profiling data set
+ */
+void gcov_info_reset(struct gcov_info *info)
+{
+ struct gcov_fn_info *fn;
+
+ list_for_each_entry(fn, &info->functions, head)
+ memset(fn->counters, 0,
+ sizeof(fn->counters[0]) * fn->num_counters);
+}
+
+/**
+ * gcov_info_is_compatible - check if profiling data can be added
+ * @info1: first profiling data set
+ * @info2: second profiling data set
+ *
+ * Returns non-zero if profiling data can be added, zero otherwise.
+ */
+int gcov_info_is_compatible(struct gcov_info *info1, struct gcov_info *info2)
+{
+ struct gcov_fn_info *fn_ptr1 = list_first_entry_or_null(
+ &info1->functions, struct gcov_fn_info, head);
+ struct gcov_fn_info *fn_ptr2 = list_first_entry_or_null(
+ &info2->functions, struct gcov_fn_info, head);
+
+ if (info1->checksum != info2->checksum)
+ return false;
+ if (!fn_ptr1)
+ return fn_ptr1 == fn_ptr2;
+ while (!list_is_last(&fn_ptr1->head, &info1->functions) &&
+ !list_is_last(&fn_ptr2->head, &info2->functions)) {
+ if (fn_ptr1->checksum != fn_ptr2->checksum)
+ return false;
+ if (fn_ptr1->use_extra_checksum != fn_ptr2->use_extra_checksum)
+ return false;
+ if (fn_ptr1->use_extra_checksum &&
+ fn_ptr1->cfg_checksum != fn_ptr2->cfg_checksum)
+ return false;
+ fn_ptr1 = list_next_entry(fn_ptr1, head);
+ fn_ptr2 = list_next_entry(fn_ptr2, head);
+ }
+ return list_is_last(&fn_ptr1->head, &info1->functions) &&
+ list_is_last(&fn_ptr2->head, &info2->functions);
+}
+
+/**
+ * gcov_info_add - add up profiling data
+ * @dest: profiling data set to which data is added
+ * @source: profiling data set which is added
+ *
+ * Adds profiling counts of @source to @dest.
+ */
+void gcov_info_add(struct gcov_info *dst, struct gcov_info *src)
+{
+ struct gcov_fn_info *dfn_ptr;
+ struct gcov_fn_info *sfn_ptr = list_first_entry_or_null(&src->functions,
+ struct gcov_fn_info, head);
+
+ list_for_each_entry(dfn_ptr, &dst->functions, head) {
+ u32 i;
+
+ for (i = 0; i < sfn_ptr->num_counters; i++)
+ dfn_ptr->counters[i] += sfn_ptr->counters[i];
+ }
+}
+
+static struct gcov_fn_info *gcov_fn_info_dup(struct gcov_fn_info *fn)
+{
+ size_t cv_size; /* counter values size */
+ struct gcov_fn_info *fn_dup = kmemdup(fn, sizeof(*fn),
+ GFP_KERNEL);
+ if (!fn_dup)
+ return NULL;
+ INIT_LIST_HEAD(&fn_dup->head);
+
+ fn_dup->function_name = kstrdup(fn->function_name, GFP_KERNEL);
+ if (!fn_dup->function_name)
+ goto err_name;
+
+ cv_size = fn->num_counters * sizeof(fn->counters[0]);
+ fn_dup->counters = vmalloc(cv_size);
+ if (!fn_dup->counters)
+ goto err_counters;
+ memcpy(fn_dup->counters, fn->counters, cv_size);
+
+ return fn_dup;
+
+err_counters:
+ kfree(fn_dup->function_name);
+err_name:
+ kfree(fn_dup);
+ return NULL;
+}
+
+/**
+ * gcov_info_dup - duplicate profiling data set
+ * @info: profiling data set to duplicate
+ *
+ * Return newly allocated duplicate on success, %NULL on error.
+ */
+struct gcov_info *gcov_info_dup(struct gcov_info *info)
+{
+ struct gcov_info *dup;
+ struct gcov_fn_info *fn;
+
+ dup = kmemdup(info, sizeof(*dup), GFP_KERNEL);
+ if (!dup)
+ return NULL;
+ INIT_LIST_HEAD(&dup->head);
+ INIT_LIST_HEAD(&dup->functions);
+ dup->filename = kstrdup(info->filename, GFP_KERNEL);
+ if (!dup->filename)
+ goto err;
+
+ list_for_each_entry(fn, &info->functions, head) {
+ struct gcov_fn_info *fn_dup = gcov_fn_info_dup(fn);
+
+ if (!fn_dup)
+ goto err;
+ list_add_tail(&fn_dup->head, &dup->functions);
+ }
+
+ return dup;
+
+err:
+ gcov_info_free(dup);
+ return NULL;
+}
+
+/**
+ * gcov_info_free - release memory for profiling data set duplicate
+ * @info: profiling data set duplicate to free
+ */
+void gcov_info_free(struct gcov_info *info)
+{
+ struct gcov_fn_info *fn, *tmp;
+
+ list_for_each_entry_safe(fn, tmp, &info->functions, head) {
+ kfree(fn->function_name);
+ vfree(fn->counters);
+ list_del(&fn->head);
+ kfree(fn);
+ }
+ kfree(info->filename);
+ kfree(info);
+}
+
+#define ITER_STRIDE PAGE_SIZE
+
+/**
+ * struct gcov_iterator - specifies current file position in logical records
+ * @info: associated profiling data
+ * @buffer: buffer containing file data
+ * @size: size of buffer
+ * @pos: current position in file
+ */
+struct gcov_iterator {
+ struct gcov_info *info;
+ void *buffer;
+ size_t size;
+ loff_t pos;
+};
+
+/**
+ * store_gcov_u32 - store 32 bit number in gcov format to buffer
+ * @buffer: target buffer or NULL
+ * @off: offset into the buffer
+ * @v: value to be stored
+ *
+ * Number format defined by gcc: numbers are recorded in the 32 bit
+ * unsigned binary form of the endianness of the machine generating the
+ * file. Returns the number of bytes stored. If @buffer is %NULL, doesn't
+ * store anything.
+ */
+static size_t store_gcov_u32(void *buffer, size_t off, u32 v)
+{
+ u32 *data;
+
+ if (buffer) {
+ data = buffer + off;
+ *data = v;
+ }
+
+ return sizeof(*data);
+}
+
+/**
+ * store_gcov_u64 - store 64 bit number in gcov format to buffer
+ * @buffer: target buffer or NULL
+ * @off: offset into the buffer
+ * @v: value to be stored
+ *
+ * Number format defined by gcc: numbers are recorded in the 32 bit
+ * unsigned binary form of the endianness of the machine generating the
+ * file. 64 bit numbers are stored as two 32 bit numbers, the low part
+ * first. Returns the number of bytes stored. If @buffer is %NULL, doesn't store
+ * anything.
+ */
+static size_t store_gcov_u64(void *buffer, size_t off, u64 v)
+{
+ u32 *data;
+
+ if (buffer) {
+ data = buffer + off;
+
+ data[0] = (v & 0xffffffffUL);
+ data[1] = (v >> 32);
+ }
+
+ return sizeof(*data) * 2;
+}
+
+/**
+ * convert_to_gcda - convert profiling data set to gcda file format
+ * @buffer: the buffer to store file data or %NULL if no data should be stored
+ * @info: profiling data set to be converted
+ *
+ * Returns the number of bytes that were/would have been stored into the buffer.
+ */
+static size_t convert_to_gcda(char *buffer, struct gcov_info *info)
+{
+ struct gcov_fn_info *fi_ptr;
+ size_t pos = 0;
+
+ /* File header. */
+ pos += store_gcov_u32(buffer, pos, GCOV_DATA_MAGIC);
+ pos += store_gcov_u32(buffer, pos, info->version);
+ pos += store_gcov_u32(buffer, pos, info->checksum);
+
+ list_for_each_entry(fi_ptr, &info->functions, head) {
+ u32 i;
+ u32 len = 2;
+
+ if (fi_ptr->use_extra_checksum)
+ len++;
+
+ pos += store_gcov_u32(buffer, pos, GCOV_TAG_FUNCTION);
+ pos += store_gcov_u32(buffer, pos, len);
+ pos += store_gcov_u32(buffer, pos, fi_ptr->ident);
+ pos += store_gcov_u32(buffer, pos, fi_ptr->checksum);
+ if (fi_ptr->use_extra_checksum)
+ pos += store_gcov_u32(buffer, pos, fi_ptr->cfg_checksum);
+
+ pos += store_gcov_u32(buffer, pos, GCOV_TAG_COUNTER_BASE);
+ pos += store_gcov_u32(buffer, pos, fi_ptr->num_counters * 2);
+ for (i = 0; i < fi_ptr->num_counters; i++)
+ pos += store_gcov_u64(buffer, pos, fi_ptr->counters[i]);
+ }
+
+ return pos;
+}
+
+/**
+ * gcov_iter_new - allocate and initialize profiling data iterator
+ * @info: profiling data set to be iterated
+ *
+ * Return file iterator on success, %NULL otherwise.
+ */
+struct gcov_iterator *gcov_iter_new(struct gcov_info *info)
+{
+ struct gcov_iterator *iter;
+
+ iter = kzalloc(sizeof(struct gcov_iterator), GFP_KERNEL);
+ if (!iter)
+ goto err_free;
+
+ iter->info = info;
+ /* Dry-run to get the actual buffer size. */
+ iter->size = convert_to_gcda(NULL, info);
+ iter->buffer = vmalloc(iter->size);
+ if (!iter->buffer)
+ goto err_free;
+
+ convert_to_gcda(iter->buffer, info);
+
+ return iter;
+
+err_free:
+ kfree(iter);
+ return NULL;
+}
+
+
+/**
+ * gcov_iter_get_info - return profiling data set for given file iterator
+ * @iter: file iterator
+ */
+void gcov_iter_free(struct gcov_iterator *iter)
+{
+ vfree(iter->buffer);
+ kfree(iter);
+}
+
+/**
+ * gcov_iter_get_info - return profiling data set for given file iterator
+ * @iter: file iterator
+ */
+struct gcov_info *gcov_iter_get_info(struct gcov_iterator *iter)
+{
+ return iter->info;
+}
+
+/**
+ * gcov_iter_start - reset file iterator to starting position
+ * @iter: file iterator
+ */
+void gcov_iter_start(struct gcov_iterator *iter)
+{
+ iter->pos = 0;
+}
+
+/**
+ * gcov_iter_next - advance file iterator to next logical record
+ * @iter: file iterator
+ *
+ * Return zero if new position is valid, non-zero if iterator has reached end.
+ */
+int gcov_iter_next(struct gcov_iterator *iter)
+{
+ if (iter->pos < iter->size)
+ iter->pos += ITER_STRIDE;
+
+ if (iter->pos >= iter->size)
+ return -EINVAL;
+
+ return 0;
+}
+
+/**
+ * gcov_iter_write - write data for current pos to seq_file
+ * @iter: file iterator
+ * @seq: seq_file handle
+ *
+ * Return zero on success, non-zero otherwise.
+ */
+int gcov_iter_write(struct gcov_iterator *iter, struct seq_file *seq)
+{
+ size_t len;
+
+ if (iter->pos >= iter->size)
+ return -EINVAL;
+
+ len = ITER_STRIDE;
+ if (iter->pos + len > iter->size)
+ len = iter->size - iter->pos;
+
+ seq_write(seq, iter->buffer + iter->pos, len);
+
+ return 0;
+}
diff --git a/kernel/gcov/gcc_3_4.c b/kernel/gcov/gcc_3_4.c
index 1e32e66..64d2dd9 100644
--- a/kernel/gcov/gcc_3_4.c
+++ b/kernel/gcov/gcc_3_4.c
@@ -137,6 +137,18 @@
gcov_info_head = info->next;
}
+/**
+ * gcov_info_within_module - check if a profiling data set belongs to a module
+ * @info: profiling data set
+ * @mod: module
+ *
+ * Returns true if profiling data belongs module, false otherwise.
+ */
+bool gcov_info_within_module(struct gcov_info *info, struct module *mod)
+{
+ return within_module((unsigned long)info, mod);
+}
+
/* Symbolic links to be created for each profiling data file. */
const struct gcov_link gcov_link[] = {
{ OBJ_TREE, "gcno" }, /* Link to .gcno file in $(objtree). */
diff --git a/kernel/gcov/gcc_4_7.c b/kernel/gcov/gcc_4_7.c
index ca5e5c0..ec37563 100644
--- a/kernel/gcov/gcc_4_7.c
+++ b/kernel/gcov/gcc_4_7.c
@@ -150,6 +150,18 @@
gcov_info_head = info->next;
}
+/**
+ * gcov_info_within_module - check if a profiling data set belongs to a module
+ * @info: profiling data set
+ * @mod: module
+ *
+ * Returns true if profiling data belongs module, false otherwise.
+ */
+bool gcov_info_within_module(struct gcov_info *info, struct module *mod)
+{
+ return within_module((unsigned long)info, mod);
+}
+
/* Symbolic links to be created for each profiling data file. */
const struct gcov_link gcov_link[] = {
{ OBJ_TREE, "gcno" }, /* Link to .gcno file in $(objtree). */
diff --git a/kernel/gcov/gcc_base.c b/kernel/gcov/gcc_base.c
new file mode 100644
index 0000000..3cf736b
--- /dev/null
+++ b/kernel/gcov/gcc_base.c
@@ -0,0 +1,86 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/export.h>
+#include <linux/kernel.h>
+#include <linux/mutex.h>
+#include "gcov.h"
+
+/*
+ * __gcov_init is called by gcc-generated constructor code for each object
+ * file compiled with -fprofile-arcs.
+ */
+void __gcov_init(struct gcov_info *info)
+{
+ static unsigned int gcov_version;
+
+ mutex_lock(&gcov_lock);
+ if (gcov_version == 0) {
+ gcov_version = gcov_info_version(info);
+ /*
+ * Printing gcc's version magic may prove useful for debugging
+ * incompatibility reports.
+ */
+ pr_info("version magic: 0x%x\n", gcov_version);
+ }
+ /*
+ * Add new profiling data structure to list and inform event
+ * listener.
+ */
+ gcov_info_link(info);
+ if (gcov_events_enabled)
+ gcov_event(GCOV_ADD, info);
+ mutex_unlock(&gcov_lock);
+}
+EXPORT_SYMBOL(__gcov_init);
+
+/*
+ * These functions may be referenced by gcc-generated profiling code but serve
+ * no function for kernel profiling.
+ */
+void __gcov_flush(void)
+{
+ /* Unused. */
+}
+EXPORT_SYMBOL(__gcov_flush);
+
+void __gcov_merge_add(gcov_type *counters, unsigned int n_counters)
+{
+ /* Unused. */
+}
+EXPORT_SYMBOL(__gcov_merge_add);
+
+void __gcov_merge_single(gcov_type *counters, unsigned int n_counters)
+{
+ /* Unused. */
+}
+EXPORT_SYMBOL(__gcov_merge_single);
+
+void __gcov_merge_delta(gcov_type *counters, unsigned int n_counters)
+{
+ /* Unused. */
+}
+EXPORT_SYMBOL(__gcov_merge_delta);
+
+void __gcov_merge_ior(gcov_type *counters, unsigned int n_counters)
+{
+ /* Unused. */
+}
+EXPORT_SYMBOL(__gcov_merge_ior);
+
+void __gcov_merge_time_profile(gcov_type *counters, unsigned int n_counters)
+{
+ /* Unused. */
+}
+EXPORT_SYMBOL(__gcov_merge_time_profile);
+
+void __gcov_merge_icall_topn(gcov_type *counters, unsigned int n_counters)
+{
+ /* Unused. */
+}
+EXPORT_SYMBOL(__gcov_merge_icall_topn);
+
+void __gcov_exit(void)
+{
+ /* Unused. */
+}
+EXPORT_SYMBOL(__gcov_exit);
diff --git a/kernel/gcov/gcov.h b/kernel/gcov/gcov.h
index de118ad..6ab2c18 100644
--- a/kernel/gcov/gcov.h
+++ b/kernel/gcov/gcov.h
@@ -15,6 +15,7 @@
#ifndef GCOV_H
#define GCOV_H GCOV_H
+#include <linux/module.h>
#include <linux/types.h>
/*
@@ -46,6 +47,7 @@
struct gcov_info *gcov_info_next(struct gcov_info *info);
void gcov_info_link(struct gcov_info *info);
void gcov_info_unlink(struct gcov_info *prev, struct gcov_info *info);
+bool gcov_info_within_module(struct gcov_info *info, struct module *mod);
/* Base interface. */
enum gcov_action {
@@ -83,4 +85,7 @@
};
extern const struct gcov_link gcov_link[];
+extern int gcov_events_enabled;
+extern struct mutex gcov_lock;
+
#endif /* GCOV_H */
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 4a91916..3486f57 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -200,6 +200,8 @@
if (hung_task_show_lock)
debug_show_all_locks();
if (hung_task_call_panic) {
+ /* Dump all tasks. */
+ show_state_filter(TASK_UNINTERRUPTIBLE);
trigger_all_cpu_backtrace();
panic("hung_task: blocked tasks");
}
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 7c82626..2e62503 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -18,6 +18,8 @@
#include <linux/cpu.h>
#include <asm/sections.h>
+#ifdef HAVE_JUMP_LABEL
+
/* mutex to protect coming/going of the the jump_label table */
static DEFINE_MUTEX(jump_label_mutex);
@@ -58,13 +60,13 @@
static void jump_label_update(struct static_key *key);
/*
- * There are similar definitions for the !CONFIG_JUMP_LABEL case in jump_label.h.
+ * There are similar definitions for the !HAVE_JUMP_LABEL case in jump_label.h.
* The use of 'atomic_read()' requires atomic.h and its problematic for some
* kernel headers such as kernel.h and others. Since static_key_count() is not
- * used in the branch statements as it is for the !CONFIG_JUMP_LABEL case its ok
+ * used in the branch statements as it is for the !HAVE_JUMP_LABEL case its ok
* to have it be a function here. Similarly, for 'static_key_enable()' and
* 'static_key_disable()', which require bug.h. This should allow jump_label.h
- * to be included from most/all places for CONFIG_JUMP_LABEL.
+ * to be included from most/all places for HAVE_JUMP_LABEL.
*/
int static_key_count(struct static_key *key)
{
@@ -794,3 +796,5 @@
}
early_initcall(jump_label_test);
#endif /* STATIC_KEYS_SELFTEST */
+
+#endif /* HAVE_JUMP_LABEL */
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 087d18d..65234c8 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -101,6 +101,12 @@
}
EXPORT_SYMBOL(kthread_should_stop);
+bool __kthread_should_park(struct task_struct *k)
+{
+ return test_bit(KTHREAD_SHOULD_PARK, &to_kthread(k)->flags);
+}
+EXPORT_SYMBOL_GPL(__kthread_should_park);
+
/**
* kthread_should_park - should this kthread park now?
*
@@ -114,7 +120,7 @@
*/
bool kthread_should_park(void)
{
- return test_bit(KTHREAD_SHOULD_PARK, &to_kthread(current)->flags);
+ return __kthread_should_park(current);
}
EXPORT_SYMBOL_GPL(kthread_should_park);
diff --git a/kernel/module.c b/kernel/module.c
index 20fc0ef..7746d59 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3119,7 +3119,7 @@
sizeof(*mod->tracepoints_ptrs),
&mod->num_tracepoints);
#endif
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
mod->jump_entries = section_objs(info, "__jump_table",
sizeof(*mod->jump_entries),
&mod->num_jump_entries);
diff --git a/kernel/module_signing.c b/kernel/module_signing.c
index f2075ce..6b9a926 100644
--- a/kernel/module_signing.c
+++ b/kernel/module_signing.c
@@ -83,6 +83,7 @@
}
return verify_pkcs7_signature(mod, modlen, mod + modlen, sig_len,
- NULL, VERIFYING_MODULE_SIGNATURE,
+ VERIFY_USE_SECONDARY_KEYRING,
+ VERIFYING_MODULE_SIGNATURE,
NULL, NULL);
}
diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig
index 3a6c2f8..f8fe57d 100644
--- a/kernel/power/Kconfig
+++ b/kernel/power/Kconfig
@@ -298,3 +298,18 @@
config CPU_PM
bool
+
+config ENERGY_MODEL
+ bool "Energy Model for CPUs"
+ depends on SMP
+ depends on CPU_FREQ
+ default n
+ help
+ Several subsystems (thermal and/or the task scheduler for example)
+ can leverage information about the energy consumed by CPUs to make
+ smarter decisions. This config option enables the framework from
+ which subsystems can access the energy models.
+
+ The exact usage of the energy model is subsystem-dependent.
+
+ If in doubt, say N.
diff --git a/kernel/power/Makefile b/kernel/power/Makefile
index a3f79f0e..3f8db83 100644
--- a/kernel/power/Makefile
+++ b/kernel/power/Makefile
@@ -15,3 +15,6 @@
obj-$(CONFIG_PM_WAKELOCKS) += wakelock.o
obj-$(CONFIG_MAGIC_SYSRQ) += poweroff.o
+
+obj-$(CONFIG_SUSPEND) += wakeup_reason.o
+obj-$(CONFIG_ENERGY_MODEL) += energy_model.o
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
new file mode 100644
index 0000000..7d66ee6
--- /dev/null
+++ b/kernel/power/energy_model.c
@@ -0,0 +1,258 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Energy Model of CPUs
+ *
+ * Copyright (c) 2018, Arm ltd.
+ * Written by: Quentin Perret, Arm ltd.
+ */
+
+#define pr_fmt(fmt) "energy_model: " fmt
+
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
+#include <linux/debugfs.h>
+#include <linux/energy_model.h>
+#include <linux/sched/topology.h>
+#include <linux/slab.h>
+
+/* Mapping of each CPU to the performance domain to which it belongs. */
+static DEFINE_PER_CPU(struct em_perf_domain *, em_data);
+
+/*
+ * Mutex serializing the registrations of performance domains and letting
+ * callbacks defined by drivers sleep.
+ */
+static DEFINE_MUTEX(em_pd_mutex);
+
+#ifdef CONFIG_DEBUG_FS
+static struct dentry *rootdir;
+
+static void em_debug_create_cs(struct em_cap_state *cs, struct dentry *pd)
+{
+ struct dentry *d;
+ char name[24];
+
+ snprintf(name, sizeof(name), "cs:%lu", cs->frequency);
+
+ /* Create per-cs directory */
+ d = debugfs_create_dir(name, pd);
+ debugfs_create_ulong("frequency", 0444, d, &cs->frequency);
+ debugfs_create_ulong("power", 0444, d, &cs->power);
+ debugfs_create_ulong("cost", 0444, d, &cs->cost);
+}
+
+static int em_debug_cpus_show(struct seq_file *s, void *unused)
+{
+ seq_printf(s, "%*pbl\n", cpumask_pr_args(to_cpumask(s->private)));
+
+ return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(em_debug_cpus);
+
+static void em_debug_create_pd(struct em_perf_domain *pd, int cpu)
+{
+ struct dentry *d;
+ char name[8];
+ int i;
+
+ snprintf(name, sizeof(name), "pd%d", cpu);
+
+ /* Create the directory of the performance domain */
+ d = debugfs_create_dir(name, rootdir);
+
+ debugfs_create_file("cpus", 0444, d, pd->cpus, &em_debug_cpus_fops);
+
+ /* Create a sub-directory for each capacity state */
+ for (i = 0; i < pd->nr_cap_states; i++)
+ em_debug_create_cs(&pd->table[i], d);
+}
+
+static int __init em_debug_init(void)
+{
+ /* Create /sys/kernel/debug/energy_model directory */
+ rootdir = debugfs_create_dir("energy_model", NULL);
+
+ return 0;
+}
+core_initcall(em_debug_init);
+#else /* CONFIG_DEBUG_FS */
+static void em_debug_create_pd(struct em_perf_domain *pd, int cpu) {}
+#endif
+static struct em_perf_domain *em_create_pd(cpumask_t *span, int nr_states,
+ struct em_data_callback *cb)
+{
+ unsigned long opp_eff, prev_opp_eff = ULONG_MAX;
+ unsigned long power, freq, prev_freq = 0;
+ int i, ret, cpu = cpumask_first(span);
+ struct em_cap_state *table;
+ struct em_perf_domain *pd;
+ u64 fmax;
+
+ if (!cb->active_power)
+ return NULL;
+
+ pd = kzalloc(sizeof(*pd) + cpumask_size(), GFP_KERNEL);
+ if (!pd)
+ return NULL;
+
+ table = kcalloc(nr_states, sizeof(*table), GFP_KERNEL);
+ if (!table)
+ goto free_pd;
+
+ /* Build the list of capacity states for this performance domain */
+ for (i = 0, freq = 0; i < nr_states; i++, freq++) {
+ /*
+ * active_power() is a driver callback which ceils 'freq' to
+ * lowest capacity state of 'cpu' above 'freq' and updates
+ * 'power' and 'freq' accordingly.
+ */
+ ret = cb->active_power(&power, &freq, cpu);
+ if (ret) {
+ pr_err("pd%d: invalid cap. state: %d\n", cpu, ret);
+ goto free_cs_table;
+ }
+
+ /*
+ * We expect the driver callback to increase the frequency for
+ * higher capacity states.
+ */
+ if (freq <= prev_freq) {
+ pr_err("pd%d: non-increasing freq: %lu\n", cpu, freq);
+ goto free_cs_table;
+ }
+
+ /*
+ * The power returned by active_state() is expected to be
+ * positive, in milli-watts and to fit into 16 bits.
+ */
+ if (!power || power > EM_CPU_MAX_POWER) {
+ pr_err("pd%d: invalid power: %lu\n", cpu, power);
+ goto free_cs_table;
+ }
+
+ table[i].power = power;
+ table[i].frequency = prev_freq = freq;
+
+ /*
+ * The hertz/watts efficiency ratio should decrease as the
+ * frequency grows on sane platforms. But this isn't always
+ * true in practice so warn the user if a higher OPP is more
+ * power efficient than a lower one.
+ */
+ opp_eff = freq / power;
+ if (opp_eff >= prev_opp_eff)
+ pr_warn("pd%d: hertz/watts ratio non-monotonically decreasing: em_cap_state %d >= em_cap_state%d\n",
+ cpu, i, i - 1);
+ prev_opp_eff = opp_eff;
+ }
+
+ /* Compute the cost of each capacity_state. */
+ fmax = (u64) table[nr_states - 1].frequency;
+ for (i = 0; i < nr_states; i++) {
+ table[i].cost = div64_u64(fmax * table[i].power,
+ table[i].frequency);
+ }
+
+ pd->table = table;
+ pd->nr_cap_states = nr_states;
+ cpumask_copy(to_cpumask(pd->cpus), span);
+
+ em_debug_create_pd(pd, cpu);
+
+ return pd;
+
+free_cs_table:
+ kfree(table);
+free_pd:
+ kfree(pd);
+
+ return NULL;
+}
+
+/**
+ * em_cpu_get() - Return the performance domain for a CPU
+ * @cpu : CPU to find the performance domain for
+ *
+ * Return: the performance domain to which 'cpu' belongs, or NULL if it doesn't
+ * exist.
+ */
+struct em_perf_domain *em_cpu_get(int cpu)
+{
+ return READ_ONCE(per_cpu(em_data, cpu));
+}
+EXPORT_SYMBOL_GPL(em_cpu_get);
+
+/**
+ * em_register_perf_domain() - Register the Energy Model of a performance domain
+ * @span : Mask of CPUs in the performance domain
+ * @nr_states : Number of capacity states to register
+ * @cb : Callback functions providing the data of the Energy Model
+ *
+ * Create Energy Model tables for a performance domain using the callbacks
+ * defined in cb.
+ *
+ * If multiple clients register the same performance domain, all but the first
+ * registration will be ignored.
+ *
+ * Return 0 on success
+ */
+int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
+ struct em_data_callback *cb)
+{
+ unsigned long cap, prev_cap = 0;
+ struct em_perf_domain *pd;
+ int cpu, ret = 0;
+
+ if (!span || !nr_states || !cb)
+ return -EINVAL;
+
+ /*
+ * Use a mutex to serialize the registration of performance domains and
+ * let the driver-defined callback functions sleep.
+ */
+ mutex_lock(&em_pd_mutex);
+
+ for_each_cpu(cpu, span) {
+ /* Make sure we don't register again an existing domain. */
+ if (READ_ONCE(per_cpu(em_data, cpu))) {
+ ret = -EEXIST;
+ goto unlock;
+ }
+
+ /*
+ * All CPUs of a domain must have the same micro-architecture
+ * since they all share the same table.
+ */
+ cap = arch_scale_cpu_capacity(NULL, cpu);
+ if (prev_cap && prev_cap != cap) {
+ pr_err("CPUs of %*pbl must have the same capacity\n",
+ cpumask_pr_args(span));
+ ret = -EINVAL;
+ goto unlock;
+ }
+ prev_cap = cap;
+ }
+
+ /* Create the performance domain and add it to the Energy Model. */
+ pd = em_create_pd(span, nr_states, cb);
+ if (!pd) {
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ for_each_cpu(cpu, span) {
+ /*
+ * The per-cpu array can be read concurrently from em_cpu_get().
+ * The barrier enforces the ordering needed to make sure readers
+ * can only access well formed em_perf_domain structs.
+ */
+ smp_store_release(per_cpu_ptr(&em_data, cpu), pd);
+ }
+
+ pr_debug("Created perf domain %*pbl\n", cpumask_pr_args(span));
+unlock:
+ mutex_unlock(&em_pd_mutex);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(em_register_perf_domain);
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 7381d49..d76e616 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -85,26 +85,27 @@
elapsed = ktime_sub(end, start);
elapsed_msecs = ktime_to_ms(elapsed);
- if (todo) {
+ if (wakeup) {
pr_cont("\n");
- pr_err("Freezing of tasks %s after %d.%03d seconds "
- "(%d tasks refusing to freeze, wq_busy=%d):\n",
- wakeup ? "aborted" : "failed",
+ pr_err("Freezing of tasks aborted after %d.%03d seconds",
+ elapsed_msecs / 1000, elapsed_msecs % 1000);
+ } else if (todo) {
+ pr_cont("\n");
+ pr_err("Freezing of tasks failed after %d.%03d seconds"
+ " (%d tasks refusing to freeze, wq_busy=%d):\n",
elapsed_msecs / 1000, elapsed_msecs % 1000,
todo - wq_busy, wq_busy);
if (wq_busy)
show_workqueue_state();
- if (!wakeup) {
- read_lock(&tasklist_lock);
- for_each_process_thread(g, p) {
- if (p != current && !freezer_should_skip(p)
- && freezing(p) && !frozen(p))
- sched_show_task(p);
- }
- read_unlock(&tasklist_lock);
+ read_lock(&tasklist_lock);
+ for_each_process_thread(g, p) {
+ if (p != current && !freezer_should_skip(p)
+ && freezing(p) && !frozen(p))
+ sched_show_task(p);
}
+ read_unlock(&tasklist_lock);
} else {
pr_cont("(elapsed %d.%03d seconds) ", elapsed_msecs / 1000,
elapsed_msecs % 1000);
diff --git a/kernel/power/wakeup_reason.c b/kernel/power/wakeup_reason.c
new file mode 100644
index 0000000..904b65f
--- /dev/null
+++ b/kernel/power/wakeup_reason.c
@@ -0,0 +1,184 @@
+/*
+ * kernel/power/wakeup_reason.c
+ *
+ * Logs the reasons which caused the kernel to resume from
+ * the suspend mode.
+ *
+ * Copyright (C) 2014 Google, Inc.
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/wakeup_reason.h>
+#include <linux/kernel.h>
+#include <linux/irq.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
+#include <linux/spinlock.h>
+#include <linux/notifier.h>
+#include <linux/suspend.h>
+
+
+#define MAX_WAKEUP_REASON_IRQS 32
+static int irq_list[MAX_WAKEUP_REASON_IRQS];
+static int irqcount;
+static struct kobject *wakeup_reason;
+static spinlock_t resume_reason_lock;
+
+static ktime_t last_monotime; /* monotonic time before last suspend */
+static ktime_t curr_monotime; /* monotonic time after last suspend */
+static ktime_t last_stime; /* monotonic boottime offset before last suspend */
+static ktime_t curr_stime; /* monotonic boottime offset after last suspend */
+
+static ssize_t last_resume_reason_show(struct kobject *kobj, struct kobj_attribute *attr,
+ char *buf)
+{
+ int irq_no, buf_offset = 0;
+ struct irq_desc *desc;
+ spin_lock(&resume_reason_lock);
+ for (irq_no = 0; irq_no < irqcount; irq_no++) {
+ desc = irq_to_desc(irq_list[irq_no]);
+ if (desc && desc->action && desc->action->name)
+ buf_offset += sprintf(buf + buf_offset, "%d %s\n",
+ irq_list[irq_no], desc->action->name);
+ else
+ buf_offset += sprintf(buf + buf_offset, "%d\n",
+ irq_list[irq_no]);
+ }
+ spin_unlock(&resume_reason_lock);
+ return buf_offset;
+}
+
+static ssize_t last_suspend_time_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ struct timespec sleep_time;
+ struct timespec total_time;
+ struct timespec suspend_resume_time;
+
+ /*
+ * total_time is calculated from monotonic bootoffsets because
+ * unlike CLOCK_MONOTONIC it include the time spent in suspend state.
+ */
+ total_time = ktime_to_timespec(ktime_sub(curr_stime, last_stime));
+
+ /*
+ * suspend_resume_time is calculated as monotonic (CLOCK_MONOTONIC)
+ * time interval before entering suspend and post suspend.
+ */
+ suspend_resume_time = ktime_to_timespec(ktime_sub(curr_monotime, last_monotime));
+
+ /* sleep_time = total_time - suspend_resume_time */
+ sleep_time = timespec_sub(total_time, suspend_resume_time);
+
+ /* Export suspend_resume_time and sleep_time in pair here. */
+ return sprintf(buf, "%lu.%09lu %lu.%09lu\n",
+ suspend_resume_time.tv_sec, suspend_resume_time.tv_nsec,
+ sleep_time.tv_sec, sleep_time.tv_nsec);
+}
+
+static struct kobj_attribute resume_reason = __ATTR_RO(last_resume_reason);
+static struct kobj_attribute suspend_time = __ATTR_RO(last_suspend_time);
+
+static struct attribute *attrs[] = {
+ &resume_reason.attr,
+ &suspend_time.attr,
+ NULL,
+};
+static struct attribute_group attr_group = {
+ .attrs = attrs,
+};
+
+/*
+ * logs all the wake up reasons to the kernel
+ * stores the irqs to expose them to the userspace via sysfs
+ */
+void log_wakeup_reason(int irq)
+{
+ struct irq_desc *desc;
+ desc = irq_to_desc(irq);
+ if (desc && desc->action && desc->action->name)
+ printk(KERN_INFO "Resume caused by IRQ %d, %s\n", irq,
+ desc->action->name);
+ else
+ printk(KERN_INFO "Resume caused by IRQ %d\n", irq);
+
+ spin_lock(&resume_reason_lock);
+ if (irqcount == MAX_WAKEUP_REASON_IRQS) {
+ spin_unlock(&resume_reason_lock);
+ printk(KERN_WARNING "Resume caused by more than %d IRQs\n",
+ MAX_WAKEUP_REASON_IRQS);
+ return;
+ }
+
+ irq_list[irqcount++] = irq;
+ spin_unlock(&resume_reason_lock);
+}
+
+/* Detects a suspend and clears all the previous wake up reasons*/
+static int wakeup_reason_pm_event(struct notifier_block *notifier,
+ unsigned long pm_event, void *unused)
+{
+ switch (pm_event) {
+ case PM_SUSPEND_PREPARE:
+ spin_lock(&resume_reason_lock);
+ irqcount = 0;
+ spin_unlock(&resume_reason_lock);
+ /* monotonic time since boot */
+ last_monotime = ktime_get();
+ /* monotonic time since boot including the time spent in suspend */
+ last_stime = ktime_get_boottime();
+ break;
+ case PM_POST_SUSPEND:
+ /* monotonic time since boot */
+ curr_monotime = ktime_get();
+ /* monotonic time since boot including the time spent in suspend */
+ curr_stime = ktime_get_boottime();
+ break;
+ default:
+ break;
+ }
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block wakeup_reason_pm_notifier_block = {
+ .notifier_call = wakeup_reason_pm_event,
+};
+
+/* Initializes the sysfs parameter
+ * registers the pm_event notifier
+ */
+int __init wakeup_reason_init(void)
+{
+ int retval;
+ spin_lock_init(&resume_reason_lock);
+ retval = register_pm_notifier(&wakeup_reason_pm_notifier_block);
+ if (retval)
+ printk(KERN_WARNING "[%s] failed to register PM notifier %d\n",
+ __func__, retval);
+
+ wakeup_reason = kobject_create_and_add("wakeup_reasons", kernel_kobj);
+ if (!wakeup_reason) {
+ printk(KERN_WARNING "[%s] failed to create a sysfs kobject\n",
+ __func__);
+ return 1;
+ }
+ retval = sysfs_create_group(wakeup_reason, &attr_group);
+ if (retval) {
+ kobject_put(wakeup_reason);
+ printk(KERN_WARNING "[%s] failed to create a sysfs group %d\n",
+ __func__, retval);
+ }
+ return 0;
+}
+
+late_initcall(wakeup_reason_init);
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 7fe1834..2389350 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -24,6 +24,7 @@
obj-$(CONFIG_SCHED_AUTOGROUP) += autogroup.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
+obj-$(CONFIG_SCHED_TUNE) += tune.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
obj-$(CONFIG_CPU_FREQ) += cpufreq.o
obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
index 2d4ff53..2067080 100644
--- a/kernel/sched/autogroup.c
+++ b/kernel/sched/autogroup.c
@@ -259,7 +259,6 @@
}
#endif /* CONFIG_PROC_FS */
-#ifdef CONFIG_SCHED_DEBUG
int autogroup_path(struct task_group *tg, char *buf, int buflen)
{
if (!task_group_is_autogroup(tg))
@@ -267,4 +266,3 @@
return snprintf(buf, buflen, "%s-%ld", "/autogroup", tg->autogroup->id);
}
-#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2befd2c..2ed0779 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -24,7 +24,7 @@
DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
-#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL)
+#if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)
/*
* Debugging: various feature bits
*
@@ -181,6 +181,7 @@
if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
update_irq_load_avg(rq, irq_delta + steal);
#endif
+ update_rq_clock_pelt(rq, delta);
}
void update_rq_clock(struct rq *rq)
@@ -413,6 +414,8 @@
if (cmpxchg_relaxed(&node->next, NULL, WAKE_Q_TAIL))
return;
+ head->count++;
+
get_task_struct(task);
/*
@@ -422,6 +425,10 @@
head->lastp = &node->next;
}
+static int
+try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
+ int sibling_count_hint);
+
void wake_up_q(struct wake_q_head *head)
{
struct wake_q_node *node = head->first;
@@ -436,10 +443,10 @@
task->wake_q.next = NULL;
/*
- * wake_up_process() executes a full barrier, which pairs with
+ * try_to_wake_up() executes a full barrier, which pairs with
* the queueing in wake_q_add() so as not to miss wakeups.
*/
- wake_up_process(task);
+ try_to_wake_up(task, TASK_NORMAL, 0, head->count);
put_task_struct(task);
}
}
@@ -702,6 +709,7 @@
if (idle_policy(p->policy)) {
load->weight = scale_load(WEIGHT_IDLEPRIO);
load->inv_weight = WMULT_IDLEPRIO;
+ p->se.runnable_weight = load->weight;
return;
}
@@ -714,6 +722,7 @@
} else {
load->weight = scale_load(sched_prio_to_weight[prio]);
load->inv_weight = sched_prio_to_wmult[prio];
+ p->se.runnable_weight = load->weight;
}
}
@@ -1524,12 +1533,14 @@
* The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
*/
static inline
-int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
+ int sibling_count_hint)
{
lockdep_assert_held(&p->pi_lock);
if (p->nr_cpus_allowed > 1)
- cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
+ cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,
+ sibling_count_hint);
else
cpu = cpumask_any(&p->cpus_allowed);
@@ -1932,6 +1943,8 @@
* @p: the thread to be awakened
* @state: the mask of task states that can be woken
* @wake_flags: wake modifier flags (WF_*)
+ * @sibling_count_hint: A hint at the number of threads that are being woken up
+ * in this event.
*
* If (@state & @p->state) @p->state = TASK_RUNNING.
*
@@ -1947,7 +1960,8 @@
* %false otherwise.
*/
static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
+ int sibling_count_hint)
{
unsigned long flags;
int cpu, success = 0;
@@ -2034,7 +2048,8 @@
atomic_dec(&task_rq(p)->nr_iowait);
}
- cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
+ cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags,
+ sibling_count_hint);
if (task_cpu(p) != cpu) {
wake_flags |= WF_MIGRATED;
set_task_cpu(p, cpu);
@@ -2121,13 +2136,13 @@
*/
int wake_up_process(struct task_struct *p)
{
- return try_to_wake_up(p, TASK_NORMAL, 0);
+ return try_to_wake_up(p, TASK_NORMAL, 0, 1);
}
EXPORT_SYMBOL(wake_up_process);
int wake_up_state(struct task_struct *p, unsigned int state)
{
- return try_to_wake_up(p, state, 0);
+ return try_to_wake_up(p, state, 0, 1);
}
/*
@@ -2409,7 +2424,7 @@
* as we're not fully set-up yet.
*/
p->recent_used_cpu = task_cpu(p);
- __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
+ __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0, 1));
#endif
rq = __task_rq_lock(p, &rf);
update_rq_clock(rq);
@@ -2948,7 +2963,7 @@
int dest_cpu;
raw_spin_lock_irqsave(&p->pi_lock, flags);
- dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
+ dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0, 1);
if (dest_cpu == smp_processor_id())
goto unlock;
@@ -3311,11 +3326,7 @@
print_ip_sym(preempt_disable_ip);
pr_cont("\n");
}
- if (panic_on_warn)
- panic("scheduling while atomic\n");
-
- dump_stack();
- add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
+ BUG();
}
/*
@@ -3750,7 +3761,7 @@
int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
void *key)
{
- return try_to_wake_up(curr->private, mode, wake_flags);
+ return try_to_wake_up(curr->private, mode, wake_flags, 1);
}
EXPORT_SYMBOL(default_wake_function);
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 1b7ec82..4a00ea9 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -13,11 +13,13 @@
#include "sched.h"
+#include <linux/sched/cpufreq.h>
#include <trace/events/power.h>
struct sugov_tunables {
struct gov_attr_set attr_set;
- unsigned int rate_limit_us;
+ unsigned int up_rate_limit_us;
+ unsigned int down_rate_limit_us;
};
struct sugov_policy {
@@ -28,7 +30,9 @@
raw_spinlock_t update_lock; /* For shared policies */
u64 last_freq_update_time;
- s64 freq_update_delay_ns;
+ s64 min_rate_limit_ns;
+ s64 up_rate_delay_ns;
+ s64 down_rate_delay_ns;
unsigned int next_freq;
unsigned int cached_raw_freq;
@@ -95,9 +99,32 @@
return true;
}
+ /* No need to recalculate next freq for min_rate_limit_us
+ * at least. However we might still decide to further rate
+ * limit once frequency change direction is decided, according
+ * to the separate rate limits.
+ */
+
+ delta_ns = time - sg_policy->last_freq_update_time;
+ return delta_ns >= sg_policy->min_rate_limit_ns;
+}
+
+static bool sugov_up_down_rate_limit(struct sugov_policy *sg_policy, u64 time,
+ unsigned int next_freq)
+{
+ s64 delta_ns;
+
delta_ns = time - sg_policy->last_freq_update_time;
- return delta_ns >= sg_policy->freq_update_delay_ns;
+ if (next_freq > sg_policy->next_freq &&
+ delta_ns < sg_policy->up_rate_delay_ns)
+ return true;
+
+ if (next_freq < sg_policy->next_freq &&
+ delta_ns < sg_policy->down_rate_delay_ns)
+ return true;
+
+ return false;
}
static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
@@ -106,6 +133,9 @@
if (sg_policy->next_freq == next_freq)
return false;
+ if (sugov_up_down_rate_limit(sg_policy, time, next_freq))
+ return false;
+
sg_policy->next_freq = next_freq;
sg_policy->last_freq_update_time = time;
@@ -174,7 +204,7 @@
unsigned int freq = arch_scale_freq_invariant() ?
policy->cpuinfo.max_freq : policy->cur;
- freq = (freq + (freq >> 2)) * util / max;
+ freq = map_util_freq(util, freq, max);
if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
return sg_policy->next_freq;
@@ -196,6 +226,9 @@
* Where the cfs,rt and dl util numbers are tracked with the same metric and
* synchronized windows and are thus directly comparable.
*
+ * The @util parameter passed to this function is assumed to be the aggregation
+ * of RT and CFS util numbers. The cases of DL and IRQ are managed here.
+ *
* The cfs,rt,dl utilization are the running times measured with rq->clock_task
* which excludes things like IRQ and steal-time. These latter are then accrued
* in the irq utilization.
@@ -204,15 +237,14 @@
* based on the task model parameters and gives the minimal utilization
* required to meet deadlines.
*/
-static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
+unsigned long schedutil_freq_util(int cpu, unsigned long util,
+ unsigned long max, enum schedutil_type type)
{
- struct rq *rq = cpu_rq(sg_cpu->cpu);
- unsigned long util, irq, max;
+ unsigned long dl_util, irq;
+ struct rq *rq = cpu_rq(cpu);
- sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
- sg_cpu->bw_dl = cpu_bw_dl(rq);
-
- if (rt_rq_is_runnable(&rq->rt))
+ if (sched_feat(SUGOV_RT_MAX_FREQ) && type == FREQUENCY_UTIL &&
+ rt_rq_is_runnable(&rq->rt))
return max;
/*
@@ -225,27 +257,33 @@
return max;
/*
- * Because the time spend on RT/DL tasks is visible as 'lost' time to
- * CFS tasks and we use the same metric to track the effective
- * utilization (PELT windows are synchronized) we can directly add them
- * to obtain the CPU's actual utilization.
+ * The function is called with @util defined as the aggregation (the
+ * sum) of RT and CFS signals, hence leaving the special case of DL
+ * to be delt with. The exact way of doing things depend on the calling
+ * context.
*/
- util = cpu_util_cfs(rq);
- util += cpu_util_rt(rq);
+ dl_util = cpu_util_dl(rq);
/*
- * We do not make cpu_util_dl() a permanent part of this sum because we
- * want to use cpu_bw_dl() later on, but we need to check if the
- * CFS+RT+DL sum is saturated (ie. no idle time) such that we select
- * f_max when there is no idle time.
+ * For frequency selection we do not make cpu_util_dl() a permanent part
+ * of this sum because we want to use cpu_bw_dl() later on, but we need
+ * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
+ * that we select f_max when there is no idle time.
*
* NOTE: numerical errors or stop class might cause us to not quite hit
* saturation when we should -- something for later.
*/
- if ((util + cpu_util_dl(rq)) >= max)
+ if (util + dl_util >= max)
return max;
/*
+ * OTOH, for energy computation we need the estimated running time, so
+ * include util_dl and ignore dl_bw.
+ */
+ if (type == ENERGY_UTIL)
+ util += dl_util;
+
+ /*
* There is still idle time; further improve the number by using the
* irq metric. Because IRQ/steal time is hidden from the task clock we
* need to scale the task numbers:
@@ -267,7 +305,22 @@
* bw_dl as requested freq. However, cpufreq is not yet ready for such
* an interface. So, we only do the latter for now.
*/
- return min(max, util + sg_cpu->bw_dl);
+ if (type == FREQUENCY_UTIL)
+ util += cpu_bw_dl(rq);
+
+ return min(max, util);
+}
+
+static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
+{
+ struct rq *rq = cpu_rq(sg_cpu->cpu);
+ unsigned long util = boosted_cpu_util(sg_cpu->cpu, cpu_util_rt(rq));
+ unsigned long max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
+
+ sg_cpu->max = max;
+ sg_cpu->bw_dl = cpu_bw_dl(rq);
+
+ return schedutil_freq_util(sg_cpu->cpu, util, max, FREQUENCY_UTIL);
}
/**
@@ -559,15 +612,32 @@
return container_of(attr_set, struct sugov_tunables, attr_set);
}
-static ssize_t rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
+static DEFINE_MUTEX(min_rate_lock);
+
+static void update_min_rate_limit_ns(struct sugov_policy *sg_policy)
+{
+ mutex_lock(&min_rate_lock);
+ sg_policy->min_rate_limit_ns = min(sg_policy->up_rate_delay_ns,
+ sg_policy->down_rate_delay_ns);
+ mutex_unlock(&min_rate_lock);
+}
+
+static ssize_t up_rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
{
struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
- return sprintf(buf, "%u\n", tunables->rate_limit_us);
+ return sprintf(buf, "%u\n", tunables->up_rate_limit_us);
}
-static ssize_t
-rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count)
+static ssize_t down_rate_limit_us_show(struct gov_attr_set *attr_set, char *buf)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+
+ return sprintf(buf, "%u\n", tunables->down_rate_limit_us);
+}
+
+static ssize_t up_rate_limit_us_store(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
{
struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
struct sugov_policy *sg_policy;
@@ -576,18 +646,42 @@
if (kstrtouint(buf, 10, &rate_limit_us))
return -EINVAL;
- tunables->rate_limit_us = rate_limit_us;
+ tunables->up_rate_limit_us = rate_limit_us;
- list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
- sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
+ list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) {
+ sg_policy->up_rate_delay_ns = rate_limit_us * NSEC_PER_USEC;
+ update_min_rate_limit_ns(sg_policy);
+ }
return count;
}
-static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
+static ssize_t down_rate_limit_us_store(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+ struct sugov_policy *sg_policy;
+ unsigned int rate_limit_us;
+
+ if (kstrtouint(buf, 10, &rate_limit_us))
+ return -EINVAL;
+
+ tunables->down_rate_limit_us = rate_limit_us;
+
+ list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) {
+ sg_policy->down_rate_delay_ns = rate_limit_us * NSEC_PER_USEC;
+ update_min_rate_limit_ns(sg_policy);
+ }
+
+ return count;
+}
+
+static struct governor_attr up_rate_limit_us = __ATTR_RW(up_rate_limit_us);
+static struct governor_attr down_rate_limit_us = __ATTR_RW(down_rate_limit_us);
static struct attribute *sugov_attributes[] = {
- &rate_limit_us.attr,
+ &up_rate_limit_us.attr,
+ &down_rate_limit_us.attr,
NULL
};
@@ -598,7 +692,7 @@
/********************** cpufreq governor interface *********************/
-static struct cpufreq_governor schedutil_gov;
+struct cpufreq_governor schedutil_gov;
static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
{
@@ -743,7 +837,8 @@
goto stop_kthread;
}
- tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy);
+ tunables->up_rate_limit_us = cpufreq_policy_transition_delay_us(policy);
+ tunables->down_rate_limit_us = cpufreq_policy_transition_delay_us(policy);
policy->governor_data = sg_policy;
sg_policy->tunables = tunables;
@@ -802,7 +897,11 @@
struct sugov_policy *sg_policy = policy->governor_data;
unsigned int cpu;
- sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+ sg_policy->up_rate_delay_ns =
+ sg_policy->tunables->up_rate_limit_us * NSEC_PER_USEC;
+ sg_policy->down_rate_delay_ns =
+ sg_policy->tunables->down_rate_limit_us * NSEC_PER_USEC;
+ update_min_rate_limit_ns(sg_policy);
sg_policy->last_freq_update_time = 0;
sg_policy->next_freq = 0;
sg_policy->work_in_progress = false;
@@ -861,7 +960,7 @@
sg_policy->limits_changed = true;
}
-static struct cpufreq_governor schedutil_gov = {
+struct cpufreq_governor schedutil_gov = {
.name = "schedutil",
.owner = THIS_MODULE,
.dynamic_switching = true,
@@ -884,3 +983,36 @@
return cpufreq_register_governor(&schedutil_gov);
}
fs_initcall(sugov_register);
+
+#ifdef CONFIG_ENERGY_MODEL
+extern bool sched_energy_update;
+extern struct mutex sched_energy_mutex;
+
+static void rebuild_sd_workfn(struct work_struct *work)
+{
+ mutex_lock(&sched_energy_mutex);
+ sched_energy_update = true;
+ rebuild_sched_domains();
+ sched_energy_update = false;
+ mutex_unlock(&sched_energy_mutex);
+}
+static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn);
+
+/*
+ * EAS shouldn't be attempted without sugov, so rebuild the sched_domains
+ * on governor changes to make sure the scheduler knows about it.
+ */
+void sched_cpufreq_governor_change(struct cpufreq_policy *policy,
+ struct cpufreq_governor *old_gov)
+{
+ if (old_gov == &schedutil_gov || policy->governor == &schedutil_gov) {
+ /*
+ * When called from the cpufreq_register_driver() path, the
+ * cpu_hotplug_lock is already held, so use a work item to
+ * avoid nested locking in rebuild_sched_domains().
+ */
+ schedule_work(&rebuild_sd_work);
+ }
+
+}
+#endif
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ebec37c..84cfedc 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1599,7 +1599,8 @@
static int find_later_rq(struct task_struct *task);
static int
-select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags,
+ int sibling_count_hint)
{
struct task_struct *curr;
struct rq *rq;
@@ -1793,7 +1794,7 @@
deadline_queue_push_tasks(rq);
if (rq->curr->sched_class != &dl_sched_class)
- update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+ update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
return p;
}
@@ -1802,7 +1803,7 @@
{
update_curr_dl(rq);
- update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
+ update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
enqueue_pushable_dl_task(rq, p);
}
@@ -1819,7 +1820,7 @@
{
update_curr_dl(rq);
- update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
+ update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
/*
* Even when we have runtime, update_curr_dl() might have resulted in us
* not being the leftmost task anymore. In that case NEED_RESCHED will
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 78fadf0..141ea9f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -73,7 +73,7 @@
return 0;
}
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
#define jump_label_key__true STATIC_KEY_INIT_TRUE
#define jump_label_key__false STATIC_KEY_INIT_FALSE
@@ -99,7 +99,7 @@
#else
static void sched_feat_disable(int i) { };
static void sched_feat_enable(int i) { };
-#endif /* CONFIG_JUMP_LABEL */
+#endif /* HAVE_JUMP_LABEL */
static int sched_feat_set(char *cmp)
{
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 86ccaaf..28c14e8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -41,6 +41,16 @@
unsigned int normalized_sysctl_sched_latency = 6000000ULL;
/*
+ * Enable/disable honoring sync flag in energy-aware wakeups.
+ */
+unsigned int sysctl_sched_sync_hint_enable = 1;
+
+/*
+ * Enable/disable using cstate knowledge in idle sibling selection
+ */
+unsigned int sysctl_sched_cstate_aware = 0;
+
+/*
* The initial- and re-scaling of tunables is configurable
*
* Options are:
@@ -248,13 +258,6 @@
*/
#ifdef CONFIG_FAIR_GROUP_SCHED
-
-/* cpu runqueue to which this cfs_rq is attached */
-static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
-{
- return cfs_rq->rq;
-}
-
static inline struct task_struct *task_of(struct sched_entity *se)
{
SCHED_WARN_ON(!entity_is_task(se));
@@ -434,12 +437,6 @@
return container_of(se, struct task_struct, se);
}
-static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
-{
- return container_of(cfs_rq, struct rq, cfs);
-}
-
-
#define for_each_sched_entity(se) \
for (; se; se = NULL)
@@ -715,12 +712,12 @@
return calc_delta_fair(sched_slice(cfs_rq, se), se);
}
-#ifdef CONFIG_SMP
#include "pelt.h"
-#include "sched-pelt.h"
+#ifdef CONFIG_SMP
static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
static unsigned long task_h_load(struct task_struct *p);
+static unsigned long capacity_of(int cpu);
/* Give new sched_entity start runnable values to heavy its load in infant time */
void init_entity_runnable_average(struct sched_entity *se)
@@ -804,7 +801,7 @@
* such that the next switched_to_fair() has the
* expected state.
*/
- se->avg.last_update_time = cfs_rq_clock_task(cfs_rq);
+ se->avg.last_update_time = cfs_rq_clock_pelt(cfs_rq);
return;
}
}
@@ -969,6 +966,7 @@
}
trace_sched_stat_blocked(tsk, delta);
+ trace_sched_blocked_reason(tsk);
/*
* Blocking time is in units of nanosecs, so shift by
@@ -1515,7 +1513,6 @@
static unsigned long weighted_cpuload(struct rq *rq);
static unsigned long source_load(int cpu, int type);
static unsigned long target_load(int cpu, int type);
-static unsigned long capacity_of(int cpu);
/* Cached statistics for all CPUs within a node */
struct numa_stats {
@@ -3172,6 +3169,8 @@
if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
atomic_long_add(delta, &cfs_rq->tg->load_avg);
cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+
+ trace_sched_load_tg(cfs_rq);
}
}
@@ -3220,7 +3219,7 @@
p_last_update_time = prev->avg.last_update_time;
n_last_update_time = next->avg.last_update_time;
#endif
- __update_load_avg_blocked_se(p_last_update_time, cpu_of(rq_of(prev)), se);
+ __update_load_avg_blocked_se(p_last_update_time, se);
se->avg.last_update_time = n_last_update_time;
}
@@ -3355,11 +3354,11 @@
/*
* runnable_sum can't be lower than running_sum
- * As running sum is scale with CPU capacity wehreas the runnable sum
- * is not we rescale running_sum 1st
+ * Rescale running sum to be in the same range as runnable sum
+ * running_sum is in [0 : LOAD_AVG_MAX << SCHED_CAPACITY_SHIFT]
+ * runnable_sum is in [0 : LOAD_AVG_MAX]
*/
- running_sum = se->avg.util_sum /
- arch_scale_cpu_capacity(NULL, cpu_of(rq_of(cfs_rq)));
+ running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT;
runnable_sum = max(runnable_sum, running_sum);
load_sum = (s64)se_weight(se) * runnable_sum;
@@ -3414,6 +3413,9 @@
update_tg_cfs_util(cfs_rq, se, gcfs_rq);
update_tg_cfs_runnable(cfs_rq, se, gcfs_rq);
+ trace_sched_load_cfs_rq(cfs_rq);
+ trace_sched_load_se(se);
+
return 1;
}
@@ -3462,7 +3464,7 @@
/**
* update_cfs_rq_load_avg - update the cfs_rq's load/util averages
- * @now: current time, as per cfs_rq_clock_task()
+ * @now: current time, as per cfs_rq_clock_pelt()
* @cfs_rq: cfs_rq to update
*
* The cfs_rq avg is the direct sum of all its entities (blocked and runnable)
@@ -3507,7 +3509,7 @@
decayed = 1;
}
- decayed |= __update_load_avg_cfs_rq(now, cpu_of(rq_of(cfs_rq)), cfs_rq);
+ decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
#ifndef CONFIG_64BIT
smp_wmb();
@@ -3566,6 +3568,8 @@
add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
cfs_rq_util_change(cfs_rq, flags);
+
+ trace_sched_load_cfs_rq(cfs_rq);
}
/**
@@ -3585,6 +3589,8 @@
add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
cfs_rq_util_change(cfs_rq, 0);
+
+ trace_sched_load_cfs_rq(cfs_rq);
}
/*
@@ -3597,9 +3603,7 @@
/* Update task and its cfs_rq load average */
static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
- u64 now = cfs_rq_clock_task(cfs_rq);
- struct rq *rq = rq_of(cfs_rq);
- int cpu = cpu_of(rq);
+ u64 now = cfs_rq_clock_pelt(cfs_rq);
int decayed;
/*
@@ -3607,7 +3611,7 @@
* track group sched_entity load average for task_h_load calc in migration
*/
if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
- __update_load_avg_se(now, cpu, cfs_rq, se);
+ __update_load_avg_se(now, cfs_rq, se);
decayed = update_cfs_rq_load_avg(now, cfs_rq);
decayed |= propagate_entity_load_avg(se);
@@ -3659,7 +3663,7 @@
u64 last_update_time;
last_update_time = cfs_rq_last_update_time(cfs_rq);
- __update_load_avg_blocked_se(last_update_time, cpu_of(rq_of(cfs_rq)), se);
+ __update_load_avg_blocked_se(last_update_time, se);
}
/*
@@ -3732,6 +3736,10 @@
enqueued = cfs_rq->avg.util_est.enqueued;
enqueued += (_task_util_est(p) | UTIL_AVG_UNCHANGED);
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
+
+ /* Update plots for Task and CPU estimated utilization */
+ trace_sched_util_est_task(p, &p->se.avg);
+ trace_sched_util_est_cpu(cpu_of(rq_of(cfs_rq)), cfs_rq);
}
/*
@@ -3752,6 +3760,7 @@
{
long last_ewma_diff;
struct util_est ue;
+ int cpu;
if (!sched_feat(UTIL_EST))
return;
@@ -3762,6 +3771,9 @@
(_task_util_est(p) | UTIL_AVG_UNCHANGED));
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
+ /* Update plots for CPU's estimated utilization */
+ trace_sched_util_est_cpu(cpu_of(rq_of(cfs_rq)), cfs_rq);
+
/*
* Skip update of task's estimated utilization when the task has not
* yet completed an activation, e.g. being migrated.
@@ -3787,6 +3799,14 @@
return;
/*
+ * To avoid overestimation of actual task utilization, skip updates if
+ * we cannot grant there is idle time in this CPU.
+ */
+ cpu = cpu_of(rq_of(cfs_rq));
+ if (task_util(p) > capacity_orig_of(cpu))
+ return;
+
+ /*
* Update Task's estimated utilization
*
* When *p completes an activation we can consolidate another sample
@@ -3807,6 +3827,32 @@
ue.ewma += last_ewma_diff;
ue.ewma >>= UTIL_EST_WEIGHT_SHIFT;
WRITE_ONCE(p->se.avg.util_est, ue);
+
+ /* Update plots for Task's estimated utilization */
+ trace_sched_util_est_task(p, &p->se.avg);
+}
+
+static inline int task_fits_capacity(struct task_struct *p, long capacity)
+{
+ return capacity * 1024 > task_util_est(p) * capacity_margin;
+}
+
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
+{
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return;
+
+ if (!p) {
+ rq->misfit_task_load = 0;
+ return;
+ }
+
+ if (task_fits_capacity(p, capacity_of(cpu_of(rq)))) {
+ rq->misfit_task_load = 0;
+ return;
+ }
+
+ rq->misfit_task_load = task_h_load(p);
}
#else /* CONFIG_SMP */
@@ -3838,6 +3884,7 @@
static inline void
util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
bool task_sleep) {}
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
#endif /* CONFIG_SMP */
@@ -4292,7 +4339,7 @@
#ifdef CONFIG_CFS_BANDWIDTH
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
static struct static_key __cfs_bandwidth_used;
static inline bool cfs_bandwidth_used(void)
@@ -4309,7 +4356,7 @@
{
static_key_slow_dec_cpuslocked(&__cfs_bandwidth_used);
}
-#else /* CONFIG_JUMP_LABEL */
+#else /* HAVE_JUMP_LABEL */
static bool cfs_bandwidth_used(void)
{
return true;
@@ -4317,7 +4364,7 @@
void cfs_bandwidth_usage_inc(void) {}
void cfs_bandwidth_usage_dec(void) {}
-#endif /* CONFIG_JUMP_LABEL */
+#endif /* HAVE_JUMP_LABEL */
/*
* default period for cfs group bandwidth.
@@ -5143,6 +5190,26 @@
}
#endif
+#ifdef CONFIG_SMP
+static inline unsigned long cpu_util(int cpu);
+static unsigned long capacity_of(int cpu);
+
+static inline bool cpu_overutilized(int cpu)
+{
+ return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin);
+}
+
+static inline void update_overutilized_status(struct rq *rq)
+{
+ if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) {
+ WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
+ trace_sched_overutilized(1);
+ }
+}
+#else
+static inline void update_overutilized_status(struct rq *rq) { }
+#endif
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -5163,6 +5230,24 @@
util_est_enqueue(&rq->cfs, p);
/*
+ * The code below (indirectly) updates schedutil which looks at
+ * the cfs_rq utilization to select a frequency.
+ * Let's update schedtune here to ensure the boost value of the
+ * current task is accounted for in the selection of the OPP.
+ *
+ * We do it also in the case where we enqueue a throttled task;
+ * we could argue that a throttled task should not boost a CPU,
+ * however:
+ * a) properly implementing CPU boosting considering throttled
+ * tasks will increase a lot the complexity of the solution
+ * b) it's not easy to quantify the benefits introduced by
+ * such a more complex solution.
+ * Thus, for the time being we go for the simple solution and boost
+ * also for throttled RQs.
+ */
+ schedtune_enqueue_task(p, cpu_of(rq));
+
+ /*
* If in_iowait is set, the code below may not trigger any cpufreq
* utilization updates, so do it here explicitly with the IOWAIT flag
* passed.
@@ -5200,8 +5285,26 @@
update_cfs_group(se);
}
- if (!se)
+ if (!se) {
add_nr_running(rq, 1);
+ /*
+ * Since new tasks are assigned an initial util_avg equal to
+ * half of the spare capacity of their CPU, tiny tasks have the
+ * ability to cross the overutilized threshold, which will
+ * result in the load balancer ruining all the task placement
+ * done by EAS. As a way to mitigate that effect, do not account
+ * for the first enqueue operation of new tasks during the
+ * overutilized flag detection.
+ *
+ * A better way of solving this problem would be to wait for
+ * the PELT signals of tasks to converge before taking them
+ * into account, but that is not straightforward to implement,
+ * and the following generally works well enough in practice.
+ */
+ if (flags & ENQUEUE_WAKEUP)
+ update_overutilized_status(rq);
+
+ }
if (cfs_bandwidth_used()) {
/*
@@ -5236,6 +5339,14 @@
struct sched_entity *se = &p->se;
int task_sleep = flags & DEQUEUE_SLEEP;
+ /*
+ * The code below (indirectly) updates schedutil which looks at
+ * the cfs_rq utilization to select a frequency.
+ * Let's update schedtune here to ensure the boost value of the
+ * current task is not more accounted for in the selection of the OPP.
+ */
+ schedtune_dequeue_task(p, cpu_of(rq));
+
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);
@@ -5599,11 +5710,6 @@
return cpu_rq(cpu)->cpu_capacity;
}
-static unsigned long capacity_orig_of(int cpu)
-{
- return cpu_rq(cpu)->cpu_capacity_orig;
-}
-
static unsigned long cpu_avg_load_per_task(int cpu)
{
struct rq *rq = cpu_rq(cpu);
@@ -5650,15 +5756,18 @@
* whatever is irrelevant, spread criteria is apparent partner count exceeds
* socket size.
*/
-static int wake_wide(struct task_struct *p)
+static int wake_wide(struct task_struct *p, int sibling_count_hint)
{
unsigned int master = current->wakee_flips;
unsigned int slave = p->wakee_flips;
- int factor = this_cpu_read(sd_llc_size);
+ int llc_size = this_cpu_read(sd_llc_size);
+
+ if (sibling_count_hint >= llc_size)
+ return 1;
if (master < slave)
swap(master, slave);
- if (slave < factor || master < slave * factor)
+ if (slave < llc_size || master < slave * llc_size)
return 0;
return 1;
}
@@ -5762,6 +5871,100 @@
return target;
}
+#ifdef CONFIG_SCHED_TUNE
+struct reciprocal_value schedtune_spc_rdiv;
+
+static long
+schedtune_margin(unsigned long signal, long boost)
+{
+ long long margin = 0;
+
+ /*
+ * Signal proportional compensation (SPC)
+ *
+ * The Boost (B) value is used to compute a Margin (M) which is
+ * proportional to the complement of the original Signal (S):
+ * M = B * (SCHED_CAPACITY_SCALE - S)
+ * The obtained M could be used by the caller to "boost" S.
+ */
+ if (boost >= 0) {
+ margin = SCHED_CAPACITY_SCALE - signal;
+ margin *= boost;
+ } else
+ margin = -signal * boost;
+
+ margin = reciprocal_divide(margin, schedtune_spc_rdiv);
+
+ if (boost < 0)
+ margin *= -1;
+ return margin;
+}
+
+static inline int
+schedtune_cpu_margin(unsigned long util, int cpu)
+{
+ int boost = schedtune_cpu_boost(cpu);
+
+ if (boost == 0)
+ return 0;
+
+ return schedtune_margin(util, boost);
+}
+
+static inline long
+schedtune_task_margin(struct task_struct *task)
+{
+ int boost = schedtune_task_boost(task);
+ unsigned long util;
+ long margin;
+
+ if (boost == 0)
+ return 0;
+
+ util = task_util_est(task);
+ margin = schedtune_margin(util, boost);
+
+ return margin;
+}
+
+unsigned long
+boosted_cpu_util(int cpu, unsigned long other_util)
+{
+ unsigned long util = cpu_util_cfs(cpu_rq(cpu)) + other_util;
+ long margin = schedtune_cpu_margin(util, cpu);
+
+ trace_sched_boost_cpu(cpu, util, margin);
+
+ return util + margin;
+}
+
+#else /* CONFIG_SCHED_TUNE */
+
+static inline int
+schedtune_cpu_margin(unsigned long util, int cpu)
+{
+ return 0;
+}
+
+static inline int
+schedtune_task_margin(struct task_struct *task)
+{
+ return 0;
+}
+
+#endif /* CONFIG_SCHED_TUNE */
+
+static inline unsigned long
+boosted_task_util(struct task_struct *task)
+{
+ unsigned long util = task_util_est(task);
+ long margin = schedtune_task_margin(task);
+
+ trace_sched_boost_task(task, util, margin);
+
+ return util + margin;
+}
+
static unsigned long cpu_util_without(int cpu, struct task_struct *p);
static unsigned long capacity_spare_without(int cpu, struct task_struct *p)
@@ -6395,6 +6598,321 @@
}
/*
+ * Returns the current capacity of cpu after applying both
+ * cpu and freq scaling.
+ */
+unsigned long capacity_curr_of(int cpu)
+{
+ unsigned long max_cap = cpu_rq(cpu)->cpu_capacity_orig;
+ unsigned long scale_freq = arch_scale_freq_capacity(cpu);
+
+ return cap_scale(max_cap, scale_freq);
+}
+
+static void find_best_target(struct sched_domain *sd, cpumask_t *cpus,
+ struct task_struct *p)
+{
+ unsigned long min_util = boosted_task_util(p);
+ unsigned long target_capacity = ULONG_MAX;
+ unsigned long min_wake_util = ULONG_MAX;
+ unsigned long target_max_spare_cap = 0;
+ unsigned long target_util = ULONG_MAX;
+ bool prefer_idle = schedtune_prefer_idle(p);
+ bool boosted = schedtune_task_boost(p) > 0;
+ /* Initialise with deepest possible cstate (INT_MAX) */
+ int shallowest_idle_cstate = INT_MAX;
+ struct sched_group *sg;
+ int best_active_cpu = -1;
+ int best_idle_cpu = -1;
+ int target_cpu = -1;
+ int backup_cpu = -1;
+ int i;
+
+ /*
+ * In most cases, target_capacity tracks capacity_orig of the most
+ * energy efficient CPU candidate, thus requiring to minimise
+ * target_capacity. For these cases target_capacity is already
+ * initialized to ULONG_MAX.
+ * However, for prefer_idle and boosted tasks we look for a high
+ * performance CPU, thus requiring to maximise target_capacity. In this
+ * case we initialise target_capacity to 0.
+ */
+ if (prefer_idle && boosted)
+ target_capacity = 0;
+
+ /* Scan CPUs in all SDs */
+ sg = sd->groups;
+ do {
+ for_each_cpu_and(i, &p->cpus_allowed, sched_group_span(sg)) {
+ unsigned long capacity_curr = capacity_curr_of(i);
+ unsigned long capacity_orig = capacity_orig_of(i);
+ unsigned long wake_util, new_util;
+ long spare_cap;
+ int idle_idx = INT_MAX;
+
+ if (!cpu_online(i))
+ continue;
+
+ /*
+ * p's blocked utilization is still accounted for on prev_cpu
+ * so prev_cpu will receive a negative bias due to the double
+ * accounting. However, the blocked utilization may be zero.
+ */
+ wake_util = cpu_util_without(i, p);
+ new_util = wake_util + task_util_est(p);
+
+ /*
+ * Ensure minimum capacity to grant the required boost.
+ * The target CPU can be already at a capacity level higher
+ * than the one required to boost the task.
+ */
+ new_util = max(min_util, new_util);
+ if (new_util > capacity_orig)
+ continue;
+
+ /*
+ * Pre-compute the maximum possible capacity we expect
+ * to have available on this CPU once the task is
+ * enqueued here.
+ */
+ spare_cap = capacity_orig - new_util;
+
+ if (idle_cpu(i))
+ idle_idx = idle_get_state_idx(cpu_rq(i));
+
+
+ /*
+ * Case A) Latency sensitive tasks
+ *
+ * Unconditionally favoring tasks that prefer idle CPU to
+ * improve latency.
+ *
+ * Looking for:
+ * - an idle CPU, whatever its idle_state is, since
+ * the first CPUs we explore are more likely to be
+ * reserved for latency sensitive tasks.
+ * - a non idle CPU where the task fits in its current
+ * capacity and has the maximum spare capacity.
+ * - a non idle CPU with lower contention from other
+ * tasks and running at the lowest possible OPP.
+ *
+ * The last two goals tries to favor a non idle CPU
+ * where the task can run as if it is "almost alone".
+ * A maximum spare capacity CPU is favoured since
+ * the task already fits into that CPU's capacity
+ * without waiting for an OPP chance.
+ *
+ * The following code path is the only one in the CPUs
+ * exploration loop which is always used by
+ * prefer_idle tasks. It exits the loop with wither a
+ * best_active_cpu or a target_cpu which should
+ * represent an optimal choice for latency sensitive
+ * tasks.
+ */
+ if (prefer_idle) {
+
+ /*
+ * Case A.1: IDLE CPU
+ * Return the best IDLE CPU we find:
+ * - for boosted tasks: the CPU with the highest
+ * performance (i.e. biggest capacity_orig)
+ * - for !boosted tasks: the most energy
+ * efficient CPU (i.e. smallest capacity_orig)
+ */
+ if (idle_cpu(i)) {
+ if (boosted &&
+ capacity_orig < target_capacity)
+ continue;
+ if (!boosted &&
+ capacity_orig > target_capacity)
+ continue;
+ /*
+ * Minimise value of idle state: skip
+ * deeper idle states and pick the
+ * shallowest.
+ */
+ if (capacity_orig == target_capacity &&
+ sysctl_sched_cstate_aware &&
+ idle_idx >= shallowest_idle_cstate)
+ continue;
+
+ target_capacity = capacity_orig;
+ shallowest_idle_cstate = idle_idx;
+ best_idle_cpu = i;
+ continue;
+ }
+ if (best_idle_cpu != -1)
+ continue;
+
+ /*
+ * Case A.2: Target ACTIVE CPU
+ * Favor CPUs with max spare capacity.
+ */
+ if (capacity_curr > new_util &&
+ spare_cap > target_max_spare_cap) {
+ target_max_spare_cap = spare_cap;
+ target_cpu = i;
+ continue;
+ }
+ if (target_cpu != -1)
+ continue;
+
+
+ /*
+ * Case A.3: Backup ACTIVE CPU
+ * Favor CPUs with:
+ * - lower utilization due to other tasks
+ * - lower utilization with the task in
+ */
+ if (wake_util > min_wake_util)
+ continue;
+ min_wake_util = wake_util;
+ best_active_cpu = i;
+ continue;
+ }
+
+ /*
+ * Enforce EAS mode
+ *
+ * For non latency sensitive tasks, skip CPUs that
+ * will be overutilized by moving the task there.
+ *
+ * The goal here is to remain in EAS mode as long as
+ * possible at least for !prefer_idle tasks.
+ */
+ if ((new_util * capacity_margin) >
+ (capacity_orig * SCHED_CAPACITY_SCALE))
+ continue;
+
+ /*
+ * Favor CPUs with smaller capacity for non latency
+ * sensitive tasks.
+ */
+ if (capacity_orig > target_capacity)
+ continue;
+
+ /*
+ * Case B) Non latency sensitive tasks on IDLE CPUs.
+ *
+ * Find an optimal backup IDLE CPU for non latency
+ * sensitive tasks.
+ *
+ * Looking for:
+ * - minimizing the capacity_orig,
+ * i.e. preferring LITTLE CPUs
+ * - favoring shallowest idle states
+ * i.e. avoid to wakeup deep-idle CPUs
+ *
+ * The following code path is used by non latency
+ * sensitive tasks if IDLE CPUs are available. If at
+ * least one of such CPUs are available it sets the
+ * best_idle_cpu to the most suitable idle CPU to be
+ * selected.
+ *
+ * If idle CPUs are available, favour these CPUs to
+ * improve performances by spreading tasks.
+ * Indeed, the energy_diff() computed by the caller
+ * will take care to ensure the minimization of energy
+ * consumptions without affecting performance.
+ */
+ if (idle_cpu(i)) {
+ /*
+ * Skip CPUs in deeper idle state, but only
+ * if they are also less energy efficient.
+ * IOW, prefer a deep IDLE LITTLE CPU vs a
+ * shallow idle big CPU.
+ */
+ if (capacity_orig == target_capacity &&
+ sysctl_sched_cstate_aware &&
+ idle_idx >= shallowest_idle_cstate)
+ continue;
+
+ target_capacity = capacity_orig;
+ shallowest_idle_cstate = idle_idx;
+ best_idle_cpu = i;
+ continue;
+ }
+
+ /*
+ * Case C) Non latency sensitive tasks on ACTIVE CPUs.
+ *
+ * Pack tasks in the most energy efficient capacities.
+ *
+ * This task packing strategy prefers more energy
+ * efficient CPUs (i.e. pack on smaller maximum
+ * capacity CPUs) while also trying to spread tasks to
+ * run them all at the lower OPP.
+ *
+ * This assumes for example that it's more energy
+ * efficient to run two tasks on two CPUs at a lower
+ * OPP than packing both on a single CPU but running
+ * that CPU at an higher OPP.
+ *
+ * Thus, this case keep track of the CPU with the
+ * smallest maximum capacity and highest spare maximum
+ * capacity.
+ */
+
+ /* Favor CPUs with maximum spare capacity */
+ if (capacity_orig == target_capacity &&
+ spare_cap < target_max_spare_cap)
+ continue;
+
+ target_max_spare_cap = spare_cap;
+ target_capacity = capacity_orig;
+ target_util = new_util;
+ target_cpu = i;
+ }
+
+ } while (sg = sg->next, sg != sd->groups);
+
+ /*
+ * For non latency sensitive tasks, cases B and C in the previous loop,
+ * we pick the best IDLE CPU only if we was not able to find a target
+ * ACTIVE CPU.
+ *
+ * Policies priorities:
+ *
+ * - prefer_idle tasks:
+ *
+ * a) IDLE CPU available: best_idle_cpu
+ * b) ACTIVE CPU where task fits and has the bigger maximum spare
+ * capacity (i.e. target_cpu)
+ * c) ACTIVE CPU with less contention due to other tasks
+ * (i.e. best_active_cpu)
+ *
+ * - NON prefer_idle tasks:
+ *
+ * a) ACTIVE CPU: target_cpu
+ * b) IDLE CPU: best_idle_cpu
+ */
+
+ if (prefer_idle && (best_idle_cpu != -1)) {
+ target_cpu = best_idle_cpu;
+ goto target;
+ }
+
+ if (target_cpu == -1)
+ target_cpu = prefer_idle
+ ? best_active_cpu
+ : best_idle_cpu;
+ else
+ backup_cpu = prefer_idle
+ ? best_active_cpu
+ : best_idle_cpu;
+
+ if (backup_cpu >= 0)
+ cpumask_set_cpu(backup_cpu, cpus);
+ if (target_cpu >= 0) {
+target:
+ cpumask_set_cpu(target_cpu, cpus);
+ }
+
+ trace_sched_find_best_target(p, prefer_idle, min_util, best_idle_cpu,
+ best_active_cpu, target_cpu, backup_cpu);
+}
+
+/*
* Disable WAKE_AFFINE in the case where task @p doesn't fit in the
* capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
*
@@ -6405,8 +6923,11 @@
{
long min_cap, max_cap;
+ if (!static_branch_unlikely(&sched_asym_cpucapacity))
+ return 0;
+
min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
- max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
+ max_cap = cpu_rq(cpu)->rd->max_cpu_capacity.val;
/* Minimum capacity is close to max, no need to abort wake_affine */
if (max_cap - min_cap < max_cap >> 3)
@@ -6415,7 +6936,255 @@
/* Bring task utilization in sync with prev_cpu */
sync_entity_load_avg(&p->se);
- return min_cap * 1024 < task_util(p) * capacity_margin;
+ return !task_fits_capacity(p, min_cap);
+}
+
+/*
+ * Predicts what cpu_util(@cpu) would return if @p was migrated (and enqueued)
+ * to @dst_cpu.
+ */
+static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
+{
+ struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
+ unsigned long util_est, util = READ_ONCE(cfs_rq->avg.util_avg);
+
+ /*
+ * If @p migrates from @cpu to another, remove its contribution. Or,
+ * if @p migrates from another CPU to @cpu, add its contribution. In
+ * the other cases, @cpu is not impacted by the migration, so the
+ * util_avg should already be correct.
+ */
+ if (task_cpu(p) == cpu && dst_cpu != cpu)
+ sub_positive(&util, task_util(p));
+ else if (task_cpu(p) != cpu && dst_cpu == cpu)
+ util += task_util(p);
+
+ if (sched_feat(UTIL_EST)) {
+ util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
+
+ /*
+ * During wake-up, the task isn't enqueued yet and doesn't
+ * appear in the cfs_rq->avg.util_est.enqueued of any rq,
+ * so just add it (if needed) to "simulate" what will be
+ * cpu_util() after the task has been enqueued.
+ */
+ if (dst_cpu == cpu)
+ util_est += _task_util_est(p);
+
+ util = max(util, util_est);
+ }
+
+ return min(util, capacity_orig_of(cpu));
+}
+
+/*
+ * compute_energy(): Estimates the energy that would be consumed if @p was
+ * migrated to @dst_cpu. compute_energy() predicts what will be the utilization
+ * landscape of the * CPUs after the task migration, and uses the Energy Model
+ * to compute what would be the energy if we decided to actually migrate that
+ * task.
+ */
+static long
+compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
+{
+ long util, max_util, sum_util, energy = 0;
+ int cpu;
+
+ for (; pd; pd = pd->next) {
+ max_util = sum_util = 0;
+ /*
+ * The capacity state of CPUs of the current rd can be driven by
+ * CPUs of another rd if they belong to the same performance
+ * domain. So, account for the utilization of these CPUs too
+ * by masking pd with cpu_online_mask instead of the rd span.
+ *
+ * If an entire performance domain is outside of the current rd,
+ * it will not appear in its pd list and will not be accounted
+ * by compute_energy().
+ */
+ for_each_cpu_and(cpu, perf_domain_span(pd), cpu_online_mask) {
+ util = cpu_util_next(cpu, p, dst_cpu);
+ util += cpu_util_rt(cpu_rq(cpu));
+ util = schedutil_energy_util(cpu, util);
+ max_util = max(util, max_util);
+ sum_util += util;
+ }
+
+ energy += em_pd_energy(pd->em_pd, max_util, sum_util);
+ }
+
+ return energy;
+}
+
+static void select_max_spare_cap_cpus(struct sched_domain *sd, cpumask_t *cpus,
+ struct perf_domain *pd, struct task_struct *p)
+{
+ unsigned long spare_cap, max_spare_cap, util, cpu_cap;
+ int cpu, max_spare_cap_cpu;
+
+ for (; pd; pd = pd->next) {
+ max_spare_cap_cpu = -1;
+ max_spare_cap = 0;
+
+ for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) {
+ if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
+ continue;
+
+ /* Skip CPUs that will be overutilized. */
+ util = cpu_util_next(cpu, p, cpu);
+ cpu_cap = capacity_of(cpu);
+ if (cpu_cap * 1024 < util * capacity_margin)
+ continue;
+
+ /*
+ * Find the CPU with the maximum spare capacity in
+ * the performance domain
+ */
+ spare_cap = cpu_cap - util;
+ if (spare_cap > max_spare_cap) {
+ max_spare_cap = spare_cap;
+ max_spare_cap_cpu = cpu;
+ }
+ }
+
+ if (max_spare_cap_cpu >= 0)
+ cpumask_set_cpu(max_spare_cap_cpu, cpus);
+ }
+}
+
+static DEFINE_PER_CPU(cpumask_t, energy_cpus);
+
+/*
+ * find_energy_efficient_cpu(): Find most energy-efficient target CPU for the
+ * waking task. find_energy_efficient_cpu() looks for the CPU with maximum
+ * spare capacity in each performance domain and uses it as a potential
+ * candidate to execute the task. Then, it uses the Energy Model to figure
+ * out which of the CPU candidates is the most energy-efficient.
+ *
+ * The rationale for this heuristic is as follows. In a performance domain,
+ * all the most energy efficient CPU candidates (according to the Energy
+ * Model) are those for which we'll request a low frequency. When there are
+ * several CPUs for which the frequency request will be the same, we don't
+ * have enough data to break the tie between them, because the Energy Model
+ * only includes active power costs. With this model, if we assume that
+ * frequency requests follow utilization (e.g. using schedutil), the CPU with
+ * the maximum spare capacity in a performance domain is guaranteed to be among
+ * the best candidates of the performance domain.
+ *
+ * In practice, it could be preferable from an energy standpoint to pack
+ * small tasks on a CPU in order to let other CPUs go in deeper idle states,
+ * but that could also hurt our chances to go cluster idle, and we have no
+ * ways to tell with the current Energy Model if this is actually a good
+ * idea or not. So, find_energy_efficient_cpu() basically favors
+ * cluster-packing, and spreading inside a cluster. That should at least be
+ * a good thing for latency, and this is consistent with the idea that most
+ * of the energy savings of EAS come from the asymmetry of the system, and
+ * not so much from breaking the tie between identical CPUs. That's also the
+ * reason why EAS is enabled in the topology code only for systems where
+ * SD_ASYM_CPUCAPACITY is set.
+ *
+ * NOTE: Forkees are not accepted in the energy-aware wake-up path because
+ * they don't have any useful utilization data yet and it's not possible to
+ * forecast their impact on energy consumption. Consequently, they will be
+ * placed by find_idlest_cpu() on the least loaded CPU, which might turn out
+ * to be energy-inefficient in some use-cases. The alternative would be to
+ * bias new tasks towards specific types of CPUs first, or to try to infer
+ * their util_avg from the parent task, but those heuristics could hurt
+ * other use-cases too. So, until someone finds a better way to solve this,
+ * let's keep things simple by re-using the existing slow path.
+ */
+
+static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu, int sync)
+{
+ unsigned long prev_energy = ULONG_MAX, best_energy = ULONG_MAX;
+ struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
+ int weight, cpu, best_energy_cpu = prev_cpu;
+ unsigned long cur_energy;
+ struct perf_domain *pd;
+ struct sched_domain *sd;
+ cpumask_t *candidates;
+
+ if (sysctl_sched_sync_hint_enable && sync) {
+ cpu = smp_processor_id();
+ if (cpumask_test_cpu(cpu, &p->cpus_allowed))
+ return cpu;
+ }
+
+ rcu_read_lock();
+ pd = rcu_dereference(rd->pd);
+ if (!pd || READ_ONCE(rd->overutilized))
+ goto fail;
+
+ /*
+ * Energy-aware wake-up happens on the lowest sched_domain starting
+ * from sd_asym_cpucapacity spanning over this_cpu and prev_cpu.
+ */
+ sd = rcu_dereference(*this_cpu_ptr(&sd_asym_cpucapacity));
+ while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
+ sd = sd->parent;
+ if (!sd)
+ goto fail;
+
+ sync_entity_load_avg(&p->se);
+ if (!task_util_est(p))
+ goto unlock;
+
+ /* Pre-select a set of candidate CPUs. */
+ candidates = this_cpu_ptr(&energy_cpus);
+ cpumask_clear(candidates);
+
+ if (sched_feat(FIND_BEST_TARGET))
+ find_best_target(sd, candidates, p);
+ else
+ select_max_spare_cap_cpus(sd, candidates, pd, p);
+
+ /* Bail out if no candidate was found. */
+ weight = cpumask_weight(candidates);
+ if (!weight)
+ goto unlock;
+
+ /* If there is only one sensible candidate, select it now. */
+ cpu = cpumask_first(candidates);
+ if (weight == 1 && ((schedtune_prefer_idle(p) && idle_cpu(cpu)) ||
+ (cpu == prev_cpu))) {
+ best_energy_cpu = cpu;
+ goto unlock;
+ }
+
+ if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
+ prev_energy = best_energy = compute_energy(p, prev_cpu, pd);
+ else
+ prev_energy = best_energy = ULONG_MAX;
+
+ /* Select the best candidate energy-wise. */
+ for_each_cpu(cpu, candidates) {
+ if (cpu == prev_cpu)
+ continue;
+ cur_energy = compute_energy(p, cpu, pd);
+ if (cur_energy < best_energy) {
+ best_energy = cur_energy;
+ best_energy_cpu = cpu;
+ }
+ }
+unlock:
+ rcu_read_unlock();
+
+ /*
+ * Pick the best CPU if prev_cpu cannot be used, or if it saves at
+ * least 6% of the energy used by prev_cpu.
+ */
+ if (prev_energy == ULONG_MAX)
+ return best_energy_cpu;
+
+ if ((prev_energy - best_energy) > (prev_energy >> 4))
+ return best_energy_cpu;
+
+ return prev_cpu;
+
+fail:
+ rcu_read_unlock();
+
+ return -1;
}
/*
@@ -6431,7 +7200,8 @@
* preempt must be disabled.
*/
static int
-select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags,
+ int sibling_count_hint)
{
struct sched_domain *tmp, *sd = NULL;
int cpu = smp_processor_id();
@@ -6441,10 +7211,23 @@
if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
- want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
- && cpumask_test_cpu(cpu, &p->cpus_allowed);
+
+ if (static_branch_unlikely(&sched_energy_present)) {
+ if (schedtune_prefer_idle(p) && !sched_feat(EAS_PREFER_IDLE) && !sync)
+ goto sd_loop;
+
+ new_cpu = find_energy_efficient_cpu(p, prev_cpu, sync);
+ if (new_cpu >= 0)
+ return new_cpu;
+ new_cpu = prev_cpu;
+ }
+
+ want_affine = !wake_wide(p, sibling_count_hint) &&
+ !wake_cap(p, cpu, prev_cpu) &&
+ cpumask_test_cpu(cpu, &p->cpus_allowed);
}
+sd_loop:
rcu_read_lock();
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
@@ -6834,9 +7617,12 @@
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);
+ update_misfit_status(p, rq);
+
return p;
idle:
+ update_misfit_status(NULL, rq);
new_tasks = idle_balance(rq, rf);
/*
@@ -6850,6 +7636,12 @@
if (new_tasks > 0)
goto again;
+ /*
+ * rq is about to be idle, check if we need to update the
+ * lost_idle_time of clock_pelt
+ */
+ update_idle_rq_clock_pelt(rq);
+
return NULL;
}
@@ -7042,6 +7834,13 @@
enum fbq_type { regular, remote, all };
+enum group_type {
+ group_other = 0,
+ group_misfit_task,
+ group_imbalanced,
+ group_overloaded,
+};
+
#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
#define LBF_DST_PINNED 0x04
@@ -7062,6 +7861,7 @@
int new_dst_cpu;
enum cpu_idle_type idle;
long imbalance;
+ unsigned int src_grp_nr_running;
/* The set of CPUs under consideration for load-balancing */
struct cpumask *cpus;
@@ -7072,6 +7872,7 @@
unsigned int loop_max;
enum fbq_type fbq_type;
+ enum group_type src_grp_type;
struct list_head tasks;
};
@@ -7497,7 +8298,7 @@
for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
struct sched_entity *se;
- if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq))
+ if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
update_tg_load_avg(cfs_rq, 0);
/* Propagate pending load changes to the parent, if any: */
@@ -7518,8 +8319,8 @@
}
curr_class = rq->curr->sched_class;
- update_rt_rq_load_avg(rq_clock_task(rq), rq, curr_class == &rt_sched_class);
- update_dl_rq_load_avg(rq_clock_task(rq), rq, curr_class == &dl_sched_class);
+ update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
+ update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
update_irq_load_avg(rq, 0);
/* Don't need periodic decay once load/util_avg are null */
if (others_have_blocked(rq))
@@ -7589,11 +8390,11 @@
rq_lock_irqsave(rq, &rf);
update_rq_clock(rq);
- update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
+ update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
curr_class = rq->curr->sched_class;
- update_rt_rq_load_avg(rq_clock_task(rq), rq, curr_class == &rt_sched_class);
- update_dl_rq_load_avg(rq_clock_task(rq), rq, curr_class == &dl_sched_class);
+ update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
+ update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
update_irq_load_avg(rq, 0);
#ifdef CONFIG_NO_HZ_COMMON
rq->last_blocked_load_update_tick = jiffies;
@@ -7611,12 +8412,6 @@
/********** Helpers for find_busiest_group ************************/
-enum group_type {
- group_other = 0,
- group_imbalanced,
- group_overloaded,
-};
-
/*
* sg_lb_stats - stats of a sched_group required for load_balancing
*/
@@ -7632,6 +8427,7 @@
unsigned int group_weight;
enum group_type group_type;
int group_no_capacity;
+ unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
@@ -7704,10 +8500,9 @@
return load_idx;
}
-static unsigned long scale_rt_capacity(struct sched_domain *sd, int cpu)
+static unsigned long scale_rt_capacity(int cpu, unsigned long max)
{
struct rq *rq = cpu_rq(cpu);
- unsigned long max = arch_scale_cpu_capacity(sd, cpu);
unsigned long used, free;
unsigned long irq;
@@ -7727,12 +8522,46 @@
return scale_irq_capacity(free, irq, max);
}
+void init_max_cpu_capacity(struct max_cpu_capacity *mcc) {
+ raw_spin_lock_init(&mcc->lock);
+ mcc->val = 0;
+ mcc->cpu = -1;
+}
+
static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
- unsigned long capacity = scale_rt_capacity(sd, cpu);
+ unsigned long capacity = arch_scale_cpu_capacity(sd, cpu);
struct sched_group *sdg = sd->groups;
+ struct max_cpu_capacity *mcc;
+ unsigned long max_capacity;
+ int max_cap_cpu;
+ unsigned long flags;
- cpu_rq(cpu)->cpu_capacity_orig = arch_scale_cpu_capacity(sd, cpu);
+ cpu_rq(cpu)->cpu_capacity_orig = capacity;
+
+ capacity *= arch_scale_max_freq_capacity(sd, cpu);
+ capacity >>= SCHED_CAPACITY_SHIFT;
+
+ mcc = &cpu_rq(cpu)->rd->max_cpu_capacity;
+
+ raw_spin_lock_irqsave(&mcc->lock, flags);
+ max_capacity = mcc->val;
+ max_cap_cpu = mcc->cpu;
+
+ if ((max_capacity > capacity && max_cap_cpu == cpu) ||
+ (max_capacity < capacity)) {
+ mcc->val = capacity;
+ mcc->cpu = cpu;
+#ifdef CONFIG_SCHED_DEBUG
+ raw_spin_unlock_irqrestore(&mcc->lock, flags);
+ pr_info("CPU%d: update max cpu_capacity %lu\n", cpu, capacity);
+ goto skip_unlock;
+#endif
+ }
+ raw_spin_unlock_irqrestore(&mcc->lock, flags);
+
+skip_unlock: __attribute__ ((unused));
+ capacity = scale_rt_capacity(cpu, capacity);
if (!capacity)
capacity = 1;
@@ -7740,13 +8569,14 @@
cpu_rq(cpu)->cpu_capacity = capacity;
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = capacity;
+ sdg->sgc->max_capacity = capacity;
}
void update_group_capacity(struct sched_domain *sd, int cpu)
{
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
- unsigned long capacity, min_capacity;
+ unsigned long capacity, min_capacity, max_capacity;
unsigned long interval;
interval = msecs_to_jiffies(sd->balance_interval);
@@ -7760,6 +8590,7 @@
capacity = 0;
min_capacity = ULONG_MAX;
+ max_capacity = 0;
if (child->flags & SD_OVERLAP) {
/*
@@ -7790,6 +8621,7 @@
}
min_capacity = min(capacity, min_capacity);
+ max_capacity = max(capacity, max_capacity);
}
} else {
/*
@@ -7803,12 +8635,14 @@
capacity += sgc->capacity;
min_capacity = min(sgc->min_capacity, min_capacity);
+ max_capacity = max(sgc->max_capacity, max_capacity);
group = group->next;
} while (group != child->groups);
}
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = min_capacity;
+ sdg->sgc->max_capacity = max_capacity;
}
/*
@@ -7904,16 +8738,27 @@
}
/*
- * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
+ * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
* per-CPU capacity than sched_group ref.
*/
static inline bool
-group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
{
return sg->sgc->min_capacity * capacity_margin <
ref->sgc->min_capacity * 1024;
}
+/*
+ * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
+ * per-CPU capacity_orig than sched_group ref.
+ */
+static inline bool
+group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+ return sg->sgc->max_capacity * capacity_margin <
+ ref->sgc->max_capacity * 1024;
+}
+
static inline enum
group_type group_classify(struct sched_group *group,
struct sg_lb_stats *sgs)
@@ -7924,6 +8769,9 @@
if (sg_imbalanced(group))
return group_imbalanced;
+ if (sgs->group_misfit_task_load)
+ return group_misfit_task;
+
return group_other;
}
@@ -7953,16 +8801,16 @@
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @env: The load balancing environment.
* @group: sched_group whose statistics are to be updated.
- * @load_idx: Load index of sched_domain of this_cpu for load calc.
- * @local_group: Does group contain this_cpu.
* @sgs: variable to hold the statistics for this group.
- * @overload: Indicate more than one runnable task for any CPU.
+ * @sg_status: Holds flag indicating the status of the sched_group
*/
static inline void update_sg_lb_stats(struct lb_env *env,
- struct sched_group *group, int load_idx,
- int local_group, struct sg_lb_stats *sgs,
- bool *overload)
+ struct sched_group *group,
+ struct sg_lb_stats *sgs,
+ int *sg_status)
{
+ int local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));
+ int load_idx = get_sd_load_idx(env->sd, env->idle);
unsigned long load;
int i, nr_running;
@@ -7986,7 +8834,10 @@
nr_running = rq->nr_running;
if (nr_running > 1)
- *overload = true;
+ *sg_status |= SG_OVERLOAD;
+
+ if (cpu_overutilized(i))
+ *sg_status |= SG_OVERUTILIZED;
#ifdef CONFIG_NUMA_BALANCING
sgs->nr_numa_running += rq->nr_numa_running;
@@ -7998,6 +8849,12 @@
*/
if (!nr_running && idle_cpu(i))
sgs->idle_cpus++;
+
+ if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+ sgs->group_misfit_task_load < rq->misfit_task_load) {
+ sgs->group_misfit_task_load = rq->misfit_task_load;
+ *sg_status |= SG_OVERLOAD;
+ }
}
/* Adjust by relative CPU capacity of the group */
@@ -8033,6 +8890,17 @@
{
struct sg_lb_stats *busiest = &sds->busiest_stat;
+ /*
+ * Don't try to pull misfit tasks we can't help.
+ * We can use max_capacity here as reduction in capacity on some
+ * CPUs in the group should either be possible to resolve
+ * internally or be covered by avg_load imbalance (eventually).
+ */
+ if (sgs->group_type == group_misfit_task &&
+ (!group_smaller_max_cpu_capacity(sg, sds->local) ||
+ !group_has_capacity(env, &sds->local_stat)))
+ return false;
+
if (sgs->group_type > busiest->group_type)
return true;
@@ -8052,7 +8920,14 @@
* power/energy consequences are not considered.
*/
if (sgs->sum_nr_running <= sgs->group_weight &&
- group_smaller_cpu_capacity(sds->local, sg))
+ group_smaller_min_cpu_capacity(sds->local, sg))
+ return false;
+
+ /*
+ * If we have more than one misfit sg go with the biggest misfit.
+ */
+ if (sgs->group_type == group_misfit_task &&
+ sgs->group_misfit_task_load < busiest->group_misfit_task_load)
return false;
asym_packing:
@@ -8123,19 +8998,14 @@
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
- int load_idx, prefer_sibling = 0;
- bool overload = false;
-
- if (child && child->flags & SD_PREFER_SIBLING)
- prefer_sibling = 1;
+ bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+ int sg_status = 0;
#ifdef CONFIG_NO_HZ_COMMON
if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
env->flags |= LBF_NOHZ_STATS;
#endif
- load_idx = get_sd_load_idx(env->sd, env->idle);
-
do {
struct sg_lb_stats *sgs = &tmp_sgs;
int local_group;
@@ -8150,8 +9020,7 @@
update_group_capacity(env->sd, env->dst_cpu);
}
- update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
- &overload);
+ update_sg_lb_stats(env, sg, sgs, &sg_status);
if (local_group)
goto next_group;
@@ -8199,11 +9068,22 @@
if (env->sd->flags & SD_NUMA)
env->fbq_type = fbq_classify_group(&sds->busiest_stat);
+ env->src_grp_nr_running = sds->busiest_stat.sum_nr_running;
+
if (!env->sd->parent) {
+ struct root_domain *rd = env->dst_rq->rd;
+
/* update overload indicator if we are at root domain */
- if (env->dst_rq->rd->overload != overload)
- env->dst_rq->rd->overload = overload;
+ WRITE_ONCE(rd->overload, sg_status & SG_OVERLOAD);
+
+ /* Update over-utilization (tipping point, U >= 0) indicator */
+ WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED);
+ trace_sched_overutilized(!!(sg_status & SG_OVERUTILIZED));
+ } else if (sg_status & SG_OVERUTILIZED) {
+ WRITE_ONCE(env->dst_rq->rd->overutilized, SG_OVERUTILIZED);
+ trace_sched_overutilized(1);
}
+
}
/**
@@ -8319,7 +9199,22 @@
capa_move /= SCHED_CAPACITY_SCALE;
/* Move if we gain throughput */
- if (capa_move > capa_now)
+ if (capa_move > capa_now) {
+ env->imbalance = busiest->load_per_task;
+ return;
+ }
+
+ /* We can't see throughput improvement with the load-based
+ * method, but it is possible depending upon group size and
+ * capacity range that there might still be an underutilized
+ * cpu available in an asymmetric capacity system. Do one last
+ * check just in case.
+ */
+ if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+ busiest->group_type == group_overloaded &&
+ busiest->sum_nr_running > busiest->group_weight &&
+ local->sum_nr_running < local->group_weight &&
+ local->group_capacity < busiest->group_capacity)
env->imbalance = busiest->load_per_task;
}
@@ -8352,8 +9247,9 @@
* factors in sg capacity and sgs with smaller group_type are
* skipped when updating the busiest sg:
*/
- if (busiest->avg_load <= sds->avg_load ||
- local->avg_load >= sds->avg_load) {
+ if (busiest->group_type != group_misfit_task &&
+ (busiest->avg_load <= sds->avg_load ||
+ local->avg_load >= sds->avg_load)) {
env->imbalance = 0;
return fix_small_imbalance(env, sds);
}
@@ -8387,6 +9283,22 @@
(sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;
+ /* Boost imbalance to allow misfit task to be balanced.
+ * Always do this if we are doing a NEWLY_IDLE balance
+ * on the assumption that any tasks we have must not be
+ * long-running (and hence we cannot rely upon load).
+ * However if we are not idle, we should assume the tasks
+ * we have are longer running and not override load-based
+ * calculations above unless we are sure that the local
+ * group is underutilized.
+ */
+ if (busiest->group_type == group_misfit_task &&
+ (env->idle == CPU_NEWLY_IDLE ||
+ local->sum_nr_running < local->group_weight)) {
+ env->imbalance = max_t(long, env->imbalance,
+ busiest->group_misfit_task_load);
+ }
+
/*
* if *imbalance is less than the average load per runnable task
* there is no guarantee that any tasks will be moved so we'll have
@@ -8422,6 +9334,14 @@
* this level.
*/
update_sd_lb_stats(env, &sds);
+
+ if (static_branch_unlikely(&sched_energy_present)) {
+ struct root_domain *rd = env->dst_rq->rd;
+
+ if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
+ goto out_balanced;
+ }
+
local = &sds.local_stat;
busiest = &sds.busiest_stat;
@@ -8453,6 +9373,10 @@
busiest->group_no_capacity)
goto force_balance;
+ /* Misfit tasks should be dealt with regardless of the avg load */
+ if (busiest->group_type == group_misfit_task)
+ goto force_balance;
+
/*
* If the local group is busier than the selected busiest group
* don't try and pull any tasks.
@@ -8490,6 +9414,7 @@
force_balance:
/* Looks like there is an imbalance. Compute it */
+ env->src_grp_type = busiest->group_type;
calculate_imbalance(env, &sds);
return env->imbalance ? sds.busiest : NULL;
@@ -8537,8 +9462,32 @@
if (rt > env->fbq_type)
continue;
+ /*
+ * For ASYM_CPUCAPACITY domains with misfit tasks we simply
+ * seek the "biggest" misfit task.
+ */
+ if (env->src_grp_type == group_misfit_task) {
+ if (rq->misfit_task_load > busiest_load) {
+ busiest_load = rq->misfit_task_load;
+ busiest = rq;
+ }
+
+ continue;
+ }
+
capacity = capacity_of(i);
+ /*
+ * For ASYM_CPUCAPACITY domains, don't pick a CPU that could
+ * eventually lead to active_balancing high->low capacity.
+ * Higher per-CPU capacity is considered better than balancing
+ * average load.
+ */
+ if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+ capacity_of(env->dst_cpu) < capacity &&
+ rq->nr_running == 1)
+ continue;
+
wl = weighted_cpuload(rq);
/*
@@ -8606,6 +9555,20 @@
return 1;
}
+ if (env->src_grp_type == group_misfit_task)
+ return 1;
+
+ if ((capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) &&
+ env->src_rq->cfs.h_nr_running == 1 &&
+ cpu_overutilized(env->src_cpu) &&
+ !cpu_overutilized(env->dst_cpu)) {
+ return 1;
+ }
+
+ if (env->src_grp_type == group_overloaded && env->src_rq->misfit_task_load)
+ return 1;
+
+
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
}
@@ -8824,7 +9787,8 @@
* excessive cache_hot migrations and active balances.
*/
if (idle != CPU_NEWLY_IDLE)
- sd->nr_balance_failed++;
+ if (env.src_grp_nr_running > 1)
+ sd->nr_balance_failed++;
if (need_active_balance(&env)) {
unsigned long flags;
@@ -9262,7 +10226,7 @@
if (time_before(now, nohz.next_balance))
goto out;
- if (rq->nr_running >= 2) {
+ if (rq->nr_running >= 2 || rq->misfit_task_load) {
flags = NOHZ_KICK_MASK;
goto out;
}
@@ -9291,7 +10255,7 @@
}
}
- sd = rcu_dereference(per_cpu(sd_asym, cpu));
+ sd = rcu_dereference(per_cpu(sd_asym_packing, cpu));
if (sd) {
for_each_cpu(i, sched_domain_span(sd)) {
if (i == cpu ||
@@ -9631,7 +10595,7 @@
rq_unpin_lock(this_rq, rf);
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !this_rq->rd->overload) {
+ !READ_ONCE(this_rq->rd->overload)) {
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);
@@ -9793,6 +10757,9 @@
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
+
+ update_misfit_status(curr, rq);
+ update_overutilized_status(task_rq(curr));
}
/*
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 85ae848..50bdfd7 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -90,3 +90,33 @@
* UtilEstimation. Use estimated CPU utilization.
*/
SCHED_FEAT(UTIL_EST, true)
+
+/*
+ * Fast pre-selection of CPU candidates for EAS.
+ */
+SCHED_FEAT(FIND_BEST_TARGET, true)
+
+/*
+ * Energy aware scheduling algorithm choices:
+ * EAS_PREFER_IDLE
+ * Direct tasks in a schedtune.prefer_idle=1 group through
+ * the EAS path for wakeup task placement. Otherwise, put
+ * those tasks through the mainline slow path.
+ */
+SCHED_FEAT(EAS_PREFER_IDLE, true)
+
+/*
+ * Request max frequency from schedutil whenever a RT task is running.
+ */
+SCHED_FEAT(SUGOV_RT_MAX_FREQ, false)
+
+/*
+ * Apply schedtune boost hold to tasks of all sched classes.
+ * If enabled, schedtune will hold the boost applied to a CPU
+ * for 50ms regardless of task activation - if the task is
+ * still running 50ms later, the boost hold expires and schedtune
+ * boost will expire immediately the task stops.
+ * If disabled, this behaviour will only apply to tasks of the
+ * RT class.
+ */
+SCHED_FEAT(SCHEDTUNE_BOOST_HOLD_ALL, false)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 44a1736..8a88061 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -16,9 +16,10 @@
* sched_idle_set_state - Record idle state for the current CPU.
* @idle_state: State to record.
*/
-void sched_idle_set_state(struct cpuidle_state *idle_state)
+void sched_idle_set_state(struct cpuidle_state *idle_state, int index)
{
idle_set_state(this_rq(), idle_state);
+ idle_set_state_idx(this_rq(), index);
}
static int __read_mostly cpu_idle_force_poll;
@@ -375,7 +376,8 @@
#ifdef CONFIG_SMP
static int
-select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags,
+ int sibling_count_hint)
{
return task_cpu(p); /* IDLE tasks as never migrated */
}
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 48a1264..19ba404 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -26,9 +26,10 @@
#include <linux/sched.h>
#include "sched.h"
-#include "sched-pelt.h"
#include "pelt.h"
+#include <trace/events/sched.h>
+
/*
* Approximate:
* val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
@@ -106,16 +107,12 @@
* n=1
*/
static __always_inline u32
-accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
+accumulate_sum(u64 delta, struct sched_avg *sa,
unsigned long load, unsigned long runnable, int running)
{
- unsigned long scale_freq, scale_cpu;
u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
u64 periods;
- scale_freq = arch_scale_freq_capacity(cpu);
- scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
-
delta += sa->period_contrib;
periods = delta / 1024; /* A period is 1024us (~1ms) */
@@ -137,13 +134,12 @@
}
sa->period_contrib = delta;
- contrib = cap_scale(contrib, scale_freq);
if (load)
sa->load_sum += load * contrib;
if (runnable)
sa->runnable_load_sum += runnable * contrib;
if (running)
- sa->util_sum += contrib * scale_cpu;
+ sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
return periods;
}
@@ -177,7 +173,7 @@
* = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
*/
static __always_inline int
-___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
+___update_load_sum(u64 now, struct sched_avg *sa,
unsigned long load, unsigned long runnable, int running)
{
u64 delta;
@@ -221,7 +217,7 @@
* Step 1: accumulate *_sum since last_update_time. If we haven't
* crossed period boundaries, finish.
*/
- if (!accumulate_sum(delta, cpu, sa, load, runnable, running))
+ if (!accumulate_sum(delta, sa, load, runnable, running))
return 0;
return 1;
@@ -267,43 +263,46 @@
* runnable_load_avg = \Sum se->avg.runable_load_avg
*/
-int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
+int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
{
- if (entity_is_task(se))
- se->runnable_weight = se->load.weight;
-
- if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
+ if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+
+ trace_sched_load_se(se);
+
return 1;
}
return 0;
}
-int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
+int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- if (entity_is_task(se))
- se->runnable_weight = se->load.weight;
-
- if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,
+ if (___update_load_sum(now, &se->avg, !!se->on_rq, !!se->on_rq,
cfs_rq->curr == se)) {
___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
cfs_se_util_change(&se->avg);
+
+ trace_sched_load_se(se);
+
return 1;
}
return 0;
}
-int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
+int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
{
- if (___update_load_sum(now, cpu, &cfs_rq->avg,
+ if (___update_load_sum(now, &cfs_rq->avg,
scale_load_down(cfs_rq->load.weight),
scale_load_down(cfs_rq->runnable_weight),
cfs_rq->curr != NULL)) {
___update_load_avg(&cfs_rq->avg, 1, 1);
+
+ trace_sched_load_cfs_rq(cfs_rq);
+
return 1;
}
@@ -323,12 +322,15 @@
int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
{
- if (___update_load_sum(now, rq->cpu, &rq->avg_rt,
+ if (___update_load_sum(now, &rq->avg_rt,
running,
running,
running)) {
___update_load_avg(&rq->avg_rt, 1, 1);
+
+ trace_sched_load_rt_rq(rq);
+
return 1;
}
@@ -346,7 +348,7 @@
int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
{
- if (___update_load_sum(now, rq->cpu, &rq->avg_dl,
+ if (___update_load_sum(now, &rq->avg_dl,
running,
running,
running)) {
@@ -371,22 +373,31 @@
int update_irq_load_avg(struct rq *rq, u64 running)
{
int ret = 0;
+
+ /*
+ * We can't use clock_pelt because irq time is not accounted in
+ * clock_task. Instead we directly scale the running time to
+ * reflect the real amount of computation
+ */
+ running = cap_scale(running, arch_scale_freq_capacity(cpu_of(rq)));
+ running = cap_scale(running, arch_scale_cpu_capacity(NULL, cpu_of(rq)));
+
/*
* We know the time that has been used by interrupt since last update
* but we don't when. Let be pessimistic and assume that interrupt has
* happened just before the update. This is not so far from reality
* because interrupt will most probably wake up task and trig an update
- * of rq clock during which the metric si updated.
+ * of rq clock during which the metric is updated.
* We start to decay with normal context time and then we add the
* interrupt context time.
* We can safely remove running from rq->clock because
* rq->clock += delta with delta >= running
*/
- ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
+ ret = ___update_load_sum(rq->clock - running, &rq->avg_irq,
0,
0,
0);
- ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
+ ret += ___update_load_sum(rq->clock, &rq->avg_irq,
1,
1,
1);
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 7e56b48..7489d5f 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -1,8 +1,9 @@
#ifdef CONFIG_SMP
+#include "sched-pelt.h"
-int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
-int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
-int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
+int __update_load_avg_blocked_se(u64 now, struct sched_entity *se);
+int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se);
+int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
@@ -42,6 +43,101 @@
WRITE_ONCE(avg->util_est.enqueued, enqueued);
}
+/*
+ * The clock_pelt scales the time to reflect the effective amount of
+ * computation done during the running delta time but then sync back to
+ * clock_task when rq is idle.
+ *
+ *
+ * absolute time | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
+ * @ max capacity ------******---------------******---------------
+ * @ half capacity ------************---------************---------
+ * clock pelt | 1| 2| 3| 4| 7| 8| 9| 10| 11|14|15|16
+ *
+ */
+static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
+{
+ if (unlikely(is_idle_task(rq->curr))) {
+ /* The rq is idle, we can sync to clock_task */
+ rq->clock_pelt = rq_clock_task(rq);
+ return;
+ }
+
+ /*
+ * When a rq runs at a lower compute capacity, it will need
+ * more time to do the same amount of work than at max
+ * capacity. In order to be invariant, we scale the delta to
+ * reflect how much work has been really done.
+ * Running longer results in stealing idle time that will
+ * disturb the load signal compared to max capacity. This
+ * stolen idle time will be automatically reflected when the
+ * rq will be idle and the clock will be synced with
+ * rq_clock_task.
+ */
+
+ /*
+ * Scale the elapsed time to reflect the real amount of
+ * computation
+ */
+ delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu_of(rq)));
+ delta = cap_scale(delta, arch_scale_freq_capacity(cpu_of(rq)));
+
+ rq->clock_pelt += delta;
+}
+
+/*
+ * When rq becomes idle, we have to check if it has lost idle time
+ * because it was fully busy. A rq is fully used when the /Sum util_sum
+ * is greater or equal to:
+ * (LOAD_AVG_MAX - 1024 + rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT;
+ * For optimization and computing rounding purpose, we don't take into account
+ * the position in the current window (period_contrib) and we use the higher
+ * bound of util_sum to decide.
+ */
+static inline void update_idle_rq_clock_pelt(struct rq *rq)
+{
+ u32 divider = ((LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX;
+ u32 util_sum = rq->cfs.avg.util_sum;
+ util_sum += rq->avg_rt.util_sum;
+ util_sum += rq->avg_dl.util_sum;
+
+ /*
+ * Reflecting stolen time makes sense only if the idle
+ * phase would be present at max capacity. As soon as the
+ * utilization of a rq has reached the maximum value, it is
+ * considered as an always runnig rq without idle time to
+ * steal. This potential idle time is considered as lost in
+ * this case. We keep track of this lost idle time compare to
+ * rq's clock_task.
+ */
+ if (util_sum >= divider)
+ rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
+}
+
+static inline u64 rq_clock_pelt(struct rq *rq)
+{
+ lockdep_assert_held(&rq->lock);
+ assert_clock_updated(rq);
+
+ return rq->clock_pelt - rq->lost_idle_time;
+}
+
+#ifdef CONFIG_CFS_BANDWIDTH
+/* rq->task_clock normalized against any time this cfs_rq has spent throttled */
+static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
+{
+ if (unlikely(cfs_rq->throttle_count))
+ return cfs_rq->throttled_clock_task - cfs_rq->throttled_clock_task_time;
+
+ return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time;
+}
+#else
+static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
+{
+ return rq_clock_pelt(rq_of(cfs_rq));
+}
+#endif
+
#else
static inline int
@@ -67,6 +163,18 @@
{
return 0;
}
+
+static inline u64 rq_clock_pelt(struct rq *rq)
+{
+ return rq_clock_task(rq);
+}
+
+static inline void
+update_rq_clock_pelt(struct rq *rq, s64 delta) { }
+
+static inline void
+update_idle_rq_clock_pelt(struct rq *rq) { }
+
#endif
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index b980cc9..150cde3 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1329,6 +1329,8 @@
{
struct sched_rt_entity *rt_se = &p->rt;
+ schedtune_enqueue_task(p, cpu_of(rq));
+
if (flags & ENQUEUE_WAKEUP)
rt_se->timeout = 0;
@@ -1342,6 +1344,8 @@
{
struct sched_rt_entity *rt_se = &p->rt;
+ schedtune_dequeue_task(p, cpu_of(rq));
+
update_curr_rt(rq);
dequeue_rt_entity(rt_se, flags);
@@ -1386,7 +1390,8 @@
static int find_lowest_rq(struct task_struct *task);
static int
-select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags,
+ int sibling_count_hint)
{
struct task_struct *curr;
struct rq *rq;
@@ -1584,7 +1589,7 @@
* rt task
*/
if (rq->curr->sched_class != &rt_sched_class)
- update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+ update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
return p;
}
@@ -1593,7 +1598,7 @@
{
update_curr_rt(rq);
- update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
+ update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
/*
* The previous task needs to be made eligible for pushing
@@ -2324,7 +2329,7 @@
struct sched_rt_entity *rt_se = &p->rt;
update_curr_rt(rq);
- update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
+ update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
watchdog(rq, p);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5f0eb45..b49b595 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -45,6 +45,7 @@
#include <linux/ctype.h>
#include <linux/debugfs.h>
#include <linux/delayacct.h>
+#include <linux/energy_model.h>
#include <linux/init_task.h>
#include <linux/kprobes.h>
#include <linux/kthread.h>
@@ -80,6 +81,8 @@
# define SCHED_WARN_ON(x) ({ (void)(x), 0; })
#endif
+#include "tune.h"
+
struct rq;
struct cpuidle_state;
@@ -705,6 +708,22 @@
return arch_asym_cpu_priority(a) > arch_asym_cpu_priority(b);
}
+struct perf_domain {
+ struct em_perf_domain *em_pd;
+ struct perf_domain *next;
+ struct rcu_head rcu;
+};
+
+struct max_cpu_capacity {
+ raw_spinlock_t lock;
+ unsigned long val;
+ int cpu;
+};
+
+/* Scheduling group status flags */
+#define SG_OVERLOAD 0x1 /* More than one runnable task on a CPU. */
+#define SG_OVERUTILIZED 0x2 /* One or more CPUs are over-utilized. */
+
/*
* We add the notion of a root-domain which will be used to define per-domain
* variables. Each exclusive cpuset essentially defines an island domain by
@@ -720,8 +739,15 @@
cpumask_var_t span;
cpumask_var_t online;
- /* Indicate more than one runnable task for any CPU */
- bool overload;
+ /*
+ * Indicate pullable load on at least one CPU, e.g:
+ * - More than one runnable task
+ * - Running task is misfit
+ */
+ int overload;
+
+ /* Indicate one or more cpus over-utilized (tipping point) */
+ int overutilized;
/*
* The bit corresponding to a CPU gets set here if such CPU has more
@@ -752,13 +778,21 @@
cpumask_var_t rto_mask;
struct cpupri cpupri;
- unsigned long max_cpu_capacity;
+ /* Maximum cpu capacity in the system. */
+ struct max_cpu_capacity max_cpu_capacity;
+
+ /*
+ * NULL-terminated list of performance domains intersecting with the
+ * CPUs of the rd. Protected by RCU.
+ */
+ struct perf_domain *pd;
};
extern struct root_domain def_root_domain;
extern struct mutex sched_domains_mutex;
extern void init_defrootdomain(void);
+extern void init_max_cpu_capacity(struct max_cpu_capacity *mcc);
extern int sched_init_domains(const struct cpumask *cpu_map);
extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
extern void sched_get_rd(struct root_domain *rd);
@@ -833,7 +867,10 @@
unsigned int clock_update_flags;
u64 clock;
- u64 clock_task;
+ /* Ensure that all clocks are in the same cache line */
+ u64 clock_task ____cacheline_aligned;
+ u64 clock_pelt;
+ unsigned long lost_idle_time;
atomic_t nr_iowait;
@@ -848,6 +885,8 @@
unsigned char idle_balance;
+ unsigned long misfit_task_load;
+
/* For active balancing */
int active_balance;
int push_cpu;
@@ -918,9 +957,26 @@
#ifdef CONFIG_CPU_IDLE
/* Must be inspected within a rcu lock section */
struct cpuidle_state *idle_state;
+ int idle_state_idx;
#endif
};
+#ifdef CONFIG_FAIR_GROUP_SCHED
+
+/* CPU runqueue to which this cfs_rq is attached */
+static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
+{
+ return cfs_rq->rq;
+}
+
+#else
+
+static inline struct rq *rq_of(struct cfs_rq *cfs_rq)
+{
+ return container_of(cfs_rq, struct rq, cfs);
+}
+#endif
+
static inline int cpu_of(struct rq *rq)
{
#ifdef CONFIG_SMP
@@ -1186,7 +1242,9 @@
DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
-DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(struct sched_domain *, sd_asym_packing);
+DECLARE_PER_CPU(struct sched_domain *, sd_asym_cpucapacity);
+extern struct static_key_false sched_asym_cpucapacity;
struct sched_group_capacity {
atomic_t ref;
@@ -1196,6 +1254,7 @@
*/
unsigned long capacity;
unsigned long min_capacity; /* Min per-CPU capacity in group */
+ unsigned long max_capacity; /* Max per-CPU capacity in group */
unsigned long next_update;
int imbalance; /* XXX unrelated to capacity but shared group state */
@@ -1361,7 +1420,7 @@
#undef SCHED_FEAT
-#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL)
+#if defined(CONFIG_SCHED_DEBUG) && defined(HAVE_JUMP_LABEL)
/*
* To support run-time toggling of sched features, all the translation units
@@ -1381,7 +1440,7 @@
extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
#define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x]))
-#else /* !(SCHED_DEBUG && CONFIG_JUMP_LABEL) */
+#else /* !(SCHED_DEBUG && HAVE_JUMP_LABEL) */
/*
* Each translation unit has its own copy of sysctl_sched_features to allow
@@ -1397,7 +1456,7 @@
#define sched_feat(x) !!(sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
-#endif /* SCHED_DEBUG && CONFIG_JUMP_LABEL */
+#endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
extern struct static_key_false sched_numa_balancing;
extern struct static_key_false sched_schedstats;
@@ -1524,7 +1583,8 @@
void (*put_prev_task)(struct rq *rq, struct task_struct *p);
#ifdef CONFIG_SMP
- int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
+ int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags,
+ int subling_count_hint);
void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
void (*task_woken)(struct rq *this_rq, struct task_struct *task);
@@ -1612,6 +1672,17 @@
return rq->idle_state;
}
+
+static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx)
+{
+ rq->idle_state_idx = idle_state_idx;
+}
+
+static inline int idle_get_state_idx(struct rq *rq)
+{
+ WARN_ON(!rcu_read_lock_held());
+ return rq->idle_state_idx;
+}
#else
static inline void idle_set_state(struct rq *rq,
struct cpuidle_state *idle_state)
@@ -1622,6 +1693,15 @@
{
return NULL;
}
+
+static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx)
+{
+}
+
+static inline int idle_get_state_idx(struct rq *rq)
+{
+ return -1;
+}
#endif
extern void schedule_idle(void);
@@ -1695,8 +1775,8 @@
if (prev_nr < 2 && rq->nr_running >= 2) {
#ifdef CONFIG_SMP
- if (!rq->rd->overload)
- rq->rd->overload = true;
+ if (!READ_ONCE(rq->rd->overload))
+ WRITE_ONCE(rq->rd->overload, 1);
#endif
}
@@ -1755,26 +1835,14 @@
}
#endif
-#ifdef CONFIG_SMP
-#ifndef arch_scale_cpu_capacity
+#ifndef arch_scale_max_freq_capacity
+struct sched_domain;
static __always_inline
-unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
-{
- if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
- return sd->smt_gain / sd->span_weight;
-
- return SCHED_CAPACITY_SCALE;
-}
-#endif
-#else
-#ifndef arch_scale_cpu_capacity
-static __always_inline
-unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
+unsigned long arch_scale_max_freq_capacity(struct sched_domain *sd, int cpu)
{
return SCHED_CAPACITY_SCALE;
}
#endif
-#endif
struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
__acquires(rq->lock);
@@ -2187,7 +2255,46 @@
# define arch_scale_freq_invariant() false
#endif
+#ifdef CONFIG_SMP
+static inline unsigned long capacity_orig_of(int cpu)
+{
+ return cpu_rq(cpu)->cpu_capacity_orig;
+}
+#endif
+
#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+/**
+ * enum schedutil_type - CPU utilization type
+ * @FREQUENCY_UTIL: Utilization used to select frequency
+ * @ENERGY_UTIL: Utilization used during energy calculation
+ *
+ * The utilization signals of all scheduling classes (CFS/RT/DL) and IRQ time
+ * need to be aggregated differently depending on the usage made of them. This
+ * enum is used within schedutil_freq_util() to differentiate the types of
+ * utilization expected by the callers, and adjust the aggregation accordingly.
+ */
+enum schedutil_type {
+ FREQUENCY_UTIL,
+ ENERGY_UTIL,
+};
+
+unsigned long schedutil_freq_util(int cpu, unsigned long util,
+ unsigned long max, enum schedutil_type type);
+
+static inline unsigned long schedutil_energy_util(int cpu, unsigned long util)
+{
+ unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
+
+ return schedutil_freq_util(cpu, util, max, ENERGY_UTIL);
+}
+#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+static inline unsigned long schedutil_energy_util(int cpu, unsigned long util)
+{
+ return util;
+}
+#endif
+
+#ifdef CONFIG_SMP
static inline unsigned long cpu_bw_dl(struct rq *rq)
{
return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
@@ -2243,3 +2350,13 @@
return util;
}
#endif
+
+#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
+#define perf_domain_span(pd) (to_cpumask(((pd)->em_pd->cpus)))
+#else
+#define perf_domain_span(pd) NULL
+#endif
+
+#ifdef CONFIG_SMP
+extern struct static_key_false sched_energy_present;
+#endif
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index c183b79..6446d61 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -11,7 +11,8 @@
#ifdef CONFIG_SMP
static int
-select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
+select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags,
+ int sibling_count_hint)
{
return task_cpu(p); /* stop tasks as never migrate */
}
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 74b6943..7bc2cdd 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -201,6 +201,199 @@
return 1;
}
+DEFINE_STATIC_KEY_FALSE(sched_energy_present);
+#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
+DEFINE_MUTEX(sched_energy_mutex);
+bool sched_energy_update;
+
+static void free_pd(struct perf_domain *pd)
+{
+ struct perf_domain *tmp;
+
+ while (pd) {
+ tmp = pd->next;
+ kfree(pd);
+ pd = tmp;
+ }
+}
+
+static struct perf_domain *find_pd(struct perf_domain *pd, int cpu)
+{
+ while (pd) {
+ if (cpumask_test_cpu(cpu, perf_domain_span(pd)))
+ return pd;
+ pd = pd->next;
+ }
+
+ return NULL;
+}
+
+static struct perf_domain *pd_init(int cpu)
+{
+ struct em_perf_domain *obj = em_cpu_get(cpu);
+ struct perf_domain *pd;
+
+ if (!obj) {
+ if (sched_debug())
+ pr_info("%s: no EM found for CPU%d\n", __func__, cpu);
+ return NULL;
+ }
+
+ pd = kzalloc(sizeof(*pd), GFP_KERNEL);
+ if (!pd)
+ return NULL;
+ pd->em_pd = obj;
+
+ return pd;
+}
+
+static void perf_domain_debug(const struct cpumask *cpu_map,
+ struct perf_domain *pd)
+{
+ if (!sched_debug() || !pd)
+ return;
+
+ printk(KERN_DEBUG "root_domain %*pbl:", cpumask_pr_args(cpu_map));
+
+ while (pd) {
+ printk(KERN_CONT " pd%d:{ cpus=%*pbl nr_cstate=%d }",
+ cpumask_first(perf_domain_span(pd)),
+ cpumask_pr_args(perf_domain_span(pd)),
+ em_pd_nr_cap_states(pd->em_pd));
+ pd = pd->next;
+ }
+
+ printk(KERN_CONT "\n");
+}
+
+static void destroy_perf_domain_rcu(struct rcu_head *rp)
+{
+ struct perf_domain *pd;
+
+ pd = container_of(rp, struct perf_domain, rcu);
+ free_pd(pd);
+}
+
+static void sched_energy_set(bool has_eas)
+{
+ if (!has_eas && static_branch_unlikely(&sched_energy_present)) {
+ if (sched_debug())
+ pr_info("%s: stopping EAS\n", __func__);
+ static_branch_disable_cpuslocked(&sched_energy_present);
+ } else if (has_eas && !static_branch_unlikely(&sched_energy_present)) {
+ if (sched_debug())
+ pr_info("%s: starting EAS\n", __func__);
+ static_branch_enable_cpuslocked(&sched_energy_present);
+ }
+}
+
+/*
+ * EAS can be used on a root domain if it meets all the following conditions:
+ * 1. an Energy Model (EM) is available;
+ * 2. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy.
+ * 3. the EM complexity is low enough to keep scheduling overheads low;
+ * 4. schedutil is driving the frequency of all CPUs of the rd;
+ *
+ * The complexity of the Energy Model is defined as:
+ *
+ * C = nr_pd * (nr_cpus + nr_cs)
+ *
+ * with parameters defined as:
+ * - nr_pd: the number of performance domains
+ * - nr_cpus: the number of CPUs
+ * - nr_cs: the sum of the number of capacity states of all performance
+ * domains (for example, on a system with 2 performance domains,
+ * with 10 capacity states each, nr_cs = 2 * 10 = 20).
+ *
+ * It is generally not a good idea to use such a model in the wake-up path on
+ * very complex platforms because of the associated scheduling overheads. The
+ * arbitrary constraint below prevents that. It makes EAS usable up to 16 CPUs
+ * with per-CPU DVFS and less than 8 capacity states each, for example.
+ */
+#define EM_MAX_COMPLEXITY 2048
+
+extern struct cpufreq_governor schedutil_gov;
+static bool build_perf_domains(const struct cpumask *cpu_map)
+{
+ int i, nr_pd = 0, nr_cs = 0, nr_cpus = cpumask_weight(cpu_map);
+ struct perf_domain *pd = NULL, *tmp;
+ int cpu = cpumask_first(cpu_map);
+ struct root_domain *rd = cpu_rq(cpu)->rd;
+ struct cpufreq_policy *policy;
+ struct cpufreq_governor *gov;
+
+ /* EAS is enabled for asymmetric CPU capacity topologies. */
+ if (!per_cpu(sd_asym_cpucapacity, cpu)) {
+ if (sched_debug()) {
+ pr_info("rd %*pbl: CPUs do not have asymmetric capacities\n",
+ cpumask_pr_args(cpu_map));
+ }
+ goto free;
+ }
+
+ for_each_cpu(i, cpu_map) {
+ /* Skip already covered CPUs. */
+ if (find_pd(pd, i))
+ continue;
+
+ /* Do not attempt EAS if schedutil is not being used. */
+ policy = cpufreq_cpu_get(i);
+ if (!policy)
+ goto free;
+ gov = policy->governor;
+ cpufreq_cpu_put(policy);
+ if (gov != &schedutil_gov) {
+ if (rd->pd)
+ pr_warn("rd %*pbl: Disabling EAS, schedutil is mandatory\n",
+ cpumask_pr_args(cpu_map));
+ goto free;
+ }
+
+ /* Create the new pd and add it to the local list. */
+ tmp = pd_init(i);
+ if (!tmp)
+ goto free;
+ tmp->next = pd;
+ pd = tmp;
+
+ /*
+ * Count performance domains and capacity states for the
+ * complexity check.
+ */
+ nr_pd++;
+ nr_cs += em_pd_nr_cap_states(pd->em_pd);
+ }
+
+ /* Bail out if the Energy Model complexity is too high. */
+ if (nr_pd * (nr_cs + nr_cpus) > EM_MAX_COMPLEXITY) {
+ WARN(1, "rd %*pbl: Failed to start EAS, EM complexity is too high\n",
+ cpumask_pr_args(cpu_map));
+ goto free;
+ }
+
+ perf_domain_debug(cpu_map, pd);
+
+ /* Attach the new list of performance domains to the root domain. */
+ tmp = rd->pd;
+ rcu_assign_pointer(rd->pd, pd);
+ if (tmp)
+ call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
+
+ return !!pd;
+
+free:
+ free_pd(pd);
+ tmp = rd->pd;
+ rcu_assign_pointer(rd->pd, NULL);
+ if (tmp)
+ call_rcu(&tmp->rcu, destroy_perf_domain_rcu);
+
+ return false;
+}
+#else
+static void free_pd(struct perf_domain *pd) { }
+#endif /* CONFIG_ENERGY_MODEL && CONFIG_CPU_FREQ_GOV_SCHEDUTIL*/
+
static void free_rootdomain(struct rcu_head *rcu)
{
struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
@@ -211,6 +404,7 @@
free_cpumask_var(rd->rto_mask);
free_cpumask_var(rd->online);
free_cpumask_var(rd->span);
+ free_pd(rd->pd);
kfree(rd);
}
@@ -287,6 +481,9 @@
if (cpupri_init(&rd->cpupri) != 0)
goto free_cpudl;
+
+ init_max_cpu_capacity(&rd->max_cpu_capacity);
+
return 0;
free_cpudl:
@@ -397,7 +594,9 @@
DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DEFINE_PER_CPU(struct sched_domain *, sd_numa);
-DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_PER_CPU(struct sched_domain *, sd_asym_packing);
+DEFINE_PER_CPU(struct sched_domain *, sd_asym_cpucapacity);
+DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
static void update_top_cache_domain(int cpu)
{
@@ -422,7 +621,10 @@
rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
- rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
+ rcu_assign_pointer(per_cpu(sd_asym_packing, cpu), sd);
+
+ sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY);
+ rcu_assign_pointer(per_cpu(sd_asym_cpucapacity, cpu), sd);
}
/*
@@ -692,6 +894,7 @@
sg_span = sched_group_span(sg);
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+ sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
}
static int
@@ -851,6 +1054,7 @@
sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_span(sg));
sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+ sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
return sg;
}
@@ -1061,7 +1265,6 @@
* SD_SHARE_PKG_RESOURCES - describes shared caches
* SD_NUMA - describes NUMA topologies
* SD_SHARE_POWERDOMAIN - describes shared power domain
- * SD_ASYM_CPUCAPACITY - describes mixed capacity topologies
*
* Odd one out, which beside describing the topology has a quirk also
* prescribes the desired behaviour that goes along with it:
@@ -1073,13 +1276,12 @@
SD_SHARE_PKG_RESOURCES | \
SD_NUMA | \
SD_ASYM_PACKING | \
- SD_ASYM_CPUCAPACITY | \
SD_SHARE_POWERDOMAIN)
static struct sched_domain *
sd_init(struct sched_domain_topology_level *tl,
const struct cpumask *cpu_map,
- struct sched_domain *child, int cpu)
+ struct sched_domain *child, int dflags, int cpu)
{
struct sd_data *sdd = &tl->data;
struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
@@ -1100,6 +1302,9 @@
"wrong sd_flags in topology description\n"))
sd_flags &= ~TOPOLOGY_SD_FLAGS;
+ /* Apply detected topology flags */
+ sd_flags |= dflags;
+
*sd = (struct sched_domain){
.min_interval = sd_weight,
.max_interval = 2*sd_weight,
@@ -1122,7 +1327,7 @@
| 0*SD_SHARE_CPUCAPACITY
| 0*SD_SHARE_PKG_RESOURCES
| 0*SD_SERIALIZE
- | 0*SD_PREFER_SIBLING
+ | 1*SD_PREFER_SIBLING
| 0*SD_NUMA
| sd_flags
,
@@ -1148,17 +1353,21 @@
if (sd->flags & SD_ASYM_CPUCAPACITY) {
struct sched_domain *t = sd;
+ /*
+ * Don't attempt to spread across CPUs of different capacities.
+ */
+ if (sd->child)
+ sd->child->flags &= ~SD_PREFER_SIBLING;
+
for_each_lower_domain(t)
t->flags |= SD_BALANCE_WAKE;
}
if (sd->flags & SD_SHARE_CPUCAPACITY) {
- sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 110;
sd->smt_gain = 1178; /* ~15% */
} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
- sd->flags |= SD_PREFER_SIBLING;
sd->imbalance_pct = 117;
sd->cache_nice_tries = 1;
sd->busy_idx = 2;
@@ -1169,6 +1378,7 @@
sd->busy_idx = 3;
sd->idle_idx = 2;
+ sd->flags &= ~SD_PREFER_SIBLING;
sd->flags |= SD_SERIALIZE;
if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
sd->flags &= ~(SD_BALANCE_EXEC |
@@ -1178,7 +1388,6 @@
#endif
} else {
- sd->flags |= SD_PREFER_SIBLING;
sd->cache_nice_tries = 1;
sd->busy_idx = 2;
sd->idle_idx = 1;
@@ -1604,9 +1813,9 @@
static struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
const struct cpumask *cpu_map, struct sched_domain_attr *attr,
- struct sched_domain *child, int cpu)
+ struct sched_domain *child, int dflags, int cpu)
{
- struct sched_domain *sd = sd_init(tl, cpu_map, child, cpu);
+ struct sched_domain *sd = sd_init(tl, cpu_map, child, dflags, cpu);
if (child) {
sd->level = child->level + 1;
@@ -1633,6 +1842,65 @@
}
/*
+ * Find the sched_domain_topology_level where all CPU capacities are visible
+ * for all CPUs.
+ */
+static struct sched_domain_topology_level
+*asym_cpu_capacity_level(const struct cpumask *cpu_map)
+{
+ int i, j, asym_level = 0;
+ bool asym = false;
+ struct sched_domain_topology_level *tl, *asym_tl = NULL;
+ unsigned long cap;
+
+ /* Is there any asymmetry? */
+ cap = arch_scale_cpu_capacity(NULL, cpumask_first(cpu_map));
+
+ for_each_cpu(i, cpu_map) {
+ if (arch_scale_cpu_capacity(NULL, i) != cap) {
+ asym = true;
+ break;
+ }
+ }
+
+ if (!asym)
+ return NULL;
+
+ /*
+ * Examine topology from all CPU's point of views to detect the lowest
+ * sched_domain_topology_level where a highest capacity CPU is visible
+ * to everyone.
+ */
+ for_each_cpu(i, cpu_map) {
+ unsigned long max_capacity = arch_scale_cpu_capacity(NULL, i);
+ int tl_id = 0;
+
+ for_each_sd_topology(tl) {
+ if (tl_id < asym_level)
+ goto next_level;
+
+ for_each_cpu_and(j, tl->mask(i), cpu_map) {
+ unsigned long capacity;
+
+ capacity = arch_scale_cpu_capacity(NULL, j);
+
+ if (capacity <= max_capacity)
+ continue;
+
+ max_capacity = capacity;
+ asym_level = tl_id;
+ asym_tl = tl;
+ }
+next_level:
+ tl_id++;
+ }
+ }
+
+ return asym_tl;
+}
+
+
+/*
* Build sched domains for a given set of CPUs and attach the sched domains
* to the individual CPUs
*/
@@ -1642,20 +1910,31 @@
enum s_alloc alloc_state;
struct sched_domain *sd;
struct s_data d;
- struct rq *rq = NULL;
int i, ret = -ENOMEM;
+ struct sched_domain_topology_level *tl_asym;
+ bool has_asym = false;
alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
if (alloc_state != sa_rootdomain)
goto error;
+ tl_asym = asym_cpu_capacity_level(cpu_map);
+
/* Set up domains for CPUs specified by the cpu_map: */
for_each_cpu(i, cpu_map) {
struct sched_domain_topology_level *tl;
sd = NULL;
for_each_sd_topology(tl) {
- sd = build_sched_domain(tl, cpu_map, attr, sd, i);
+ int dflags = 0;
+
+ if (tl == tl_asym) {
+ dflags |= SD_ASYM_CPUCAPACITY;
+ has_asym = true;
+ }
+
+ sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i);
+
if (tl == sched_domain_topology)
*per_cpu_ptr(d.sd, i) = sd;
if (tl->flags & SDTL_OVERLAP)
@@ -1693,21 +1972,13 @@
/* Attach the domains */
rcu_read_lock();
for_each_cpu(i, cpu_map) {
- rq = cpu_rq(i);
sd = *per_cpu_ptr(d.sd, i);
-
- /* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */
- if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
- WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);
-
cpu_attach_domain(sd, d.rd, i);
}
rcu_read_unlock();
- if (rq && sched_debug_enabled) {
- pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
- cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
- }
+ if (has_asym)
+ static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
ret = 0;
error:
@@ -1852,6 +2123,7 @@
void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
struct sched_domain_attr *dattr_new)
{
+ bool __maybe_unused has_eas = false;
int i, j, n;
int new_topology;
@@ -1879,8 +2151,8 @@
/* Destroy deleted domains: */
for (i = 0; i < ndoms_cur; i++) {
for (j = 0; j < n && !new_topology; j++) {
- if (cpumask_equal(doms_cur[i], doms_new[j])
- && dattrs_equal(dattr_cur, i, dattr_new, j))
+ if (cpumask_equal(doms_cur[i], doms_new[j]) &&
+ dattrs_equal(dattr_cur, i, dattr_new, j))
goto match1;
}
/* No match - a current sched domain not in new doms_new[] */
@@ -1900,8 +2172,8 @@
/* Build new domains: */
for (i = 0; i < ndoms_new; i++) {
for (j = 0; j < n && !new_topology; j++) {
- if (cpumask_equal(doms_new[i], doms_cur[j])
- && dattrs_equal(dattr_new, i, dattr_cur, j))
+ if (cpumask_equal(doms_new[i], doms_cur[j]) &&
+ dattrs_equal(dattr_new, i, dattr_cur, j))
goto match2;
}
/* No match - add a new doms_new */
@@ -1910,6 +2182,24 @@
;
}
+#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
+ /* Build perf. domains: */
+ for (i = 0; i < ndoms_new; i++) {
+ for (j = 0; j < n && !sched_energy_update; j++) {
+ if (cpumask_equal(doms_new[i], doms_cur[j]) &&
+ cpu_rq(cpumask_first(doms_cur[j]))->rd->pd) {
+ has_eas = true;
+ goto match3;
+ }
+ }
+ /* No match - add perf. domains for a new rd */
+ has_eas |= build_perf_domains(doms_new[i]);
+match3:
+ ;
+ }
+ sched_energy_set(has_eas);
+#endif
+
/* Remember the new sched domains: */
if (doms_cur != &fallback_doms)
free_sched_domains(doms_cur, ndoms_cur);
diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c
new file mode 100644
index 0000000..6e228f7
--- /dev/null
+++ b/kernel/sched/tune.c
@@ -0,0 +1,686 @@
+#include <linux/cgroup.h>
+#include <linux/err.h>
+#include <linux/kernel.h>
+#include <linux/percpu.h>
+#include <linux/printk.h>
+#include <linux/rcupdate.h>
+#include <linux/slab.h>
+
+#include <trace/events/sched.h>
+
+#include "sched.h"
+
+bool schedtune_initialized = false;
+extern struct reciprocal_value schedtune_spc_rdiv;
+
+/* We hold schedtune boost in effect for at least this long */
+#define SCHEDTUNE_BOOST_HOLD_NS 50000000ULL
+
+/*
+ * EAS scheduler tunables for task groups.
+ *
+ * When CGroup support is enabled, we have to synchronize two different
+ * paths:
+ * - slow path: where CGroups are created/updated/removed
+ * - fast path: where tasks in a CGroups are accounted
+ *
+ * The slow path tracks (a limited number of) CGroups and maps each on a
+ * "boost_group" index. The fastpath accounts tasks currently RUNNABLE on each
+ * "boost_group".
+ *
+ * Once a new CGroup is created, a boost group idx is assigned and the
+ * corresponding "boost_group" marked as valid on each CPU.
+ * Once a CGroup is release, the corresponding "boost_group" is marked as
+ * invalid on each CPU. The CPU boost value (boost_max) is aggregated by
+ * considering only valid boost_groups with a non null tasks counter.
+ *
+ * .:: Locking strategy
+ *
+ * The fast path uses a spin lock for each CPU boost_group which protects the
+ * tasks counter.
+ *
+ * The "valid" and "boost" values of each CPU boost_group is instead
+ * protected by the RCU lock provided by the CGroups callbacks. Thus, only the
+ * slow path can access and modify the boost_group attribtues of each CPU.
+ * The fast path will catch up the most updated values at the next scheduling
+ * event (i.e. enqueue/dequeue).
+ *
+ * |
+ * SLOW PATH | FAST PATH
+ * CGroup add/update/remove | Scheduler enqueue/dequeue events
+ * |
+ * |
+ * | DEFINE_PER_CPU(struct boost_groups)
+ * | +--------------+----+---+----+----+
+ * | | idle | | | | |
+ * | | boost_max | | | | |
+ * | +---->lock | | | | |
+ * struct schedtune allocated_groups | | | group[ ] | | | | |
+ * +------------------------------+ +-------+ | | +--+---------+-+----+---+----+----+
+ * | idx | | | | | | valid |
+ * | boots / prefer_idle | | | | | | boost |
+ * | perf_{boost/constraints}_idx | <---------+(*) | | | | tasks | <------------+
+ * | css | +-------+ | | +---------+ |
+ * +-+----------------------------+ | | | | | | |
+ * ^ | | | | | | |
+ * | +-------+ | | +---------+ |
+ * | | | | | | | |
+ * | | | | | | | |
+ * | +-------+ | | +---------+ |
+ * | zmalloc | | | | | | |
+ * | | | | | | | |
+ * | +-------+ | | +---------+ |
+ * + BOOSTGROUPS_COUNT | | BOOSTGROUPS_COUNT |
+ * schedtune_boostgroup_init() | + |
+ * | schedtune_{en,de}queue_task() |
+ * | +
+ * | schedtune_tasks_update()
+ * |
+ */
+
+/* SchdTune tunables for a group of tasks */
+struct schedtune {
+ /* SchedTune CGroup subsystem */
+ struct cgroup_subsys_state css;
+
+ /* Boost group allocated ID */
+ int idx;
+
+ /* Boost value for tasks on that SchedTune CGroup */
+ int boost;
+
+ /* Hint to bias scheduling of tasks on that SchedTune CGroup
+ * towards idle CPUs */
+ int prefer_idle;
+};
+
+static inline struct schedtune *css_st(struct cgroup_subsys_state *css)
+{
+ return css ? container_of(css, struct schedtune, css) : NULL;
+}
+
+static inline struct schedtune *task_schedtune(struct task_struct *tsk)
+{
+ return css_st(task_css(tsk, schedtune_cgrp_id));
+}
+
+static inline struct schedtune *parent_st(struct schedtune *st)
+{
+ return css_st(st->css.parent);
+}
+
+/*
+ * SchedTune root control group
+ * The root control group is used to defined a system-wide boosting tuning,
+ * which is applied to all tasks in the system.
+ * Task specific boost tuning could be specified by creating and
+ * configuring a child control group under the root one.
+ * By default, system-wide boosting is disabled, i.e. no boosting is applied
+ * to tasks which are not into a child control group.
+ */
+static struct schedtune
+root_schedtune = {
+ .boost = 0,
+ .prefer_idle = 0,
+};
+
+/*
+ * Maximum number of boost groups to support
+ * When per-task boosting is used we still allow only limited number of
+ * boost groups for two main reasons:
+ * 1. on a real system we usually have only few classes of workloads which
+ * make sense to boost with different values (e.g. background vs foreground
+ * tasks, interactive vs low-priority tasks)
+ * 2. a limited number allows for a simpler and more memory/time efficient
+ * implementation especially for the computation of the per-CPU boost
+ * value
+ */
+#define BOOSTGROUPS_COUNT 16
+
+/* Array of configured boostgroups */
+static struct schedtune *allocated_group[BOOSTGROUPS_COUNT] = {
+ &root_schedtune,
+ NULL,
+};
+
+/* SchedTune boost groups
+ * Keep track of all the boost groups which impact on CPU, for example when a
+ * CPU has two RUNNABLE tasks belonging to two different boost groups and thus
+ * likely with different boost values.
+ * Since on each system we expect only a limited number of boost groups, here
+ * we use a simple array to keep track of the metrics required to compute the
+ * maximum per-CPU boosting value.
+ */
+struct boost_groups {
+ /* Maximum boost value for all RUNNABLE tasks on a CPU */
+ int boost_max;
+ u64 boost_ts;
+ struct {
+ /* True when this boost group maps an actual cgroup */
+ bool valid;
+ /* The boost for tasks on that boost group */
+ int boost;
+ /* Count of RUNNABLE tasks on that boost group */
+ unsigned tasks;
+ /* Timestamp of boost activation */
+ u64 ts;
+ } group[BOOSTGROUPS_COUNT];
+ /* CPU's boost group locking */
+ raw_spinlock_t lock;
+};
+
+/* Boost groups affecting each CPU in the system */
+DEFINE_PER_CPU(struct boost_groups, cpu_boost_groups);
+
+static inline bool schedtune_boost_timeout(u64 now, u64 ts)
+{
+ return ((now - ts) > SCHEDTUNE_BOOST_HOLD_NS);
+}
+
+static inline bool
+schedtune_boost_group_active(int idx, struct boost_groups* bg, u64 now)
+{
+ if (bg->group[idx].tasks)
+ return true;
+
+ return !schedtune_boost_timeout(now, bg->group[idx].ts);
+}
+
+static void
+schedtune_cpu_update(int cpu, u64 now)
+{
+ struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu);
+ int boost_max;
+ u64 boost_ts;
+ int idx;
+
+ /* The root boost group is always active */
+ boost_max = bg->group[0].boost;
+ boost_ts = now;
+ for (idx = 1; idx < BOOSTGROUPS_COUNT; ++idx) {
+
+ /* Ignore non boostgroups not mapping a cgroup */
+ if (!bg->group[idx].valid)
+ continue;
+
+ /*
+ * A boost group affects a CPU only if it has
+ * RUNNABLE tasks on that CPU or it has hold
+ * in effect from a previous task.
+ */
+ if (!schedtune_boost_group_active(idx, bg, now))
+ continue;
+
+ /* This boost group is active */
+ if (boost_max > bg->group[idx].boost)
+ continue;
+
+ boost_max = bg->group[idx].boost;
+ boost_ts = bg->group[idx].ts;
+ }
+
+ /* Ensures boost_max is non-negative when all cgroup boost values
+ * are neagtive. Avoids under-accounting of cpu capacity which may cause
+ * task stacking and frequency spikes.*/
+ boost_max = max(boost_max, 0);
+ bg->boost_max = boost_max;
+ bg->boost_ts = boost_ts;
+}
+
+static int
+schedtune_boostgroup_update(int idx, int boost)
+{
+ struct boost_groups *bg;
+ int cur_boost_max;
+ int old_boost;
+ int cpu;
+ u64 now;
+
+ /* Update per CPU boost groups */
+ for_each_possible_cpu(cpu) {
+ bg = &per_cpu(cpu_boost_groups, cpu);
+
+ /* CGroups are never associated to non active cgroups */
+ BUG_ON(!bg->group[idx].valid);
+
+ /*
+ * Keep track of current boost values to compute the per CPU
+ * maximum only when it has been affected by the new value of
+ * the updated boost group
+ */
+ cur_boost_max = bg->boost_max;
+ old_boost = bg->group[idx].boost;
+
+ /* Update the boost value of this boost group */
+ bg->group[idx].boost = boost;
+
+ /* Check if this update increase current max */
+ now = sched_clock_cpu(cpu);
+ if (boost > cur_boost_max &&
+ schedtune_boost_group_active(idx, bg, now)) {
+ bg->boost_max = boost;
+ bg->boost_ts = bg->group[idx].ts;
+
+ trace_sched_tune_boostgroup_update(cpu, 1, bg->boost_max);
+ continue;
+ }
+
+ /* Check if this update has decreased current max */
+ if (cur_boost_max == old_boost && old_boost > boost) {
+ schedtune_cpu_update(cpu, now);
+ trace_sched_tune_boostgroup_update(cpu, -1, bg->boost_max);
+ continue;
+ }
+
+ trace_sched_tune_boostgroup_update(cpu, 0, bg->boost_max);
+ }
+
+ return 0;
+}
+
+#define ENQUEUE_TASK 1
+#define DEQUEUE_TASK -1
+
+static inline bool
+schedtune_update_timestamp(struct task_struct *p)
+{
+ if (sched_feat(SCHEDTUNE_BOOST_HOLD_ALL))
+ return true;
+
+ return task_has_rt_policy(p);
+}
+
+static inline void
+schedtune_tasks_update(struct task_struct *p, int cpu, int idx, int task_count)
+{
+ struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu);
+ int tasks = bg->group[idx].tasks + task_count;
+
+ /* Update boosted tasks count while avoiding to make it negative */
+ bg->group[idx].tasks = max(0, tasks);
+
+ /* Update timeout on enqueue */
+ if (task_count > 0) {
+ u64 now = sched_clock_cpu(cpu);
+
+ if (schedtune_update_timestamp(p))
+ bg->group[idx].ts = now;
+
+ /* Boost group activation or deactivation on that RQ */
+ if (bg->group[idx].tasks == 1)
+ schedtune_cpu_update(cpu, now);
+ }
+
+ trace_sched_tune_tasks_update(p, cpu, tasks, idx,
+ bg->group[idx].boost, bg->boost_max,
+ bg->group[idx].ts);
+}
+
+/*
+ * NOTE: This function must be called while holding the lock on the CPU RQ
+ */
+void schedtune_enqueue_task(struct task_struct *p, int cpu)
+{
+ struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu);
+ unsigned long irq_flags;
+ struct schedtune *st;
+ int idx;
+
+ if (unlikely(!schedtune_initialized))
+ return;
+
+ /*
+ * Boost group accouting is protected by a per-cpu lock and requires
+ * interrupt to be disabled to avoid race conditions for example on
+ * do_exit()::cgroup_exit() and task migration.
+ */
+ raw_spin_lock_irqsave(&bg->lock, irq_flags);
+ rcu_read_lock();
+
+ st = task_schedtune(p);
+ idx = st->idx;
+
+ schedtune_tasks_update(p, cpu, idx, ENQUEUE_TASK);
+
+ rcu_read_unlock();
+ raw_spin_unlock_irqrestore(&bg->lock, irq_flags);
+}
+
+int schedtune_can_attach(struct cgroup_taskset *tset)
+{
+ struct task_struct *task;
+ struct cgroup_subsys_state *css;
+ struct boost_groups *bg;
+ struct rq_flags rq_flags;
+ unsigned int cpu;
+ struct rq *rq;
+ int src_bg; /* Source boost group index */
+ int dst_bg; /* Destination boost group index */
+ int tasks;
+ u64 now;
+
+ if (unlikely(!schedtune_initialized))
+ return 0;
+
+
+ cgroup_taskset_for_each(task, css, tset) {
+
+ /*
+ * Lock the CPU's RQ the task is enqueued to avoid race
+ * conditions with migration code while the task is being
+ * accounted
+ */
+ rq = task_rq_lock(task, &rq_flags);
+
+ if (!task->on_rq) {
+ task_rq_unlock(rq, task, &rq_flags);
+ continue;
+ }
+
+ /*
+ * Boost group accouting is protected by a per-cpu lock and requires
+ * interrupt to be disabled to avoid race conditions on...
+ */
+ cpu = cpu_of(rq);
+ bg = &per_cpu(cpu_boost_groups, cpu);
+ raw_spin_lock(&bg->lock);
+
+ dst_bg = css_st(css)->idx;
+ src_bg = task_schedtune(task)->idx;
+
+ /*
+ * Current task is not changing boostgroup, which can
+ * happen when the new hierarchy is in use.
+ */
+ if (unlikely(dst_bg == src_bg)) {
+ raw_spin_unlock(&bg->lock);
+ task_rq_unlock(rq, task, &rq_flags);
+ continue;
+ }
+
+ /*
+ * This is the case of a RUNNABLE task which is switching its
+ * current boost group.
+ */
+
+ /* Move task from src to dst boost group */
+ tasks = bg->group[src_bg].tasks - 1;
+ bg->group[src_bg].tasks = max(0, tasks);
+ bg->group[dst_bg].tasks += 1;
+
+ /* Update boost hold start for this group */
+ now = sched_clock_cpu(cpu);
+ bg->group[dst_bg].ts = now;
+
+ /* Force boost group re-evaluation at next boost check */
+ bg->boost_ts = now - SCHEDTUNE_BOOST_HOLD_NS;
+
+ raw_spin_unlock(&bg->lock);
+ task_rq_unlock(rq, task, &rq_flags);
+ }
+
+ return 0;
+}
+
+void schedtune_cancel_attach(struct cgroup_taskset *tset)
+{
+ /* This can happen only if SchedTune controller is mounted with
+ * other hierarchies ane one of them fails. Since usually SchedTune is
+ * mouted on its own hierarcy, for the time being we do not implement
+ * a proper rollback mechanism */
+ WARN(1, "SchedTune cancel attach not implemented");
+}
+
+/*
+ * NOTE: This function must be called while holding the lock on the CPU RQ
+ */
+void schedtune_dequeue_task(struct task_struct *p, int cpu)
+{
+ struct boost_groups *bg = &per_cpu(cpu_boost_groups, cpu);
+ unsigned long irq_flags;
+ struct schedtune *st;
+ int idx;
+
+ if (unlikely(!schedtune_initialized))
+ return;
+
+ /*
+ * Boost group accouting is protected by a per-cpu lock and requires
+ * interrupt to be disabled to avoid race conditions on...
+ */
+ raw_spin_lock_irqsave(&bg->lock, irq_flags);
+ rcu_read_lock();
+
+ st = task_schedtune(p);
+ idx = st->idx;
+
+ schedtune_tasks_update(p, cpu, idx, DEQUEUE_TASK);
+
+ rcu_read_unlock();
+ raw_spin_unlock_irqrestore(&bg->lock, irq_flags);
+}
+
+int schedtune_cpu_boost(int cpu)
+{
+ struct boost_groups *bg;
+ u64 now;
+
+ bg = &per_cpu(cpu_boost_groups, cpu);
+ now = sched_clock_cpu(cpu);
+
+ /* Check to see if we have a hold in effect */
+ if (schedtune_boost_timeout(now, bg->boost_ts))
+ schedtune_cpu_update(cpu, now);
+
+ return bg->boost_max;
+}
+
+int schedtune_task_boost(struct task_struct *p)
+{
+ struct schedtune *st;
+ int task_boost;
+
+ if (unlikely(!schedtune_initialized))
+ return 0;
+
+ /* Get task boost value */
+ rcu_read_lock();
+ st = task_schedtune(p);
+ task_boost = st->boost;
+ rcu_read_unlock();
+
+ return task_boost;
+}
+
+int schedtune_prefer_idle(struct task_struct *p)
+{
+ struct schedtune *st;
+ int prefer_idle;
+
+ if (unlikely(!schedtune_initialized))
+ return 0;
+
+ /* Get prefer_idle value */
+ rcu_read_lock();
+ st = task_schedtune(p);
+ prefer_idle = st->prefer_idle;
+ rcu_read_unlock();
+
+ return prefer_idle;
+}
+
+static u64
+prefer_idle_read(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+ struct schedtune *st = css_st(css);
+
+ return st->prefer_idle;
+}
+
+static int
+prefer_idle_write(struct cgroup_subsys_state *css, struct cftype *cft,
+ u64 prefer_idle)
+{
+ struct schedtune *st = css_st(css);
+ st->prefer_idle = !!prefer_idle;
+
+ return 0;
+}
+
+static s64
+boost_read(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+ struct schedtune *st = css_st(css);
+
+ return st->boost;
+}
+
+static int
+boost_write(struct cgroup_subsys_state *css, struct cftype *cft,
+ s64 boost)
+{
+ struct schedtune *st = css_st(css);
+
+ if (boost < 0 || boost > 100)
+ return -EINVAL;
+
+ st->boost = boost;
+
+ /* Update CPU boost */
+ schedtune_boostgroup_update(st->idx, st->boost);
+
+ return 0;
+}
+
+static struct cftype files[] = {
+ {
+ .name = "boost",
+ .read_s64 = boost_read,
+ .write_s64 = boost_write,
+ },
+ {
+ .name = "prefer_idle",
+ .read_u64 = prefer_idle_read,
+ .write_u64 = prefer_idle_write,
+ },
+ { } /* terminate */
+};
+
+static void
+schedtune_boostgroup_init(struct schedtune *st, int idx)
+{
+ struct boost_groups *bg;
+ int cpu;
+
+ /* Initialize per CPUs boost group support */
+ for_each_possible_cpu(cpu) {
+ bg = &per_cpu(cpu_boost_groups, cpu);
+ bg->group[idx].boost = 0;
+ bg->group[idx].valid = true;
+ bg->group[idx].ts = 0;
+ }
+
+ /* Keep track of allocated boost groups */
+ allocated_group[idx] = st;
+ st->idx = idx;
+}
+
+static struct cgroup_subsys_state *
+schedtune_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+ struct schedtune *st;
+ int idx;
+
+ if (!parent_css)
+ return &root_schedtune.css;
+
+ /* Allow only a limited number of boosting groups */
+ for (idx = 1; idx < BOOSTGROUPS_COUNT; ++idx)
+ if (!allocated_group[idx])
+ break;
+ if (idx == BOOSTGROUPS_COUNT) {
+ pr_err("Trying to create more than %d SchedTune boosting groups\n",
+ BOOSTGROUPS_COUNT);
+ return ERR_PTR(-ENOSPC);
+ }
+
+ st = kzalloc(sizeof(*st), GFP_KERNEL);
+ if (!st)
+ goto out;
+
+ /* Initialize per CPUs boost group support */
+ schedtune_boostgroup_init(st, idx);
+
+ return &st->css;
+
+out:
+ return ERR_PTR(-ENOMEM);
+}
+
+static void
+schedtune_boostgroup_release(struct schedtune *st)
+{
+ struct boost_groups *bg;
+ int cpu;
+
+ /* Reset per CPUs boost group support */
+ for_each_possible_cpu(cpu) {
+ bg = &per_cpu(cpu_boost_groups, cpu);
+ bg->group[st->idx].valid = false;
+ bg->group[st->idx].boost = 0;
+ }
+
+ /* Keep track of allocated boost groups */
+ allocated_group[st->idx] = NULL;
+}
+
+static void
+schedtune_css_free(struct cgroup_subsys_state *css)
+{
+ struct schedtune *st = css_st(css);
+
+ /* Release per CPUs boost group support */
+ schedtune_boostgroup_release(st);
+ kfree(st);
+}
+
+struct cgroup_subsys schedtune_cgrp_subsys = {
+ .css_alloc = schedtune_css_alloc,
+ .css_free = schedtune_css_free,
+ .can_attach = schedtune_can_attach,
+ .cancel_attach = schedtune_cancel_attach,
+ .legacy_cftypes = files,
+ .early_init = 1,
+};
+
+static inline void
+schedtune_init_cgroups(void)
+{
+ struct boost_groups *bg;
+ int cpu;
+
+ /* Initialize the per CPU boost groups */
+ for_each_possible_cpu(cpu) {
+ bg = &per_cpu(cpu_boost_groups, cpu);
+ memset(bg, 0, sizeof(struct boost_groups));
+ bg->group[0].valid = true;
+ raw_spin_lock_init(&bg->lock);
+ }
+
+ pr_info("schedtune: configured to support %d boost groups\n",
+ BOOSTGROUPS_COUNT);
+
+ schedtune_initialized = true;
+}
+
+/*
+ * Initialize the cgroup structures
+ */
+static int
+schedtune_init(void)
+{
+ schedtune_spc_rdiv = reciprocal_value(100);
+ schedtune_init_cgroups();
+ return 0;
+}
+postcore_initcall(schedtune_init);
diff --git a/kernel/sched/tune.h b/kernel/sched/tune.h
new file mode 100644
index 0000000..821f026
--- /dev/null
+++ b/kernel/sched/tune.h
@@ -0,0 +1,37 @@
+
+#ifdef CONFIG_SCHED_TUNE
+
+#include <linux/reciprocal_div.h>
+
+/*
+ * System energy normalization constants
+ */
+struct target_nrg {
+ unsigned long min_power;
+ unsigned long max_power;
+ struct reciprocal_value rdiv;
+};
+
+int schedtune_cpu_boost(int cpu);
+int schedtune_task_boost(struct task_struct *tsk);
+
+int schedtune_prefer_idle(struct task_struct *tsk);
+
+void schedtune_enqueue_task(struct task_struct *p, int cpu);
+void schedtune_dequeue_task(struct task_struct *p, int cpu);
+
+unsigned long boosted_cpu_util(int cpu, unsigned long other_util);
+
+#else /* CONFIG_SCHED_TUNE */
+
+#define schedtune_cpu_boost(cpu) 0
+#define schedtune_task_boost(tsk) 0
+
+#define schedtune_prefer_idle(tsk) 0
+
+#define schedtune_enqueue_task(task, cpu) do { } while (0)
+#define schedtune_dequeue_task(task, cpu) do { } while (0)
+
+#define boosted_cpu_util(cpu, other_util) cpu_util_cfs(cpu_rq(cpu))
+
+#endif /* CONFIG_SCHED_TUNE */
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 6f58486..1946ac6 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -89,7 +89,8 @@
if (pending & SOFTIRQ_NOW_MASK)
return false;
- return tsk && (tsk->state == TASK_RUNNING);
+ return tsk && (tsk->state == TASK_RUNNING) &&
+ !__kthread_should_park(tsk);
}
/*
diff --git a/kernel/sys.c b/kernel/sys.c
index 096932a..1d1e673 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,9 +42,12 @@
#include <linux/syscore_ops.h>
#include <linux/version.h>
#include <linux/ctype.h>
+#include <linux/mm.h>
+#include <linux/mempolicy.h>
#include <linux/compat.h>
#include <linux/syscalls.h>
+#include <linux/alt-syscall.h>
#include <linux/kprobes.h>
#include <linux/user_namespace.h>
#include <linux/binfmts.h>
@@ -190,7 +193,7 @@
return error;
}
-SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
+int ksys_setpriority(int which, int who, int niceval)
{
struct task_struct *g, *p;
struct user_struct *user;
@@ -254,13 +257,18 @@
return error;
}
+SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
+{
+ return ksys_setpriority(which, who, niceval);
+}
+
/*
* Ugh. To avoid negative return values, "getpriority()" will
* not return the normal nice-value, but a negated value that
* has been offset by 20 (ie it returns 40..1 instead of -20..19)
* to stay compatible.
*/
-SYSCALL_DEFINE2(getpriority, int, which, int, who)
+int ksys_getpriority(int which, int who)
{
struct task_struct *g, *p;
struct user_struct *user;
@@ -325,6 +333,11 @@
return retval;
}
+SYSCALL_DEFINE2(getpriority, int, which, int, who)
+{
+ return ksys_getpriority(which, who);
+}
+
/*
* Unprivileged users may change the real gid to the effective gid
* or vice versa. (BSD-style)
@@ -2258,8 +2271,155 @@
return -EINVAL;
}
-SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
- unsigned long, arg4, unsigned long, arg5)
+#ifdef CONFIG_MMU
+static int prctl_update_vma_anon_name(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end,
+ const char __user *name_addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ int error = 0;
+ pgoff_t pgoff;
+
+ if (name_addr == vma_get_anon_name(vma)) {
+ *prev = vma;
+ goto out;
+ }
+
+ pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
+ *prev = vma_merge(mm, *prev, start, end, vma->vm_flags, vma->anon_vma,
+ vma->vm_file, pgoff, vma_policy(vma),
+ vma->vm_userfaultfd_ctx, name_addr);
+ if (*prev) {
+ vma = *prev;
+ goto success;
+ }
+
+ *prev = vma;
+
+ if (start != vma->vm_start) {
+ error = split_vma(mm, vma, start, 1);
+ if (error)
+ goto out;
+ }
+
+ if (end != vma->vm_end) {
+ error = split_vma(mm, vma, end, 0);
+ if (error)
+ goto out;
+ }
+
+success:
+ if (!vma->vm_file)
+ vma->anon_name = name_addr;
+
+out:
+ if (error == -ENOMEM)
+ error = -EAGAIN;
+ return error;
+}
+
+static int prctl_set_vma_anon_name(unsigned long start, unsigned long end,
+ unsigned long arg)
+{
+ unsigned long tmp;
+ struct vm_area_struct *vma, *prev;
+ int unmapped_error = 0;
+ int error = -EINVAL;
+
+ /*
+ * If the interval [start,end) covers some unmapped address
+ * ranges, just ignore them, but return -ENOMEM at the end.
+ * - this matches the handling in madvise.
+ */
+ vma = find_vma_prev(current->mm, start, &prev);
+ if (vma && start > vma->vm_start)
+ prev = vma;
+
+ for (;;) {
+ /* Still start < end. */
+ error = -ENOMEM;
+ if (!vma)
+ return error;
+
+ /* Here start < (end|vma->vm_end). */
+ if (start < vma->vm_start) {
+ unmapped_error = -ENOMEM;
+ start = vma->vm_start;
+ if (start >= end)
+ return error;
+ }
+
+ /* Here vma->vm_start <= start < (end|vma->vm_end) */
+ tmp = vma->vm_end;
+ if (end < tmp)
+ tmp = end;
+
+ /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
+ error = prctl_update_vma_anon_name(vma, &prev, start, tmp,
+ (const char __user *)arg);
+ if (error)
+ return error;
+ start = tmp;
+ if (prev && start < prev->vm_end)
+ start = prev->vm_end;
+ error = unmapped_error;
+ if (start >= end)
+ return error;
+ if (prev)
+ vma = prev->vm_next;
+ else /* madvise_remove dropped mmap_sem */
+ vma = find_vma(current->mm, start);
+ }
+}
+
+static int prctl_set_vma(unsigned long opt, unsigned long start,
+ unsigned long len_in, unsigned long arg)
+{
+ struct mm_struct *mm = current->mm;
+ int error;
+ unsigned long len;
+ unsigned long end;
+
+ if (start & ~PAGE_MASK)
+ return -EINVAL;
+ len = (len_in + ~PAGE_MASK) & PAGE_MASK;
+
+ /* Check to see whether len was rounded up from small -ve to zero */
+ if (len_in && !len)
+ return -EINVAL;
+
+ end = start + len;
+ if (end < start)
+ return -EINVAL;
+
+ if (end == start)
+ return 0;
+
+ down_write(&mm->mmap_sem);
+
+ switch (opt) {
+ case PR_SET_VMA_ANON_NAME:
+ error = prctl_set_vma_anon_name(start, end, arg);
+ break;
+ default:
+ error = -EINVAL;
+ }
+
+ up_write(&mm->mmap_sem);
+
+ return error;
+}
+#else /* CONFIG_MMU */
+static int prctl_set_vma(unsigned long opt, unsigned long start,
+ unsigned long len_in, unsigned long arg)
+{
+ return -EINVAL;
+}
+#endif
+
+int ksys_prctl(int option, unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
{
struct task_struct *me = current;
unsigned char comm[sizeof(me->comm)];
@@ -2342,6 +2502,12 @@
case PR_SET_SECCOMP:
error = prctl_set_seccomp(arg2, (char __user *)arg3);
break;
+ case PR_ALT_SYSCALL:
+ if (arg2 == PR_ALT_SYSCALL_SET_SYSCALL_TABLE)
+ error = set_alt_sys_call_table((char __user *)arg3);
+ else
+ error = -EINVAL;
+ break;
case PR_GET_TSC:
error = GET_TSC_CTL(arg2);
break;
@@ -2476,6 +2642,9 @@
return -EINVAL;
error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
break;
+ case PR_SET_VMA:
+ error = prctl_set_vma(arg2, arg3, arg4, arg5);
+ break;
default:
error = -EINVAL;
break;
@@ -2483,8 +2652,14 @@
return error;
}
-SYSCALL_DEFINE3(getcpu, unsigned __user *, cpup, unsigned __user *, nodep,
- struct getcpu_cache __user *, unused)
+SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
+ unsigned long, arg4, unsigned long, arg5)
+{
+ return ksys_prctl(option, arg2, arg3, arg4, arg5);
+}
+
+int ksys_getcpu(unsigned __user *cpup, unsigned __user *nodep,
+ struct getcpu_cache __user *unused)
{
int err = 0;
int cpu = raw_smp_processor_id();
@@ -2496,6 +2671,12 @@
return err ? -EFAULT : 0;
}
+SYSCALL_DEFINE3(getcpu, unsigned __user *, cpup, unsigned __user *, nodep,
+ struct getcpu_cache __user *, unused)
+{
+ return ksys_getcpu(cpup, nodep, unused);
+}
+
/**
* do_sysinfo - fill in sysinfo struct
* @info: pointer to buffer to fill
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4c4fd43..7874979 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -323,6 +323,13 @@
},
#ifdef CONFIG_SCHED_DEBUG
{
+ .procname = "sched_cstate_aware",
+ .data = &sysctl_sched_cstate_aware,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "sched_min_granularity_ns",
.data = &sysctl_sched_min_granularity,
.maxlen = sizeof(unsigned int),
@@ -341,6 +348,13 @@
.extra2 = &max_sched_granularity_ns,
},
{
+ .procname = "sched_sync_hint_enable",
+ .data = &sysctl_sched_sync_hint_enable,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "sched_wakeup_granularity_ns",
.data = &sysctl_sched_wakeup_granularity,
.maxlen = sizeof(unsigned int),
@@ -1573,6 +1587,15 @@
.mode = 0644,
.proc_handler = mmap_min_addr_handler,
},
+ {
+ .procname = "mmap_noexec_taint",
+ .data = &sysctl_mmap_noexec_taint,
+ .maxlen = sizeof(sysctl_mmap_noexec_taint),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
#endif
#ifdef CONFIG_NUMA
{
@@ -1644,6 +1667,13 @@
.mode = 0644,
.proc_handler = proc_doulongvec_minmax,
},
+ {
+ .procname = "min_filelist_kbytes",
+ .data = &min_filelist_kbytes,
+ .maxlen = sizeof(min_filelist_kbytes),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
#ifdef CONFIG_HAVE_ARCH_MMAP_RND_BITS
{
.procname = "mmap_rnd_bits",
@@ -1666,6 +1696,17 @@
.extra2 = (void *)&mmap_rnd_compat_bits_max,
},
#endif
+#ifdef CONFIG_DISK_BASED_SWAP
+ {
+ .procname = "disk_based_swap",
+ .data = &sysctl_disk_based_swap,
+ .maxlen = sizeof(sysctl_disk_based_swap),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
{ }
};
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 5a01c4f..0cae15e 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -1068,8 +1068,7 @@
return error;
}
-SYSCALL_DEFINE2(clock_adjtime, const clockid_t, which_clock,
- struct timex __user *, utx)
+int ksys_clock_adjtime(const clockid_t which_clock, struct timex __user * utx)
{
const struct k_clock *kc = clockid_to_kclock(which_clock);
struct timex ktx;
@@ -1091,6 +1090,12 @@
return err;
}
+SYSCALL_DEFINE2(clock_adjtime, const clockid_t, which_clock,
+ struct timex __user *, utx)
+{
+ return ksys_clock_adjtime(which_clock, utx);
+}
+
SYSCALL_DEFINE2(clock_getres, const clockid_t, which_clock,
struct __kernel_timespec __user *, tp)
{
@@ -1148,8 +1153,7 @@
#ifdef CONFIG_COMPAT
-COMPAT_SYSCALL_DEFINE2(clock_adjtime, clockid_t, which_clock,
- struct compat_timex __user *, utp)
+int compat_ksys_clock_adjtime(clockid_t which_clock, struct compat_timex __user * utp)
{
const struct k_clock *kc = clockid_to_kclock(which_clock);
struct timex ktx;
@@ -1172,6 +1176,12 @@
return err;
}
+COMPAT_SYSCALL_DEFINE2(clock_adjtime, clockid_t, which_clock,
+ struct compat_timex __user *, utp)
+{
+ return compat_ksys_clock_adjtime(which_clock, utp);
+}
+
#endif
#ifdef CONFIG_COMPAT_32BIT_TIME
diff --git a/kernel/time/time.c b/kernel/time/time.c
index f7d4fa5..e97e3ff 100644
--- a/kernel/time/time.c
+++ b/kernel/time/time.c
@@ -266,7 +266,7 @@
}
#endif
-SYSCALL_DEFINE1(adjtimex, struct timex __user *, txc_p)
+int ksys_adjtimex(struct timex __user * txc_p)
{
struct timex txc; /* Local copy of parameter */
int ret;
@@ -281,9 +281,14 @@
return copy_to_user(txc_p, &txc, sizeof(struct timex)) ? -EFAULT : ret;
}
+SYSCALL_DEFINE1(adjtimex, struct timex __user *, txc_p)
+{
+ return ksys_adjtimex(txc_p);
+}
+
#ifdef CONFIG_COMPAT
-COMPAT_SYSCALL_DEFINE1(adjtimex, struct compat_timex __user *, utp)
+int compat_ksys_adjtimex(struct compat_timex __user * utp)
{
struct timex txc;
int err, ret;
@@ -300,6 +305,12 @@
return ret;
}
+
+COMPAT_SYSCALL_DEFINE1(adjtimex, struct compat_timex __user *, utp)
+{
+ return compat_ksys_adjtimex(utp);
+}
+
#endif
/*
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4966410..ed9a1ba 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3320,33 +3320,68 @@
}
static void
+get_total_entries_cpu(struct trace_buffer *buf, unsigned long *total,
+ unsigned long *entries, int cpu)
+{
+ unsigned long count;
+
+ count = ring_buffer_entries_cpu(buf->buffer, cpu);
+ /*
+ * If this buffer has skipped entries, then we hold all
+ * entries for the trace and we need to ignore the
+ * ones before the time stamp.
+ */
+ if (per_cpu_ptr(buf->data, cpu)->skipped_entries) {
+ count -= per_cpu_ptr(buf->data, cpu)->skipped_entries;
+ /* total is the same as the entries */
+ *total = count;
+ } else
+ *total = count +
+ ring_buffer_overrun_cpu(buf->buffer, cpu);
+ *entries = count;
+}
+
+static void
get_total_entries(struct trace_buffer *buf,
unsigned long *total, unsigned long *entries)
{
- unsigned long count;
+ unsigned long t, e;
int cpu;
*total = 0;
*entries = 0;
for_each_tracing_cpu(cpu) {
- count = ring_buffer_entries_cpu(buf->buffer, cpu);
- /*
- * If this buffer has skipped entries, then we hold all
- * entries for the trace and we need to ignore the
- * ones before the time stamp.
- */
- if (per_cpu_ptr(buf->data, cpu)->skipped_entries) {
- count -= per_cpu_ptr(buf->data, cpu)->skipped_entries;
- /* total is the same as the entries */
- *total += count;
- } else
- *total += count +
- ring_buffer_overrun_cpu(buf->buffer, cpu);
- *entries += count;
+ get_total_entries_cpu(buf, &t, &e, cpu);
+ *total += t;
+ *entries += e;
}
}
+unsigned long trace_total_entries_cpu(struct trace_array *tr, int cpu)
+{
+ unsigned long total, entries;
+
+ if (!tr)
+ tr = &global_trace;
+
+ get_total_entries_cpu(&tr->trace_buffer, &total, &entries, cpu);
+
+ return entries;
+}
+
+unsigned long trace_total_entries(struct trace_array *tr)
+{
+ unsigned long total, entries;
+
+ if (!tr)
+ tr = &global_trace;
+
+ get_total_entries(&tr->trace_buffer, &total, &entries);
+
+ return entries;
+}
+
static void print_lat_help_header(struct seq_file *m)
{
seq_puts(m, "# _------=> CPU# \n"
@@ -4649,7 +4684,7 @@
"place (kretprobe): [<module>:]<symbol>[+<offset>]|<memaddr>\n"
#endif
#ifdef CONFIG_UPROBE_EVENTS
- "\t place: <path>:<offset>\n"
+ " place (uprobe): <path>:<offset>[(ref_ctr_offset)]\n"
#endif
"\t args: <name>=fetcharg[:type]\n"
"\t fetcharg: %<register>, @<address>, @<symbol>[+|-<offset>],\n"
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index ee0c6a3..7cdda42 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -663,6 +663,9 @@
void tracing_iter_reset(struct trace_iterator *iter, int cpu);
+unsigned long trace_total_entries_cpu(struct trace_array *tr, int cpu);
+unsigned long trace_total_entries(struct trace_array *tr);
+
void trace_function(struct trace_array *tr,
unsigned long ip,
unsigned long parent_ip,
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index f5b3bf0..48ee92c 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -294,7 +294,8 @@
#endif /* CONFIG_KPROBE_EVENTS */
#ifdef CONFIG_UPROBE_EVENTS
-int perf_uprobe_init(struct perf_event *p_event, bool is_retprobe)
+int perf_uprobe_init(struct perf_event *p_event,
+ unsigned long ref_ctr_offset, bool is_retprobe)
{
int ret;
char *path = NULL;
@@ -314,8 +315,8 @@
goto out;
}
- tp_event = create_local_trace_uprobe(
- path, p_event->attr.probe_offset, is_retprobe);
+ tp_event = create_local_trace_uprobe(path, p_event->attr.probe_offset,
+ ref_ctr_offset, is_retprobe);
if (IS_ERR(tp_event)) {
ret = PTR_ERR(tp_event);
goto out;
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index b949c39..9be3d1d 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -451,8 +451,10 @@
switch (*next) {
case '(': /* #2 */
- if (top - op_stack > nr_parens)
- return ERR_PTR(-EINVAL);
+ if (top - op_stack > nr_parens) {
+ ret = -EINVAL;
+ goto out_free;
+ }
*(++top) = invert;
continue;
case '!': /* #3 */
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 5f52668..03b10f3 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -412,6 +412,7 @@
extern void destroy_local_trace_kprobe(struct trace_event_call *event_call);
extern struct trace_event_call *
-create_local_trace_uprobe(char *name, unsigned long offs, bool is_return);
+create_local_trace_uprobe(char *name, unsigned long offs,
+ unsigned long ref_ctr_offset, bool is_return);
extern void destroy_local_trace_uprobe(struct trace_event_call *event_call);
#endif
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 0da379b..251865d 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -47,6 +47,7 @@
struct inode *inode;
char *filename;
unsigned long offset;
+ unsigned long ref_ctr_offset;
unsigned long nhit;
struct trace_probe tp;
};
@@ -359,10 +360,10 @@
static int create_trace_uprobe(int argc, char **argv)
{
struct trace_uprobe *tu;
- char *arg, *event, *group, *filename;
+ char *arg, *event, *group, *filename, *rctr, *rctr_end;
char buf[MAX_EVENT_NAME_LEN];
struct path path;
- unsigned long offset;
+ unsigned long offset, ref_ctr_offset;
bool is_delete, is_return;
int i, ret;
@@ -371,6 +372,7 @@
is_return = false;
event = NULL;
group = NULL;
+ ref_ctr_offset = 0;
/* argc must be >= 1 */
if (argv[0][0] == '-')
@@ -445,6 +447,26 @@
goto fail_address_parse;
}
+ /* Parse reference counter offset if specified. */
+ rctr = strchr(arg, '(');
+ if (rctr) {
+ rctr_end = strchr(rctr, ')');
+ if (rctr > rctr_end || *(rctr_end + 1) != 0) {
+ ret = -EINVAL;
+ pr_info("Invalid reference counter offset.\n");
+ goto fail_address_parse;
+ }
+
+ *rctr++ = '\0';
+ *rctr_end = '\0';
+ ret = kstrtoul(rctr, 0, &ref_ctr_offset);
+ if (ret) {
+ pr_info("Invalid reference counter offset.\n");
+ goto fail_address_parse;
+ }
+ }
+
+ /* Parse uprobe offset. */
ret = kstrtoul(arg, 0, &offset);
if (ret)
goto fail_address_parse;
@@ -479,6 +501,7 @@
goto fail_address_parse;
}
tu->offset = offset;
+ tu->ref_ctr_offset = ref_ctr_offset;
tu->path = path;
tu->filename = kstrdup(filename, GFP_KERNEL);
@@ -597,6 +620,9 @@
trace_event_name(&tu->tp.call), tu->filename,
(int)(sizeof(void *) * 2), tu->offset);
+ if (tu->ref_ctr_offset)
+ seq_printf(m, "(0x%lx)", tu->ref_ctr_offset);
+
for (i = 0; i < tu->tp.nr_args; i++)
seq_printf(m, " %s=%s", tu->tp.args[i].name, tu->tp.args[i].comm);
@@ -912,7 +938,13 @@
tu->consumer.filter = filter;
tu->inode = d_real_inode(tu->path.dentry);
- ret = uprobe_register(tu->inode, tu->offset, &tu->consumer);
+ if (tu->ref_ctr_offset) {
+ ret = uprobe_register_refctr(tu->inode, tu->offset,
+ tu->ref_ctr_offset, &tu->consumer);
+ } else {
+ ret = uprobe_register(tu->inode, tu->offset, &tu->consumer);
+ }
+
if (ret)
goto err_buffer;
@@ -1347,7 +1379,8 @@
#ifdef CONFIG_PERF_EVENTS
struct trace_event_call *
-create_local_trace_uprobe(char *name, unsigned long offs, bool is_return)
+create_local_trace_uprobe(char *name, unsigned long offs,
+ unsigned long ref_ctr_offset, bool is_return)
{
struct trace_uprobe *tu;
struct path path;
@@ -1379,6 +1412,7 @@
tu->offset = offs;
tu->path = path;
+ tu->ref_ctr_offset = ref_ctr_offset;
tu->filename = kstrdup(name, GFP_KERNEL);
init_trace_event_call(tu, &tu->tp.call);
diff --git a/kernel/user.c b/kernel/user.c
index 0df9b16..7f74a8a 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -17,6 +17,7 @@
#include <linux/interrupt.h>
#include <linux/export.h>
#include <linux/user_namespace.h>
+#include <linux/proc_fs.h>
#include <linux/proc_ns.h>
/*
@@ -208,6 +209,7 @@
}
spin_unlock_irq(&uidhash_lock);
}
+ proc_register_uid(uid);
return up;
@@ -229,6 +231,7 @@
spin_lock_irq(&uidhash_lock);
uid_hash_insert(&root_user, uidhashentry(GLOBAL_ROOT_UID));
spin_unlock_irq(&uidhash_lock);
+ proc_register_uid(GLOBAL_ROOT_UID);
return 0;
}
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 923414a..4f5b7f3 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -1322,6 +1322,7 @@
.owner = userns_owner,
.get_parent = ns_get_owner,
};
+EXPORT_SYMBOL(userns_operations);
static __init int user_namespaces_init(void)
{
diff --git a/lib/dynamic_debug.c b/lib/dynamic_debug.c
index dbf2b45..c7c96bc 100644
--- a/lib/dynamic_debug.c
+++ b/lib/dynamic_debug.c
@@ -188,7 +188,7 @@
newflags = (dp->flags & mask) | flags;
if (newflags == dp->flags)
continue;
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
if (dp->flags & _DPRINTK_FLAGS_PRINT) {
if (!(flags & _DPRINTK_FLAGS_PRINT))
static_branch_disable(&dp->key.dd_key_true);
diff --git a/lib/list_sort.c b/lib/list_sort.c
index 8575992..52f0c25 100644
--- a/lib/list_sort.c
+++ b/lib/list_sort.c
@@ -7,33 +7,41 @@
#include <linux/list_sort.h>
#include <linux/list.h>
-#define MAX_LIST_LENGTH_BITS 20
+typedef int __attribute__((nonnull(2,3))) (*cmp_func)(void *,
+ struct list_head const *, struct list_head const *);
/*
* Returns a list organized in an intermediate format suited
* to chaining of merge() calls: null-terminated, no reserved or
* sentinel head node, "prev" links not maintained.
*/
-static struct list_head *merge(void *priv,
- int (*cmp)(void *priv, struct list_head *a,
- struct list_head *b),
+__attribute__((nonnull(2,3,4)))
+static struct list_head *merge(void *priv, cmp_func cmp,
struct list_head *a, struct list_head *b)
{
- struct list_head head, *tail = &head;
+ struct list_head *head, **tail = &head;
- while (a && b) {
+ for (;;) {
/* if equal, take 'a' -- important for sort stability */
- if ((*cmp)(priv, a, b) <= 0) {
- tail->next = a;
+ if (cmp(priv, a, b) <= 0) {
+ *tail = a;
+ tail = &a->next;
a = a->next;
+ if (!a) {
+ *tail = b;
+ break;
+ }
} else {
- tail->next = b;
+ *tail = b;
+ tail = &b->next;
b = b->next;
+ if (!b) {
+ *tail = a;
+ break;
+ }
}
- tail = tail->next;
}
- tail->next = a?:b;
- return head.next;
+ return head;
}
/*
@@ -43,44 +51,52 @@
* prev-link restoration pass, or maintaining the prev links
* throughout.
*/
-static void merge_and_restore_back_links(void *priv,
- int (*cmp)(void *priv, struct list_head *a,
- struct list_head *b),
- struct list_head *head,
- struct list_head *a, struct list_head *b)
+__attribute__((nonnull(2,3,4,5)))
+static void merge_final(void *priv, cmp_func cmp, struct list_head *head,
+ struct list_head *a, struct list_head *b)
{
struct list_head *tail = head;
u8 count = 0;
- while (a && b) {
+ for (;;) {
/* if equal, take 'a' -- important for sort stability */
- if ((*cmp)(priv, a, b) <= 0) {
+ if (cmp(priv, a, b) <= 0) {
tail->next = a;
a->prev = tail;
+ tail = a;
a = a->next;
+ if (!a)
+ break;
} else {
tail->next = b;
b->prev = tail;
+ tail = b;
b = b->next;
+ if (!b) {
+ b = a;
+ break;
+ }
}
- tail = tail->next;
}
- tail->next = a ? : b;
+ /* Finish linking remainder of list b on to tail */
+ tail->next = b;
do {
/*
- * In worst cases this loop may run many iterations.
+ * If the merge is highly unbalanced (e.g. the input is
+ * already sorted), this loop may run many iterations.
* Continue callbacks to the client even though no
* element comparison is needed, so the client's cmp()
* routine can invoke cond_resched() periodically.
*/
- if (unlikely(!(++count)))
- (*cmp)(priv, tail->next, tail->next);
+ if (unlikely(!++count))
+ cmp(priv, b, b);
+ b->prev = tail;
+ tail = b;
+ b = b->next;
+ } while (b);
- tail->next->prev = tail;
- tail = tail->next;
- } while (tail->next);
-
+ /* And the final links to make a circular doubly-linked list */
tail->next = head;
head->prev = tail;
}
@@ -91,55 +107,152 @@
* @head: the list to sort
* @cmp: the elements comparison function
*
- * This function implements "merge sort", which has O(nlog(n))
- * complexity.
+ * The comparison funtion @cmp must return > 0 if @a should sort after
+ * @b ("@a > @b" if you want an ascending sort), and <= 0 if @a should
+ * sort before @b *or* their original order should be preserved. It is
+ * always called with the element that came first in the input in @a,
+ * and list_sort is a stable sort, so it is not necessary to distinguish
+ * the @a < @b and @a == @b cases.
*
- * The comparison function @cmp must return a negative value if @a
- * should sort before @b, and a positive value if @a should sort after
- * @b. If @a and @b are equivalent, and their original relative
- * ordering is to be preserved, @cmp must return 0.
+ * This is compatible with two styles of @cmp function:
+ * - The traditional style which returns <0 / =0 / >0, or
+ * - Returning a boolean 0/1.
+ * The latter offers a chance to save a few cycles in the comparison
+ * (which is used by e.g. plug_ctx_cmp() in block/blk-mq.c).
+ *
+ * A good way to write a multi-word comparison is::
+ *
+ * if (a->high != b->high)
+ * return a->high > b->high;
+ * if (a->middle != b->middle)
+ * return a->middle > b->middle;
+ * return a->low > b->low;
+ *
+ *
+ * This mergesort is as eager as possible while always performing at least
+ * 2:1 balanced merges. Given two pending sublists of size 2^k, they are
+ * merged to a size-2^(k+1) list as soon as we have 2^k following elements.
+ *
+ * Thus, it will avoid cache thrashing as long as 3*2^k elements can
+ * fit into the cache. Not quite as good as a fully-eager bottom-up
+ * mergesort, but it does use 0.2*n fewer comparisons, so is faster in
+ * the common case that everything fits into L1.
+ *
+ *
+ * The merging is controlled by "count", the number of elements in the
+ * pending lists. This is beautiully simple code, but rather subtle.
+ *
+ * Each time we increment "count", we set one bit (bit k) and clear
+ * bits k-1 .. 0. Each time this happens (except the very first time
+ * for each bit, when count increments to 2^k), we merge two lists of
+ * size 2^k into one list of size 2^(k+1).
+ *
+ * This merge happens exactly when the count reaches an odd multiple of
+ * 2^k, which is when we have 2^k elements pending in smaller lists,
+ * so it's safe to merge away two lists of size 2^k.
+ *
+ * After this happens twice, we have created two lists of size 2^(k+1),
+ * which will be merged into a list of size 2^(k+2) before we create
+ * a third list of size 2^(k+1), so there are never more than two pending.
+ *
+ * The number of pending lists of size 2^k is determined by the
+ * state of bit k of "count" plus two extra pieces of information:
+ *
+ * - The state of bit k-1 (when k == 0, consider bit -1 always set), and
+ * - Whether the higher-order bits are zero or non-zero (i.e.
+ * is count >= 2^(k+1)).
+ *
+ * There are six states we distinguish. "x" represents some arbitrary
+ * bits, and "y" represents some arbitrary non-zero bits:
+ * 0: 00x: 0 pending of size 2^k; x pending of sizes < 2^k
+ * 1: 01x: 0 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k
+ * 2: x10x: 0 pending of size 2^k; 2^k + x pending of sizes < 2^k
+ * 3: x11x: 1 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k
+ * 4: y00x: 1 pending of size 2^k; 2^k + x pending of sizes < 2^k
+ * 5: y01x: 2 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k
+ * (merge and loop back to state 2)
+ *
+ * We gain lists of size 2^k in the 2->3 and 4->5 transitions (because
+ * bit k-1 is set while the more significant bits are non-zero) and
+ * merge them away in the 5->2 transition. Note in particular that just
+ * before the 5->2 transition, all lower-order bits are 11 (state 3),
+ * so there is one list of each smaller size.
+ *
+ * When we reach the end of the input, we merge all the pending
+ * lists, from smallest to largest. If you work through cases 2 to
+ * 5 above, you can see that the number of elements we merge with a list
+ * of size 2^k varies from 2^(k-1) (cases 3 and 5 when x == 0) to
+ * 2^(k+1) - 1 (second merge of case 5 when x == 2^(k-1) - 1).
*/
+__attribute__((nonnull(2,3)))
void list_sort(void *priv, struct list_head *head,
int (*cmp)(void *priv, struct list_head *a,
struct list_head *b))
{
- struct list_head *part[MAX_LIST_LENGTH_BITS+1]; /* sorted partial lists
- -- last slot is a sentinel */
- int lev; /* index into part[] */
- int max_lev = 0;
- struct list_head *list;
+ struct list_head *list = head->next, *pending = NULL;
+ size_t count = 0; /* Count of pending */
- if (list_empty(head))
+ if (list == head->prev) /* Zero or one elements */
return;
- memset(part, 0, sizeof(part));
-
+ /* Convert to a null-terminated singly-linked list. */
head->prev->next = NULL;
- list = head->next;
- while (list) {
- struct list_head *cur = list;
+ /*
+ * Data structure invariants:
+ * - All lists are singly linked and null-terminated; prev
+ * pointers are not maintained.
+ * - pending is a prev-linked "list of lists" of sorted
+ * sublists awaiting further merging.
+ * - Each of the sorted sublists is power-of-two in size.
+ * - Sublists are sorted by size and age, smallest & newest at front.
+ * - There are zero to two sublists of each size.
+ * - A pair of pending sublists are merged as soon as the number
+ * of following pending elements equals their size (i.e.
+ * each time count reaches an odd multiple of that size).
+ * That ensures each later final merge will be at worst 2:1.
+ * - Each round consists of:
+ * - Merging the two sublists selected by the highest bit
+ * which flips when count is incremented, and
+ * - Adding an element from the input as a size-1 sublist.
+ */
+ do {
+ size_t bits;
+ struct list_head **tail = &pending;
+
+ /* Find the least-significant clear bit in count */
+ for (bits = count; bits & 1; bits >>= 1)
+ tail = &(*tail)->prev;
+ /* Do the indicated merge */
+ if (likely(bits)) {
+ struct list_head *a = *tail, *b = a->prev;
+
+ a = merge(priv, (cmp_func)cmp, b, a);
+ /* Install the merged result in place of the inputs */
+ a->prev = b->prev;
+ *tail = a;
+ }
+
+ /* Move one element from input list to pending */
+ list->prev = pending;
+ pending = list;
list = list->next;
- cur->next = NULL;
+ pending->next = NULL;
+ count++;
+ } while (list);
- for (lev = 0; part[lev]; lev++) {
- cur = merge(priv, cmp, part[lev], cur);
- part[lev] = NULL;
- }
- if (lev > max_lev) {
- if (unlikely(lev >= ARRAY_SIZE(part)-1)) {
- printk_once(KERN_DEBUG "list too long for efficiency\n");
- lev--;
- }
- max_lev = lev;
- }
- part[lev] = cur;
+ /* End of input; merge together all the pending lists. */
+ list = pending;
+ pending = pending->prev;
+ for (;;) {
+ struct list_head *next = pending->prev;
+
+ if (!next)
+ break;
+ list = merge(priv, (cmp_func)cmp, pending, list);
+ pending = next;
}
-
- for (lev = 0; lev < max_lev; lev++)
- if (part[lev])
- list = merge(priv, cmp, part[lev], list);
-
- merge_and_restore_back_links(priv, cmp, head, part[max_lev], list);
+ /* The final merge, rebuilding prev links */
+ merge_final(priv, (cmp_func)cmp, head, pending, list);
}
EXPORT_SYMBOL(list_sort);
diff --git a/lib/lzo/lzo1x_compress.c b/lib/lzo/lzo1x_compress.c
index 236eb21..4525fb0 100644
--- a/lib/lzo/lzo1x_compress.c
+++ b/lib/lzo/lzo1x_compress.c
@@ -20,7 +20,8 @@
static noinline size_t
lzo1x_1_do_compress(const unsigned char *in, size_t in_len,
unsigned char *out, size_t *out_len,
- size_t ti, void *wrkmem)
+ size_t ti, void *wrkmem, signed char *state_offset,
+ const unsigned char bitstream_version)
{
const unsigned char *ip;
unsigned char *op;
@@ -35,27 +36,85 @@
ip += ti < 4 ? 4 - ti : 0;
for (;;) {
- const unsigned char *m_pos;
+ const unsigned char *m_pos = NULL;
size_t t, m_len, m_off;
u32 dv;
+ u32 run_length = 0;
literal:
ip += 1 + ((ip - ii) >> 5);
next:
if (unlikely(ip >= ip_end))
break;
dv = get_unaligned_le32(ip);
- t = ((dv * 0x1824429d) >> (32 - D_BITS)) & D_MASK;
- m_pos = in + dict[t];
- dict[t] = (lzo_dict_t) (ip - in);
- if (unlikely(dv != get_unaligned_le32(m_pos)))
- goto literal;
+
+ if (dv == 0 && bitstream_version) {
+ const unsigned char *ir = ip + 4;
+ const unsigned char *limit = ip_end
+ < (ip + MAX_ZERO_RUN_LENGTH + 1)
+ ? ip_end : ip + MAX_ZERO_RUN_LENGTH + 1;
+#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && \
+ defined(LZO_FAST_64BIT_MEMORY_ACCESS)
+ u64 dv64;
+
+ for (; (ir + 32) <= limit; ir += 32) {
+ dv64 = get_unaligned((u64 *)ir);
+ dv64 |= get_unaligned((u64 *)ir + 1);
+ dv64 |= get_unaligned((u64 *)ir + 2);
+ dv64 |= get_unaligned((u64 *)ir + 3);
+ if (dv64)
+ break;
+ }
+ for (; (ir + 8) <= limit; ir += 8) {
+ dv64 = get_unaligned((u64 *)ir);
+ if (dv64) {
+# if defined(__LITTLE_ENDIAN)
+ ir += __builtin_ctzll(dv64) >> 3;
+# elif defined(__BIG_ENDIAN)
+ ir += __builtin_clzll(dv64) >> 3;
+# else
+# error "missing endian definition"
+# endif
+ break;
+ }
+ }
+#else
+ while ((ir < (const unsigned char *)
+ ALIGN((uintptr_t)ir, 4)) &&
+ (ir < limit) && (*ir == 0))
+ ir++;
+ for (; (ir + 4) <= limit; ir += 4) {
+ dv = *((u32 *)ir);
+ if (dv) {
+# if defined(__LITTLE_ENDIAN)
+ ir += __builtin_ctz(dv) >> 3;
+# elif defined(__BIG_ENDIAN)
+ ir += __builtin_clz(dv) >> 3;
+# else
+# error "missing endian definition"
+# endif
+ break;
+ }
+ }
+#endif
+ while (likely(ir < limit) && unlikely(*ir == 0))
+ ir++;
+ run_length = ir - ip;
+ if (run_length > MAX_ZERO_RUN_LENGTH)
+ run_length = MAX_ZERO_RUN_LENGTH;
+ } else {
+ t = ((dv * 0x1824429d) >> (32 - D_BITS)) & D_MASK;
+ m_pos = in + dict[t];
+ dict[t] = (lzo_dict_t) (ip - in);
+ if (unlikely(dv != get_unaligned_le32(m_pos)))
+ goto literal;
+ }
ii -= ti;
ti = 0;
t = ip - ii;
if (t != 0) {
if (t <= 3) {
- op[-2] |= t;
+ op[*state_offset] |= t;
COPY4(op, ii);
op += t;
} else if (t <= 16) {
@@ -88,6 +147,17 @@
}
}
+ if (unlikely(run_length)) {
+ ip += run_length;
+ run_length -= MIN_ZERO_RUN_LENGTH;
+ put_unaligned_le32((run_length << 21) | 0xfffc18
+ | (run_length & 0x7), op);
+ op += 4;
+ run_length = 0;
+ *state_offset = -3;
+ goto finished_writing_instruction;
+ }
+
m_len = 4;
{
#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && defined(LZO_USE_CTZ64)
@@ -170,7 +240,6 @@
m_off = ip - m_pos;
ip += m_len;
- ii = ip;
if (m_len <= M2_MAX_LEN && m_off <= M2_MAX_OFFSET) {
m_off -= 1;
*op++ = (((m_len - 1) << 5) | ((m_off & 7) << 2));
@@ -207,29 +276,45 @@
*op++ = (m_off << 2);
*op++ = (m_off >> 6);
}
+ *state_offset = -2;
+finished_writing_instruction:
+ ii = ip;
goto next;
}
*out_len = op - out;
return in_end - (ii - ti);
}
-int lzo1x_1_compress(const unsigned char *in, size_t in_len,
+int lzogeneric1x_1_compress(const unsigned char *in, size_t in_len,
unsigned char *out, size_t *out_len,
- void *wrkmem)
+ void *wrkmem, const unsigned char bitstream_version)
{
const unsigned char *ip = in;
unsigned char *op = out;
size_t l = in_len;
size_t t = 0;
+ signed char state_offset = -2;
+ unsigned int m4_max_offset;
+
+ // LZO v0 will never write 17 as first byte,
+ // so this is used to version the bitstream
+ if (bitstream_version > 0) {
+ *op++ = 17;
+ *op++ = bitstream_version;
+ m4_max_offset = M4_MAX_OFFSET_V1;
+ } else {
+ m4_max_offset = M4_MAX_OFFSET_V0;
+ }
while (l > 20) {
- size_t ll = l <= (M4_MAX_OFFSET + 1) ? l : (M4_MAX_OFFSET + 1);
+ size_t ll = l <= (m4_max_offset + 1) ? l : (m4_max_offset + 1);
uintptr_t ll_end = (uintptr_t) ip + ll;
if ((ll_end + ((t + ll) >> 5)) <= ll_end)
break;
BUILD_BUG_ON(D_SIZE * sizeof(lzo_dict_t) > LZO1X_1_MEM_COMPRESS);
memset(wrkmem, 0, D_SIZE * sizeof(lzo_dict_t));
- t = lzo1x_1_do_compress(ip, ll, op, out_len, t, wrkmem);
+ t = lzo1x_1_do_compress(ip, ll, op, out_len, t, wrkmem,
+ &state_offset, bitstream_version);
ip += ll;
op += *out_len;
l -= ll;
@@ -242,7 +327,7 @@
if (op == out && t <= 238) {
*op++ = (17 + t);
} else if (t <= 3) {
- op[-2] |= t;
+ op[state_offset] |= t;
} else if (t <= 18) {
*op++ = (t - 3);
} else {
@@ -273,7 +358,24 @@
*out_len = op - out;
return LZO_E_OK;
}
+
+int lzo1x_1_compress(const unsigned char *in, size_t in_len,
+ unsigned char *out, size_t *out_len,
+ void *wrkmem)
+{
+ return lzogeneric1x_1_compress(in, in_len, out, out_len, wrkmem, 0);
+}
+
+int lzorle1x_1_compress(const unsigned char *in, size_t in_len,
+ unsigned char *out, size_t *out_len,
+ void *wrkmem)
+{
+ return lzogeneric1x_1_compress(in, in_len, out, out_len,
+ wrkmem, LZO_VERSION);
+}
+
EXPORT_SYMBOL_GPL(lzo1x_1_compress);
+EXPORT_SYMBOL_GPL(lzorle1x_1_compress);
MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("LZO1X-1 Compressor");
diff --git a/lib/lzo/lzo1x_decompress_safe.c b/lib/lzo/lzo1x_decompress_safe.c
index a1c387f..6d2600e 100644
--- a/lib/lzo/lzo1x_decompress_safe.c
+++ b/lib/lzo/lzo1x_decompress_safe.c
@@ -46,11 +46,23 @@
const unsigned char * const ip_end = in + in_len;
unsigned char * const op_end = out + *out_len;
+ unsigned char bitstream_version;
+
op = out;
ip = in;
if (unlikely(in_len < 3))
goto input_overrun;
+
+ if (likely(*ip == 17)) {
+ bitstream_version = ip[1];
+ ip += 2;
+ if (unlikely(in_len < 5))
+ goto input_overrun;
+ } else {
+ bitstream_version = 0;
+ }
+
if (*ip > 17) {
t = *ip++ - 17;
if (t < 4) {
@@ -154,32 +166,49 @@
m_pos -= next >> 2;
next &= 3;
} else {
- m_pos = op;
- m_pos -= (t & 8) << 11;
- t = (t & 7) + (3 - 1);
- if (unlikely(t == 2)) {
- size_t offset;
- const unsigned char *ip_last = ip;
-
- while (unlikely(*ip == 0)) {
- ip++;
- NEED_IP(1);
- }
- offset = ip - ip_last;
- if (unlikely(offset > MAX_255_COUNT))
- return LZO_E_ERROR;
-
- offset = (offset << 8) - offset;
- t += offset + 7 + *ip++;
- NEED_IP(2);
- }
+ NEED_IP(2);
next = get_unaligned_le16(ip);
- ip += 2;
- m_pos -= next >> 2;
- next &= 3;
- if (m_pos == op)
- goto eof_found;
- m_pos -= 0x4000;
+ if (((next & 0xfffc) == 0xfffc) &&
+ ((t & 0xf8) == 0x18) &&
+ likely(bitstream_version)) {
+ NEED_IP(3);
+ t &= 7;
+ t |= ip[2] << 3;
+ t += MIN_ZERO_RUN_LENGTH;
+ NEED_OP(t);
+ memset(op, 0, t);
+ op += t;
+ next &= 3;
+ ip += 3;
+ goto match_next;
+ } else {
+ m_pos = op;
+ m_pos -= (t & 8) << 11;
+ t = (t & 7) + (3 - 1);
+ if (unlikely(t == 2)) {
+ size_t offset;
+ const unsigned char *ip_last = ip;
+
+ while (unlikely(*ip == 0)) {
+ ip++;
+ NEED_IP(1);
+ }
+ offset = ip - ip_last;
+ if (unlikely(offset > MAX_255_COUNT))
+ return LZO_E_ERROR;
+
+ offset = (offset << 8) - offset;
+ t += offset + 7 + *ip++;
+ NEED_IP(2);
+ next = get_unaligned_le16(ip);
+ }
+ ip += 2;
+ m_pos -= next >> 2;
+ next &= 3;
+ if (m_pos == op)
+ goto eof_found;
+ m_pos -= 0x4000;
+ }
}
TEST_LB(m_pos);
#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)
diff --git a/lib/lzo/lzodefs.h b/lib/lzo/lzodefs.h
index 4edefd2..3b46f5f4 100644
--- a/lib/lzo/lzodefs.h
+++ b/lib/lzo/lzodefs.h
@@ -13,9 +13,15 @@
*/
+/* Version
+ * 0: original lzo version
+ * 1: lzo with support for RLE
+ */
+#define LZO_VERSION 1
+
#define COPY4(dst, src) \
put_unaligned(get_unaligned((const u32 *)(src)), (u32 *)(dst))
-#if defined(__x86_64__)
+#if defined(CONFIG_X86_64)
#define COPY8(dst, src) \
put_unaligned(get_unaligned((const u64 *)(src)), (u64 *)(dst))
#else
@@ -25,19 +31,21 @@
#if defined(__BIG_ENDIAN) && defined(__LITTLE_ENDIAN)
#error "conflicting endian definitions"
-#elif defined(__x86_64__)
+#elif defined(CONFIG_X86_64)
#define LZO_USE_CTZ64 1
#define LZO_USE_CTZ32 1
-#elif defined(__i386__) || defined(__powerpc__)
+#define LZO_FAST_64BIT_MEMORY_ACCESS
+#elif defined(CONFIG_X86) || defined(CONFIG_PPC)
#define LZO_USE_CTZ32 1
-#elif defined(__arm__) && (__LINUX_ARM_ARCH__ >= 5)
+#elif defined(CONFIG_ARM) && (__LINUX_ARM_ARCH__ >= 5)
#define LZO_USE_CTZ32 1
#endif
#define M1_MAX_OFFSET 0x0400
#define M2_MAX_OFFSET 0x0800
#define M3_MAX_OFFSET 0x4000
-#define M4_MAX_OFFSET 0xbfff
+#define M4_MAX_OFFSET_V0 0xbfff
+#define M4_MAX_OFFSET_V1 0xbffe
#define M1_MIN_LEN 2
#define M1_MAX_LEN 2
@@ -53,6 +61,9 @@
#define M3_MARKER 32
#define M4_MARKER 16
+#define MIN_ZERO_RUN_LENGTH 4
+#define MAX_ZERO_RUN_LENGTH (2047 + MIN_ZERO_RUN_LENGTH)
+
#define lzo_dict_t unsigned short
#define D_BITS 13
#define D_SIZE (1u << D_BITS)
diff --git a/lib/sort.c b/lib/sort.c
index d6b7a20..d54cf97 100644
--- a/lib/sort.c
+++ b/lib/sort.c
@@ -1,8 +1,13 @@
// SPDX-License-Identifier: GPL-2.0
/*
- * A fast, small, non-recursive O(nlog n) sort for the Linux kernel
+ * A fast, small, non-recursive O(n log n) sort for the Linux kernel
*
- * Jan 23 2005 Matt Mackall <mpm@selenic.com>
+ * This performs n*log2(n) + 0.37*n + o(n) comparisons on average,
+ * and 1.5*n*log2(n) + O(n) in the (very contrived) worst case.
+ *
+ * Glibc qsort() manages n*log2(n) - 1.26*n for random inputs (1.63*n
+ * better) at the expense of stack usage and much larger code to avoid
+ * quicksort's O(n^2) worst case.
*/
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -11,96 +16,262 @@
#include <linux/export.h>
#include <linux/sort.h>
-static int alignment_ok(const void *base, int align)
+/**
+ * is_aligned - is this pointer & size okay for word-wide copying?
+ * @base: pointer to data
+ * @size: size of each element
+ * @align: required alignment (typically 4 or 8)
+ *
+ * Returns true if elements can be copied using word loads and stores.
+ * The size must be a multiple of the alignment, and the base address must
+ * be if we do not have CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.
+ *
+ * For some reason, gcc doesn't know to optimize "if (a & mask || b & mask)"
+ * to "if ((a | b) & mask)", so we do that by hand.
+ */
+__attribute_const__ __always_inline
+static bool is_aligned(const void *base, size_t size, unsigned char align)
{
- return IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
- ((unsigned long)base & (align - 1)) == 0;
-}
+ unsigned char lsbits = (unsigned char)size;
-static void u32_swap(void *a, void *b, int size)
-{
- u32 t = *(u32 *)a;
- *(u32 *)a = *(u32 *)b;
- *(u32 *)b = t;
-}
-
-static void u64_swap(void *a, void *b, int size)
-{
- u64 t = *(u64 *)a;
- *(u64 *)a = *(u64 *)b;
- *(u64 *)b = t;
-}
-
-static void generic_swap(void *a, void *b, int size)
-{
- char t;
-
- do {
- t = *(char *)a;
- *(char *)a++ = *(char *)b;
- *(char *)b++ = t;
- } while (--size > 0);
+ (void)base;
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+ lsbits |= (unsigned char)(uintptr_t)base;
+#endif
+ return (lsbits & (align - 1)) == 0;
}
/**
- * sort - sort an array of elements
+ * swap_words_32 - swap two elements in 32-bit chunks
+ * @a: pointer to the first element to swap
+ * @b: pointer to the second element to swap
+ * @n: element size (must be a multiple of 4)
+ *
+ * Exchange the two objects in memory. This exploits base+index addressing,
+ * which basically all CPUs have, to minimize loop overhead computations.
+ *
+ * For some reason, on x86 gcc 7.3.0 adds a redundant test of n at the
+ * bottom of the loop, even though the zero flag is stil valid from the
+ * subtract (since the intervening mov instructions don't alter the flags).
+ * Gcc 8.1.0 doesn't have that problem.
+ */
+static void swap_words_32(void *a, void *b, size_t n)
+{
+ do {
+ u32 t = *(u32 *)(a + (n -= 4));
+ *(u32 *)(a + n) = *(u32 *)(b + n);
+ *(u32 *)(b + n) = t;
+ } while (n);
+}
+
+/**
+ * swap_words_64 - swap two elements in 64-bit chunks
+ * @a: pointer to the first element to swap
+ * @b: pointer to the second element to swap
+ * @n: element size (must be a multiple of 8)
+ *
+ * Exchange the two objects in memory. This exploits base+index
+ * addressing, which basically all CPUs have, to minimize loop overhead
+ * computations.
+ *
+ * We'd like to use 64-bit loads if possible. If they're not, emulating
+ * one requires base+index+4 addressing which x86 has but most other
+ * processors do not. If CONFIG_64BIT, we definitely have 64-bit loads,
+ * but it's possible to have 64-bit loads without 64-bit pointers (e.g.
+ * x32 ABI). Are there any cases the kernel needs to worry about?
+ */
+static void swap_words_64(void *a, void *b, size_t n)
+{
+ do {
+#ifdef CONFIG_64BIT
+ u64 t = *(u64 *)(a + (n -= 8));
+ *(u64 *)(a + n) = *(u64 *)(b + n);
+ *(u64 *)(b + n) = t;
+#else
+ /* Use two 32-bit transfers to avoid base+index+4 addressing */
+ u32 t = *(u32 *)(a + (n -= 4));
+ *(u32 *)(a + n) = *(u32 *)(b + n);
+ *(u32 *)(b + n) = t;
+
+ t = *(u32 *)(a + (n -= 4));
+ *(u32 *)(a + n) = *(u32 *)(b + n);
+ *(u32 *)(b + n) = t;
+#endif
+ } while (n);
+}
+
+/**
+ * swap_bytes - swap two elements a byte at a time
+ * @a: pointer to the first element to swap
+ * @b: pointer to the second element to swap
+ * @n: element size
+ *
+ * This is the fallback if alignment doesn't allow using larger chunks.
+ */
+static void swap_bytes(void *a, void *b, size_t n)
+{
+ do {
+ char t = ((char *)a)[--n];
+ ((char *)a)[n] = ((char *)b)[n];
+ ((char *)b)[n] = t;
+ } while (n);
+}
+
+typedef void (*swap_func_t)(void *a, void *b, int size);
+
+/*
+ * The values are arbitrary as long as they can't be confused with
+ * a pointer, but small integers make for the smallest compare
+ * instructions.
+ */
+#define SWAP_WORDS_64 (swap_func_t)0
+#define SWAP_WORDS_32 (swap_func_t)1
+#define SWAP_BYTES (swap_func_t)2
+
+/*
+ * The function pointer is last to make tail calls most efficient if the
+ * compiler decides not to inline this function.
+ */
+static void do_swap(void *a, void *b, size_t size, swap_func_t swap_func)
+{
+ if (swap_func == SWAP_WORDS_64)
+ swap_words_64(a, b, size);
+ else if (swap_func == SWAP_WORDS_32)
+ swap_words_32(a, b, size);
+ else if (swap_func == SWAP_BYTES)
+ swap_bytes(a, b, size);
+ else
+ swap_func(a, b, (int)size);
+}
+
+typedef int (*cmp_func_t)(const void *, const void *);
+typedef int (*cmp_r_func_t)(const void *, const void *, const void *);
+#define _CMP_WRAPPER ((cmp_r_func_t)0L)
+
+static int do_cmp(const void *a, const void *b,
+ cmp_r_func_t cmp, const void *priv)
+{
+ if (cmp == _CMP_WRAPPER)
+ return ((cmp_func_t)(priv))(a, b);
+ return cmp(a, b, priv);
+}
+
+/**
+ * parent - given the offset of the child, find the offset of the parent.
+ * @i: the offset of the heap element whose parent is sought. Non-zero.
+ * @lsbit: a precomputed 1-bit mask, equal to "size & -size"
+ * @size: size of each element
+ *
+ * In terms of array indexes, the parent of element j = @i/@size is simply
+ * (j-1)/2. But when working in byte offsets, we can't use implicit
+ * truncation of integer divides.
+ *
+ * Fortunately, we only need one bit of the quotient, not the full divide.
+ * @size has a least significant bit. That bit will be clear if @i is
+ * an even multiple of @size, and set if it's an odd multiple.
+ *
+ * Logically, we're doing "if (i & lsbit) i -= size;", but since the
+ * branch is unpredictable, it's done with a bit of clever branch-free
+ * code instead.
+ */
+__attribute_const__ __always_inline
+static size_t parent(size_t i, unsigned int lsbit, size_t size)
+{
+ i -= size;
+ i -= size & -(i & lsbit);
+ return i / 2;
+}
+
+/**
+ * sort_r - sort an array of elements
* @base: pointer to data to sort
* @num: number of elements
* @size: size of each element
* @cmp_func: pointer to comparison function
* @swap_func: pointer to swap function or NULL
+ * @priv: third argument passed to comparison function
*
- * This function does a heapsort on the given array. You may provide a
- * swap_func function optimized to your element type.
+ * This function does a heapsort on the given array. You may provide
+ * a swap_func function if you need to do something more than a memory
+ * copy (e.g. fix up pointers or auxiliary data), but the built-in swap
+ * avoids a slow retpoline and so is significantly faster.
*
* Sorting time is O(n log n) both on average and worst-case. While
- * qsort is about 20% faster on average, it suffers from exploitable
+ * quicksort is slightly faster on average, it suffers from exploitable
* O(n*n) worst-case behavior and extra memory requirements that make
* it less suitable for kernel use.
*/
+void sort_r(void *base, size_t num, size_t size,
+ int (*cmp_func)(const void *, const void *, const void *),
+ void (*swap_func)(void *, void *, int size),
+ const void *priv)
+{
+ /* pre-scale counters for performance */
+ size_t n = num * size, a = (num/2) * size;
+ const unsigned int lsbit = size & -size; /* Used to find parent */
+
+ if (!a) /* num < 2 || size == 0 */
+ return;
+
+ if (!swap_func) {
+ if (is_aligned(base, size, 8))
+ swap_func = SWAP_WORDS_64;
+ else if (is_aligned(base, size, 4))
+ swap_func = SWAP_WORDS_32;
+ else
+ swap_func = SWAP_BYTES;
+ }
+
+ /*
+ * Loop invariants:
+ * 1. elements [a,n) satisfy the heap property (compare greater than
+ * all of their children),
+ * 2. elements [n,num*size) are sorted, and
+ * 3. a <= b <= c <= d <= n (whenever they are valid).
+ */
+ for (;;) {
+ size_t b, c, d;
+
+ if (a) /* Building heap: sift down --a */
+ a -= size;
+ else if (n -= size) /* Sorting: Extract root to --n */
+ do_swap(base, base + n, size, swap_func);
+ else /* Sort complete */
+ break;
+
+ /*
+ * Sift element at "a" down into heap. This is the
+ * "bottom-up" variant, which significantly reduces
+ * calls to cmp_func(): we find the sift-down path all
+ * the way to the leaves (one compare per level), then
+ * backtrack to find where to insert the target element.
+ *
+ * Because elements tend to sift down close to the leaves,
+ * this uses fewer compares than doing two per level
+ * on the way down. (A bit more than half as many on
+ * average, 3/4 worst-case.)
+ */
+ for (b = a; c = 2*b + size, (d = c + size) < n;)
+ b = do_cmp(base + c, base + d, cmp_func, priv) >= 0 ? c : d;
+ if (d == n) /* Special case last leaf with no sibling */
+ b = c;
+
+ /* Now backtrack from "b" to the correct location for "a" */
+ while (b != a && do_cmp(base + a, base + b, cmp_func, priv) >= 0)
+ b = parent(b, lsbit, size);
+ c = b; /* Where "a" belongs */
+ while (b != a) { /* Shift it into place */
+ b = parent(b, lsbit, size);
+ do_swap(base + b, base + c, size, swap_func);
+ }
+ }
+}
+EXPORT_SYMBOL(sort_r);
void sort(void *base, size_t num, size_t size,
int (*cmp_func)(const void *, const void *),
void (*swap_func)(void *, void *, int size))
{
- /* pre-scale counters for performance */
- int i = (num/2 - 1) * size, n = num * size, c, r;
-
- if (!swap_func) {
- if (size == 4 && alignment_ok(base, 4))
- swap_func = u32_swap;
- else if (size == 8 && alignment_ok(base, 8))
- swap_func = u64_swap;
- else
- swap_func = generic_swap;
- }
-
- /* heapify */
- for ( ; i >= 0; i -= size) {
- for (r = i; r * 2 + size < n; r = c) {
- c = r * 2 + size;
- if (c < n - size &&
- cmp_func(base + c, base + c + size) < 0)
- c += size;
- if (cmp_func(base + r, base + c) >= 0)
- break;
- swap_func(base + r, base + c, size);
- }
- }
-
- /* sort */
- for (i = n - size; i > 0; i -= size) {
- swap_func(base, base + i, size);
- for (r = 0; r * 2 + size < i; r = c) {
- c = r * 2 + size;
- if (c < i - size &&
- cmp_func(base + c, base + c + size) < 0)
- c += size;
- if (cmp_func(base + r, base + c) >= 0)
- break;
- swap_func(base + r, base + c, size);
- }
- }
+ return sort_r(base, num, size, _CMP_WRAPPER, swap_func, cmp_func);
}
-
EXPORT_SYMBOL(sort);
diff --git a/mm/Kconfig b/mm/Kconfig
index b457e94..8020040 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -327,6 +327,23 @@
This value can be changed after boot using the
/proc/sys/vm/mmap_min_addr tunable.
+config MMAP_NOEXEC_TAINT
+ int "Turns on tainting of mmap()d files from noexec mountpoints"
+ default 1 if MMU
+ default 0 if !MMU
+ help
+ By default, the ability to change the protections of a virtual
+ memory area to allow execution depend on if the vma has the
+ VM_MAYEXEC flag. When mapping regions from files, VM_MAYEXEC
+ will be unset if the containing mountpoint is mounted MNT_NOEXEC.
+ By setting the value to 0, any mmap()d region may be later
+ mprotect()d with PROT_EXEC.
+
+ If unsure, keep the value set to 1.
+
+ This value can be changed after boot using the
+ /proc/sys/vm/mmap_noexec_taint tunable.
+
config ARCH_SUPPORTS_MEMORY_FAILURE
bool
diff --git a/mm/filemap.c b/mm/filemap.c
index 45f1c6d..b19706a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2541,8 +2541,8 @@
} else if (!page) {
/* No page in the page cache at all */
do_sync_mmap_readahead(vmf->vma, ra, file, offset);
- count_vm_event(PGMAJFAULT);
- count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
+ count_vm_event(PGMAJFAULT_F);
+ count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT_F);
ret = VM_FAULT_MAJOR;
retry_find:
page = find_get_page(mapping, offset);
diff --git a/mm/madvise.c b/mm/madvise.c
index 71d21df..899b19e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -138,7 +138,7 @@
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
vma->vm_file, pgoff, vma_policy(vma),
- vma->vm_userfaultfd_ctx);
+ vma->vm_userfaultfd_ctx, vma_get_anon_name(vma));
if (*prev) {
vma = *prev;
goto success;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3b78b6a..5dd8eca 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3441,6 +3441,9 @@
PGPGOUT,
PGFAULT,
PGMAJFAULT,
+ PGMAJFAULT_S,
+ PGMAJFAULT_A,
+ PGMAJFAULT_F,
};
static const char *const memcg1_event_names[] = {
@@ -3448,6 +3451,9 @@
"pgpgout",
"pgfault",
"pgmajfault",
+ "pgmajfault_s",
+ "pgmajfault_a",
+ "pgmajfault_f",
};
static int memcg_stat_show(struct seq_file *m, void *v)
@@ -5681,6 +5687,9 @@
seq_printf(m, "pgfault %lu\n", acc.events[PGFAULT]);
seq_printf(m, "pgmajfault %lu\n", acc.events[PGMAJFAULT]);
+ seq_printf(m, "pgmajfault_s %lu\n", acc.events[PGMAJFAULT_S]);
+ seq_printf(m, "pgmajfault_a %lu\n", acc.events[PGMAJFAULT_A]);
+ seq_printf(m, "pgmajfault_f %lu\n", acc.events[PGMAJFAULT_F]);
seq_printf(m, "pgrefill %lu\n", acc.events[PGREFILL]);
seq_printf(m, "pgscan %lu\n", acc.events[PGSCAN_KSWAPD] +
diff --git a/mm/memory.c b/mm/memory.c
index bbf0cc4..3d3c947 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2980,8 +2980,8 @@
/* Had to read the page from swap area: Major fault */
ret = VM_FAULT_MAJOR;
- count_vm_event(PGMAJFAULT);
- count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
+ count_vm_event(PGMAJFAULT_A);
+ count_memcg_event_mm(vma->vm_mm, PGMAJFAULT_A);
} else if (PageHWPoison(page)) {
/*
* hwpoisoned dirty swapcache pages are kept for killing
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 68c46da..08abc8b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -760,7 +760,8 @@
((vmstart - vma->vm_start) >> PAGE_SHIFT);
prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
vma->anon_vma, vma->vm_file, pgoff,
- new_pol, vma->vm_userfaultfd_ctx);
+ new_pol, vma->vm_userfaultfd_ctx,
+ vma_get_anon_name(vma));
if (prev) {
vma = prev;
next = vma->vm_next;
diff --git a/mm/mlock.c b/mm/mlock.c
index 0ab8250..02d8a89 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -535,7 +535,7 @@
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
vma->vm_file, pgoff, vma_policy(vma),
- vma->vm_userfaultfd_ctx);
+ vma->vm_userfaultfd_ctx, vma_get_anon_name(vma));
if (*prev) {
vma = *prev;
goto success;
diff --git a/mm/mmap.c b/mm/mmap.c
index a98f09b..0636423 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -977,7 +977,8 @@
*/
static inline int is_mergeable_vma(struct vm_area_struct *vma,
struct file *file, unsigned long vm_flags,
- struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+ const char __user *anon_name)
{
/*
* VM_SOFTDIRTY should not prevent from VMA merging, if we
@@ -995,6 +996,8 @@
return 0;
if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx))
return 0;
+ if (vma_get_anon_name(vma) != anon_name)
+ return 0;
return 1;
}
@@ -1027,9 +1030,10 @@
can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
struct anon_vma *anon_vma, struct file *file,
pgoff_t vm_pgoff,
- struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+ const char __user *anon_name)
{
- if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
+ if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name) &&
is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
if (vma->vm_pgoff == vm_pgoff)
return 1;
@@ -1048,9 +1052,10 @@
can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags,
struct anon_vma *anon_vma, struct file *file,
pgoff_t vm_pgoff,
- struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+ const char __user *anon_name)
{
- if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
+ if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name) &&
is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
pgoff_t vm_pglen;
vm_pglen = vma_pages(vma);
@@ -1061,9 +1066,9 @@
}
/*
- * Given a mapping request (addr,end,vm_flags,file,pgoff), figure out
- * whether that can be merged with its predecessor or its successor.
- * Or both (it neatly fills a hole).
+ * Given a mapping request (addr,end,vm_flags,file,pgoff,anon_name),
+ * figure out whether that can be merged with its predecessor or its
+ * successor. Or both (it neatly fills a hole).
*
* In most cases - when called for mmap, brk or mremap - [addr,end) is
* certain not to be mapped by the time vma_merge is called; but when
@@ -1105,7 +1110,8 @@
unsigned long end, unsigned long vm_flags,
struct anon_vma *anon_vma, struct file *file,
pgoff_t pgoff, struct mempolicy *policy,
- struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
+ struct vm_userfaultfd_ctx vm_userfaultfd_ctx,
+ const char __user *anon_name)
{
pgoff_t pglen = (end - addr) >> PAGE_SHIFT;
struct vm_area_struct *area, *next;
@@ -1138,7 +1144,8 @@
mpol_equal(vma_policy(prev), policy) &&
can_vma_merge_after(prev, vm_flags,
anon_vma, file, pgoff,
- vm_userfaultfd_ctx)) {
+ vm_userfaultfd_ctx,
+ anon_name)) {
/*
* OK, it can. Can we now merge in the successor as well?
*/
@@ -1147,7 +1154,8 @@
can_vma_merge_before(next, vm_flags,
anon_vma, file,
pgoff+pglen,
- vm_userfaultfd_ctx) &&
+ vm_userfaultfd_ctx,
+ anon_name) &&
is_mergeable_anon_vma(prev->anon_vma,
next->anon_vma, NULL)) {
/* cases 1, 6 */
@@ -1170,7 +1178,8 @@
mpol_equal(policy, vma_policy(next)) &&
can_vma_merge_before(next, vm_flags,
anon_vma, file, pgoff+pglen,
- vm_userfaultfd_ctx)) {
+ vm_userfaultfd_ctx,
+ anon_name)) {
if (prev && addr < prev->vm_end) /* case 4 */
err = __vma_adjust(prev, prev->vm_start,
addr, prev->vm_pgoff, NULL, next);
@@ -1479,7 +1488,8 @@
if (path_noexec(&file->f_path)) {
if (vm_flags & VM_EXEC)
return -EPERM;
- vm_flags &= ~VM_MAYEXEC;
+ if (sysctl_mmap_noexec_taint)
+ vm_flags &= ~VM_MAYEXEC;
}
if (!file->f_op->mmap)
@@ -1715,7 +1725,7 @@
* Can we just expand an old mapping?
*/
vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
- NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
+ NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
if (vma)
goto out;
@@ -2785,6 +2795,7 @@
return 0;
}
+EXPORT_SYMBOL(do_munmap);
int vm_munmap(unsigned long start, size_t len)
{
@@ -2971,7 +2982,7 @@
/* Can we just expand an old private anonymous mapping? */
vma = vma_merge(mm, prev, addr, addr + len, flags,
- NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX);
+ NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
if (vma)
goto out;
@@ -3169,7 +3180,7 @@
return NULL; /* should never get here */
new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
- vma->vm_userfaultfd_ctx);
+ vma->vm_userfaultfd_ctx, vma_get_anon_name(vma));
if (new_vma) {
/*
* Source vma may have been merged into new_vma
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 86837f2..3f374a6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -432,7 +432,7 @@
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*pprev = vma_merge(mm, *pprev, start, end, newflags,
vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
- vma->vm_userfaultfd_ctx);
+ vma->vm_userfaultfd_ctx, vma_get_anon_name(vma));
if (*pprev) {
vma = *pprev;
VM_WARN_ON((vma->vm_flags ^ newflags) & ~VM_SOFTDIRTY);
diff --git a/mm/shmem.c b/mm/shmem.c
index dea5120..4666d9f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1690,8 +1690,8 @@
/* Or update major stats only when swapin succeeds?? */
if (fault_type) {
*fault_type |= VM_FAULT_MAJOR;
- count_vm_event(PGMAJFAULT);
- count_memcg_event_mm(charge_mm, PGMAJFAULT);
+ count_vm_event(PGMAJFAULT_S);
+ count_memcg_event_mm(charge_mm, PGMAJFAULT_S);
}
/* Here we actually start the io */
page = shmem_swapin(swap, gfp, info, index);
diff --git a/mm/slub.c b/mm/slub.c
index d8116a4..d0f4b03 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5094,6 +5094,14 @@
SLAB_ATTR_RO(cache_dma);
#endif
+#ifdef CONFIG_ZONE_DMA32
+static ssize_t cache_dma32_show(struct kmem_cache *s, char *buf)
+{
+ return sprintf(buf, "%d\n", !!(s->flags & SLAB_CACHE_DMA32));
+}
+SLAB_ATTR_RO(cache_dma32);
+#endif
+
static ssize_t usersize_show(struct kmem_cache *s, char *buf)
{
return sprintf(buf, "%u\n", s->usersize);
@@ -5434,6 +5442,9 @@
#ifdef CONFIG_ZONE_DMA
&cache_dma_attr.attr,
#endif
+#ifdef CONFIG_ZONE_DMA32
+ &cache_dma32_attr.attr,
+#endif
#ifdef CONFIG_NUMA
&remote_node_defrag_ratio_attr.attr,
#endif
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0047dca..0563c7b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2532,6 +2532,8 @@
struct swap_cluster_info *cluster_info;
unsigned long *frontswap_map;
struct file *swap_file, *victim;
+ struct path path_holder;
+ struct path *victim_path = NULL;
struct address_space *mapping;
struct inode *inode;
struct filename *pathname;
@@ -2549,10 +2551,16 @@
victim = file_open_name(pathname, O_RDWR|O_LARGEFILE, 0);
err = PTR_ERR(victim);
- if (IS_ERR(victim))
- goto out;
-
- mapping = victim->f_mapping;
+ if (IS_ERR(victim)) {
+ /* Fallback to just the inode mapping if possible. */
+ if (kern_path(pathname->name, LOOKUP_FOLLOW, &path_holder))
+ goto out; /* Propogate the original err. */
+ victim_path = &path_holder;
+ mapping = victim_path->dentry->d_inode->i_mapping;
+ victim = NULL;
+ } else {
+ mapping = victim->f_mapping;
+ }
spin_lock(&swap_lock);
plist_for_each_entry(p, &swap_active_head, list) {
if (p->flags & SWP_WRITEOK) {
@@ -2685,7 +2693,10 @@
wake_up_interruptible(&proc_poll_wait);
out_dput:
- filp_close(victim, NULL);
+ if (victim)
+ filp_close(victim, NULL);
+ if (victim_path)
+ path_put(victim_path);
out:
putname(pathname);
return err;
@@ -2873,12 +2884,23 @@
return p;
}
-static int claim_swapfile(struct swap_info_struct *p, struct inode *inode)
+/* This sysctl is only exposed when CONFIG_DISK_BASED_SWAP is enabled. */
+int sysctl_disk_based_swap;
+
+static int claim_swapfile(struct swap_info_struct *p, struct inode *inode,
+ bool allow_disk_based_swap)
{
int error;
-
+ /* On Chromium OS, we only support zram swap devices. */
if (S_ISBLK(inode->i_mode)) {
+ char name[BDEVNAME_SIZE];
p->bdev = bdgrab(I_BDEV(inode));
+ bdevname(p->bdev, name);
+ if (strncmp(name, "zram", strlen("zram"))) {
+ bdput(p->bdev);
+ p->bdev = NULL;
+ return -EINVAL;
+ }
error = blkdev_get(p->bdev,
FMODE_READ | FMODE_WRITE | FMODE_EXCL, p);
if (error < 0) {
@@ -2890,7 +2912,7 @@
if (error < 0)
return error;
p->flags |= SWP_BLKDEV;
- } else if (S_ISREG(inode->i_mode)) {
+ } else if (S_ISREG(inode->i_mode) && allow_disk_based_swap) {
p->bdev = inode->i_sb->s_bdev;
inode_lock(inode);
if (IS_SWAPFILE(inode))
@@ -3119,6 +3141,7 @@
struct page *page = NULL;
struct inode *inode = NULL;
bool inced_nr_rotate_swap = false;
+ bool allow_disk_based_swap = sysctl_disk_based_swap;
if (swap_flags & ~SWAP_FLAGS_VALID)
return -EINVAL;
@@ -3153,7 +3176,7 @@
inode = mapping->host;
/* If S_ISREG(inode->i_mode) will do inode_lock(inode); */
- error = claim_swapfile(p, inode);
+ error = claim_swapfile(p, inode, allow_disk_based_swap);
if (unlikely(error))
goto bad_swap;
@@ -3321,7 +3344,8 @@
atomic_dec(&nr_rotate_swap);
if (swap_file) {
if (inode && S_ISREG(inode->i_mode)) {
- inode_unlock(inode);
+ if (allow_disk_based_swap)
+ inode_unlock(inode);
inode = NULL;
}
filp_close(swap_file, NULL);
diff --git a/mm/util.c b/mm/util.c
index 6a24a10..d7628b1 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -563,6 +563,7 @@
int sysctl_overcommit_ratio __read_mostly = 50;
unsigned long sysctl_overcommit_kbytes __read_mostly;
int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
+int sysctl_mmap_noexec_taint __read_mostly = CONFIG_MMAP_NOEXEC_TAINT;
unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bc2ecd4..d8e561b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -166,6 +166,11 @@
*/
unsigned long vm_total_pages;
+/*
+ * Low watermark used to prevent fscache thrashing during low memory.
+ */
+int min_filelist_kbytes;
+
static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);
@@ -2242,9 +2247,33 @@
return inactive * inactive_ratio < active;
}
+/*
+ * Check low watermark used to prevent fscache thrashing during low memory.
+ */
+static int file_is_low(struct lruvec *lruvec, struct scan_control *sc)
+{
+ unsigned long pages_min, active, inactive;
+ enum lru_list inactive_lru = LRU_FILE;
+ enum lru_list active_lru = LRU_FILE + LRU_ACTIVE;
+
+ if (!mem_cgroup_disabled())
+ return false;
+
+ pages_min = min_filelist_kbytes >> (PAGE_SHIFT - 10);
+ inactive = lruvec_lru_size(lruvec, inactive_lru, sc->reclaim_idx);
+ active = lruvec_lru_size(lruvec, active_lru, sc->reclaim_idx);
+
+ return ((active + inactive) < pages_min);
+}
+
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct lruvec *lruvec, struct scan_control *sc)
{
+ int file = is_file_lru(lru);
+
+ if (file && file_is_low(lruvec, sc))
+ return 0;
+
if (is_active_lru(lru)) {
if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true))
shrink_active_list(nr_to_scan, lruvec, sc, lru);
@@ -2285,6 +2314,15 @@
unsigned long ap, fp;
enum lru_list lru;
+ /*
+ * Do not scan file pages when swap is allowed by __GFP_IO and
+ * file page count is low.
+ */
+ if ((sc->gfp_mask & __GFP_IO) && file_is_low(lruvec, sc)) {
+ scan_balance = SCAN_ANON;
+ goto out;
+ }
+
/* If we have no swap space, do not bother scanning anon pages. */
if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
scan_balance = SCAN_FILE;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ce81b0a..6bdb077 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1185,6 +1185,9 @@
"pgfault",
"pgmajfault",
+ "pgmajfault_s",
+ "pgmajfault_a",
+ "pgmajfault_f",
"pglazyfreed",
"pgrefill",
@@ -1688,6 +1691,8 @@
all_vm_events(v);
v[PGPGIN] /= 2; /* sectors -> kbytes */
v[PGPGOUT] /= 2;
+ /* Add up page faults */
+ v[PGMAJFAULT] = v[PGMAJFAULT_S] + v[PGMAJFAULT_A] + v[PGMAJFAULT_F];
#endif
return (unsigned long *)m->private + *pos;
}
diff --git a/net/bridge/br.c b/net/bridge/br.c
index b0a0b82..e411e40 100644
--- a/net/bridge/br.c
+++ b/net/bridge/br.c
@@ -175,6 +175,22 @@
.notifier_call = br_switchdev_event,
};
+void br_opt_toggle(struct net_bridge *br, enum net_bridge_opts opt, bool on)
+{
+ bool cur = !!br_opt_get(br, opt);
+
+ br_debug(br, "toggle option: %d state: %d -> %d\n",
+ opt, cur, on);
+
+ if (cur == on)
+ return;
+
+ if (on)
+ set_bit(opt, &br->options);
+ else
+ clear_bit(opt, &br->options);
+}
+
static void __net_exit br_net_exit(struct net *net)
{
struct net_device *dev;
diff --git a/net/bridge/br_arp_nd_proxy.c b/net/bridge/br_arp_nd_proxy.c
index d42e390..6b78e63 100644
--- a/net/bridge/br_arp_nd_proxy.c
+++ b/net/bridge/br_arp_nd_proxy.c
@@ -39,7 +39,7 @@
}
}
- br->neigh_suppress_enabled = neigh_suppress;
+ br_opt_toggle(br, BROPT_NEIGH_SUPPRESS_ENABLED, neigh_suppress);
}
#if IS_ENABLED(CONFIG_INET)
@@ -155,7 +155,7 @@
ipv4_is_multicast(tip))
return;
- if (br->neigh_suppress_enabled) {
+ if (br_opt_get(br, BROPT_NEIGH_SUPPRESS_ENABLED)) {
if (p && (p->flags & BR_NEIGH_SUPPRESS))
return;
if (ipv4_is_zeronet(sip) || sip == tip) {
@@ -175,7 +175,8 @@
return;
}
- if (br->neigh_suppress_enabled && br_is_local_ip(vlandev, tip)) {
+ if (br_opt_get(br, BROPT_NEIGH_SUPPRESS_ENABLED) &&
+ br_is_local_ip(vlandev, tip)) {
/* its our local ip, so don't proxy reply
* and don't forward to neigh suppress ports
*/
@@ -213,7 +214,8 @@
/* If we have replied or as long as we know the
* mac, indicate to arp replied
*/
- if (replied || br->neigh_suppress_enabled)
+ if (replied ||
+ br_opt_get(br, BROPT_NEIGH_SUPPRESS_ENABLED))
BR_INPUT_SKB_CB(skb)->proxyarp_replied = true;
}
@@ -460,7 +462,8 @@
* mac, indicate to NEIGH_SUPPRESS ports that we
* have replied
*/
- if (replied || br->neigh_suppress_enabled)
+ if (replied ||
+ br_opt_get(br, BROPT_NEIGH_SUPPRESS_ENABLED))
BR_INPUT_SKB_CB(skb)->proxyarp_replied = true;
}
neigh_release(n);
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 9ce661e..e645983 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -67,11 +67,11 @@
if (IS_ENABLED(CONFIG_INET) &&
(eth->h_proto == htons(ETH_P_ARP) ||
eth->h_proto == htons(ETH_P_RARP)) &&
- br->neigh_suppress_enabled) {
+ br_opt_get(br, BROPT_NEIGH_SUPPRESS_ENABLED)) {
br_do_proxy_suppress_arp(skb, br, vid, NULL);
} else if (IS_ENABLED(CONFIG_IPV6) &&
skb->protocol == htons(ETH_P_IPV6) &&
- br->neigh_suppress_enabled &&
+ br_opt_get(br, BROPT_NEIGH_SUPPRESS_ENABLED) &&
pskb_may_pull(skb, sizeof(struct ipv6hdr) +
sizeof(struct nd_msg)) &&
ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {
@@ -228,7 +228,7 @@
dev->mtu = new_mtu;
/* this flag will be cleared if the MTU was automatically adjusted */
- br->mtu_set_by_user = true;
+ br_opt_toggle(br, BROPT_MTU_SET_BY_USER, true);
#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
/* remember the MTU in the rtable for PMTU */
dst_metric_set(&br->fake_rtable.dst, RTAX_MTU, new_mtu);
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index ed2b600..9c34cf6 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -509,14 +509,14 @@
ASSERT_RTNL();
/* if the bridge MTU was manually configured don't mess with it */
- if (br->mtu_set_by_user)
+ if (br_opt_get(br, BROPT_MTU_SET_BY_USER))
return;
/* change to the minimum MTU and clear the flag which was set by
* the bridge ndo_change_mtu callback
*/
dev_set_mtu(br->dev, br_mtu_min(br));
- br->mtu_set_by_user = false;
+ br_opt_toggle(br, BROPT_MTU_SET_BY_USER, false);
}
static void br_set_gso_limits(struct net_bridge *br)
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 2532c1a..e96035f 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -120,7 +120,7 @@
br_do_proxy_suppress_arp(skb, br, vid, p);
} else if (IS_ENABLED(CONFIG_IPV6) &&
skb->protocol == htons(ETH_P_IPV6) &&
- br->neigh_suppress_enabled &&
+ br_opt_get(br, BROPT_NEIGH_SUPPRESS_ENABLED) &&
pskb_may_pull(skb, sizeof(struct ipv6hdr) +
sizeof(struct nd_msg)) &&
ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {
diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index 5519881..fb5026a 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -84,7 +84,7 @@
int i, err = 0;
int idx = 0, s_idx = cb->args[1];
- if (br->multicast_disabled)
+ if (!br_opt_get(br, BROPT_MULTICAST_ENABLED))
return 0;
mdb = rcu_dereference(br->mdb);
@@ -598,7 +598,7 @@
struct net_bridge_port *p;
int ret;
- if (!netif_running(br->dev) || br->multicast_disabled)
+ if (!netif_running(br->dev) || !br_opt_get(br, BROPT_MULTICAST_ENABLED))
return -EINVAL;
dev = __dev_get_by_index(net, entry->ifindex);
@@ -673,7 +673,7 @@
struct br_ip ip;
int err = -EINVAL;
- if (!netif_running(br->dev) || br->multicast_disabled)
+ if (!netif_running(br->dev) || !br_opt_get(br, BROPT_MULTICAST_ENABLED))
return -EINVAL;
__mdb_entry_to_br_ip(entry, &ip);
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 6a362da..526a83d 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -158,7 +158,7 @@
struct net_bridge_mdb_htable *mdb = rcu_dereference(br->mdb);
struct br_ip ip;
- if (br->multicast_disabled)
+ if (!br_opt_get(br, BROPT_MULTICAST_ENABLED))
return NULL;
if (BR_INPUT_SKB_CB(skb)->igmp)
@@ -411,7 +411,7 @@
iph->frag_off = htons(IP_DF);
iph->ttl = 1;
iph->protocol = IPPROTO_IGMP;
- iph->saddr = br->multicast_query_use_ifaddr ?
+ iph->saddr = br_opt_get(br, BROPT_MULTICAST_QUERY_USE_IFADDR) ?
inet_select_addr(br->dev, 0, RT_SCOPE_LINK) : 0;
iph->daddr = htonl(INADDR_ALLHOSTS_GROUP);
((u8 *)&iph[1])[0] = IPOPT_RA;
@@ -503,11 +503,11 @@
if (ipv6_dev_get_saddr(dev_net(br->dev), br->dev, &ip6h->daddr, 0,
&ip6h->saddr)) {
kfree_skb(skb);
- br->has_ipv6_addr = 0;
+ br_opt_toggle(br, BROPT_HAS_IPV6_ADDR, false);
return NULL;
}
- br->has_ipv6_addr = 1;
+ br_opt_toggle(br, BROPT_HAS_IPV6_ADDR, true);
ipv6_eth_mc_map(&ip6h->daddr, eth->h_dest);
hopopt = (u8 *)(ip6h + 1);
@@ -628,7 +628,7 @@
port ? port->dev->name : br->dev->name);
err = -E2BIG;
disable:
- br->multicast_disabled = 1;
+ br_opt_toggle(br, BROPT_MULTICAST_ENABLED, false);
goto err;
}
}
@@ -894,7 +894,7 @@
struct bridge_mcast_own_query *query)
{
spin_lock(&br->multicast_lock);
- if (!netif_running(br->dev) || br->multicast_disabled)
+ if (!netif_running(br->dev) || !br_opt_get(br, BROPT_MULTICAST_ENABLED))
goto out;
br_multicast_start_querier(br, query);
@@ -965,8 +965,9 @@
struct br_ip br_group;
unsigned long time;
- if (!netif_running(br->dev) || br->multicast_disabled ||
- !br->multicast_querier)
+ if (!netif_running(br->dev) ||
+ !br_opt_get(br, BROPT_MULTICAST_ENABLED) ||
+ !br_opt_get(br, BROPT_MULTICAST_QUERIER))
return;
memset(&br_group.u, 0, sizeof(br_group.u));
@@ -1036,7 +1037,7 @@
.orig_dev = dev,
.id = SWITCHDEV_ATTR_ID_BRIDGE_MC_DISABLED,
.flags = SWITCHDEV_F_DEFER,
- .u.mc_disabled = value,
+ .u.mc_disabled = !value,
};
switchdev_port_attr_set(dev, &attr);
@@ -1054,7 +1055,8 @@
timer_setup(&port->ip6_own_query.timer,
br_ip6_multicast_port_query_expired, 0);
#endif
- br_mc_disabled_update(port->dev, port->br->multicast_disabled);
+ br_mc_disabled_update(port->dev,
+ br_opt_get(port->br, BROPT_MULTICAST_ENABLED));
port->mcast_stats = netdev_alloc_pcpu_stats(struct bridge_mcast_stats);
if (!port->mcast_stats)
@@ -1091,7 +1093,7 @@
{
struct net_bridge *br = port->br;
- if (br->multicast_disabled || !netif_running(br->dev))
+ if (!br_opt_get(br, BROPT_MULTICAST_ENABLED) || !netif_running(br->dev))
return;
br_multicast_enable(&port->ip4_own_query);
@@ -1641,7 +1643,7 @@
if (timer_pending(&other_query->timer))
goto out;
- if (br->multicast_querier) {
+ if (br_opt_get(br, BROPT_MULTICAST_QUERIER)) {
__br_multicast_send_query(br, port, &mp->addr);
time = jiffies + br->multicast_last_member_count *
@@ -1753,7 +1755,7 @@
struct bridge_mcast_stats __percpu *stats;
struct bridge_mcast_stats *pstats;
- if (!br->multicast_stats_enabled)
+ if (!br_opt_get(br, BROPT_MULTICAST_STATS_ENABLED))
return;
if (p)
@@ -1911,7 +1913,7 @@
BR_INPUT_SKB_CB(skb)->igmp = 0;
BR_INPUT_SKB_CB(skb)->mrouters_only = 0;
- if (br->multicast_disabled)
+ if (!br_opt_get(br, BROPT_MULTICAST_ENABLED))
return 0;
switch (skb->protocol) {
@@ -1963,8 +1965,6 @@
br->hash_max = 512;
br->multicast_router = MDB_RTR_TYPE_TEMP_QUERY;
- br->multicast_querier = 0;
- br->multicast_query_use_ifaddr = 0;
br->multicast_last_member_count = 2;
br->multicast_startup_query_count = 2;
@@ -1983,7 +1983,7 @@
br->ip6_other_query.delay_time = 0;
br->ip6_querier.port = NULL;
#endif
- br->has_ipv6_addr = 1;
+ br_opt_toggle(br, BROPT_HAS_IPV6_ADDR, true);
spin_lock_init(&br->multicast_lock);
timer_setup(&br->multicast_router_timer,
@@ -2005,7 +2005,7 @@
{
query->startup_sent = 0;
- if (br->multicast_disabled)
+ if (!br_opt_get(br, BROPT_MULTICAST_ENABLED))
return;
mod_timer(&query->timer, jiffies);
@@ -2182,12 +2182,12 @@
int err = 0;
spin_lock_bh(&br->multicast_lock);
- if (br->multicast_disabled == !val)
+ if (!!br_opt_get(br, BROPT_MULTICAST_ENABLED) == !!val)
goto unlock;
- br_mc_disabled_update(br->dev, !val);
- br->multicast_disabled = !val;
- if (br->multicast_disabled)
+ br_mc_disabled_update(br->dev, val);
+ br_opt_toggle(br, BROPT_MULTICAST_ENABLED, !!val);
+ if (!br_opt_get(br, BROPT_MULTICAST_ENABLED))
goto unlock;
if (!netif_running(br->dev))
@@ -2198,7 +2198,7 @@
if (mdb->old) {
err = -EEXIST;
rollback:
- br->multicast_disabled = !!val;
+ br_opt_toggle(br, BROPT_MULTICAST_ENABLED, false);
goto unlock;
}
@@ -2222,7 +2222,7 @@
{
struct net_bridge *br = netdev_priv(dev);
- return !br->multicast_disabled;
+ return !!br_opt_get(br, BROPT_MULTICAST_ENABLED);
}
EXPORT_SYMBOL_GPL(br_multicast_enabled);
@@ -2245,10 +2245,10 @@
val = !!val;
spin_lock_bh(&br->multicast_lock);
- if (br->multicast_querier == val)
+ if (br_opt_get(br, BROPT_MULTICAST_QUERIER) == val)
goto unlock;
- br->multicast_querier = val;
+ br_opt_toggle(br, BROPT_MULTICAST_QUERIER, !!val);
if (!val)
goto unlock;
@@ -2569,7 +2569,7 @@
struct bridge_mcast_stats __percpu *stats;
/* if multicast_disabled is true then igmp type can't be set */
- if (!type || !br->multicast_stats_enabled)
+ if (!type || !br_opt_get(br, BROPT_MULTICAST_STATS_ENABLED))
return;
if (p)
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index ccab290..ad207463 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -51,25 +51,22 @@
struct brnf_net {
bool enabled;
-};
#ifdef CONFIG_SYSCTL
-static struct ctl_table_header *brnf_sysctl_header;
-static int brnf_call_iptables __read_mostly = 1;
-static int brnf_call_ip6tables __read_mostly = 1;
-static int brnf_call_arptables __read_mostly = 1;
-static int brnf_filter_vlan_tagged __read_mostly;
-static int brnf_filter_pppoe_tagged __read_mostly;
-static int brnf_pass_vlan_indev __read_mostly;
-#else
-#define brnf_call_iptables 1
-#define brnf_call_ip6tables 1
-#define brnf_call_arptables 1
-#define brnf_filter_vlan_tagged 0
-#define brnf_filter_pppoe_tagged 0
-#define brnf_pass_vlan_indev 0
+ struct ctl_table_header *ctl_hdr;
#endif
+ /* default value is 1 */
+ int call_iptables;
+ int call_ip6tables;
+ int call_arptables;
+
+ /* default value is 0 */
+ int filter_vlan_tagged;
+ int filter_pppoe_tagged;
+ int pass_vlan_indev;
+};
+
#define IS_IP(skb) \
(!skb_vlan_tag_present(skb) && skb->protocol == htons(ETH_P_IP))
@@ -89,17 +86,28 @@
return 0;
}
-#define IS_VLAN_IP(skb) \
- (vlan_proto(skb) == htons(ETH_P_IP) && \
- brnf_filter_vlan_tagged)
+static inline bool is_vlan_ip(const struct sk_buff *skb, const struct net *net)
+{
+ struct brnf_net *brnet = net_generic(net, brnf_net_id);
-#define IS_VLAN_IPV6(skb) \
- (vlan_proto(skb) == htons(ETH_P_IPV6) && \
- brnf_filter_vlan_tagged)
+ return vlan_proto(skb) == htons(ETH_P_IP) && brnet->filter_vlan_tagged;
+}
-#define IS_VLAN_ARP(skb) \
- (vlan_proto(skb) == htons(ETH_P_ARP) && \
- brnf_filter_vlan_tagged)
+static inline bool is_vlan_ipv6(const struct sk_buff *skb,
+ const struct net *net)
+{
+ struct brnf_net *brnet = net_generic(net, brnf_net_id);
+
+ return vlan_proto(skb) == htons(ETH_P_IPV6) &&
+ brnet->filter_vlan_tagged;
+}
+
+static inline bool is_vlan_arp(const struct sk_buff *skb, const struct net *net)
+{
+ struct brnf_net *brnet = net_generic(net, brnf_net_id);
+
+ return vlan_proto(skb) == htons(ETH_P_ARP) && brnet->filter_vlan_tagged;
+}
static inline __be16 pppoe_proto(const struct sk_buff *skb)
{
@@ -107,15 +115,23 @@
sizeof(struct pppoe_hdr)));
}
-#define IS_PPPOE_IP(skb) \
- (skb->protocol == htons(ETH_P_PPP_SES) && \
- pppoe_proto(skb) == htons(PPP_IP) && \
- brnf_filter_pppoe_tagged)
+static inline bool is_pppoe_ip(const struct sk_buff *skb, const struct net *net)
+{
+ struct brnf_net *brnet = net_generic(net, brnf_net_id);
-#define IS_PPPOE_IPV6(skb) \
- (skb->protocol == htons(ETH_P_PPP_SES) && \
- pppoe_proto(skb) == htons(PPP_IPV6) && \
- brnf_filter_pppoe_tagged)
+ return skb->protocol == htons(ETH_P_PPP_SES) &&
+ pppoe_proto(skb) == htons(PPP_IP) && brnet->filter_pppoe_tagged;
+}
+
+static inline bool is_pppoe_ipv6(const struct sk_buff *skb,
+ const struct net *net)
+{
+ struct brnf_net *brnet = net_generic(net, brnf_net_id);
+
+ return skb->protocol == htons(ETH_P_PPP_SES) &&
+ pppoe_proto(skb) == htons(PPP_IPV6) &&
+ brnet->filter_pppoe_tagged;
+}
/* largest possible L2 header, see br_nf_dev_queue_xmit() */
#define NF_BRIDGE_MAX_MAC_HEADER_LENGTH (PPPOE_SES_HLEN + ETH_HLEN)
@@ -425,12 +441,16 @@
return 0;
}
-static struct net_device *brnf_get_logical_dev(struct sk_buff *skb, const struct net_device *dev)
+static struct net_device *brnf_get_logical_dev(struct sk_buff *skb,
+ const struct net_device *dev,
+ const struct net *net)
{
struct net_device *vlan, *br;
+ struct brnf_net *brnet = net_generic(net, brnf_net_id);
br = bridge_parent(dev);
- if (brnf_pass_vlan_indev == 0 || !skb_vlan_tag_present(skb))
+
+ if (brnet->pass_vlan_indev == 0 || !skb_vlan_tag_present(skb))
return br;
vlan = __vlan_find_dev_deep_rcu(br, skb->vlan_proto,
@@ -440,7 +460,7 @@
}
/* Some common code for IPv4/IPv6 */
-struct net_device *setup_pre_routing(struct sk_buff *skb)
+struct net_device *setup_pre_routing(struct sk_buff *skb, const struct net *net)
{
struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
@@ -451,7 +471,7 @@
nf_bridge->in_prerouting = 1;
nf_bridge->physindev = skb->dev;
- skb->dev = brnf_get_logical_dev(skb, skb->dev);
+ skb->dev = brnf_get_logical_dev(skb, skb->dev, net);
if (skb->protocol == htons(ETH_P_8021Q))
nf_bridge->orig_proto = BRNF_PROTO_8021Q;
@@ -477,6 +497,7 @@
struct net_bridge_port *p;
struct net_bridge *br;
__u32 len = nf_bridge_encap_header_len(skb);
+ struct brnf_net *brnet;
if (unlikely(!pskb_may_pull(skb, len)))
return NF_DROP;
@@ -486,18 +507,22 @@
return NF_DROP;
br = p->br;
- if (IS_IPV6(skb) || IS_VLAN_IPV6(skb) || IS_PPPOE_IPV6(skb)) {
- if (!brnf_call_ip6tables && !br->nf_call_ip6tables)
+ brnet = net_generic(state->net, brnf_net_id);
+ if (IS_IPV6(skb) || is_vlan_ipv6(skb, state->net) ||
+ is_pppoe_ipv6(skb, state->net)) {
+ if (!brnet->call_ip6tables &&
+ !br_opt_get(br, BROPT_NF_CALL_IP6TABLES))
return NF_ACCEPT;
nf_bridge_pull_encap_header_rcsum(skb);
return br_nf_pre_routing_ipv6(priv, skb, state);
}
- if (!brnf_call_iptables && !br->nf_call_iptables)
+ if (!brnet->call_iptables && !br_opt_get(br, BROPT_NF_CALL_IPTABLES))
return NF_ACCEPT;
- if (!IS_IP(skb) && !IS_VLAN_IP(skb) && !IS_PPPOE_IP(skb))
+ if (!IS_IP(skb) && !is_vlan_ip(skb, state->net) &&
+ !is_pppoe_ip(skb, state->net))
return NF_ACCEPT;
nf_bridge_pull_encap_header_rcsum(skb);
@@ -508,7 +533,7 @@
nf_bridge_put(skb->nf_bridge);
if (!nf_bridge_alloc(skb))
return NF_DROP;
- if (!setup_pre_routing(skb))
+ if (!setup_pre_routing(skb, state->net))
return NF_DROP;
nf_bridge = nf_bridge_info_get(skb);
@@ -531,7 +556,7 @@
struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
struct net_device *in;
- if (!IS_ARP(skb) && !IS_VLAN_ARP(skb)) {
+ if (!IS_ARP(skb) && !is_vlan_arp(skb, net)) {
if (skb->protocol == htons(ETH_P_IP))
nf_bridge->frag_max_size = IPCB(skb)->frag_max_size;
@@ -585,9 +610,11 @@
if (!parent)
return NF_DROP;
- if (IS_IP(skb) || IS_VLAN_IP(skb) || IS_PPPOE_IP(skb))
+ if (IS_IP(skb) || is_vlan_ip(skb, state->net) ||
+ is_pppoe_ip(skb, state->net))
pf = NFPROTO_IPV4;
- else if (IS_IPV6(skb) || IS_VLAN_IPV6(skb) || IS_PPPOE_IPV6(skb))
+ else if (IS_IPV6(skb) || is_vlan_ipv6(skb, state->net) ||
+ is_pppoe_ipv6(skb, state->net))
pf = NFPROTO_IPV6;
else
return NF_ACCEPT;
@@ -618,7 +645,7 @@
skb->protocol = htons(ETH_P_IPV6);
NF_HOOK(pf, NF_INET_FORWARD, state->net, NULL, skb,
- brnf_get_logical_dev(skb, state->in),
+ brnf_get_logical_dev(skb, state->in, state->net),
parent, br_nf_forward_finish);
return NF_STOLEN;
@@ -631,17 +658,19 @@
struct net_bridge_port *p;
struct net_bridge *br;
struct net_device **d = (struct net_device **)(skb->cb);
+ struct brnf_net *brnet;
p = br_port_get_rcu(state->out);
if (p == NULL)
return NF_ACCEPT;
br = p->br;
- if (!brnf_call_arptables && !br->nf_call_arptables)
+ brnet = net_generic(state->net, brnf_net_id);
+ if (!brnet->call_arptables && !br_opt_get(br, BROPT_NF_CALL_ARPTABLES))
return NF_ACCEPT;
if (!IS_ARP(skb)) {
- if (!IS_VLAN_ARP(skb))
+ if (!is_vlan_arp(skb, state->net))
return NF_ACCEPT;
nf_bridge_pull_encap_header(skb);
}
@@ -650,7 +679,7 @@
return NF_DROP;
if (arp_hdr(skb)->ar_pln != 4) {
- if (IS_VLAN_ARP(skb))
+ if (is_vlan_arp(skb, state->net))
nf_bridge_push_encap_header(skb);
return NF_ACCEPT;
}
@@ -805,9 +834,11 @@
if (!realoutdev)
return NF_DROP;
- if (IS_IP(skb) || IS_VLAN_IP(skb) || IS_PPPOE_IP(skb))
+ if (IS_IP(skb) || is_vlan_ip(skb, state->net) ||
+ is_pppoe_ip(skb, state->net))
pf = NFPROTO_IPV4;
- else if (IS_IPV6(skb) || IS_VLAN_IPV6(skb) || IS_PPPOE_IPV6(skb))
+ else if (IS_IPV6(skb) || is_vlan_ipv6(skb, state->net) ||
+ is_pppoe_ipv6(skb, state->net))
pf = NFPROTO_IPV6;
else
return NF_ACCEPT;
@@ -955,23 +986,6 @@
return NOTIFY_OK;
}
-static void __net_exit brnf_exit_net(struct net *net)
-{
- struct brnf_net *brnet = net_generic(net, brnf_net_id);
-
- if (!brnet->enabled)
- return;
-
- nf_unregister_net_hooks(net, br_nf_ops, ARRAY_SIZE(br_nf_ops));
- brnet->enabled = false;
-}
-
-static struct pernet_operations brnf_net_ops __read_mostly = {
- .exit = brnf_exit_net,
- .id = &brnf_net_id,
- .size = sizeof(struct brnf_net),
-};
-
static struct notifier_block brnf_notifier __read_mostly = {
.notifier_call = brnf_device_event,
};
@@ -1030,50 +1044,125 @@
static struct ctl_table brnf_table[] = {
{
.procname = "bridge-nf-call-arptables",
- .data = &brnf_call_arptables,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = brnf_sysctl_call_tables,
},
{
.procname = "bridge-nf-call-iptables",
- .data = &brnf_call_iptables,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = brnf_sysctl_call_tables,
},
{
.procname = "bridge-nf-call-ip6tables",
- .data = &brnf_call_ip6tables,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = brnf_sysctl_call_tables,
},
{
.procname = "bridge-nf-filter-vlan-tagged",
- .data = &brnf_filter_vlan_tagged,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = brnf_sysctl_call_tables,
},
{
.procname = "bridge-nf-filter-pppoe-tagged",
- .data = &brnf_filter_pppoe_tagged,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = brnf_sysctl_call_tables,
},
{
.procname = "bridge-nf-pass-vlan-input-dev",
- .data = &brnf_pass_vlan_indev,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = brnf_sysctl_call_tables,
},
{ }
};
+
+static inline void br_netfilter_sysctl_default(struct brnf_net *brnf)
+{
+ brnf->call_iptables = 1;
+ brnf->call_ip6tables = 1;
+ brnf->call_arptables = 1;
+ brnf->filter_vlan_tagged = 0;
+ brnf->filter_pppoe_tagged = 0;
+ brnf->pass_vlan_indev = 0;
+}
+
+static int br_netfilter_sysctl_init_net(struct net *net)
+{
+ struct ctl_table *table = brnf_table;
+ struct brnf_net *brnet;
+
+ if (!net_eq(net, &init_net)) {
+ table = kmemdup(table, sizeof(brnf_table), GFP_KERNEL);
+ if (!table)
+ return -ENOMEM;
+ }
+
+ brnet = net_generic(net, brnf_net_id);
+ table[0].data = &brnet->call_arptables;
+ table[1].data = &brnet->call_iptables;
+ table[2].data = &brnet->call_ip6tables;
+ table[3].data = &brnet->filter_vlan_tagged;
+ table[4].data = &brnet->filter_pppoe_tagged;
+ table[5].data = &brnet->pass_vlan_indev;
+
+ br_netfilter_sysctl_default(brnet);
+
+ brnet->ctl_hdr = register_net_sysctl(net, "net/bridge", table);
+ if (!brnet->ctl_hdr) {
+ if (!net_eq(net, &init_net))
+ kfree(table);
+
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static void br_netfilter_sysctl_exit_net(struct net *net,
+ struct brnf_net *brnet)
+{
+ struct ctl_table *table = brnet->ctl_hdr->ctl_table_arg;
+
+ unregister_net_sysctl_table(brnet->ctl_hdr);
+ if (!net_eq(net, &init_net))
+ kfree(table);
+}
+
+static int __net_init brnf_init_net(struct net *net)
+{
+ return br_netfilter_sysctl_init_net(net);
+}
#endif
+static void __net_exit brnf_exit_net(struct net *net)
+{
+ struct brnf_net *brnet;
+
+ brnet = net_generic(net, brnf_net_id);
+ if (brnet->enabled) {
+ nf_unregister_net_hooks(net, br_nf_ops, ARRAY_SIZE(br_nf_ops));
+ brnet->enabled = false;
+ }
+
+#ifdef CONFIG_SYSCTL
+ br_netfilter_sysctl_exit_net(net, brnet);
+#endif
+}
+
+static struct pernet_operations brnf_net_ops __read_mostly = {
+#ifdef CONFIG_SYSCTL
+ .init = brnf_init_net,
+#endif
+ .exit = brnf_exit_net,
+ .id = &brnf_net_id,
+ .size = sizeof(struct brnf_net),
+};
+
static int __init br_netfilter_init(void)
{
int ret;
@@ -1088,16 +1177,6 @@
return ret;
}
-#ifdef CONFIG_SYSCTL
- brnf_sysctl_header = register_net_sysctl(&init_net, "net/bridge", brnf_table);
- if (brnf_sysctl_header == NULL) {
- printk(KERN_WARNING
- "br_netfilter: can't register to sysctl.\n");
- unregister_netdevice_notifier(&brnf_notifier);
- unregister_pernet_subsys(&brnf_net_ops);
- return -ENOMEM;
- }
-#endif
RCU_INIT_POINTER(nf_br_ops, &br_ops);
printk(KERN_NOTICE "Bridge firewalling registered\n");
return 0;
@@ -1108,9 +1187,6 @@
RCU_INIT_POINTER(nf_br_ops, NULL);
unregister_netdevice_notifier(&brnf_notifier);
unregister_pernet_subsys(&brnf_net_ops);
-#ifdef CONFIG_SYSCTL
- unregister_net_sysctl_table(brnf_sysctl_header);
-#endif
}
module_init(br_netfilter_init);
diff --git a/net/bridge/br_netfilter_ipv6.c b/net/bridge/br_netfilter_ipv6.c
index 09d5e0c..1e0fe6b 100644
--- a/net/bridge/br_netfilter_ipv6.c
+++ b/net/bridge/br_netfilter_ipv6.c
@@ -228,7 +228,7 @@
nf_bridge_put(skb->nf_bridge);
if (!nf_bridge_alloc(skb))
return NF_DROP;
- if (!setup_pre_routing(skb))
+ if (!setup_pre_routing(skb, state->net))
return NF_DROP;
nf_bridge = nf_bridge_info_get(skb);
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index ec2b58a..e5a5bc5 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -1139,7 +1139,7 @@
spin_lock_bh(&br->lock);
memcpy(br->group_addr, new_addr, sizeof(br->group_addr));
spin_unlock_bh(&br->lock);
- br->group_addr_set = true;
+ br_opt_toggle(br, BROPT_GROUP_ADDR_SET, true);
br_recalculate_fwd_mask(br);
}
@@ -1167,7 +1167,7 @@
u8 val;
val = nla_get_u8(data[IFLA_BR_MCAST_QUERY_USE_IFADDR]);
- br->multicast_query_use_ifaddr = !!val;
+ br_opt_toggle(br, BROPT_MULTICAST_QUERY_USE_IFADDR, !!val);
}
if (data[IFLA_BR_MCAST_QUERIER]) {
@@ -1244,7 +1244,7 @@
__u8 mcast_stats;
mcast_stats = nla_get_u8(data[IFLA_BR_MCAST_STATS_ENABLED]);
- br->multicast_stats_enabled = !!mcast_stats;
+ br_opt_toggle(br, BROPT_MULTICAST_STATS_ENABLED, !!mcast_stats);
}
if (data[IFLA_BR_MCAST_IGMP_VERSION]) {
@@ -1271,19 +1271,19 @@
if (data[IFLA_BR_NF_CALL_IPTABLES]) {
u8 val = nla_get_u8(data[IFLA_BR_NF_CALL_IPTABLES]);
- br->nf_call_iptables = val ? true : false;
+ br_opt_toggle(br, BROPT_NF_CALL_IPTABLES, !!val);
}
if (data[IFLA_BR_NF_CALL_IP6TABLES]) {
u8 val = nla_get_u8(data[IFLA_BR_NF_CALL_IP6TABLES]);
- br->nf_call_ip6tables = val ? true : false;
+ br_opt_toggle(br, BROPT_NF_CALL_IP6TABLES, !!val);
}
if (data[IFLA_BR_NF_CALL_ARPTABLES]) {
u8 val = nla_get_u8(data[IFLA_BR_NF_CALL_ARPTABLES]);
- br->nf_call_arptables = val ? true : false;
+ br_opt_toggle(br, BROPT_NF_CALL_ARPTABLES, !!val);
}
#endif
@@ -1416,17 +1416,20 @@
#ifdef CONFIG_BRIDGE_VLAN_FILTERING
if (nla_put_be16(skb, IFLA_BR_VLAN_PROTOCOL, br->vlan_proto) ||
nla_put_u16(skb, IFLA_BR_VLAN_DEFAULT_PVID, br->default_pvid) ||
- nla_put_u8(skb, IFLA_BR_VLAN_STATS_ENABLED, br->vlan_stats_enabled))
+ nla_put_u8(skb, IFLA_BR_VLAN_STATS_ENABLED,
+ br_opt_get(br, BROPT_VLAN_STATS_ENABLED)))
return -EMSGSIZE;
#endif
#ifdef CONFIG_BRIDGE_IGMP_SNOOPING
if (nla_put_u8(skb, IFLA_BR_MCAST_ROUTER, br->multicast_router) ||
- nla_put_u8(skb, IFLA_BR_MCAST_SNOOPING, !br->multicast_disabled) ||
+ nla_put_u8(skb, IFLA_BR_MCAST_SNOOPING,
+ br_opt_get(br, BROPT_MULTICAST_ENABLED)) ||
nla_put_u8(skb, IFLA_BR_MCAST_QUERY_USE_IFADDR,
- br->multicast_query_use_ifaddr) ||
- nla_put_u8(skb, IFLA_BR_MCAST_QUERIER, br->multicast_querier) ||
+ br_opt_get(br, BROPT_MULTICAST_QUERY_USE_IFADDR)) ||
+ nla_put_u8(skb, IFLA_BR_MCAST_QUERIER,
+ br_opt_get(br, BROPT_MULTICAST_QUERIER)) ||
nla_put_u8(skb, IFLA_BR_MCAST_STATS_ENABLED,
- br->multicast_stats_enabled) ||
+ br_opt_get(br, BROPT_MULTICAST_STATS_ENABLED)) ||
nla_put_u32(skb, IFLA_BR_MCAST_HASH_ELASTICITY,
br->hash_elasticity) ||
nla_put_u32(skb, IFLA_BR_MCAST_HASH_MAX, br->hash_max) ||
@@ -1469,11 +1472,11 @@
#endif
#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
if (nla_put_u8(skb, IFLA_BR_NF_CALL_IPTABLES,
- br->nf_call_iptables ? 1 : 0) ||
+ br_opt_get(br, BROPT_NF_CALL_IPTABLES) ? 1 : 0) ||
nla_put_u8(skb, IFLA_BR_NF_CALL_IP6TABLES,
- br->nf_call_ip6tables ? 1 : 0) ||
+ br_opt_get(br, BROPT_NF_CALL_IP6TABLES) ? 1 : 0) ||
nla_put_u8(skb, IFLA_BR_NF_CALL_ARPTABLES,
- br->nf_call_arptables ? 1 : 0))
+ br_opt_get(br, BROPT_NF_CALL_ARPTABLES) ? 1 : 0))
return -EMSGSIZE;
#endif
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 11ed202..c18bf39 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -54,14 +54,12 @@
typedef struct mac_addr mac_addr;
typedef __u16 port_id;
-struct bridge_id
-{
+struct bridge_id {
unsigned char prio[2];
unsigned char addr[ETH_ALEN];
};
-struct mac_addr
-{
+struct mac_addr {
unsigned char addr[ETH_ALEN];
};
@@ -206,8 +204,7 @@
unsigned char eth_addr[ETH_ALEN];
};
-struct net_bridge_mdb_entry
-{
+struct net_bridge_mdb_entry {
struct hlist_node hlist[2];
struct net_bridge *br;
struct net_bridge_port_group __rcu *ports;
@@ -217,8 +214,7 @@
bool host_joined;
};
-struct net_bridge_mdb_htable
-{
+struct net_bridge_mdb_htable {
struct hlist_head *mhash;
struct rcu_head rcu;
struct net_bridge_mdb_htable *old;
@@ -309,16 +305,31 @@
rcu_dereference_rtnl(dev->rx_handler_data) : NULL;
}
+enum net_bridge_opts {
+ BROPT_VLAN_ENABLED,
+ BROPT_VLAN_STATS_ENABLED,
+ BROPT_NF_CALL_IPTABLES,
+ BROPT_NF_CALL_IP6TABLES,
+ BROPT_NF_CALL_ARPTABLES,
+ BROPT_GROUP_ADDR_SET,
+ BROPT_MULTICAST_ENABLED,
+ BROPT_MULTICAST_QUERIER,
+ BROPT_MULTICAST_QUERY_USE_IFADDR,
+ BROPT_MULTICAST_STATS_ENABLED,
+ BROPT_HAS_IPV6_ADDR,
+ BROPT_NEIGH_SUPPRESS_ENABLED,
+ BROPT_MTU_SET_BY_USER,
+};
+
struct net_bridge {
spinlock_t lock;
spinlock_t hash_lock;
struct list_head port_list;
struct net_device *dev;
struct pcpu_sw_netstats __percpu *stats;
+ unsigned long options;
/* These fields are accessed on each packet */
#ifdef CONFIG_BRIDGE_VLAN_FILTERING
- u8 vlan_enabled;
- u8 vlan_stats_enabled;
__be16 vlan_proto;
u16 default_pvid;
struct net_bridge_vlan_group __rcu *vlgrp;
@@ -330,9 +341,6 @@
struct rtable fake_rtable;
struct rt6_info fake_rt6_info;
};
- bool nf_call_iptables;
- bool nf_call_ip6tables;
- bool nf_call_arptables;
#endif
u16 group_fwd_mask;
u16 group_fwd_mask_required;
@@ -340,7 +348,6 @@
/* STP */
bridge_id designated_root;
bridge_id bridge_id;
- u32 root_path_cost;
unsigned char topology_change;
unsigned char topology_change_detected;
u16 root_port;
@@ -352,9 +359,9 @@
unsigned long bridge_hello_time;
unsigned long bridge_forward_delay;
unsigned long bridge_ageing_time;
+ u32 root_path_cost;
u8 group_addr[ETH_ALEN];
- bool group_addr_set;
enum {
BR_NO_STP, /* no spanning tree */
@@ -363,13 +370,6 @@
} stp_enabled;
#ifdef CONFIG_BRIDGE_IGMP_SNOOPING
- unsigned char multicast_router;
-
- u8 multicast_disabled:1;
- u8 multicast_querier:1;
- u8 multicast_query_use_ifaddr:1;
- u8 has_ipv6_addr:1;
- u8 multicast_stats_enabled:1;
u32 hash_elasticity;
u32 hash_max;
@@ -378,7 +378,11 @@
u32 multicast_startup_query_count;
u8 multicast_igmp_version;
-
+ u8 multicast_router;
+#if IS_ENABLED(CONFIG_IPV6)
+ u8 multicast_mld_version;
+#endif
+ spinlock_t multicast_lock;
unsigned long multicast_last_member_interval;
unsigned long multicast_membership_interval;
unsigned long multicast_querier_interval;
@@ -386,7 +390,6 @@
unsigned long multicast_query_response_interval;
unsigned long multicast_startup_query_interval;
- spinlock_t multicast_lock;
struct net_bridge_mdb_htable __rcu *mdb;
struct hlist_head router_list;
@@ -399,7 +402,6 @@
struct bridge_mcast_other_query ip6_other_query;
struct bridge_mcast_own_query ip6_own_query;
struct bridge_mcast_querier ip6_querier;
- u8 multicast_mld_version;
#endif /* IS_ENABLED(CONFIG_IPV6) */
#endif
@@ -413,8 +415,6 @@
#ifdef CONFIG_NET_SWITCHDEV
int offload_fwd_mark;
#endif
- bool neigh_suppress_enabled;
- bool mtu_set_by_user;
struct hlist_head fdb_list;
};
@@ -492,6 +492,14 @@
return true;
}
+static inline int br_opt_get(const struct net_bridge *br,
+ enum net_bridge_opts opt)
+{
+ return test_bit(opt, &br->options);
+}
+
+void br_opt_toggle(struct net_bridge *br, enum net_bridge_opts opt, bool on);
+
/* br_device.c */
void br_dev_setup(struct net_device *dev);
void br_dev_delete(struct net_device *dev, struct list_head *list);
@@ -698,8 +706,8 @@
{
bool own_querier_enabled;
- if (br->multicast_querier) {
- if (is_ipv6 && !br->has_ipv6_addr)
+ if (br_opt_get(br, BROPT_MULTICAST_QUERIER)) {
+ if (is_ipv6 && !br_opt_get(br, BROPT_HAS_IPV6_ADDR))
own_querier_enabled = false;
else
own_querier_enabled = true;
diff --git a/net/bridge/br_sysfs_br.c b/net/bridge/br_sysfs_br.c
index 0318a69..c93c572 100644
--- a/net/bridge/br_sysfs_br.c
+++ b/net/bridge/br_sysfs_br.c
@@ -303,7 +303,7 @@
ether_addr_copy(br->group_addr, new_addr);
spin_unlock_bh(&br->lock);
- br->group_addr_set = true;
+ br_opt_toggle(br, BROPT_GROUP_ADDR_SET, true);
br_recalculate_fwd_mask(br);
netdev_state_change(br->dev);
@@ -349,7 +349,7 @@
char *buf)
{
struct net_bridge *br = to_bridge(d);
- return sprintf(buf, "%d\n", !br->multicast_disabled);
+ return sprintf(buf, "%d\n", br_opt_get(br, BROPT_MULTICAST_ENABLED));
}
static ssize_t multicast_snooping_store(struct device *d,
@@ -365,12 +365,13 @@
char *buf)
{
struct net_bridge *br = to_bridge(d);
- return sprintf(buf, "%d\n", br->multicast_query_use_ifaddr);
+ return sprintf(buf, "%d\n",
+ br_opt_get(br, BROPT_MULTICAST_QUERY_USE_IFADDR));
}
static int set_query_use_ifaddr(struct net_bridge *br, unsigned long val)
{
- br->multicast_query_use_ifaddr = !!val;
+ br_opt_toggle(br, BROPT_MULTICAST_QUERY_USE_IFADDR, !!val);
return 0;
}
@@ -388,7 +389,7 @@
char *buf)
{
struct net_bridge *br = to_bridge(d);
- return sprintf(buf, "%d\n", br->multicast_querier);
+ return sprintf(buf, "%d\n", br_opt_get(br, BROPT_MULTICAST_QUERIER));
}
static ssize_t multicast_querier_store(struct device *d,
@@ -636,12 +637,13 @@
{
struct net_bridge *br = to_bridge(d);
- return sprintf(buf, "%u\n", br->multicast_stats_enabled);
+ return sprintf(buf, "%d\n",
+ br_opt_get(br, BROPT_MULTICAST_STATS_ENABLED));
}
static int set_stats_enabled(struct net_bridge *br, unsigned long val)
{
- br->multicast_stats_enabled = !!val;
+ br_opt_toggle(br, BROPT_MULTICAST_STATS_ENABLED, !!val);
return 0;
}
@@ -678,12 +680,12 @@
struct device *d, struct device_attribute *attr, char *buf)
{
struct net_bridge *br = to_bridge(d);
- return sprintf(buf, "%u\n", br->nf_call_iptables);
+ return sprintf(buf, "%u\n", br_opt_get(br, BROPT_NF_CALL_IPTABLES));
}
static int set_nf_call_iptables(struct net_bridge *br, unsigned long val)
{
- br->nf_call_iptables = val ? true : false;
+ br_opt_toggle(br, BROPT_NF_CALL_IPTABLES, !!val);
return 0;
}
@@ -699,12 +701,12 @@
struct device *d, struct device_attribute *attr, char *buf)
{
struct net_bridge *br = to_bridge(d);
- return sprintf(buf, "%u\n", br->nf_call_ip6tables);
+ return sprintf(buf, "%u\n", br_opt_get(br, BROPT_NF_CALL_IP6TABLES));
}
static int set_nf_call_ip6tables(struct net_bridge *br, unsigned long val)
{
- br->nf_call_ip6tables = val ? true : false;
+ br_opt_toggle(br, BROPT_NF_CALL_IP6TABLES, !!val);
return 0;
}
@@ -720,12 +722,12 @@
struct device *d, struct device_attribute *attr, char *buf)
{
struct net_bridge *br = to_bridge(d);
- return sprintf(buf, "%u\n", br->nf_call_arptables);
+ return sprintf(buf, "%u\n", br_opt_get(br, BROPT_NF_CALL_ARPTABLES));
}
static int set_nf_call_arptables(struct net_bridge *br, unsigned long val)
{
- br->nf_call_arptables = val ? true : false;
+ br_opt_toggle(br, BROPT_NF_CALL_ARPTABLES, !!val);
return 0;
}
@@ -743,7 +745,7 @@
char *buf)
{
struct net_bridge *br = to_bridge(d);
- return sprintf(buf, "%d\n", br->vlan_enabled);
+ return sprintf(buf, "%d\n", br_opt_get(br, BROPT_VLAN_ENABLED));
}
static ssize_t vlan_filtering_store(struct device *d,
@@ -791,7 +793,7 @@
char *buf)
{
struct net_bridge *br = to_bridge(d);
- return sprintf(buf, "%u\n", br->vlan_stats_enabled);
+ return sprintf(buf, "%u\n", br_opt_get(br, BROPT_VLAN_STATS_ENABLED));
}
static ssize_t vlan_stats_enabled_store(struct device *d,
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 5f3950f..f1744b6 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -386,7 +386,7 @@
return NULL;
}
}
- if (br->vlan_stats_enabled) {
+ if (br_opt_get(br, BROPT_VLAN_STATS_ENABLED)) {
stats = this_cpu_ptr(v->stats);
u64_stats_update_begin(&stats->syncp);
stats->tx_bytes += skb->len;
@@ -475,14 +475,14 @@
skb->vlan_tci |= pvid;
/* if stats are disabled we can avoid the lookup */
- if (!br->vlan_stats_enabled)
+ if (!br_opt_get(br, BROPT_VLAN_STATS_ENABLED))
return true;
}
v = br_vlan_find(vg, *vid);
if (!v || !br_vlan_should_use(v))
goto drop;
- if (br->vlan_stats_enabled) {
+ if (br_opt_get(br, BROPT_VLAN_STATS_ENABLED)) {
stats = this_cpu_ptr(v->stats);
u64_stats_update_begin(&stats->syncp);
stats->rx_bytes += skb->len;
@@ -504,7 +504,7 @@
/* If VLAN filtering is disabled on the bridge, all packets are
* permitted.
*/
- if (!br->vlan_enabled) {
+ if (!br_opt_get(br, BROPT_VLAN_ENABLED)) {
BR_INPUT_SKB_CB(skb)->vlan_filtered = false;
return true;
}
@@ -538,7 +538,7 @@
struct net_bridge *br = p->br;
/* If filtering was disabled at input, let it pass. */
- if (!br->vlan_enabled)
+ if (!br_opt_get(br, BROPT_VLAN_ENABLED))
return true;
vg = nbp_vlan_group_rcu(p);
@@ -700,11 +700,12 @@
/* Must be protected by RTNL. */
static void recalculate_group_addr(struct net_bridge *br)
{
- if (br->group_addr_set)
+ if (br_opt_get(br, BROPT_GROUP_ADDR_SET))
return;
spin_lock_bh(&br->lock);
- if (!br->vlan_enabled || br->vlan_proto == htons(ETH_P_8021Q)) {
+ if (!br_opt_get(br, BROPT_VLAN_ENABLED) ||
+ br->vlan_proto == htons(ETH_P_8021Q)) {
/* Bridge Group Address */
br->group_addr[5] = 0x00;
} else { /* vlan_enabled && ETH_P_8021AD */
@@ -717,7 +718,8 @@
/* Must be protected by RTNL. */
void br_recalculate_fwd_mask(struct net_bridge *br)
{
- if (!br->vlan_enabled || br->vlan_proto == htons(ETH_P_8021Q))
+ if (!br_opt_get(br, BROPT_VLAN_ENABLED) ||
+ br->vlan_proto == htons(ETH_P_8021Q))
br->group_fwd_mask_required = BR_GROUPFWD_DEFAULT;
else /* vlan_enabled && ETH_P_8021AD */
br->group_fwd_mask_required = BR_GROUPFWD_8021AD &
@@ -734,14 +736,14 @@
};
int err;
- if (br->vlan_enabled == val)
+ if (br_opt_get(br, BROPT_VLAN_ENABLED) == !!val)
return 0;
err = switchdev_port_attr_set(br->dev, &attr);
if (err && err != -EOPNOTSUPP)
return err;
- br->vlan_enabled = val;
+ br_opt_toggle(br, BROPT_VLAN_ENABLED, !!val);
br_manage_promisc(br);
recalculate_group_addr(br);
br_recalculate_fwd_mask(br);
@@ -758,7 +760,7 @@
{
struct net_bridge *br = netdev_priv(dev);
- return !!br->vlan_enabled;
+ return br_opt_get(br, BROPT_VLAN_ENABLED);
}
EXPORT_SYMBOL_GPL(br_vlan_enabled);
@@ -824,7 +826,7 @@
switch (val) {
case 0:
case 1:
- br->vlan_stats_enabled = val;
+ br_opt_toggle(br, BROPT_VLAN_STATS_ENABLED, !!val);
break;
default:
return -EINVAL;
@@ -970,7 +972,7 @@
goto out;
/* Only allow default pvid change when filtering is disabled */
- if (br->vlan_enabled) {
+ if (br_opt_get(br, BROPT_VLAN_ENABLED)) {
pr_info_once("Please disable vlan filtering to change default_pvid\n");
err = -EPERM;
goto out;
@@ -1024,7 +1026,7 @@
.orig_dev = p->br->dev,
.id = SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING,
.flags = SWITCHDEV_F_SKIP_EOPNOTSUPP,
- .u.vlan_filtering = p->br->vlan_enabled,
+ .u.vlan_filtering = br_opt_get(p->br, BROPT_VLAN_ENABLED),
};
struct net_bridge_vlan_group *vg;
int ret = -ENOMEM;
diff --git a/net/core/dev.c b/net/core/dev.c
index 50498a7..dc6cbb4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1822,7 +1822,7 @@
#endif
static DEFINE_STATIC_KEY_FALSE(netstamp_needed_key);
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
static atomic_t netstamp_needed_deferred;
static atomic_t netstamp_wanted;
static void netstamp_clear(struct work_struct *work)
@@ -1841,7 +1841,7 @@
void net_enable_timestamp(void)
{
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
int wanted;
while (1) {
@@ -1861,7 +1861,7 @@
void net_disable_timestamp(void)
{
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
int wanted;
while (1) {
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 7446b98..433e35a 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -20,6 +20,7 @@
obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
+obj-$(CONFIG_SYSFS) += sysfs_net_ipv4.o
obj-$(CONFIG_PROC_FS) += proc.o
obj-$(CONFIG_IP_MULTIPLE_TABLES) += fib_rules.o
obj-$(CONFIG_IP_MROUTE) += ipmr.o
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 3047fc47..81637bb 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -348,12 +348,18 @@
static inline void empty_child_inc(struct key_vector *n)
{
- ++tn_info(n)->empty_children ? : ++tn_info(n)->full_children;
+ tn_info(n)->empty_children++;
+
+ if (!tn_info(n)->empty_children)
+ tn_info(n)->full_children++;
}
static inline void empty_child_dec(struct key_vector *n)
{
- tn_info(n)->empty_children-- ? : tn_info(n)->full_children--;
+ if (!tn_info(n)->empty_children)
+ tn_info(n)->full_children--;
+
+ tn_info(n)->empty_children--;
}
static struct key_vector *leaf_new(t_key key, struct fib_alias *fa)
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 82f341e..aa3fd61 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -343,6 +343,8 @@
return -EINVAL;
new_ra = on ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
+ if (on && !new_ra)
+ return -ENOMEM;
mutex_lock(&net->ipv4.ra_mutex);
for (rap = &net->ipv4.ra_chain;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index ad132b6..6357b03 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -222,6 +222,21 @@
return ret;
}
+/* Validate changes from /proc interface. */
+static int proc_tcp_default_init_rwnd(struct ctl_table *ctl, int write,
+ void __user *buffer,
+ size_t *lenp, loff_t *ppos)
+{
+ int old_value = *(int *)ctl->data;
+ int ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
+ int new_value = *(int *)ctl->data;
+
+ if (write && ret == 0 && (new_value < 3 || new_value > 100))
+ *(int *)ctl->data = old_value;
+
+ return ret;
+}
+
static int proc_tcp_congestion_control(struct ctl_table *ctl, int write,
void __user *buffer, size_t *lenp, loff_t *ppos)
{
@@ -1191,6 +1206,13 @@
.extra2 = &thousand,
},
{
+ .procname = "tcp_default_init_rwnd",
+ .data = &init_net.ipv4.sysctl_tcp_default_init_rwnd,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_tcp_default_init_rwnd
+ },
+ {
.procname = "tcp_wmem",
.data = &init_net.ipv4.sysctl_tcp_wmem,
.maxlen = sizeof(init_net.ipv4.sysctl_tcp_wmem),
diff --git a/net/ipv4/sysfs_net_ipv4.c b/net/ipv4/sysfs_net_ipv4.c
new file mode 100644
index 0000000..35a651aa
--- /dev/null
+++ b/net/ipv4/sysfs_net_ipv4.c
@@ -0,0 +1,88 @@
+/*
+ * net/ipv4/sysfs_net_ipv4.c
+ *
+ * sysfs-based networking knobs (so we can, unlike with sysctl, control perms)
+ *
+ * Copyright (C) 2008 Google, Inc.
+ *
+ * Robert Love <rlove@google.com>
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kobject.h>
+#include <linux/string.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
+#include <net/tcp.h>
+
+#define CREATE_IPV4_FILE(_name, _var) \
+static ssize_t _name##_show(struct kobject *kobj, \
+ struct kobj_attribute *attr, char *buf) \
+{ \
+ return sprintf(buf, "%d\n", _var); \
+} \
+static ssize_t _name##_store(struct kobject *kobj, \
+ struct kobj_attribute *attr, \
+ const char *buf, size_t count) \
+{ \
+ int val, ret; \
+ ret = sscanf(buf, "%d", &val); \
+ if (ret != 1) \
+ return -EINVAL; \
+ if (val < 0) \
+ return -EINVAL; \
+ _var = val; \
+ return count; \
+} \
+static struct kobj_attribute _name##_attr = \
+ __ATTR(_name, 0644, _name##_show, _name##_store)
+
+CREATE_IPV4_FILE(tcp_wmem_min, init_net.ipv4.sysctl_tcp_wmem[0]);
+CREATE_IPV4_FILE(tcp_wmem_def, init_net.ipv4.sysctl_tcp_wmem[1]);
+CREATE_IPV4_FILE(tcp_wmem_max, init_net.ipv4.sysctl_tcp_wmem[2]);
+
+CREATE_IPV4_FILE(tcp_rmem_min, init_net.ipv4.sysctl_tcp_rmem[0]);
+CREATE_IPV4_FILE(tcp_rmem_def, init_net.ipv4.sysctl_tcp_rmem[1]);
+CREATE_IPV4_FILE(tcp_rmem_max, init_net.ipv4.sysctl_tcp_rmem[2]);
+
+static struct attribute *ipv4_attrs[] = {
+ &tcp_wmem_min_attr.attr,
+ &tcp_wmem_def_attr.attr,
+ &tcp_wmem_max_attr.attr,
+ &tcp_rmem_min_attr.attr,
+ &tcp_rmem_def_attr.attr,
+ &tcp_rmem_max_attr.attr,
+ NULL
+};
+
+static struct attribute_group ipv4_attr_group = {
+ .attrs = ipv4_attrs,
+};
+
+static __init int sysfs_ipv4_init(void)
+{
+ struct kobject *ipv4_kobject;
+ int ret;
+
+ ipv4_kobject = kobject_create_and_add("ipv4", kernel_kobj);
+ if (!ipv4_kobject)
+ return -ENOMEM;
+
+ ret = sysfs_create_group(ipv4_kobject, &ipv4_attr_group);
+ if (ret) {
+ kobject_put(ipv4_kobject);
+ return ret;
+ }
+
+ return 0;
+}
+
+subsys_initcall(sysfs_ipv4_init);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 6da3930..5e7d66b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2581,6 +2581,7 @@
net->ipv4.sysctl_tcp_invalid_ratelimit = HZ/2;
net->ipv4.sysctl_tcp_pacing_ss_ratio = 200;
net->ipv4.sysctl_tcp_pacing_ca_ratio = 120;
+ net->ipv4.sysctl_tcp_default_init_rwnd = TCP_INIT_CWND * 2;
if (net != &init_net) {
memcpy(net->ipv4.sysctl_tcp_rmem,
init_net.ipv4.sysctl_tcp_rmem,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cc4ba42..fc4ed9e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -232,6 +232,7 @@
(*rcv_wscale)++;
}
}
+
/* Set the clamp no higher than max representable value */
(*window_clamp) = min_t(__u32, U16_MAX << (*rcv_wscale), *window_clamp);
}
diff --git a/net/ipv6/exthdrs_core.c b/net/ipv6/exthdrs_core.c
index ae365df..1af240dc 100644
--- a/net/ipv6/exthdrs_core.c
+++ b/net/ipv6/exthdrs_core.c
@@ -166,15 +166,15 @@
* to explore inner IPv6 header, eg. ICMPv6 error messages.
*
* If target header is found, its offset is set in *offset and return protocol
- * number. Otherwise, return -1.
+ * number. Otherwise, return -ENOENT or -EBADMSG.
*
* If the first fragment doesn't contain the final protocol header or
* NEXTHDR_NONE it is considered invalid.
*
* Note that non-1st fragment is special case that "the protocol number
* of last header" is "next header" field in Fragment header. In this case,
- * *offset is meaningless and fragment offset is stored in *fragoff if fragoff
- * isn't NULL.
+ * *offset is meaningless. If fragoff is not NULL, the fragment offset is
+ * stored in *fragoff; if it is NULL, return -EINVAL.
*
* if flags is not NULL and it's a fragment, then the frag flag
* IP6_FH_F_FRAG will be set. If it's an AH header, the
@@ -251,9 +251,12 @@
if (target < 0 &&
((!ipv6_ext_hdr(hp->nexthdr)) ||
hp->nexthdr == NEXTHDR_NONE)) {
- if (fragoff)
+ if (fragoff) {
*fragoff = _frag_off;
- return hp->nexthdr;
+ return hp->nexthdr;
+ } else {
+ return -EINVAL;
+ }
}
if (!found)
return -ENOENT;
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 231c489..fe4ac4f 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -68,6 +68,8 @@
return -ENOPROTOOPT;
new_ra = (sel >= 0) ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
+ if (sel >= 0 && !new_ra)
+ return -ENOMEM;
write_lock_bh(&ip6_ra_lock);
for (rap = &ip6_ra_chain; (ra = *rap) != NULL; rap = &ra->next) {
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index e0fb56d..bc6983f 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -1461,6 +1461,29 @@
If you want to compile it as a module, say M here and read
<file:Documentation/kbuild/modules.txt>. If unsure, say `N'.
+config NETFILTER_XT_MATCH_QUOTA2
+ tristate '"quota2" match support'
+ depends on NETFILTER_ADVANCED
+ help
+ This option adds a `quota2' match, which allows to match on a
+ byte counter correctly and not per CPU.
+ It allows naming the quotas.
+ This is based on http://xtables-addons.git.sourceforge.net
+
+ If you want to compile it as a module, say M here and read
+ <file:Documentation/kbuild/modules.txt>. If unsure, say `N'.
+
+config NETFILTER_XT_MATCH_QUOTA2_LOG
+ bool '"quota2" Netfilter LOG support'
+ depends on NETFILTER_XT_MATCH_QUOTA2
+ default n
+ help
+ This option allows `quota2' to log ONCE when a quota limit
+ is passed. It logs via NETLINK using the NETLINK_NFLOG family.
+ It logs similarly to how ipt_ULOG would without data.
+
+ If unsure, say `N'.
+
config NETFILTER_XT_MATCH_RATEEST
tristate '"rateest" match support'
depends on NETFILTER_ADVANCED
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 16895e0..9c87ed5 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -191,6 +191,7 @@
obj-$(CONFIG_NETFILTER_XT_MATCH_PKTTYPE) += xt_pkttype.o
obj-$(CONFIG_NETFILTER_XT_MATCH_POLICY) += xt_policy.o
obj-$(CONFIG_NETFILTER_XT_MATCH_QUOTA) += xt_quota.o
+obj-$(CONFIG_NETFILTER_XT_MATCH_QUOTA2) += xt_quota2.o
obj-$(CONFIG_NETFILTER_XT_MATCH_RATEEST) += xt_rateest.o
obj-$(CONFIG_NETFILTER_XT_MATCH_REALM) += xt_realm.o
obj-$(CONFIG_NETFILTER_XT_MATCH_RECENT) += xt_recent.o
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 93aaec3..dc240cb 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -33,7 +33,7 @@
DEFINE_PER_CPU(bool, nf_skb_duplicated);
EXPORT_SYMBOL_GPL(nf_skb_duplicated);
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
struct static_key nf_hooks_needed[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
EXPORT_SYMBOL(nf_hooks_needed);
#endif
@@ -347,7 +347,7 @@
if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_INGRESS)
net_inc_ingress_queue();
#endif
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
static_key_slow_inc(&nf_hooks_needed[pf][reg->hooknum]);
#endif
BUG_ON(p == new_hooks);
@@ -405,7 +405,7 @@
if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_INGRESS)
net_dec_ingress_queue();
#endif
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
static_key_slow_dec(&nf_hooks_needed[pf][reg->hooknum]);
#endif
} else {
diff --git a/net/netfilter/xt_IDLETIMER.c b/net/netfilter/xt_IDLETIMER.c
index 25453a1..559acfa 100644
--- a/net/netfilter/xt_IDLETIMER.c
+++ b/net/netfilter/xt_IDLETIMER.c
@@ -5,6 +5,7 @@
* After timer expires a kevent will be sent.
*
* Copyright (C) 2004, 2010 Nokia Corporation
+ *
* Written by Timo Teras <ext-timo.teras@nokia.com>
*
* Converted to x_tables and reworked for upstream inclusion
@@ -38,8 +39,17 @@
#include <linux/netfilter/xt_IDLETIMER.h>
#include <linux/kdev_t.h>
#include <linux/kobject.h>
+#include <linux/skbuff.h>
#include <linux/workqueue.h>
#include <linux/sysfs.h>
+#include <linux/rtc.h>
+#include <linux/time.h>
+#include <linux/math64.h>
+#include <linux/suspend.h>
+#include <linux/notifier.h>
+#include <net/net_namespace.h>
+#include <net/sock.h>
+#include <net/inet_sock.h>
struct idletimer_tg_attr {
struct attribute attr;
@@ -55,14 +65,110 @@
struct kobject *kobj;
struct idletimer_tg_attr attr;
+ struct timespec delayed_timer_trigger;
+ struct timespec last_modified_timer;
+ struct timespec last_suspend_time;
+ struct notifier_block pm_nb;
+
+ int timeout;
unsigned int refcnt;
+ bool work_pending;
+ bool send_nl_msg;
+ bool active;
+ uid_t uid;
};
static LIST_HEAD(idletimer_tg_list);
static DEFINE_MUTEX(list_mutex);
+static DEFINE_SPINLOCK(timestamp_lock);
static struct kobject *idletimer_tg_kobj;
+static bool check_for_delayed_trigger(struct idletimer_tg *timer,
+ struct timespec *ts)
+{
+ bool state;
+ struct timespec temp;
+ spin_lock_bh(×tamp_lock);
+ timer->work_pending = false;
+ if ((ts->tv_sec - timer->last_modified_timer.tv_sec) > timer->timeout ||
+ timer->delayed_timer_trigger.tv_sec != 0) {
+ state = false;
+ temp.tv_sec = timer->timeout;
+ temp.tv_nsec = 0;
+ if (timer->delayed_timer_trigger.tv_sec != 0) {
+ temp = timespec_add(timer->delayed_timer_trigger, temp);
+ ts->tv_sec = temp.tv_sec;
+ ts->tv_nsec = temp.tv_nsec;
+ timer->delayed_timer_trigger.tv_sec = 0;
+ timer->work_pending = true;
+ schedule_work(&timer->work);
+ } else {
+ temp = timespec_add(timer->last_modified_timer, temp);
+ ts->tv_sec = temp.tv_sec;
+ ts->tv_nsec = temp.tv_nsec;
+ }
+ } else {
+ state = timer->active;
+ }
+ spin_unlock_bh(×tamp_lock);
+ return state;
+}
+
+static void notify_netlink_uevent(const char *iface, struct idletimer_tg *timer)
+{
+ char iface_msg[NLMSG_MAX_SIZE];
+ char state_msg[NLMSG_MAX_SIZE];
+ char timestamp_msg[NLMSG_MAX_SIZE];
+ char uid_msg[NLMSG_MAX_SIZE];
+ char *envp[] = { iface_msg, state_msg, timestamp_msg, uid_msg, NULL };
+ int res;
+ struct timespec ts;
+ uint64_t time_ns;
+ bool state;
+
+ res = snprintf(iface_msg, NLMSG_MAX_SIZE, "INTERFACE=%s",
+ iface);
+ if (NLMSG_MAX_SIZE <= res) {
+ pr_err("message too long (%d)", res);
+ return;
+ }
+
+ get_monotonic_boottime(&ts);
+ state = check_for_delayed_trigger(timer, &ts);
+ res = snprintf(state_msg, NLMSG_MAX_SIZE, "STATE=%s",
+ state ? "active" : "inactive");
+
+ if (NLMSG_MAX_SIZE <= res) {
+ pr_err("message too long (%d)", res);
+ return;
+ }
+
+ if (state) {
+ res = snprintf(uid_msg, NLMSG_MAX_SIZE, "UID=%u", timer->uid);
+ if (NLMSG_MAX_SIZE <= res)
+ pr_err("message too long (%d)", res);
+ } else {
+ res = snprintf(uid_msg, NLMSG_MAX_SIZE, "UID=");
+ if (NLMSG_MAX_SIZE <= res)
+ pr_err("message too long (%d)", res);
+ }
+
+ time_ns = timespec_to_ns(&ts);
+ res = snprintf(timestamp_msg, NLMSG_MAX_SIZE, "TIME_NS=%llu", time_ns);
+ if (NLMSG_MAX_SIZE <= res) {
+ timestamp_msg[0] = '\0';
+ pr_err("message too long (%d)", res);
+ }
+
+ pr_debug("putting nlmsg: <%s> <%s> <%s> <%s>\n", iface_msg, state_msg,
+ timestamp_msg, uid_msg);
+ kobject_uevent_env(idletimer_tg_kobj, KOBJ_CHANGE, envp);
+ return;
+
+
+}
+
static
struct idletimer_tg *__idletimer_tg_find_by_label(const char *label)
{
@@ -83,6 +189,7 @@
{
struct idletimer_tg *timer;
unsigned long expires = 0;
+ unsigned long now = jiffies;
mutex_lock(&list_mutex);
@@ -92,11 +199,15 @@
mutex_unlock(&list_mutex);
- if (time_after(expires, jiffies))
+ if (time_after(expires, now))
return sprintf(buf, "%u\n",
- jiffies_to_msecs(expires - jiffies) / 1000);
+ jiffies_to_msecs(expires - now) / 1000);
- return sprintf(buf, "0\n");
+ if (timer->send_nl_msg)
+ return sprintf(buf, "0 %d\n",
+ jiffies_to_msecs(now - expires) / 1000);
+ else
+ return sprintf(buf, "0\n");
}
static void idletimer_tg_work(struct work_struct *work)
@@ -105,6 +216,9 @@
work);
sysfs_notify(idletimer_tg_kobj, NULL, timer->attr.attr.name);
+
+ if (timer->send_nl_msg)
+ notify_netlink_uevent(timer->attr.attr.name, timer);
}
static void idletimer_tg_expired(struct timer_list *t)
@@ -112,8 +226,55 @@
struct idletimer_tg *timer = from_timer(timer, t, timer);
pr_debug("timer %s expired\n", timer->attr.attr.name);
-
+ spin_lock_bh(×tamp_lock);
+ timer->active = false;
+ timer->work_pending = true;
schedule_work(&timer->work);
+ spin_unlock_bh(×tamp_lock);
+}
+
+static int idletimer_resume(struct notifier_block *notifier,
+ unsigned long pm_event, void *unused)
+{
+ struct timespec ts;
+ unsigned long time_diff, now = jiffies;
+ struct idletimer_tg *timer = container_of(notifier,
+ struct idletimer_tg, pm_nb);
+ if (!timer)
+ return NOTIFY_DONE;
+ switch (pm_event) {
+ case PM_SUSPEND_PREPARE:
+ get_monotonic_boottime(&timer->last_suspend_time);
+ break;
+ case PM_POST_SUSPEND:
+ spin_lock_bh(×tamp_lock);
+ if (!timer->active) {
+ spin_unlock_bh(×tamp_lock);
+ break;
+ }
+ /* since jiffies are not updated when suspended now represents
+ * the time it would have suspended */
+ if (time_after(timer->timer.expires, now)) {
+ get_monotonic_boottime(&ts);
+ ts = timespec_sub(ts, timer->last_suspend_time);
+ time_diff = timespec_to_jiffies(&ts);
+ if (timer->timer.expires > (time_diff + now)) {
+ mod_timer_pending(&timer->timer,
+ (timer->timer.expires - time_diff));
+ } else {
+ del_timer(&timer->timer);
+ timer->timer.expires = 0;
+ timer->active = false;
+ timer->work_pending = true;
+ schedule_work(&timer->work);
+ }
+ }
+ spin_unlock_bh(×tamp_lock);
+ break;
+ default:
+ break;
+ }
+ return NOTIFY_DONE;
}
static int idletimer_check_sysfs_name(const char *name, unsigned int size)
@@ -165,6 +326,21 @@
timer_setup(&info->timer->timer, idletimer_tg_expired, 0);
info->timer->refcnt = 1;
+ info->timer->send_nl_msg = (info->send_nl_msg == 0) ? false : true;
+ info->timer->active = true;
+ info->timer->timeout = info->timeout;
+
+ info->timer->delayed_timer_trigger.tv_sec = 0;
+ info->timer->delayed_timer_trigger.tv_nsec = 0;
+ info->timer->work_pending = false;
+ info->timer->uid = 0;
+ get_monotonic_boottime(&info->timer->last_modified_timer);
+
+ info->timer->pm_nb.notifier_call = idletimer_resume;
+ ret = register_pm_notifier(&info->timer->pm_nb);
+ if (ret)
+ printk(KERN_WARNING "[%s] Failed to register pm notifier %d\n",
+ __func__, ret);
INIT_WORK(&info->timer->work, idletimer_tg_work);
@@ -181,6 +357,42 @@
return ret;
}
+static void reset_timer(const struct idletimer_tg_info *info,
+ struct sk_buff *skb)
+{
+ unsigned long now = jiffies;
+ struct idletimer_tg *timer = info->timer;
+ bool timer_prev;
+
+ spin_lock_bh(×tamp_lock);
+ timer_prev = timer->active;
+ timer->active = true;
+ /* timer_prev is used to guard overflow problem in time_before*/
+ if (!timer_prev || time_before(timer->timer.expires, now)) {
+ pr_debug("Starting Checkentry timer (Expired, Jiffies): %lu, %lu\n",
+ timer->timer.expires, now);
+
+ /* Stores the uid resposible for waking up the radio */
+ if (skb && (skb->sk)) {
+ timer->uid = from_kuid_munged(current_user_ns(),
+ sock_i_uid(skb_to_full_sk(skb)));
+ }
+
+ /* checks if there is a pending inactive notification*/
+ if (timer->work_pending)
+ timer->delayed_timer_trigger = timer->last_modified_timer;
+ else {
+ timer->work_pending = true;
+ schedule_work(&timer->work);
+ }
+ }
+
+ get_monotonic_boottime(&timer->last_modified_timer);
+ mod_timer(&timer->timer,
+ msecs_to_jiffies(info->timeout * 1000) + now);
+ spin_unlock_bh(×tamp_lock);
+}
+
/*
* The actual xt_tables plugin.
*/
@@ -188,15 +400,23 @@
const struct xt_action_param *par)
{
const struct idletimer_tg_info *info = par->targinfo;
+ unsigned long now = jiffies;
pr_debug("resetting timer %s, timeout period %u\n",
info->label, info->timeout);
BUG_ON(!info->timer);
- mod_timer(&info->timer->timer,
- msecs_to_jiffies(info->timeout * 1000) + jiffies);
+ info->timer->active = true;
+ if (time_before(info->timer->timer.expires, now)) {
+ schedule_work(&info->timer->work);
+ pr_debug("Starting timer %s (Expired, Jiffies): %lu, %lu\n",
+ info->label, info->timer->timer.expires, now);
+ }
+
+ /* TODO: Avoid modifying timers on each packet */
+ reset_timer(info, skb);
return XT_CONTINUE;
}
@@ -205,7 +425,7 @@
struct idletimer_tg_info *info = par->targinfo;
int ret;
- pr_debug("checkentry targinfo%s\n", info->label);
+ pr_debug("checkentry targinfo %s\n", info->label);
if (info->timeout == 0) {
pr_debug("timeout value is zero\n");
@@ -227,9 +447,7 @@
info->timer = __idletimer_tg_find_by_label(info->label);
if (info->timer) {
info->timer->refcnt++;
- mod_timer(&info->timer->timer,
- msecs_to_jiffies(info->timeout * 1000) + jiffies);
-
+ reset_timer(info, NULL);
pr_debug("increased refcnt of timer %s to %u\n",
info->label, info->timer->refcnt);
} else {
@@ -242,6 +460,7 @@
}
mutex_unlock(&list_mutex);
+
return 0;
}
@@ -258,13 +477,14 @@
list_del(&info->timer->entry);
del_timer_sync(&info->timer->timer);
- cancel_work_sync(&info->timer->work);
sysfs_remove_file(idletimer_tg_kobj, &info->timer->attr.attr);
+ unregister_pm_notifier(&info->timer->pm_nb);
+ cancel_work_sync(&info->timer->work);
kfree(info->timer->attr.attr.name);
kfree(info->timer);
} else {
pr_debug("decreased refcnt of timer %s to %u\n",
- info->label, info->timer->refcnt);
+ info->label, info->timer->refcnt);
}
mutex_unlock(&list_mutex);
@@ -272,6 +492,7 @@
static struct xt_target idletimer_tg __read_mostly = {
.name = "IDLETIMER",
+ .revision = 1,
.family = NFPROTO_UNSPEC,
.target = idletimer_tg_target,
.targetsize = sizeof(struct idletimer_tg_info),
@@ -338,3 +559,4 @@
MODULE_LICENSE("GPL v2");
MODULE_ALIAS("ipt_IDLETIMER");
MODULE_ALIAS("ip6t_IDLETIMER");
+MODULE_ALIAS("arpt_IDLETIMER");
diff --git a/net/netfilter/xt_quota2.c b/net/netfilter/xt_quota2.c
new file mode 100644
index 0000000..24b7742
--- /dev/null
+++ b/net/netfilter/xt_quota2.c
@@ -0,0 +1,401 @@
+/*
+ * xt_quota2 - enhanced xt_quota that can count upwards and in packets
+ * as a minimal accounting match.
+ * by Jan Engelhardt <jengelh@medozas.de>, 2008
+ *
+ * Originally based on xt_quota.c:
+ * netfilter module to enforce network quotas
+ * Sam Johnston <samj@samj.net>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License; either
+ * version 2 of the License, as published by the Free Software Foundation.
+ */
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/skbuff.h>
+#include <linux/spinlock.h>
+#include <asm/atomic.h>
+#include <net/netlink.h>
+
+#include <linux/netfilter/x_tables.h>
+#include <linux/netfilter/xt_quota2.h>
+
+#ifdef CONFIG_NETFILTER_XT_MATCH_QUOTA2_LOG
+/* For compatibility, these definitions are copied from the
+ * deprecated header file <linux/netfilter_ipv4/ipt_ULOG.h> */
+#define ULOG_MAC_LEN 80
+#define ULOG_PREFIX_LEN 32
+
+/* Format of the ULOG packets passed through netlink */
+typedef struct ulog_packet_msg {
+ unsigned long mark;
+ long timestamp_sec;
+ long timestamp_usec;
+ unsigned int hook;
+ char indev_name[IFNAMSIZ];
+ char outdev_name[IFNAMSIZ];
+ size_t data_len;
+ char prefix[ULOG_PREFIX_LEN];
+ unsigned char mac_len;
+ unsigned char mac[ULOG_MAC_LEN];
+ unsigned char payload[0];
+} ulog_packet_msg_t;
+#endif
+
+/**
+ * @lock: lock to protect quota writers from each other
+ */
+struct xt_quota_counter {
+ u_int64_t quota;
+ spinlock_t lock;
+ struct list_head list;
+ atomic_t ref;
+ char name[sizeof(((struct xt_quota_mtinfo2 *)NULL)->name)];
+ struct proc_dir_entry *procfs_entry;
+};
+
+#ifdef CONFIG_NETFILTER_XT_MATCH_QUOTA2_LOG
+/* Harald's favorite number +1 :D From ipt_ULOG.C */
+static int qlog_nl_event = 112;
+module_param_named(event_num, qlog_nl_event, uint, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(event_num,
+ "Event number for NETLINK_NFLOG message. 0 disables log."
+ "111 is what ipt_ULOG uses.");
+static struct sock *nflognl;
+#endif
+
+static LIST_HEAD(counter_list);
+static DEFINE_SPINLOCK(counter_list_lock);
+
+static struct proc_dir_entry *proc_xt_quota;
+static unsigned int quota_list_perms = S_IRUGO | S_IWUSR;
+static kuid_t quota_list_uid = KUIDT_INIT(0);
+static kgid_t quota_list_gid = KGIDT_INIT(0);
+module_param_named(perms, quota_list_perms, uint, S_IRUGO | S_IWUSR);
+
+#ifdef CONFIG_NETFILTER_XT_MATCH_QUOTA2_LOG
+static void quota2_log(unsigned int hooknum,
+ const struct sk_buff *skb,
+ const struct net_device *in,
+ const struct net_device *out,
+ const char *prefix)
+{
+ ulog_packet_msg_t *pm;
+ struct sk_buff *log_skb;
+ size_t size;
+ struct nlmsghdr *nlh;
+
+ if (!qlog_nl_event)
+ return;
+
+ size = NLMSG_SPACE(sizeof(*pm));
+ size = max(size, (size_t)NLMSG_GOODSIZE);
+ log_skb = alloc_skb(size, GFP_ATOMIC);
+ if (!log_skb) {
+ pr_err("xt_quota2: cannot alloc skb for logging\n");
+ return;
+ }
+
+ nlh = nlmsg_put(log_skb, /*pid*/0, /*seq*/0, qlog_nl_event,
+ sizeof(*pm), 0);
+ if (!nlh) {
+ pr_err("xt_quota2: nlmsg_put failed\n");
+ kfree_skb(log_skb);
+ return;
+ }
+ pm = nlmsg_data(nlh);
+ if (skb->tstamp == 0)
+ __net_timestamp((struct sk_buff *)skb);
+ pm->data_len = 0;
+ pm->hook = hooknum;
+ if (prefix != NULL)
+ strlcpy(pm->prefix, prefix, sizeof(pm->prefix));
+ else
+ *(pm->prefix) = '\0';
+ if (in)
+ strlcpy(pm->indev_name, in->name, sizeof(pm->indev_name));
+ else
+ pm->indev_name[0] = '\0';
+
+ if (out)
+ strlcpy(pm->outdev_name, out->name, sizeof(pm->outdev_name));
+ else
+ pm->outdev_name[0] = '\0';
+
+ NETLINK_CB(log_skb).dst_group = 1;
+ pr_debug("throwing 1 packets to netlink group 1\n");
+ netlink_broadcast(nflognl, log_skb, 0, 1, GFP_ATOMIC);
+}
+#else
+static void quota2_log(unsigned int hooknum,
+ const struct sk_buff *skb,
+ const struct net_device *in,
+ const struct net_device *out,
+ const char *prefix)
+{
+}
+#endif /* if+else CONFIG_NETFILTER_XT_MATCH_QUOTA2_LOG */
+
+static ssize_t quota_proc_read(struct file *file, char __user *buf,
+ size_t size, loff_t *ppos)
+{
+ struct xt_quota_counter *e = PDE_DATA(file_inode(file));
+ char tmp[24];
+ size_t tmp_size;
+
+ spin_lock_bh(&e->lock);
+ tmp_size = scnprintf(tmp, sizeof(tmp), "%llu\n", e->quota);
+ spin_unlock_bh(&e->lock);
+ return simple_read_from_buffer(buf, size, ppos, tmp, tmp_size);
+}
+
+static ssize_t quota_proc_write(struct file *file, const char __user *input,
+ size_t size, loff_t *ppos)
+{
+ struct xt_quota_counter *e = PDE_DATA(file_inode(file));
+ char buf[sizeof("18446744073709551616")];
+
+ if (size > sizeof(buf))
+ size = sizeof(buf);
+ if (copy_from_user(buf, input, size) != 0)
+ return -EFAULT;
+ buf[sizeof(buf)-1] = '\0';
+
+ spin_lock_bh(&e->lock);
+ e->quota = simple_strtoull(buf, NULL, 0);
+ spin_unlock_bh(&e->lock);
+ return size;
+}
+
+static const struct file_operations q2_counter_fops = {
+ .read = quota_proc_read,
+ .write = quota_proc_write,
+ .llseek = default_llseek,
+};
+
+static struct xt_quota_counter *
+q2_new_counter(const struct xt_quota_mtinfo2 *q, bool anon)
+{
+ struct xt_quota_counter *e;
+ unsigned int size;
+
+ /* Do not need all the procfs things for anonymous counters. */
+ size = anon ? offsetof(typeof(*e), list) : sizeof(*e);
+ e = kmalloc(size, GFP_KERNEL);
+ if (e == NULL)
+ return NULL;
+
+ e->quota = q->quota;
+ spin_lock_init(&e->lock);
+ if (!anon) {
+ INIT_LIST_HEAD(&e->list);
+ atomic_set(&e->ref, 1);
+ strlcpy(e->name, q->name, sizeof(e->name));
+ }
+ return e;
+}
+
+/**
+ * q2_get_counter - get ref to counter or create new
+ * @name: name of counter
+ */
+static struct xt_quota_counter *
+q2_get_counter(const struct xt_quota_mtinfo2 *q)
+{
+ struct proc_dir_entry *p;
+ struct xt_quota_counter *e = NULL;
+ struct xt_quota_counter *new_e;
+
+ if (*q->name == '\0')
+ return q2_new_counter(q, true);
+
+ /* No need to hold a lock while getting a new counter */
+ new_e = q2_new_counter(q, false);
+ if (new_e == NULL)
+ goto out;
+
+ spin_lock_bh(&counter_list_lock);
+ list_for_each_entry(e, &counter_list, list)
+ if (strcmp(e->name, q->name) == 0) {
+ atomic_inc(&e->ref);
+ spin_unlock_bh(&counter_list_lock);
+ kfree(new_e);
+ pr_debug("xt_quota2: old counter name=%s", e->name);
+ return e;
+ }
+ e = new_e;
+ pr_debug("xt_quota2: new_counter name=%s", e->name);
+ list_add_tail(&e->list, &counter_list);
+ /* The entry having a refcount of 1 is not directly destructible.
+ * This func has not yet returned the new entry, thus iptables
+ * has not references for destroying this entry.
+ * For another rule to try to destroy it, it would 1st need for this
+ * func* to be re-invoked, acquire a new ref for the same named quota.
+ * Nobody will access the e->procfs_entry either.
+ * So release the lock. */
+ spin_unlock_bh(&counter_list_lock);
+
+ /* create_proc_entry() is not spin_lock happy */
+ p = e->procfs_entry = proc_create_data(e->name, quota_list_perms,
+ proc_xt_quota, &q2_counter_fops, e);
+
+ if (IS_ERR_OR_NULL(p)) {
+ spin_lock_bh(&counter_list_lock);
+ list_del(&e->list);
+ spin_unlock_bh(&counter_list_lock);
+ goto out;
+ }
+ proc_set_user(p, quota_list_uid, quota_list_gid);
+ return e;
+
+ out:
+ kfree(e);
+ return NULL;
+}
+
+static int quota_mt2_check(const struct xt_mtchk_param *par)
+{
+ struct xt_quota_mtinfo2 *q = par->matchinfo;
+
+ pr_debug("xt_quota2: check() flags=0x%04x", q->flags);
+
+ if (q->flags & ~XT_QUOTA_MASK)
+ return -EINVAL;
+
+ q->name[sizeof(q->name)-1] = '\0';
+ if (*q->name == '.' || strchr(q->name, '/') != NULL) {
+ printk(KERN_ERR "xt_quota.3: illegal name\n");
+ return -EINVAL;
+ }
+
+ q->master = q2_get_counter(q);
+ if (q->master == NULL) {
+ printk(KERN_ERR "xt_quota.3: memory alloc failure\n");
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static void quota_mt2_destroy(const struct xt_mtdtor_param *par)
+{
+ struct xt_quota_mtinfo2 *q = par->matchinfo;
+ struct xt_quota_counter *e = q->master;
+
+ if (*q->name == '\0') {
+ kfree(e);
+ return;
+ }
+
+ spin_lock_bh(&counter_list_lock);
+ if (!atomic_dec_and_test(&e->ref)) {
+ spin_unlock_bh(&counter_list_lock);
+ return;
+ }
+
+ list_del(&e->list);
+ remove_proc_entry(e->name, proc_xt_quota);
+ spin_unlock_bh(&counter_list_lock);
+ kfree(e);
+}
+
+static bool
+quota_mt2(const struct sk_buff *skb, struct xt_action_param *par)
+{
+ struct xt_quota_mtinfo2 *q = (void *)par->matchinfo;
+ struct xt_quota_counter *e = q->master;
+ bool ret = q->flags & XT_QUOTA_INVERT;
+
+ spin_lock_bh(&e->lock);
+ if (q->flags & XT_QUOTA_GROW) {
+ /*
+ * While no_change is pointless in "grow" mode, we will
+ * implement it here simply to have a consistent behavior.
+ */
+ if (!(q->flags & XT_QUOTA_NO_CHANGE)) {
+ e->quota += (q->flags & XT_QUOTA_PACKET) ? 1 : skb->len;
+ }
+ ret = true;
+ } else {
+ if (e->quota >= skb->len) {
+ if (!(q->flags & XT_QUOTA_NO_CHANGE))
+ e->quota -= (q->flags & XT_QUOTA_PACKET) ? 1 : skb->len;
+ ret = !ret;
+ } else {
+ /* We are transitioning, log that fact. */
+ if (e->quota) {
+ quota2_log(xt_hooknum(par),
+ skb,
+ xt_in(par),
+ xt_out(par),
+ q->name);
+ }
+ /* we do not allow even small packets from now on */
+ e->quota = 0;
+ }
+ }
+ spin_unlock_bh(&e->lock);
+ return ret;
+}
+
+static struct xt_match quota_mt2_reg[] __read_mostly = {
+ {
+ .name = "quota2",
+ .revision = 3,
+ .family = NFPROTO_IPV4,
+ .checkentry = quota_mt2_check,
+ .match = quota_mt2,
+ .destroy = quota_mt2_destroy,
+ .matchsize = sizeof(struct xt_quota_mtinfo2),
+ .me = THIS_MODULE,
+ },
+ {
+ .name = "quota2",
+ .revision = 3,
+ .family = NFPROTO_IPV6,
+ .checkentry = quota_mt2_check,
+ .match = quota_mt2,
+ .destroy = quota_mt2_destroy,
+ .matchsize = sizeof(struct xt_quota_mtinfo2),
+ .me = THIS_MODULE,
+ },
+};
+
+static int __init quota_mt2_init(void)
+{
+ int ret;
+ pr_debug("xt_quota2: init()");
+
+#ifdef CONFIG_NETFILTER_XT_MATCH_QUOTA2_LOG
+ nflognl = netlink_kernel_create(&init_net, NETLINK_NFLOG, NULL);
+ if (!nflognl)
+ return -ENOMEM;
+#endif
+
+ proc_xt_quota = proc_mkdir("xt_quota", init_net.proc_net);
+ if (proc_xt_quota == NULL)
+ return -EACCES;
+
+ ret = xt_register_matches(quota_mt2_reg, ARRAY_SIZE(quota_mt2_reg));
+ if (ret < 0)
+ remove_proc_entry("xt_quota", init_net.proc_net);
+ pr_debug("xt_quota2: init() %d", ret);
+ return ret;
+}
+
+static void __exit quota_mt2_exit(void)
+{
+ xt_unregister_matches(quota_mt2_reg, ARRAY_SIZE(quota_mt2_reg));
+ remove_proc_entry("xt_quota", init_net.proc_net);
+}
+
+module_init(quota_mt2_init);
+module_exit(quota_mt2_exit);
+MODULE_DESCRIPTION("Xtables: countdown quota match; up counter");
+MODULE_AUTHOR("Sam Johnston <samj@samj.net>");
+MODULE_AUTHOR("Jan Engelhardt <jengelh@medozas.de>");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS("ipt_quota2");
+MODULE_ALIAS("ip6t_quota2");
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 85a6e8f..23c4653 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -268,7 +268,7 @@
@echo " CLANG-bpf " $@
$(Q)$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
-I$(srctree)/tools/testing/selftests/bpf/ \
- -D__KERNEL__ -D__BPF_TRACING__ -Wno-unused-value -Wno-pointer-sign \
+ -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
-D__TARGET_ARCH_$(ARCH) -Wno-compare-distinct-pointer-types \
-Wno-gnu-variable-sized-type-not-at-end \
-Wno-address-of-packed-member -Wno-tautological-compare \
diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include
index ce53639..df0f0f4 100644
--- a/scripts/Kbuild.include
+++ b/scripts/Kbuild.include
@@ -138,10 +138,6 @@
cc-disable-warning = $(call try-run,\
$(CC) -Werror $(KBUILD_CPPFLAGS) $(CC_OPTION_CFLAGS) -W$(strip $(1)) -c -x c /dev/null -o "$$TMP",-Wno-$(strip $(1)))
-# cc-name
-# Expands to either gcc or clang
-cc-name = $(shell $(CC) -v 2>&1 | grep -q "clang version" && echo clang || echo gcc)
-
# cc-version
cc-version = $(shell $(CONFIG_SHELL) $(srctree)/scripts/gcc-version.sh $(CC))
diff --git a/scripts/Makefile.extrawarn b/scripts/Makefile.extrawarn
index 486e135..1a1b881 100644
--- a/scripts/Makefile.extrawarn
+++ b/scripts/Makefile.extrawarn
@@ -23,14 +23,16 @@
warning-1 := -Wextra -Wunused -Wno-unused-parameter
warning-1 += -Wmissing-declarations
warning-1 += -Wmissing-format-attribute
-warning-1 += $(call cc-option, -Wmissing-prototypes)
+warning-1 += -Wmissing-prototypes
warning-1 += -Wold-style-definition
-warning-1 += $(call cc-option, -Wmissing-include-dirs)
+warning-1 += -Wmissing-include-dirs
warning-1 += $(call cc-option, -Wunused-but-set-variable)
warning-1 += $(call cc-option, -Wunused-const-variable)
warning-1 += $(call cc-option, -Wpacked-not-aligned)
-warning-1 += $(call cc-disable-warning, missing-field-initializers)
-warning-1 += $(call cc-disable-warning, sign-compare)
+warning-1 += $(call cc-option, -Wstringop-truncation)
+# The following turn off the warnings enabled by -Wextra
+warning-1 += -Wno-missing-field-initializers
+warning-1 += -Wno-sign-compare
warning-2 := -Waggregate-return
warning-2 += -Wcast-align
@@ -38,8 +40,8 @@
warning-2 += -Wnested-externs
warning-2 += -Wshadow
warning-2 += $(call cc-option, -Wlogical-op)
-warning-2 += $(call cc-option, -Wmissing-field-initializers)
-warning-2 += $(call cc-option, -Wsign-compare)
+warning-2 += -Wmissing-field-initializers
+warning-2 += -Wsign-compare
warning-2 += $(call cc-option, -Wmaybe-uninitialized)
warning-2 += $(call cc-option, -Wunused-macros)
@@ -65,13 +67,12 @@
KBUILD_CFLAGS += $(warning)
else
-ifeq ($(cc-name),clang)
-KBUILD_CFLAGS += $(call cc-disable-warning, initializer-overrides)
-KBUILD_CFLAGS += $(call cc-disable-warning, unused-value)
-KBUILD_CFLAGS += $(call cc-disable-warning, format)
-KBUILD_CFLAGS += $(call cc-disable-warning, sign-compare)
-KBUILD_CFLAGS += $(call cc-disable-warning, format-zero-length)
-KBUILD_CFLAGS += $(call cc-disable-warning, uninitialized)
+ifdef CONFIG_CC_IS_CLANG
+KBUILD_CFLAGS += -Wno-initializer-overrides
+KBUILD_CFLAGS += -Wno-format
+KBUILD_CFLAGS += -Wno-sign-compare
+KBUILD_CFLAGS += -Wno-format-zero-length
+KBUILD_CFLAGS += -Wno-uninitialized
KBUILD_CFLAGS += $(call cc-disable-warning, pointer-to-enum-cast)
endif
endif
diff --git a/scripts/Makefile.modinst b/scripts/Makefile.modinst
index ff5ca98..8ff0669 100644
--- a/scripts/Makefile.modinst
+++ b/scripts/Makefile.modinst
@@ -30,7 +30,7 @@
INSTALL_MOD_DIR ?= extra
ext-mod-dir = $(INSTALL_MOD_DIR)$(subst $(patsubst %/,%,$(KBUILD_EXTMOD)),,$(@D))
-modinst_dir = $(if $(KBUILD_EXTMOD),$(ext-mod-dir),kernel/$(@D))
+modinst_dir ?= $(if $(KBUILD_EXTMOD),$(ext-mod-dir),kernel/$(@D))
$(modules):
$(call cmd,modules_install,$(MODLIB)/$(modinst_dir))
diff --git a/scripts/decode_stacktrace.sh b/scripts/decode_stacktrace.sh
index 5aa75a0a..092ead2 100755
--- a/scripts/decode_stacktrace.sh
+++ b/scripts/decode_stacktrace.sh
@@ -28,7 +28,7 @@
local objfile=${modcache[$module]}
else
[[ $modpath == "" ]] && return
- local objfile=$(find "$modpath" -name $module.ko -print -quit)
+ local objfile=$(find "$modpath" -name "${module//_/[-_]}.ko*" -print -quit)
[[ $objfile == "" ]] && return
modcache[$module]=$objfile
fi
diff --git a/scripts/gcc-goto.sh b/scripts/gcc-goto.sh
index 8b980fb..083c526 100755
--- a/scripts/gcc-goto.sh
+++ b/scripts/gcc-goto.sh
@@ -3,7 +3,7 @@
# Test for gcc 'asm goto' support
# Copyright (C) 2010, Jason Baron <jbaron@redhat.com>
-cat << "END" | $@ -x c - -fno-PIE -c -o /dev/null
+cat << "END" | $@ -x c - -c -o /dev/null >/dev/null 2>&1 && echo "y"
int main(void)
{
#if defined(__arm__) || defined(__aarch64__)
diff --git a/scripts/gen_compile_commands.py b/scripts/gen_compile_commands.py
new file mode 100755
index 0000000..7915823
--- /dev/null
+++ b/scripts/gen_compile_commands.py
@@ -0,0 +1,151 @@
+#!/usr/bin/env python
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) Google LLC, 2018
+#
+# Author: Tom Roeder <tmroeder@google.com>
+#
+"""A tool for generating compile_commands.json in the Linux kernel."""
+
+import argparse
+import json
+import logging
+import os
+import re
+
+_DEFAULT_OUTPUT = 'compile_commands.json'
+_DEFAULT_LOG_LEVEL = 'WARNING'
+
+_FILENAME_PATTERN = r'^\..*\.cmd$'
+_LINE_PATTERN = r'^cmd_[^ ]*\.o := (.* )([^ ]*\.c)$'
+_VALID_LOG_LEVELS = ['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL']
+
+# A kernel build generally has over 2000 entries in its compile_commands.json
+# database. If this code finds 500 or fewer, then warn the user that they might
+# not have all the .cmd files, and they might need to compile the kernel.
+_LOW_COUNT_THRESHOLD = 500
+
+
+def parse_arguments():
+ """Sets up and parses command-line arguments.
+
+ Returns:
+ log_level: A logging level to filter log output.
+ directory: The directory to search for .cmd files.
+ output: Where to write the compile-commands JSON file.
+ """
+ usage = 'Creates a compile_commands.json database from kernel .cmd files'
+ parser = argparse.ArgumentParser(description=usage)
+
+ directory_help = ('Path to the kernel source directory to search '
+ '(defaults to the working directory)')
+ parser.add_argument('-d', '--directory', type=str, help=directory_help)
+
+ output_help = ('The location to write compile_commands.json (defaults to '
+ 'compile_commands.json in the search directory)')
+ parser.add_argument('-o', '--output', type=str, help=output_help)
+
+ log_level_help = ('The level of log messages to produce (one of ' +
+ ', '.join(_VALID_LOG_LEVELS) + '; defaults to ' +
+ _DEFAULT_LOG_LEVEL + ')')
+ parser.add_argument(
+ '--log_level', type=str, default=_DEFAULT_LOG_LEVEL,
+ help=log_level_help)
+
+ args = parser.parse_args()
+
+ log_level = args.log_level
+ if log_level not in _VALID_LOG_LEVELS:
+ raise ValueError('%s is not a valid log level' % log_level)
+
+ directory = args.directory or os.getcwd()
+ output = args.output or os.path.join(directory, _DEFAULT_OUTPUT)
+ directory = os.path.abspath(directory)
+
+ return log_level, directory, output
+
+
+def process_line(root_directory, file_directory, command_prefix, relative_path):
+ """Extracts information from a .cmd line and creates an entry from it.
+
+ Args:
+ root_directory: The directory that was searched for .cmd files. Usually
+ used directly in the "directory" entry in compile_commands.json.
+ file_directory: The path to the directory the .cmd file was found in.
+ command_prefix: The extracted command line, up to the last element.
+ relative_path: The .c file from the end of the extracted command.
+ Usually relative to root_directory, but sometimes relative to
+ file_directory and sometimes neither.
+
+ Returns:
+ An entry to append to compile_commands.
+
+ Raises:
+ ValueError: Could not find the extracted file based on relative_path and
+ root_directory or file_directory.
+ """
+ # The .cmd files are intended to be included directly by Make, so they
+ # escape the pound sign '#', either as '\#' or '$(pound)' (depending on the
+ # kernel version). The compile_commands.json file is not interepreted
+ # by Make, so this code replaces the escaped version with '#'.
+ prefix = command_prefix.replace('\#', '#').replace('$(pound)', '#')
+
+ cur_dir = root_directory
+ expected_path = os.path.join(cur_dir, relative_path)
+ if not os.path.exists(expected_path):
+ # Try using file_directory instead. Some of the tools have a different
+ # style of .cmd file than the kernel.
+ cur_dir = file_directory
+ expected_path = os.path.join(cur_dir, relative_path)
+ if not os.path.exists(expected_path):
+ raise ValueError('File %s not in %s or %s' %
+ (relative_path, root_directory, file_directory))
+ return {
+ 'directory': cur_dir,
+ 'file': relative_path,
+ 'command': prefix + relative_path,
+ }
+
+
+def main():
+ """Walks through the directory and finds and parses .cmd files."""
+ log_level, directory, output = parse_arguments()
+
+ level = getattr(logging, log_level)
+ logging.basicConfig(format='%(levelname)s: %(message)s', level=level)
+
+ filename_matcher = re.compile(_FILENAME_PATTERN)
+ line_matcher = re.compile(_LINE_PATTERN)
+
+ compile_commands = []
+ for dirpath, _, filenames in os.walk(directory):
+ for filename in filenames:
+ if not filename_matcher.match(filename):
+ continue
+ filepath = os.path.join(dirpath, filename)
+
+ with open(filepath, 'rt') as f:
+ for line in f:
+ result = line_matcher.match(line)
+ if not result:
+ continue
+
+ try:
+ entry = process_line(directory, dirpath,
+ result.group(1), result.group(2))
+ compile_commands.append(entry)
+ except ValueError as err:
+ logging.info('Could not add line from %s: %s',
+ filepath, err)
+
+ with open(output, 'wt') as f:
+ json.dump(compile_commands, f, indent=2, sort_keys=True)
+
+ count = len(compile_commands)
+ if count < _LOW_COUNT_THRESHOLD:
+ logging.warning(
+ 'Found %s entries. Have you compiled the kernel?', count)
+
+
+if __name__ == '__main__':
+ main()
diff --git a/security/Kconfig b/security/Kconfig
index d9aa521..07e9d4c 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -236,6 +236,8 @@
source security/apparmor/Kconfig
source security/loadpin/Kconfig
source security/yama/Kconfig
+source security/chromiumos/Kconfig
+source security/container/Kconfig
source security/integrity/Kconfig
@@ -276,5 +278,13 @@
default "apparmor" if DEFAULT_SECURITY_APPARMOR
default "" if DEFAULT_SECURITY_DAC
-endmenu
+config ARCH_HAS_ALT_SYSCALL
+ def_bool n
+config ALT_SYSCALL
+ bool "Alternate syscall table support"
+ depends on ARCH_HAS_ALT_SYSCALL
+ help
+ Allow syscall table to be swapped on a running process.
+
+endmenu
diff --git a/security/Makefile b/security/Makefile
index 4d2d378..e08af6f 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -10,6 +10,8 @@
subdir-$(CONFIG_SECURITY_APPARMOR) += apparmor
subdir-$(CONFIG_SECURITY_YAMA) += yama
subdir-$(CONFIG_SECURITY_LOADPIN) += loadpin
+subdir-$(CONFIG_SECURITY_CHROMIUMOS) += chromiumos
+subdir-$(CONFIG_SECURITY_CONTAINER_MONITOR) += container
# always enable default capabilities
obj-y += commoncap.o
@@ -18,6 +20,7 @@
# Object file lists
obj-$(CONFIG_SECURITY) += security.o
obj-$(CONFIG_SECURITYFS) += inode.o
+obj-$(CONFIG_SECURITY_CHROMIUMOS) += chromiumos/
obj-$(CONFIG_SECURITY_SELINUX) += selinux/
obj-$(CONFIG_SECURITY_SMACK) += smack/
obj-$(CONFIG_AUDIT) += lsm_audit.o
@@ -26,6 +29,7 @@
obj-$(CONFIG_SECURITY_YAMA) += yama/
obj-$(CONFIG_SECURITY_LOADPIN) += loadpin/
obj-$(CONFIG_CGROUP_DEVICE) += device_cgroup.o
+obj-$(CONFIG_SECURITY_CONTAINER_MONITOR) += container/
# Object integrity file lists
subdir-$(CONFIG_INTEGRITY) += integrity
diff --git a/security/chromiumos/Kconfig b/security/chromiumos/Kconfig
new file mode 100644
index 0000000..aaaa295
--- /dev/null
+++ b/security/chromiumos/Kconfig
@@ -0,0 +1,50 @@
+config SECURITY_CHROMIUMOS
+ bool "Chromium OS Security Module"
+ depends on SECURITY
+ depends on X86_64 || ARM64
+ help
+ The purpose of the Chromium OS security module is to reduce attacking
+ surface by preventing access to general purpose access modes not
+ required by Chromium OS. Currently: the mount operation is
+ restricted by requiring a mount point path without symbolic links,
+ and loading modules is limited to only the root filesystem. This
+ LSM is stacked ahead of any primary "full" LSM.
+
+config SECURITY_CHROMIUMOS_NO_SYMLINK_MOUNT
+ bool "Chromium OS Security: prohibit mount to symlinked target"
+ depends on SECURITY_CHROMIUMOS
+ default y
+ help
+ When enabled mount() syscall will return ELOOP whenever target path
+ contains any symlinks.
+
+config SECURITY_CHROMIUMOS_NO_UNPRIVILEGED_UNSAFE_MOUNTS
+ bool "Chromium OS Security: prohibit unsafe mounts in unprivileged user namespaces"
+ depends on SECURITY_CHROMIUMOS
+ default y
+ help
+ When enabled, mount() syscall will return EPERM whenever a new mount
+ is attempted that would cause the filesystem to have the exec, suid,
+ or dev flags if the caller does not have the CAP_SYS_ADMIN capability
+ in the init namespace.
+
+config ALT_SYSCALL_CHROMIUMOS
+ bool "Chromium OS Alt-Syscall Tables"
+ depends on ALT_SYSCALL
+ help
+ Register restricted, alternate syscall tables used by Chromium OS
+ using the alt-syscall infrastructure. Alternate syscall tables
+ can be selected with prctl(PR_ALT_SYSCALL).
+
+config ALT_SYSCALL_CHROMIUMOS_LEGACY_API
+ bool
+ default y
+ depends on ALT_SYSCALL_CHROMIUMOS && ARM
+
+config SECURITY_CHROMIUMOS_READONLY_PROC_SELF_MEM
+ bool "Force /proc/<pid>/mem paths to be read-only"
+ default y
+ help
+ When enabled, attempts to open /proc/self/mem for write access
+ will always fail. Write access to this file allows bypassing
+ of memory map permissions (such as modifying read-only code).
diff --git a/security/chromiumos/Makefile b/security/chromiumos/Makefile
new file mode 100644
index 0000000..a59b4ec
--- /dev/null
+++ b/security/chromiumos/Makefile
@@ -0,0 +1,5 @@
+obj-$(CONFIG_SECURITY_CHROMIUMOS) := chromiumos_lsm.o
+
+chromiumos_lsm-y := inode_mark.o lsm.o securityfs.o utils.o
+
+obj-$(CONFIG_ALT_SYSCALL_CHROMIUMOS) += alt-syscall.o
diff --git a/security/chromiumos/alt-syscall.c b/security/chromiumos/alt-syscall.c
new file mode 100644
index 0000000..c4adfcd
--- /dev/null
+++ b/security/chromiumos/alt-syscall.c
@@ -0,0 +1,492 @@
+/*
+ * Chromium OS alt-syscall tables
+ *
+ * Copyright (C) 2015 Google, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/alt-syscall.h>
+#include <linux/compat.h>
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/prctl.h>
+#include <linux/sched/types.h>
+#include <linux/slab.h>
+#include <linux/socket.h>
+#include <linux/syscalls.h>
+#include <linux/timex.h>
+
+#include <asm/unistd.h>
+
+#include "alt-syscall.h"
+#include "android_whitelists.h"
+#include "complete_whitelists.h"
+#include "read_write_test_whitelists.h"
+#include "third_party_whitelists.h"
+
+#ifdef CONFIG_ALT_SYSCALL_CHROMIUMOS_LEGACY_API
+
+/* Intercept and log blocked syscalls. */
+static asmlinkage long block_syscall(void)
+{
+ struct task_struct *task = current;
+ struct pt_regs *regs = task_pt_regs(task);
+
+ pr_warn_ratelimited("[%d] %s: blocked syscall %d\n", task_pid_nr(task),
+ task->comm, syscall_get_nr(task, regs));
+
+ return -ENOSYS;
+}
+
+/*
+ * In permissive mode, warn that the syscall was blocked, but still allow
+ * it to go through. Note that since we don't have an easy way to map from
+ * syscall to number of arguments, we pass the maximum (6).
+ */
+static asmlinkage long warn_syscall(void)
+{
+ struct task_struct *task = current;
+ struct pt_regs *regs = task_pt_regs(task);
+ int nr = syscall_get_nr(task, regs);
+ sys_call_ptr_t fn = default_table.table[nr];
+ unsigned long args[6];
+
+ pr_warn_ratelimited("[%d] %s: syscall %d not whitelisted\n",
+ task_pid_nr(task), task->comm, nr);
+
+ syscall_get_arguments(task, regs, 0, ARRAY_SIZE(args), args);
+
+ return fn(args[0], args[1], args[2], args[3], args[4], args[5]);
+}
+
+#else
+
+/* Intercept and log blocked syscalls. */
+static asmlinkage long block_syscall(struct pt_regs *regs)
+{
+ struct task_struct *task = current;
+
+ pr_warn_ratelimited("[%d] %s: blocked syscall %d\n", task_pid_nr(task),
+ task->comm, syscall_get_nr(task, regs));
+
+ return -ENOSYS;
+}
+
+/*
+ * In permissive mode, warn that the syscall was blocked, but still allow
+ * it to go through. Note that since we don't have an easy way to map from
+ * syscall to number of arguments, we pass the maximum (6).
+ */
+static asmlinkage long warn_syscall(struct pt_regs *regs)
+{
+ struct task_struct *task = current;
+ int nr = syscall_get_nr(task, regs);
+ sys_call_ptr_t fn = (sys_call_ptr_t)default_table.table[nr];
+
+ pr_warn_ratelimited("[%d] %s: syscall %d not whitelisted\n",
+ task_pid_nr(task), task->comm, nr);
+
+ return fn(regs);
+}
+
+#ifdef CONFIG_COMPAT
+static asmlinkage long warn_compat_syscall(struct pt_regs *regs)
+{
+ struct task_struct *task = current;
+ int nr = syscall_get_nr(task, regs);
+ sys_call_ptr_t fn = (sys_call_ptr_t)default_table.compat_table[nr];
+
+ pr_warn_ratelimited("[%d] %s: compat syscall %d not whitelisted\n",
+ task_pid_nr(task), task->comm, nr);
+
+ return fn(regs);
+}
+#endif /* CONFIG_COMPAT */
+#endif /* CONFIG_ALT_SYSCALL_CHROMIUMOS_LEGACY_API */
+
+static inline long do_alt_sys_prctl(int option, unsigned long arg2,
+ unsigned long arg3, unsigned long arg4,
+ unsigned long arg5)
+
+{
+ if (option == PR_ALT_SYSCALL &&
+ arg2 == PR_ALT_SYSCALL_SET_SYSCALL_TABLE)
+ return -EPERM;
+
+ return ksys_prctl(option, arg2, arg3, arg4, arg5);
+}
+DEF_ALT_SYS(alt_sys_prctl, 5, int, unsigned long, unsigned long,
+ unsigned long, unsigned long);
+
+/* Thread priority used by Android. */
+#define ANDROID_PRIORITY_FOREGROUND -2
+#define ANDROID_PRIORITY_DISPLAY -4
+#define ANDROID_PRIORITY_URGENT_DISPLAY -8
+#define ANDROID_PRIORITY_AUDIO -16
+#define ANDROID_PRIORITY_URGENT_AUDIO -19
+#define ANDROID_PRIORITY_HIGHEST -20
+
+/* Reduced priority when running inside container. */
+#define CONTAINER_PRIORITY_FOREGROUND -1
+#define CONTAINER_PRIORITY_DISPLAY -2
+#define CONTAINER_PRIORITY_URGENT_DISPLAY -4
+#define CONTAINER_PRIORITY_AUDIO -8
+#define CONTAINER_PRIORITY_URGENT_AUDIO -9
+#define CONTAINER_PRIORITY_HIGHEST -10
+
+/*
+ * TODO(mortonm): Move the implementation of these Android-specific
+ * alt-syscalls (starting with android_*) to their own .c file.
+ */
+static long do_android_getpriority(int which, int who)
+{
+ int prio, nice;
+
+ prio = ksys_getpriority(which, who);
+ if (prio <= 20)
+ return prio;
+
+ nice = -(prio - 20);
+ switch (nice) {
+ case CONTAINER_PRIORITY_FOREGROUND:
+ nice = ANDROID_PRIORITY_FOREGROUND;
+ break;
+ case CONTAINER_PRIORITY_DISPLAY:
+ nice = ANDROID_PRIORITY_DISPLAY;
+ break;
+ case CONTAINER_PRIORITY_URGENT_DISPLAY:
+ nice = ANDROID_PRIORITY_URGENT_DISPLAY;
+ break;
+ case CONTAINER_PRIORITY_AUDIO:
+ nice = ANDROID_PRIORITY_AUDIO;
+ break;
+ case CONTAINER_PRIORITY_URGENT_AUDIO:
+ nice = ANDROID_PRIORITY_URGENT_AUDIO;
+ break;
+ case CONTAINER_PRIORITY_HIGHEST:
+ nice = ANDROID_PRIORITY_HIGHEST;
+ break;
+ }
+
+ return -nice + 20;
+}
+DEF_ALT_SYS(android_getpriority, 2, int, int);
+
+static inline long do_android_keyctl(int cmd, unsigned long arg2,
+ unsigned long arg3, unsigned long arg4,
+ unsigned long arg5)
+{
+ return -EACCES;
+}
+DEF_ALT_SYS(android_keyctl, 5, int, unsigned long, unsigned long,
+ unsigned long, unsigned long);
+
+static inline int do_android_setpriority(int which, int who, int niceval)
+{
+ if (niceval < 0) {
+ if (niceval < -20)
+ niceval = -20;
+
+ niceval = niceval / 2;
+ }
+
+ return ksys_setpriority(which, who, niceval);
+}
+DEF_ALT_SYS(android_setpriority, 3, int, int, int);
+
+static inline long
+do_android_sched_setscheduler(pid_t pid, int policy,
+ struct sched_param __user *param)
+{
+ struct sched_param lparam;
+ struct task_struct *p;
+ long retval;
+
+ if (policy < 0)
+ return -EINVAL;
+
+ if (!param || pid < 0)
+ return -EINVAL;
+ if (copy_from_user(&lparam, param, sizeof(struct sched_param)))
+ return -EFAULT;
+
+ rcu_read_lock();
+ retval = -ESRCH;
+ p = pid ? find_task_by_vpid(pid) : current;
+ if (p != NULL) {
+ const struct cred *cred = current_cred();
+ kuid_t android_root_uid, android_system_uid;
+
+ /*
+ * Allow root(0) and system(1000) processes to set RT scheduler.
+ *
+ * The system_server process run under system provides
+ * SchedulingPolicyService which is used by audioflinger and
+ * other services to boost their threads, so allow it to set RT
+ * scheduler for other threads.
+ */
+ android_root_uid = make_kuid(cred->user_ns, 0);
+ android_system_uid = make_kuid(cred->user_ns, 1000);
+ if ((uid_eq(cred->euid, android_root_uid) ||
+ uid_eq(cred->euid, android_system_uid)) &&
+ ns_capable(cred->user_ns, CAP_SYS_NICE))
+ retval = sched_setscheduler_nocheck(p, policy, &lparam);
+ else
+ retval = sched_setscheduler(p, policy, &lparam);
+ }
+ rcu_read_unlock();
+
+ return retval;
+}
+DEF_ALT_SYS(android_sched_setscheduler, 3, pid_t, int,
+ struct sched_param __user *);
+
+/*
+ * sched_setparam() passes in -1 for its policy, to let the functions
+ * it calls know not to change it.
+ */
+#define SETPARAM_POLICY -1
+
+static inline long do_android_sched_setparam(pid_t pid,
+ struct sched_param __user *param)
+{
+ return do_android_sched_setscheduler(pid, SETPARAM_POLICY, param);
+}
+DEF_ALT_SYS(android_sched_setparam, 2, pid_t, struct sched_param __user *);
+
+static inline long do_android_socket(int domain, int type, int protocol)
+{
+ if (domain == AF_VSOCK)
+ return -EACCES;
+
+ return __sys_socket(domain, type, protocol);
+}
+DEF_ALT_SYS(android_socket, 3, int, int, int);
+
+static inline long do_android_perf_event_open(
+ struct perf_event_attr __user *attr_uptr, pid_t pid, int cpu,
+ int group_fd, unsigned long flags)
+{
+ if (!allow_devmode_syscalls)
+ return -EACCES;
+
+ return ksys_perf_event_open(attr_uptr, pid, cpu, group_fd, flags);
+}
+DEF_ALT_SYS(android_perf_event_open, 5, struct perf_event_attr __user *,
+ pid_t, int, int, unsigned long);
+
+static inline long do_android_adjtimex(struct timex __user *buf)
+{
+ struct timex kbuf;
+
+ /* adjtimex() is allowed only for read. */
+ if (copy_from_user(&kbuf, buf, sizeof(struct timex)))
+ return -EFAULT;
+ if (kbuf.modes != 0)
+ return -EPERM;
+
+ return ksys_adjtimex(buf);
+}
+DEF_ALT_SYS(android_adjtimex, 1, struct timex __user *);
+
+static inline long do_android_clock_adjtime(clockid_t which_clock,
+ struct timex __user *buf)
+{
+ struct timex kbuf;
+
+ /* clock_adjtime() is allowed only for read. */
+ if (copy_from_user(&kbuf, buf, sizeof(struct timex)))
+ return -EFAULT;
+
+ if (kbuf.modes != 0)
+ return -EPERM;
+
+ return ksys_clock_adjtime(which_clock, buf);
+}
+DEF_ALT_SYS(android_clock_adjtime, 2, clockid_t, struct timex __user *);
+
+static inline long do_android_getcpu(unsigned __user *cpu,
+ unsigned __user *node,
+ struct getcpu_cache __user *cache)
+{
+ if (node || cache)
+ return -EPERM;
+
+ return ksys_getcpu(cpu, node, cache);
+}
+DEF_ALT_SYS(android_getcpu, 3, unsigned __user *, unsigned __user *,
+ struct getcpu_cache __user *);
+
+#ifdef CONFIG_COMPAT
+static inline long do_android_compat_adjtimex(struct compat_timex __user *buf)
+{
+ struct compat_timex kbuf;
+
+ /* adjtimex() is allowed only for read. */
+ if (copy_from_user(&kbuf, buf, sizeof(struct compat_timex)))
+ return -EFAULT;
+
+ if (kbuf.modes != 0)
+ return -EPERM;
+
+ return compat_ksys_adjtimex(buf);
+}
+DEF_ALT_SYS(android_compat_adjtimex, 1, struct compat_timex __user *);
+
+static inline long
+do_android_compat_clock_adjtime(clockid_t which_clock,
+ struct compat_timex __user *buf)
+{
+ struct compat_timex kbuf;
+
+ /* clock_adjtime() is allowed only for read. */
+ if (copy_from_user(&kbuf, buf, sizeof(struct compat_timex)))
+ return -EFAULT;
+
+ if (kbuf.modes != 0)
+ return -EPERM;
+
+ return compat_ksys_clock_adjtime(which_clock, buf);
+}
+DEF_ALT_SYS(android_compat_clock_adjtime, 2, clockid_t,
+ struct compat_timex __user *);
+
+#endif /* CONFIG_COMPAT */
+
+static struct syscall_whitelist whitelists[] = {
+ SYSCALL_WHITELIST(read_write_test),
+ SYSCALL_WHITELIST(android),
+ PERMISSIVE_SYSCALL_WHITELIST(android),
+ SYSCALL_WHITELIST(third_party),
+ PERMISSIVE_SYSCALL_WHITELIST(third_party),
+ SYSCALL_WHITELIST(complete),
+ PERMISSIVE_SYSCALL_WHITELIST(complete)
+};
+
+static int alt_syscall_apply_whitelist(const struct syscall_whitelist *wl,
+ struct alt_sys_call_table *t)
+{
+ unsigned int i;
+ DECLARE_BITMAP(whitelist, t->size);
+
+ bitmap_zero(whitelist, t->size);
+ for (i = 0; i < wl->nr_whitelist; i++) {
+ unsigned int nr = wl->whitelist[i].nr;
+
+ if (nr >= t->size)
+ return -EINVAL;
+ bitmap_set(whitelist, nr, 1);
+ if (wl->whitelist[i].alt)
+ t->table[nr] = wl->whitelist[i].alt;
+ }
+
+ for (i = 0; i < t->size; i++) {
+ if (!test_bit(i, whitelist)) {
+ t->table[i] = wl->permissive ?
+ (sys_call_ptr_t)warn_syscall :
+ (sys_call_ptr_t)block_syscall;
+ }
+ }
+
+ return 0;
+}
+
+#ifdef CONFIG_COMPAT
+static int
+alt_syscall_apply_compat_whitelist(const struct syscall_whitelist *wl,
+ struct alt_sys_call_table *t)
+{
+ unsigned int i;
+ DECLARE_BITMAP(whitelist, t->compat_size);
+
+ bitmap_zero(whitelist, t->compat_size);
+ for (i = 0; i < wl->nr_compat_whitelist; i++) {
+ unsigned int nr = wl->compat_whitelist[i].nr;
+
+ if (nr >= t->compat_size)
+ return -EINVAL;
+ bitmap_set(whitelist, nr, 1);
+ if (wl->compat_whitelist[i].alt)
+ t->compat_table[nr] = wl->compat_whitelist[i].alt;
+ }
+
+ for (i = 0; i < t->compat_size; i++) {
+ if (!test_bit(i, whitelist)) {
+ t->compat_table[i] = wl->permissive ?
+ (sys_call_ptr_t)warn_compat_syscall :
+ (sys_call_ptr_t)block_syscall;
+ }
+ }
+
+ return 0;
+}
+#else
+static inline int
+alt_syscall_apply_compat_whitelist(const struct syscall_whitelist *wl,
+ struct alt_sys_call_table *t)
+{
+ return 0;
+}
+#endif /* CONFIG_COMPAT */
+
+static int alt_syscall_init_one(const struct syscall_whitelist *wl)
+{
+ struct alt_sys_call_table *t;
+ int err;
+
+ t = kzalloc(sizeof(*t), GFP_KERNEL);
+ if (!t)
+ return -ENOMEM;
+ strncpy(t->name, wl->name, sizeof(t->name));
+
+ err = arch_dup_sys_call_table(t);
+ if (err)
+ return err;
+
+ err = alt_syscall_apply_whitelist(wl, t);
+ if (err)
+ return err;
+ err = alt_syscall_apply_compat_whitelist(wl, t);
+ if (err)
+ return err;
+
+ return register_alt_sys_call_table(t);
+}
+
+/*
+ * Register an alternate syscall table for each whitelist. Note that the
+ * lack of a module_exit() is intentional - once a syscall table is registered
+ * it cannot be unregistered.
+ *
+ * TODO(abrestic) Support unregistering syscall tables?
+ */
+static int chromiumos_alt_syscall_init(void)
+{
+ unsigned int i;
+ int err;
+
+#ifdef CONFIG_SYSCTL
+ if (!register_sysctl_paths(chromiumos_sysctl_path,
+ chromiumos_sysctl_table))
+ pr_warn("Failed to register sysctl\n");
+#endif
+
+ err = arch_dup_sys_call_table(&default_table);
+ if (err)
+ return err;
+
+ for (i = 0; i < ARRAY_SIZE(whitelists); i++) {
+ err = alt_syscall_init_one(&whitelists[i]);
+ if (err)
+ pr_warn("Failed to register syscall table %s: %d\n",
+ whitelists[i].name, err);
+ }
+
+ return 0;
+}
+module_init(chromiumos_alt_syscall_init);
diff --git a/security/chromiumos/alt-syscall.h b/security/chromiumos/alt-syscall.h
new file mode 100644
index 0000000..16473a2
--- /dev/null
+++ b/security/chromiumos/alt-syscall.h
@@ -0,0 +1,440 @@
+/*
+ * Linux Security Module for Chromium OS
+ *
+ * Copyright 2018 Google LLC. All Rights Reserved
+ *
+ * Authors:
+ * Micah Morton <mortonm@chromium.org>
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef ALT_SYSCALL_H
+#define ALT_SYSCALL_H
+
+/*
+ * NOTE: this file uses the 'static' keyword for variable and function
+ * definitions because alt-syscall.c is the only .c file that is expected to
+ * include this header. Definitions were pulled out from alt-syscall.c into
+ * this header and the *_whitelists.h headers for the sake of readability.
+ */
+
+static int allow_devmode_syscalls;
+
+#ifdef CONFIG_SYSCTL
+static int zero;
+static int one = 1;
+
+static struct ctl_path chromiumos_sysctl_path[] = {
+ { .procname = "kernel", },
+ { .procname = "chromiumos", },
+ { .procname = "alt_syscall", },
+ { }
+};
+
+static struct ctl_table chromiumos_sysctl_table[] = {
+ {
+ .procname = "allow_devmode_syscalls",
+ .data = &allow_devmode_syscalls,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+ { }
+};
+#endif
+
+struct syscall_whitelist_entry {
+ unsigned int nr;
+ sys_call_ptr_t alt;
+};
+
+struct syscall_whitelist {
+ const char *name;
+ const struct syscall_whitelist_entry *whitelist;
+ unsigned int nr_whitelist;
+#ifdef CONFIG_COMPAT
+ const struct syscall_whitelist_entry *compat_whitelist;
+ unsigned int nr_compat_whitelist;
+#endif
+ bool permissive;
+};
+
+static struct alt_sys_call_table default_table;
+
+#define SYSCALL_ENTRY_ALT(name, func) \
+ { \
+ .nr = __NR_ ## name, \
+ .alt = (sys_call_ptr_t)func, \
+ }
+#define SYSCALL_ENTRY(name) SYSCALL_ENTRY_ALT(name, NULL)
+#define COMPAT_SYSCALL_ENTRY_ALT(name, func) \
+ { \
+ .nr = __NR_compat_ ## name, \
+ .alt = (sys_call_ptr_t)func, \
+ }
+#define COMPAT_SYSCALL_ENTRY(name) COMPAT_SYSCALL_ENTRY_ALT(name, NULL)
+
+
+#ifdef CONFIG_ALT_SYSCALL_CHROMIUMOS_LEGACY_API
+#define ALT_SYS_ARG(n) arg ## n
+#else
+#define ALT_SYS_ARG(n) args[n - 1]
+#endif
+
+#define ALT_SYS_ARGS1_CALL(dt1) \
+ (dt1)ALT_SYS_ARG(1)
+#define ALT_SYS_ARGS2_CALL(dt1, dt2) \
+ ALT_SYS_ARGS1_CALL(dt1), (dt2)ALT_SYS_ARG(2)
+#define ALT_SYS_ARGS3_CALL(dt1, dt2, dt3) \
+ ALT_SYS_ARGS2_CALL(dt1, dt2), (dt3)ALT_SYS_ARG(3)
+#define ALT_SYS_ARGS4_CALL(dt1, dt2, dt3, dt4) \
+ ALT_SYS_ARGS3_CALL(dt1, dt2, dt3), (dt4)ALT_SYS_ARG(4)
+#define ALT_SYS_ARGS5_CALL(dt1, dt2, dt3, dt4, dt5) \
+ ALT_SYS_ARGS4_CALL(dt1, dt2, dt3, dt4), (dt5)ALT_SYS_ARG(5)
+#define ALT_SYS_ARGS6_CALL(dt1, dt2, dt3, dt4, dt5, dt6) \
+ ALT_SYS_ARGS5_CALL(dt1, dt2, dt3, dt4, dt5), (dt6)ALT_SYS_ARG(6)
+
+#ifdef CONFIG_ALT_SYSCALL_CHROMIUMOS_LEGACY_API
+
+#define ALT_SYS_ARGS1_HDR unsigned long arg1
+#define ALT_SYS_ARGS2_HDR ALT_SYS_ARGS1_HDR, unsigned long arg2
+#define ALT_SYS_ARGS3_HDR ALT_SYS_ARGS2_HDR, unsigned long arg3
+#define ALT_SYS_ARGS4_HDR ALT_SYS_ARGS3_HDR, unsigned long arg4
+#define ALT_SYS_ARGS5_HDR ALT_SYS_ARGS4_HDR, unsigned long arg5
+#define ALT_SYS_ARGS6_HDR ALT_SYS_ARGS5_HDR, unsigned long arg6
+
+#define DEF_ALT_SYS(func_name, nargs, ...) \
+static asmlinkage long func_name(ALT_SYS_ARGS ## nargs ## _HDR) \
+{ \
+ return do_##func_name(ALT_SYS_ARGS ## nargs ## _CALL(__VA_ARGS__)); \
+}
+
+#define DECL_ALT_SYS(func_name, nargs) \
+static asmlinkage long func_name(ALT_SYS_ARGS ## nargs ## _HDR)
+
+#else
+
+#define DEF_ALT_SYS(func_name, nargs, ...) \
+static asmlinkage long func_name(struct pt_regs *regs) \
+{ \
+ struct task_struct *task = current; \
+ unsigned long args[nargs]; \
+ \
+ syscall_get_arguments(task, regs, 0, nargs, args); \
+ \
+ return do_##func_name(ALT_SYS_ARGS ## nargs ## _CALL(__VA_ARGS__)); \
+}
+
+#define DECL_ALT_SYS(func_name, nargs) \
+static asmlinkage long func_name(struct pt_regs *regs)
+
+#endif /* CONFIG_ALT_SYSCALL_CHROMIUMOS_LEGACY_API */
+
+/*
+ * If an alt_syscall table allows prctl(), override it to prevent a process
+ * from changing its syscall table.
+ */
+DECL_ALT_SYS(alt_sys_prctl, 5);
+
+#ifdef CONFIG_COMPAT
+#define SYSCALL_WHITELIST_COMPAT(x) \
+ .compat_whitelist = x ## _compat_whitelist, \
+ .nr_compat_whitelist = ARRAY_SIZE(x ## _compat_whitelist),
+#else
+#define SYSCALL_WHITELIST_COMPAT(x)
+#endif
+
+#define SYSCALL_WHITELIST(x) \
+ { \
+ .name = #x, \
+ .whitelist = x ## _whitelist, \
+ .nr_whitelist = ARRAY_SIZE(x ## _whitelist), \
+ SYSCALL_WHITELIST_COMPAT(x) \
+ }
+
+#define PERMISSIVE_SYSCALL_WHITELIST(x) \
+ { \
+ .name = #x "_permissive", \
+ .permissive = true, \
+ .whitelist = x ## _whitelist, \
+ .nr_whitelist = ARRAY_SIZE(x ## _whitelist), \
+ SYSCALL_WHITELIST_COMPAT(x) \
+ }
+
+#ifdef CONFIG_COMPAT
+#ifdef CONFIG_X86_64
+#define __NR_compat_access __NR_ia32_access
+#define __NR_compat_adjtimex __NR_ia32_adjtimex
+#define __NR_compat_brk __NR_ia32_brk
+#define __NR_compat_capget __NR_ia32_capget
+#define __NR_compat_capset __NR_ia32_capset
+#define __NR_compat_chdir __NR_ia32_chdir
+#define __NR_compat_chmod __NR_ia32_chmod
+#define __NR_compat_clock_adjtime __NR_ia32_clock_adjtime
+#define __NR_compat_clock_getres __NR_ia32_clock_getres
+#define __NR_compat_clock_gettime __NR_ia32_clock_gettime
+#define __NR_compat_clock_nanosleep __NR_ia32_clock_nanosleep
+#define __NR_compat_clock_settime __NR_ia32_clock_settime
+#define __NR_compat_clone __NR_ia32_clone
+#define __NR_compat_close __NR_ia32_close
+#define __NR_compat_creat __NR_ia32_creat
+#define __NR_compat_dup __NR_ia32_dup
+#define __NR_compat_dup2 __NR_ia32_dup2
+#define __NR_compat_dup3 __NR_ia32_dup3
+#define __NR_compat_epoll_create __NR_ia32_epoll_create
+#define __NR_compat_epoll_create1 __NR_ia32_epoll_create1
+#define __NR_compat_epoll_ctl __NR_ia32_epoll_ctl
+#define __NR_compat_epoll_wait __NR_ia32_epoll_wait
+#define __NR_compat_epoll_pwait __NR_ia32_epoll_pwait
+#define __NR_compat_eventfd __NR_ia32_eventfd
+#define __NR_compat_eventfd2 __NR_ia32_eventfd2
+#define __NR_compat_execve __NR_ia32_execve
+#define __NR_compat_exit __NR_ia32_exit
+#define __NR_compat_exit_group __NR_ia32_exit_group
+#define __NR_compat_faccessat __NR_ia32_faccessat
+#define __NR_compat_fallocate __NR_ia32_fallocate
+#define __NR_compat_fchdir __NR_ia32_fchdir
+#define __NR_compat_fchmod __NR_ia32_fchmod
+#define __NR_compat_fchmodat __NR_ia32_fchmodat
+#define __NR_compat_fchown __NR_ia32_fchown
+#define __NR_compat_fchownat __NR_ia32_fchownat
+#define __NR_compat_fcntl __NR_ia32_fcntl
+#define __NR_compat_fdatasync __NR_ia32_fdatasync
+#define __NR_compat_fgetxattr __NR_ia32_fgetxattr
+#define __NR_compat_flistxattr __NR_ia32_flistxattr
+#define __NR_compat_flock __NR_ia32_flock
+#define __NR_compat_fork __NR_ia32_fork
+#define __NR_compat_fremovexattr __NR_ia32_fremovexattr
+#define __NR_compat_fsetxattr __NR_ia32_fsetxattr
+#define __NR_compat_fstat __NR_ia32_fstat
+#define __NR_compat_fstatfs __NR_ia32_fstatfs
+#define __NR_compat_fsync __NR_ia32_fsync
+#define __NR_compat_ftruncate __NR_ia32_ftruncate
+#define __NR_compat_futex __NR_ia32_futex
+#define __NR_compat_futimesat __NR_ia32_futimesat
+#define __NR_compat_getcpu __NR_ia32_getcpu
+#define __NR_compat_getcwd __NR_ia32_getcwd
+#define __NR_compat_getdents __NR_ia32_getdents
+#define __NR_compat_getdents64 __NR_ia32_getdents64
+#define __NR_compat_getegid __NR_ia32_getegid
+#define __NR_compat_geteuid __NR_ia32_geteuid
+#define __NR_compat_getgid __NR_ia32_getgid
+#define __NR_compat_getgroups32 __NR_ia32_getgroups32
+#define __NR_compat_getpgid __NR_ia32_getpgid
+#define __NR_compat_getpgrp __NR_ia32_getpgrp
+#define __NR_compat_getpid __NR_ia32_getpid
+#define __NR_compat_getppid __NR_ia32_getppid
+#define __NR_compat_getpriority __NR_ia32_getpriority
+#define __NR_compat_getrandom __NR_ia32_getrandom
+#define __NR_compat_getresgid __NR_ia32_getresgid
+#define __NR_compat_getresuid __NR_ia32_getresuid
+#define __NR_compat_getrlimit __NR_ia32_getrlimit
+#define __NR_compat_getrusage __NR_ia32_getrusage
+#define __NR_compat_getsid __NR_ia32_getsid
+#define __NR_compat_gettid __NR_ia32_gettid
+#define __NR_compat_gettimeofday __NR_ia32_gettimeofday
+#define __NR_compat_getuid __NR_ia32_getuid
+#define __NR_compat_getxattr __NR_ia32_getxattr
+#define __NR_compat_inotify_add_watch __NR_ia32_inotify_add_watch
+#define __NR_compat_inotify_init __NR_ia32_inotify_init
+#define __NR_compat_inotify_init1 __NR_ia32_inotify_init1
+#define __NR_compat_inotify_rm_watch __NR_ia32_inotify_rm_watch
+#define __NR_compat_ioctl __NR_ia32_ioctl
+#define __NR_compat_io_destroy __NR_ia32_io_destroy
+#define __NR_compat_io_getevents __NR_ia32_io_getevents
+#define __NR_compat_io_setup __NR_ia32_io_setup
+#define __NR_compat_io_submit __NR_ia32_io_submit
+#define __NR_compat_ioprio_set __NR_ia32_ioprio_set
+#define __NR_compat_keyctl __NR_ia32_keyctl
+#define __NR_compat_kill __NR_ia32_kill
+#define __NR_compat_lgetxattr __NR_ia32_lgetxattr
+#define __NR_compat_link __NR_ia32_link
+#define __NR_compat_linkat __NR_ia32_linkat
+#define __NR_compat_listxattr __NR_ia32_listxattr
+#define __NR_compat_llistxattr __NR_ia32_llistxattr
+#define __NR_compat_lremovexattr __NR_ia32_lremovexattr
+#define __NR_compat_lseek __NR_ia32_lseek
+#define __NR_compat_lsetxattr __NR_ia32_lsetxattr
+#define __NR_compat_lstat __NR_ia32_lstat
+#define __NR_compat_madvise __NR_ia32_madvise
+#define __NR_compat_memfd_create __NR_ia32_memfd_create
+#define __NR_compat_mincore __NR_ia32_mincore
+#define __NR_compat_mkdir __NR_ia32_mkdir
+#define __NR_compat_mkdirat __NR_ia32_mkdirat
+#define __NR_compat_mknod __NR_ia32_mknod
+#define __NR_compat_mknodat __NR_ia32_mknodat
+#define __NR_compat_mlock __NR_ia32_mlock
+#define __NR_compat_munlock __NR_ia32_munlock
+#define __NR_compat_mlockall __NR_ia32_mlockall
+#define __NR_compat_munlockall __NR_ia32_munlockall
+#define __NR_compat_modify_ldt __NR_ia32_modify_ldt
+#define __NR_compat_mount __NR_ia32_mount
+#define __NR_compat_mprotect __NR_ia32_mprotect
+#define __NR_compat_mremap __NR_ia32_mremap
+#define __NR_compat_msync __NR_ia32_msync
+#define __NR_compat_munmap __NR_ia32_munmap
+#define __NR_compat_name_to_handle_at __NR_ia32_name_to_handle_at
+#define __NR_compat_nanosleep __NR_ia32_nanosleep
+#define __NR_compat_open __NR_ia32_open
+#define __NR_compat_open_by_handle_at __NR_ia32_open_by_handle_at
+#define __NR_compat_openat __NR_ia32_openat
+#define __NR_compat_perf_event_open __NR_ia32_perf_event_open
+#define __NR_compat_personality __NR_ia32_personality
+#define __NR_compat_pipe __NR_ia32_pipe
+#define __NR_compat_pipe2 __NR_ia32_pipe2
+#define __NR_compat_poll __NR_ia32_poll
+#define __NR_compat_ppoll __NR_ia32_ppoll
+#define __NR_compat_prctl __NR_ia32_prctl
+#define __NR_compat_pread64 __NR_ia32_pread64
+#define __NR_compat_preadv __NR_ia32_preadv
+#define __NR_compat_prlimit64 __NR_ia32_prlimit64
+#define __NR_compat_process_vm_readv __NR_ia32_process_vm_readv
+#define __NR_compat_process_vm_writev __NR_ia32_process_vm_writev
+#define __NR_compat_pselect6 __NR_ia32_pselect6
+#define __NR_compat_ptrace __NR_ia32_ptrace
+#define __NR_compat_pwrite64 __NR_ia32_pwrite64
+#define __NR_compat_pwritev __NR_ia32_pwritev
+#define __NR_compat_read __NR_ia32_read
+#define __NR_compat_readahead __NR_ia32_readahead
+#define __NR_compat_readv __NR_ia32_readv
+#define __NR_compat_readlink __NR_ia32_readlink
+#define __NR_compat_readlinkat __NR_ia32_readlinkat
+#define __NR_compat_recvmmsg __NR_ia32_recvmmsg
+#define __NR_compat_remap_file_pages __NR_ia32_remap_file_pages
+#define __NR_compat_removexattr __NR_ia32_removexattr
+#define __NR_compat_rename __NR_ia32_rename
+#define __NR_compat_renameat __NR_ia32_renameat
+#define __NR_compat_restart_syscall __NR_ia32_restart_syscall
+#define __NR_compat_rmdir __NR_ia32_rmdir
+#define __NR_compat_rt_sigaction __NR_ia32_rt_sigaction
+#define __NR_compat_rt_sigpending __NR_ia32_rt_sigpending
+#define __NR_compat_rt_sigprocmask __NR_ia32_rt_sigprocmask
+#define __NR_compat_rt_sigqueueinfo __NR_ia32_rt_sigqueueinfo
+#define __NR_compat_rt_sigreturn __NR_ia32_rt_sigreturn
+#define __NR_compat_rt_sigsuspend __NR_ia32_rt_sigsuspend
+#define __NR_compat_rt_sigtimedwait __NR_ia32_rt_sigtimedwait
+#define __NR_compat_rt_tgsigqueueinfo __NR_ia32_rt_tgsigqueueinfo
+#define __NR_compat_sched_get_priority_max __NR_ia32_sched_get_priority_max
+#define __NR_compat_sched_get_priority_min __NR_ia32_sched_get_priority_min
+#define __NR_compat_sched_getaffinity __NR_ia32_sched_getaffinity
+#define __NR_compat_sched_getparam __NR_ia32_sched_getparam
+#define __NR_compat_sched_getscheduler __NR_ia32_sched_getscheduler
+#define __NR_compat_sched_setaffinity __NR_ia32_sched_setaffinity
+#define __NR_compat_sched_setparam __NR_ia32_sched_setparam
+#define __NR_compat_sched_setscheduler __NR_ia32_sched_setscheduler
+#define __NR_compat_sched_yield __NR_ia32_sched_yield
+#define __NR_compat_seccomp __NR_ia32_seccomp
+#define __NR_compat_sendfile __NR_ia32_sendfile
+#define __NR_compat_sendfile64 __NR_ia32_sendfile64
+#define __NR_compat_sendmmsg __NR_ia32_sendmmsg
+#define __NR_compat_setdomainname __NR_ia32_setdomainname
+#define __NR_compat_set_robust_list __NR_ia32_set_robust_list
+#define __NR_compat_set_tid_address __NR_ia32_set_tid_address
+#define __NR_compat_set_thread_area __NR_ia32_set_thread_area
+#define __NR_compat_setgid __NR_ia32_setgid
+#define __NR_compat_setgroups __NR_ia32_setgroups
+#define __NR_compat_setitimer __NR_ia32_setitimer
+#define __NR_compat_setns __NR_ia32_setns
+#define __NR_compat_setpgid __NR_ia32_setpgid
+#define __NR_compat_setpriority __NR_ia32_setpriority
+#define __NR_compat_setregid __NR_ia32_setregid
+#define __NR_compat_setresgid __NR_ia32_setresgid
+#define __NR_compat_setresuid __NR_ia32_setresuid
+#define __NR_compat_setrlimit __NR_ia32_setrlimit
+#define __NR_compat_setsid __NR_ia32_setsid
+#define __NR_compat_settimeofday __NR_ia32_settimeofday
+#define __NR_compat_setuid __NR_ia32_setuid
+#define __NR_compat_setxattr __NR_ia32_setxattr
+#define __NR_compat_signalfd4 __NR_ia32_signalfd4
+#define __NR_compat_sigaltstack __NR_ia32_sigaltstack
+#define __NR_compat_socketcall __NR_ia32_socketcall
+#define __NR_compat_splice __NR_ia32_splice
+#define __NR_compat_stat __NR_ia32_stat
+#define __NR_compat_statfs __NR_ia32_statfs
+#define __NR_compat_symlink __NR_ia32_symlink
+#define __NR_compat_symlinkat __NR_ia32_symlinkat
+#define __NR_compat_sync __NR_ia32_sync
+#define __NR_compat_syncfs __NR_ia32_syncfs
+#define __NR_compat_sync_file_range __NR_ia32_sync_file_range
+#define __NR_compat_sysinfo __NR_ia32_sysinfo
+#define __NR_compat_syslog __NR_ia32_syslog
+#define __NR_compat_tee __NR_ia32_tee
+#define __NR_compat_tgkill __NR_ia32_tgkill
+#define __NR_compat_tkill __NR_ia32_tkill
+#define __NR_compat_time __NR_ia32_time
+#define __NR_compat_timer_create __NR_ia32_timer_create
+#define __NR_compat_timer_delete __NR_ia32_timer_delete
+#define __NR_compat_timer_getoverrun __NR_ia32_timer_getoverrun
+#define __NR_compat_timer_gettime __NR_ia32_timer_gettime
+#define __NR_compat_timer_settime __NR_ia32_timer_settime
+#define __NR_compat_timerfd_create __NR_ia32_timerfd_create
+#define __NR_compat_timerfd_gettime __NR_ia32_timerfd_gettime
+#define __NR_compat_timerfd_settime __NR_ia32_timerfd_settime
+#define __NR_compat_times __NR_ia32_times
+#define __NR_compat_truncate __NR_ia32_truncate
+#define __NR_compat_umask __NR_ia32_umask
+#define __NR_compat_umount2 __NR_ia32_umount2
+#define __NR_compat_uname __NR_ia32_uname
+#define __NR_compat_unlink __NR_ia32_unlink
+#define __NR_compat_unlinkat __NR_ia32_unlinkat
+#define __NR_compat_unshare __NR_ia32_unshare
+#define __NR_compat_ustat __NR_ia32_ustat
+#define __NR_compat_utimensat __NR_ia32_utimensat
+#define __NR_compat_utimes __NR_ia32_utimes
+#define __NR_compat_vfork __NR_ia32_vfork
+#define __NR_compat_vmsplice __NR_ia32_vmsplice
+#define __NR_compat_wait4 __NR_ia32_wait4
+#define __NR_compat_waitid __NR_ia32_waitid
+#define __NR_compat_waitpid __NR_ia32_waitpid
+#define __NR_compat_write __NR_ia32_write
+#define __NR_compat_writev __NR_ia32_writev
+#define __NR_compat_chown32 __NR_ia32_chown32
+#define __NR_compat_fadvise64 __NR_ia32_fadvise64
+#define __NR_compat_fadvise64_64 __NR_ia32_fadvise64_64
+#define __NR_compat_fchown32 __NR_ia32_fchown32
+#define __NR_compat_fcntl64 __NR_ia32_fcntl64
+#define __NR_compat_fstat64 __NR_ia32_fstat64
+#define __NR_compat_fstatat64 __NR_ia32_fstatat64
+#define __NR_compat_fstatfs64 __NR_ia32_fstatfs64
+#define __NR_compat_ftruncate64 __NR_ia32_ftruncate64
+#define __NR_compat_getegid32 __NR_ia32_getegid32
+#define __NR_compat_geteuid32 __NR_ia32_geteuid32
+#define __NR_compat_getgid32 __NR_ia32_getgid32
+#define __NR_compat_getresgid32 __NR_ia32_getresgid32
+#define __NR_compat_getresuid32 __NR_ia32_getresuid32
+#define __NR_compat_getuid32 __NR_ia32_getuid32
+#define __NR_compat_lchown32 __NR_ia32_lchown32
+#define __NR_compat_lstat64 __NR_ia32_lstat64
+#define __NR_compat_mmap2 __NR_ia32_mmap2
+#define __NR_compat__newselect __NR_ia32__newselect
+#define __NR_compat__llseek __NR_ia32__llseek
+#define __NR_compat_sigaction __NR_ia32_sigaction
+#define __NR_compat_sigpending __NR_ia32_sigpending
+#define __NR_compat_sigprocmask __NR_ia32_sigprocmask
+#define __NR_compat_sigreturn __NR_ia32_sigreturn
+#define __NR_compat_sigsuspend __NR_ia32_sigsuspend
+#define __NR_compat_setgid32 __NR_ia32_setgid32
+#define __NR_compat_setgroups32 __NR_ia32_setgroups32
+#define __NR_compat_setregid32 __NR_ia32_setregid32
+#define __NR_compat_setresgid32 __NR_ia32_setresgid32
+#define __NR_compat_setresuid32 __NR_ia32_setresuid32
+#define __NR_compat_setreuid32 __NR_ia32_setreuid32
+#define __NR_compat_setuid32 __NR_ia32_setuid32
+#define __NR_compat_stat64 __NR_ia32_stat64
+#define __NR_compat_statfs64 __NR_ia32_statfs64
+#define __NR_compat_truncate64 __NR_ia32_truncate64
+#define __NR_compat_ugetrlimit __NR_ia32_ugetrlimit
+#endif
+#endif
+
+#endif /* ALT_SYSCALL_H */
diff --git a/security/chromiumos/android_whitelists.h b/security/chromiumos/android_whitelists.h
new file mode 100644
index 0000000..56ae74a
--- /dev/null
+++ b/security/chromiumos/android_whitelists.h
@@ -0,0 +1,697 @@
+/*
+ * Linux Security Module for Chromium OS
+ *
+ * Copyright 2018 Google LLC. All Rights Reserved
+ *
+ * Authors:
+ * Micah Morton <mortonm@chromium.org>
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef ANDROID_WHITELISTS_H
+#define ANDROID_WHITELISTS_H
+
+/*
+ * NOTE: the purpose of this header is only to pull out the definition of this
+ * array from alt-syscall.c for the purposes of readability. It should not be
+ * included in other .c files.
+ */
+
+#include "alt-syscall.h"
+
+/*
+ * Syscall overrides for android.
+ */
+
+/*
+ * Reflect the priority adjustment done by android_setpriority.
+ * Note that the prio returned by getpriority has been offset by 20.
+ * (returns 40..1 instead of -20..19)
+ */
+DECL_ALT_SYS(android_getpriority, 2);
+/* Android does not get to call keyctl. */
+DECL_ALT_SYS(android_keyctl, 5);
+/* Make sure nothing sets a nice value more favorable than -10. */
+DECL_ALT_SYS(android_setpriority, 3);
+DECL_ALT_SYS(android_sched_setscheduler, 3);
+DECL_ALT_SYS(android_sched_setparam, 2);
+DECL_ALT_SYS(android_socket, 3);
+DECL_ALT_SYS(android_perf_event_open, 5);
+DECL_ALT_SYS(android_adjtimex, 1);
+DECL_ALT_SYS(android_clock_adjtime, 2);
+DECL_ALT_SYS(android_getcpu, 3);
+
+#ifdef CONFIG_COMPAT
+DECL_ALT_SYS(android_compat_adjtimex, 1);
+DECL_ALT_SYS(android_compat_clock_adjtime, 2);
+#endif /* CONFIG_COMPAT */
+
+static struct syscall_whitelist_entry android_whitelist[] = {
+ SYSCALL_ENTRY(accept),
+ SYSCALL_ENTRY(accept4),
+ SYSCALL_ENTRY_ALT(adjtimex, android_adjtimex),
+ SYSCALL_ENTRY(bind),
+ SYSCALL_ENTRY(brk),
+ SYSCALL_ENTRY(capget),
+ SYSCALL_ENTRY(capset),
+ SYSCALL_ENTRY(chdir),
+ SYSCALL_ENTRY_ALT(clock_adjtime, android_clock_adjtime),
+ SYSCALL_ENTRY(clock_getres),
+ SYSCALL_ENTRY(clock_gettime),
+ SYSCALL_ENTRY(clock_nanosleep),
+ SYSCALL_ENTRY(clock_settime),
+ SYSCALL_ENTRY(clone),
+ SYSCALL_ENTRY(close),
+ SYSCALL_ENTRY(connect),
+ SYSCALL_ENTRY(dup),
+ SYSCALL_ENTRY(dup3),
+ SYSCALL_ENTRY(epoll_create1),
+ SYSCALL_ENTRY(epoll_ctl),
+ SYSCALL_ENTRY(epoll_pwait),
+ SYSCALL_ENTRY(eventfd2),
+ SYSCALL_ENTRY(execve),
+ SYSCALL_ENTRY(exit),
+ SYSCALL_ENTRY(exit_group),
+ SYSCALL_ENTRY(faccessat),
+ SYSCALL_ENTRY(fallocate),
+ SYSCALL_ENTRY(fchdir),
+ SYSCALL_ENTRY(fchmod),
+ SYSCALL_ENTRY(fchmodat),
+ SYSCALL_ENTRY(fchownat),
+ SYSCALL_ENTRY(fcntl),
+ SYSCALL_ENTRY(fdatasync),
+ SYSCALL_ENTRY(fgetxattr),
+ SYSCALL_ENTRY(flistxattr),
+ SYSCALL_ENTRY(flock),
+ SYSCALL_ENTRY(fremovexattr),
+ SYSCALL_ENTRY(fsetxattr),
+ SYSCALL_ENTRY(fstat),
+ SYSCALL_ENTRY(fstatfs),
+ SYSCALL_ENTRY(fsync),
+ SYSCALL_ENTRY(ftruncate),
+ SYSCALL_ENTRY(futex),
+ SYSCALL_ENTRY_ALT(getcpu, android_getcpu),
+ SYSCALL_ENTRY(getcwd),
+ SYSCALL_ENTRY(getdents64),
+ SYSCALL_ENTRY(getpeername),
+ SYSCALL_ENTRY(getpgid),
+ SYSCALL_ENTRY(getpid),
+ SYSCALL_ENTRY(getppid),
+ SYSCALL_ENTRY_ALT(getpriority, android_getpriority),
+ SYSCALL_ENTRY(getrandom),
+ SYSCALL_ENTRY(getrusage),
+ SYSCALL_ENTRY(getsid),
+ SYSCALL_ENTRY(getsockname),
+ SYSCALL_ENTRY(getsockopt),
+ SYSCALL_ENTRY(gettid),
+ SYSCALL_ENTRY(gettimeofday),
+ SYSCALL_ENTRY(getxattr),
+ SYSCALL_ENTRY(inotify_add_watch),
+ SYSCALL_ENTRY(inotify_init1),
+ SYSCALL_ENTRY(inotify_rm_watch),
+ SYSCALL_ENTRY(ioctl),
+ SYSCALL_ENTRY(io_destroy),
+ SYSCALL_ENTRY(io_getevents),
+ SYSCALL_ENTRY(io_setup),
+ SYSCALL_ENTRY(io_submit),
+ SYSCALL_ENTRY(ioprio_set),
+ SYSCALL_ENTRY_ALT(keyctl, android_keyctl),
+ SYSCALL_ENTRY(kill),
+ SYSCALL_ENTRY(lgetxattr),
+ SYSCALL_ENTRY(linkat),
+ SYSCALL_ENTRY(listxattr),
+ SYSCALL_ENTRY(listen),
+ SYSCALL_ENTRY(llistxattr),
+ SYSCALL_ENTRY(lremovexattr),
+ SYSCALL_ENTRY(lseek),
+ SYSCALL_ENTRY(lsetxattr),
+ SYSCALL_ENTRY(madvise),
+ SYSCALL_ENTRY(memfd_create),
+ SYSCALL_ENTRY(mincore),
+ SYSCALL_ENTRY(mkdirat),
+ SYSCALL_ENTRY(mknodat),
+ SYSCALL_ENTRY(mlock),
+ SYSCALL_ENTRY(mlockall),
+ SYSCALL_ENTRY(munlock),
+ SYSCALL_ENTRY(munlockall),
+ SYSCALL_ENTRY(mount),
+ SYSCALL_ENTRY(mprotect),
+ SYSCALL_ENTRY(mremap),
+ SYSCALL_ENTRY(msync),
+ SYSCALL_ENTRY(munmap),
+ SYSCALL_ENTRY(name_to_handle_at),
+ SYSCALL_ENTRY(nanosleep),
+ SYSCALL_ENTRY(open_by_handle_at),
+ SYSCALL_ENTRY(openat),
+ SYSCALL_ENTRY_ALT(perf_event_open, android_perf_event_open),
+ SYSCALL_ENTRY(personality),
+ SYSCALL_ENTRY(pipe2),
+ SYSCALL_ENTRY(ppoll),
+ SYSCALL_ENTRY_ALT(prctl, alt_sys_prctl),
+ SYSCALL_ENTRY(pread64),
+ SYSCALL_ENTRY(preadv),
+ SYSCALL_ENTRY(prlimit64),
+ SYSCALL_ENTRY(process_vm_readv),
+ SYSCALL_ENTRY(process_vm_writev),
+ SYSCALL_ENTRY(pselect6),
+ SYSCALL_ENTRY(ptrace),
+ SYSCALL_ENTRY(pwrite64),
+ SYSCALL_ENTRY(pwritev),
+ SYSCALL_ENTRY(read),
+ SYSCALL_ENTRY(readahead),
+ SYSCALL_ENTRY(readv),
+ SYSCALL_ENTRY(readlinkat),
+ SYSCALL_ENTRY(recvfrom),
+ SYSCALL_ENTRY(recvmmsg),
+ SYSCALL_ENTRY(recvmsg),
+ SYSCALL_ENTRY(remap_file_pages),
+ SYSCALL_ENTRY(removexattr),
+ SYSCALL_ENTRY(renameat),
+ SYSCALL_ENTRY(restart_syscall),
+ SYSCALL_ENTRY(rt_sigaction),
+ SYSCALL_ENTRY(rt_sigpending),
+ SYSCALL_ENTRY(rt_sigprocmask),
+ SYSCALL_ENTRY(rt_sigqueueinfo),
+ SYSCALL_ENTRY(rt_sigreturn),
+ SYSCALL_ENTRY(rt_sigsuspend),
+ SYSCALL_ENTRY(rt_sigtimedwait),
+ SYSCALL_ENTRY(rt_tgsigqueueinfo),
+ SYSCALL_ENTRY(sched_get_priority_max),
+ SYSCALL_ENTRY(sched_get_priority_min),
+ SYSCALL_ENTRY(sched_getaffinity),
+ SYSCALL_ENTRY(sched_getparam),
+ SYSCALL_ENTRY(sched_getscheduler),
+ SYSCALL_ENTRY(sched_setaffinity),
+ SYSCALL_ENTRY_ALT(sched_setparam, android_sched_setparam),
+ SYSCALL_ENTRY_ALT(sched_setscheduler, android_sched_setscheduler),
+ SYSCALL_ENTRY(sched_yield),
+ SYSCALL_ENTRY(seccomp),
+ SYSCALL_ENTRY(sendfile),
+ SYSCALL_ENTRY(sendmmsg),
+ SYSCALL_ENTRY(sendmsg),
+ SYSCALL_ENTRY(sendto),
+ SYSCALL_ENTRY(setdomainname),
+ SYSCALL_ENTRY(set_robust_list),
+ SYSCALL_ENTRY(set_tid_address),
+ SYSCALL_ENTRY(setitimer),
+ SYSCALL_ENTRY(setns),
+ SYSCALL_ENTRY(setpgid),
+ SYSCALL_ENTRY_ALT(setpriority, android_setpriority),
+ SYSCALL_ENTRY(setrlimit),
+ SYSCALL_ENTRY(setsid),
+ SYSCALL_ENTRY(setsockopt),
+ SYSCALL_ENTRY(settimeofday),
+ SYSCALL_ENTRY(setxattr),
+ SYSCALL_ENTRY(shutdown),
+ SYSCALL_ENTRY(signalfd4),
+ SYSCALL_ENTRY(sigaltstack),
+ SYSCALL_ENTRY_ALT(socket, android_socket),
+ SYSCALL_ENTRY(socketpair),
+ SYSCALL_ENTRY(splice),
+ SYSCALL_ENTRY(statfs),
+ SYSCALL_ENTRY(symlinkat),
+ SYSCALL_ENTRY(sync),
+ SYSCALL_ENTRY(syncfs),
+ SYSCALL_ENTRY(sysinfo),
+ SYSCALL_ENTRY(syslog),
+ SYSCALL_ENTRY(tee),
+ SYSCALL_ENTRY(tgkill),
+ SYSCALL_ENTRY(tkill),
+ SYSCALL_ENTRY(timer_create),
+ SYSCALL_ENTRY(timer_delete),
+ SYSCALL_ENTRY(timer_gettime),
+ SYSCALL_ENTRY(timer_getoverrun),
+ SYSCALL_ENTRY(timer_settime),
+ SYSCALL_ENTRY(timerfd_create),
+ SYSCALL_ENTRY(timerfd_gettime),
+ SYSCALL_ENTRY(timerfd_settime),
+ SYSCALL_ENTRY(times),
+ SYSCALL_ENTRY(truncate),
+ SYSCALL_ENTRY(umask),
+ SYSCALL_ENTRY(umount2),
+ SYSCALL_ENTRY(uname),
+ SYSCALL_ENTRY(unlinkat),
+ SYSCALL_ENTRY(unshare),
+ SYSCALL_ENTRY(utimensat),
+ SYSCALL_ENTRY(vmsplice),
+ SYSCALL_ENTRY(wait4),
+ SYSCALL_ENTRY(waitid),
+ SYSCALL_ENTRY(write),
+ SYSCALL_ENTRY(writev),
+
+ /*
+ * Deprecated syscalls which are not wired up on new architectures
+ * such as ARM64.
+ */
+#ifndef CONFIG_ARM64
+ SYSCALL_ENTRY(access),
+ SYSCALL_ENTRY(chmod),
+ SYSCALL_ENTRY(open),
+ SYSCALL_ENTRY(creat),
+ SYSCALL_ENTRY(dup2),
+ SYSCALL_ENTRY(epoll_create),
+ SYSCALL_ENTRY(epoll_wait),
+ SYSCALL_ENTRY(eventfd),
+ SYSCALL_ENTRY(fork),
+ SYSCALL_ENTRY(futimesat),
+ SYSCALL_ENTRY(getdents),
+ SYSCALL_ENTRY(getpgrp),
+ SYSCALL_ENTRY(inotify_init),
+ SYSCALL_ENTRY(link),
+ SYSCALL_ENTRY(lstat),
+ SYSCALL_ENTRY(mkdir),
+ SYSCALL_ENTRY(mknod),
+ SYSCALL_ENTRY(pipe),
+ SYSCALL_ENTRY(poll),
+ SYSCALL_ENTRY(readlink),
+ SYSCALL_ENTRY(rename),
+ SYSCALL_ENTRY(rmdir),
+ SYSCALL_ENTRY(stat),
+ SYSCALL_ENTRY(symlink),
+ SYSCALL_ENTRY(unlink),
+ SYSCALL_ENTRY(ustat),
+ SYSCALL_ENTRY(utimes),
+ SYSCALL_ENTRY(vfork),
+#endif
+
+ /*
+ * recv(2)/send(2) are officially deprecated, but their entry-points
+ * still exist on ARM.
+ */
+#ifdef CONFIG_ARM
+ SYSCALL_ENTRY(recv),
+ SYSCALL_ENTRY(send),
+#endif
+
+ /*
+ * posix_fadvise(2) and sync_file_range(2) have ARM-specific wrappers
+ * to deal with register alignment.
+ */
+#ifdef CONFIG_ARM
+ SYSCALL_ENTRY(arm_fadvise64_64),
+ SYSCALL_ENTRY(sync_file_range2),
+#else
+ SYSCALL_ENTRY(fadvise64),
+ SYSCALL_ENTRY(sync_file_range),
+#endif
+
+ /* 64-bit only syscalls. */
+#if defined(CONFIG_X86_64) || defined(CONFIG_ARM64)
+ SYSCALL_ENTRY(fchown),
+ SYSCALL_ENTRY(getegid),
+ SYSCALL_ENTRY(geteuid),
+ SYSCALL_ENTRY(getgid),
+ SYSCALL_ENTRY(getgroups),
+ SYSCALL_ENTRY(getresgid),
+ SYSCALL_ENTRY(getresuid),
+ SYSCALL_ENTRY(getrlimit),
+ SYSCALL_ENTRY(getuid),
+ SYSCALL_ENTRY(newfstatat),
+ SYSCALL_ENTRY(mmap),
+ SYSCALL_ENTRY(setgid),
+ SYSCALL_ENTRY(setgroups),
+ SYSCALL_ENTRY(setregid),
+ SYSCALL_ENTRY(setresgid),
+ SYSCALL_ENTRY(setresuid),
+ SYSCALL_ENTRY(setreuid),
+ SYSCALL_ENTRY(setuid),
+ /*
+ * chown(2) and lchown(2) are deprecated and not wired up
+ * on ARM64.
+ */
+#ifndef CONFIG_ARM64
+ SYSCALL_ENTRY(chown),
+ SYSCALL_ENTRY(lchown),
+#endif
+#endif
+
+ /* ARM32 only syscalls. */
+#if defined(CONFIG_ARM)
+ SYSCALL_ENTRY(chown32),
+ SYSCALL_ENTRY(fchown32),
+ SYSCALL_ENTRY(fcntl64),
+ SYSCALL_ENTRY(fstat64),
+ SYSCALL_ENTRY(fstatat64),
+ SYSCALL_ENTRY(fstatfs64),
+ SYSCALL_ENTRY(ftruncate64),
+ SYSCALL_ENTRY(getegid32),
+ SYSCALL_ENTRY(geteuid32),
+ SYSCALL_ENTRY(getgid32),
+ SYSCALL_ENTRY(getgroups32),
+ SYSCALL_ENTRY(getresgid32),
+ SYSCALL_ENTRY(getresuid32),
+ SYSCALL_ENTRY(getuid32),
+ SYSCALL_ENTRY(lchown32),
+ SYSCALL_ENTRY(lstat64),
+ SYSCALL_ENTRY(mmap2),
+ SYSCALL_ENTRY(_newselect),
+ SYSCALL_ENTRY(_llseek),
+ SYSCALL_ENTRY(sigaction),
+ SYSCALL_ENTRY(sigpending),
+ SYSCALL_ENTRY(sigprocmask),
+ SYSCALL_ENTRY(sigreturn),
+ SYSCALL_ENTRY(sigsuspend),
+ SYSCALL_ENTRY(sendfile64),
+ SYSCALL_ENTRY(setgid32),
+ SYSCALL_ENTRY(setgroups32),
+ SYSCALL_ENTRY(setregid32),
+ SYSCALL_ENTRY(setresgid32),
+ SYSCALL_ENTRY(setresuid32),
+ SYSCALL_ENTRY(setreuid32),
+ SYSCALL_ENTRY(setuid32),
+ SYSCALL_ENTRY(stat64),
+ SYSCALL_ENTRY(statfs64),
+ SYSCALL_ENTRY(truncate64),
+ SYSCALL_ENTRY(ugetrlimit),
+#endif
+
+ /* X86_64-specific syscalls. */
+#ifdef CONFIG_X86_64
+ SYSCALL_ENTRY(arch_prctl),
+ SYSCALL_ENTRY(modify_ldt),
+ SYSCALL_ENTRY(select),
+ SYSCALL_ENTRY(set_thread_area),
+ SYSCALL_ENTRY(time),
+#endif
+
+}; /* end android_whitelist */
+
+#ifdef CONFIG_COMPAT
+static struct syscall_whitelist_entry android_compat_whitelist[] = {
+ COMPAT_SYSCALL_ENTRY(access),
+ COMPAT_SYSCALL_ENTRY_ALT(adjtimex, android_compat_adjtimex),
+ COMPAT_SYSCALL_ENTRY(brk),
+ COMPAT_SYSCALL_ENTRY(capget),
+ COMPAT_SYSCALL_ENTRY(capset),
+ COMPAT_SYSCALL_ENTRY(chdir),
+ COMPAT_SYSCALL_ENTRY(chmod),
+ COMPAT_SYSCALL_ENTRY_ALT(clock_adjtime, android_compat_clock_adjtime),
+ COMPAT_SYSCALL_ENTRY(clock_getres),
+ COMPAT_SYSCALL_ENTRY(clock_gettime),
+ COMPAT_SYSCALL_ENTRY(clock_nanosleep),
+ COMPAT_SYSCALL_ENTRY(clock_settime),
+ COMPAT_SYSCALL_ENTRY(clone),
+ COMPAT_SYSCALL_ENTRY(close),
+ COMPAT_SYSCALL_ENTRY(creat),
+ COMPAT_SYSCALL_ENTRY(dup),
+ COMPAT_SYSCALL_ENTRY(dup2),
+ COMPAT_SYSCALL_ENTRY(dup3),
+ COMPAT_SYSCALL_ENTRY(epoll_create),
+ COMPAT_SYSCALL_ENTRY(epoll_create1),
+ COMPAT_SYSCALL_ENTRY(epoll_ctl),
+ COMPAT_SYSCALL_ENTRY(epoll_wait),
+ COMPAT_SYSCALL_ENTRY(epoll_pwait),
+ COMPAT_SYSCALL_ENTRY(eventfd),
+ COMPAT_SYSCALL_ENTRY(eventfd2),
+ COMPAT_SYSCALL_ENTRY(execve),
+ COMPAT_SYSCALL_ENTRY(exit),
+ COMPAT_SYSCALL_ENTRY(exit_group),
+ COMPAT_SYSCALL_ENTRY(faccessat),
+ COMPAT_SYSCALL_ENTRY(fallocate),
+ COMPAT_SYSCALL_ENTRY(fchdir),
+ COMPAT_SYSCALL_ENTRY(fchmod),
+ COMPAT_SYSCALL_ENTRY(fchmodat),
+ COMPAT_SYSCALL_ENTRY(fchownat),
+ COMPAT_SYSCALL_ENTRY(fcntl),
+ COMPAT_SYSCALL_ENTRY(fdatasync),
+ COMPAT_SYSCALL_ENTRY(fgetxattr),
+ COMPAT_SYSCALL_ENTRY(flistxattr),
+ COMPAT_SYSCALL_ENTRY(flock),
+ COMPAT_SYSCALL_ENTRY(fork),
+ COMPAT_SYSCALL_ENTRY(fremovexattr),
+ COMPAT_SYSCALL_ENTRY(fsetxattr),
+ COMPAT_SYSCALL_ENTRY(fstat),
+ COMPAT_SYSCALL_ENTRY(fstatfs),
+ COMPAT_SYSCALL_ENTRY(fsync),
+ COMPAT_SYSCALL_ENTRY(ftruncate),
+ COMPAT_SYSCALL_ENTRY(futex),
+ COMPAT_SYSCALL_ENTRY(futimesat),
+ COMPAT_SYSCALL_ENTRY_ALT(getcpu, android_getcpu),
+ COMPAT_SYSCALL_ENTRY(getcwd),
+ COMPAT_SYSCALL_ENTRY(getdents),
+ COMPAT_SYSCALL_ENTRY(getdents64),
+ COMPAT_SYSCALL_ENTRY(getpgid),
+ COMPAT_SYSCALL_ENTRY(getpgrp),
+ COMPAT_SYSCALL_ENTRY(getpid),
+ COMPAT_SYSCALL_ENTRY(getppid),
+ COMPAT_SYSCALL_ENTRY_ALT(getpriority, android_getpriority),
+ COMPAT_SYSCALL_ENTRY(getrandom),
+ COMPAT_SYSCALL_ENTRY(getrusage),
+ COMPAT_SYSCALL_ENTRY(getsid),
+ COMPAT_SYSCALL_ENTRY(gettid),
+ COMPAT_SYSCALL_ENTRY(gettimeofday),
+ COMPAT_SYSCALL_ENTRY(getxattr),
+ COMPAT_SYSCALL_ENTRY(inotify_add_watch),
+ COMPAT_SYSCALL_ENTRY(inotify_init),
+ COMPAT_SYSCALL_ENTRY(inotify_init1),
+ COMPAT_SYSCALL_ENTRY(inotify_rm_watch),
+ COMPAT_SYSCALL_ENTRY(ioctl),
+ COMPAT_SYSCALL_ENTRY(io_destroy),
+ COMPAT_SYSCALL_ENTRY(io_getevents),
+ COMPAT_SYSCALL_ENTRY(io_setup),
+ COMPAT_SYSCALL_ENTRY(io_submit),
+ COMPAT_SYSCALL_ENTRY(ioprio_set),
+ COMPAT_SYSCALL_ENTRY_ALT(keyctl, android_keyctl),
+ COMPAT_SYSCALL_ENTRY(kill),
+ COMPAT_SYSCALL_ENTRY(lgetxattr),
+ COMPAT_SYSCALL_ENTRY(link),
+ COMPAT_SYSCALL_ENTRY(linkat),
+ COMPAT_SYSCALL_ENTRY(listxattr),
+ COMPAT_SYSCALL_ENTRY(llistxattr),
+ COMPAT_SYSCALL_ENTRY(lremovexattr),
+ COMPAT_SYSCALL_ENTRY(lseek),
+ COMPAT_SYSCALL_ENTRY(lsetxattr),
+ COMPAT_SYSCALL_ENTRY(lstat),
+ COMPAT_SYSCALL_ENTRY(madvise),
+ COMPAT_SYSCALL_ENTRY(memfd_create),
+ COMPAT_SYSCALL_ENTRY(mincore),
+ COMPAT_SYSCALL_ENTRY(mkdir),
+ COMPAT_SYSCALL_ENTRY(mkdirat),
+ COMPAT_SYSCALL_ENTRY(mknod),
+ COMPAT_SYSCALL_ENTRY(mknodat),
+ COMPAT_SYSCALL_ENTRY(mlock),
+ COMPAT_SYSCALL_ENTRY(mlockall),
+ COMPAT_SYSCALL_ENTRY(munlock),
+ COMPAT_SYSCALL_ENTRY(munlockall),
+ COMPAT_SYSCALL_ENTRY(mount),
+ COMPAT_SYSCALL_ENTRY(mprotect),
+ COMPAT_SYSCALL_ENTRY(mremap),
+ COMPAT_SYSCALL_ENTRY(msync),
+ COMPAT_SYSCALL_ENTRY(munmap),
+ COMPAT_SYSCALL_ENTRY(name_to_handle_at),
+ COMPAT_SYSCALL_ENTRY(nanosleep),
+ COMPAT_SYSCALL_ENTRY(open),
+ COMPAT_SYSCALL_ENTRY(open_by_handle_at),
+ COMPAT_SYSCALL_ENTRY(openat),
+ COMPAT_SYSCALL_ENTRY_ALT(perf_event_open, android_perf_event_open),
+ COMPAT_SYSCALL_ENTRY(personality),
+ COMPAT_SYSCALL_ENTRY(pipe),
+ COMPAT_SYSCALL_ENTRY(pipe2),
+ COMPAT_SYSCALL_ENTRY(poll),
+ COMPAT_SYSCALL_ENTRY(ppoll),
+ COMPAT_SYSCALL_ENTRY_ALT(prctl, alt_sys_prctl),
+ COMPAT_SYSCALL_ENTRY(pread64),
+ COMPAT_SYSCALL_ENTRY(preadv),
+ COMPAT_SYSCALL_ENTRY(prlimit64),
+ COMPAT_SYSCALL_ENTRY(process_vm_readv),
+ COMPAT_SYSCALL_ENTRY(process_vm_writev),
+ COMPAT_SYSCALL_ENTRY(pselect6),
+ COMPAT_SYSCALL_ENTRY(ptrace),
+ COMPAT_SYSCALL_ENTRY(pwrite64),
+ COMPAT_SYSCALL_ENTRY(pwritev),
+ COMPAT_SYSCALL_ENTRY(read),
+ COMPAT_SYSCALL_ENTRY(readahead),
+ COMPAT_SYSCALL_ENTRY(readv),
+ COMPAT_SYSCALL_ENTRY(readlink),
+ COMPAT_SYSCALL_ENTRY(readlinkat),
+ COMPAT_SYSCALL_ENTRY(recvmmsg),
+ COMPAT_SYSCALL_ENTRY(remap_file_pages),
+ COMPAT_SYSCALL_ENTRY(removexattr),
+ COMPAT_SYSCALL_ENTRY(rename),
+ COMPAT_SYSCALL_ENTRY(renameat),
+ COMPAT_SYSCALL_ENTRY(restart_syscall),
+ COMPAT_SYSCALL_ENTRY(rmdir),
+ COMPAT_SYSCALL_ENTRY(rt_sigaction),
+ COMPAT_SYSCALL_ENTRY(rt_sigpending),
+ COMPAT_SYSCALL_ENTRY(rt_sigprocmask),
+ COMPAT_SYSCALL_ENTRY(rt_sigqueueinfo),
+ COMPAT_SYSCALL_ENTRY(rt_sigreturn),
+ COMPAT_SYSCALL_ENTRY(rt_sigsuspend),
+ COMPAT_SYSCALL_ENTRY(rt_sigtimedwait),
+ COMPAT_SYSCALL_ENTRY(rt_tgsigqueueinfo),
+ COMPAT_SYSCALL_ENTRY(sched_get_priority_max),
+ COMPAT_SYSCALL_ENTRY(sched_get_priority_min),
+ COMPAT_SYSCALL_ENTRY(sched_getaffinity),
+ COMPAT_SYSCALL_ENTRY(sched_getparam),
+ COMPAT_SYSCALL_ENTRY(sched_getscheduler),
+ COMPAT_SYSCALL_ENTRY(sched_setaffinity),
+ COMPAT_SYSCALL_ENTRY_ALT(sched_setparam,
+ android_sched_setparam),
+ COMPAT_SYSCALL_ENTRY_ALT(sched_setscheduler,
+ android_sched_setscheduler),
+ COMPAT_SYSCALL_ENTRY(sched_yield),
+ COMPAT_SYSCALL_ENTRY(seccomp),
+ COMPAT_SYSCALL_ENTRY(sendfile),
+ COMPAT_SYSCALL_ENTRY(sendfile64),
+ COMPAT_SYSCALL_ENTRY(sendmmsg),
+ COMPAT_SYSCALL_ENTRY(setdomainname),
+ COMPAT_SYSCALL_ENTRY(set_robust_list),
+ COMPAT_SYSCALL_ENTRY(set_tid_address),
+ COMPAT_SYSCALL_ENTRY(setitimer),
+ COMPAT_SYSCALL_ENTRY(setns),
+ COMPAT_SYSCALL_ENTRY(setpgid),
+ COMPAT_SYSCALL_ENTRY_ALT(setpriority, android_setpriority),
+ COMPAT_SYSCALL_ENTRY(setrlimit),
+ COMPAT_SYSCALL_ENTRY(setsid),
+ COMPAT_SYSCALL_ENTRY(settimeofday),
+ COMPAT_SYSCALL_ENTRY(setxattr),
+ COMPAT_SYSCALL_ENTRY(signalfd4),
+ COMPAT_SYSCALL_ENTRY(sigaltstack),
+ COMPAT_SYSCALL_ENTRY(splice),
+ COMPAT_SYSCALL_ENTRY(stat),
+ COMPAT_SYSCALL_ENTRY(statfs),
+ COMPAT_SYSCALL_ENTRY(symlink),
+ COMPAT_SYSCALL_ENTRY(symlinkat),
+ COMPAT_SYSCALL_ENTRY(sync),
+ COMPAT_SYSCALL_ENTRY(syncfs),
+ COMPAT_SYSCALL_ENTRY(sysinfo),
+ COMPAT_SYSCALL_ENTRY(syslog),
+ COMPAT_SYSCALL_ENTRY(tgkill),
+ COMPAT_SYSCALL_ENTRY(tee),
+ COMPAT_SYSCALL_ENTRY(tkill),
+ COMPAT_SYSCALL_ENTRY(timer_create),
+ COMPAT_SYSCALL_ENTRY(timer_delete),
+ COMPAT_SYSCALL_ENTRY(timer_gettime),
+ COMPAT_SYSCALL_ENTRY(timer_getoverrun),
+ COMPAT_SYSCALL_ENTRY(timer_settime),
+ COMPAT_SYSCALL_ENTRY(timerfd_create),
+ COMPAT_SYSCALL_ENTRY(timerfd_gettime),
+ COMPAT_SYSCALL_ENTRY(timerfd_settime),
+ COMPAT_SYSCALL_ENTRY(times),
+ COMPAT_SYSCALL_ENTRY(truncate),
+ COMPAT_SYSCALL_ENTRY(umask),
+ COMPAT_SYSCALL_ENTRY(umount2),
+ COMPAT_SYSCALL_ENTRY(uname),
+ COMPAT_SYSCALL_ENTRY(unlink),
+ COMPAT_SYSCALL_ENTRY(unlinkat),
+ COMPAT_SYSCALL_ENTRY(unshare),
+ COMPAT_SYSCALL_ENTRY(ustat),
+ COMPAT_SYSCALL_ENTRY(utimensat),
+ COMPAT_SYSCALL_ENTRY(utimes),
+ COMPAT_SYSCALL_ENTRY(vfork),
+ COMPAT_SYSCALL_ENTRY(vmsplice),
+ COMPAT_SYSCALL_ENTRY(wait4),
+ COMPAT_SYSCALL_ENTRY(waitid),
+ COMPAT_SYSCALL_ENTRY(write),
+ COMPAT_SYSCALL_ENTRY(writev),
+ COMPAT_SYSCALL_ENTRY(chown32),
+ COMPAT_SYSCALL_ENTRY(fchown32),
+ COMPAT_SYSCALL_ENTRY(fcntl64),
+ COMPAT_SYSCALL_ENTRY(fstat64),
+ COMPAT_SYSCALL_ENTRY(fstatat64),
+ COMPAT_SYSCALL_ENTRY(fstatfs64),
+ COMPAT_SYSCALL_ENTRY(ftruncate64),
+ COMPAT_SYSCALL_ENTRY(getegid),
+ COMPAT_SYSCALL_ENTRY(getegid32),
+ COMPAT_SYSCALL_ENTRY(geteuid),
+ COMPAT_SYSCALL_ENTRY(geteuid32),
+ COMPAT_SYSCALL_ENTRY(getgid),
+ COMPAT_SYSCALL_ENTRY(getgid32),
+ COMPAT_SYSCALL_ENTRY(getgroups32),
+ COMPAT_SYSCALL_ENTRY(getresgid32),
+ COMPAT_SYSCALL_ENTRY(getresuid32),
+ COMPAT_SYSCALL_ENTRY(getuid),
+ COMPAT_SYSCALL_ENTRY(getuid32),
+ COMPAT_SYSCALL_ENTRY(lchown32),
+ COMPAT_SYSCALL_ENTRY(lstat64),
+ COMPAT_SYSCALL_ENTRY(mmap2),
+ COMPAT_SYSCALL_ENTRY(_newselect),
+ COMPAT_SYSCALL_ENTRY(_llseek),
+ COMPAT_SYSCALL_ENTRY(sigaction),
+ COMPAT_SYSCALL_ENTRY(sigpending),
+ COMPAT_SYSCALL_ENTRY(sigprocmask),
+ COMPAT_SYSCALL_ENTRY(sigreturn),
+ COMPAT_SYSCALL_ENTRY(sigsuspend),
+ COMPAT_SYSCALL_ENTRY(setgid32),
+ COMPAT_SYSCALL_ENTRY(setgroups32),
+ COMPAT_SYSCALL_ENTRY(setregid32),
+ COMPAT_SYSCALL_ENTRY(setresgid32),
+ COMPAT_SYSCALL_ENTRY(setresuid32),
+ COMPAT_SYSCALL_ENTRY(setreuid32),
+ COMPAT_SYSCALL_ENTRY(setuid32),
+ COMPAT_SYSCALL_ENTRY(stat64),
+ COMPAT_SYSCALL_ENTRY(statfs64),
+ COMPAT_SYSCALL_ENTRY(truncate64),
+ COMPAT_SYSCALL_ENTRY(ugetrlimit),
+
+#ifdef CONFIG_X86_64
+ /*
+ * waitpid(2) is deprecated on most architectures, but still exists
+ * on IA32.
+ */
+ COMPAT_SYSCALL_ENTRY(waitpid),
+
+ /* IA32 uses the common socketcall(2) entrypoint for socket calls. */
+ COMPAT_SYSCALL_ENTRY(socketcall),
+#endif
+
+#ifdef CONFIG_ARM64
+ COMPAT_SYSCALL_ENTRY(accept),
+ COMPAT_SYSCALL_ENTRY(accept4),
+ COMPAT_SYSCALL_ENTRY(bind),
+ COMPAT_SYSCALL_ENTRY(connect),
+ COMPAT_SYSCALL_ENTRY(getpeername),
+ COMPAT_SYSCALL_ENTRY(getsockname),
+ COMPAT_SYSCALL_ENTRY(getsockopt),
+ COMPAT_SYSCALL_ENTRY(listen),
+ COMPAT_SYSCALL_ENTRY(recvfrom),
+ COMPAT_SYSCALL_ENTRY(recvmsg),
+ COMPAT_SYSCALL_ENTRY(sendmsg),
+ COMPAT_SYSCALL_ENTRY(sendto),
+ COMPAT_SYSCALL_ENTRY(setsockopt),
+ COMPAT_SYSCALL_ENTRY(shutdown),
+ COMPAT_SYSCALL_ENTRY(socket),
+ COMPAT_SYSCALL_ENTRY(socketpair),
+ COMPAT_SYSCALL_ENTRY(recv),
+ COMPAT_SYSCALL_ENTRY(send),
+#endif
+
+ /*
+ * posix_fadvise(2) and sync_file_range(2) have ARM-specific wrappers
+ * to deal with register alignment.
+ */
+#ifdef CONFIG_ARM64
+ COMPAT_SYSCALL_ENTRY(arm_fadvise64_64),
+ COMPAT_SYSCALL_ENTRY(sync_file_range2),
+#else
+ COMPAT_SYSCALL_ENTRY(fadvise64_64),
+ COMPAT_SYSCALL_ENTRY(fadvise64),
+ COMPAT_SYSCALL_ENTRY(sync_file_range),
+#endif
+
+ /*
+ * getrlimit(2) and time(2) are deprecated and not wired in the ARM
+ * compat table on ARM64.
+ */
+#ifndef CONFIG_ARM64
+ COMPAT_SYSCALL_ENTRY(getrlimit),
+ COMPAT_SYSCALL_ENTRY(time),
+#endif
+
+ /* x86-specific syscalls. */
+#ifdef CONFIG_X86_64
+ COMPAT_SYSCALL_ENTRY(modify_ldt),
+ COMPAT_SYSCALL_ENTRY(set_thread_area),
+#endif
+}; /* end android_compat_whitelist */
+#endif /* CONFIG_COMPAT */
+
+#endif /* ANDROID_WHITELISTS_H */
diff --git a/security/chromiumos/complete_whitelists.h b/security/chromiumos/complete_whitelists.h
new file mode 100644
index 0000000..cb3a8f3
--- /dev/null
+++ b/security/chromiumos/complete_whitelists.h
@@ -0,0 +1,401 @@
+/*
+ * Linux Security Module for Chromium OS
+ *
+ * Copyright 2018 Google LLC. All Rights Reserved
+ *
+ * Authors:
+ * Micah Morton <mortonm@chromium.org>
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef COMPLETE_WHITELISTS_H
+#define COMPLETE_WHITELISTS_H
+
+/*
+ * NOTE: the purpose of this header is only to pull out the definition of this
+ * array from alt-syscall.c for the purposes of readability. It should not be
+ * included in other .c files.
+ */
+
+#include "alt-syscall.h"
+
+static struct syscall_whitelist_entry complete_whitelist[] = {
+ /* Syscalls wired up on ARM32/ARM64 and x86_64. */
+ SYSCALL_ENTRY(accept),
+ SYSCALL_ENTRY(accept4),
+ SYSCALL_ENTRY(acct),
+ SYSCALL_ENTRY(add_key),
+ SYSCALL_ENTRY(adjtimex),
+ SYSCALL_ENTRY(bind),
+ SYSCALL_ENTRY(brk),
+ SYSCALL_ENTRY(capget),
+ SYSCALL_ENTRY(capset),
+ SYSCALL_ENTRY(chdir),
+ SYSCALL_ENTRY(chroot),
+ SYSCALL_ENTRY(clock_adjtime),
+ SYSCALL_ENTRY(clock_getres),
+ SYSCALL_ENTRY(clock_gettime),
+ SYSCALL_ENTRY(clock_nanosleep),
+ SYSCALL_ENTRY(clock_settime),
+ SYSCALL_ENTRY(clone),
+ SYSCALL_ENTRY(close),
+ SYSCALL_ENTRY(connect),
+ SYSCALL_ENTRY(copy_file_range),
+ SYSCALL_ENTRY(delete_module),
+ SYSCALL_ENTRY(dup),
+ SYSCALL_ENTRY(dup3),
+ SYSCALL_ENTRY(epoll_create1),
+ SYSCALL_ENTRY(epoll_ctl),
+ SYSCALL_ENTRY(epoll_pwait),
+ SYSCALL_ENTRY(eventfd2),
+ SYSCALL_ENTRY(execve),
+ SYSCALL_ENTRY(exit),
+ SYSCALL_ENTRY(exit_group),
+ SYSCALL_ENTRY(faccessat),
+ SYSCALL_ENTRY(fallocate),
+ SYSCALL_ENTRY(fanotify_init),
+ SYSCALL_ENTRY(fanotify_mark),
+ SYSCALL_ENTRY(fchdir),
+ SYSCALL_ENTRY(fchmod),
+ SYSCALL_ENTRY(fchmodat),
+ SYSCALL_ENTRY(fchown),
+ SYSCALL_ENTRY(fchownat),
+ SYSCALL_ENTRY(fcntl),
+ SYSCALL_ENTRY(fdatasync),
+ SYSCALL_ENTRY(fgetxattr),
+ SYSCALL_ENTRY(finit_module),
+ SYSCALL_ENTRY(flistxattr),
+ SYSCALL_ENTRY(flock),
+ SYSCALL_ENTRY(fremovexattr),
+ SYSCALL_ENTRY(fsetxattr),
+ SYSCALL_ENTRY(fstatfs),
+ SYSCALL_ENTRY(fsync),
+ SYSCALL_ENTRY(ftruncate),
+ SYSCALL_ENTRY(futex),
+ SYSCALL_ENTRY(getcpu),
+ SYSCALL_ENTRY(getcwd),
+ SYSCALL_ENTRY(getdents64),
+ SYSCALL_ENTRY(getegid),
+ SYSCALL_ENTRY(geteuid),
+ SYSCALL_ENTRY(getgid),
+ SYSCALL_ENTRY(getgroups),
+ SYSCALL_ENTRY(getitimer),
+ SYSCALL_ENTRY(get_mempolicy),
+ SYSCALL_ENTRY(getpeername),
+ SYSCALL_ENTRY(getpgid),
+ SYSCALL_ENTRY(getpid),
+ SYSCALL_ENTRY(getppid),
+ SYSCALL_ENTRY(getpriority),
+ SYSCALL_ENTRY(getrandom),
+ SYSCALL_ENTRY(getresgid),
+ SYSCALL_ENTRY(getresuid),
+ SYSCALL_ENTRY(get_robust_list),
+ SYSCALL_ENTRY(getrusage),
+ SYSCALL_ENTRY(getsid),
+ SYSCALL_ENTRY(getsockname),
+ SYSCALL_ENTRY(getsockopt),
+ SYSCALL_ENTRY(gettid),
+ SYSCALL_ENTRY(gettimeofday),
+ SYSCALL_ENTRY(getuid),
+ SYSCALL_ENTRY(getxattr),
+ SYSCALL_ENTRY(init_module),
+ SYSCALL_ENTRY(inotify_add_watch),
+ SYSCALL_ENTRY(inotify_init1),
+ SYSCALL_ENTRY(inotify_rm_watch),
+ SYSCALL_ENTRY(io_cancel),
+ SYSCALL_ENTRY(ioctl),
+ SYSCALL_ENTRY(io_destroy),
+ SYSCALL_ENTRY(io_getevents),
+ SYSCALL_ENTRY(ioprio_get),
+ SYSCALL_ENTRY(ioprio_set),
+ SYSCALL_ENTRY(io_setup),
+ SYSCALL_ENTRY(io_submit),
+ SYSCALL_ENTRY(kcmp),
+ SYSCALL_ENTRY(kexec_load),
+ SYSCALL_ENTRY(keyctl),
+ SYSCALL_ENTRY(kill),
+ SYSCALL_ENTRY(lgetxattr),
+ SYSCALL_ENTRY(linkat),
+ SYSCALL_ENTRY(listen),
+ SYSCALL_ENTRY(listxattr),
+ SYSCALL_ENTRY(llistxattr),
+ SYSCALL_ENTRY(lookup_dcookie),
+ SYSCALL_ENTRY(lremovexattr),
+ SYSCALL_ENTRY(lseek),
+ SYSCALL_ENTRY(lsetxattr),
+ SYSCALL_ENTRY(madvise),
+ SYSCALL_ENTRY(mbind),
+ SYSCALL_ENTRY(memfd_create),
+ SYSCALL_ENTRY(mincore),
+ SYSCALL_ENTRY(mkdirat),
+ SYSCALL_ENTRY(mknodat),
+ SYSCALL_ENTRY(mlock),
+ SYSCALL_ENTRY(mlockall),
+ SYSCALL_ENTRY(mount),
+ SYSCALL_ENTRY(move_pages),
+ SYSCALL_ENTRY(mprotect),
+ SYSCALL_ENTRY(mq_getsetattr),
+ SYSCALL_ENTRY(mq_notify),
+ SYSCALL_ENTRY(mq_open),
+ SYSCALL_ENTRY(mq_timedreceive),
+ SYSCALL_ENTRY(mq_timedsend),
+ SYSCALL_ENTRY(mq_unlink),
+ SYSCALL_ENTRY(mremap),
+ SYSCALL_ENTRY(msgctl),
+ SYSCALL_ENTRY(msgget),
+ SYSCALL_ENTRY(msgrcv),
+ SYSCALL_ENTRY(msgsnd),
+ SYSCALL_ENTRY(msync),
+ SYSCALL_ENTRY(munlock),
+ SYSCALL_ENTRY(munlockall),
+ SYSCALL_ENTRY(munmap),
+ SYSCALL_ENTRY(name_to_handle_at),
+ SYSCALL_ENTRY(nanosleep),
+ SYSCALL_ENTRY(openat),
+ SYSCALL_ENTRY(open_by_handle_at),
+ SYSCALL_ENTRY(perf_event_open),
+ SYSCALL_ENTRY(personality),
+ SYSCALL_ENTRY(pipe2),
+ SYSCALL_ENTRY(pivot_root),
+ SYSCALL_ENTRY(pkey_alloc),
+ SYSCALL_ENTRY(pkey_free),
+ SYSCALL_ENTRY(pkey_mprotect),
+ SYSCALL_ENTRY(ppoll),
+ SYSCALL_ENTRY_ALT(prctl, alt_sys_prctl),
+ SYSCALL_ENTRY(pread64),
+ SYSCALL_ENTRY(preadv),
+ SYSCALL_ENTRY(preadv2),
+ SYSCALL_ENTRY(pwritev2),
+ SYSCALL_ENTRY(prlimit64),
+ SYSCALL_ENTRY(process_vm_readv),
+ SYSCALL_ENTRY(process_vm_writev),
+ SYSCALL_ENTRY(pselect6),
+ SYSCALL_ENTRY(ptrace),
+ SYSCALL_ENTRY(pwrite64),
+ SYSCALL_ENTRY(pwritev),
+ SYSCALL_ENTRY(quotactl),
+ SYSCALL_ENTRY(read),
+ SYSCALL_ENTRY(readahead),
+ SYSCALL_ENTRY(readlinkat),
+ SYSCALL_ENTRY(readv),
+ SYSCALL_ENTRY(reboot),
+ SYSCALL_ENTRY(recvfrom),
+ SYSCALL_ENTRY(recvmmsg),
+ SYSCALL_ENTRY(recvmsg),
+ SYSCALL_ENTRY(remap_file_pages),
+ SYSCALL_ENTRY(removexattr),
+ SYSCALL_ENTRY(renameat),
+ SYSCALL_ENTRY(request_key),
+ SYSCALL_ENTRY(restart_syscall),
+ SYSCALL_ENTRY(rt_sigaction),
+ SYSCALL_ENTRY(rt_sigpending),
+ SYSCALL_ENTRY(rt_sigprocmask),
+ SYSCALL_ENTRY(rt_sigqueueinfo),
+ SYSCALL_ENTRY(rt_sigsuspend),
+ SYSCALL_ENTRY(rt_sigtimedwait),
+ SYSCALL_ENTRY(rt_tgsigqueueinfo),
+ SYSCALL_ENTRY(sched_getaffinity),
+ SYSCALL_ENTRY(sched_getattr),
+ SYSCALL_ENTRY(sched_getparam),
+ SYSCALL_ENTRY(sched_get_priority_max),
+ SYSCALL_ENTRY(sched_get_priority_min),
+ SYSCALL_ENTRY(sched_getscheduler),
+ SYSCALL_ENTRY(sched_rr_get_interval),
+ SYSCALL_ENTRY(sched_setaffinity),
+ SYSCALL_ENTRY(sched_setattr),
+ SYSCALL_ENTRY(sched_setparam),
+ SYSCALL_ENTRY(sched_setscheduler),
+ SYSCALL_ENTRY(sched_yield),
+ SYSCALL_ENTRY(seccomp),
+ SYSCALL_ENTRY(semctl),
+ SYSCALL_ENTRY(semget),
+ SYSCALL_ENTRY(semop),
+ SYSCALL_ENTRY(semtimedop),
+ SYSCALL_ENTRY(sendfile),
+ SYSCALL_ENTRY(sendmmsg),
+ SYSCALL_ENTRY(sendmsg),
+ SYSCALL_ENTRY(sendto),
+ SYSCALL_ENTRY(setdomainname),
+ SYSCALL_ENTRY(setfsgid),
+ SYSCALL_ENTRY(setfsuid),
+ SYSCALL_ENTRY(setgid),
+ SYSCALL_ENTRY(setgroups),
+ SYSCALL_ENTRY(sethostname),
+ SYSCALL_ENTRY(setitimer),
+ SYSCALL_ENTRY(set_mempolicy),
+ SYSCALL_ENTRY(setns),
+ SYSCALL_ENTRY(setpgid),
+ SYSCALL_ENTRY(setpriority),
+ SYSCALL_ENTRY(setregid),
+ SYSCALL_ENTRY(setresgid),
+ SYSCALL_ENTRY(setresuid),
+ SYSCALL_ENTRY(setreuid),
+ SYSCALL_ENTRY(setrlimit),
+ SYSCALL_ENTRY(set_robust_list),
+ SYSCALL_ENTRY(setsid),
+ SYSCALL_ENTRY(setsockopt),
+ SYSCALL_ENTRY(set_tid_address),
+ SYSCALL_ENTRY(settimeofday),
+ SYSCALL_ENTRY(setuid),
+ SYSCALL_ENTRY(setxattr),
+ SYSCALL_ENTRY(shmat),
+ SYSCALL_ENTRY(shmctl),
+ SYSCALL_ENTRY(shmdt),
+ SYSCALL_ENTRY(shmget),
+ SYSCALL_ENTRY(shutdown),
+ SYSCALL_ENTRY(sigaltstack),
+ SYSCALL_ENTRY(signalfd4),
+ SYSCALL_ENTRY(socket),
+ SYSCALL_ENTRY(socketpair),
+ SYSCALL_ENTRY(splice),
+ SYSCALL_ENTRY(statfs),
+ SYSCALL_ENTRY(statx),
+ SYSCALL_ENTRY(swapoff),
+ SYSCALL_ENTRY(swapon),
+ SYSCALL_ENTRY(symlinkat),
+ SYSCALL_ENTRY(sync),
+ SYSCALL_ENTRY(syncfs),
+ SYSCALL_ENTRY(sysinfo),
+ SYSCALL_ENTRY(syslog),
+ SYSCALL_ENTRY(tee),
+ SYSCALL_ENTRY(tgkill),
+ SYSCALL_ENTRY(timer_create),
+ SYSCALL_ENTRY(timer_delete),
+ SYSCALL_ENTRY(timerfd_create),
+ SYSCALL_ENTRY(timerfd_gettime),
+ SYSCALL_ENTRY(timerfd_settime),
+ SYSCALL_ENTRY(timer_getoverrun),
+ SYSCALL_ENTRY(timer_gettime),
+ SYSCALL_ENTRY(timer_settime),
+ SYSCALL_ENTRY(times),
+ SYSCALL_ENTRY(tkill),
+ SYSCALL_ENTRY(truncate),
+ SYSCALL_ENTRY(umask),
+ SYSCALL_ENTRY(unlinkat),
+ SYSCALL_ENTRY(unshare),
+ SYSCALL_ENTRY(utimensat),
+ SYSCALL_ENTRY(vhangup),
+ SYSCALL_ENTRY(vmsplice),
+ SYSCALL_ENTRY(wait4),
+ SYSCALL_ENTRY(waitid),
+ SYSCALL_ENTRY(write),
+ SYSCALL_ENTRY(writev),
+
+ /* Exist for x86_64 and ARM32 but not ARM64. */
+#ifndef CONFIG_ARM64
+ SYSCALL_ENTRY(access),
+ SYSCALL_ENTRY(chmod),
+ SYSCALL_ENTRY(chown),
+ SYSCALL_ENTRY(creat),
+ SYSCALL_ENTRY(dup2),
+ SYSCALL_ENTRY(epoll_create),
+ SYSCALL_ENTRY(epoll_wait),
+ SYSCALL_ENTRY(eventfd),
+ SYSCALL_ENTRY(fork),
+ SYSCALL_ENTRY(futimesat),
+ SYSCALL_ENTRY(getdents),
+ SYSCALL_ENTRY(getpgrp),
+ SYSCALL_ENTRY(inotify_init),
+ SYSCALL_ENTRY(lchown),
+ SYSCALL_ENTRY(link),
+ SYSCALL_ENTRY(mkdir),
+ SYSCALL_ENTRY(mknod),
+ SYSCALL_ENTRY(open),
+ SYSCALL_ENTRY(pause),
+ SYSCALL_ENTRY(pipe),
+ SYSCALL_ENTRY(poll),
+ SYSCALL_ENTRY(readlink),
+ SYSCALL_ENTRY(rename),
+ SYSCALL_ENTRY(rmdir),
+ SYSCALL_ENTRY(signalfd),
+ SYSCALL_ENTRY(symlink),
+ SYSCALL_ENTRY(sysfs),
+ SYSCALL_ENTRY(unlink),
+ SYSCALL_ENTRY(ustat),
+ SYSCALL_ENTRY(utimes),
+ SYSCALL_ENTRY(vfork),
+#endif
+
+ /* Exist for x86_64 and ARM64 but not ARM32 */
+#if defined(CONFIG_ARM64) || defined(CONFIG_X86_64)
+ SYSCALL_ENTRY(fadvise64),
+ SYSCALL_ENTRY(fstat),
+ SYSCALL_ENTRY(getrlimit),
+ SYSCALL_ENTRY(migrate_pages),
+ SYSCALL_ENTRY(mmap),
+ SYSCALL_ENTRY(rt_sigreturn),
+ SYSCALL_ENTRY(sync_file_range),
+ SYSCALL_ENTRY(umount2),
+ SYSCALL_ENTRY(uname),
+#endif
+
+ /* Unique to ARM32. */
+#ifdef CONFIG_ARM
+ SYSCALL_ENTRY(arm_fadvise64_64),
+ SYSCALL_ENTRY(bdflush),
+ SYSCALL_ENTRY(fcntl64),
+ SYSCALL_ENTRY(fstat64),
+ SYSCALL_ENTRY(fstatat64),
+ SYSCALL_ENTRY(ftruncate64),
+ SYSCALL_ENTRY(lstat64),
+ SYSCALL_ENTRY(mmap2),
+ SYSCALL_ENTRY(nice),
+ SYSCALL_ENTRY(pciconfig_iobase),
+ SYSCALL_ENTRY(pciconfig_read),
+ SYSCALL_ENTRY(pciconfig_write),
+ SYSCALL_ENTRY(recv),
+ SYSCALL_ENTRY(send),
+ SYSCALL_ENTRY(sendfile64),
+ SYSCALL_ENTRY(sigaction),
+ SYSCALL_ENTRY(sigpending),
+ SYSCALL_ENTRY(sigprocmask),
+ SYSCALL_ENTRY(sigsuspend),
+ SYSCALL_ENTRY(stat64),
+ SYSCALL_ENTRY(truncate64),
+ SYSCALL_ENTRY(uselib),
+#endif
+
+ /* Unique to x86_64. */
+#ifdef CONFIG_X86_64
+ SYSCALL_ENTRY(alarm),
+ SYSCALL_ENTRY(arch_prctl),
+ SYSCALL_ENTRY(ioperm),
+ SYSCALL_ENTRY(iopl),
+ SYSCALL_ENTRY(kexec_file_load),
+ SYSCALL_ENTRY(lstat),
+ SYSCALL_ENTRY(modify_ldt),
+ SYSCALL_ENTRY(newfstatat),
+ SYSCALL_ENTRY(select),
+ SYSCALL_ENTRY(stat),
+ SYSCALL_ENTRY(time),
+ SYSCALL_ENTRY(_sysctl),
+ SYSCALL_ENTRY(utime),
+#endif
+
+ /* Unique to ARM64. */
+#ifdef CONFIG_ARM64
+ SYSCALL_ENTRY(nfsservctl),
+ SYSCALL_ENTRY(renameat2),
+#endif
+}; /* end complete_whitelist */
+
+#ifdef CONFIG_COMPAT
+/*
+ * For now not adding a 32-bit-compatible version of the complete whitelist.
+ * Since we are not whitelisting any compat syscalls here, a call into the
+ * compat section of this "complete" alt syscall table will be redirected to
+ * block_syscall() (unless the permissive mode is used in which case the call
+ * will be redirected to warn_compat_syscall()).
+ */
+static struct syscall_whitelist_entry complete_compat_whitelist[] = {};
+#endif /* CONFIG_COMPAT */
+
+#endif /* COMPLETE_WHITELISTS_H */
diff --git a/security/chromiumos/inode_mark.c b/security/chromiumos/inode_mark.c
new file mode 100644
index 0000000..009debb
--- /dev/null
+++ b/security/chromiumos/inode_mark.c
@@ -0,0 +1,354 @@
+/*
+ * Linux Security Module for Chromium OS
+ *
+ * Copyright 2016 Google Inc. All Rights Reserved
+ *
+ * Authors:
+ * Mattias Nissler <mnissler@chromium.org>
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/atomic.h>
+#include <linux/compiler.h>
+#include <linux/dcache.h>
+#include <linux/fs.h>
+#include <linux/fsnotify_backend.h>
+#include <linux/hash.h>
+#include <linux/mutex.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+
+#include "inode_mark.h"
+
+/*
+ * This file implements facilities to pin inodes in core and attach some
+ * meta data to them. We use fsnotify inode marks as a vehicle to attach the
+ * meta data.
+ */
+struct chromiumos_inode_mark {
+ struct fsnotify_mark mark;
+ struct inode *inode;
+ enum chromiumos_inode_security_policy
+ policies[CHROMIUMOS_NUMBER_OF_POLICIES];
+};
+
+static inline struct chromiumos_inode_mark *
+chromiumos_to_inode_mark(struct fsnotify_mark *mark)
+{
+ return container_of(mark, struct chromiumos_inode_mark, mark);
+}
+
+/*
+ * Hashtable entry that contains tracking information specific to the file
+ * system identified by the corresponding super_block. This contains the
+ * fsnotify group that holds all the marks for inodes belonging to the
+ * super_block.
+ */
+struct chromiumos_super_block_mark {
+ atomic_t refcnt;
+ struct hlist_node node;
+ struct super_block *sb;
+ struct fsnotify_group *fsn_group;
+};
+
+#define CHROMIUMOS_SUPER_BLOCK_HASH_BITS 8
+#define CHROMIUMOS_SUPER_BLOCK_HASH_SIZE (1 << CHROMIUMOS_SUPER_BLOCK_HASH_BITS)
+
+static struct hlist_head chromiumos_super_block_hash_table
+ [CHROMIUMOS_SUPER_BLOCK_HASH_SIZE] __read_mostly;
+static DEFINE_MUTEX(chromiumos_super_block_hash_lock);
+
+static struct hlist_head *chromiumos_super_block_hlist(struct super_block *sb)
+{
+ return &chromiumos_super_block_hash_table[hash_ptr(
+ sb, CHROMIUMOS_SUPER_BLOCK_HASH_BITS)];
+}
+
+static void chromiumos_super_block_put(struct chromiumos_super_block_mark *sbm)
+{
+ if (atomic_dec_and_test(&sbm->refcnt)) {
+ mutex_lock(&chromiumos_super_block_hash_lock);
+ hlist_del_rcu(&sbm->node);
+ mutex_unlock(&chromiumos_super_block_hash_lock);
+
+ synchronize_rcu();
+
+ fsnotify_destroy_group(sbm->fsn_group);
+ kfree(sbm);
+ }
+}
+
+static struct chromiumos_super_block_mark *
+chromiumos_super_block_lookup(struct super_block *sb)
+{
+ struct hlist_head *hlist = chromiumos_super_block_hlist(sb);
+ struct chromiumos_super_block_mark *sbm;
+ struct chromiumos_super_block_mark *matching_sbm = NULL;
+
+ rcu_read_lock();
+ hlist_for_each_entry_rcu(sbm, hlist, node) {
+ if (sbm->sb == sb && atomic_inc_not_zero(&sbm->refcnt)) {
+ matching_sbm = sbm;
+ break;
+ }
+ }
+ rcu_read_unlock();
+
+ return matching_sbm;
+}
+
+static int chromiumos_handle_fsnotify_event(struct fsnotify_group *group,
+ struct inode *inode,
+ u32 mask, const void *data,
+ int data_type,
+ const unsigned char *file_name,
+ u32 cookie,
+ struct fsnotify_iter_info *iter_info)
+{
+ /*
+ * This should never get called because a zero mask is set on the inode
+ * marks. All cases of marks going away (inode deletion, unmount,
+ * explicit removal) are handled in chromiumos_freeing_mark.
+ */
+ WARN_ON_ONCE(1);
+ return 0;
+}
+
+static void chromiumos_freeing_mark(struct fsnotify_mark *mark,
+ struct fsnotify_group *group)
+{
+ struct chromiumos_inode_mark *inode_mark =
+ chromiumos_to_inode_mark(mark);
+
+ iput(inode_mark->inode);
+ inode_mark->inode = NULL;
+ chromiumos_super_block_put(group->private);
+}
+
+static void chromiumos_free_mark(struct fsnotify_mark *mark)
+{
+ iput(chromiumos_to_inode_mark(mark)->inode);
+ kfree(mark);
+}
+
+static const struct fsnotify_ops chromiumos_fsn_ops = {
+ .handle_event = chromiumos_handle_fsnotify_event,
+ .freeing_mark = chromiumos_freeing_mark,
+ .free_mark = chromiumos_free_mark,
+};
+
+static struct chromiumos_super_block_mark *
+chromiumos_super_block_create(struct super_block *sb)
+{
+ struct hlist_head *hlist = chromiumos_super_block_hlist(sb);
+ struct chromiumos_super_block_mark *sbm = NULL;
+
+ WARN_ON(!mutex_is_locked(&chromiumos_super_block_hash_lock));
+
+ /* No match found, create a new entry. */
+ sbm = kzalloc(sizeof(*sbm), GFP_KERNEL);
+ if (!sbm)
+ return ERR_PTR(-ENOMEM);
+
+ atomic_set(&sbm->refcnt, 1);
+ sbm->sb = sb;
+ sbm->fsn_group = fsnotify_alloc_group(&chromiumos_fsn_ops);
+ if (IS_ERR(sbm->fsn_group)) {
+ int ret = PTR_ERR(sbm->fsn_group);
+
+ kfree(sbm);
+ return ERR_PTR(ret);
+ }
+ sbm->fsn_group->private = sbm;
+ hlist_add_head_rcu(&sbm->node, hlist);
+
+ return sbm;
+}
+
+static struct chromiumos_super_block_mark *
+chromiumos_super_block_get(struct super_block *sb)
+{
+ struct chromiumos_super_block_mark *sbm;
+
+ mutex_lock(&chromiumos_super_block_hash_lock);
+ sbm = chromiumos_super_block_lookup(sb);
+ if (!sbm)
+ sbm = chromiumos_super_block_create(sb);
+
+ mutex_unlock(&chromiumos_super_block_hash_lock);
+ return sbm;
+}
+
+/*
+ * This will only ever get called if the metadata does not already exist for
+ * an inode, so no need to worry about freeing an existing mark.
+ */
+static int
+chromiumos_inode_mark_create(
+ struct chromiumos_super_block_mark *sbm,
+ struct inode *inode,
+ enum chromiumos_inode_security_policy_type type,
+ enum chromiumos_inode_security_policy policy)
+{
+ struct chromiumos_inode_mark *inode_mark;
+ int ret;
+ size_t i;
+
+ WARN_ON(!mutex_is_locked(&sbm->fsn_group->mark_mutex));
+
+ inode_mark = kzalloc(sizeof(*inode_mark), GFP_KERNEL);
+ if (!inode_mark)
+ return -ENOMEM;
+
+ fsnotify_init_mark(&inode_mark->mark, sbm->fsn_group);
+ inode_mark->inode = igrab(inode);
+ if (!inode_mark->inode) {
+ ret = -ENOENT;
+ goto out;
+ }
+
+ /* Initialize all policies to inherit. */
+ for (i = 0; i < CHROMIUMOS_NUMBER_OF_POLICIES; i++)
+ inode_mark->policies[i] = CHROMIUMOS_INODE_POLICY_INHERIT;
+
+ inode_mark->policies[type] = policy;
+ ret = fsnotify_add_mark_locked(&inode_mark->mark, &inode->i_fsnotify_marks,
+ type, false);
+ if (ret)
+ goto out;
+
+ /* Take an sbm reference so the created mark is accounted for. */
+ atomic_inc(&sbm->refcnt);
+
+out:
+ fsnotify_put_mark(&inode_mark->mark);
+ return ret;
+}
+
+int chromiumos_update_inode_security_policy(
+ struct inode *inode,
+ enum chromiumos_inode_security_policy_type type,
+ enum chromiumos_inode_security_policy policy)
+{
+ struct chromiumos_super_block_mark *sbm;
+ struct fsnotify_mark *mark;
+ bool free_mark = false;
+ int ret;
+ size_t i;
+
+ sbm = chromiumos_super_block_get(inode->i_sb);
+ if (IS_ERR(sbm))
+ return PTR_ERR(sbm);
+
+ mutex_lock(&sbm->fsn_group->mark_mutex);
+
+ mark = fsnotify_find_mark(&inode->i_fsnotify_marks, sbm->fsn_group);
+ if (mark) {
+ WRITE_ONCE(chromiumos_to_inode_mark(mark)->policies[type],
+ policy);
+ /*
+ * Frees mark if all policies are
+ * CHROMIUM_INODE_POLICY_INHERIT.
+ */
+ free_mark = true;
+ for (i = 0; i < CHROMIUMOS_NUMBER_OF_POLICIES; i++) {
+ if (chromiumos_to_inode_mark(mark)->policies[i]
+ != CHROMIUMOS_INODE_POLICY_INHERIT) {
+ free_mark = false;
+ break;
+ }
+ }
+ if (free_mark)
+ fsnotify_detach_mark(mark);
+ ret = 0;
+ } else {
+ ret = chromiumos_inode_mark_create(sbm, inode, type, policy);
+ }
+
+ mutex_unlock(&sbm->fsn_group->mark_mutex);
+ chromiumos_super_block_put(sbm);
+
+ /* This must happen after dropping the mark mutex. */
+ if (free_mark)
+ fsnotify_free_mark(mark);
+ if (mark)
+ fsnotify_put_mark(mark);
+
+ return ret;
+}
+
+/* Flushes all inode security policies. */
+int chromiumos_flush_inode_security_policies(struct super_block *sb)
+{
+ struct chromiumos_super_block_mark *sbm;
+
+ sbm = chromiumos_super_block_lookup(sb);
+ if (sbm) {
+ fsnotify_clear_marks_by_group(sbm->fsn_group,
+ FSNOTIFY_OBJ_ALL_TYPES_MASK);
+ chromiumos_super_block_put(sbm);
+ }
+
+ return 0;
+}
+
+enum chromiumos_inode_security_policy chromiumos_get_inode_security_policy(
+ struct dentry *dentry, struct inode *inode,
+ enum chromiumos_inode_security_policy_type type)
+{
+ struct chromiumos_super_block_mark *sbm;
+ /*
+ * Initializes policy to CHROMIUM_INODE_POLICY_INHERIT, which is
+ * the value that will be returned if neither |dentry| nor any
+ * directory in its path has been asigned an inode security policy
+ * value for the given type.
+ */
+ enum chromiumos_inode_security_policy policy =
+ CHROMIUMOS_INODE_POLICY_INHERIT;
+
+ if (!dentry || !inode || type >= CHROMIUMOS_NUMBER_OF_POLICIES)
+ return policy;
+
+ sbm = chromiumos_super_block_lookup(inode->i_sb);
+ if (!sbm)
+ return policy;
+
+ /* Walk the dentry path and look for a traversal policy. */
+ rcu_read_lock();
+ while (1) {
+ struct fsnotify_mark *mark = fsnotify_find_mark(
+ &inode->i_fsnotify_marks, sbm->fsn_group);
+ if (mark) {
+ struct chromiumos_inode_mark *inode_mark =
+ chromiumos_to_inode_mark(mark);
+ policy = READ_ONCE(inode_mark->policies[type]);
+ fsnotify_put_mark(mark);
+
+ if (policy != CHROMIUMOS_INODE_POLICY_INHERIT)
+ break;
+ }
+
+ if (IS_ROOT(dentry))
+ break;
+ dentry = READ_ONCE(dentry->d_parent);
+ if (!dentry)
+ break;
+ inode = d_inode_rcu(dentry);
+ if (!inode)
+ break;
+ }
+ rcu_read_unlock();
+
+ chromiumos_super_block_put(sbm);
+
+ return policy;
+}
diff --git a/security/chromiumos/inode_mark.h b/security/chromiumos/inode_mark.h
new file mode 100644
index 0000000..ec00bb4
--- /dev/null
+++ b/security/chromiumos/inode_mark.h
@@ -0,0 +1,47 @@
+/*
+ * Linux Security Module for Chromium OS
+ *
+ * Copyright 2016 Google Inc. All Rights Reserved
+ *
+ * Authors:
+ * Mattias Nissler <mnissler@chromium.org>
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+/* FS feature availability policy for inode. */
+enum chromiumos_inode_security_policy {
+ CHROMIUMOS_INODE_POLICY_INHERIT, /* Inherit policy from parent dir */
+ CHROMIUMOS_INODE_POLICY_ALLOW,
+ CHROMIUMOS_INODE_POLICY_BLOCK,
+};
+
+/*
+ * Inode security policy types available for use. To add an additional
+ * security policy, simply add a new member here, add the corresponding policy
+ * files in securityfs.c, and associate the files being added with the new enum
+ * member.
+ */
+enum chromiumos_inode_security_policy_type {
+ CHROMIUMOS_SYMLINK_TRAVERSAL = 0,
+ CHROMIUMOS_FIFO_ACCESS,
+ CHROMIUMOS_NUMBER_OF_POLICIES, /* Do not add entries after this line. */
+};
+
+extern int chromiumos_update_inode_security_policy(
+ struct inode *inode,
+ enum chromiumos_inode_security_policy_type type,
+ enum chromiumos_inode_security_policy policy);
+int chromiumos_flush_inode_security_policies(struct super_block *sb);
+
+extern enum chromiumos_inode_security_policy
+chromiumos_get_inode_security_policy(
+ struct dentry *dentry, struct inode *inode,
+ enum chromiumos_inode_security_policy_type type);
diff --git a/security/chromiumos/lsm.c b/security/chromiumos/lsm.c
new file mode 100644
index 0000000..c2128de
--- /dev/null
+++ b/security/chromiumos/lsm.c
@@ -0,0 +1,442 @@
+/*
+ * Linux Security Module for Chromium OS
+ *
+ * Copyright 2011 Google Inc. All Rights Reserved
+ *
+ * Authors:
+ * Stephan Uphoff <ups@google.com>
+ * Kees Cook <keescook@chromium.org>
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt) "Chromium OS LSM: " fmt
+
+#include <asm/syscall.h>
+#include <linux/cred.h>
+#include <linux/fs.h>
+#include <linux/fs_struct.h>
+#include <linux/hashtable.h>
+#include <linux/lsm_hooks.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/namei.h> /* for nameidata_get_total_link_count */
+#include <linux/path.h>
+#include <linux/ptrace.h>
+#include <linux/sched/task_stack.h>
+#include <linux/sched.h> /* current and other task related stuff */
+#include <linux/security.h>
+
+#include "inode_mark.h"
+#include "utils.h"
+
+#define NUM_BITS 8 // 128 buckets in hash table
+
+static DEFINE_HASHTABLE(sb_nosymfollow_hashtable, NUM_BITS);
+
+struct sb_entry {
+ struct hlist_node next;
+ struct hlist_node dlist; /* for deletion cleanup */
+ uintptr_t sb;
+};
+
+#if defined(CONFIG_SECURITY_CHROMIUMOS_NO_UNPRIVILEGED_UNSAFE_MOUNTS) || \
+ defined(CONFIG_SECURITY_CHROMIUMOS_NO_SYMLINK_MOUNT)
+static void report(const char *origin, const struct path *path, char *operation)
+{
+ char *alloced = NULL, *cmdline;
+ char *pathname; /* Pointer to either static string or "alloced". */
+
+ if (!path)
+ pathname = "<unknown>";
+ else {
+ /* We will allow 11 spaces for ' (deleted)' to be appended */
+ alloced = pathname = kmalloc(PATH_MAX+11, GFP_KERNEL);
+ if (!pathname)
+ pathname = "<no_memory>";
+ else {
+ pathname = d_path(path, pathname, PATH_MAX+11);
+ if (IS_ERR(pathname))
+ pathname = "<too_long>";
+ else {
+ pathname = printable(pathname, PATH_MAX+11);
+ kfree(alloced);
+ alloced = pathname;
+ }
+ }
+ }
+
+ cmdline = printable_cmdline(current);
+
+ pr_notice("%s %s obj=%s pid=%d cmdline=%s\n", origin,
+ operation, pathname, task_pid_nr(current), cmdline);
+
+ kfree(cmdline);
+ kfree(alloced);
+}
+#endif
+
+static int chromiumos_security_sb_mount(const char *dev_name,
+ const struct path *path,
+ const char *type, unsigned long flags,
+ void *data)
+{
+#ifdef CONFIG_SECURITY_CHROMIUMOS_NO_SYMLINK_MOUNT
+ if (nameidata_get_total_link_count()) {
+ report("sb_mount", path, "Mount path with symlinks prohibited");
+ pr_notice("sb_mount dev=%s type=%s flags=%#lx\n",
+ dev_name, type, flags);
+ return -ELOOP;
+ }
+#endif
+
+#ifdef CONFIG_SECURITY_CHROMIUMOS_NO_UNPRIVILEGED_UNSAFE_MOUNTS
+ if ((!(flags & (MS_BIND | MS_MOVE | MS_SHARED | MS_PRIVATE | MS_SLAVE |
+ MS_UNBINDABLE)) ||
+ ((flags & MS_REMOUNT) && (flags & MS_BIND))) &&
+ !capable(CAP_SYS_ADMIN)) {
+ int required_mnt_flags = MNT_NOEXEC | MNT_NOSUID | MNT_NODEV;
+
+ if (flags & MS_REMOUNT) {
+ /*
+ * If this is a remount, we only require that the
+ * requested flags are a superset of the original mount
+ * flags.
+ */
+ required_mnt_flags &= path->mnt->mnt_flags;
+ }
+ /*
+ * The three flags we are interested in disallowing in
+ * unprivileged user namespaces (MS_NOEXEC, MS_NOSUID, MS_NODEV)
+ * cannot be modified when doing a bind-mount. The kernel
+ * attempts to dispatch calls to do_mount() within
+ * fs/namespace.c in the following order:
+ *
+ * * If the MS_REMOUNT flag is present, it calls do_remount().
+ * When MS_BIND is also present, it only allows to modify the
+ * per-mount flags, which are copied into
+ * |required_mnt_flags|. Otherwise it bails in the absence of
+ * the CAP_SYS_ADMIN in the init ns.
+ * * If the MS_BIND flag is present, the only other flag checked
+ * is MS_REC.
+ * * If any of the mount propagation flags are present
+ * (MS_SHARED, MS_PRIVATE, MS_SLAVE, MS_UNBINDABLE),
+ * flags_to_propagation_type() filters out any additional
+ * flags.
+ * * If MS_MOVE flag is present, all other flags are ignored.
+ */
+ if ((required_mnt_flags & MNT_NOEXEC) && !(flags & MS_NOEXEC)) {
+ report("sb_mount", path,
+ "Mounting a filesystem with 'exec' flag requires CAP_SYS_ADMIN in init ns");
+ pr_notice("sb_mount dev=%s type=%s flags=%#lx\n",
+ dev_name, type, flags);
+ return -EPERM;
+ }
+ if ((required_mnt_flags & MNT_NOSUID) && !(flags & MS_NOSUID)) {
+ report("sb_mount", path,
+ "Mounting a filesystem with 'suid' flag requires CAP_SYS_ADMIN in init ns");
+ pr_notice("sb_mount dev=%s type=%s flags=%#lx\n",
+ dev_name, type, flags);
+ return -EPERM;
+ }
+ if ((required_mnt_flags & MNT_NODEV) && !(flags & MS_NODEV) &&
+ strcmp(type, "devpts")) {
+ report("sb_mount", path,
+ "Mounting a filesystem with 'dev' flag requires CAP_SYS_ADMIN in init ns");
+ pr_notice("sb_mount dev=%s type=%s flags=%#lx\n",
+ dev_name, type, flags);
+ return -EPERM;
+ }
+ }
+#endif
+
+ return 0;
+}
+
+static DEFINE_SPINLOCK(sb_nosymfollow_hashtable_spinlock);
+
+/* Check for entry in hash table. */
+static bool chromiumos_check_sb_nosymfollow_hashtable(struct super_block *sb)
+{
+ struct sb_entry *entry;
+ uintptr_t sb_pointer = (uintptr_t)sb;
+ bool found = false;
+
+ rcu_read_lock();
+ hash_for_each_possible_rcu(sb_nosymfollow_hashtable,
+ entry, next, sb_pointer) {
+ if (entry->sb == sb_pointer) {
+ found = true;
+ break;
+ }
+ }
+ rcu_read_unlock();
+
+ /*
+ * Its possible that a policy gets added in between the time we check
+ * above and when we return false here. Such a race condition should
+ * not affect this check however, since it would only be relevant if
+ * userspace tried to traverse a symlink on a filesystem before that
+ * filesystem was done being mounted (or potentially while it was being
+ * remounted with new mount flags).
+ */
+ return found;
+}
+
+/* Add entry to hash table. */
+static int chromiumos_add_sb_nosymfollow_hashtable(struct super_block *sb)
+{
+ struct sb_entry *new;
+ uintptr_t sb_pointer = (uintptr_t)sb;
+
+ /* Return if entry already exists */
+ if (chromiumos_check_sb_nosymfollow_hashtable(sb))
+ return 0;
+
+ new = kzalloc(sizeof(struct sb_entry), GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+ new->sb = sb_pointer;
+ spin_lock(&sb_nosymfollow_hashtable_spinlock);
+ hash_add_rcu(sb_nosymfollow_hashtable, &new->next, sb_pointer);
+ spin_unlock(&sb_nosymfollow_hashtable_spinlock);
+ return 0;
+}
+
+/* Flush all entries from hash table. */
+void chromiumos_flush_sb_nosymfollow_hashtable(void)
+{
+ struct sb_entry *entry;
+ struct hlist_node *hlist_node;
+ unsigned int bkt_loop_cursor;
+ HLIST_HEAD(free_list);
+
+ /*
+ * Could probably use hash_for_each_rcu here instead, but this should
+ * be fine as well.
+ */
+ spin_lock(&sb_nosymfollow_hashtable_spinlock);
+ hash_for_each_safe(sb_nosymfollow_hashtable, bkt_loop_cursor,
+ hlist_node, entry, next) {
+ hash_del_rcu(&entry->next);
+ hlist_add_head(&entry->dlist, &free_list);
+ }
+ spin_unlock(&sb_nosymfollow_hashtable_spinlock);
+ synchronize_rcu();
+ hlist_for_each_entry_safe(entry, hlist_node, &free_list, dlist)
+ kfree(entry);
+}
+
+/* Remove entry from hash table. */
+static void chromiumos_remove_sb_nosymfollow_hashtable(struct super_block *sb)
+{
+ struct sb_entry *entry;
+ struct hlist_node *hlist_node;
+ uintptr_t sb_pointer = (uintptr_t)sb;
+ bool free_entry = false;
+
+ /*
+ * Could probably use hash_for_each_rcu here instead, but this should
+ * be fine as well.
+ */
+ spin_lock(&sb_nosymfollow_hashtable_spinlock);
+ hash_for_each_possible_safe(sb_nosymfollow_hashtable, entry,
+ hlist_node, next, sb_pointer) {
+ if (entry->sb == sb_pointer) {
+ hash_del_rcu(&entry->next);
+ free_entry = true;
+ break;
+ }
+ }
+ spin_unlock(&sb_nosymfollow_hashtable_spinlock);
+ if (free_entry) {
+ synchronize_rcu();
+ kfree(entry);
+ }
+}
+
+int chromiumos_security_sb_umount(struct vfsmount *mnt, int flags)
+{
+ /* If mnt->mnt_sb is in nosymfollow hashtable, remove it. */
+ chromiumos_remove_sb_nosymfollow_hashtable(mnt->mnt_sb);
+
+ return 0;
+}
+
+/*
+ * NOTE: The WARN() calls will emit a warning in cases of blocked symlink
+ * traversal attempts. These will show up in kernel warning reports
+ * collected by the crash reporter, so we have some insight on spurious
+ * failures that need addressing.
+ */
+static int chromiumos_security_inode_follow_link(struct dentry *dentry,
+ struct inode *inode, bool rcu)
+{
+ static char accessed_path[PATH_MAX];
+ enum chromiumos_inode_security_policy policy;
+
+ /* Deny if symlinks have been disabled on this superblock. */
+ if (chromiumos_check_sb_nosymfollow_hashtable(dentry->d_sb)) {
+ WARN(1,
+ "Blocked symlink traversal for path %x:%x:%s (symlinks were disabled on this FS through the 'nosymfollow' mount option)\n",
+ MAJOR(dentry->d_sb->s_dev),
+ MINOR(dentry->d_sb->s_dev),
+ dentry_path(dentry, accessed_path, PATH_MAX));
+ return -EACCES;
+ }
+
+ policy = chromiumos_get_inode_security_policy(
+ dentry, inode,
+ CHROMIUMOS_SYMLINK_TRAVERSAL);
+
+ WARN(policy == CHROMIUMOS_INODE_POLICY_BLOCK,
+ "Blocked symlink traversal for path %x:%x:%s (see https://goo.gl/8xICW6 for context and rationale)\n",
+ MAJOR(dentry->d_sb->s_dev), MINOR(dentry->d_sb->s_dev),
+ dentry_path(dentry, accessed_path, PATH_MAX));
+
+ return policy == CHROMIUMOS_INODE_POLICY_BLOCK ? -EACCES : 0;
+}
+
+static int chromiumos_security_file_open(struct file *file)
+{
+ static char accessed_path[PATH_MAX];
+ enum chromiumos_inode_security_policy policy;
+ struct dentry *dentry = file->f_path.dentry;
+
+ /* Returns 0 if file is not a FIFO */
+ if (!S_ISFIFO(file->f_inode->i_mode))
+ return 0;
+
+ policy = chromiumos_get_inode_security_policy(
+ dentry, dentry->d_inode,
+ CHROMIUMOS_FIFO_ACCESS);
+
+ /*
+ * Emit a warning in cases of blocked fifo access attempts. These will
+ * show up in kernel warning reports collected by the crash reporter,
+ * so we have some insight on spurious failures that need addressing.
+ */
+ WARN(policy == CHROMIUMOS_INODE_POLICY_BLOCK,
+ "Blocked fifo access for path %x:%x:%s\n (see https://goo.gl/8xICW6 for context and rationale)\n",
+ MAJOR(dentry->d_sb->s_dev), MINOR(dentry->d_sb->s_dev),
+ dentry_path(dentry, accessed_path, PATH_MAX));
+
+ return policy == CHROMIUMOS_INODE_POLICY_BLOCK ? -EACCES : 0;
+}
+
+/*
+ * This hook inspects the string pointed to by the first parameter, looking for
+ * the "nosymfollow" mount option. The second parameter points to an empty
+ * page-sized buffer that is used for holding LSM-specific mount options that
+ * are grabbed (after this function executes, in security_sb_copy_data) from
+ * the mount string in the first parameter. Since the chromiumos LSM is stacked
+ * ahead of SELinux for ChromeOS, the page-sized buffer is empty when this
+ * function is called. If the "nosymfollow" mount option is encountered in this
+ * function, we write "nosymflw" to the empty page-sized buffer which lets us
+ * transmit information which will be visible in chromiumos_sb_kern_mount
+ * signifying that symlinks should be disabled for the sb. We store this token
+ * at a spot in the buffer that is at a greater offset than the bytes needed to
+ * record the rest of the LSM-specific mount options (e.g. those for SELinux).
+ * The "nosymfollow" option will be stripped from the mount string if it is
+ * encountered.
+ */
+int chromiumos_sb_copy_data(char *orig, char *copy)
+{
+ char *orig_copy;
+ char *orig_copy_cur;
+ char *option;
+ size_t offset = 0;
+ bool found = false;
+
+ if (!orig || *orig == 0)
+ return 0;
+
+ orig_copy = alloc_secdata();
+ if (!orig_copy)
+ return -ENOMEM;
+ strncpy(orig_copy, orig, PAGE_SIZE);
+
+ memset(orig, 0, strlen(orig));
+
+ orig_copy_cur = orig_copy;
+ while (orig_copy_cur) {
+ option = strsep(&orig_copy_cur, ",");
+ if (strcmp(option, "nosymfollow") == 0) {
+ if (found) /* Found multiple times. */
+ return -EINVAL;
+ found = true;
+ } else {
+ if (offset > 0) {
+ orig[offset] = ',';
+ offset++;
+ }
+ strcpy(orig + offset, option);
+ offset += strlen(option);
+ }
+ }
+
+ if (found)
+ strcpy(copy + offset + 1, "nosymflw");
+
+ free_secdata(orig_copy);
+ return 0;
+}
+
+/* Unfortunately the kernel doesn't implement memmem function. */
+static void *search_buffer(void *haystack, size_t haystacklen,
+ const void *needle, size_t needlelen)
+{
+ if (!needlelen)
+ return (void *)haystack;
+ while (haystacklen >= needlelen) {
+ haystacklen--;
+ if (!memcmp(haystack, needle, needlelen))
+ return (void *)haystack;
+ haystack++;
+ }
+ return NULL;
+}
+
+int chromiumos_sb_kern_mount(struct super_block *sb, int flags, void *data)
+{
+ int ret;
+ char search_str[10] = "\0nosymflw";
+
+ if (!data)
+ return 0;
+
+ if (search_buffer(data, PAGE_SIZE, search_str, 10)) {
+ ret = chromiumos_add_sb_nosymfollow_hashtable(sb);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
+}
+
+static struct security_hook_list chromiumos_security_hooks[] = {
+ LSM_HOOK_INIT(sb_mount, chromiumos_security_sb_mount),
+ LSM_HOOK_INIT(inode_follow_link, chromiumos_security_inode_follow_link),
+ LSM_HOOK_INIT(file_open, chromiumos_security_file_open),
+ LSM_HOOK_INIT(sb_copy_data, chromiumos_sb_copy_data),
+ LSM_HOOK_INIT(sb_kern_mount, chromiumos_sb_kern_mount),
+ LSM_HOOK_INIT(sb_umount, chromiumos_security_sb_umount)
+};
+
+static int __init chromiumos_security_init(void)
+{
+ security_add_hooks(chromiumos_security_hooks,
+ ARRAY_SIZE(chromiumos_security_hooks), "chromiumos");
+
+ pr_info("enabled");
+
+ return 0;
+}
+security_initcall(chromiumos_security_init);
diff --git a/security/chromiumos/read_write_test_whitelists.h b/security/chromiumos/read_write_test_whitelists.h
new file mode 100644
index 0000000..5aa7370
--- /dev/null
+++ b/security/chromiumos/read_write_test_whitelists.h
@@ -0,0 +1,56 @@
+/*
+ * Linux Security Module for Chromium OS
+ *
+ * Copyright 2018 Google LLC. All Rights Reserved
+ *
+ * Authors:
+ * Micah Morton <mortonm@chromium.org>
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef READ_WRITE_TESTS_WHITELISTS_H
+#define READ_WRITE_TESTS_WHITELISTS_H
+
+/*
+ * NOTE: the purpose of this header is only to pull out the definition of this
+ * array from alt-syscall.c for the purposes of readability. It should not be
+ * included in other .c files.
+ */
+
+#include "alt-syscall.h"
+
+static struct syscall_whitelist_entry read_write_test_whitelist[] = {
+ SYSCALL_ENTRY(exit),
+ SYSCALL_ENTRY(openat),
+ SYSCALL_ENTRY(close),
+ SYSCALL_ENTRY(read),
+ SYSCALL_ENTRY(write),
+ SYSCALL_ENTRY_ALT(prctl, alt_sys_prctl),
+
+ /* open(2) is deprecated and not wired up on ARM64. */
+#ifndef CONFIG_ARM64
+ SYSCALL_ENTRY(open),
+#endif
+}; /* end read_write_test_whitelist */
+
+#ifdef CONFIG_COMPAT
+static struct syscall_whitelist_entry read_write_test_compat_whitelist[] = {
+ COMPAT_SYSCALL_ENTRY(exit),
+ COMPAT_SYSCALL_ENTRY(open),
+ COMPAT_SYSCALL_ENTRY(openat),
+ COMPAT_SYSCALL_ENTRY(close),
+ COMPAT_SYSCALL_ENTRY(read),
+ COMPAT_SYSCALL_ENTRY(write),
+ COMPAT_SYSCALL_ENTRY_ALT(prctl, alt_sys_prctl),
+}; /* end read_write_test_compat_whitelist */
+#endif /* CONFIG_COMPAT */
+
+#endif /* READ_WRITE_TESTS_WHITELISTS_H */
diff --git a/security/chromiumos/securityfs.c b/security/chromiumos/securityfs.c
new file mode 100644
index 0000000..ae2d76a
--- /dev/null
+++ b/security/chromiumos/securityfs.c
@@ -0,0 +1,241 @@
+/*
+ * Linux Security Module for Chromium OS
+ *
+ * Copyright 2016 Google Inc. All Rights Reserved
+ *
+ * Authors:
+ * Mattias Nissler <mnissler@chromium.org>
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/capability.h>
+#include <linux/cred.h>
+#include <linux/dcache.h>
+#include <linux/fs.h>
+#include <linux/namei.h>
+#include <linux/sched.h>
+#include <linux/security.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+
+#include "inode_mark.h"
+
+static struct dentry *chromiumos_dir;
+static struct dentry *chromiumos_inode_policy_dir;
+
+struct chromiumos_inode_policy_file_entry {
+ const char *name;
+ int (*handle_write)(struct chromiumos_inode_policy_file_entry *,
+ struct dentry *);
+ enum chromiumos_inode_security_policy_type type;
+ enum chromiumos_inode_security_policy policy;
+ struct dentry *dentry;
+};
+
+static int chromiumos_inode_policy_file_write(
+ struct chromiumos_inode_policy_file_entry *file_entry,
+ struct dentry *dentry)
+{
+ return chromiumos_update_inode_security_policy(dentry->d_inode,
+ file_entry->type, file_entry->policy);
+}
+
+/*
+ * Causes all marks to be removed from inodes thus removing all inode security
+ * policies.
+ */
+static int chromiumos_inode_policy_file_flush_write(
+ struct chromiumos_inode_policy_file_entry *file_entry,
+ struct dentry *dentry)
+{
+ return chromiumos_flush_inode_security_policies(dentry->d_sb);
+}
+
+static struct chromiumos_inode_policy_file_entry
+ chromiumos_inode_policy_files[] = {
+ {.name = "block_symlink",
+ .handle_write = chromiumos_inode_policy_file_write,
+ .type = CHROMIUMOS_SYMLINK_TRAVERSAL,
+ .policy = CHROMIUMOS_INODE_POLICY_BLOCK},
+ {.name = "allow_symlink",
+ .handle_write = chromiumos_inode_policy_file_write,
+ .type = CHROMIUMOS_SYMLINK_TRAVERSAL,
+ .policy = CHROMIUMOS_INODE_POLICY_ALLOW},
+ {.name = "reset_symlink",
+ .handle_write = chromiumos_inode_policy_file_write,
+ .type = CHROMIUMOS_SYMLINK_TRAVERSAL,
+ .policy = CHROMIUMOS_INODE_POLICY_INHERIT},
+ {.name = "block_fifo",
+ .handle_write = chromiumos_inode_policy_file_write,
+ .type = CHROMIUMOS_FIFO_ACCESS,
+ .policy = CHROMIUMOS_INODE_POLICY_BLOCK},
+ {.name = "allow_fifo",
+ .handle_write = chromiumos_inode_policy_file_write,
+ .type = CHROMIUMOS_FIFO_ACCESS,
+ .policy = CHROMIUMOS_INODE_POLICY_ALLOW},
+ {.name = "reset_fifo",
+ .handle_write = chromiumos_inode_policy_file_write,
+ .type = CHROMIUMOS_FIFO_ACCESS,
+ .policy = CHROMIUMOS_INODE_POLICY_INHERIT},
+ {.name = "flush_policies",
+ .handle_write = &chromiumos_inode_policy_file_flush_write},
+};
+
+static int chromiumos_resolve_path(const char __user *buf, size_t len,
+ struct path *path)
+{
+ char *filename = NULL;
+ char *canonical_buf = NULL;
+ char *canonical;
+ int ret;
+
+ if (len + 1 > PATH_MAX)
+ return -EINVAL;
+
+ /*
+ * Copy the path to a kernel buffer. We can't use user_path_at()
+ * since it expects a zero-terminated path, which we generally don't
+ * have here.
+ */
+ filename = kzalloc(len + 1, GFP_KERNEL);
+ if (!filename)
+ return -ENOMEM;
+
+ if (copy_from_user(filename, buf, len)) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ ret = kern_path(filename, 0, path);
+ if (ret)
+ goto out;
+
+ /*
+ * Make sure the path is canonical, i.e. it didn't contain symlinks. To
+ * check this we convert |path| back to an absolute path (within the
+ * global root) and compare the resulting path name with the passed-in
+ * |filename|. This is stricter than needed (i.e. consecutive slashes
+ * don't get ignored), but that's fine for our purposes.
+ */
+ canonical_buf = kzalloc(len + 1, GFP_KERNEL);
+ if (!canonical_buf) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ canonical = d_absolute_path(path, canonical_buf, len + 1);
+ if (IS_ERR(canonical)) {
+ ret = PTR_ERR(canonical);
+
+ /* Buffer too short implies |filename| wasn't canonical. */
+ if (ret == -ENAMETOOLONG)
+ ret = -EMLINK;
+
+ goto out;
+ }
+
+ ret = strcmp(filename, canonical) ? -EMLINK : 0;
+
+out:
+ kfree(canonical_buf);
+ if (ret < 0)
+ path_put(path);
+ kfree(filename);
+ return ret;
+}
+
+static ssize_t chromiumos_inode_file_write(
+ struct file *file,
+ const char __user *buf,
+ size_t len,
+ loff_t *ppos)
+{
+ struct chromiumos_inode_policy_file_entry *file_entry =
+ file->f_inode->i_private;
+ struct path path = {};
+ int ret;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (*ppos != 0)
+ return -EINVAL;
+
+ ret = chromiumos_resolve_path(buf, len, &path);
+ if (ret)
+ return ret;
+
+ ret = file_entry->handle_write(file_entry, path.dentry);
+ path_put(&path);
+ return ret < 0 ? ret : len;
+}
+
+static const struct file_operations chromiumos_inode_policy_file_fops = {
+ .write = chromiumos_inode_file_write,
+};
+
+static void chromiumos_shutdown_securityfs(void)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(chromiumos_inode_policy_files); ++i) {
+ struct chromiumos_inode_policy_file_entry *entry =
+ &chromiumos_inode_policy_files[i];
+ securityfs_remove(entry->dentry);
+ entry->dentry = NULL;
+ }
+
+ securityfs_remove(chromiumos_inode_policy_dir);
+ chromiumos_inode_policy_dir = NULL;
+
+ securityfs_remove(chromiumos_dir);
+ chromiumos_dir = NULL;
+}
+
+static int chromiumos_init_securityfs(void)
+{
+ int i;
+ int ret;
+
+ chromiumos_dir = securityfs_create_dir("chromiumos", NULL);
+ if (!chromiumos_dir) {
+ ret = PTR_ERR(chromiumos_dir);
+ goto error;
+ }
+
+ chromiumos_inode_policy_dir =
+ securityfs_create_dir(
+ "inode_security_policies",
+ chromiumos_dir);
+ if (!chromiumos_inode_policy_dir) {
+ ret = PTR_ERR(chromiumos_inode_policy_dir);
+ goto error;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(chromiumos_inode_policy_files); ++i) {
+ struct chromiumos_inode_policy_file_entry *entry =
+ &chromiumos_inode_policy_files[i];
+ entry->dentry = securityfs_create_file(
+ entry->name, 0200, chromiumos_inode_policy_dir,
+ entry, &chromiumos_inode_policy_file_fops);
+ if (IS_ERR(entry->dentry)) {
+ ret = PTR_ERR(entry->dentry);
+ goto error;
+ }
+ }
+
+ return 0;
+
+error:
+ chromiumos_shutdown_securityfs();
+ return ret;
+}
+fs_initcall(chromiumos_init_securityfs);
diff --git a/security/chromiumos/third_party_whitelists.h b/security/chromiumos/third_party_whitelists.h
new file mode 100644
index 0000000..dfc4e01
--- /dev/null
+++ b/security/chromiumos/third_party_whitelists.h
@@ -0,0 +1,256 @@
+/*
+ * Linux Security Module for Chromium OS
+ *
+ * Copyright 2018 Google LLC. All Rights Reserved
+ *
+ * Authors:
+ * Micah Morton <mortonm@chromium.org>
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef THIRD_PARTY_WHITELISTS_H
+#define THIRD_PARTY_WHITELISTS_H
+
+/*
+ * NOTE: the purpose of this header is only to pull out the definition of this
+ * array from alt-syscall.c for the purposes of readability. It should not be
+ * included in other .c files.
+ */
+
+#include "alt-syscall.h"
+
+static struct syscall_whitelist_entry third_party_whitelist[] = {
+ SYSCALL_ENTRY(accept),
+ SYSCALL_ENTRY(bind),
+ SYSCALL_ENTRY(brk),
+ SYSCALL_ENTRY(chdir),
+ SYSCALL_ENTRY(clock_gettime),
+ SYSCALL_ENTRY(clone),
+ SYSCALL_ENTRY(close),
+ SYSCALL_ENTRY(connect),
+ SYSCALL_ENTRY(dup),
+ SYSCALL_ENTRY(execve),
+ SYSCALL_ENTRY(exit),
+ SYSCALL_ENTRY(exit_group),
+ SYSCALL_ENTRY(fcntl),
+ SYSCALL_ENTRY(fstat),
+ SYSCALL_ENTRY(futex),
+ SYSCALL_ENTRY(getcwd),
+ SYSCALL_ENTRY(getdents64),
+ SYSCALL_ENTRY(getpid),
+ SYSCALL_ENTRY(getpgid),
+ SYSCALL_ENTRY(getppid),
+ SYSCALL_ENTRY(getpriority),
+ SYSCALL_ENTRY(getsid),
+ SYSCALL_ENTRY(gettimeofday),
+ SYSCALL_ENTRY(ioctl),
+ SYSCALL_ENTRY(listen),
+ SYSCALL_ENTRY(lseek),
+ SYSCALL_ENTRY(madvise),
+ SYSCALL_ENTRY(memfd_create),
+ SYSCALL_ENTRY(mprotect),
+ SYSCALL_ENTRY(munmap),
+ SYSCALL_ENTRY(nanosleep),
+ SYSCALL_ENTRY(openat),
+ SYSCALL_ENTRY(prlimit64),
+ SYSCALL_ENTRY(read),
+ SYSCALL_ENTRY(recvfrom),
+ SYSCALL_ENTRY(recvmsg),
+ SYSCALL_ENTRY(rt_sigaction),
+ SYSCALL_ENTRY(rt_sigprocmask),
+ SYSCALL_ENTRY(rt_sigreturn),
+ SYSCALL_ENTRY(sendfile),
+ SYSCALL_ENTRY(sendmsg),
+ SYSCALL_ENTRY(sendto),
+ SYSCALL_ENTRY(set_robust_list),
+ SYSCALL_ENTRY(set_tid_address),
+ SYSCALL_ENTRY(setpgid),
+ SYSCALL_ENTRY(setpriority),
+ SYSCALL_ENTRY(setsid),
+ SYSCALL_ENTRY(setsockopt),
+ SYSCALL_ENTRY(socket),
+ SYSCALL_ENTRY(socketpair),
+ SYSCALL_ENTRY(syslog),
+ SYSCALL_ENTRY(statfs),
+ SYSCALL_ENTRY(umask),
+ SYSCALL_ENTRY(uname),
+ SYSCALL_ENTRY(wait4),
+ SYSCALL_ENTRY(write),
+ SYSCALL_ENTRY(writev),
+
+ /*
+ * Deprecated syscalls which are not wired up on new architectures
+ * such as ARM64.
+ */
+#ifndef CONFIG_ARM64
+ SYSCALL_ENTRY(access),
+ SYSCALL_ENTRY(creat),
+ SYSCALL_ENTRY(dup2),
+ SYSCALL_ENTRY(getdents),
+ SYSCALL_ENTRY(getpgrp),
+ SYSCALL_ENTRY(lstat),
+ SYSCALL_ENTRY(mkdir),
+ SYSCALL_ENTRY(open),
+ SYSCALL_ENTRY(pipe),
+ SYSCALL_ENTRY(poll),
+ SYSCALL_ENTRY(readlink),
+ SYSCALL_ENTRY(stat),
+ SYSCALL_ENTRY(unlink),
+#endif
+
+ /* ARM32 only syscalls. */
+#if defined(CONFIG_ARM)
+ SYSCALL_ENTRY(fcntl64),
+ SYSCALL_ENTRY(fstat64),
+ SYSCALL_ENTRY(geteuid32),
+ SYSCALL_ENTRY(getuid32),
+ SYSCALL_ENTRY(_llseek),
+ SYSCALL_ENTRY(lstat64),
+ SYSCALL_ENTRY(_newselect),
+ SYSCALL_ENTRY(mmap2),
+ SYSCALL_ENTRY(stat64),
+ SYSCALL_ENTRY(ugetrlimit),
+#endif
+
+ /* 64-bit only syscalls. */
+#if defined(CONFIG_X86_64) || defined(CONFIG_ARM64)
+ SYSCALL_ENTRY(getegid),
+ SYSCALL_ENTRY(geteuid),
+ SYSCALL_ENTRY(getgid),
+ SYSCALL_ENTRY(getrlimit),
+ SYSCALL_ENTRY(getuid),
+ SYSCALL_ENTRY(mmap),
+ SYSCALL_ENTRY(setgid),
+ SYSCALL_ENTRY(setuid),
+ /*
+ * chown(2), lchown(2), and select(2) are deprecated and not wired up
+ * on ARM64.
+ */
+#ifndef CONFIG_ARM64
+ SYSCALL_ENTRY(select),
+#endif
+#endif
+
+ /* X86_64-specific syscalls. */
+#ifdef CONFIG_X86_64
+ SYSCALL_ENTRY(arch_prctl),
+#endif
+}; /* end third_party_whitelist */
+
+#ifdef CONFIG_COMPAT
+static struct syscall_whitelist_entry third_party_compat_whitelist[] = {
+ COMPAT_SYSCALL_ENTRY(access),
+ COMPAT_SYSCALL_ENTRY(brk),
+ COMPAT_SYSCALL_ENTRY(chdir),
+ COMPAT_SYSCALL_ENTRY(clock_gettime),
+ COMPAT_SYSCALL_ENTRY(clone),
+ COMPAT_SYSCALL_ENTRY(close),
+ COMPAT_SYSCALL_ENTRY(creat),
+ COMPAT_SYSCALL_ENTRY(dup),
+ COMPAT_SYSCALL_ENTRY(dup2),
+ COMPAT_SYSCALL_ENTRY(execve),
+ COMPAT_SYSCALL_ENTRY(exit),
+ COMPAT_SYSCALL_ENTRY(exit_group),
+ COMPAT_SYSCALL_ENTRY(fcntl),
+ COMPAT_SYSCALL_ENTRY(fcntl64),
+ COMPAT_SYSCALL_ENTRY(fstat),
+ COMPAT_SYSCALL_ENTRY(fstat64),
+ COMPAT_SYSCALL_ENTRY(futex),
+ COMPAT_SYSCALL_ENTRY(getcwd),
+ COMPAT_SYSCALL_ENTRY(getdents),
+ COMPAT_SYSCALL_ENTRY(getdents64),
+ COMPAT_SYSCALL_ENTRY(getegid),
+ COMPAT_SYSCALL_ENTRY(geteuid),
+ COMPAT_SYSCALL_ENTRY(geteuid32),
+ COMPAT_SYSCALL_ENTRY(getgid),
+ COMPAT_SYSCALL_ENTRY(getpgid),
+ COMPAT_SYSCALL_ENTRY(getpgrp),
+ COMPAT_SYSCALL_ENTRY(getpid),
+ COMPAT_SYSCALL_ENTRY(getpriority),
+ COMPAT_SYSCALL_ENTRY(getppid),
+ COMPAT_SYSCALL_ENTRY(getsid),
+ COMPAT_SYSCALL_ENTRY(gettimeofday),
+ COMPAT_SYSCALL_ENTRY(getuid),
+ COMPAT_SYSCALL_ENTRY(getuid32),
+ COMPAT_SYSCALL_ENTRY(ioctl),
+ COMPAT_SYSCALL_ENTRY(_llseek),
+ COMPAT_SYSCALL_ENTRY(lseek),
+ COMPAT_SYSCALL_ENTRY(lstat),
+ COMPAT_SYSCALL_ENTRY(lstat64),
+ COMPAT_SYSCALL_ENTRY(madvise),
+ COMPAT_SYSCALL_ENTRY(memfd_create),
+ COMPAT_SYSCALL_ENTRY(mkdir),
+ COMPAT_SYSCALL_ENTRY(mmap2),
+ COMPAT_SYSCALL_ENTRY(mprotect),
+ COMPAT_SYSCALL_ENTRY(munmap),
+ COMPAT_SYSCALL_ENTRY(nanosleep),
+ COMPAT_SYSCALL_ENTRY(_newselect),
+ COMPAT_SYSCALL_ENTRY(open),
+ COMPAT_SYSCALL_ENTRY(openat),
+ COMPAT_SYSCALL_ENTRY(pipe),
+ COMPAT_SYSCALL_ENTRY(poll),
+ COMPAT_SYSCALL_ENTRY(prlimit64),
+ COMPAT_SYSCALL_ENTRY(read),
+ COMPAT_SYSCALL_ENTRY(readlink),
+ COMPAT_SYSCALL_ENTRY(rt_sigaction),
+ COMPAT_SYSCALL_ENTRY(rt_sigprocmask),
+ COMPAT_SYSCALL_ENTRY(rt_sigreturn),
+ COMPAT_SYSCALL_ENTRY(sendfile),
+ COMPAT_SYSCALL_ENTRY(set_robust_list),
+ COMPAT_SYSCALL_ENTRY(set_tid_address),
+ COMPAT_SYSCALL_ENTRY(setgid32),
+ COMPAT_SYSCALL_ENTRY(setuid32),
+ COMPAT_SYSCALL_ENTRY(setpgid),
+ COMPAT_SYSCALL_ENTRY(setpriority),
+ COMPAT_SYSCALL_ENTRY(setsid),
+ COMPAT_SYSCALL_ENTRY(stat),
+ COMPAT_SYSCALL_ENTRY(stat64),
+ COMPAT_SYSCALL_ENTRY(statfs),
+ COMPAT_SYSCALL_ENTRY(syslog),
+ COMPAT_SYSCALL_ENTRY(ugetrlimit),
+ COMPAT_SYSCALL_ENTRY(umask),
+ COMPAT_SYSCALL_ENTRY(uname),
+ COMPAT_SYSCALL_ENTRY(unlink),
+ COMPAT_SYSCALL_ENTRY(wait4),
+ COMPAT_SYSCALL_ENTRY(write),
+ COMPAT_SYSCALL_ENTRY(writev),
+
+ /* IA32 uses the common socketcall(2) entrypoint for socket calls. */
+#ifdef CONFIG_X86_64
+ COMPAT_SYSCALL_ENTRY(socketcall),
+#endif
+
+#ifdef CONFIG_ARM64
+ COMPAT_SYSCALL_ENTRY(accept),
+ COMPAT_SYSCALL_ENTRY(bind),
+ COMPAT_SYSCALL_ENTRY(connect),
+ COMPAT_SYSCALL_ENTRY(listen),
+ COMPAT_SYSCALL_ENTRY(recvfrom),
+ COMPAT_SYSCALL_ENTRY(recvmsg),
+ COMPAT_SYSCALL_ENTRY(sendmsg),
+ COMPAT_SYSCALL_ENTRY(sendto),
+ COMPAT_SYSCALL_ENTRY(setsockopt),
+ COMPAT_SYSCALL_ENTRY(socket),
+ COMPAT_SYSCALL_ENTRY(socketpair),
+#endif
+
+ /*
+ * getrlimit(2) is deprecated and not wired in the ARM compat table
+ * on ARM64.
+ */
+#ifndef CONFIG_ARM64
+ COMPAT_SYSCALL_ENTRY(getrlimit),
+#endif
+
+}; /* end third_party_compat_whitelist */
+#endif /* CONFIG_COMPAT */
+
+#endif /* THIRD_PARTY_WHITELISTS_H */
diff --git a/security/chromiumos/utils.c b/security/chromiumos/utils.c
new file mode 100644
index 0000000..d0d82d7
--- /dev/null
+++ b/security/chromiumos/utils.c
@@ -0,0 +1,157 @@
+/*
+ * Utilities for the Linux Security Module for Chromium OS
+ * (Since CONFIG_AUDIT is disabled for Chrome OS, we must repurpose
+ * a bunch of the audit string handling logic here instead.)
+ *
+ * Copyright 2012 Google Inc. All Rights Reserved
+ *
+ * Author:
+ * Kees Cook <keescook@chromium.org>
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/sched/mm.h>
+#include <linux/security.h>
+
+#include "utils.h"
+
+/* Disallow double-quote and control characters other than space. */
+static int contains_unprintable(const char *source, size_t len)
+{
+ const unsigned char *p;
+ for (p = source; p < (const unsigned char *)source + len; p++) {
+ if (*p == '"' || *p < 0x20 || *p > 0x7e)
+ return 1;
+ }
+ return 0;
+}
+
+static char *hex_printable(const char *source, size_t len)
+{
+ size_t i;
+ char *dest, *ptr;
+ const char *hex = "0123456789ABCDEF";
+
+ /* Need to double the length of the string, plus a NULL. */
+ if (len > (INT_MAX - 1) / 2)
+ return NULL;
+ dest = kmalloc((len * 2) + 1, GFP_KERNEL);
+ if (!dest)
+ return NULL;
+
+ for (ptr = dest, i = 0; i < len; i++) {
+ *ptr++ = hex[(source[i] & 0xF0) >> 4];
+ *ptr++ = hex[source[i] & 0x0F];
+ }
+ *ptr = '\0';
+
+ return dest;
+}
+
+static char *quoted_printable(const char *source, size_t len)
+{
+ char *dest;
+
+ /* Need to add 2 double quotes and a NULL. */
+ if (len > INT_MAX - 3)
+ return NULL;
+ dest = kmalloc(len + 3, GFP_KERNEL);
+ if (!dest)
+ return NULL;
+
+ dest[0] = '"';
+ strncpy(dest + 1, source, len);
+ dest[len + 1] = '"';
+ dest[len + 2] = '\0';
+ return dest;
+}
+
+/* Return a string that has been sanitized and is safe to log. It is either
+ * in double-quotes, or is a series of hex digits.
+ */
+char *printable(char *source, size_t max_len)
+{
+ size_t len;
+
+ if (!source)
+ return NULL;
+
+ len = strnlen(source, max_len);
+ if (contains_unprintable(source, len))
+ return hex_printable(source, len);
+ else
+ return quoted_printable(source, len);
+}
+
+/* Repurposed from fs/proc/base.c, with NULL-replacement for saner printing.
+ * Allocates the buffer itself.
+ */
+char *printable_cmdline(struct task_struct *task)
+{
+ char *buffer = NULL, *sanitized;
+ int res, i;
+ unsigned int len;
+ struct mm_struct *mm;
+
+ mm = get_task_mm(task);
+ if (!mm)
+ goto out;
+
+ if (!mm->arg_end)
+ goto out_mm; /* Shh! No looking before we're done */
+
+ buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
+ if (!buffer)
+ goto out_mm;
+
+ len = mm->arg_end - mm->arg_start;
+
+ if (len > PAGE_SIZE)
+ len = PAGE_SIZE;
+
+ res = access_process_vm(task, mm->arg_start, buffer, len, 0);
+
+ /* Space-fill NULLs. */
+ if (res > 1)
+ for (i = 0; i < res - 2; ++i)
+ if (buffer[i] == '\0')
+ buffer[i] = ' ';
+
+ /* If the NULL at the end of args has been overwritten, then
+ * assume application is using setproctitle(3).
+ */
+ if (res > 0 && buffer[res-1] != '\0' && len < PAGE_SIZE) {
+ len = strnlen(buffer, res);
+ if (len < res) {
+ res = len;
+ } else {
+ len = mm->env_end - mm->env_start;
+ if (len > PAGE_SIZE - res)
+ len = PAGE_SIZE - res;
+ res += access_process_vm(task, mm->env_start,
+ buffer+res, len, 0);
+ }
+ }
+
+ /* Make sure the buffer is always NULL-terminated. */
+ buffer[PAGE_SIZE-1] = 0;
+
+ /* Make sure result is printable. */
+ sanitized = printable(buffer, res);
+ kfree(buffer);
+ buffer = sanitized;
+
+out_mm:
+ mmput(mm);
+out:
+ return buffer;
+}
diff --git a/security/chromiumos/utils.h b/security/chromiumos/utils.h
new file mode 100644
index 0000000..7151bba
--- /dev/null
+++ b/security/chromiumos/utils.h
@@ -0,0 +1,30 @@
+/*
+ * Utilities for the Linux Security Module for Chromium OS
+ * (Since CONFIG_AUDIT is disabled for Chrome OS, we must repurpose
+ * a bunch of the audit string handling logic here instead.)
+ *
+ * Copyright 2012 Google Inc. All Rights Reserved
+ *
+ * Author:
+ * Kees Cook <keescook@chromium.org>
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef _SECURITY_CHROMIUMOS_UTILS_H
+#define _SECURITY_CHROMIUMOS_UTILS_H
+
+#include <linux/sched.h>
+#include <linux/mm.h>
+
+char *printable(char *source, size_t max_len);
+char *printable_cmdline(struct task_struct *task);
+
+#endif /* _SECURITY_CHROMIUMOS_UTILS_H */
diff --git a/security/container/Kconfig b/security/container/Kconfig
new file mode 100644
index 0000000..827fe0a
--- /dev/null
+++ b/security/container/Kconfig
@@ -0,0 +1,18 @@
+config SECURITY_CONTAINER_MONITOR
+ bool "Monitor containerized processes"
+ depends on SECURITY
+ depends on MMU
+ depends on VSOCKETS=y
+ depends on X86_64
+ select SECURITYFS
+ help
+ Instrument the Linux kernel to collect more information about containers
+ and identify security threats.
+
+config SECURITY_CONTAINER_MONITOR_DEBUG
+ bool "Enable debug pr_devel logs"
+ depends on SECURITY_CONTAINER_MONITOR
+ help
+ Define DEBUG for CSM files to compile verbose debugging messages.
+
+ Only for debugging/testing do not enable for production.
diff --git a/security/container/Makefile b/security/container/Makefile
new file mode 100644
index 0000000..678d0f7
--- /dev/null
+++ b/security/container/Makefile
@@ -0,0 +1,16 @@
+PB_CCFLAGS := -DPB_SYSTEM_HEADER="<pbsystem.h>" \
+ -DPB_NO_ERRMSG \
+ -DPB_FIELD_16BIT \
+ -DPB_BUFFER_ONLY
+export PB_CCFLAGS
+
+subdir-$(CONFIG_SECURITY_CONTAINER_MONITOR) += protos
+
+obj-$(CONFIG_SECURITY_CONTAINER_MONITOR) += protos/
+obj-$(CONFIG_SECURITY_CONTAINER_MONITOR) += monitor.o pb.o process.o vsock.o
+
+ccflags-y := -I$(srctree)/security/container/protos \
+ -I$(srctree)/security/container/protos/nanopb \
+ -I$(srctree)/fs \
+ $(PB_CCFLAGS)
+ccflags-$(CONFIG_SECURITY_CONTAINER_MONITOR_DEBUG) += -DDEBUG
diff --git a/security/container/monitor.c b/security/container/monitor.c
new file mode 100644
index 0000000..05e54e5
--- /dev/null
+++ b/security/container/monitor.c
@@ -0,0 +1,782 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Container Security Monitor module
+ *
+ * Copyright (c) 2018 Google, Inc
+ */
+
+#include "monitor.h"
+#include "process.h"
+
+#include <linux/audit.h>
+#include <linux/lsm_hooks.h>
+#include <linux/module.h>
+#include <linux/pipe_fs_i.h>
+#include <linux/rwsem.h>
+#include <linux/string.h>
+#include <linux/sysctl.h>
+#include <linux/socket.h>
+#include <net/sock.h>
+#include <linux/vm_sockets.h>
+#include <linux/file.h>
+
+/* protects csm_*_enabled and configurations. */
+DECLARE_RWSEM(csm_rwsem_config);
+
+/* protects csm_host_port and csm_vsocket. */
+DECLARE_RWSEM(csm_rwsem_vsocket);
+
+/* queue used for poll wait on config changes. */
+static DECLARE_WAIT_QUEUE_HEAD(config_wait);
+
+/* increase each time a new configuration is applied. */
+static unsigned long config_version;
+
+/* Stats gathered from the LSM. */
+struct container_stats csm_stats;
+
+struct container_stats_mapping {
+ const char *key;
+ size_t *value;
+};
+
+/* Key value pair mapping for the sysfs entry. */
+struct container_stats_mapping csm_stats_mapping[] = {
+ { "ProtoEncodingFailed", &csm_stats.proto_encoding_failed },
+ { "WorkQueueFailed", &csm_stats.workqueue_failed },
+ { "EventWritingFailed", &csm_stats.event_writing_failed },
+ { "SizePickingFailed", &csm_stats.size_picking_failed },
+ { "PipeAlreadyOpened", &csm_stats.pipe_already_opened },
+};
+
+/*
+ * Is monitoring enabled? Defaults to disabled.
+ * These variables might be used without locking csm_rwsem_config to check if an
+ * LSM hook can bail quickly. The semaphore is taken later to ensure CSM is
+ * still enabled.
+ *
+ * csm_enabled is true if any collector is enabled.
+ */
+bool csm_enabled;
+static bool csm_container_enabled;
+bool csm_execute_enabled;
+bool csm_memexec_enabled;
+
+/* securityfs control files */
+static struct dentry *csm_dir;
+static struct dentry *csm_enabled_file;
+static struct dentry *csm_container_file;
+static struct dentry *csm_config_file;
+static struct dentry *csm_config_vers_file;
+static struct dentry *csm_pipe_file;
+static struct dentry *csm_stats_file;
+
+/* Pipes to forward data to user-mode. */
+DECLARE_RWSEM(csm_rwsem_pipe);
+static struct file *csm_user_read_pipe;
+struct file *csm_user_write_pipe;
+
+/* Option to disable the CSM features at boot. */
+static bool cmdline_boot_disabled;
+bool cmdline_boot_vsock_enabled;
+
+/* Options disabled by default. */
+static bool cmdline_boot_pipe_enabled;
+static bool cmdline_boot_config_enabled;
+
+/* Option to fully enabled the LSM at boot for automated testing. */
+static bool cmdline_default_enabled;
+
+static int csm_boot_disabled_setup(char *str)
+{
+ return kstrtobool(str, &cmdline_boot_disabled);
+}
+early_param("csm.disabled", csm_boot_disabled_setup);
+
+static int csm_default_enabled_setup(char *str)
+{
+ return kstrtobool(str, &cmdline_default_enabled);
+}
+early_param("csm.default.enabled", csm_default_enabled_setup);
+
+static int csm_boot_vsock_enabled_setup(char *str)
+{
+ return kstrtobool(str, &cmdline_boot_vsock_enabled);
+}
+early_param("csm.vsock.enabled", csm_boot_vsock_enabled_setup);
+
+static int csm_boot_pipe_enabled_setup(char *str)
+{
+ return kstrtobool(str, &cmdline_boot_pipe_enabled);
+}
+early_param("csm.pipe.enabled", csm_boot_pipe_enabled_setup);
+
+static int csm_boot_config_enabled_setup(char *str)
+{
+ return kstrtobool(str, &cmdline_boot_config_enabled);
+}
+early_param("csm.config.enabled", csm_boot_config_enabled_setup);
+
+static bool pipe_in_use(void)
+{
+ struct pipe_inode_info *pipe;
+
+ lockdep_assert_held_exclusive(&csm_rwsem_config);
+ if (csm_user_read_pipe) {
+ pipe = get_pipe_info(csm_user_read_pipe);
+ if (pipe)
+ return READ_ONCE(pipe->readers) > 1;
+ }
+ return false;
+}
+
+/* Close pipe, force has to be true to close pipe if it is still being used. */
+int close_pipe_files(bool force)
+{
+ if (csm_user_read_pipe) {
+ /* Pipe is still used. */
+ if (pipe_in_use()) {
+ if (!force)
+ return -EBUSY;
+ pr_warn("pipe is closed while it is still being used.\n");
+ }
+
+ fput(csm_user_read_pipe);
+ fput(csm_user_write_pipe);
+ csm_user_read_pipe = NULL;
+ csm_user_write_pipe = NULL;
+ }
+ return 0;
+}
+
+static void csm_update_config(schema_ConfigurationRequest *req)
+{
+ schema_ExecuteCollectorConfig *econf;
+ size_t i;
+ bool enumerate_processes = false;
+
+ /* Expect the lock to be held for write before this call. */
+ lockdep_assert_held_exclusive(&csm_rwsem_config);
+
+ /* This covers the scenario where a client is connected and the config
+ * transitions the execute collector from disabled to enabled. In that
+ * case there may have been execute events not sent. So they are
+ * enumerated.
+ */
+ if (!csm_execute_enabled && req->execute_config.enabled &&
+ pipe_in_use())
+ enumerate_processes = true;
+
+ csm_container_enabled = req->container_config.enabled;
+ csm_execute_enabled = req->execute_config.enabled;
+ csm_memexec_enabled = req->memexec_config.enabled;
+
+ /* csm_enabled is true if any collector is enabled. */
+ csm_enabled = csm_container_enabled || csm_execute_enabled ||
+ csm_memexec_enabled;
+
+ /* Clean-up existing configurations. */
+ kfree(csm_execute_config.envp_allowlist);
+ memset(&csm_execute_config, 0, sizeof(csm_execute_config));
+
+ if (csm_execute_enabled) {
+ econf = &req->execute_config;
+ csm_execute_config.argv_limit = econf->argv_limit;
+ csm_execute_config.envp_limit = econf->envp_limit;
+
+ /* Swap the allowlist so it is not freed on return. */
+ csm_execute_config.envp_allowlist = econf->envp_allowlist.arg;
+ econf->envp_allowlist.arg = NULL;
+ }
+
+ /* Reset all stats and close pipe if disabled. */
+ if (!csm_enabled) {
+ for (i = 0; i < ARRAY_SIZE(csm_stats_mapping); i++)
+ *csm_stats_mapping[i].value = 0;
+
+ close_pipe_files(true);
+ }
+
+ config_version++;
+ if (enumerate_processes)
+ csm_enumerate_processes();
+ wake_up(&config_wait);
+}
+
+int csm_update_config_from_buffer(void *data, size_t size)
+{
+ schema_ConfigurationRequest c = schema_ConfigurationRequest_init_zero;
+ pb_istream_t istream;
+
+ c.execute_config.envp_allowlist.funcs.decode = pb_decode_string_array;
+
+ istream = pb_istream_from_buffer(data, size);
+ if (!pb_decode(&istream, schema_ConfigurationRequest_fields, &c)) {
+ kfree(c.execute_config.envp_allowlist.arg);
+ return -EINVAL;
+ }
+
+ down_write(&csm_rwsem_config);
+ csm_update_config(&c);
+ up_write(&csm_rwsem_config);
+
+ return 0;
+}
+
+static ssize_t csm_config_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ ssize_t err = 0;
+ void *mem;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* No partial writes. */
+ if (*ppos != 0)
+ return -EINVAL;
+
+ /* Duplicate user memory to safely parse protobuf. */
+ mem = memdup_user(buf, count);
+ if (IS_ERR(mem))
+ return PTR_ERR(mem);
+
+ err = csm_update_config_from_buffer(mem, count);
+ if (!err)
+ err = count;
+
+ kfree(mem);
+ return err;
+}
+
+static const struct file_operations csm_config_fops = {
+ .write = csm_config_write,
+};
+
+static void csm_enable(void)
+{
+ schema_ConfigurationRequest req = schema_ConfigurationRequest_init_zero;
+
+ /* Expect the lock to be held for write before this call. */
+ lockdep_assert_held_exclusive(&csm_rwsem_config);
+
+ /* Default configuration */
+ req.container_config.enabled = true;
+ req.execute_config.enabled = true;
+ req.execute_config.argv_limit = UINT_MAX;
+ req.execute_config.envp_limit = UINT_MAX;
+ req.memexec_config.enabled = true;
+ csm_update_config(&req);
+}
+
+static void csm_disable(void)
+{
+ schema_ConfigurationRequest req = schema_ConfigurationRequest_init_zero;
+
+ /* Expect the lock to be held for write before this call. */
+ lockdep_assert_held_exclusive(&csm_rwsem_config);
+
+ /* Zero configuration disable all collectors. */
+ csm_update_config(&req);
+ pr_info("disabled\n");
+}
+
+static ssize_t csm_enabled_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ const char *str = csm_enabled ? "1\n" : "0\n";
+
+ return simple_read_from_buffer(buf, count, ppos, str, 2);
+}
+
+static ssize_t csm_enabled_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ bool enabled;
+ int err;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (count <= 0 || count > PAGE_SIZE || *ppos)
+ return -EINVAL;
+
+ err = kstrtobool_from_user(buf, count, &enabled);
+ if (err)
+ return err;
+
+ down_write(&csm_rwsem_config);
+
+ if (enabled)
+ csm_enable();
+ else
+ csm_disable();
+
+ up_write(&csm_rwsem_config);
+
+ return count;
+}
+
+static const struct file_operations csm_enabled_fops = {
+ .read = csm_enabled_read,
+ .write = csm_enabled_write,
+};
+
+static int csm_config_version_open(struct inode *inode, struct file *file)
+{
+ /* private_data is used to keep the latest config version read. */
+ file->private_data = (void*)-1;
+ return 0;
+}
+
+static ssize_t csm_config_version_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ unsigned long version = config_version;
+ file->private_data = (void*)version;
+ return simple_read_from_buffer(buf, count, ppos, &version,
+ sizeof(version));
+}
+
+static __poll_t csm_config_version_poll(struct file *file,
+ struct poll_table_struct *poll_tab)
+{
+ if ((unsigned long)file->private_data != config_version)
+ return EPOLLIN;
+ poll_wait(file, &config_wait, poll_tab);
+ if ((unsigned long)file->private_data != config_version)
+ return EPOLLIN;
+ return 0;
+}
+
+static const struct file_operations csm_config_version_fops = {
+ .open = csm_config_version_open,
+ .read = csm_config_version_read,
+ .poll = csm_config_version_poll,
+};
+
+static int csm_pipe_open(struct inode *inode, struct file *file)
+{
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ if (!csm_enabled)
+ return -EAGAIN;
+ return 0;
+}
+
+/* Similar to file_clone_open that is available only in 4.19 and up. */
+static inline struct file *pipe_clone_open(struct file *file)
+{
+ return dentry_open(&file->f_path, file->f_flags, file->f_cred);
+}
+
+/* Check if the pipe is still used, else recreate and dup it. */
+static struct file *csm_dup_pipe(void)
+{
+ long pipe_size = 1024 * PAGE_SIZE;
+ long actual_size;
+ struct file *pipes[2] = {NULL, NULL};
+ struct file *ret;
+ int err;
+
+ down_write(&csm_rwsem_pipe);
+
+ err = close_pipe_files(false);
+ if (err) {
+ ret = ERR_PTR(err);
+ csm_stats.pipe_already_opened++;
+ goto out;
+ }
+
+ err = create_pipe_files(pipes, O_NONBLOCK);
+ if (err) {
+ ret = ERR_PTR(err);
+ goto out;
+ }
+
+ /*
+ * Try to increase the pipe size to 1024 pages, if there is not
+ * enough memory, pipes will stay unchanged.
+ */
+ actual_size = pipe_fcntl(pipes[0], F_SETPIPE_SZ, pipe_size);
+ if (actual_size != pipe_size)
+ pr_err("failed to resize pipe to 1024 pages, error: %ld, fallback to the default value\n",
+ actual_size);
+
+ csm_user_read_pipe = pipes[0];
+ csm_user_write_pipe = pipes[1];
+
+ /* Clone the file so we can track if the reader is still used. */
+ ret = pipe_clone_open(csm_user_read_pipe);
+
+out:
+ up_write(&csm_rwsem_pipe);
+ return ret;
+}
+
+static ssize_t csm_pipe_read(struct file *file, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ int fd;
+ ssize_t err;
+ struct file *local_pipe;
+
+ /* No partial reads. */
+ if (*ppos != 0)
+ return -EINVAL;
+
+ fd = get_unused_fd_flags(0);
+ if (fd < 0)
+ return fd;
+
+ local_pipe = csm_dup_pipe();
+ if (IS_ERR(local_pipe)) {
+ err = PTR_ERR(local_pipe);
+ local_pipe = NULL;
+ goto error;
+ }
+
+ err = simple_read_from_buffer(buf, count, ppos, &fd, sizeof(fd));
+ if (err < 0)
+ goto error;
+
+ if (err < sizeof(fd)) {
+ err = -EINVAL;
+ goto error;
+ }
+
+ /* Install the file descriptor when we know everything succeeded. */
+ fd_install(fd, local_pipe);
+
+ csm_enumerate_processes();
+
+ return err;
+
+error:
+ if (local_pipe)
+ fput(local_pipe);
+ put_unused_fd(fd);
+ return err;
+}
+
+
+static const struct file_operations csm_pipe_fops = {
+ .open = csm_pipe_open,
+ .read = csm_pipe_read,
+};
+
+static void set_container_decode_callbacks(schema_Container *container)
+{
+ container->pod_namespace.funcs.decode = pb_decode_string_field;
+ container->pod_name.funcs.decode = pb_decode_string_field;
+ container->container_name.funcs.decode = pb_decode_string_field;
+ container->container_image_uri.funcs.decode = pb_decode_string_field;
+ container->labels.funcs.decode = pb_decode_string_array;
+}
+
+static void set_container_encode_callbacks(schema_Container *container)
+{
+ container->pod_namespace.funcs.encode = pb_encode_string_field;
+ container->pod_name.funcs.encode = pb_encode_string_field;
+ container->container_name.funcs.encode = pb_encode_string_field;
+ container->container_image_uri.funcs.encode = pb_encode_string_field;
+ container->labels.funcs.encode = pb_encode_string_array;
+}
+
+static void free_container_callbacks_args(schema_Container *container)
+{
+ kfree(container->pod_namespace.arg);
+ kfree(container->pod_name.arg);
+ kfree(container->container_name.arg);
+ kfree(container->container_image_uri.arg);
+ kfree(container->labels.arg);
+}
+
+static ssize_t csm_container_write(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ ssize_t err = 0;
+ void *mem;
+ u64 cid;
+ pb_istream_t istream;
+ struct task_struct *task;
+ schema_ContainerReport report = schema_ContainerReport_init_zero;
+ schema_Event event = schema_Event_init_zero;
+ schema_Container *container;
+ char *uuid = NULL;
+
+ /* Notify that this collector is not yet enabled. */
+ if (!csm_container_enabled)
+ return -EAGAIN;
+
+ /* No partial writes. */
+ if (*ppos != 0)
+ return -EINVAL;
+
+ /* Duplicate user memory to safely parse protobuf. */
+ mem = memdup_user(buf, count);
+ if (IS_ERR(mem))
+ return PTR_ERR(mem);
+
+ /* Callback to decode string in protobuf. */
+ set_container_decode_callbacks(&report.container);
+
+ istream = pb_istream_from_buffer(mem, count);
+ if (!pb_decode(&istream, schema_ContainerReport_fields, &report)) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ /* Check protobuf is as expected */
+ if (report.pid == 0 ||
+ report.container.container_id != 0) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ /* Find if the process id is linked to an existing container-id. */
+ rcu_read_lock();
+ task = find_task_by_pid_ns(report.pid, &init_pid_ns);
+ if (task) {
+ cid = audit_get_contid(task);
+ if (cid == AUDIT_CID_UNSET)
+ err = -ENOENT;
+ } else {
+ err = -ENOENT;
+ }
+ rcu_read_unlock();
+
+ if (err)
+ goto out;
+
+ uuid = kzalloc(PROCESS_UUID_SIZE, GFP_KERNEL);
+ if (!uuid)
+ goto out;
+
+ /* Provide the uuid for the top process of the container. */
+ err = get_process_uuid_by_pid(report.pid, uuid, PROCESS_UUID_SIZE);
+ if (err)
+ goto out;
+
+ /* Correct the container-id and feed the event to vsock */
+ report.container.container_id = cid;
+ report.container.init_uuid.funcs.encode = pb_encode_uuid_field;
+ report.container.init_uuid.arg = uuid;
+ container = &event.event.container.container;
+ *container = report.container;
+
+ /* Use encode callback to generate the final proto. */
+ set_container_encode_callbacks(container);
+
+ event.which_event = schema_Event_container_tag;
+
+ err = csm_sendeventproto(schema_Event_fields, &event);
+ if (!err)
+ err = count;
+
+out:
+ /* Free any allocated nanopb callback arguments. */
+ free_container_callbacks_args(&report.container);
+ kfree(uuid);
+ kfree(mem);
+ return err;
+}
+
+static const struct file_operations csm_container_fops = {
+ .write = csm_container_write,
+};
+
+static int csm_show_stats(struct seq_file *p, void *v)
+{
+ size_t i;
+
+ for (i = 0; i < ARRAY_SIZE(csm_stats_mapping); i++) {
+ seq_printf(p, "%s:\t%zu\n",
+ csm_stats_mapping[i].key,
+ *csm_stats_mapping[i].value);
+ }
+
+ return 0;
+}
+
+static int csm_stats_open(struct inode *inode, struct file *file)
+{
+ size_t i, size = 1; /* Start at one for the null byte. */
+
+ for (i = 0; i < ARRAY_SIZE(csm_stats_mapping); i++) {
+ /*
+ * Calculate the maximum length:
+ * - Length of the key
+ * - 3 additional chars :\t\n
+ * - longest unsigned 64-bit integer.
+ */
+ size += strlen(csm_stats_mapping[i].key)
+ + 3 + sizeof("18446744073709551615");
+ }
+
+ return single_open_size(file, csm_show_stats, NULL, size);
+}
+
+static const struct file_operations csm_stats_fops = {
+ .open = csm_stats_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+/* Prevent user-mode from using vsock on our port. */
+static int csm_socket_connect(struct socket *sock, struct sockaddr *address,
+ int addrlen)
+{
+ struct sockaddr_vm *vaddr = (struct sockaddr_vm *)address;
+
+ /* Filter only vsock sockets */
+ if (!sock->sk || sock->sk->sk_family != AF_VSOCK)
+ return 0;
+
+ /* Allow kernel sockets. */
+ if (sock->sk->sk_kern_sock)
+ return 0;
+
+ if (addrlen < sizeof(*vaddr))
+ return -EINVAL;
+
+ /* Forbid access to the CSM VMM backend port. */
+ if (vaddr->svm_port == CSM_HOST_PORT)
+ return -EPERM;
+
+ return 0;
+}
+
+static int csm_setxattr(struct dentry *dentry, const char *name,
+ const void *value, size_t size, int flags)
+{
+ if (csm_enabled && !strcmp(name, XATTR_SECURITY_CSM))
+ return -EPERM;
+ return 0;
+}
+
+static struct security_hook_list csm_hooks[] __lsm_ro_after_init = {
+ /* Track process execution. */
+ LSM_HOOK_INIT(bprm_check_security, csm_bprm_check_security),
+ LSM_HOOK_INIT(task_post_alloc, csm_task_post_alloc),
+ LSM_HOOK_INIT(task_exit, csm_task_exit),
+
+ /* Block vsock access when relevant. */
+ LSM_HOOK_INIT(socket_connect, csm_socket_connect),
+
+ /* Track memory execution */
+ LSM_HOOK_INIT(file_mprotect, csm_mprotect),
+ LSM_HOOK_INIT(mmap_file, csm_mmap_file),
+
+ /* Track file modification provenance. */
+ LSM_HOOK_INIT(file_pre_free_security, csm_file_pre_free),
+
+ /* Block modyfing csm xattr. */
+ LSM_HOOK_INIT(inode_setxattr, csm_setxattr),
+};
+
+static int __init csm_init(void)
+{
+ int err;
+
+ if (cmdline_boot_disabled)
+ return 0;
+
+ /*
+ * If cmdline_boot_vsock_enabled is false, only the event pool will be
+ * allocated. The destroy function will clean-up only what was reserved.
+ */
+ err = vsock_initialize();
+ if (err)
+ return err;
+
+ csm_dir = securityfs_create_dir("container_monitor", NULL);
+ if (IS_ERR(csm_dir)) {
+ err = PTR_ERR(csm_dir);
+ goto error;
+ }
+
+ csm_enabled_file = securityfs_create_file("enabled", 0644, csm_dir,
+ NULL, &csm_enabled_fops);
+ if (IS_ERR(csm_enabled_file)) {
+ err = PTR_ERR(csm_enabled_file);
+ goto error_rmdir;
+ }
+
+ csm_container_file = securityfs_create_file("container", 0200, csm_dir,
+ NULL, &csm_container_fops);
+ if (IS_ERR(csm_container_file)) {
+ err = PTR_ERR(csm_container_file);
+ goto error_rm_enabled;
+ }
+
+ csm_config_vers_file = securityfs_create_file("config_version", 0400,
+ csm_dir, NULL,
+ &csm_config_version_fops);
+ if (IS_ERR(csm_config_vers_file)) {
+ err = PTR_ERR(csm_config_vers_file);
+ goto error_rm_container;
+ }
+
+ if (cmdline_boot_config_enabled) {
+ csm_config_file = securityfs_create_file("config", 0200,
+ csm_dir, NULL,
+ &csm_config_fops);
+ if (IS_ERR(csm_config_file)) {
+ err = PTR_ERR(csm_config_file);
+ goto error_rm_config_vers;
+ }
+ }
+
+ if (cmdline_boot_pipe_enabled) {
+ csm_pipe_file = securityfs_create_file("pipe", 0400, csm_dir,
+ NULL, &csm_pipe_fops);
+ if (IS_ERR(csm_pipe_file)) {
+ err = PTR_ERR(csm_pipe_file);
+ goto error_rm_config;
+ }
+ }
+
+ csm_stats_file = securityfs_create_file("stats", 0400, csm_dir,
+ NULL, &csm_stats_fops);
+ if (IS_ERR(csm_stats_file)) {
+ err = PTR_ERR(csm_stats_file);
+ goto error_rm_pipe;
+ }
+
+ pr_debug("created securityfs control files\n");
+
+ security_add_hooks(csm_hooks, ARRAY_SIZE(csm_hooks), "csm");
+ pr_debug("registered hooks\n");
+
+ /* Off-by-default, only used for testing images. */
+ if (cmdline_default_enabled) {
+ down_write(&csm_rwsem_config);
+ csm_enable();
+ up_write(&csm_rwsem_config);
+ }
+
+ return 0;
+
+error_rm_pipe:
+ if (cmdline_boot_pipe_enabled)
+ securityfs_remove(csm_pipe_file);
+error_rm_config:
+ if (cmdline_boot_config_enabled)
+ securityfs_remove(csm_config_file);
+error_rm_config_vers:
+ securityfs_remove(csm_config_vers_file);
+error_rm_container:
+ securityfs_remove(csm_container_file);
+error_rm_enabled:
+ securityfs_remove(csm_enabled_file);
+error_rmdir:
+ securityfs_remove(csm_dir);
+error:
+ vsock_destroy();
+ pr_warn("fs initialization error: %d", err);
+ return err;
+}
+
+late_initcall(csm_init);
diff --git a/security/container/monitor.h b/security/container/monitor.h
new file mode 100644
index 0000000..cab3d5b
--- /dev/null
+++ b/security/container/monitor.h
@@ -0,0 +1,123 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Container Security Monitor module
+ *
+ * Copyright (c) 2018 Google, Inc
+ */
+
+#define pr_fmt(fmt) "container-security-monitor: " fmt
+
+#include <linux/kernel.h>
+#include <linux/security.h>
+#include <linux/fs.h>
+#include <linux/rwsem.h>
+#include <linux/binfmts.h>
+#include <linux/xattr.h>
+#include <config.pb.h>
+#include <event.pb.h>
+#include <pb_encode.h>
+#include <pb_decode.h>
+
+#include "monitoring_protocol.h"
+
+/* Part of the CSM configuration response. */
+#define CSM_VERSION 1
+
+/* protects csm_*_enabled and configurations. */
+extern struct rw_semaphore csm_rwsem_config;
+
+/* protects csm_host_port and csm_vsocket. */
+extern struct rw_semaphore csm_rwsem_vsocket;
+
+/* Port to connect to the host on, over virtio-vsock */
+#define CSM_HOST_PORT 4444
+
+/*
+ * Is monitoring enabled? Defaults to disabled.
+ * These variables might be used as gates without locking (as processor ensures
+ * valid proper access for native scalar values) so it can bail quickly.
+ */
+extern bool csm_enabled;
+extern bool csm_execute_enabled;
+extern bool csm_memexec_enabled;
+
+/* Configuration options for execute collector. */
+struct execute_config {
+ size_t argv_limit;
+ size_t envp_limit;
+ char *envp_allowlist;
+};
+
+extern struct execute_config csm_execute_config;
+
+/* pipe to forward vsock packets to user-mode. */
+extern struct rw_semaphore csm_rwsem_pipe;
+extern struct file *csm_user_write_pipe;
+
+/* Was vsock enabled at boot time? */
+extern bool cmdline_boot_vsock_enabled;
+
+/* Stats on LSM events. */
+struct container_stats {
+ size_t proto_encoding_failed;
+ size_t event_writing_failed;
+ size_t workqueue_failed;
+ size_t size_picking_failed;
+ size_t pipe_already_opened;
+};
+
+extern struct container_stats csm_stats;
+
+/* Streams file numbers are unknown from the kernel */
+#define STDIN_FILENO 0
+#define STDOUT_FILENO 1
+#define STDERR_FILENO 2
+
+/* security attribute for file provenance. */
+#define XATTR_SECURITY_CSM XATTR_SECURITY_PREFIX "csm"
+
+/* monitor functions */
+int csm_update_config_from_buffer(void *data, size_t size);
+
+/* vsock functions */
+int vsock_initialize(void);
+void vsock_destroy(void);
+int vsock_late_initialize(void);
+int csm_sendeventproto(const pb_field_t fields[], schema_Event *event);
+int csm_sendconfigrespproto(const pb_field_t fields[],
+ schema_ConfigurationResponse *resp);
+
+/* process events functions */
+int csm_bprm_check_security(struct linux_binprm *bprm);
+void csm_task_exit(struct task_struct *task);
+void csm_task_post_alloc(struct task_struct *task);
+int get_process_uuid_by_pid(pid_t pid_nr, char *buffer, size_t size);
+
+/* memory execution events functions */
+int csm_mprotect(struct vm_area_struct *vma, unsigned long reqprot,
+ unsigned long prot);
+int csm_mmap_file(struct file *file, unsigned long reqprot,
+ unsigned long prot, unsigned long flags);
+
+/* Tracking of file modification provenance. */
+void csm_file_pre_free(struct file *file);
+
+/* nano functions */
+bool pb_encode_string_field(pb_ostream_t *stream, const pb_field_t *field,
+ void * const *arg);
+bool pb_decode_string_field(pb_istream_t *stream, const pb_field_t *field,
+ void **arg);
+ssize_t pb_encode_string_field_limit(pb_ostream_t *stream,
+ const pb_field_t *field,
+ void * const *arg, size_t limit);
+bool pb_encode_string_array(pb_ostream_t *stream, const pb_field_t *field,
+ void * const *arg);
+bool pb_decode_string_array(pb_istream_t *stream, const pb_field_t *field,
+ void **arg);
+bool pb_encode_uuid_field(pb_ostream_t *stream, const pb_field_t *field,
+ void * const *arg);
+bool pb_encode_ip4(pb_ostream_t *stream, const pb_field_t *field,
+ void * const *arg);
+bool pb_encode_ip6(pb_ostream_t *stream, const pb_field_t *field,
+ void * const *arg);
+
diff --git a/security/container/monitoring_protocol.h b/security/container/monitoring_protocol.h
new file mode 100644
index 0000000..dbdfc9c
--- /dev/null
+++ b/security/container/monitoring_protocol.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+
+/* Container security monitoring protocol definitions */
+
+#include <linux/types.h>
+
+enum csm_msgtype {
+ CSM_MSG_TYPE_HEARTBEAT = 1,
+ CSM_MSG_EVENT_PROTO = 2,
+ CSM_MSG_CONFIG_REQUEST_PROTO = 3,
+ CSM_MSG_CONFIG_RESPONSE_PROTO = 4,
+};
+
+struct csm_msg_hdr {
+ __le32 msg_type;
+ __le32 msg_length;
+};
+
+/* The process uuid is a 128-bits identifier */
+#define PROCESS_UUID_SIZE 16
+
+/* The entire structure forms the collision domain. */
+union process_uuid {
+ struct {
+ __u32 machineid;
+ __u64 start_time;
+ __u32 tgid;
+ } __attribute__((packed));
+ __u8 data[PROCESS_UUID_SIZE];
+};
diff --git a/security/container/pb.c b/security/container/pb.c
new file mode 100644
index 0000000..f24e58f
--- /dev/null
+++ b/security/container/pb.c
@@ -0,0 +1,175 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Container Security Monitor module
+ *
+ * Copyright (c) 2018 Google, Inc
+ */
+
+#include "monitor.h"
+
+#include <linux/string.h>
+#include <net/sock.h>
+#include <net/tcp.h>
+#include <net/ipv6.h>
+
+bool pb_encode_string_field(pb_ostream_t *stream, const pb_field_t *field,
+ void * const *arg)
+{
+ const uint8_t *str = (const uint8_t *)*arg;
+
+ /* If the string is not set, skip this string. */
+ if (!str)
+ return true;
+
+ if (!pb_encode_tag_for_field(stream, field))
+ return false;
+
+ return pb_encode_string(stream, str, strlen(str));
+}
+
+bool pb_decode_string_field(pb_istream_t *stream, const pb_field_t *field,
+ void **arg)
+{
+ size_t size;
+ void *data;
+
+ *arg = NULL;
+
+ size = stream->bytes_left;
+
+ /* Ensure a null-byte at the end */
+ if (size + 1 < size)
+ return false;
+
+ data = kzalloc(size + 1, GFP_KERNEL);
+ if (!data)
+ return false;
+
+ if (!pb_read(stream, data, size)) {
+ kfree(data);
+ return false;
+ }
+
+ *arg = data;
+
+ return true;
+}
+
+bool pb_encode_string_array(pb_ostream_t *stream, const pb_field_t *field,
+ void * const *arg)
+{
+ char *strs = (char *)*arg;
+
+ /* If the string array is not set, skip this string array. */
+ if (!strs)
+ return true;
+
+ do {
+ if (!pb_encode_string_field(stream, field,
+ (void * const *) &strs))
+ return false;
+
+ strs += strlen(strs) + 1;
+ } while (*strs != 0);
+
+ return true;
+}
+
+/* Limit the encoded string size and return how many characters were added. */
+ssize_t pb_encode_string_field_limit(pb_ostream_t *stream,
+ const pb_field_t *field,
+ void * const *arg, size_t limit)
+{
+ char *str = (char *)*arg;
+ size_t length;
+
+ /* If the string is not set, skip this string. */
+ if (!str)
+ return 0;
+
+ if (!pb_encode_tag_for_field(stream, field))
+ return -EINVAL;
+
+ length = strlen(str);
+ if (length > limit)
+ length = limit;
+
+ if (!pb_encode_string(stream, (uint8_t *)str, length))
+ return -EINVAL;
+
+ return length;
+}
+
+bool pb_decode_string_array(pb_istream_t *stream, const pb_field_t *field,
+ void **arg)
+{
+ size_t needed, used = 0;
+ char *data, *strs;
+
+ /* String length, and two null-bytes for the end of the list. */
+ needed = stream->bytes_left + 2;
+ if (needed < stream->bytes_left)
+ return false;
+
+ if (*arg) {
+ /* Calculate used space from the current list. */
+ strs = (char *)*arg;
+ do {
+ used += strlen(strs + used) + 1;
+ } while (strs[used] != 0);
+
+ if (used + needed < needed)
+ return false;
+ }
+
+ data = krealloc(*arg, used + needed, GFP_KERNEL);
+ if (!data)
+ return false;
+
+ /* Will always be freed by the caller */
+ *arg = data;
+
+ /* Reset the new part of the buffer. */
+ memset(data + used, 0, needed);
+
+ /* Read what's in the stream buffer only. */
+ if (!pb_read(stream, data + used, stream->bytes_left))
+ return false;
+
+ return true;
+}
+
+bool pb_encode_fixed_string(pb_ostream_t *stream, const pb_field_t *field,
+ const uint8_t *data, size_t length)
+{
+ /* If the data is not set, skip this string. */
+ if (!data)
+ return true;
+
+ if (!pb_encode_tag_for_field(stream, field))
+ return false;
+
+ return pb_encode_string(stream, data, length);
+}
+
+
+bool pb_encode_uuid_field(pb_ostream_t *stream, const pb_field_t *field,
+ void * const *arg)
+{
+ return pb_encode_fixed_string(stream, field, (const uint8_t *)*arg,
+ PROCESS_UUID_SIZE);
+}
+
+bool pb_encode_ip4(pb_ostream_t *stream, const pb_field_t *field,
+ void * const *arg)
+{
+ return pb_encode_fixed_string(stream, field, (const uint8_t *)*arg,
+ sizeof(struct in_addr));
+}
+
+bool pb_encode_ip6(pb_ostream_t *stream, const pb_field_t *field,
+ void * const *arg)
+{
+ return pb_encode_fixed_string(stream, field, (const uint8_t *)*arg,
+ sizeof(struct in6_addr));
+}
diff --git a/security/container/process.c b/security/container/process.c
new file mode 100644
index 0000000..a0c33e7
--- /dev/null
+++ b/security/container/process.c
@@ -0,0 +1,1149 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Container Security Monitor module
+ *
+ * Copyright (c) 2018 Google, Inc
+ */
+
+#include "monitor.h"
+
+#include <linux/atomic.h>
+#include <linux/audit.h>
+#include <linux/file.h>
+#include <linux/highmem.h>
+#include <linux/mempool.h>
+#include <linux/mm.h>
+#include <linux/mount.h>
+#include <linux/notifier.h>
+#include <linux/net.h>
+#include <linux/path.h>
+#include <linux/pid.h>
+#include <linux/pid_namespace.h>
+#include <linux/random.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/task.h>
+#include <linux/slab.h>
+#include <linux/socket.h>
+#include <linux/timekeeping.h>
+#include <linux/vmalloc.h>
+#include <linux/workqueue.h>
+#include <linux/xattr.h>
+#include <net/ipv6.h>
+#include <net/sock.h>
+#include <net/tcp.h>
+#include <overlayfs/overlayfs.h>
+#include <uapi/linux/magic.h>
+#include <uapi/asm/mman.h>
+
+/* Configuration options for execute collector. */
+struct execute_config csm_execute_config;
+
+/* unique atomic value for the machine boot instance */
+static atomic_t machine_rand = ATOMIC_INIT(0);
+
+/* sequential container identifier */
+static atomic_t contid = ATOMIC_INIT(0);
+
+/* Generation id for each enumeration invocation. */
+static atomic_t enumeration_count = ATOMIC_INIT(0);
+
+struct file_provenance {
+ /* pid of the process doing the first write. */
+ pid_t tgid;
+ /* start_time of the process to uniquely identify it. */
+ u64 start_time;
+};
+
+struct csm_enumerate_processes_work_data {
+ struct work_struct work;
+ int enumeration_count;
+};
+
+static void *kmap_argument_stack(struct linux_binprm *bprm, void **ctx)
+{
+ char *argv;
+ int err;
+ unsigned long i, pos, count;
+ void *map;
+ struct page *page;
+
+ /* vma_pages() returns the number of pages reserved for the stack */
+ count = vma_pages(bprm->vma);
+
+ if (likely(count == 1)) {
+ err = get_user_pages_remote(current, bprm->mm, bprm->p, 1,
+ FOLL_FORCE, &page, NULL, NULL);
+ if (err != 1)
+ return NULL;
+
+ argv = kmap(page);
+ *ctx = page;
+ } else {
+ /*
+ * If more than one pages is needed, copy all of them to a set
+ * of pages. Parsing the argument across kmap pages in different
+ * addresses would make it impractical.
+ */
+ argv = vmalloc(count * PAGE_SIZE);
+ if (!argv)
+ return NULL;
+
+ for (i = 0; i < count; i++) {
+ pos = ALIGN_DOWN(bprm->p, PAGE_SIZE) + i * PAGE_SIZE;
+ err = get_user_pages_remote(current, bprm->mm, pos, 1,
+ FOLL_FORCE, &page, NULL,
+ NULL);
+ if (err <= 0) {
+ vfree(argv);
+ return NULL;
+ }
+
+ map = kmap(page);
+ memcpy(argv + i * PAGE_SIZE, map, PAGE_SIZE);
+ kunmap(page);
+ put_page(page);
+ }
+ *ctx = bprm;
+ }
+
+ return argv;
+}
+
+static void kunmap_argument_stack(struct linux_binprm *bprm, void *addr,
+ void *ctx)
+{
+ struct page *page;
+
+ if (!addr)
+ return;
+
+ if (likely(vma_pages(bprm->vma) == 1)) {
+ page = (struct page *)ctx;
+ kunmap(page);
+ put_page(ctx);
+ } else {
+ vfree(addr);
+ }
+}
+
+static char *find_array_next_entry(char *array, unsigned long *offset,
+ unsigned long end)
+{
+ char *entry;
+ unsigned long off = *offset;
+
+ if (off >= end)
+ return NULL;
+
+ /* Check the entry is null terminated and in bound */
+ entry = array + off;
+ while (array[off]) {
+ if (++off >= end)
+ return NULL;
+ }
+
+ /* Pass the null byte for the next iteration */
+ *offset = off + 1;
+
+ return entry;
+}
+
+struct string_arr_ctx {
+ struct linux_binprm *bprm;
+ void *stack;
+};
+
+static size_t get_config_limit(size_t *config_ptr)
+{
+ lockdep_assert_held_read(&csm_rwsem_config);
+
+ /*
+ * If execute is not enabled, do not capture arguments.
+ * The vsock packet won't be sent anyway.
+ */
+ if (!csm_execute_enabled)
+ return 0;
+
+ return *config_ptr;
+}
+
+static bool encode_current_argv(pb_ostream_t *stream, const pb_field_t *field,
+ void * const *arg)
+{
+ struct string_arr_ctx *ctx = (struct string_arr_ctx *)*arg;
+ int i;
+ struct linux_binprm *bprm = ctx->bprm;
+ unsigned long offset = bprm->p % PAGE_SIZE;
+ unsigned long end = vma_pages(bprm->vma) * PAGE_SIZE;
+ char *argv = ctx->stack;
+ char *entry;
+ size_t limit, used = 0;
+ ssize_t ret;
+
+ limit = get_config_limit(&csm_execute_config.argv_limit);
+ if (!limit)
+ return true;
+
+ for (i = 0; i < bprm->argc; i++) {
+ entry = find_array_next_entry(argv, &offset, end);
+ if (!entry)
+ return false;
+
+ ret = pb_encode_string_field_limit(stream, field,
+ (void * const *)&entry,
+ limit - used);
+ if (ret < 0)
+ return false;
+
+ used += ret;
+
+ if (used >= limit)
+ break;
+ }
+
+ return true;
+}
+
+static bool check_envp_allowlist(char *envp)
+{
+ bool ret = false;
+ char *strs, *equal;
+ size_t str_size, equal_pos;
+
+ /* If execute is not enabled, skip all. */
+ if (!csm_execute_enabled)
+ goto out;
+
+ /* No filter, allow all. */
+ strs = csm_execute_config.envp_allowlist;
+ if (!strs) {
+ ret = true;
+ goto out;
+ }
+
+ /*
+ * Identify the key=value separation.
+ * If none exists use the whole string as a key.
+ */
+ equal = strchr(envp, '=');
+ equal_pos = equal ? (equal - envp) : strlen(envp);
+
+ /* Default to skip if no match found. */
+ ret = false;
+
+ do {
+ str_size = strlen(strs);
+
+ /*
+ * If the filter length align with the key value equal sign,
+ * it might be a match, check the key value.
+ */
+ if (str_size == equal_pos &&
+ !strncmp(strs, envp, str_size)) {
+ ret = true;
+ goto out;
+ }
+
+ strs += str_size + 1;
+ } while (*strs != 0);
+
+out:
+ return ret;
+}
+
+static bool encode_current_envp(pb_ostream_t *stream, const pb_field_t *field,
+ void * const *arg)
+{
+ struct string_arr_ctx *ctx = (struct string_arr_ctx *)*arg;
+ int i;
+ struct linux_binprm *bprm = ctx->bprm;
+ unsigned long offset = bprm->p % PAGE_SIZE;
+ unsigned long end = vma_pages(bprm->vma) * PAGE_SIZE;
+ char *argv = ctx->stack;
+ char *entry;
+ size_t limit, used = 0;
+ ssize_t ret;
+
+ limit = get_config_limit(&csm_execute_config.envp_limit);
+ if (!limit)
+ return true;
+
+ /* Skip arguments */
+ for (i = 0; i < bprm->argc; i++) {
+ if (!find_array_next_entry(argv, &offset, end))
+ return false;
+ }
+
+ for (i = 0; i < bprm->envc; i++) {
+ entry = find_array_next_entry(argv, &offset, end);
+ if (!entry)
+ return false;
+
+ if (!check_envp_allowlist(entry))
+ continue;
+
+ ret = pb_encode_string_field_limit(stream, field,
+ (void * const *)&entry,
+ limit - used);
+ if (ret < 0)
+ return false;
+
+ used += ret;
+
+ if (used >= limit)
+ break;
+ }
+
+ return true;
+}
+
+static bool is_overlayfs_mounted(struct file *file)
+{
+ struct vfsmount *mnt;
+ struct super_block *mnt_sb;
+
+ mnt = file->f_path.mnt;
+ if (mnt == NULL)
+ return false;
+
+ mnt_sb = mnt->mnt_sb;
+ if (mnt_sb == NULL || mnt_sb->s_magic != OVERLAYFS_SUPER_MAGIC)
+ return false;
+
+ return true;
+}
+
+/*
+ * Before the process starts, identify a possible container by checking if the
+ * task is on a pid namespace and the target file is using an overlayfs mounting
+ * point. This check is valid for COS and GKE but not all existing containers.
+ */
+static bool is_possible_container(struct task_struct *task,
+ struct file *file)
+{
+ if (task_active_pid_ns(task) == &init_pid_ns)
+ return false;
+
+ return is_overlayfs_mounted(file);
+}
+
+/*
+ * Generates a random identifier for this boot instance.
+ * This identifier is generated only when needed to increase the entropy
+ * available compared to doing it at early boot.
+ */
+static u32 get_machine_id(void)
+{
+ int machineid, old;
+
+ machineid = atomic_read(&machine_rand);
+
+ if (unlikely(machineid == 0)) {
+ machineid = (int)get_random_int();
+ if (machineid == 0)
+ machineid = 1;
+ old = atomic_cmpxchg(&machine_rand, 0, machineid);
+
+ /* If someone beat us, use their value. */
+ if (old != 0)
+ machineid = old;
+ }
+
+ return (u32)machineid;
+}
+
+/*
+ * Generate a 128-bit unique identifier for the process by appending:
+ * - A machine identifier unique per boot.
+ * - The start time of the process in nanoseconds.
+ * - The tgid for the set of threads in a process.
+ */
+static int get_process_uuid(struct task_struct *task, char *buffer, size_t size)
+{
+ union process_uuid *id = (union process_uuid *)buffer;
+
+ memset(buffer, 0, size);
+
+ if (WARN_ON(size < PROCESS_UUID_SIZE))
+ return -EINVAL;
+
+ id->machineid = get_machine_id();
+ id->start_time = ktime_mono_to_real(task->group_leader->start_time);
+ id->tgid = task_tgid_nr(task);
+
+ return 0;
+}
+
+int get_process_uuid_by_pid(pid_t pid_nr, char *buffer, size_t size)
+{
+ int err;
+ struct task_struct *task = NULL;
+
+ rcu_read_lock();
+ task = find_task_by_pid_ns(pid_nr, &init_pid_ns);
+ if (!task) {
+ err = -ENOENT;
+ goto out;
+ }
+ err = get_process_uuid(task, buffer, size);
+out:
+ rcu_read_unlock();
+ return err;
+}
+
+static int get_process_uuid_from_xattr(struct file *file, char *buffer,
+ size_t size)
+{
+ struct dentry *dentry;
+ int err;
+ struct file_provenance prov;
+ union process_uuid *id = (union process_uuid *)buffer;
+
+ memset(buffer, 0, size);
+
+ if (WARN_ON(size < PROCESS_UUID_SIZE))
+ return -EINVAL;
+
+ /* The file is part of overlayfs on the upper layer. */
+ if (!is_overlayfs_mounted(file))
+ return -ENODATA;
+
+ dentry = ovl_dentry_upper(file->f_path.dentry);
+ if (!dentry)
+ return -ENODATA;
+
+ err = __vfs_getxattr(dentry, dentry->d_inode,
+ XATTR_SECURITY_CSM, &prov, sizeof(prov));
+ /* returns -ENODATA if the xattr does not exist. */
+ if (err < 0)
+ return err;
+ if (err != sizeof(prov)) {
+ pr_err("unexpected size for xattr: %zu -> %d\n",
+ size, err);
+ return -ENODATA;
+ }
+
+ id->machineid = get_machine_id();
+ id->start_time = prov.start_time;
+ id->tgid = prov.tgid;
+ return 0;
+}
+
+u64 csm_set_contid(struct task_struct *task)
+{
+ u64 cid;
+ struct pid_namespace *ns;
+
+ ns = task_active_pid_ns(task);
+ if (WARN_ON(!task->audit) || WARN_ON(!ns))
+ return AUDIT_CID_UNSET;
+
+ cid = atomic_inc_return(&contid);
+ task->audit->contid = cid;
+
+ /*
+ * If the namespace container-id is not set, use the one assigned
+ * to the first process created.
+ */
+ cmpxchg(&ns->cid, 0, cid);
+ return cid;
+}
+
+u64 csm_get_ns_contid(struct pid_namespace *ns)
+{
+ if (!ns || !ns->cid)
+ return AUDIT_CID_UNSET;
+
+ return ns->cid;
+}
+
+union ip_data {
+ struct in_addr ip4;
+ struct in6_addr ip6;
+};
+
+struct file_data {
+ void *allocated;
+ union ip_data local;
+ union ip_data remote;
+ char modified_uuid[PROCESS_UUID_SIZE];
+};
+
+static void free_file_data(struct file_data *fdata)
+{
+ free_page((unsigned long)fdata->allocated);
+ fdata->allocated = NULL;
+}
+
+static void fill_socket_description(struct sockaddr_storage *saddr,
+ union ip_data *idata,
+ schema_SocketIp *schema_socketip)
+{
+ struct sockaddr_in *sin4 = (struct sockaddr_in *)saddr;
+ struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)saddr;
+
+ schema_socketip->family = saddr->ss_family;
+
+ switch (saddr->ss_family) {
+ case AF_INET:
+ schema_socketip->port = ntohs(sin4->sin_port);
+ idata->ip4 = sin4->sin_addr;
+ schema_socketip->ip.funcs.encode = pb_encode_ip4;
+ schema_socketip->ip.arg = &idata->ip4;
+ break;
+ case AF_INET6:
+ schema_socketip->port = ntohs(sin6->sin6_port);
+ idata->ip6 = sin6->sin6_addr;
+ schema_socketip->ip.funcs.encode = pb_encode_ip6;
+ schema_socketip->ip.arg = &idata->ip6;
+ break;
+ }
+}
+
+static int fill_file_overlayfs(struct file *file, schema_File *schema_file,
+ struct file_data *fdata)
+{
+ struct dentry *dentry;
+ int err;
+ schema_Overlay *overlayfs;
+
+ /* If not an overlayfs superblock, done. */
+ if (!is_overlayfs_mounted(file))
+ return 0;
+
+ dentry = file->f_path.dentry;
+ schema_file->which_filesystem = schema_File_overlayfs_tag;
+ overlayfs = &schema_file->filesystem.overlayfs;
+ overlayfs->lower_layer = ovl_dentry_lower(dentry);
+ overlayfs->upper_layer = ovl_dentry_upper(dentry);
+
+ err = get_process_uuid_from_xattr(file, fdata->modified_uuid,
+ sizeof(fdata->modified_uuid));
+ /* If there is no xattr, just skip the modified_uuid field. */
+ if (err == -ENODATA)
+ return 0;
+ if (err < 0)
+ return err;
+
+ overlayfs->modified_uuid.funcs.encode = pb_encode_uuid_field;
+ overlayfs->modified_uuid.arg = fdata->modified_uuid;
+ return 0;
+}
+
+static int fill_file_description(struct file *file, schema_File *schema_file,
+ struct file_data *fdata)
+{
+ char *buf;
+ int err;
+ u32 mode;
+ char *path;
+ struct socket *socket;
+ schema_Socket *socketfs;
+ struct sockaddr_storage saddr;
+
+ memset(fdata, 0, sizeof(*fdata));
+
+ if (file == NULL)
+ return 0;
+
+ schema_file->ino = file_inode(file)->i_ino;
+ mode = file_inode(file)->i_mode;
+
+ /* For pipes, no need to resolve the path. */
+ if (S_ISFIFO(mode))
+ return 0;
+
+ if (S_ISSOCK(mode)) {
+ socket = (struct socket *)file->private_data;
+ socketfs = &schema_file->filesystem.socket;
+
+ /* Local socket */
+ err = kernel_getsockname(socket, (struct sockaddr *)&saddr);
+ if (err >= 0) {
+ fill_socket_description(&saddr, &fdata->local,
+ &socketfs->local);
+ }
+
+ /* Remote socket, might not be connected. */
+ err = kernel_getpeername(socket, (struct sockaddr *)&saddr);
+ if (err >= 0) {
+ fill_socket_description(&saddr, &fdata->remote,
+ &socketfs->remote);
+ }
+
+ schema_file->which_filesystem = schema_File_socket_tag;
+ return 0;
+ }
+
+ /*
+ * From this point, we care about all the other types of files as their
+ * path provides interesting insight.
+ */
+ buf = (char *)__get_free_page(GFP_KERNEL);
+ if (buf == NULL)
+ return -ENOMEM;
+
+ fdata->allocated = buf;
+
+ path = d_path(&file->f_path, buf, PAGE_SIZE);
+ if (IS_ERR(path)) {
+ free_file_data(fdata);
+ return PTR_ERR(path);
+ }
+
+ schema_file->fullpath.funcs.encode = pb_encode_string_field;
+ schema_file->fullpath.arg = path; /* buf is freed in free_file_data. */
+
+ err = fill_file_overlayfs(file, schema_file, fdata);
+ if (err) {
+ free_file_data(fdata);
+ return err;
+ }
+
+ return 0;
+}
+
+static int fill_stream_description(schema_Descriptor *desc, int fd,
+ struct file_data *fdata)
+{
+ struct fd sfd;
+ struct file *file;
+ int err = 0;
+
+ sfd = fdget(fd);
+ file = sfd.file;
+
+ if (file == NULL) {
+ memset(fdata, 0, sizeof(*fdata));
+ goto end;
+ }
+
+ desc->mode = file_inode(file)->i_mode;
+ err = fill_file_description(file, &desc->file, fdata);
+
+end:
+ fdput(sfd);
+ return err;
+}
+
+static int populate_proc_uuid_common(schema_Process *proc, char *uuid,
+ size_t uuid_size, char *parent_uuid,
+ size_t parent_uuid_size,
+ struct task_struct *task)
+{
+ int err;
+ struct task_struct *parent;
+ /* Generate unique identifier for the process and its parent */
+ err = get_process_uuid(task, uuid, uuid_size);
+ if (err)
+ return err;
+
+ proc->uuid.funcs.encode = pb_encode_uuid_field;
+ proc->uuid.arg = uuid;
+
+ rcu_read_lock();
+
+ if (!pid_alive(task))
+ goto out;
+ /*
+ * I don't think this needs to be task_rcu_dereference because
+ * real_parent is only supposed to be accessed using RCU.
+ */
+ parent = rcu_dereference(task->real_parent);
+
+ if (parent) {
+ err = get_process_uuid(parent, parent_uuid, parent_uuid_size);
+ if (!err) {
+ proc->parent_uuid.funcs.encode = pb_encode_uuid_field;
+ proc->parent_uuid.arg = parent_uuid;
+ }
+ }
+
+out:
+ rcu_read_unlock();
+
+ return err;
+}
+
+/* Populate the fields that we always want to set in Process messages. */
+static int populate_proc_common(schema_Process *proc, char *uuid,
+ size_t uuid_size, char *parent_uuid,
+ size_t parent_uuid_size,
+ struct task_struct *task)
+{
+ u64 cid;
+ struct pid_namespace *ns = task_active_pid_ns(task);
+
+ /* Container identifier for the current namespace. */
+ proc->container_id = csm_get_ns_contid(ns);
+
+ /*
+ * If the process container-id is different, the process tree is part of
+ * a different session within the namespace (kubectl/docker exec,
+ * liveness probe or others).
+ */
+ cid = audit_get_contid(task);
+ if (proc->container_id != cid)
+ proc->exec_session_id = cid;
+
+ /* Add information about pid in different namespaces */
+ proc->pid = task_pid_nr(task);
+ proc->parent_pid = task_ppid_nr(task);
+ proc->container_pid = task_pid_nr_ns(task, ns);
+ proc->container_parent_pid = task_ppid_nr_ns(task, ns);
+
+ return populate_proc_uuid_common(proc, uuid, uuid_size, parent_uuid,
+ parent_uuid_size, task);
+}
+
+int csm_bprm_check_security(struct linux_binprm *bprm)
+{
+ char uuid[PROCESS_UUID_SIZE];
+ char parent_uuid[PROCESS_UUID_SIZE];
+ int err;
+ schema_Event event = schema_Event_init_zero;
+ schema_Process *proc;
+ struct string_arr_ctx argv_ctx;
+ void *stack = NULL, *ctx = NULL;
+ u64 cid;
+ struct file_data path_data = {};
+ struct file_data stdin_data = {};
+ struct file_data stdout_data = {};
+ struct file_data stderr_data = {};
+
+ /*
+ * Always create a container-id for containerized processes.
+ * If the LSM is enabled later, we can track existing containers.
+ */
+ cid = audit_get_contid(current);
+
+ if (cid == AUDIT_CID_UNSET) {
+ if (!is_possible_container(current, bprm->file))
+ return 0;
+
+ cid = csm_set_contid(current);
+
+ if (cid == AUDIT_CID_UNSET)
+ return 0;
+ }
+
+ if (!csm_execute_enabled)
+ return 0;
+
+ /* The interpreter will call us again with more context. */
+ if (bprm->buf[0] == '#' && bprm->buf[1] == '!')
+ return 0;
+
+ proc = &event.event.execute.proc;
+ err = populate_proc_common(proc, uuid, sizeof(uuid), parent_uuid,
+ sizeof(parent_uuid), current);
+ if (err)
+ goto out_free_buf;
+
+ proc->creation_timestamp = ktime_get_real_ns();
+
+ /* Provide information about the launched binary. */
+ err = fill_file_description(bprm->file, &proc->binary, &path_data);
+ if (err)
+ goto out_free_buf;
+
+ /* Information about streams */
+ err = fill_stream_description(&proc->streams.stdin, STDIN_FILENO,
+ &stdin_data);
+ if (err)
+ goto out_free_buf;
+
+ err = fill_stream_description(&proc->streams.stdout, STDOUT_FILENO,
+ &stdout_data);
+ if (err)
+ goto out_free_buf;
+
+ err = fill_stream_description(&proc->streams.stderr, STDERR_FILENO,
+ &stderr_data);
+ if (err)
+ goto out_free_buf;
+
+ stack = kmap_argument_stack(bprm, &ctx);
+ if (!stack) {
+ err = -EFAULT;
+ goto out_free_buf;
+ }
+
+ /* Capture process argument */
+ argv_ctx.bprm = bprm;
+ argv_ctx.stack = stack;
+ proc->args.argv.funcs.encode = encode_current_argv;
+ proc->args.argv.arg = &argv_ctx;
+
+ /* Capture process environment variables */
+ proc->args.envp.funcs.encode = encode_current_envp;
+ proc->args.envp.arg = &argv_ctx;
+
+ event.which_event = schema_Event_execute_tag;
+
+ /*
+ * Configurations options are checked when computing the serialized
+ * protobufs.
+ */
+ down_read(&csm_rwsem_config);
+ err = csm_sendeventproto(schema_Event_fields, &event);
+ up_read(&csm_rwsem_config);
+
+ if (err)
+ pr_err("csm_sendeventproto returned %d on execve\n", err);
+ err = 0;
+
+out_free_buf:
+ kunmap_argument_stack(bprm, stack, ctx);
+ free_file_data(&path_data);
+ free_file_data(&stdin_data);
+ free_file_data(&stdout_data);
+ free_file_data(&stderr_data);
+
+ /*
+ * On failure, enforce it only if the execute config is enabled.
+ * If the collector was disabled, prefer to succeed to not impact the
+ * system.
+ */
+ if (unlikely(err < 0 && !csm_execute_enabled))
+ err = 0;
+
+ return err;
+}
+
+/* Create a clone event when a new task leader is created. */
+void csm_task_post_alloc(struct task_struct *task)
+{
+ int err;
+ char uuid[PROCESS_UUID_SIZE];
+ char parent_uuid[PROCESS_UUID_SIZE];
+ schema_Event event = schema_Event_init_zero;
+ schema_Process *proc;
+
+ if (!csm_execute_enabled ||
+ audit_get_contid(task) == AUDIT_CID_UNSET ||
+ !thread_group_leader(task))
+ return;
+
+ proc = &event.event.clone.proc;
+
+ err = populate_proc_uuid_common(proc, uuid, sizeof(uuid), parent_uuid,
+ sizeof(parent_uuid), task);
+
+ event.which_event = schema_Event_clone_tag;
+ err = csm_sendeventproto(schema_Event_fields, &event);
+ if (err)
+ pr_err("csm_sendeventproto returned %d on exit\n", err);
+}
+
+/*
+ * This LSM hook callback doesn't exist upstream and is called only when the
+ * last thread of a thread group exit.
+ */
+void csm_task_exit(struct task_struct *task)
+{
+ int err;
+ schema_Event event = schema_Event_init_zero;
+ schema_ExitEvent *exit;
+ char uuid[PROCESS_UUID_SIZE];
+
+ if (!csm_execute_enabled ||
+ audit_get_contid(task) == AUDIT_CID_UNSET)
+ return;
+
+ exit = &event.event.exit;
+
+ /* Fetch the unique identifier for this process */
+ err = get_process_uuid(task, uuid, sizeof(uuid));
+ if (err) {
+ pr_err("failed to get process uuid on exit\n");
+ return;
+ }
+
+ exit->process_uuid.funcs.encode = pb_encode_uuid_field;
+ exit->process_uuid.arg = uuid;
+
+ event.which_event = schema_Event_exit_tag;
+
+ err = csm_sendeventproto(schema_Event_fields, &event);
+ if (err)
+ pr_err("csm_sendeventproto returned %d on exit\n", err);
+}
+
+int csm_mprotect(struct vm_area_struct *vma, unsigned long reqprot,
+ unsigned long prot)
+{
+ char uuid[PROCESS_UUID_SIZE];
+ char parent_uuid[PROCESS_UUID_SIZE];
+ int err;
+ schema_Event event = schema_Event_init_zero;
+ schema_MemoryExecEvent *memexec;
+ u64 cid;
+ struct file_data path_data = {};
+
+ cid = audit_get_contid(current);
+
+ if (!csm_memexec_enabled ||
+ !(prot & PROT_EXEC) ||
+ vma->vm_file == NULL ||
+ cid == AUDIT_CID_UNSET)
+ return 0;
+
+ memexec = &event.event.memexec;
+
+ err = fill_file_description(vma->vm_file, &memexec->mapped_file,
+ &path_data);
+ if (err)
+ return err;
+
+ err = populate_proc_common(&memexec->proc, uuid, sizeof(uuid),
+ parent_uuid, sizeof(parent_uuid), current);
+ if (err)
+ goto out;
+
+ memexec->prot_exec_timestamp = ktime_get_real_ns();
+ memexec->new_flags = prot;
+ memexec->req_flags = reqprot;
+ memexec->old_vm_flags = vma->vm_flags;
+
+ memexec->action = schema_MemoryExecEvent_Action_MPROTECT;
+ memexec->start_addr = vma->vm_start;
+ memexec->end_addr = vma->vm_end;
+
+ event.which_event = schema_Event_memexec_tag;
+
+ err = csm_sendeventproto(schema_Event_fields, &event);
+ if (err)
+ pr_err("csm_sendeventproto returned %d on mprotect\n", err);
+ err = 0;
+
+ if (unlikely(err < 0 && !csm_memexec_enabled))
+ err = 0;
+
+out:
+ free_file_data(&path_data);
+ return err;
+}
+
+int csm_mmap_file(struct file *file, unsigned long reqprot,
+ unsigned long prot, unsigned long flags)
+{
+ char uuid[PROCESS_UUID_SIZE];
+ char parent_uuid[PROCESS_UUID_SIZE];
+ int err;
+ schema_Event event = schema_Event_init_zero;
+ schema_MemoryExecEvent *memexec;
+ struct file *exe_file;
+ u64 cid;
+ struct file_data path_data = {};
+
+ cid = audit_get_contid(current);
+
+ if (!csm_memexec_enabled ||
+ !(prot & PROT_EXEC) ||
+ file == NULL ||
+ cid == AUDIT_CID_UNSET)
+ return 0;
+
+ memexec = &event.event.memexec;
+ err = fill_file_description(file, &memexec->mapped_file,
+ &path_data);
+ if (err)
+ return err;
+
+ err = populate_proc_common(&memexec->proc, uuid, sizeof(uuid),
+ parent_uuid, sizeof(parent_uuid), current);
+ if (err)
+ goto out;
+
+ /* get_mm_exe_file does its own locking on mm_sem. */
+ exe_file = get_mm_exe_file(current->mm);
+ if (exe_file) {
+ if (path_equal(&file->f_path, &exe_file->f_path))
+ memexec->is_initial_mmap = 1;
+ fput(exe_file);
+ }
+
+ memexec->prot_exec_timestamp = ktime_get_real_ns();
+ memexec->new_flags = prot;
+ memexec->req_flags = reqprot;
+ memexec->mmap_flags = flags;
+ memexec->action = schema_MemoryExecEvent_Action_MMAP_FILE;
+ event.which_event = schema_Event_memexec_tag;
+
+ err = csm_sendeventproto(schema_Event_fields, &event);
+ if (err)
+ pr_err("csm_sendeventproto returned %d on mmap_file\n", err);
+ err = 0;
+
+ if (unlikely(err < 0 && !csm_memexec_enabled))
+ err = 0;
+
+out:
+ free_file_data(&path_data);
+ return err;
+}
+
+void csm_file_pre_free(struct file *file)
+{
+ struct dentry *dentry;
+ int err;
+ struct file_provenance prov;
+
+ /* The file was opened to be modified and the LSM is enabled */
+ if (!(file->f_mode & FMODE_WRITE) ||
+ !csm_enabled)
+ return;
+
+ /* The current process is containerized. */
+ if (audit_get_contid(current) == AUDIT_CID_UNSET)
+ return;
+
+ /* The file is part of overlayfs on the upper layer. */
+ if (!is_overlayfs_mounted(file))
+ return;
+
+ dentry = ovl_dentry_upper(file->f_path.dentry);
+ if (!dentry)
+ return;
+
+ err = __vfs_getxattr(dentry, dentry->d_inode, XATTR_SECURITY_CSM,
+ NULL, 0);
+ if (err != -ENODATA) {
+ if (err < 0)
+ pr_err("failed to get security attribute: %d\n", err);
+ return;
+ }
+
+ prov.tgid = task_tgid_nr(current);
+ prov.start_time = ktime_mono_to_real(current->group_leader->start_time);
+
+ err = __vfs_setxattr(dentry, dentry->d_inode, XATTR_SECURITY_CSM, &prov,
+ sizeof(prov), 0);
+ if (err < 0)
+ pr_err("failed to set security attribute: %d\n", err);
+}
+
+/*
+ * Based off of fs/proc/base.c:next_tgid
+ *
+ * next_thread_group_leader returns the task_struct of the next task with a pid
+ * greater than or equal to tgid. The reference count is increased so that
+ * rcu_read_unlock may be called, and preemption reenabled.
+ */
+static struct task_struct *next_thread_group_leader(pid_t *tgid)
+{
+ struct pid *pid;
+ struct task_struct *task;
+
+ cond_resched();
+ rcu_read_lock();
+retry:
+ task = NULL;
+ pid = find_ge_pid(*tgid, &init_pid_ns);
+ if (pid) {
+ *tgid = pid_nr_ns(pid, &init_pid_ns);
+ task = pid_task(pid, PIDTYPE_PID);
+ if (!task || !has_group_leader_pid(task) ||
+ audit_get_contid(task) == AUDIT_CID_UNSET) {
+ (*tgid) += 1;
+ goto retry;
+ }
+
+ /*
+ * Increment the reference count on the task before leaving
+ * the RCU grace period.
+ */
+ get_task_struct(task);
+ (*tgid) += 1;
+ }
+
+ rcu_read_unlock();
+ return task;
+}
+
+void delayed_enumerate_processes(struct work_struct *work)
+{
+ pid_t tgid = 0;
+ struct task_struct *task;
+ struct csm_enumerate_processes_work_data *wd = container_of(
+ work, struct csm_enumerate_processes_work_data, work);
+ int wd_enumeration_count = wd->enumeration_count;
+
+ kfree(wd);
+ wd = NULL;
+ work = NULL;
+
+ /*
+ * Try for only a single enumeration routine at a time, as long as the
+ * execute collector is enabled.
+ */
+ while ((wd_enumeration_count == atomic_read(&enumeration_count)) &&
+ READ_ONCE(csm_execute_enabled) &&
+ (task = next_thread_group_leader(&tgid))) {
+ int err;
+ char uuid[PROCESS_UUID_SIZE];
+ char parent_uuid[PROCESS_UUID_SIZE];
+ struct file *exe_file = NULL;
+ struct file_data path_data = {};
+ schema_Event event = schema_Event_init_zero;
+ schema_Process *proc = &event.event.enumproc.proc;
+
+ exe_file = get_task_exe_file(task);
+ if (!exe_file) {
+ pr_err("failed to get enumerated process executable, pid: %u\n",
+ task_pid_nr(task));
+ goto next;
+ }
+
+ err = fill_file_description(exe_file, &proc->binary,
+ &path_data);
+ if (err) {
+ pr_err("failed to fill enumerated process %u executable description: %d\n",
+ task_pid_nr(task), err);
+ goto next;
+ }
+
+ err = populate_proc_common(proc, uuid, sizeof(uuid),
+ parent_uuid, sizeof(parent_uuid),
+ task);
+ if (err) {
+ pr_err("failed to set pid %u common fields: %d\n",
+ task_pid_nr(task), err);
+ goto next;
+ }
+
+ if (task->flags & PF_EXITING)
+ goto next;
+
+ event.which_event = schema_Event_enumproc_tag;
+ err = csm_sendeventproto(schema_Event_fields,
+ &event);
+ if (err) {
+ pr_err("failed to send pid %u enumerated process: %d\n",
+ task_pid_nr(task), err);
+ goto next;
+ }
+next:
+ free_file_data(&path_data);
+ if (exe_file)
+ fput(exe_file);
+
+ put_task_struct(task);
+ }
+}
+
+void csm_enumerate_processes(unsigned long const config_version)
+{
+ struct csm_enumerate_processes_work_data *wd;
+
+ wd = kmalloc(sizeof(*wd), GFP_KERNEL);
+ if (!wd)
+ return;
+
+ INIT_WORK(&wd->work, delayed_enumerate_processes);
+ wd->enumeration_count = atomic_add_return(1, &enumeration_count);
+ schedule_work(&wd->work);
+}
diff --git a/security/container/process.h b/security/container/process.h
new file mode 100644
index 0000000..1c98134
--- /dev/null
+++ b/security/container/process.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Container Security Monitor module
+ *
+ * Copyright (c) 2019 Google, Inc
+ */
+
+void csm_enumerate_processes(void);
diff --git a/security/container/protos/Makefile b/security/container/protos/Makefile
new file mode 100644
index 0000000..a88068b
--- /dev/null
+++ b/security/container/protos/Makefile
@@ -0,0 +1,10 @@
+subdir-$(CONFIG_SECURITY_CONTAINER_MONITOR) += nanopb
+
+obj-$(CONFIG_SECURITY_CONTAINER_MONITOR) += nanopb/
+obj-$(CONFIG_SECURITY_CONTAINER_MONITOR) += protos.o
+
+protos-y := config.pb.o event.pb.o
+
+ccflags-y := -I$(srctree)/security/container/protos \
+ -I$(srctree)/security/container/protos/nanopb \
+ $(PB_CCFLAGS)
diff --git a/security/container/protos/README b/security/container/protos/README
new file mode 100644
index 0000000..1b0628a
--- /dev/null
+++ b/security/container/protos/README
@@ -0,0 +1,18 @@
+This document provides guidance on how to change the protos used in this directory.
+
+Any change made to a proto file require to reformat it and regenerate nanopb
+sources. It also requires the proto files to be compatible to previously released versions.
+
+To reformat any proto file run: "clang-format -style=Google -i <file.proto>"
+
+To regenerate nanopb files:
+ - Install protoc
+ - apt-get install protobuf-compiler
+ - Clone/setup nanopb for version 0.3.9.1 (or clone the internal depot)
+ - git clone --depth=1 https://github.com/nanopb/nanopb.git
+ - cd nanopb
+ - git fetch --tags
+ - git checkout tags/0.3.9.1
+ - make -C generator/proto
+ - Run protoc with the nanopb definition
+ - protoc --plugin=<path_to_nanopb>/generator/protoc-gen-nanopb --nanopb_out=<path_to_linux>/security/container/protos/ <path_to_linux>/security/container/protos/<file.proto> --proto_path=<path_to_linux>/security/container/protos
diff --git a/security/container/protos/config.pb.c b/security/container/protos/config.pb.c
new file mode 100644
index 0000000..211bab0
--- /dev/null
+++ b/security/container/protos/config.pb.c
@@ -0,0 +1,72 @@
+/* Automatically generated nanopb constant definitions */
+/* Generated by nanopb-0.3.9.3 at Wed Jun 5 11:00:24 2019. */
+
+#include "config.pb.h"
+
+/* @@protoc_insertion_point(includes) */
+#if PB_PROTO_HEADER_VERSION != 30
+#error Regenerate this file with the current version of nanopb generator.
+#endif
+
+
+
+const pb_field_t schema_ContainerCollectorConfig_fields[2] = {
+ PB_FIELD( 1, BOOL , SINGULAR, STATIC , FIRST, schema_ContainerCollectorConfig, enabled, enabled, 0),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_ExecuteCollectorConfig_fields[5] = {
+ PB_FIELD( 1, BOOL , SINGULAR, STATIC , FIRST, schema_ExecuteCollectorConfig, enabled, enabled, 0),
+ PB_FIELD( 2, UINT32 , SINGULAR, STATIC , OTHER, schema_ExecuteCollectorConfig, argv_limit, enabled, 0),
+ PB_FIELD( 3, UINT32 , SINGULAR, STATIC , OTHER, schema_ExecuteCollectorConfig, envp_limit, argv_limit, 0),
+ PB_FIELD( 4, STRING , REPEATED, CALLBACK, OTHER, schema_ExecuteCollectorConfig, envp_allowlist, envp_limit, 0),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_MemExecCollectorConfig_fields[2] = {
+ PB_FIELD( 1, BOOL , SINGULAR, STATIC , FIRST, schema_MemExecCollectorConfig, enabled, enabled, 0),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_ConfigurationRequest_fields[4] = {
+ PB_FIELD( 1, MESSAGE , SINGULAR, STATIC , FIRST, schema_ConfigurationRequest, container_config, container_config, &schema_ContainerCollectorConfig_fields),
+ PB_FIELD( 2, MESSAGE , SINGULAR, STATIC , OTHER, schema_ConfigurationRequest, execute_config, container_config, &schema_ExecuteCollectorConfig_fields),
+ PB_FIELD( 3, MESSAGE , SINGULAR, STATIC , OTHER, schema_ConfigurationRequest, memexec_config, execute_config, &schema_MemExecCollectorConfig_fields),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_ConfigurationResponse_fields[5] = {
+ PB_FIELD( 1, UENUM , SINGULAR, STATIC , FIRST, schema_ConfigurationResponse, error, error, 0),
+ PB_FIELD( 2, STRING , SINGULAR, CALLBACK, OTHER, schema_ConfigurationResponse, msg, error, 0),
+ PB_FIELD( 3, UINT64 , SINGULAR, STATIC , OTHER, schema_ConfigurationResponse, version, msg, 0),
+ PB_FIELD( 4, UINT32 , SINGULAR, STATIC , OTHER, schema_ConfigurationResponse, kernel_version, version, 0),
+ PB_LAST_FIELD
+};
+
+
+
+/* Check that field information fits in pb_field_t */
+#if !defined(PB_FIELD_32BIT)
+/* If you get an error here, it means that you need to define PB_FIELD_32BIT
+ * compile-time option. You can do that in pb.h or on compiler command line.
+ *
+ * The reason you need to do this is that some of your messages contain tag
+ * numbers or field sizes that are larger than what can fit in 8 or 16 bit
+ * field descriptors.
+ */
+PB_STATIC_ASSERT((pb_membersize(schema_ConfigurationRequest, container_config) < 65536 && pb_membersize(schema_ConfigurationRequest, execute_config) < 65536 && pb_membersize(schema_ConfigurationRequest, memexec_config) < 65536), YOU_MUST_DEFINE_PB_FIELD_32BIT_FOR_MESSAGES_schema_ContainerCollectorConfig_schema_ExecuteCollectorConfig_schema_MemExecCollectorConfig_schema_ConfigurationRequest_schema_ConfigurationResponse)
+#endif
+
+#if !defined(PB_FIELD_16BIT) && !defined(PB_FIELD_32BIT)
+/* If you get an error here, it means that you need to define PB_FIELD_16BIT
+ * compile-time option. You can do that in pb.h or on compiler command line.
+ *
+ * The reason you need to do this is that some of your messages contain tag
+ * numbers or field sizes that are larger than what can fit in the default
+ * 8 bit descriptors.
+ */
+PB_STATIC_ASSERT((pb_membersize(schema_ConfigurationRequest, container_config) < 256 && pb_membersize(schema_ConfigurationRequest, execute_config) < 256 && pb_membersize(schema_ConfigurationRequest, memexec_config) < 256), YOU_MUST_DEFINE_PB_FIELD_16BIT_FOR_MESSAGES_schema_ContainerCollectorConfig_schema_ExecuteCollectorConfig_schema_MemExecCollectorConfig_schema_ConfigurationRequest_schema_ConfigurationResponse)
+#endif
+
+
+/* @@protoc_insertion_point(eof) */
diff --git a/security/container/protos/config.pb.h b/security/container/protos/config.pb.h
new file mode 100644
index 0000000..2685d2b
--- /dev/null
+++ b/security/container/protos/config.pb.h
@@ -0,0 +1,116 @@
+/* Automatically generated nanopb header */
+/* Generated by nanopb-0.3.9.3 at Wed Jun 5 11:00:24 2019. */
+
+#ifndef PB_SCHEMA_CONFIG_PB_H_INCLUDED
+#define PB_SCHEMA_CONFIG_PB_H_INCLUDED
+#include <pb.h>
+
+/* @@protoc_insertion_point(includes) */
+#if PB_PROTO_HEADER_VERSION != 30
+#error Regenerate this file with the current version of nanopb generator.
+#endif
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/* Enum definitions */
+typedef enum _schema_ConfigurationResponse_ErrorCode {
+ schema_ConfigurationResponse_ErrorCode_NO_ERROR = 0,
+ schema_ConfigurationResponse_ErrorCode_UNKNOWN = 2
+} schema_ConfigurationResponse_ErrorCode;
+#define _schema_ConfigurationResponse_ErrorCode_MIN schema_ConfigurationResponse_ErrorCode_NO_ERROR
+#define _schema_ConfigurationResponse_ErrorCode_MAX schema_ConfigurationResponse_ErrorCode_UNKNOWN
+#define _schema_ConfigurationResponse_ErrorCode_ARRAYSIZE ((schema_ConfigurationResponse_ErrorCode)(schema_ConfigurationResponse_ErrorCode_UNKNOWN+1))
+
+/* Struct definitions */
+typedef struct _schema_ConfigurationResponse {
+ schema_ConfigurationResponse_ErrorCode error;
+ pb_callback_t msg;
+ uint64_t version;
+ uint32_t kernel_version;
+/* @@protoc_insertion_point(struct:schema_ConfigurationResponse) */
+} schema_ConfigurationResponse;
+
+typedef struct _schema_ContainerCollectorConfig {
+ bool enabled;
+/* @@protoc_insertion_point(struct:schema_ContainerCollectorConfig) */
+} schema_ContainerCollectorConfig;
+
+typedef struct _schema_ExecuteCollectorConfig {
+ bool enabled;
+ uint32_t argv_limit;
+ uint32_t envp_limit;
+ pb_callback_t envp_allowlist;
+/* @@protoc_insertion_point(struct:schema_ExecuteCollectorConfig) */
+} schema_ExecuteCollectorConfig;
+
+typedef struct _schema_MemExecCollectorConfig {
+ bool enabled;
+/* @@protoc_insertion_point(struct:schema_MemExecCollectorConfig) */
+} schema_MemExecCollectorConfig;
+
+typedef struct _schema_ConfigurationRequest {
+ schema_ContainerCollectorConfig container_config;
+ schema_ExecuteCollectorConfig execute_config;
+ schema_MemExecCollectorConfig memexec_config;
+/* @@protoc_insertion_point(struct:schema_ConfigurationRequest) */
+} schema_ConfigurationRequest;
+
+/* Default values for struct fields */
+
+/* Initializer values for message structs */
+#define schema_ContainerCollectorConfig_init_default {0}
+#define schema_ExecuteCollectorConfig_init_default {0, 0, 0, {{NULL}, NULL}}
+#define schema_MemExecCollectorConfig_init_default {0}
+#define schema_ConfigurationRequest_init_default {schema_ContainerCollectorConfig_init_default, schema_ExecuteCollectorConfig_init_default, schema_MemExecCollectorConfig_init_default}
+#define schema_ConfigurationResponse_init_default {_schema_ConfigurationResponse_ErrorCode_MIN, {{NULL}, NULL}, 0, 0}
+#define schema_ContainerCollectorConfig_init_zero {0}
+#define schema_ExecuteCollectorConfig_init_zero {0, 0, 0, {{NULL}, NULL}}
+#define schema_MemExecCollectorConfig_init_zero {0}
+#define schema_ConfigurationRequest_init_zero {schema_ContainerCollectorConfig_init_zero, schema_ExecuteCollectorConfig_init_zero, schema_MemExecCollectorConfig_init_zero}
+#define schema_ConfigurationResponse_init_zero {_schema_ConfigurationResponse_ErrorCode_MIN, {{NULL}, NULL}, 0, 0}
+
+/* Field tags (for use in manual encoding/decoding) */
+#define schema_ConfigurationResponse_error_tag 1
+#define schema_ConfigurationResponse_msg_tag 2
+#define schema_ConfigurationResponse_version_tag 3
+#define schema_ConfigurationResponse_kernel_version_tag 4
+#define schema_ContainerCollectorConfig_enabled_tag 1
+#define schema_ExecuteCollectorConfig_enabled_tag 1
+#define schema_ExecuteCollectorConfig_argv_limit_tag 2
+#define schema_ExecuteCollectorConfig_envp_limit_tag 3
+#define schema_ExecuteCollectorConfig_envp_allowlist_tag 4
+#define schema_MemExecCollectorConfig_enabled_tag 1
+#define schema_ConfigurationRequest_container_config_tag 1
+#define schema_ConfigurationRequest_execute_config_tag 2
+#define schema_ConfigurationRequest_memexec_config_tag 3
+
+/* Struct field encoding specification for nanopb */
+extern const pb_field_t schema_ContainerCollectorConfig_fields[2];
+extern const pb_field_t schema_ExecuteCollectorConfig_fields[5];
+extern const pb_field_t schema_MemExecCollectorConfig_fields[2];
+extern const pb_field_t schema_ConfigurationRequest_fields[4];
+extern const pb_field_t schema_ConfigurationResponse_fields[5];
+
+/* Maximum encoded size of messages (where known) */
+#define schema_ContainerCollectorConfig_size 2
+/* schema_ExecuteCollectorConfig_size depends on runtime parameters */
+#define schema_MemExecCollectorConfig_size 2
+/* schema_ConfigurationRequest_size depends on runtime parameters */
+/* schema_ConfigurationResponse_size depends on runtime parameters */
+
+/* Message IDs (where set with "msgid" option) */
+#ifdef PB_MSGID
+
+#define CONFIG_MESSAGES \
+
+
+#endif
+
+#ifdef __cplusplus
+} /* extern "C" */
+#endif
+/* @@protoc_insertion_point(eof) */
+
+#endif
diff --git a/security/container/protos/config.proto b/security/container/protos/config.proto
new file mode 100644
index 0000000..e32a517
--- /dev/null
+++ b/security/container/protos/config.proto
@@ -0,0 +1,51 @@
+syntax = "proto3";
+
+package schema;
+
+// Collect information about running containers
+message ContainerCollectorConfig {
+ bool enabled = 1;
+}
+
+message ExecuteCollectorConfig {
+ bool enabled = 1;
+
+ // truncate argv/envp if cumulative length exceeds limit
+ uint32 argv_limit = 2;
+ uint32 envp_limit = 3;
+
+ // If specified, only report the named environment variables. An
+ // empty envp_allowlist indicates that all environment variables
+ // should be reported up to a cumulative total of envp_limit bytes.
+ repeated string envp_allowlist = 4;
+}
+
+// Collect information about executable memory mappings.
+message MemExecCollectorConfig {
+ bool enabled = 1;
+}
+
+// Convey configuration information to Guest LSM
+message ConfigurationRequest {
+ ContainerCollectorConfig container_config = 1;
+ ExecuteCollectorConfig execute_config = 2;
+ MemExecCollectorConfig memexec_config = 3;
+
+ // Additional configuration messages will be added as new collectors
+ // are implemented
+}
+
+// Report success or failure of previous ConfigurationRequest
+message ConfigurationResponse {
+ enum ErrorCode {
+ // Keep values in sync with
+ // https://github.com/googleapis/googleapis/blob/master/google/rpc/code.proto
+ NO_ERROR = 0;
+ UNKNOWN = 2;
+ }
+
+ ErrorCode error = 1;
+ string msg = 2;
+ uint64 version = 3; // Version of the LSM
+ uint32 kernel_version = 4; // LINUX_VERSION_CODE
+}
diff --git a/security/container/protos/event.pb.c b/security/container/protos/event.pb.c
new file mode 100644
index 0000000..1018ce2
--- /dev/null
+++ b/security/container/protos/event.pb.c
@@ -0,0 +1,174 @@
+/* Automatically generated nanopb constant definitions */
+/* Generated by nanopb-0.3.9.1 at Mon Nov 11 16:11:19 2019. */
+
+#include "event.pb.h"
+
+/* @@protoc_insertion_point(includes) */
+#if PB_PROTO_HEADER_VERSION != 30
+#error Regenerate this file with the current version of nanopb generator.
+#endif
+
+
+
+const pb_field_t schema_SocketIp_fields[4] = {
+ PB_FIELD( 1, UINT32 , SINGULAR, STATIC , FIRST, schema_SocketIp, family, family, 0),
+ PB_FIELD( 2, BYTES , SINGULAR, CALLBACK, OTHER, schema_SocketIp, ip, family, 0),
+ PB_FIELD( 3, UINT32 , SINGULAR, STATIC , OTHER, schema_SocketIp, port, ip, 0),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_Socket_fields[3] = {
+ PB_FIELD( 1, MESSAGE , SINGULAR, STATIC , FIRST, schema_Socket, local, local, &schema_SocketIp_fields),
+ PB_FIELD( 2, MESSAGE , SINGULAR, STATIC , OTHER, schema_Socket, remote, local, &schema_SocketIp_fields),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_Overlay_fields[4] = {
+ PB_FIELD( 1, BOOL , SINGULAR, STATIC , FIRST, schema_Overlay, lower_layer, lower_layer, 0),
+ PB_FIELD( 2, BOOL , SINGULAR, STATIC , OTHER, schema_Overlay, upper_layer, lower_layer, 0),
+ PB_FIELD( 3, BYTES , SINGULAR, CALLBACK, OTHER, schema_Overlay, modified_uuid, upper_layer, 0),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_File_fields[5] = {
+ PB_FIELD( 1, BYTES , SINGULAR, CALLBACK, FIRST, schema_File, fullpath, fullpath, 0),
+ PB_ONEOF_FIELD(filesystem, 2, MESSAGE , ONEOF, STATIC , OTHER, schema_File, overlayfs, fullpath, &schema_Overlay_fields),
+ PB_ONEOF_FIELD(filesystem, 4, MESSAGE , ONEOF, STATIC , UNION, schema_File, socket, fullpath, &schema_Socket_fields),
+ PB_FIELD( 3, UINT32 , SINGULAR, STATIC , OTHER, schema_File, ino, filesystem.socket, 0),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_ProcessArguments_fields[5] = {
+ PB_FIELD( 1, BYTES , REPEATED, CALLBACK, FIRST, schema_ProcessArguments, argv, argv, 0),
+ PB_FIELD( 2, UINT32 , SINGULAR, STATIC , OTHER, schema_ProcessArguments, argv_truncated, argv, 0),
+ PB_FIELD( 3, BYTES , REPEATED, CALLBACK, OTHER, schema_ProcessArguments, envp, argv_truncated, 0),
+ PB_FIELD( 4, UINT32 , SINGULAR, STATIC , OTHER, schema_ProcessArguments, envp_truncated, envp, 0),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_Descriptor_fields[3] = {
+ PB_FIELD( 1, UINT32 , SINGULAR, STATIC , FIRST, schema_Descriptor, mode, mode, 0),
+ PB_FIELD( 2, MESSAGE , SINGULAR, STATIC , OTHER, schema_Descriptor, file, mode, &schema_File_fields),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_Streams_fields[4] = {
+ PB_FIELD( 1, MESSAGE , SINGULAR, STATIC , FIRST, schema_Streams, stdin, stdin, &schema_Descriptor_fields),
+ PB_FIELD( 2, MESSAGE , SINGULAR, STATIC , OTHER, schema_Streams, stdout, stdin, &schema_Descriptor_fields),
+ PB_FIELD( 3, MESSAGE , SINGULAR, STATIC , OTHER, schema_Streams, stderr, stdout, &schema_Descriptor_fields),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_Process_fields[13] = {
+ PB_FIELD( 1, UINT64 , SINGULAR, STATIC , FIRST, schema_Process, creation_timestamp, creation_timestamp, 0),
+ PB_FIELD( 2, BYTES , SINGULAR, CALLBACK, OTHER, schema_Process, uuid, creation_timestamp, 0),
+ PB_FIELD( 3, UINT32 , SINGULAR, STATIC , OTHER, schema_Process, pid, uuid, 0),
+ PB_FIELD( 4, MESSAGE , SINGULAR, STATIC , OTHER, schema_Process, binary, pid, &schema_File_fields),
+ PB_FIELD( 5, UINT32 , SINGULAR, STATIC , OTHER, schema_Process, parent_pid, binary, 0),
+ PB_FIELD( 6, BYTES , SINGULAR, CALLBACK, OTHER, schema_Process, parent_uuid, parent_pid, 0),
+ PB_FIELD( 7, UINT64 , SINGULAR, STATIC , OTHER, schema_Process, container_id, parent_uuid, 0),
+ PB_FIELD( 8, UINT32 , SINGULAR, STATIC , OTHER, schema_Process, container_pid, container_id, 0),
+ PB_FIELD( 9, UINT32 , SINGULAR, STATIC , OTHER, schema_Process, container_parent_pid, container_pid, 0),
+ PB_FIELD( 10, MESSAGE , SINGULAR, STATIC , OTHER, schema_Process, args, container_parent_pid, &schema_ProcessArguments_fields),
+ PB_FIELD( 11, MESSAGE , SINGULAR, STATIC , OTHER, schema_Process, streams, args, &schema_Streams_fields),
+ PB_FIELD( 12, UINT64 , SINGULAR, STATIC , OTHER, schema_Process, exec_session_id, streams, 0),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_Container_fields[10] = {
+ PB_FIELD( 1, UINT64 , SINGULAR, STATIC , FIRST, schema_Container, creation_timestamp, creation_timestamp, 0),
+ PB_FIELD( 2, BYTES , SINGULAR, CALLBACK, OTHER, schema_Container, pod_namespace, creation_timestamp, 0),
+ PB_FIELD( 3, BYTES , SINGULAR, CALLBACK, OTHER, schema_Container, pod_name, pod_namespace, 0),
+ PB_FIELD( 4, UINT64 , SINGULAR, STATIC , OTHER, schema_Container, container_id, pod_name, 0),
+ PB_FIELD( 5, BYTES , SINGULAR, CALLBACK, OTHER, schema_Container, container_name, container_id, 0),
+ PB_FIELD( 6, BYTES , SINGULAR, CALLBACK, OTHER, schema_Container, container_image_uri, container_name, 0),
+ PB_FIELD( 7, BYTES , REPEATED, CALLBACK, OTHER, schema_Container, labels, container_image_uri, 0),
+ PB_FIELD( 8, BYTES , SINGULAR, CALLBACK, OTHER, schema_Container, init_uuid, labels, 0),
+ PB_FIELD( 9, BYTES , SINGULAR, CALLBACK, OTHER, schema_Container, container_image_id, init_uuid, 0),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_ExecuteEvent_fields[2] = {
+ PB_FIELD( 1, MESSAGE , SINGULAR, STATIC , FIRST, schema_ExecuteEvent, proc, proc, &schema_Process_fields),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_CloneEvent_fields[2] = {
+ PB_FIELD( 1, MESSAGE , SINGULAR, STATIC , FIRST, schema_CloneEvent, proc, proc, &schema_Process_fields),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_EnumerateProcessEvent_fields[2] = {
+ PB_FIELD( 1, MESSAGE , SINGULAR, STATIC , FIRST, schema_EnumerateProcessEvent, proc, proc, &schema_Process_fields),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_MemoryExecEvent_fields[12] = {
+ PB_FIELD( 1, MESSAGE , SINGULAR, STATIC , FIRST, schema_MemoryExecEvent, proc, proc, &schema_Process_fields),
+ PB_FIELD( 2, UINT64 , SINGULAR, STATIC , OTHER, schema_MemoryExecEvent, prot_exec_timestamp, proc, 0),
+ PB_FIELD( 3, UINT64 , SINGULAR, STATIC , OTHER, schema_MemoryExecEvent, new_flags, prot_exec_timestamp, 0),
+ PB_FIELD( 4, UINT64 , SINGULAR, STATIC , OTHER, schema_MemoryExecEvent, req_flags, new_flags, 0),
+ PB_FIELD( 5, UINT64 , SINGULAR, STATIC , OTHER, schema_MemoryExecEvent, old_vm_flags, req_flags, 0),
+ PB_FIELD( 6, UINT64 , SINGULAR, STATIC , OTHER, schema_MemoryExecEvent, mmap_flags, old_vm_flags, 0),
+ PB_FIELD( 7, MESSAGE , SINGULAR, STATIC , OTHER, schema_MemoryExecEvent, mapped_file, mmap_flags, &schema_File_fields),
+ PB_FIELD( 8, UENUM , SINGULAR, STATIC , OTHER, schema_MemoryExecEvent, action, mapped_file, 0),
+ PB_FIELD( 9, UINT64 , SINGULAR, STATIC , OTHER, schema_MemoryExecEvent, start_addr, action, 0),
+ PB_FIELD( 10, UINT64 , SINGULAR, STATIC , OTHER, schema_MemoryExecEvent, end_addr, start_addr, 0),
+ PB_FIELD( 11, BOOL , SINGULAR, STATIC , OTHER, schema_MemoryExecEvent, is_initial_mmap, end_addr, 0),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_ContainerInfoEvent_fields[2] = {
+ PB_FIELD( 1, MESSAGE , SINGULAR, STATIC , FIRST, schema_ContainerInfoEvent, container, container, &schema_Container_fields),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_ExitEvent_fields[2] = {
+ PB_FIELD( 1, BYTES , SINGULAR, CALLBACK, FIRST, schema_ExitEvent, process_uuid, process_uuid, 0),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_Event_fields[8] = {
+ PB_ONEOF_FIELD(event, 1, MESSAGE , ONEOF, STATIC , FIRST, schema_Event, execute, execute, &schema_ExecuteEvent_fields),
+ PB_ONEOF_FIELD(event, 2, MESSAGE , ONEOF, STATIC , UNION, schema_Event, container, container, &schema_ContainerInfoEvent_fields),
+ PB_ONEOF_FIELD(event, 3, MESSAGE , ONEOF, STATIC , UNION, schema_Event, exit, exit, &schema_ExitEvent_fields),
+ PB_ONEOF_FIELD(event, 4, MESSAGE , ONEOF, STATIC , UNION, schema_Event, memexec, memexec, &schema_MemoryExecEvent_fields),
+ PB_ONEOF_FIELD(event, 5, MESSAGE , ONEOF, STATIC , UNION, schema_Event, clone, clone, &schema_CloneEvent_fields),
+ PB_ONEOF_FIELD(event, 7, MESSAGE , ONEOF, STATIC , UNION, schema_Event, enumproc, enumproc, &schema_EnumerateProcessEvent_fields),
+ PB_FIELD( 6, UINT64 , SINGULAR, STATIC , OTHER, schema_Event, timestamp, event.enumproc, 0),
+ PB_LAST_FIELD
+};
+
+const pb_field_t schema_ContainerReport_fields[3] = {
+ PB_FIELD( 1, UINT32 , SINGULAR, STATIC , FIRST, schema_ContainerReport, pid, pid, 0),
+ PB_FIELD( 2, MESSAGE , SINGULAR, STATIC , OTHER, schema_ContainerReport, container, pid, &schema_Container_fields),
+ PB_LAST_FIELD
+};
+
+
+
+/* Check that field information fits in pb_field_t */
+#if !defined(PB_FIELD_32BIT)
+/* If you get an error here, it means that you need to define PB_FIELD_32BIT
+ * compile-time option. You can do that in pb.h or on compiler command line.
+ *
+ * The reason you need to do this is that some of your messages contain tag
+ * numbers or field sizes that are larger than what can fit in 8 or 16 bit
+ * field descriptors.
+ */
+PB_STATIC_ASSERT((pb_membersize(schema_Socket, local) < 65536 && pb_membersize(schema_Socket, remote) < 65536 && pb_membersize(schema_File, filesystem.overlayfs) < 65536 && pb_membersize(schema_File, filesystem.socket) < 65536 && pb_membersize(schema_Descriptor, file) < 65536 && pb_membersize(schema_Streams, stdin) < 65536 && pb_membersize(schema_Streams, stdout) < 65536 && pb_membersize(schema_Streams, stderr) < 65536 && pb_membersize(schema_Process, binary) < 65536 && pb_membersize(schema_Process, args) < 65536 && pb_membersize(schema_Process, streams) < 65536 && pb_membersize(schema_ExecuteEvent, proc) < 65536 && pb_membersize(schema_CloneEvent, proc) < 65536 && pb_membersize(schema_EnumerateProcessEvent, proc) < 65536 && pb_membersize(schema_MemoryExecEvent, proc) < 65536 && pb_membersize(schema_MemoryExecEvent, mapped_file) < 65536 && pb_membersize(schema_ContainerInfoEvent, container) < 65536 && pb_membersize(schema_Event, event.execute) < 65536 && pb_membersize(schema_Event, event.container) < 65536 && pb_membersize(schema_Event, event.exit) < 65536 && pb_membersize(schema_Event, event.memexec) < 65536 && pb_membersize(schema_Event, event.clone) < 65536 && pb_membersize(schema_Event, event.enumproc) < 65536 && pb_membersize(schema_ContainerReport, container) < 65536), YOU_MUST_DEFINE_PB_FIELD_32BIT_FOR_MESSAGES_schema_SocketIp_schema_Socket_schema_Overlay_schema_File_schema_ProcessArguments_schema_Descriptor_schema_Streams_schema_Process_schema_Container_schema_ExecuteEvent_schema_CloneEvent_schema_EnumerateProcessEvent_schema_MemoryExecEvent_schema_ContainerInfoEvent_schema_ExitEvent_schema_Event_schema_ContainerReport)
+#endif
+
+#if !defined(PB_FIELD_16BIT) && !defined(PB_FIELD_32BIT)
+/* If you get an error here, it means that you need to define PB_FIELD_16BIT
+ * compile-time option. You can do that in pb.h or on compiler command line.
+ *
+ * The reason you need to do this is that some of your messages contain tag
+ * numbers or field sizes that are larger than what can fit in the default
+ * 8 bit descriptors.
+ */
+PB_STATIC_ASSERT((pb_membersize(schema_Socket, local) < 256 && pb_membersize(schema_Socket, remote) < 256 && pb_membersize(schema_File, filesystem.overlayfs) < 256 && pb_membersize(schema_File, filesystem.socket) < 256 && pb_membersize(schema_Descriptor, file) < 256 && pb_membersize(schema_Streams, stdin) < 256 && pb_membersize(schema_Streams, stdout) < 256 && pb_membersize(schema_Streams, stderr) < 256 && pb_membersize(schema_Process, binary) < 256 && pb_membersize(schema_Process, args) < 256 && pb_membersize(schema_Process, streams) < 256 && pb_membersize(schema_ExecuteEvent, proc) < 256 && pb_membersize(schema_CloneEvent, proc) < 256 && pb_membersize(schema_EnumerateProcessEvent, proc) < 256 && pb_membersize(schema_MemoryExecEvent, proc) < 256 && pb_membersize(schema_MemoryExecEvent, mapped_file) < 256 && pb_membersize(schema_ContainerInfoEvent, container) < 256 && pb_membersize(schema_Event, event.execute) < 256 && pb_membersize(schema_Event, event.container) < 256 && pb_membersize(schema_Event, event.exit) < 256 && pb_membersize(schema_Event, event.memexec) < 256 && pb_membersize(schema_Event, event.clone) < 256 && pb_membersize(schema_Event, event.enumproc) < 256 && pb_membersize(schema_ContainerReport, container) < 256), YOU_MUST_DEFINE_PB_FIELD_16BIT_FOR_MESSAGES_schema_SocketIp_schema_Socket_schema_Overlay_schema_File_schema_ProcessArguments_schema_Descriptor_schema_Streams_schema_Process_schema_Container_schema_ExecuteEvent_schema_CloneEvent_schema_EnumerateProcessEvent_schema_MemoryExecEvent_schema_ContainerInfoEvent_schema_ExitEvent_schema_Event_schema_ContainerReport)
+#endif
+
+
+/* @@protoc_insertion_point(eof) */
diff --git a/security/container/protos/event.pb.h b/security/container/protos/event.pb.h
new file mode 100644
index 0000000..0634e8e
--- /dev/null
+++ b/security/container/protos/event.pb.h
@@ -0,0 +1,327 @@
+/* Automatically generated nanopb header */
+/* Generated by nanopb-0.3.9.1 at Mon Nov 11 16:11:19 2019. */
+
+#ifndef PB_SCHEMA_EVENT_PB_H_INCLUDED
+#define PB_SCHEMA_EVENT_PB_H_INCLUDED
+#include <pb.h>
+
+/* @@protoc_insertion_point(includes) */
+#if PB_PROTO_HEADER_VERSION != 30
+#error Regenerate this file with the current version of nanopb generator.
+#endif
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/* Enum definitions */
+typedef enum _schema_MemoryExecEvent_Action {
+ schema_MemoryExecEvent_Action_UNDEFINED = 0,
+ schema_MemoryExecEvent_Action_MPROTECT = 1,
+ schema_MemoryExecEvent_Action_MMAP_FILE = 2
+} schema_MemoryExecEvent_Action;
+#define _schema_MemoryExecEvent_Action_MIN schema_MemoryExecEvent_Action_UNDEFINED
+#define _schema_MemoryExecEvent_Action_MAX schema_MemoryExecEvent_Action_MMAP_FILE
+#define _schema_MemoryExecEvent_Action_ARRAYSIZE ((schema_MemoryExecEvent_Action)(schema_MemoryExecEvent_Action_MMAP_FILE+1))
+
+/* Struct definitions */
+typedef struct _schema_ExitEvent {
+ pb_callback_t process_uuid;
+/* @@protoc_insertion_point(struct:schema_ExitEvent) */
+} schema_ExitEvent;
+
+typedef struct _schema_Container {
+ uint64_t creation_timestamp;
+ pb_callback_t pod_namespace;
+ pb_callback_t pod_name;
+ uint64_t container_id;
+ pb_callback_t container_name;
+ pb_callback_t container_image_uri;
+ pb_callback_t labels;
+ pb_callback_t init_uuid;
+ pb_callback_t container_image_id;
+/* @@protoc_insertion_point(struct:schema_Container) */
+} schema_Container;
+
+typedef struct _schema_Overlay {
+ bool lower_layer;
+ bool upper_layer;
+ pb_callback_t modified_uuid;
+/* @@protoc_insertion_point(struct:schema_Overlay) */
+} schema_Overlay;
+
+typedef struct _schema_ProcessArguments {
+ pb_callback_t argv;
+ uint32_t argv_truncated;
+ pb_callback_t envp;
+ uint32_t envp_truncated;
+/* @@protoc_insertion_point(struct:schema_ProcessArguments) */
+} schema_ProcessArguments;
+
+typedef struct _schema_SocketIp {
+ uint32_t family;
+ pb_callback_t ip;
+ uint32_t port;
+/* @@protoc_insertion_point(struct:schema_SocketIp) */
+} schema_SocketIp;
+
+typedef struct _schema_ContainerInfoEvent {
+ schema_Container container;
+/* @@protoc_insertion_point(struct:schema_ContainerInfoEvent) */
+} schema_ContainerInfoEvent;
+
+typedef struct _schema_ContainerReport {
+ uint32_t pid;
+ schema_Container container;
+/* @@protoc_insertion_point(struct:schema_ContainerReport) */
+} schema_ContainerReport;
+
+typedef struct _schema_Socket {
+ schema_SocketIp local;
+ schema_SocketIp remote;
+/* @@protoc_insertion_point(struct:schema_Socket) */
+} schema_Socket;
+
+typedef struct _schema_File {
+ pb_callback_t fullpath;
+ pb_size_t which_filesystem;
+ union {
+ schema_Overlay overlayfs;
+ schema_Socket socket;
+ } filesystem;
+ uint32_t ino;
+/* @@protoc_insertion_point(struct:schema_File) */
+} schema_File;
+
+typedef struct _schema_Descriptor {
+ uint32_t mode;
+ schema_File file;
+/* @@protoc_insertion_point(struct:schema_Descriptor) */
+} schema_Descriptor;
+
+typedef struct _schema_Streams {
+ schema_Descriptor stdin;
+ schema_Descriptor stdout;
+ schema_Descriptor stderr;
+/* @@protoc_insertion_point(struct:schema_Streams) */
+} schema_Streams;
+
+typedef struct _schema_Process {
+ uint64_t creation_timestamp;
+ pb_callback_t uuid;
+ uint32_t pid;
+ schema_File binary;
+ uint32_t parent_pid;
+ pb_callback_t parent_uuid;
+ uint64_t container_id;
+ uint32_t container_pid;
+ uint32_t container_parent_pid;
+ schema_ProcessArguments args;
+ schema_Streams streams;
+ uint64_t exec_session_id;
+/* @@protoc_insertion_point(struct:schema_Process) */
+} schema_Process;
+
+typedef struct _schema_CloneEvent {
+ schema_Process proc;
+/* @@protoc_insertion_point(struct:schema_CloneEvent) */
+} schema_CloneEvent;
+
+typedef struct _schema_EnumerateProcessEvent {
+ schema_Process proc;
+/* @@protoc_insertion_point(struct:schema_EnumerateProcessEvent) */
+} schema_EnumerateProcessEvent;
+
+typedef struct _schema_ExecuteEvent {
+ schema_Process proc;
+/* @@protoc_insertion_point(struct:schema_ExecuteEvent) */
+} schema_ExecuteEvent;
+
+typedef struct _schema_MemoryExecEvent {
+ schema_Process proc;
+ uint64_t prot_exec_timestamp;
+ uint64_t new_flags;
+ uint64_t req_flags;
+ uint64_t old_vm_flags;
+ uint64_t mmap_flags;
+ schema_File mapped_file;
+ schema_MemoryExecEvent_Action action;
+ uint64_t start_addr;
+ uint64_t end_addr;
+ bool is_initial_mmap;
+/* @@protoc_insertion_point(struct:schema_MemoryExecEvent) */
+} schema_MemoryExecEvent;
+
+typedef struct _schema_Event {
+ pb_size_t which_event;
+ union {
+ schema_ExecuteEvent execute;
+ schema_ContainerInfoEvent container;
+ schema_ExitEvent exit;
+ schema_MemoryExecEvent memexec;
+ schema_CloneEvent clone;
+ schema_EnumerateProcessEvent enumproc;
+ } event;
+ uint64_t timestamp;
+/* @@protoc_insertion_point(struct:schema_Event) */
+} schema_Event;
+
+/* Default values for struct fields */
+
+/* Initializer values for message structs */
+#define schema_SocketIp_init_default {0, {{NULL}, NULL}, 0}
+#define schema_Socket_init_default {schema_SocketIp_init_default, schema_SocketIp_init_default}
+#define schema_Overlay_init_default {0, 0, {{NULL}, NULL}}
+#define schema_File_init_default {{{NULL}, NULL}, 0, {schema_Overlay_init_default}, 0}
+#define schema_ProcessArguments_init_default {{{NULL}, NULL}, 0, {{NULL}, NULL}, 0}
+#define schema_Descriptor_init_default {0, schema_File_init_default}
+#define schema_Streams_init_default {schema_Descriptor_init_default, schema_Descriptor_init_default, schema_Descriptor_init_default}
+#define schema_Process_init_default {0, {{NULL}, NULL}, 0, schema_File_init_default, 0, {{NULL}, NULL}, 0, 0, 0, schema_ProcessArguments_init_default, schema_Streams_init_default, 0}
+#define schema_Container_init_default {0, {{NULL}, NULL}, {{NULL}, NULL}, 0, {{NULL}, NULL}, {{NULL}, NULL}, {{NULL}, NULL}, {{NULL}, NULL}, {{NULL}, NULL}}
+#define schema_ExecuteEvent_init_default {schema_Process_init_default}
+#define schema_CloneEvent_init_default {schema_Process_init_default}
+#define schema_EnumerateProcessEvent_init_default {schema_Process_init_default}
+#define schema_MemoryExecEvent_init_default {schema_Process_init_default, 0, 0, 0, 0, 0, schema_File_init_default, _schema_MemoryExecEvent_Action_MIN, 0, 0, 0}
+#define schema_ContainerInfoEvent_init_default {schema_Container_init_default}
+#define schema_ExitEvent_init_default {{{NULL}, NULL}}
+#define schema_Event_init_default {0, {schema_ExecuteEvent_init_default}, 0}
+#define schema_ContainerReport_init_default {0, schema_Container_init_default}
+#define schema_SocketIp_init_zero {0, {{NULL}, NULL}, 0}
+#define schema_Socket_init_zero {schema_SocketIp_init_zero, schema_SocketIp_init_zero}
+#define schema_Overlay_init_zero {0, 0, {{NULL}, NULL}}
+#define schema_File_init_zero {{{NULL}, NULL}, 0, {schema_Overlay_init_zero}, 0}
+#define schema_ProcessArguments_init_zero {{{NULL}, NULL}, 0, {{NULL}, NULL}, 0}
+#define schema_Descriptor_init_zero {0, schema_File_init_zero}
+#define schema_Streams_init_zero {schema_Descriptor_init_zero, schema_Descriptor_init_zero, schema_Descriptor_init_zero}
+#define schema_Process_init_zero {0, {{NULL}, NULL}, 0, schema_File_init_zero, 0, {{NULL}, NULL}, 0, 0, 0, schema_ProcessArguments_init_zero, schema_Streams_init_zero, 0}
+#define schema_Container_init_zero {0, {{NULL}, NULL}, {{NULL}, NULL}, 0, {{NULL}, NULL}, {{NULL}, NULL}, {{NULL}, NULL}, {{NULL}, NULL}, {{NULL}, NULL}}
+#define schema_ExecuteEvent_init_zero {schema_Process_init_zero}
+#define schema_CloneEvent_init_zero {schema_Process_init_zero}
+#define schema_EnumerateProcessEvent_init_zero {schema_Process_init_zero}
+#define schema_MemoryExecEvent_init_zero {schema_Process_init_zero, 0, 0, 0, 0, 0, schema_File_init_zero, _schema_MemoryExecEvent_Action_MIN, 0, 0, 0}
+#define schema_ContainerInfoEvent_init_zero {schema_Container_init_zero}
+#define schema_ExitEvent_init_zero {{{NULL}, NULL}}
+#define schema_Event_init_zero {0, {schema_ExecuteEvent_init_zero}, 0}
+#define schema_ContainerReport_init_zero {0, schema_Container_init_zero}
+
+/* Field tags (for use in manual encoding/decoding) */
+#define schema_ExitEvent_process_uuid_tag 1
+#define schema_Container_creation_timestamp_tag 1
+#define schema_Container_pod_namespace_tag 2
+#define schema_Container_pod_name_tag 3
+#define schema_Container_container_id_tag 4
+#define schema_Container_container_name_tag 5
+#define schema_Container_container_image_uri_tag 6
+#define schema_Container_labels_tag 7
+#define schema_Container_init_uuid_tag 8
+#define schema_Container_container_image_id_tag 9
+#define schema_Overlay_lower_layer_tag 1
+#define schema_Overlay_upper_layer_tag 2
+#define schema_Overlay_modified_uuid_tag 3
+#define schema_ProcessArguments_argv_tag 1
+#define schema_ProcessArguments_argv_truncated_tag 2
+#define schema_ProcessArguments_envp_tag 3
+#define schema_ProcessArguments_envp_truncated_tag 4
+#define schema_SocketIp_family_tag 1
+#define schema_SocketIp_ip_tag 2
+#define schema_SocketIp_port_tag 3
+#define schema_ContainerInfoEvent_container_tag 1
+#define schema_ContainerReport_pid_tag 1
+#define schema_ContainerReport_container_tag 2
+#define schema_Socket_local_tag 1
+#define schema_Socket_remote_tag 2
+#define schema_File_overlayfs_tag 2
+#define schema_File_socket_tag 4
+#define schema_File_fullpath_tag 1
+#define schema_File_ino_tag 3
+#define schema_Descriptor_mode_tag 1
+#define schema_Descriptor_file_tag 2
+#define schema_Streams_stdin_tag 1
+#define schema_Streams_stdout_tag 2
+#define schema_Streams_stderr_tag 3
+#define schema_Process_creation_timestamp_tag 1
+#define schema_Process_uuid_tag 2
+#define schema_Process_pid_tag 3
+#define schema_Process_binary_tag 4
+#define schema_Process_parent_pid_tag 5
+#define schema_Process_parent_uuid_tag 6
+#define schema_Process_container_id_tag 7
+#define schema_Process_container_pid_tag 8
+#define schema_Process_container_parent_pid_tag 9
+#define schema_Process_args_tag 10
+#define schema_Process_streams_tag 11
+#define schema_Process_exec_session_id_tag 12
+#define schema_CloneEvent_proc_tag 1
+#define schema_EnumerateProcessEvent_proc_tag 1
+#define schema_ExecuteEvent_proc_tag 1
+#define schema_MemoryExecEvent_proc_tag 1
+#define schema_MemoryExecEvent_prot_exec_timestamp_tag 2
+#define schema_MemoryExecEvent_new_flags_tag 3
+#define schema_MemoryExecEvent_req_flags_tag 4
+#define schema_MemoryExecEvent_old_vm_flags_tag 5
+#define schema_MemoryExecEvent_mmap_flags_tag 6
+#define schema_MemoryExecEvent_mapped_file_tag 7
+#define schema_MemoryExecEvent_action_tag 8
+#define schema_MemoryExecEvent_start_addr_tag 9
+#define schema_MemoryExecEvent_end_addr_tag 10
+#define schema_MemoryExecEvent_is_initial_mmap_tag 11
+#define schema_Event_execute_tag 1
+#define schema_Event_container_tag 2
+#define schema_Event_exit_tag 3
+#define schema_Event_memexec_tag 4
+#define schema_Event_clone_tag 5
+#define schema_Event_enumproc_tag 7
+#define schema_Event_timestamp_tag 6
+
+/* Struct field encoding specification for nanopb */
+extern const pb_field_t schema_SocketIp_fields[4];
+extern const pb_field_t schema_Socket_fields[3];
+extern const pb_field_t schema_Overlay_fields[4];
+extern const pb_field_t schema_File_fields[5];
+extern const pb_field_t schema_ProcessArguments_fields[5];
+extern const pb_field_t schema_Descriptor_fields[3];
+extern const pb_field_t schema_Streams_fields[4];
+extern const pb_field_t schema_Process_fields[13];
+extern const pb_field_t schema_Container_fields[10];
+extern const pb_field_t schema_ExecuteEvent_fields[2];
+extern const pb_field_t schema_CloneEvent_fields[2];
+extern const pb_field_t schema_EnumerateProcessEvent_fields[2];
+extern const pb_field_t schema_MemoryExecEvent_fields[12];
+extern const pb_field_t schema_ContainerInfoEvent_fields[2];
+extern const pb_field_t schema_ExitEvent_fields[2];
+extern const pb_field_t schema_Event_fields[8];
+extern const pb_field_t schema_ContainerReport_fields[3];
+
+/* Maximum encoded size of messages (where known) */
+/* schema_SocketIp_size depends on runtime parameters */
+#define schema_Socket_size (12 + schema_SocketIp_size + schema_SocketIp_size)
+/* schema_Overlay_size depends on runtime parameters */
+/* schema_File_size depends on runtime parameters */
+/* schema_ProcessArguments_size depends on runtime parameters */
+#define schema_Descriptor_size (12 + schema_File_size)
+#define schema_Streams_size (54 + schema_File_size + schema_File_size + schema_File_size)
+/* schema_Process_size depends on runtime parameters */
+/* schema_Container_size depends on runtime parameters */
+#define schema_ExecuteEvent_size (6 + schema_Process_size)
+#define schema_CloneEvent_size (6 + schema_Process_size)
+#define schema_EnumerateProcessEvent_size (6 + schema_Process_size)
+#define schema_MemoryExecEvent_size (93 + schema_Process_size + schema_File_size)
+#define schema_ContainerInfoEvent_size (6 + schema_Container_size)
+/* schema_ExitEvent_size depends on runtime parameters */
+#define schema_Event_size (11 + ((((((schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) > schema_ExecuteEvent_size ? (schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) : schema_ExecuteEvent_size) > schema_ExitEvent_size ? ((schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) > schema_ExecuteEvent_size ? (schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) : schema_ExecuteEvent_size) : schema_ExitEvent_size) > schema_ContainerInfoEvent_size ? (((schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) > schema_ExecuteEvent_size ? (schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) : schema_ExecuteEvent_size) > schema_ExitEvent_size ? ((schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) > schema_ExecuteEvent_size ? (schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) : schema_ExecuteEvent_size) : schema_ExitEvent_size) : schema_ContainerInfoEvent_size) > schema_MemoryExecEvent_size ? ((((schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) > schema_ExecuteEvent_size ? (schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) : schema_ExecuteEvent_size) > schema_ExitEvent_size ? ((schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) > schema_ExecuteEvent_size ? (schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) : schema_ExecuteEvent_size) : schema_ExitEvent_size) > schema_ContainerInfoEvent_size ? (((schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) > schema_ExecuteEvent_size ? (schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) : schema_ExecuteEvent_size) > schema_ExitEvent_size ? ((schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) > schema_ExecuteEvent_size ? (schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) : schema_ExecuteEvent_size) : schema_ExitEvent_size) : schema_ContainerInfoEvent_size) : schema_MemoryExecEvent_size) > 0 ? (((((schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) > schema_ExecuteEvent_size ? (schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) : schema_ExecuteEvent_size) > schema_ExitEvent_size ? ((schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) > schema_ExecuteEvent_size ? (schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) : schema_ExecuteEvent_size) : schema_ExitEvent_size) > schema_ContainerInfoEvent_size ? (((schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) > schema_ExecuteEvent_size ? (schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) : schema_ExecuteEvent_size) > schema_ExitEvent_size ? ((schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) > schema_ExecuteEvent_size ? (schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) : schema_ExecuteEvent_size) : schema_ExitEvent_size) : schema_ContainerInfoEvent_size) > schema_MemoryExecEvent_size ? ((((schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) > schema_ExecuteEvent_size ? (schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) : schema_ExecuteEvent_size) > schema_ExitEvent_size ? ((schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) > schema_ExecuteEvent_size ? (schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) : schema_ExecuteEvent_size) : schema_ExitEvent_size) > schema_ContainerInfoEvent_size ? (((schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) > schema_ExecuteEvent_size ? (schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) : schema_ExecuteEvent_size) > schema_ExitEvent_size ? ((schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) > schema_ExecuteEvent_size ? (schema_CloneEvent_size > schema_EnumerateProcessEvent_size ? schema_CloneEvent_size : schema_EnumerateProcessEvent_size) : schema_ExecuteEvent_size) : schema_ExitEvent_size) : schema_ContainerInfoEvent_size) : schema_MemoryExecEvent_size) : 0))
+#define schema_ContainerReport_size (12 + schema_Container_size)
+
+/* Message IDs (where set with "msgid" option) */
+#ifdef PB_MSGID
+
+#define EVENT_MESSAGES \
+
+
+#endif
+
+#ifdef __cplusplus
+} /* extern "C" */
+#endif
+/* @@protoc_insertion_point(eof) */
+
+#endif
diff --git a/security/container/protos/event.proto b/security/container/protos/event.proto
new file mode 100644
index 0000000..dfe483f
--- /dev/null
+++ b/security/container/protos/event.proto
@@ -0,0 +1,151 @@
+syntax = "proto3";
+
+package schema;
+
+message SocketIp {
+ uint32 family = 1; // AF_* for socket type.
+ bytes ip = 2; // ip4 or ip6 address.
+ uint32 port = 3; // port bind or connected.
+}
+
+message Socket {
+ SocketIp local = 1;
+ SocketIp remote = 2; // unset if not connected.
+}
+
+message Overlay {
+ bool lower_layer = 1;
+ bool upper_layer = 2;
+ bytes modified_uuid = 3; // The process who first modified the file.
+}
+
+message File {
+ bytes fullpath = 1;
+ uint32 ino = 3; // inode number.
+ oneof filesystem {
+ Overlay overlayfs = 2;
+ Socket socket = 4;
+ }
+}
+
+message ProcessArguments {
+ repeated bytes argv = 1; // process arguments
+ uint32 argv_truncated = 2; // number of characters truncated from argv
+ repeated bytes envp = 3; // process environment variables
+ uint32 envp_truncated = 4; // number of characters truncated from envp
+}
+
+message Descriptor {
+ uint32 mode = 1; // file mode (stat st_mode)
+ File file = 2;
+}
+
+message Streams {
+ Descriptor stdin = 1;
+ Descriptor stdout = 2;
+ Descriptor stderr = 3;
+}
+
+message Process {
+ uint64 creation_timestamp = 1; // Only populated in ExecuteEvent, in ns.
+ bytes uuid = 2;
+ uint32 pid = 3;
+ File binary = 4; // Only populated in ExecuteEvent.
+ uint32 parent_pid = 5;
+ bytes parent_uuid = 6;
+ uint64 container_id = 7; // unique id of process's container
+ uint32 container_pid = 8; // pid inside the container namespace pid
+ uint32 container_parent_pid = 9; // optional
+ ProcessArguments args = 10; // Only populated in ExecuteEvent.
+ Streams streams = 11; // Only populated in ExecuteEvent.
+ uint64 exec_session_id = 12; // identifier set for kubectl exec sessions.
+}
+
+message Container {
+ uint64 creation_timestamp = 1; // container create time in ns
+ bytes pod_namespace = 2;
+ bytes pod_name = 3;
+ uint64 container_id = 4; // unique across lifetime of Node
+ bytes container_name = 5;
+ bytes container_image_uri = 6;
+ repeated bytes labels = 7;
+ bytes init_uuid = 8;
+ bytes container_image_id = 9;
+}
+
+// A binary being executed.
+// e.g., execve()
+message ExecuteEvent {
+ Process proc = 1;
+}
+
+// A process clone is being created. This message means that a cloning operation
+// is being attempted. It may be sent even if fork fails.
+message CloneEvent {
+ Process proc = 1;
+}
+
+// Processes that are enumerated at startup will be sent with this event. There
+// is no distinction from events we would have seen from fork or exec.
+message EnumerateProcessEvent {
+ Process proc = 1;
+}
+
+// Collect information about mmap/mprotect calls with the PROT_EXEC flag set.
+message MemoryExecEvent {
+ Process proc = 1; // The origin process
+ // The timestamp in ns when the memory was set executable
+ uint64 prot_exec_timestamp = 2;
+ // The prot flags granted by the kernel for the operation
+ uint64 new_flags = 3;
+ // The prot flags requested for the mprotect/mmap operation
+ uint64 req_flags = 4;
+ // The vm_flags prior to the mprotect operation, if relevant
+ uint64 old_vm_flags = 5;
+ // The operational flags for the mmap operation, if relevant
+ uint64 mmap_flags = 6;
+ // Derived from the file struct describing the fd being mapped
+ File mapped_file = 7;
+ enum Action {
+ UNDEFINED = 0;
+ MPROTECT = 1;
+ MMAP_FILE = 2;
+ }
+ Action action = 8;
+
+ uint64 start_addr = 9; // The executable memory region start addr
+ uint64 end_addr = 10; // The executable memory region end addr
+ // True if this event is a mmap of the process' binary
+ bool is_initial_mmap = 11;
+}
+
+// Associate the following container information with all processes
+// that have the indicated container_id.
+message ContainerInfoEvent {
+ Container container = 1;
+}
+
+// The process with the indicated pid has exited.
+message ExitEvent {
+ bytes process_uuid = 1;
+}
+
+// Next ID: 8
+message Event {
+ oneof event {
+ ExecuteEvent execute = 1;
+ ContainerInfoEvent container = 2;
+ ExitEvent exit = 3;
+ MemoryExecEvent memexec = 4;
+ CloneEvent clone = 5;
+ EnumerateProcessEvent enumproc = 7;
+ }
+
+ uint64 timestamp = 6; // In nanoseconds
+}
+
+// Message sent by the daemonset to the LSM for container enlightenment.
+message ContainerReport {
+ uint32 pid = 1; // Top pid of the running container.
+ Container container = 2; // Information collected about the container.
+}
diff --git a/security/container/protos/nanopb/LICENSE b/security/container/protos/nanopb/LICENSE
new file mode 100644
index 0000000..a83630a
--- /dev/null
+++ b/security/container/protos/nanopb/LICENSE
@@ -0,0 +1,20 @@
+Copyright (c) 2011 Petteri Aimonen <jpa at nanopb.mail.kapsi.fi>
+
+This software is provided 'as-is', without any express or
+implied warranty. In no event will the authors be held liable
+for any damages arising from the use of this software.
+
+Permission is granted to anyone to use this software for any
+purpose, including commercial applications, and to alter it and
+redistribute it freely, subject to the following restrictions:
+
+1. The origin of this software must not be misrepresented; you
+ must not claim that you wrote the original software. If you use
+ this software in a product, an acknowledgment in the product
+ documentation would be appreciated but is not required.
+
+2. Altered source versions must be plainly marked as such, and
+ must not be misrepresented as being the original software.
+
+3. This notice may not be removed or altered from any source
+ distribution.
diff --git a/security/container/protos/nanopb/Makefile b/security/container/protos/nanopb/Makefile
new file mode 100644
index 0000000..b7e15f8
--- /dev/null
+++ b/security/container/protos/nanopb/Makefile
@@ -0,0 +1,7 @@
+obj-$(CONFIG_SECURITY_CONTAINER_MONITOR) += nanopb.o
+
+nanopb-y := pb_encode.o pb_decode.o pb_common.o
+
+ccflags-y := -I$(srctree)/security/container/protos \
+ -I$(srctree)/security/container/protos/nanopb \
+ $(PB_CCFLAGS)
diff --git a/security/container/protos/nanopb/pb.h b/security/container/protos/nanopb/pb.h
new file mode 100644
index 0000000..174a84b
--- /dev/null
+++ b/security/container/protos/nanopb/pb.h
@@ -0,0 +1,593 @@
+/* Common parts of the nanopb library. Most of these are quite low-level
+ * stuff. For the high-level interface, see pb_encode.h and pb_decode.h.
+ */
+
+#ifndef PB_H_INCLUDED
+#define PB_H_INCLUDED
+
+/*****************************************************************
+ * Nanopb compilation time options. You can change these here by *
+ * uncommenting the lines, or on the compiler command line. *
+ *****************************************************************/
+
+/* Enable support for dynamically allocated fields */
+/* #define PB_ENABLE_MALLOC 1 */
+
+/* Define this if your CPU / compiler combination does not support
+ * unaligned memory access to packed structures. */
+/* #define PB_NO_PACKED_STRUCTS 1 */
+
+/* Increase the number of required fields that are tracked.
+ * A compiler warning will tell if you need this. */
+/* #define PB_MAX_REQUIRED_FIELDS 256 */
+
+/* Add support for tag numbers > 255 and fields larger than 255 bytes. */
+/* #define PB_FIELD_16BIT 1 */
+
+/* Add support for tag numbers > 65536 and fields larger than 65536 bytes. */
+/* #define PB_FIELD_32BIT 1 */
+
+/* Disable support for error messages in order to save some code space. */
+/* #define PB_NO_ERRMSG 1 */
+
+/* Disable support for custom streams (support only memory buffers). */
+/* #define PB_BUFFER_ONLY 1 */
+
+/* Switch back to the old-style callback function signature.
+ * This was the default until nanopb-0.2.1. */
+/* #define PB_OLD_CALLBACK_STYLE */
+
+
+/******************************************************************
+ * You usually don't need to change anything below this line. *
+ * Feel free to look around and use the defined macros, though. *
+ ******************************************************************/
+
+
+/* Version of the nanopb library. Just in case you want to check it in
+ * your own program. */
+#define NANOPB_VERSION nanopb-0.3.9.1
+
+/* Include all the system headers needed by nanopb. You will need the
+ * definitions of the following:
+ * - strlen, memcpy, memset functions
+ * - [u]int_least8_t, uint_fast8_t, [u]int_least16_t, [u]int32_t, [u]int64_t
+ * - size_t
+ * - bool
+ *
+ * If you don't have the standard header files, you can instead provide
+ * a custom header that defines or includes all this. In that case,
+ * define PB_SYSTEM_HEADER to the path of this file.
+ */
+#ifdef PB_SYSTEM_HEADER
+#include PB_SYSTEM_HEADER
+#else
+#include <stdint.h>
+#include <stddef.h>
+#include <stdbool.h>
+#include <string.h>
+
+#ifdef PB_ENABLE_MALLOC
+#include <stdlib.h>
+#endif
+#endif
+
+/* Macro for defining packed structures (compiler dependent).
+ * This just reduces memory requirements, but is not required.
+ */
+#if defined(PB_NO_PACKED_STRUCTS)
+ /* Disable struct packing */
+# define PB_PACKED_STRUCT_START
+# define PB_PACKED_STRUCT_END
+# define pb_packed
+#elif defined(__GNUC__) || defined(__clang__)
+ /* For GCC and clang */
+# define PB_PACKED_STRUCT_START
+# define PB_PACKED_STRUCT_END
+# define pb_packed __attribute__((packed))
+#elif defined(__ICCARM__) || defined(__CC_ARM)
+ /* For IAR ARM and Keil MDK-ARM compilers */
+# define PB_PACKED_STRUCT_START _Pragma("pack(push, 1)")
+# define PB_PACKED_STRUCT_END _Pragma("pack(pop)")
+# define pb_packed
+#elif defined(_MSC_VER) && (_MSC_VER >= 1500)
+ /* For Microsoft Visual C++ */
+# define PB_PACKED_STRUCT_START __pragma(pack(push, 1))
+# define PB_PACKED_STRUCT_END __pragma(pack(pop))
+# define pb_packed
+#else
+ /* Unknown compiler */
+# define PB_PACKED_STRUCT_START
+# define PB_PACKED_STRUCT_END
+# define pb_packed
+#endif
+
+/* Handly macro for suppressing unreferenced-parameter compiler warnings. */
+#ifndef PB_UNUSED
+#define PB_UNUSED(x) (void)(x)
+#endif
+
+/* Compile-time assertion, used for checking compatible compilation options.
+ * If this does not work properly on your compiler, use
+ * #define PB_NO_STATIC_ASSERT to disable it.
+ *
+ * But before doing that, check carefully the error message / place where it
+ * comes from to see if the error has a real cause. Unfortunately the error
+ * message is not always very clear to read, but you can see the reason better
+ * in the place where the PB_STATIC_ASSERT macro was called.
+ */
+#ifndef PB_NO_STATIC_ASSERT
+#ifndef PB_STATIC_ASSERT
+#define PB_STATIC_ASSERT(COND,MSG) typedef char PB_STATIC_ASSERT_MSG(MSG, __LINE__, __COUNTER__)[(COND)?1:-1];
+#define PB_STATIC_ASSERT_MSG(MSG, LINE, COUNTER) PB_STATIC_ASSERT_MSG_(MSG, LINE, COUNTER)
+#define PB_STATIC_ASSERT_MSG_(MSG, LINE, COUNTER) pb_static_assertion_##MSG##LINE##COUNTER
+#endif
+#else
+#define PB_STATIC_ASSERT(COND,MSG)
+#endif
+
+/* Number of required fields to keep track of. */
+#ifndef PB_MAX_REQUIRED_FIELDS
+#define PB_MAX_REQUIRED_FIELDS 64
+#endif
+
+#if PB_MAX_REQUIRED_FIELDS < 64
+#error You should not lower PB_MAX_REQUIRED_FIELDS from the default value (64).
+#endif
+
+/* List of possible field types. These are used in the autogenerated code.
+ * Least-significant 4 bits tell the scalar type
+ * Most-significant 4 bits specify repeated/required/packed etc.
+ */
+
+typedef uint_least8_t pb_type_t;
+
+/**** Field data types ****/
+
+/* Numeric types */
+#define PB_LTYPE_VARINT 0x00 /* int32, int64, enum, bool */
+#define PB_LTYPE_UVARINT 0x01 /* uint32, uint64 */
+#define PB_LTYPE_SVARINT 0x02 /* sint32, sint64 */
+#define PB_LTYPE_FIXED32 0x03 /* fixed32, sfixed32, float */
+#define PB_LTYPE_FIXED64 0x04 /* fixed64, sfixed64, double */
+
+/* Marker for last packable field type. */
+#define PB_LTYPE_LAST_PACKABLE 0x04
+
+/* Byte array with pre-allocated buffer.
+ * data_size is the length of the allocated PB_BYTES_ARRAY structure. */
+#define PB_LTYPE_BYTES 0x05
+
+/* String with pre-allocated buffer.
+ * data_size is the maximum length. */
+#define PB_LTYPE_STRING 0x06
+
+/* Submessage
+ * submsg_fields is pointer to field descriptions */
+#define PB_LTYPE_SUBMESSAGE 0x07
+
+/* Extension pseudo-field
+ * The field contains a pointer to pb_extension_t */
+#define PB_LTYPE_EXTENSION 0x08
+
+/* Byte array with inline, pre-allocated byffer.
+ * data_size is the length of the inline, allocated buffer.
+ * This differs from PB_LTYPE_BYTES by defining the element as
+ * pb_byte_t[data_size] rather than pb_bytes_array_t. */
+#define PB_LTYPE_FIXED_LENGTH_BYTES 0x09
+
+/* Number of declared LTYPES */
+#define PB_LTYPES_COUNT 0x0A
+#define PB_LTYPE_MASK 0x0F
+
+/**** Field repetition rules ****/
+
+#define PB_HTYPE_REQUIRED 0x00
+#define PB_HTYPE_OPTIONAL 0x10
+#define PB_HTYPE_REPEATED 0x20
+#define PB_HTYPE_ONEOF 0x30
+#define PB_HTYPE_MASK 0x30
+
+/**** Field allocation types ****/
+
+#define PB_ATYPE_STATIC 0x00
+#define PB_ATYPE_POINTER 0x80
+#define PB_ATYPE_CALLBACK 0x40
+#define PB_ATYPE_MASK 0xC0
+
+#define PB_ATYPE(x) ((x) & PB_ATYPE_MASK)
+#define PB_HTYPE(x) ((x) & PB_HTYPE_MASK)
+#define PB_LTYPE(x) ((x) & PB_LTYPE_MASK)
+
+/* Data type used for storing sizes of struct fields
+ * and array counts.
+ */
+#if defined(PB_FIELD_32BIT)
+ typedef uint32_t pb_size_t;
+ typedef int32_t pb_ssize_t;
+#elif defined(PB_FIELD_16BIT)
+ typedef uint_least16_t pb_size_t;
+ typedef int_least16_t pb_ssize_t;
+#else
+ typedef uint_least8_t pb_size_t;
+ typedef int_least8_t pb_ssize_t;
+#endif
+#define PB_SIZE_MAX ((pb_size_t)-1)
+
+/* Data type for storing encoded data and other byte streams.
+ * This typedef exists to support platforms where uint8_t does not exist.
+ * You can regard it as equivalent on uint8_t on other platforms.
+ */
+typedef uint_least8_t pb_byte_t;
+
+/* This structure is used in auto-generated constants
+ * to specify struct fields.
+ * You can change field sizes if you need structures
+ * larger than 256 bytes or field tags larger than 256.
+ * The compiler should complain if your .proto has such
+ * structures. Fix that by defining PB_FIELD_16BIT or
+ * PB_FIELD_32BIT.
+ */
+PB_PACKED_STRUCT_START
+typedef struct pb_field_s pb_field_t;
+struct pb_field_s {
+ pb_size_t tag;
+ pb_type_t type;
+ pb_size_t data_offset; /* Offset of field data, relative to previous field. */
+ pb_ssize_t size_offset; /* Offset of array size or has-boolean, relative to data */
+ pb_size_t data_size; /* Data size in bytes for a single item */
+ pb_size_t array_size; /* Maximum number of entries in array */
+
+ /* Field definitions for submessage
+ * OR default value for all other non-array, non-callback types
+ * If null, then field will zeroed. */
+ const void *ptr;
+} pb_packed;
+PB_PACKED_STRUCT_END
+
+/* Make sure that the standard integer types are of the expected sizes.
+ * Otherwise fixed32/fixed64 fields can break.
+ *
+ * If you get errors here, it probably means that your stdint.h is not
+ * correct for your platform.
+ */
+#ifndef PB_WITHOUT_64BIT
+PB_STATIC_ASSERT(sizeof(int64_t) == 2 * sizeof(int32_t), INT64_T_WRONG_SIZE)
+PB_STATIC_ASSERT(sizeof(uint64_t) == 2 * sizeof(uint32_t), UINT64_T_WRONG_SIZE)
+#endif
+
+/* This structure is used for 'bytes' arrays.
+ * It has the number of bytes in the beginning, and after that an array.
+ * Note that actual structs used will have a different length of bytes array.
+ */
+#define PB_BYTES_ARRAY_T(n) struct { pb_size_t size; pb_byte_t bytes[n]; }
+#define PB_BYTES_ARRAY_T_ALLOCSIZE(n) ((size_t)n + offsetof(pb_bytes_array_t, bytes))
+
+struct pb_bytes_array_s {
+ pb_size_t size;
+ pb_byte_t bytes[1];
+};
+typedef struct pb_bytes_array_s pb_bytes_array_t;
+
+/* This structure is used for giving the callback function.
+ * It is stored in the message structure and filled in by the method that
+ * calls pb_decode.
+ *
+ * The decoding callback will be given a limited-length stream
+ * If the wire type was string, the length is the length of the string.
+ * If the wire type was a varint/fixed32/fixed64, the length is the length
+ * of the actual value.
+ * The function may be called multiple times (especially for repeated types,
+ * but also otherwise if the message happens to contain the field multiple
+ * times.)
+ *
+ * The encoding callback will receive the actual output stream.
+ * It should write all the data in one call, including the field tag and
+ * wire type. It can write multiple fields.
+ *
+ * The callback can be null if you want to skip a field.
+ */
+typedef struct pb_istream_s pb_istream_t;
+typedef struct pb_ostream_s pb_ostream_t;
+typedef struct pb_callback_s pb_callback_t;
+struct pb_callback_s {
+#ifdef PB_OLD_CALLBACK_STYLE
+ /* Deprecated since nanopb-0.2.1 */
+ union {
+ bool (*decode)(pb_istream_t *stream, const pb_field_t *field, void *arg);
+ bool (*encode)(pb_ostream_t *stream, const pb_field_t *field, const void *arg);
+ } funcs;
+#else
+ /* New function signature, which allows modifying arg contents in callback. */
+ union {
+ bool (*decode)(pb_istream_t *stream, const pb_field_t *field, void **arg);
+ bool (*encode)(pb_ostream_t *stream, const pb_field_t *field, void * const *arg);
+ } funcs;
+#endif
+
+ /* Free arg for use by callback */
+ void *arg;
+};
+
+/* Wire types. Library user needs these only in encoder callbacks. */
+typedef enum {
+ PB_WT_VARINT = 0,
+ PB_WT_64BIT = 1,
+ PB_WT_STRING = 2,
+ PB_WT_32BIT = 5
+} pb_wire_type_t;
+
+/* Structure for defining the handling of unknown/extension fields.
+ * Usually the pb_extension_type_t structure is automatically generated,
+ * while the pb_extension_t structure is created by the user. However,
+ * if you want to catch all unknown fields, you can also create a custom
+ * pb_extension_type_t with your own callback.
+ */
+typedef struct pb_extension_type_s pb_extension_type_t;
+typedef struct pb_extension_s pb_extension_t;
+struct pb_extension_type_s {
+ /* Called for each unknown field in the message.
+ * If you handle the field, read off all of its data and return true.
+ * If you do not handle the field, do not read anything and return true.
+ * If you run into an error, return false.
+ * Set to NULL for default handler.
+ */
+ bool (*decode)(pb_istream_t *stream, pb_extension_t *extension,
+ uint32_t tag, pb_wire_type_t wire_type);
+
+ /* Called once after all regular fields have been encoded.
+ * If you have something to write, do so and return true.
+ * If you do not have anything to write, just return true.
+ * If you run into an error, return false.
+ * Set to NULL for default handler.
+ */
+ bool (*encode)(pb_ostream_t *stream, const pb_extension_t *extension);
+
+ /* Free field for use by the callback. */
+ const void *arg;
+};
+
+struct pb_extension_s {
+ /* Type describing the extension field. Usually you'll initialize
+ * this to a pointer to the automatically generated structure. */
+ const pb_extension_type_t *type;
+
+ /* Destination for the decoded data. This must match the datatype
+ * of the extension field. */
+ void *dest;
+
+ /* Pointer to the next extension handler, or NULL.
+ * If this extension does not match a field, the next handler is
+ * automatically called. */
+ pb_extension_t *next;
+
+ /* The decoder sets this to true if the extension was found.
+ * Ignored for encoding. */
+ bool found;
+};
+
+/* Memory allocation functions to use. You can define pb_realloc and
+ * pb_free to custom functions if you want. */
+#ifdef PB_ENABLE_MALLOC
+# ifndef pb_realloc
+# define pb_realloc(ptr, size) realloc(ptr, size)
+# endif
+# ifndef pb_free
+# define pb_free(ptr) free(ptr)
+# endif
+#endif
+
+/* This is used to inform about need to regenerate .pb.h/.pb.c files. */
+#define PB_PROTO_HEADER_VERSION 30
+
+/* These macros are used to declare pb_field_t's in the constant array. */
+/* Size of a structure member, in bytes. */
+#define pb_membersize(st, m) (sizeof ((st*)0)->m)
+/* Number of entries in an array. */
+#define pb_arraysize(st, m) (pb_membersize(st, m) / pb_membersize(st, m[0]))
+/* Delta from start of one member to the start of another member. */
+#define pb_delta(st, m1, m2) ((int)offsetof(st, m1) - (int)offsetof(st, m2))
+/* Marks the end of the field list */
+#define PB_LAST_FIELD {0,(pb_type_t) 0,0,0,0,0,0}
+
+/* Macros for filling in the data_offset field */
+/* data_offset for first field in a message */
+#define PB_DATAOFFSET_FIRST(st, m1, m2) (offsetof(st, m1))
+/* data_offset for subsequent fields */
+#define PB_DATAOFFSET_OTHER(st, m1, m2) (offsetof(st, m1) - offsetof(st, m2) - pb_membersize(st, m2))
+/* data offset for subsequent fields inside an union (oneof) */
+#define PB_DATAOFFSET_UNION(st, m1, m2) (PB_SIZE_MAX)
+/* Choose first/other based on m1 == m2 (deprecated, remains for backwards compatibility) */
+#define PB_DATAOFFSET_CHOOSE(st, m1, m2) (int)(offsetof(st, m1) == offsetof(st, m2) \
+ ? PB_DATAOFFSET_FIRST(st, m1, m2) \
+ : PB_DATAOFFSET_OTHER(st, m1, m2))
+
+/* Required fields are the simplest. They just have delta (padding) from
+ * previous field end, and the size of the field. Pointer is used for
+ * submessages and default values.
+ */
+#define PB_REQUIRED_STATIC(tag, st, m, fd, ltype, ptr) \
+ {tag, PB_ATYPE_STATIC | PB_HTYPE_REQUIRED | ltype, \
+ fd, 0, pb_membersize(st, m), 0, ptr}
+
+/* Optional fields add the delta to the has_ variable. */
+#define PB_OPTIONAL_STATIC(tag, st, m, fd, ltype, ptr) \
+ {tag, PB_ATYPE_STATIC | PB_HTYPE_OPTIONAL | ltype, \
+ fd, \
+ pb_delta(st, has_ ## m, m), \
+ pb_membersize(st, m), 0, ptr}
+
+#define PB_SINGULAR_STATIC(tag, st, m, fd, ltype, ptr) \
+ {tag, PB_ATYPE_STATIC | PB_HTYPE_OPTIONAL | ltype, \
+ fd, 0, pb_membersize(st, m), 0, ptr}
+
+/* Repeated fields have a _count field and also the maximum number of entries. */
+#define PB_REPEATED_STATIC(tag, st, m, fd, ltype, ptr) \
+ {tag, PB_ATYPE_STATIC | PB_HTYPE_REPEATED | ltype, \
+ fd, \
+ pb_delta(st, m ## _count, m), \
+ pb_membersize(st, m[0]), \
+ pb_arraysize(st, m), ptr}
+
+/* Allocated fields carry the size of the actual data, not the pointer */
+#define PB_REQUIRED_POINTER(tag, st, m, fd, ltype, ptr) \
+ {tag, PB_ATYPE_POINTER | PB_HTYPE_REQUIRED | ltype, \
+ fd, 0, pb_membersize(st, m[0]), 0, ptr}
+
+/* Optional fields don't need a has_ variable, as information would be redundant */
+#define PB_OPTIONAL_POINTER(tag, st, m, fd, ltype, ptr) \
+ {tag, PB_ATYPE_POINTER | PB_HTYPE_OPTIONAL | ltype, \
+ fd, 0, pb_membersize(st, m[0]), 0, ptr}
+
+/* Same as optional fields*/
+#define PB_SINGULAR_POINTER(tag, st, m, fd, ltype, ptr) \
+ {tag, PB_ATYPE_POINTER | PB_HTYPE_OPTIONAL | ltype, \
+ fd, 0, pb_membersize(st, m[0]), 0, ptr}
+
+/* Repeated fields have a _count field and a pointer to array of pointers */
+#define PB_REPEATED_POINTER(tag, st, m, fd, ltype, ptr) \
+ {tag, PB_ATYPE_POINTER | PB_HTYPE_REPEATED | ltype, \
+ fd, pb_delta(st, m ## _count, m), \
+ pb_membersize(st, m[0]), 0, ptr}
+
+/* Callbacks are much like required fields except with special datatype. */
+#define PB_REQUIRED_CALLBACK(tag, st, m, fd, ltype, ptr) \
+ {tag, PB_ATYPE_CALLBACK | PB_HTYPE_REQUIRED | ltype, \
+ fd, 0, pb_membersize(st, m), 0, ptr}
+
+#define PB_OPTIONAL_CALLBACK(tag, st, m, fd, ltype, ptr) \
+ {tag, PB_ATYPE_CALLBACK | PB_HTYPE_OPTIONAL | ltype, \
+ fd, 0, pb_membersize(st, m), 0, ptr}
+
+#define PB_SINGULAR_CALLBACK(tag, st, m, fd, ltype, ptr) \
+ {tag, PB_ATYPE_CALLBACK | PB_HTYPE_OPTIONAL | ltype, \
+ fd, 0, pb_membersize(st, m), 0, ptr}
+
+#define PB_REPEATED_CALLBACK(tag, st, m, fd, ltype, ptr) \
+ {tag, PB_ATYPE_CALLBACK | PB_HTYPE_REPEATED | ltype, \
+ fd, 0, pb_membersize(st, m), 0, ptr}
+
+/* Optional extensions don't have the has_ field, as that would be redundant.
+ * Furthermore, the combination of OPTIONAL without has_ field is used
+ * for indicating proto3 style fields. Extensions exist in proto2 mode only,
+ * so they should be encoded according to proto2 rules. To avoid the conflict,
+ * extensions are marked as REQUIRED instead.
+ */
+#define PB_OPTEXT_STATIC(tag, st, m, fd, ltype, ptr) \
+ {tag, PB_ATYPE_STATIC | PB_HTYPE_REQUIRED | ltype, \
+ 0, \
+ 0, \
+ pb_membersize(st, m), 0, ptr}
+
+#define PB_OPTEXT_POINTER(tag, st, m, fd, ltype, ptr) \
+ PB_OPTIONAL_POINTER(tag, st, m, fd, ltype, ptr)
+
+#define PB_OPTEXT_CALLBACK(tag, st, m, fd, ltype, ptr) \
+ PB_OPTIONAL_CALLBACK(tag, st, m, fd, ltype, ptr)
+
+/* The mapping from protobuf types to LTYPEs is done using these macros. */
+#define PB_LTYPE_MAP_BOOL PB_LTYPE_VARINT
+#define PB_LTYPE_MAP_BYTES PB_LTYPE_BYTES
+#define PB_LTYPE_MAP_DOUBLE PB_LTYPE_FIXED64
+#define PB_LTYPE_MAP_ENUM PB_LTYPE_VARINT
+#define PB_LTYPE_MAP_UENUM PB_LTYPE_UVARINT
+#define PB_LTYPE_MAP_FIXED32 PB_LTYPE_FIXED32
+#define PB_LTYPE_MAP_FIXED64 PB_LTYPE_FIXED64
+#define PB_LTYPE_MAP_FLOAT PB_LTYPE_FIXED32
+#define PB_LTYPE_MAP_INT32 PB_LTYPE_VARINT
+#define PB_LTYPE_MAP_INT64 PB_LTYPE_VARINT
+#define PB_LTYPE_MAP_MESSAGE PB_LTYPE_SUBMESSAGE
+#define PB_LTYPE_MAP_SFIXED32 PB_LTYPE_FIXED32
+#define PB_LTYPE_MAP_SFIXED64 PB_LTYPE_FIXED64
+#define PB_LTYPE_MAP_SINT32 PB_LTYPE_SVARINT
+#define PB_LTYPE_MAP_SINT64 PB_LTYPE_SVARINT
+#define PB_LTYPE_MAP_STRING PB_LTYPE_STRING
+#define PB_LTYPE_MAP_UINT32 PB_LTYPE_UVARINT
+#define PB_LTYPE_MAP_UINT64 PB_LTYPE_UVARINT
+#define PB_LTYPE_MAP_EXTENSION PB_LTYPE_EXTENSION
+#define PB_LTYPE_MAP_FIXED_LENGTH_BYTES PB_LTYPE_FIXED_LENGTH_BYTES
+
+/* This is the actual macro used in field descriptions.
+ * It takes these arguments:
+ * - Field tag number
+ * - Field type: BOOL, BYTES, DOUBLE, ENUM, UENUM, FIXED32, FIXED64,
+ * FLOAT, INT32, INT64, MESSAGE, SFIXED32, SFIXED64
+ * SINT32, SINT64, STRING, UINT32, UINT64 or EXTENSION
+ * - Field rules: REQUIRED, OPTIONAL or REPEATED
+ * - Allocation: STATIC, CALLBACK or POINTER
+ * - Placement: FIRST or OTHER, depending on if this is the first field in structure.
+ * - Message name
+ * - Field name
+ * - Previous field name (or field name again for first field)
+ * - Pointer to default value or submsg fields.
+ */
+
+#define PB_FIELD(tag, type, rules, allocation, placement, message, field, prevfield, ptr) \
+ PB_ ## rules ## _ ## allocation(tag, message, field, \
+ PB_DATAOFFSET_ ## placement(message, field, prevfield), \
+ PB_LTYPE_MAP_ ## type, ptr)
+
+/* Field description for repeated static fixed count fields.*/
+#define PB_REPEATED_FIXED_COUNT(tag, type, placement, message, field, prevfield, ptr) \
+ {tag, PB_ATYPE_STATIC | PB_HTYPE_REPEATED | PB_LTYPE_MAP_ ## type, \
+ PB_DATAOFFSET_ ## placement(message, field, prevfield), \
+ 0, \
+ pb_membersize(message, field[0]), \
+ pb_arraysize(message, field), ptr}
+
+/* Field description for oneof fields. This requires taking into account the
+ * union name also, that's why a separate set of macros is needed.
+ */
+#define PB_ONEOF_STATIC(u, tag, st, m, fd, ltype, ptr) \
+ {tag, PB_ATYPE_STATIC | PB_HTYPE_ONEOF | ltype, \
+ fd, pb_delta(st, which_ ## u, u.m), \
+ pb_membersize(st, u.m), 0, ptr}
+
+#define PB_ONEOF_POINTER(u, tag, st, m, fd, ltype, ptr) \
+ {tag, PB_ATYPE_POINTER | PB_HTYPE_ONEOF | ltype, \
+ fd, pb_delta(st, which_ ## u, u.m), \
+ pb_membersize(st, u.m[0]), 0, ptr}
+
+#define PB_ONEOF_FIELD(union_name, tag, type, rules, allocation, placement, message, field, prevfield, ptr) \
+ PB_ONEOF_ ## allocation(union_name, tag, message, field, \
+ PB_DATAOFFSET_ ## placement(message, union_name.field, prevfield), \
+ PB_LTYPE_MAP_ ## type, ptr)
+
+#define PB_ANONYMOUS_ONEOF_STATIC(u, tag, st, m, fd, ltype, ptr) \
+ {tag, PB_ATYPE_STATIC | PB_HTYPE_ONEOF | ltype, \
+ fd, pb_delta(st, which_ ## u, m), \
+ pb_membersize(st, m), 0, ptr}
+
+#define PB_ANONYMOUS_ONEOF_POINTER(u, tag, st, m, fd, ltype, ptr) \
+ {tag, PB_ATYPE_POINTER | PB_HTYPE_ONEOF | ltype, \
+ fd, pb_delta(st, which_ ## u, m), \
+ pb_membersize(st, m[0]), 0, ptr}
+
+#define PB_ANONYMOUS_ONEOF_FIELD(union_name, tag, type, rules, allocation, placement, message, field, prevfield, ptr) \
+ PB_ANONYMOUS_ONEOF_ ## allocation(union_name, tag, message, field, \
+ PB_DATAOFFSET_ ## placement(message, field, prevfield), \
+ PB_LTYPE_MAP_ ## type, ptr)
+
+/* These macros are used for giving out error messages.
+ * They are mostly a debugging aid; the main error information
+ * is the true/false return value from functions.
+ * Some code space can be saved by disabling the error
+ * messages if not used.
+ *
+ * PB_SET_ERROR() sets the error message if none has been set yet.
+ * msg must be a constant string literal.
+ * PB_GET_ERROR() always returns a pointer to a string.
+ * PB_RETURN_ERROR() sets the error and returns false from current
+ * function.
+ */
+#ifdef PB_NO_ERRMSG
+#define PB_SET_ERROR(stream, msg) PB_UNUSED(stream)
+#define PB_GET_ERROR(stream) "(errmsg disabled)"
+#else
+#define PB_SET_ERROR(stream, msg) (stream->errmsg = (stream)->errmsg ? (stream)->errmsg : (msg))
+#define PB_GET_ERROR(stream) ((stream)->errmsg ? (stream)->errmsg : "(none)")
+#endif
+
+#define PB_RETURN_ERROR(stream, msg) return PB_SET_ERROR(stream, msg), false
+
+#endif
diff --git a/security/container/protos/nanopb/pb_common.c b/security/container/protos/nanopb/pb_common.c
new file mode 100644
index 0000000..4fb7186
--- /dev/null
+++ b/security/container/protos/nanopb/pb_common.c
@@ -0,0 +1,97 @@
+/* pb_common.c: Common support functions for pb_encode.c and pb_decode.c.
+ *
+ * 2014 Petteri Aimonen <jpa@kapsi.fi>
+ */
+
+#include "pb_common.h"
+
+bool pb_field_iter_begin(pb_field_iter_t *iter, const pb_field_t *fields, void *dest_struct)
+{
+ iter->start = fields;
+ iter->pos = fields;
+ iter->required_field_index = 0;
+ iter->dest_struct = dest_struct;
+ iter->pData = (char*)dest_struct + iter->pos->data_offset;
+ iter->pSize = (char*)iter->pData + iter->pos->size_offset;
+
+ return (iter->pos->tag != 0);
+}
+
+bool pb_field_iter_next(pb_field_iter_t *iter)
+{
+ const pb_field_t *prev_field = iter->pos;
+
+ if (prev_field->tag == 0)
+ {
+ /* Handle empty message types, where the first field is already the terminator.
+ * In other cases, the iter->pos never points to the terminator. */
+ return false;
+ }
+
+ iter->pos++;
+
+ if (iter->pos->tag == 0)
+ {
+ /* Wrapped back to beginning, reinitialize */
+ (void)pb_field_iter_begin(iter, iter->start, iter->dest_struct);
+ return false;
+ }
+ else
+ {
+ /* Increment the pointers based on previous field size */
+ size_t prev_size = prev_field->data_size;
+
+ if (PB_HTYPE(prev_field->type) == PB_HTYPE_ONEOF &&
+ PB_HTYPE(iter->pos->type) == PB_HTYPE_ONEOF &&
+ iter->pos->data_offset == PB_SIZE_MAX)
+ {
+ /* Don't advance pointers inside unions */
+ return true;
+ }
+ else if (PB_ATYPE(prev_field->type) == PB_ATYPE_STATIC &&
+ PB_HTYPE(prev_field->type) == PB_HTYPE_REPEATED)
+ {
+ /* In static arrays, the data_size tells the size of a single entry and
+ * array_size is the number of entries */
+ prev_size *= prev_field->array_size;
+ }
+ else if (PB_ATYPE(prev_field->type) == PB_ATYPE_POINTER)
+ {
+ /* Pointer fields always have a constant size in the main structure.
+ * The data_size only applies to the dynamically allocated area. */
+ prev_size = sizeof(void*);
+ }
+
+ if (PB_HTYPE(prev_field->type) == PB_HTYPE_REQUIRED)
+ {
+ /* Count the required fields, in order to check their presence in the
+ * decoder. */
+ iter->required_field_index++;
+ }
+
+ iter->pData = (char*)iter->pData + prev_size + iter->pos->data_offset;
+ iter->pSize = (char*)iter->pData + iter->pos->size_offset;
+ return true;
+ }
+}
+
+bool pb_field_iter_find(pb_field_iter_t *iter, uint32_t tag)
+{
+ const pb_field_t *start = iter->pos;
+
+ do {
+ if (iter->pos->tag == tag &&
+ PB_LTYPE(iter->pos->type) != PB_LTYPE_EXTENSION)
+ {
+ /* Found the wanted field */
+ return true;
+ }
+
+ (void)pb_field_iter_next(iter);
+ } while (iter->pos != start);
+
+ /* Searched all the way back to start, and found nothing. */
+ return false;
+}
+
+
diff --git a/security/container/protos/nanopb/pb_common.h b/security/container/protos/nanopb/pb_common.h
new file mode 100644
index 0000000..60b3d37
--- /dev/null
+++ b/security/container/protos/nanopb/pb_common.h
@@ -0,0 +1,42 @@
+/* pb_common.h: Common support functions for pb_encode.c and pb_decode.c.
+ * These functions are rarely needed by applications directly.
+ */
+
+#ifndef PB_COMMON_H_INCLUDED
+#define PB_COMMON_H_INCLUDED
+
+#include "pb.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/* Iterator for pb_field_t list */
+struct pb_field_iter_s {
+ const pb_field_t *start; /* Start of the pb_field_t array */
+ const pb_field_t *pos; /* Current position of the iterator */
+ unsigned required_field_index; /* Zero-based index that counts only the required fields */
+ void *dest_struct; /* Pointer to start of the structure */
+ void *pData; /* Pointer to current field value */
+ void *pSize; /* Pointer to count/has field */
+};
+typedef struct pb_field_iter_s pb_field_iter_t;
+
+/* Initialize the field iterator structure to beginning.
+ * Returns false if the message type is empty. */
+bool pb_field_iter_begin(pb_field_iter_t *iter, const pb_field_t *fields, void *dest_struct);
+
+/* Advance the iterator to the next field.
+ * Returns false when the iterator wraps back to the first field. */
+bool pb_field_iter_next(pb_field_iter_t *iter);
+
+/* Advance the iterator until it points at a field with the given tag.
+ * Returns false if no such field exists. */
+bool pb_field_iter_find(pb_field_iter_t *iter, uint32_t tag);
+
+#ifdef __cplusplus
+} /* extern "C" */
+#endif
+
+#endif
+
diff --git a/security/container/protos/nanopb/pb_decode.c b/security/container/protos/nanopb/pb_decode.c
new file mode 100644
index 0000000..4b80e81
--- /dev/null
+++ b/security/container/protos/nanopb/pb_decode.c
@@ -0,0 +1,1508 @@
+/* pb_decode.c -- decode a protobuf using minimal resources
+ *
+ * 2011 Petteri Aimonen <jpa@kapsi.fi>
+ */
+
+/* Use the GCC warn_unused_result attribute to check that all return values
+ * are propagated correctly. On other compilers and gcc before 3.4.0 just
+ * ignore the annotation.
+ */
+#if !defined(__GNUC__) || ( __GNUC__ < 3) || (__GNUC__ == 3 && __GNUC_MINOR__ < 4)
+ #define checkreturn
+#else
+ #define checkreturn __attribute__((warn_unused_result))
+#endif
+
+#include "pb.h"
+#include "pb_decode.h"
+#include "pb_common.h"
+
+/**************************************
+ * Declarations internal to this file *
+ **************************************/
+
+typedef bool (*pb_decoder_t)(pb_istream_t *stream, const pb_field_t *field, void *dest) checkreturn;
+
+static bool checkreturn buf_read(pb_istream_t *stream, pb_byte_t *buf, size_t count);
+static bool checkreturn read_raw_value(pb_istream_t *stream, pb_wire_type_t wire_type, pb_byte_t *buf, size_t *size);
+static bool checkreturn decode_static_field(pb_istream_t *stream, pb_wire_type_t wire_type, pb_field_iter_t *iter);
+static bool checkreturn decode_callback_field(pb_istream_t *stream, pb_wire_type_t wire_type, pb_field_iter_t *iter);
+static bool checkreturn decode_field(pb_istream_t *stream, pb_wire_type_t wire_type, pb_field_iter_t *iter);
+static void iter_from_extension(pb_field_iter_t *iter, pb_extension_t *extension);
+static bool checkreturn default_extension_decoder(pb_istream_t *stream, pb_extension_t *extension, uint32_t tag, pb_wire_type_t wire_type);
+static bool checkreturn decode_extension(pb_istream_t *stream, uint32_t tag, pb_wire_type_t wire_type, pb_field_iter_t *iter);
+static bool checkreturn find_extension_field(pb_field_iter_t *iter);
+static void pb_field_set_to_default(pb_field_iter_t *iter);
+static void pb_message_set_to_defaults(const pb_field_t fields[], void *dest_struct);
+static bool checkreturn pb_dec_varint(pb_istream_t *stream, const pb_field_t *field, void *dest);
+static bool checkreturn pb_decode_varint32_eof(pb_istream_t *stream, uint32_t *dest, bool *eof);
+static bool checkreturn pb_dec_uvarint(pb_istream_t *stream, const pb_field_t *field, void *dest);
+static bool checkreturn pb_dec_svarint(pb_istream_t *stream, const pb_field_t *field, void *dest);
+static bool checkreturn pb_dec_fixed32(pb_istream_t *stream, const pb_field_t *field, void *dest);
+static bool checkreturn pb_dec_fixed64(pb_istream_t *stream, const pb_field_t *field, void *dest);
+static bool checkreturn pb_dec_bytes(pb_istream_t *stream, const pb_field_t *field, void *dest);
+static bool checkreturn pb_dec_string(pb_istream_t *stream, const pb_field_t *field, void *dest);
+static bool checkreturn pb_dec_submessage(pb_istream_t *stream, const pb_field_t *field, void *dest);
+static bool checkreturn pb_dec_fixed_length_bytes(pb_istream_t *stream, const pb_field_t *field, void *dest);
+static bool checkreturn pb_skip_varint(pb_istream_t *stream);
+static bool checkreturn pb_skip_string(pb_istream_t *stream);
+
+#ifdef PB_ENABLE_MALLOC
+static bool checkreturn allocate_field(pb_istream_t *stream, void *pData, size_t data_size, size_t array_size);
+static bool checkreturn pb_release_union_field(pb_istream_t *stream, pb_field_iter_t *iter);
+static void pb_release_single_field(const pb_field_iter_t *iter);
+#endif
+
+#ifdef PB_WITHOUT_64BIT
+#define pb_int64_t int32_t
+#define pb_uint64_t uint32_t
+#else
+#define pb_int64_t int64_t
+#define pb_uint64_t uint64_t
+#endif
+
+/* --- Function pointers to field decoders ---
+ * Order in the array must match pb_action_t LTYPE numbering.
+ */
+static const pb_decoder_t PB_DECODERS[PB_LTYPES_COUNT] = {
+ &pb_dec_varint,
+ &pb_dec_uvarint,
+ &pb_dec_svarint,
+ &pb_dec_fixed32,
+ &pb_dec_fixed64,
+
+ &pb_dec_bytes,
+ &pb_dec_string,
+ &pb_dec_submessage,
+ NULL, /* extensions */
+ &pb_dec_fixed_length_bytes
+};
+
+/*******************************
+ * pb_istream_t implementation *
+ *******************************/
+
+static bool checkreturn buf_read(pb_istream_t *stream, pb_byte_t *buf, size_t count)
+{
+ size_t i;
+ const pb_byte_t *source = (const pb_byte_t*)stream->state;
+ stream->state = (pb_byte_t*)stream->state + count;
+
+ if (buf != NULL)
+ {
+ for (i = 0; i < count; i++)
+ buf[i] = source[i];
+ }
+
+ return true;
+}
+
+bool checkreturn pb_read(pb_istream_t *stream, pb_byte_t *buf, size_t count)
+{
+#ifndef PB_BUFFER_ONLY
+ if (buf == NULL && stream->callback != buf_read)
+ {
+ /* Skip input bytes */
+ pb_byte_t tmp[16];
+ while (count > 16)
+ {
+ if (!pb_read(stream, tmp, 16))
+ return false;
+
+ count -= 16;
+ }
+
+ return pb_read(stream, tmp, count);
+ }
+#endif
+
+ if (stream->bytes_left < count)
+ PB_RETURN_ERROR(stream, "end-of-stream");
+
+#ifndef PB_BUFFER_ONLY
+ if (!stream->callback(stream, buf, count))
+ PB_RETURN_ERROR(stream, "io error");
+#else
+ if (!buf_read(stream, buf, count))
+ return false;
+#endif
+
+ stream->bytes_left -= count;
+ return true;
+}
+
+/* Read a single byte from input stream. buf may not be NULL.
+ * This is an optimization for the varint decoding. */
+static bool checkreturn pb_readbyte(pb_istream_t *stream, pb_byte_t *buf)
+{
+ if (stream->bytes_left == 0)
+ PB_RETURN_ERROR(stream, "end-of-stream");
+
+#ifndef PB_BUFFER_ONLY
+ if (!stream->callback(stream, buf, 1))
+ PB_RETURN_ERROR(stream, "io error");
+#else
+ *buf = *(const pb_byte_t*)stream->state;
+ stream->state = (pb_byte_t*)stream->state + 1;
+#endif
+
+ stream->bytes_left--;
+
+ return true;
+}
+
+pb_istream_t pb_istream_from_buffer(const pb_byte_t *buf, size_t bufsize)
+{
+ pb_istream_t stream;
+ /* Cast away the const from buf without a compiler error. We are
+ * careful to use it only in a const manner in the callbacks.
+ */
+ union {
+ void *state;
+ const void *c_state;
+ } state;
+#ifdef PB_BUFFER_ONLY
+ stream.callback = NULL;
+#else
+ stream.callback = &buf_read;
+#endif
+ state.c_state = buf;
+ stream.state = state.state;
+ stream.bytes_left = bufsize;
+#ifndef PB_NO_ERRMSG
+ stream.errmsg = NULL;
+#endif
+ return stream;
+}
+
+/********************
+ * Helper functions *
+ ********************/
+
+static bool checkreturn pb_decode_varint32_eof(pb_istream_t *stream, uint32_t *dest, bool *eof)
+{
+ pb_byte_t byte;
+ uint32_t result;
+
+ if (!pb_readbyte(stream, &byte))
+ {
+ if (stream->bytes_left == 0)
+ {
+ if (eof)
+ {
+ *eof = true;
+ }
+ }
+
+ return false;
+ }
+
+ if ((byte & 0x80) == 0)
+ {
+ /* Quick case, 1 byte value */
+ result = byte;
+ }
+ else
+ {
+ /* Multibyte case */
+ uint_fast8_t bitpos = 7;
+ result = byte & 0x7F;
+
+ do
+ {
+ if (!pb_readbyte(stream, &byte))
+ return false;
+
+ if (bitpos >= 32)
+ {
+ /* Note: The varint could have trailing 0x80 bytes, or 0xFF for negative. */
+ uint8_t sign_extension = (bitpos < 63) ? 0xFF : 0x01;
+
+ if ((byte & 0x7F) != 0x00 && ((result >> 31) == 0 || byte != sign_extension))
+ {
+ PB_RETURN_ERROR(stream, "varint overflow");
+ }
+ }
+ else
+ {
+ result |= (uint32_t)(byte & 0x7F) << bitpos;
+ }
+ bitpos = (uint_fast8_t)(bitpos + 7);
+ } while (byte & 0x80);
+
+ if (bitpos == 35 && (byte & 0x70) != 0)
+ {
+ /* The last byte was at bitpos=28, so only bottom 4 bits fit. */
+ PB_RETURN_ERROR(stream, "varint overflow");
+ }
+ }
+
+ *dest = result;
+ return true;
+}
+
+bool checkreturn pb_decode_varint32(pb_istream_t *stream, uint32_t *dest)
+{
+ return pb_decode_varint32_eof(stream, dest, NULL);
+}
+
+#ifndef PB_WITHOUT_64BIT
+bool checkreturn pb_decode_varint(pb_istream_t *stream, uint64_t *dest)
+{
+ pb_byte_t byte;
+ uint_fast8_t bitpos = 0;
+ uint64_t result = 0;
+
+ do
+ {
+ if (bitpos >= 64)
+ PB_RETURN_ERROR(stream, "varint overflow");
+
+ if (!pb_readbyte(stream, &byte))
+ return false;
+
+ result |= (uint64_t)(byte & 0x7F) << bitpos;
+ bitpos = (uint_fast8_t)(bitpos + 7);
+ } while (byte & 0x80);
+
+ *dest = result;
+ return true;
+}
+#endif
+
+bool checkreturn pb_skip_varint(pb_istream_t *stream)
+{
+ pb_byte_t byte;
+ do
+ {
+ if (!pb_read(stream, &byte, 1))
+ return false;
+ } while (byte & 0x80);
+ return true;
+}
+
+bool checkreturn pb_skip_string(pb_istream_t *stream)
+{
+ uint32_t length;
+ if (!pb_decode_varint32(stream, &length))
+ return false;
+
+ return pb_read(stream, NULL, length);
+}
+
+bool checkreturn pb_decode_tag(pb_istream_t *stream, pb_wire_type_t *wire_type, uint32_t *tag, bool *eof)
+{
+ uint32_t temp;
+ *eof = false;
+ *wire_type = (pb_wire_type_t) 0;
+ *tag = 0;
+
+ if (!pb_decode_varint32_eof(stream, &temp, eof))
+ {
+ return false;
+ }
+
+ if (temp == 0)
+ {
+ *eof = true; /* Special feature: allow 0-terminated messages. */
+ return false;
+ }
+
+ *tag = temp >> 3;
+ *wire_type = (pb_wire_type_t)(temp & 7);
+ return true;
+}
+
+bool checkreturn pb_skip_field(pb_istream_t *stream, pb_wire_type_t wire_type)
+{
+ switch (wire_type)
+ {
+ case PB_WT_VARINT: return pb_skip_varint(stream);
+ case PB_WT_64BIT: return pb_read(stream, NULL, 8);
+ case PB_WT_STRING: return pb_skip_string(stream);
+ case PB_WT_32BIT: return pb_read(stream, NULL, 4);
+ default: PB_RETURN_ERROR(stream, "invalid wire_type");
+ }
+}
+
+/* Read a raw value to buffer, for the purpose of passing it to callback as
+ * a substream. Size is maximum size on call, and actual size on return.
+ */
+static bool checkreturn read_raw_value(pb_istream_t *stream, pb_wire_type_t wire_type, pb_byte_t *buf, size_t *size)
+{
+ size_t max_size = *size;
+ switch (wire_type)
+ {
+ case PB_WT_VARINT:
+ *size = 0;
+ do
+ {
+ (*size)++;
+ if (*size > max_size) return false;
+ if (!pb_read(stream, buf, 1)) return false;
+ } while (*buf++ & 0x80);
+ return true;
+
+ case PB_WT_64BIT:
+ *size = 8;
+ return pb_read(stream, buf, 8);
+
+ case PB_WT_32BIT:
+ *size = 4;
+ return pb_read(stream, buf, 4);
+
+ case PB_WT_STRING:
+ /* Calling read_raw_value with a PB_WT_STRING is an error.
+ * Explicitly handle this case and fallthrough to default to avoid
+ * compiler warnings.
+ */
+
+ default: PB_RETURN_ERROR(stream, "invalid wire_type");
+ }
+}
+
+/* Decode string length from stream and return a substream with limited length.
+ * Remember to close the substream using pb_close_string_substream().
+ */
+bool checkreturn pb_make_string_substream(pb_istream_t *stream, pb_istream_t *substream)
+{
+ uint32_t size;
+ if (!pb_decode_varint32(stream, &size))
+ return false;
+
+ *substream = *stream;
+ if (substream->bytes_left < size)
+ PB_RETURN_ERROR(stream, "parent stream too short");
+
+ substream->bytes_left = size;
+ stream->bytes_left -= size;
+ return true;
+}
+
+bool checkreturn pb_close_string_substream(pb_istream_t *stream, pb_istream_t *substream)
+{
+ if (substream->bytes_left) {
+ if (!pb_read(substream, NULL, substream->bytes_left))
+ return false;
+ }
+
+ stream->state = substream->state;
+
+#ifndef PB_NO_ERRMSG
+ stream->errmsg = substream->errmsg;
+#endif
+ return true;
+}
+
+/*************************
+ * Decode a single field *
+ *************************/
+
+static bool checkreturn decode_static_field(pb_istream_t *stream, pb_wire_type_t wire_type, pb_field_iter_t *iter)
+{
+ pb_type_t type;
+ pb_decoder_t func;
+
+ type = iter->pos->type;
+ func = PB_DECODERS[PB_LTYPE(type)];
+
+ switch (PB_HTYPE(type))
+ {
+ case PB_HTYPE_REQUIRED:
+ return func(stream, iter->pos, iter->pData);
+
+ case PB_HTYPE_OPTIONAL:
+ if (iter->pSize != iter->pData)
+ *(bool*)iter->pSize = true;
+ return func(stream, iter->pos, iter->pData);
+
+ case PB_HTYPE_REPEATED:
+ if (wire_type == PB_WT_STRING
+ && PB_LTYPE(type) <= PB_LTYPE_LAST_PACKABLE)
+ {
+ /* Packed array */
+ bool status = true;
+ pb_size_t *size = (pb_size_t*)iter->pSize;
+
+ pb_istream_t substream;
+ if (!pb_make_string_substream(stream, &substream))
+ return false;
+
+ while (substream.bytes_left > 0 && *size < iter->pos->array_size)
+ {
+ void *pItem = (char*)iter->pData + iter->pos->data_size * (*size);
+ if (!func(&substream, iter->pos, pItem))
+ {
+ status = false;
+ break;
+ }
+ (*size)++;
+ }
+
+ if (substream.bytes_left != 0)
+ PB_RETURN_ERROR(stream, "array overflow");
+ if (!pb_close_string_substream(stream, &substream))
+ return false;
+
+ return status;
+ }
+ else
+ {
+ /* Repeated field */
+ pb_size_t *size = (pb_size_t*)iter->pSize;
+ char *pItem = (char*)iter->pData + iter->pos->data_size * (*size);
+
+ if ((*size)++ >= iter->pos->array_size)
+ PB_RETURN_ERROR(stream, "array overflow");
+
+ return func(stream, iter->pos, pItem);
+ }
+
+ case PB_HTYPE_ONEOF:
+ *(pb_size_t*)iter->pSize = iter->pos->tag;
+ if (PB_LTYPE(type) == PB_LTYPE_SUBMESSAGE)
+ {
+ /* We memset to zero so that any callbacks are set to NULL.
+ * Then set any default values. */
+ memset(iter->pData, 0, iter->pos->data_size);
+ pb_message_set_to_defaults((const pb_field_t*)iter->pos->ptr, iter->pData);
+ }
+ return func(stream, iter->pos, iter->pData);
+
+ default:
+ PB_RETURN_ERROR(stream, "invalid field type");
+ }
+}
+
+#ifdef PB_ENABLE_MALLOC
+/* Allocate storage for the field and store the pointer at iter->pData.
+ * array_size is the number of entries to reserve in an array.
+ * Zero size is not allowed, use pb_free() for releasing.
+ */
+static bool checkreturn allocate_field(pb_istream_t *stream, void *pData, size_t data_size, size_t array_size)
+{
+ void *ptr = *(void**)pData;
+
+ if (data_size == 0 || array_size == 0)
+ PB_RETURN_ERROR(stream, "invalid size");
+
+ /* Check for multiplication overflows.
+ * This code avoids the costly division if the sizes are small enough.
+ * Multiplication is safe as long as only half of bits are set
+ * in either multiplicand.
+ */
+ {
+ const size_t check_limit = (size_t)1 << (sizeof(size_t) * 4);
+ if (data_size >= check_limit || array_size >= check_limit)
+ {
+ const size_t size_max = (size_t)-1;
+ if (size_max / array_size < data_size)
+ {
+ PB_RETURN_ERROR(stream, "size too large");
+ }
+ }
+ }
+
+ /* Allocate new or expand previous allocation */
+ /* Note: on failure the old pointer will remain in the structure,
+ * the message must be freed by caller also on error return. */
+ ptr = pb_realloc(ptr, array_size * data_size);
+ if (ptr == NULL)
+ PB_RETURN_ERROR(stream, "realloc failed");
+
+ *(void**)pData = ptr;
+ return true;
+}
+
+/* Clear a newly allocated item in case it contains a pointer, or is a submessage. */
+static void initialize_pointer_field(void *pItem, pb_field_iter_t *iter)
+{
+ if (PB_LTYPE(iter->pos->type) == PB_LTYPE_STRING ||
+ PB_LTYPE(iter->pos->type) == PB_LTYPE_BYTES)
+ {
+ *(void**)pItem = NULL;
+ }
+ else if (PB_LTYPE(iter->pos->type) == PB_LTYPE_SUBMESSAGE)
+ {
+ /* We memset to zero so that any callbacks are set to NULL.
+ * Then set any default values. */
+ memset(pItem, 0, iter->pos->data_size);
+ pb_message_set_to_defaults((const pb_field_t *) iter->pos->ptr, pItem);
+ }
+}
+#endif
+
+static bool checkreturn decode_pointer_field(pb_istream_t *stream, pb_wire_type_t wire_type, pb_field_iter_t *iter)
+{
+#ifndef PB_ENABLE_MALLOC
+ PB_UNUSED(wire_type);
+ PB_UNUSED(iter);
+ PB_RETURN_ERROR(stream, "no malloc support");
+#else
+ pb_type_t type;
+ pb_decoder_t func;
+
+ type = iter->pos->type;
+ func = PB_DECODERS[PB_LTYPE(type)];
+
+ switch (PB_HTYPE(type))
+ {
+ case PB_HTYPE_REQUIRED:
+ case PB_HTYPE_OPTIONAL:
+ case PB_HTYPE_ONEOF:
+ if (PB_LTYPE(type) == PB_LTYPE_SUBMESSAGE &&
+ *(void**)iter->pData != NULL)
+ {
+ /* Duplicate field, have to release the old allocation first. */
+ pb_release_single_field(iter);
+ }
+
+ if (PB_HTYPE(type) == PB_HTYPE_ONEOF)
+ {
+ *(pb_size_t*)iter->pSize = iter->pos->tag;
+ }
+
+ if (PB_LTYPE(type) == PB_LTYPE_STRING ||
+ PB_LTYPE(type) == PB_LTYPE_BYTES)
+ {
+ return func(stream, iter->pos, iter->pData);
+ }
+ else
+ {
+ if (!allocate_field(stream, iter->pData, iter->pos->data_size, 1))
+ return false;
+
+ initialize_pointer_field(*(void**)iter->pData, iter);
+ return func(stream, iter->pos, *(void**)iter->pData);
+ }
+
+ case PB_HTYPE_REPEATED:
+ if (wire_type == PB_WT_STRING
+ && PB_LTYPE(type) <= PB_LTYPE_LAST_PACKABLE)
+ {
+ /* Packed array, multiple items come in at once. */
+ bool status = true;
+ pb_size_t *size = (pb_size_t*)iter->pSize;
+ size_t allocated_size = *size;
+ void *pItem;
+ pb_istream_t substream;
+
+ if (!pb_make_string_substream(stream, &substream))
+ return false;
+
+ while (substream.bytes_left)
+ {
+ if ((size_t)*size + 1 > allocated_size)
+ {
+ /* Allocate more storage. This tries to guess the
+ * number of remaining entries. Round the division
+ * upwards. */
+ allocated_size += (substream.bytes_left - 1) / iter->pos->data_size + 1;
+
+ if (!allocate_field(&substream, iter->pData, iter->pos->data_size, allocated_size))
+ {
+ status = false;
+ break;
+ }
+ }
+
+ /* Decode the array entry */
+ pItem = *(char**)iter->pData + iter->pos->data_size * (*size);
+ initialize_pointer_field(pItem, iter);
+ if (!func(&substream, iter->pos, pItem))
+ {
+ status = false;
+ break;
+ }
+
+ if (*size == PB_SIZE_MAX)
+ {
+#ifndef PB_NO_ERRMSG
+ stream->errmsg = "too many array entries";
+#endif
+ status = false;
+ break;
+ }
+
+ (*size)++;
+ }
+ if (!pb_close_string_substream(stream, &substream))
+ return false;
+
+ return status;
+ }
+ else
+ {
+ /* Normal repeated field, i.e. only one item at a time. */
+ pb_size_t *size = (pb_size_t*)iter->pSize;
+ void *pItem;
+
+ if (*size == PB_SIZE_MAX)
+ PB_RETURN_ERROR(stream, "too many array entries");
+
+ (*size)++;
+ if (!allocate_field(stream, iter->pData, iter->pos->data_size, *size))
+ return false;
+
+ pItem = *(char**)iter->pData + iter->pos->data_size * (*size - 1);
+ initialize_pointer_field(pItem, iter);
+ return func(stream, iter->pos, pItem);
+ }
+
+ default:
+ PB_RETURN_ERROR(stream, "invalid field type");
+ }
+#endif
+}
+
+static bool checkreturn decode_callback_field(pb_istream_t *stream, pb_wire_type_t wire_type, pb_field_iter_t *iter)
+{
+ pb_callback_t *pCallback = (pb_callback_t*)iter->pData;
+
+#ifdef PB_OLD_CALLBACK_STYLE
+ void *arg = pCallback->arg;
+#else
+ void **arg = &(pCallback->arg);
+#endif
+
+ if (pCallback == NULL || pCallback->funcs.decode == NULL)
+ return pb_skip_field(stream, wire_type);
+
+ if (wire_type == PB_WT_STRING)
+ {
+ pb_istream_t substream;
+
+ if (!pb_make_string_substream(stream, &substream))
+ return false;
+
+ do
+ {
+ if (!pCallback->funcs.decode(&substream, iter->pos, arg))
+ PB_RETURN_ERROR(stream, "callback failed");
+ } while (substream.bytes_left);
+
+ if (!pb_close_string_substream(stream, &substream))
+ return false;
+
+ return true;
+ }
+ else
+ {
+ /* Copy the single scalar value to stack.
+ * This is required so that we can limit the stream length,
+ * which in turn allows to use same callback for packed and
+ * not-packed fields. */
+ pb_istream_t substream;
+ pb_byte_t buffer[10];
+ size_t size = sizeof(buffer);
+
+ if (!read_raw_value(stream, wire_type, buffer, &size))
+ return false;
+ substream = pb_istream_from_buffer(buffer, size);
+
+ return pCallback->funcs.decode(&substream, iter->pos, arg);
+ }
+}
+
+static bool checkreturn decode_field(pb_istream_t *stream, pb_wire_type_t wire_type, pb_field_iter_t *iter)
+{
+#ifdef PB_ENABLE_MALLOC
+ /* When decoding an oneof field, check if there is old data that must be
+ * released first. */
+ if (PB_HTYPE(iter->pos->type) == PB_HTYPE_ONEOF)
+ {
+ if (!pb_release_union_field(stream, iter))
+ return false;
+ }
+#endif
+
+ switch (PB_ATYPE(iter->pos->type))
+ {
+ case PB_ATYPE_STATIC:
+ return decode_static_field(stream, wire_type, iter);
+
+ case PB_ATYPE_POINTER:
+ return decode_pointer_field(stream, wire_type, iter);
+
+ case PB_ATYPE_CALLBACK:
+ return decode_callback_field(stream, wire_type, iter);
+
+ default:
+ PB_RETURN_ERROR(stream, "invalid field type");
+ }
+}
+
+static void iter_from_extension(pb_field_iter_t *iter, pb_extension_t *extension)
+{
+ /* Fake a field iterator for the extension field.
+ * It is not actually safe to advance this iterator, but decode_field
+ * will not even try to. */
+ const pb_field_t *field = (const pb_field_t*)extension->type->arg;
+ (void)pb_field_iter_begin(iter, field, extension->dest);
+ iter->pData = extension->dest;
+ iter->pSize = &extension->found;
+
+ if (PB_ATYPE(field->type) == PB_ATYPE_POINTER)
+ {
+ /* For pointer extensions, the pointer is stored directly
+ * in the extension structure. This avoids having an extra
+ * indirection. */
+ iter->pData = &extension->dest;
+ }
+}
+
+/* Default handler for extension fields. Expects a pb_field_t structure
+ * in extension->type->arg. */
+static bool checkreturn default_extension_decoder(pb_istream_t *stream,
+ pb_extension_t *extension, uint32_t tag, pb_wire_type_t wire_type)
+{
+ const pb_field_t *field = (const pb_field_t*)extension->type->arg;
+ pb_field_iter_t iter;
+
+ if (field->tag != tag)
+ return true;
+
+ iter_from_extension(&iter, extension);
+ extension->found = true;
+ return decode_field(stream, wire_type, &iter);
+}
+
+/* Try to decode an unknown field as an extension field. Tries each extension
+ * decoder in turn, until one of them handles the field or loop ends. */
+static bool checkreturn decode_extension(pb_istream_t *stream,
+ uint32_t tag, pb_wire_type_t wire_type, pb_field_iter_t *iter)
+{
+ pb_extension_t *extension = *(pb_extension_t* const *)iter->pData;
+ size_t pos = stream->bytes_left;
+
+ while (extension != NULL && pos == stream->bytes_left)
+ {
+ bool status;
+ if (extension->type->decode)
+ status = extension->type->decode(stream, extension, tag, wire_type);
+ else
+ status = default_extension_decoder(stream, extension, tag, wire_type);
+
+ if (!status)
+ return false;
+
+ extension = extension->next;
+ }
+
+ return true;
+}
+
+/* Step through the iterator until an extension field is found or until all
+ * entries have been checked. There can be only one extension field per
+ * message. Returns false if no extension field is found. */
+static bool checkreturn find_extension_field(pb_field_iter_t *iter)
+{
+ const pb_field_t *start = iter->pos;
+
+ do {
+ if (PB_LTYPE(iter->pos->type) == PB_LTYPE_EXTENSION)
+ return true;
+ (void)pb_field_iter_next(iter);
+ } while (iter->pos != start);
+
+ return false;
+}
+
+/* Initialize message fields to default values, recursively */
+static void pb_field_set_to_default(pb_field_iter_t *iter)
+{
+ pb_type_t type;
+ type = iter->pos->type;
+
+ if (PB_LTYPE(type) == PB_LTYPE_EXTENSION)
+ {
+ pb_extension_t *ext = *(pb_extension_t* const *)iter->pData;
+ while (ext != NULL)
+ {
+ pb_field_iter_t ext_iter;
+ ext->found = false;
+ iter_from_extension(&ext_iter, ext);
+ pb_field_set_to_default(&ext_iter);
+ ext = ext->next;
+ }
+ }
+ else if (PB_ATYPE(type) == PB_ATYPE_STATIC)
+ {
+ bool init_data = true;
+ if (PB_HTYPE(type) == PB_HTYPE_OPTIONAL && iter->pSize != iter->pData)
+ {
+ /* Set has_field to false. Still initialize the optional field
+ * itself also. */
+ *(bool*)iter->pSize = false;
+ }
+ else if (PB_HTYPE(type) == PB_HTYPE_REPEATED ||
+ PB_HTYPE(type) == PB_HTYPE_ONEOF)
+ {
+ /* REPEATED: Set array count to 0, no need to initialize contents.
+ ONEOF: Set which_field to 0. */
+ *(pb_size_t*)iter->pSize = 0;
+ init_data = false;
+ }
+
+ if (init_data)
+ {
+ if (PB_LTYPE(iter->pos->type) == PB_LTYPE_SUBMESSAGE)
+ {
+ /* Initialize submessage to defaults */
+ pb_message_set_to_defaults((const pb_field_t *) iter->pos->ptr, iter->pData);
+ }
+ else if (iter->pos->ptr != NULL)
+ {
+ /* Initialize to default value */
+ memcpy(iter->pData, iter->pos->ptr, iter->pos->data_size);
+ }
+ else
+ {
+ /* Initialize to zeros */
+ memset(iter->pData, 0, iter->pos->data_size);
+ }
+ }
+ }
+ else if (PB_ATYPE(type) == PB_ATYPE_POINTER)
+ {
+ /* Initialize the pointer to NULL. */
+ *(void**)iter->pData = NULL;
+
+ /* Initialize array count to 0. */
+ if (PB_HTYPE(type) == PB_HTYPE_REPEATED ||
+ PB_HTYPE(type) == PB_HTYPE_ONEOF)
+ {
+ *(pb_size_t*)iter->pSize = 0;
+ }
+ }
+ else if (PB_ATYPE(type) == PB_ATYPE_CALLBACK)
+ {
+ /* Don't overwrite callback */
+ }
+}
+
+static void pb_message_set_to_defaults(const pb_field_t fields[], void *dest_struct)
+{
+ pb_field_iter_t iter;
+
+ if (!pb_field_iter_begin(&iter, fields, dest_struct))
+ return; /* Empty message type */
+
+ do
+ {
+ pb_field_set_to_default(&iter);
+ } while (pb_field_iter_next(&iter));
+}
+
+/*********************
+ * Decode all fields *
+ *********************/
+
+bool checkreturn pb_decode_noinit(pb_istream_t *stream, const pb_field_t fields[], void *dest_struct)
+{
+ uint32_t fields_seen[(PB_MAX_REQUIRED_FIELDS + 31) / 32] = {0, 0};
+ const uint32_t allbits = ~(uint32_t)0;
+ uint32_t extension_range_start = 0;
+ pb_field_iter_t iter;
+
+ /* 'fixed_count_field' and 'fixed_count_size' track position of a repeated fixed
+ * count field. This can only handle _one_ repeated fixed count field that
+ * is unpacked and unordered among other (non repeated fixed count) fields.
+ */
+ const pb_field_t *fixed_count_field = NULL;
+ pb_size_t fixed_count_size = 0;
+
+ /* Return value ignored, as empty message types will be correctly handled by
+ * pb_field_iter_find() anyway. */
+ (void)pb_field_iter_begin(&iter, fields, dest_struct);
+
+ while (stream->bytes_left)
+ {
+ uint32_t tag;
+ pb_wire_type_t wire_type;
+ bool eof;
+
+ if (!pb_decode_tag(stream, &wire_type, &tag, &eof))
+ {
+ if (eof)
+ break;
+ else
+ return false;
+ }
+
+ if (!pb_field_iter_find(&iter, tag))
+ {
+ /* No match found, check if it matches an extension. */
+ if (tag >= extension_range_start)
+ {
+ if (!find_extension_field(&iter))
+ extension_range_start = (uint32_t)-1;
+ else
+ extension_range_start = iter.pos->tag;
+
+ if (tag >= extension_range_start)
+ {
+ size_t pos = stream->bytes_left;
+
+ if (!decode_extension(stream, tag, wire_type, &iter))
+ return false;
+
+ if (pos != stream->bytes_left)
+ {
+ /* The field was handled */
+ continue;
+ }
+ }
+ }
+
+ /* No match found, skip data */
+ if (!pb_skip_field(stream, wire_type))
+ return false;
+ continue;
+ }
+
+ /* If a repeated fixed count field was found, get size from
+ * 'fixed_count_field' as there is no counter contained in the struct.
+ */
+ if (PB_HTYPE(iter.pos->type) == PB_HTYPE_REPEATED
+ && iter.pSize == iter.pData)
+ {
+ if (fixed_count_field != iter.pos) {
+ /* If the new fixed count field does not match the previous one,
+ * check that the previous one is NULL or that it finished
+ * receiving all the expected data.
+ */
+ if (fixed_count_field != NULL &&
+ fixed_count_size != fixed_count_field->array_size)
+ {
+ PB_RETURN_ERROR(stream, "wrong size for fixed count field");
+ }
+
+ fixed_count_field = iter.pos;
+ fixed_count_size = 0;
+ }
+
+ iter.pSize = &fixed_count_size;
+ }
+
+ if (PB_HTYPE(iter.pos->type) == PB_HTYPE_REQUIRED
+ && iter.required_field_index < PB_MAX_REQUIRED_FIELDS)
+ {
+ uint32_t tmp = ((uint32_t)1 << (iter.required_field_index & 31));
+ fields_seen[iter.required_field_index >> 5] |= tmp;
+ }
+
+ if (!decode_field(stream, wire_type, &iter))
+ return false;
+ }
+
+ /* Check that all elements of the last decoded fixed count field were present. */
+ if (fixed_count_field != NULL &&
+ fixed_count_size != fixed_count_field->array_size)
+ {
+ PB_RETURN_ERROR(stream, "wrong size for fixed count field");
+ }
+
+ /* Check that all required fields were present. */
+ {
+ /* First figure out the number of required fields by
+ * seeking to the end of the field array. Usually we
+ * are already close to end after decoding.
+ */
+ unsigned req_field_count;
+ pb_type_t last_type;
+ unsigned i;
+ do {
+ req_field_count = iter.required_field_index;
+ last_type = iter.pos->type;
+ } while (pb_field_iter_next(&iter));
+
+ /* Fixup if last field was also required. */
+ if (PB_HTYPE(last_type) == PB_HTYPE_REQUIRED && iter.pos->tag != 0)
+ req_field_count++;
+
+ if (req_field_count > PB_MAX_REQUIRED_FIELDS)
+ req_field_count = PB_MAX_REQUIRED_FIELDS;
+
+ if (req_field_count > 0)
+ {
+ /* Check the whole words */
+ for (i = 0; i < (req_field_count >> 5); i++)
+ {
+ if (fields_seen[i] != allbits)
+ PB_RETURN_ERROR(stream, "missing required field");
+ }
+
+ /* Check the remaining bits (if any) */
+ if ((req_field_count & 31) != 0)
+ {
+ if (fields_seen[req_field_count >> 5] !=
+ (allbits >> (32 - (req_field_count & 31))))
+ {
+ PB_RETURN_ERROR(stream, "missing required field");
+ }
+ }
+ }
+ }
+
+ return true;
+}
+
+bool checkreturn pb_decode(pb_istream_t *stream, const pb_field_t fields[], void *dest_struct)
+{
+ bool status;
+ pb_message_set_to_defaults(fields, dest_struct);
+ status = pb_decode_noinit(stream, fields, dest_struct);
+
+#ifdef PB_ENABLE_MALLOC
+ if (!status)
+ pb_release(fields, dest_struct);
+#endif
+
+ return status;
+}
+
+bool pb_decode_delimited_noinit(pb_istream_t *stream, const pb_field_t fields[], void *dest_struct)
+{
+ pb_istream_t substream;
+ bool status;
+
+ if (!pb_make_string_substream(stream, &substream))
+ return false;
+
+ status = pb_decode_noinit(&substream, fields, dest_struct);
+
+ if (!pb_close_string_substream(stream, &substream))
+ return false;
+ return status;
+}
+
+bool pb_decode_delimited(pb_istream_t *stream, const pb_field_t fields[], void *dest_struct)
+{
+ pb_istream_t substream;
+ bool status;
+
+ if (!pb_make_string_substream(stream, &substream))
+ return false;
+
+ status = pb_decode(&substream, fields, dest_struct);
+
+ if (!pb_close_string_substream(stream, &substream))
+ return false;
+ return status;
+}
+
+bool pb_decode_nullterminated(pb_istream_t *stream, const pb_field_t fields[], void *dest_struct)
+{
+ /* This behaviour will be separated in nanopb-0.4.0, see issue #278. */
+ return pb_decode(stream, fields, dest_struct);
+}
+
+#ifdef PB_ENABLE_MALLOC
+/* Given an oneof field, if there has already been a field inside this oneof,
+ * release it before overwriting with a different one. */
+static bool pb_release_union_field(pb_istream_t *stream, pb_field_iter_t *iter)
+{
+ pb_size_t old_tag = *(pb_size_t*)iter->pSize; /* Previous which_ value */
+ pb_size_t new_tag = iter->pos->tag; /* New which_ value */
+
+ if (old_tag == 0)
+ return true; /* Ok, no old data in union */
+
+ if (old_tag == new_tag)
+ return true; /* Ok, old data is of same type => merge */
+
+ /* Release old data. The find can fail if the message struct contains
+ * invalid data. */
+ if (!pb_field_iter_find(iter, old_tag))
+ PB_RETURN_ERROR(stream, "invalid union tag");
+
+ pb_release_single_field(iter);
+
+ /* Restore iterator to where it should be.
+ * This shouldn't fail unless the pb_field_t structure is corrupted. */
+ if (!pb_field_iter_find(iter, new_tag))
+ PB_RETURN_ERROR(stream, "iterator error");
+
+ return true;
+}
+
+static void pb_release_single_field(const pb_field_iter_t *iter)
+{
+ pb_type_t type;
+ type = iter->pos->type;
+
+ if (PB_HTYPE(type) == PB_HTYPE_ONEOF)
+ {
+ if (*(pb_size_t*)iter->pSize != iter->pos->tag)
+ return; /* This is not the current field in the union */
+ }
+
+ /* Release anything contained inside an extension or submsg.
+ * This has to be done even if the submsg itself is statically
+ * allocated. */
+ if (PB_LTYPE(type) == PB_LTYPE_EXTENSION)
+ {
+ /* Release fields from all extensions in the linked list */
+ pb_extension_t *ext = *(pb_extension_t**)iter->pData;
+ while (ext != NULL)
+ {
+ pb_field_iter_t ext_iter;
+ iter_from_extension(&ext_iter, ext);
+ pb_release_single_field(&ext_iter);
+ ext = ext->next;
+ }
+ }
+ else if (PB_LTYPE(type) == PB_LTYPE_SUBMESSAGE)
+ {
+ /* Release fields in submessage or submsg array */
+ void *pItem = iter->pData;
+ pb_size_t count = 1;
+
+ if (PB_ATYPE(type) == PB_ATYPE_POINTER)
+ {
+ pItem = *(void**)iter->pData;
+ }
+
+ if (PB_HTYPE(type) == PB_HTYPE_REPEATED)
+ {
+ if (PB_ATYPE(type) == PB_ATYPE_STATIC && iter->pSize == iter->pData) {
+ /* No _count field so use size of the array */
+ count = iter->pos->array_size;
+ } else {
+ count = *(pb_size_t*)iter->pSize;
+ }
+
+ if (PB_ATYPE(type) == PB_ATYPE_STATIC && count > iter->pos->array_size)
+ {
+ /* Protect against corrupted _count fields */
+ count = iter->pos->array_size;
+ }
+ }
+
+ if (pItem)
+ {
+ while (count--)
+ {
+ pb_release((const pb_field_t*)iter->pos->ptr, pItem);
+ pItem = (char*)pItem + iter->pos->data_size;
+ }
+ }
+ }
+
+ if (PB_ATYPE(type) == PB_ATYPE_POINTER)
+ {
+ if (PB_HTYPE(type) == PB_HTYPE_REPEATED &&
+ (PB_LTYPE(type) == PB_LTYPE_STRING ||
+ PB_LTYPE(type) == PB_LTYPE_BYTES))
+ {
+ /* Release entries in repeated string or bytes array */
+ void **pItem = *(void***)iter->pData;
+ pb_size_t count = *(pb_size_t*)iter->pSize;
+ while (count--)
+ {
+ pb_free(*pItem);
+ *pItem++ = NULL;
+ }
+ }
+
+ if (PB_HTYPE(type) == PB_HTYPE_REPEATED)
+ {
+ /* We are going to release the array, so set the size to 0 */
+ *(pb_size_t*)iter->pSize = 0;
+ }
+
+ /* Release main item */
+ pb_free(*(void**)iter->pData);
+ *(void**)iter->pData = NULL;
+ }
+}
+
+void pb_release(const pb_field_t fields[], void *dest_struct)
+{
+ pb_field_iter_t iter;
+
+ if (!dest_struct)
+ return; /* Ignore NULL pointers, similar to free() */
+
+ if (!pb_field_iter_begin(&iter, fields, dest_struct))
+ return; /* Empty message type */
+
+ do
+ {
+ pb_release_single_field(&iter);
+ } while (pb_field_iter_next(&iter));
+}
+#endif
+
+/* Field decoders */
+
+bool pb_decode_svarint(pb_istream_t *stream, pb_int64_t *dest)
+{
+ pb_uint64_t value;
+ if (!pb_decode_varint(stream, &value))
+ return false;
+
+ if (value & 1)
+ *dest = (pb_int64_t)(~(value >> 1));
+ else
+ *dest = (pb_int64_t)(value >> 1);
+
+ return true;
+}
+
+bool pb_decode_fixed32(pb_istream_t *stream, void *dest)
+{
+ pb_byte_t bytes[4];
+
+ if (!pb_read(stream, bytes, 4))
+ return false;
+
+ *(uint32_t*)dest = ((uint32_t)bytes[0] << 0) |
+ ((uint32_t)bytes[1] << 8) |
+ ((uint32_t)bytes[2] << 16) |
+ ((uint32_t)bytes[3] << 24);
+ return true;
+}
+
+#ifndef PB_WITHOUT_64BIT
+bool pb_decode_fixed64(pb_istream_t *stream, void *dest)
+{
+ pb_byte_t bytes[8];
+
+ if (!pb_read(stream, bytes, 8))
+ return false;
+
+ *(uint64_t*)dest = ((uint64_t)bytes[0] << 0) |
+ ((uint64_t)bytes[1] << 8) |
+ ((uint64_t)bytes[2] << 16) |
+ ((uint64_t)bytes[3] << 24) |
+ ((uint64_t)bytes[4] << 32) |
+ ((uint64_t)bytes[5] << 40) |
+ ((uint64_t)bytes[6] << 48) |
+ ((uint64_t)bytes[7] << 56);
+
+ return true;
+}
+#endif
+
+static bool checkreturn pb_dec_varint(pb_istream_t *stream, const pb_field_t *field, void *dest)
+{
+ pb_uint64_t value;
+ pb_int64_t svalue;
+ pb_int64_t clamped;
+ if (!pb_decode_varint(stream, &value))
+ return false;
+
+ /* See issue 97: Google's C++ protobuf allows negative varint values to
+ * be cast as int32_t, instead of the int64_t that should be used when
+ * encoding. Previous nanopb versions had a bug in encoding. In order to
+ * not break decoding of such messages, we cast <=32 bit fields to
+ * int32_t first to get the sign correct.
+ */
+ if (field->data_size == sizeof(pb_int64_t))
+ svalue = (pb_int64_t)value;
+ else
+ svalue = (int32_t)value;
+
+ /* Cast to the proper field size, while checking for overflows */
+ if (field->data_size == sizeof(pb_int64_t))
+ clamped = *(pb_int64_t*)dest = svalue;
+ else if (field->data_size == sizeof(int32_t))
+ clamped = *(int32_t*)dest = (int32_t)svalue;
+ else if (field->data_size == sizeof(int_least16_t))
+ clamped = *(int_least16_t*)dest = (int_least16_t)svalue;
+ else if (field->data_size == sizeof(int_least8_t))
+ clamped = *(int_least8_t*)dest = (int_least8_t)svalue;
+ else
+ PB_RETURN_ERROR(stream, "invalid data_size");
+
+ if (clamped != svalue)
+ PB_RETURN_ERROR(stream, "integer too large");
+
+ return true;
+}
+
+static bool checkreturn pb_dec_uvarint(pb_istream_t *stream, const pb_field_t *field, void *dest)
+{
+ pb_uint64_t value, clamped;
+ if (!pb_decode_varint(stream, &value))
+ return false;
+
+ /* Cast to the proper field size, while checking for overflows */
+ if (field->data_size == sizeof(pb_uint64_t))
+ clamped = *(pb_uint64_t*)dest = value;
+ else if (field->data_size == sizeof(uint32_t))
+ clamped = *(uint32_t*)dest = (uint32_t)value;
+ else if (field->data_size == sizeof(uint_least16_t))
+ clamped = *(uint_least16_t*)dest = (uint_least16_t)value;
+ else if (field->data_size == sizeof(uint_least8_t))
+ clamped = *(uint_least8_t*)dest = (uint_least8_t)value;
+ else
+ PB_RETURN_ERROR(stream, "invalid data_size");
+
+ if (clamped != value)
+ PB_RETURN_ERROR(stream, "integer too large");
+
+ return true;
+}
+
+static bool checkreturn pb_dec_svarint(pb_istream_t *stream, const pb_field_t *field, void *dest)
+{
+ pb_int64_t value, clamped;
+ if (!pb_decode_svarint(stream, &value))
+ return false;
+
+ /* Cast to the proper field size, while checking for overflows */
+ if (field->data_size == sizeof(pb_int64_t))
+ clamped = *(pb_int64_t*)dest = value;
+ else if (field->data_size == sizeof(int32_t))
+ clamped = *(int32_t*)dest = (int32_t)value;
+ else if (field->data_size == sizeof(int_least16_t))
+ clamped = *(int_least16_t*)dest = (int_least16_t)value;
+ else if (field->data_size == sizeof(int_least8_t))
+ clamped = *(int_least8_t*)dest = (int_least8_t)value;
+ else
+ PB_RETURN_ERROR(stream, "invalid data_size");
+
+ if (clamped != value)
+ PB_RETURN_ERROR(stream, "integer too large");
+
+ return true;
+}
+
+static bool checkreturn pb_dec_fixed32(pb_istream_t *stream, const pb_field_t *field, void *dest)
+{
+ PB_UNUSED(field);
+ return pb_decode_fixed32(stream, dest);
+}
+
+static bool checkreturn pb_dec_fixed64(pb_istream_t *stream, const pb_field_t *field, void *dest)
+{
+ PB_UNUSED(field);
+#ifndef PB_WITHOUT_64BIT
+ return pb_decode_fixed64(stream, dest);
+#else
+ PB_UNUSED(dest);
+ PB_RETURN_ERROR(stream, "no 64bit support");
+#endif
+}
+
+static bool checkreturn pb_dec_bytes(pb_istream_t *stream, const pb_field_t *field, void *dest)
+{
+ uint32_t size;
+ size_t alloc_size;
+ pb_bytes_array_t *bdest;
+
+ if (!pb_decode_varint32(stream, &size))
+ return false;
+
+ if (size > PB_SIZE_MAX)
+ PB_RETURN_ERROR(stream, "bytes overflow");
+
+ alloc_size = PB_BYTES_ARRAY_T_ALLOCSIZE(size);
+ if (size > alloc_size)
+ PB_RETURN_ERROR(stream, "size too large");
+
+ if (PB_ATYPE(field->type) == PB_ATYPE_POINTER)
+ {
+#ifndef PB_ENABLE_MALLOC
+ PB_RETURN_ERROR(stream, "no malloc support");
+#else
+ if (!allocate_field(stream, dest, alloc_size, 1))
+ return false;
+ bdest = *(pb_bytes_array_t**)dest;
+#endif
+ }
+ else
+ {
+ if (alloc_size > field->data_size)
+ PB_RETURN_ERROR(stream, "bytes overflow");
+ bdest = (pb_bytes_array_t*)dest;
+ }
+
+ bdest->size = (pb_size_t)size;
+ return pb_read(stream, bdest->bytes, size);
+}
+
+static bool checkreturn pb_dec_string(pb_istream_t *stream, const pb_field_t *field, void *dest)
+{
+ uint32_t size;
+ size_t alloc_size;
+ bool status;
+ if (!pb_decode_varint32(stream, &size))
+ return false;
+
+ /* Space for null terminator */
+ alloc_size = size + 1;
+
+ if (alloc_size < size)
+ PB_RETURN_ERROR(stream, "size too large");
+
+ if (PB_ATYPE(field->type) == PB_ATYPE_POINTER)
+ {
+#ifndef PB_ENABLE_MALLOC
+ PB_RETURN_ERROR(stream, "no malloc support");
+#else
+ if (!allocate_field(stream, dest, alloc_size, 1))
+ return false;
+ dest = *(void**)dest;
+#endif
+ }
+ else
+ {
+ if (alloc_size > field->data_size)
+ PB_RETURN_ERROR(stream, "string overflow");
+ }
+
+ status = pb_read(stream, (pb_byte_t*)dest, size);
+ *((pb_byte_t*)dest + size) = 0;
+ return status;
+}
+
+static bool checkreturn pb_dec_submessage(pb_istream_t *stream, const pb_field_t *field, void *dest)
+{
+ bool status;
+ pb_istream_t substream;
+ const pb_field_t* submsg_fields = (const pb_field_t*)field->ptr;
+
+ if (!pb_make_string_substream(stream, &substream))
+ return false;
+
+ if (field->ptr == NULL)
+ PB_RETURN_ERROR(stream, "invalid field descriptor");
+
+ /* New array entries need to be initialized, while required and optional
+ * submessages have already been initialized in the top-level pb_decode. */
+ if (PB_HTYPE(field->type) == PB_HTYPE_REPEATED)
+ status = pb_decode(&substream, submsg_fields, dest);
+ else
+ status = pb_decode_noinit(&substream, submsg_fields, dest);
+
+ if (!pb_close_string_substream(stream, &substream))
+ return false;
+ return status;
+}
+
+static bool checkreturn pb_dec_fixed_length_bytes(pb_istream_t *stream, const pb_field_t *field, void *dest)
+{
+ uint32_t size;
+
+ if (!pb_decode_varint32(stream, &size))
+ return false;
+
+ if (size > PB_SIZE_MAX)
+ PB_RETURN_ERROR(stream, "bytes overflow");
+
+ if (size == 0)
+ {
+ /* As a special case, treat empty bytes string as all zeros for fixed_length_bytes. */
+ memset(dest, 0, field->data_size);
+ return true;
+ }
+
+ if (size != field->data_size)
+ PB_RETURN_ERROR(stream, "incorrect fixed length bytes size");
+
+ return pb_read(stream, (pb_byte_t*)dest, field->data_size);
+}
diff --git a/security/container/protos/nanopb/pb_decode.h b/security/container/protos/nanopb/pb_decode.h
new file mode 100644
index 0000000..398b24a
--- /dev/null
+++ b/security/container/protos/nanopb/pb_decode.h
@@ -0,0 +1,175 @@
+/* pb_decode.h: Functions to decode protocol buffers. Depends on pb_decode.c.
+ * The main function is pb_decode. You also need an input stream, and the
+ * field descriptions created by nanopb_generator.py.
+ */
+
+#ifndef PB_DECODE_H_INCLUDED
+#define PB_DECODE_H_INCLUDED
+
+#include "pb.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/* Structure for defining custom input streams. You will need to provide
+ * a callback function to read the bytes from your storage, which can be
+ * for example a file or a network socket.
+ *
+ * The callback must conform to these rules:
+ *
+ * 1) Return false on IO errors. This will cause decoding to abort.
+ * 2) You can use state to store your own data (e.g. buffer pointer),
+ * and rely on pb_read to verify that no-body reads past bytes_left.
+ * 3) Your callback may be used with substreams, in which case bytes_left
+ * is different than from the main stream. Don't use bytes_left to compute
+ * any pointers.
+ */
+struct pb_istream_s
+{
+#ifdef PB_BUFFER_ONLY
+ /* Callback pointer is not used in buffer-only configuration.
+ * Having an int pointer here allows binary compatibility but
+ * gives an error if someone tries to assign callback function.
+ */
+ int *callback;
+#else
+ bool (*callback)(pb_istream_t *stream, pb_byte_t *buf, size_t count);
+#endif
+
+ void *state; /* Free field for use by callback implementation */
+ size_t bytes_left;
+
+#ifndef PB_NO_ERRMSG
+ const char *errmsg;
+#endif
+};
+
+/***************************
+ * Main decoding functions *
+ ***************************/
+
+/* Decode a single protocol buffers message from input stream into a C structure.
+ * Returns true on success, false on any failure.
+ * The actual struct pointed to by dest must match the description in fields.
+ * Callback fields of the destination structure must be initialized by caller.
+ * All other fields will be initialized by this function.
+ *
+ * Example usage:
+ * MyMessage msg = {};
+ * uint8_t buffer[64];
+ * pb_istream_t stream;
+ *
+ * // ... read some data into buffer ...
+ *
+ * stream = pb_istream_from_buffer(buffer, count);
+ * pb_decode(&stream, MyMessage_fields, &msg);
+ */
+bool pb_decode(pb_istream_t *stream, const pb_field_t fields[], void *dest_struct);
+
+/* Same as pb_decode, except does not initialize the destination structure
+ * to default values. This is slightly faster if you need no default values
+ * and just do memset(struct, 0, sizeof(struct)) yourself.
+ *
+ * This can also be used for 'merging' two messages, i.e. update only the
+ * fields that exist in the new message.
+ *
+ * Note: If this function returns with an error, it will not release any
+ * dynamically allocated fields. You will need to call pb_release() yourself.
+ */
+bool pb_decode_noinit(pb_istream_t *stream, const pb_field_t fields[], void *dest_struct);
+
+/* Same as pb_decode, except expects the stream to start with the message size
+ * encoded as varint. Corresponds to parseDelimitedFrom() in Google's
+ * protobuf API.
+ */
+bool pb_decode_delimited(pb_istream_t *stream, const pb_field_t fields[], void *dest_struct);
+
+/* Same as pb_decode_delimited, except that it does not initialize the destination structure.
+ * See pb_decode_noinit
+ */
+bool pb_decode_delimited_noinit(pb_istream_t *stream, const pb_field_t fields[], void *dest_struct);
+
+/* Same as pb_decode, except allows the message to be terminated with a null byte.
+ * NOTE: Until nanopb-0.4.0, pb_decode() also allows null-termination. This behaviour
+ * is not supported in most other protobuf implementations, so pb_decode_delimited()
+ * is a better option for compatibility.
+ */
+bool pb_decode_nullterminated(pb_istream_t *stream, const pb_field_t fields[], void *dest_struct);
+
+#ifdef PB_ENABLE_MALLOC
+/* Release any allocated pointer fields. If you use dynamic allocation, you should
+ * call this for any successfully decoded message when you are done with it. If
+ * pb_decode() returns with an error, the message is already released.
+ */
+void pb_release(const pb_field_t fields[], void *dest_struct);
+#endif
+
+
+/**************************************
+ * Functions for manipulating streams *
+ **************************************/
+
+/* Create an input stream for reading from a memory buffer.
+ *
+ * Alternatively, you can use a custom stream that reads directly from e.g.
+ * a file or a network socket.
+ */
+pb_istream_t pb_istream_from_buffer(const pb_byte_t *buf, size_t bufsize);
+
+/* Function to read from a pb_istream_t. You can use this if you need to
+ * read some custom header data, or to read data in field callbacks.
+ */
+bool pb_read(pb_istream_t *stream, pb_byte_t *buf, size_t count);
+
+
+/************************************************
+ * Helper functions for writing field callbacks *
+ ************************************************/
+
+/* Decode the tag for the next field in the stream. Gives the wire type and
+ * field tag. At end of the message, returns false and sets eof to true. */
+bool pb_decode_tag(pb_istream_t *stream, pb_wire_type_t *wire_type, uint32_t *tag, bool *eof);
+
+/* Skip the field payload data, given the wire type. */
+bool pb_skip_field(pb_istream_t *stream, pb_wire_type_t wire_type);
+
+/* Decode an integer in the varint format. This works for bool, enum, int32,
+ * int64, uint32 and uint64 field types. */
+#ifndef PB_WITHOUT_64BIT
+bool pb_decode_varint(pb_istream_t *stream, uint64_t *dest);
+#else
+#define pb_decode_varint pb_decode_varint32
+#endif
+
+/* Decode an integer in the varint format. This works for bool, enum, int32,
+ * and uint32 field types. */
+bool pb_decode_varint32(pb_istream_t *stream, uint32_t *dest);
+
+/* Decode an integer in the zig-zagged svarint format. This works for sint32
+ * and sint64. */
+#ifndef PB_WITHOUT_64BIT
+bool pb_decode_svarint(pb_istream_t *stream, int64_t *dest);
+#else
+bool pb_decode_svarint(pb_istream_t *stream, int32_t *dest);
+#endif
+
+/* Decode a fixed32, sfixed32 or float value. You need to pass a pointer to
+ * a 4-byte wide C variable. */
+bool pb_decode_fixed32(pb_istream_t *stream, void *dest);
+
+#ifndef PB_WITHOUT_64BIT
+/* Decode a fixed64, sfixed64 or double value. You need to pass a pointer to
+ * a 8-byte wide C variable. */
+bool pb_decode_fixed64(pb_istream_t *stream, void *dest);
+#endif
+
+/* Make a limited-length substream for reading a PB_WT_STRING field. */
+bool pb_make_string_substream(pb_istream_t *stream, pb_istream_t *substream);
+bool pb_close_string_substream(pb_istream_t *stream, pb_istream_t *substream);
+
+#ifdef __cplusplus
+} /* extern "C" */
+#endif
+
+#endif
diff --git a/security/container/protos/nanopb/pb_encode.c b/security/container/protos/nanopb/pb_encode.c
new file mode 100644
index 0000000..089172c
--- /dev/null
+++ b/security/container/protos/nanopb/pb_encode.c
@@ -0,0 +1,869 @@
+/* pb_encode.c -- encode a protobuf using minimal resources
+ *
+ * 2011 Petteri Aimonen <jpa@kapsi.fi>
+ */
+
+#include "pb.h"
+#include "pb_encode.h"
+#include "pb_common.h"
+
+/* Use the GCC warn_unused_result attribute to check that all return values
+ * are propagated correctly. On other compilers and gcc before 3.4.0 just
+ * ignore the annotation.
+ */
+#if !defined(__GNUC__) || ( __GNUC__ < 3) || (__GNUC__ == 3 && __GNUC_MINOR__ < 4)
+ #define checkreturn
+#else
+ #define checkreturn __attribute__((warn_unused_result))
+#endif
+
+/**************************************
+ * Declarations internal to this file *
+ **************************************/
+typedef bool (*pb_encoder_t)(pb_ostream_t *stream, const pb_field_t *field, const void *src) checkreturn;
+
+static bool checkreturn buf_write(pb_ostream_t *stream, const pb_byte_t *buf, size_t count);
+static bool checkreturn encode_array(pb_ostream_t *stream, const pb_field_t *field, const void *pData, size_t count, pb_encoder_t func);
+static bool checkreturn encode_field(pb_ostream_t *stream, const pb_field_t *field, const void *pData);
+static bool checkreturn default_extension_encoder(pb_ostream_t *stream, const pb_extension_t *extension);
+static bool checkreturn encode_extension_field(pb_ostream_t *stream, const pb_field_t *field, const void *pData);
+static void *pb_const_cast(const void *p);
+static bool checkreturn pb_enc_varint(pb_ostream_t *stream, const pb_field_t *field, const void *src);
+static bool checkreturn pb_enc_uvarint(pb_ostream_t *stream, const pb_field_t *field, const void *src);
+static bool checkreturn pb_enc_svarint(pb_ostream_t *stream, const pb_field_t *field, const void *src);
+static bool checkreturn pb_enc_fixed32(pb_ostream_t *stream, const pb_field_t *field, const void *src);
+static bool checkreturn pb_enc_fixed64(pb_ostream_t *stream, const pb_field_t *field, const void *src);
+static bool checkreturn pb_enc_bytes(pb_ostream_t *stream, const pb_field_t *field, const void *src);
+static bool checkreturn pb_enc_string(pb_ostream_t *stream, const pb_field_t *field, const void *src);
+static bool checkreturn pb_enc_submessage(pb_ostream_t *stream, const pb_field_t *field, const void *src);
+static bool checkreturn pb_enc_fixed_length_bytes(pb_ostream_t *stream, const pb_field_t *field, const void *src);
+
+#ifdef PB_WITHOUT_64BIT
+#define pb_int64_t int32_t
+#define pb_uint64_t uint32_t
+
+static bool checkreturn pb_encode_negative_varint(pb_ostream_t *stream, pb_uint64_t value);
+#else
+#define pb_int64_t int64_t
+#define pb_uint64_t uint64_t
+#endif
+
+/* --- Function pointers to field encoders ---
+ * Order in the array must match pb_action_t LTYPE numbering.
+ */
+static const pb_encoder_t PB_ENCODERS[PB_LTYPES_COUNT] = {
+ &pb_enc_varint,
+ &pb_enc_uvarint,
+ &pb_enc_svarint,
+ &pb_enc_fixed32,
+ &pb_enc_fixed64,
+
+ &pb_enc_bytes,
+ &pb_enc_string,
+ &pb_enc_submessage,
+ NULL, /* extensions */
+ &pb_enc_fixed_length_bytes
+};
+
+/*******************************
+ * pb_ostream_t implementation *
+ *******************************/
+
+static bool checkreturn buf_write(pb_ostream_t *stream, const pb_byte_t *buf, size_t count)
+{
+ size_t i;
+ pb_byte_t *dest = (pb_byte_t*)stream->state;
+ stream->state = dest + count;
+
+ for (i = 0; i < count; i++)
+ dest[i] = buf[i];
+
+ return true;
+}
+
+pb_ostream_t pb_ostream_from_buffer(pb_byte_t *buf, size_t bufsize)
+{
+ pb_ostream_t stream;
+#ifdef PB_BUFFER_ONLY
+ stream.callback = (void*)1; /* Just a marker value */
+#else
+ stream.callback = &buf_write;
+#endif
+ stream.state = buf;
+ stream.max_size = bufsize;
+ stream.bytes_written = 0;
+#ifndef PB_NO_ERRMSG
+ stream.errmsg = NULL;
+#endif
+ return stream;
+}
+
+bool checkreturn pb_write(pb_ostream_t *stream, const pb_byte_t *buf, size_t count)
+{
+ if (stream->callback != NULL)
+ {
+ if (stream->bytes_written + count > stream->max_size)
+ PB_RETURN_ERROR(stream, "stream full");
+
+#ifdef PB_BUFFER_ONLY
+ if (!buf_write(stream, buf, count))
+ PB_RETURN_ERROR(stream, "io error");
+#else
+ if (!stream->callback(stream, buf, count))
+ PB_RETURN_ERROR(stream, "io error");
+#endif
+ }
+
+ stream->bytes_written += count;
+ return true;
+}
+
+/*************************
+ * Encode a single field *
+ *************************/
+
+/* Encode a static array. Handles the size calculations and possible packing. */
+static bool checkreturn encode_array(pb_ostream_t *stream, const pb_field_t *field,
+ const void *pData, size_t count, pb_encoder_t func)
+{
+ size_t i;
+ const void *p;
+ size_t size;
+
+ if (count == 0)
+ return true;
+
+ if (PB_ATYPE(field->type) != PB_ATYPE_POINTER && count > field->array_size)
+ PB_RETURN_ERROR(stream, "array max size exceeded");
+
+ /* We always pack arrays if the datatype allows it. */
+ if (PB_LTYPE(field->type) <= PB_LTYPE_LAST_PACKABLE)
+ {
+ if (!pb_encode_tag(stream, PB_WT_STRING, field->tag))
+ return false;
+
+ /* Determine the total size of packed array. */
+ if (PB_LTYPE(field->type) == PB_LTYPE_FIXED32)
+ {
+ size = 4 * count;
+ }
+ else if (PB_LTYPE(field->type) == PB_LTYPE_FIXED64)
+ {
+ size = 8 * count;
+ }
+ else
+ {
+ pb_ostream_t sizestream = PB_OSTREAM_SIZING;
+ p = pData;
+ for (i = 0; i < count; i++)
+ {
+ if (!func(&sizestream, field, p))
+ return false;
+ p = (const char*)p + field->data_size;
+ }
+ size = sizestream.bytes_written;
+ }
+
+ if (!pb_encode_varint(stream, (pb_uint64_t)size))
+ return false;
+
+ if (stream->callback == NULL)
+ return pb_write(stream, NULL, size); /* Just sizing.. */
+
+ /* Write the data */
+ p = pData;
+ for (i = 0; i < count; i++)
+ {
+ if (!func(stream, field, p))
+ return false;
+ p = (const char*)p + field->data_size;
+ }
+ }
+ else
+ {
+ p = pData;
+ for (i = 0; i < count; i++)
+ {
+ if (!pb_encode_tag_for_field(stream, field))
+ return false;
+
+ /* Normally the data is stored directly in the array entries, but
+ * for pointer-type string and bytes fields, the array entries are
+ * actually pointers themselves also. So we have to dereference once
+ * more to get to the actual data. */
+ if (PB_ATYPE(field->type) == PB_ATYPE_POINTER &&
+ (PB_LTYPE(field->type) == PB_LTYPE_STRING ||
+ PB_LTYPE(field->type) == PB_LTYPE_BYTES))
+ {
+ if (!func(stream, field, *(const void* const*)p))
+ return false;
+ }
+ else
+ {
+ if (!func(stream, field, p))
+ return false;
+ }
+ p = (const char*)p + field->data_size;
+ }
+ }
+
+ return true;
+}
+
+/* In proto3, all fields are optional and are only encoded if their value is "non-zero".
+ * This function implements the check for the zero value. */
+static bool pb_check_proto3_default_value(const pb_field_t *field, const void *pData)
+{
+ pb_type_t type = field->type;
+ const void *pSize = (const char*)pData + field->size_offset;
+
+ if (PB_HTYPE(type) == PB_HTYPE_REQUIRED)
+ {
+ /* Required proto2 fields inside proto3 submessage, pretty rare case */
+ return false;
+ }
+ else if (PB_HTYPE(type) == PB_HTYPE_REPEATED)
+ {
+ /* Repeated fields inside proto3 submessage: present if count != 0 */
+ return *(const pb_size_t*)pSize == 0;
+ }
+ else if (PB_HTYPE(type) == PB_HTYPE_ONEOF)
+ {
+ /* Oneof fields */
+ return *(const pb_size_t*)pSize == 0;
+ }
+ else if (PB_HTYPE(type) == PB_HTYPE_OPTIONAL && field->size_offset)
+ {
+ /* Proto2 optional fields inside proto3 submessage */
+ return *(const bool*)pSize == false;
+ }
+
+ /* Rest is proto3 singular fields */
+
+ if (PB_ATYPE(type) == PB_ATYPE_STATIC)
+ {
+ if (PB_LTYPE(type) == PB_LTYPE_BYTES)
+ {
+ const pb_bytes_array_t *bytes = (const pb_bytes_array_t*)pData;
+ return bytes->size == 0;
+ }
+ else if (PB_LTYPE(type) == PB_LTYPE_STRING)
+ {
+ return *(const char*)pData == '\0';
+ }
+ else if (PB_LTYPE(type) == PB_LTYPE_FIXED_LENGTH_BYTES)
+ {
+ /* Fixed length bytes is only empty if its length is fixed
+ * as 0. Which would be pretty strange, but we can check
+ * it anyway. */
+ return field->data_size == 0;
+ }
+ else if (PB_LTYPE(type) == PB_LTYPE_SUBMESSAGE)
+ {
+ /* Check all fields in the submessage to find if any of them
+ * are non-zero. The comparison cannot be done byte-per-byte
+ * because the C struct may contain padding bytes that must
+ * be skipped.
+ */
+ pb_field_iter_t iter;
+ if (pb_field_iter_begin(&iter, (const pb_field_t*)field->ptr, pb_const_cast(pData)))
+ {
+ do
+ {
+ if (!pb_check_proto3_default_value(iter.pos, iter.pData))
+ {
+ return false;
+ }
+ } while (pb_field_iter_next(&iter));
+ }
+ return true;
+ }
+ }
+
+ {
+ /* Catch-all branch that does byte-per-byte comparison for zero value.
+ *
+ * This is for all pointer fields, and for static PB_LTYPE_VARINT,
+ * UVARINT, SVARINT, FIXED32, FIXED64, EXTENSION fields, and also
+ * callback fields. These all have integer or pointer value which
+ * can be compared with 0.
+ */
+ pb_size_t i;
+ const char *p = (const char*)pData;
+ for (i = 0; i < field->data_size; i++)
+ {
+ if (p[i] != 0)
+ {
+ return false;
+ }
+ }
+
+ return true;
+ }
+}
+
+/* Encode a field with static or pointer allocation, i.e. one whose data
+ * is available to the encoder directly. */
+static bool checkreturn encode_basic_field(pb_ostream_t *stream,
+ const pb_field_t *field, const void *pData)
+{
+ pb_encoder_t func;
+ bool implicit_has;
+ const void *pSize = &implicit_has;
+
+ func = PB_ENCODERS[PB_LTYPE(field->type)];
+
+ if (field->size_offset)
+ {
+ /* Static optional, repeated or oneof field */
+ pSize = (const char*)pData + field->size_offset;
+ }
+ else if (PB_HTYPE(field->type) == PB_HTYPE_OPTIONAL)
+ {
+ /* Proto3 style field, optional but without explicit has_ field. */
+ implicit_has = !pb_check_proto3_default_value(field, pData);
+ }
+ else
+ {
+ /* Required field, always present */
+ implicit_has = true;
+ }
+
+ if (PB_ATYPE(field->type) == PB_ATYPE_POINTER)
+ {
+ /* pData is a pointer to the field, which contains pointer to
+ * the data. If the 2nd pointer is NULL, it is interpreted as if
+ * the has_field was false.
+ */
+ pData = *(const void* const*)pData;
+ implicit_has = (pData != NULL);
+ }
+
+ switch (PB_HTYPE(field->type))
+ {
+ case PB_HTYPE_REQUIRED:
+ if (!pData)
+ PB_RETURN_ERROR(stream, "missing required field");
+ if (!pb_encode_tag_for_field(stream, field))
+ return false;
+ if (!func(stream, field, pData))
+ return false;
+ break;
+
+ case PB_HTYPE_OPTIONAL:
+ if (*(const bool*)pSize)
+ {
+ if (!pb_encode_tag_for_field(stream, field))
+ return false;
+
+ if (!func(stream, field, pData))
+ return false;
+ }
+ break;
+
+ case PB_HTYPE_REPEATED: {
+ pb_size_t count;
+ if (field->size_offset != 0) {
+ count = *(const pb_size_t*)pSize;
+ } else {
+ count = field->array_size;
+ }
+ if (!encode_array(stream, field, pData, count, func))
+ return false;
+ break;
+ }
+
+ case PB_HTYPE_ONEOF:
+ if (*(const pb_size_t*)pSize == field->tag)
+ {
+ if (!pb_encode_tag_for_field(stream, field))
+ return false;
+
+ if (!func(stream, field, pData))
+ return false;
+ }
+ break;
+
+ default:
+ PB_RETURN_ERROR(stream, "invalid field type");
+ }
+
+ return true;
+}
+
+/* Encode a field with callback semantics. This means that a user function is
+ * called to provide and encode the actual data. */
+static bool checkreturn encode_callback_field(pb_ostream_t *stream,
+ const pb_field_t *field, const void *pData)
+{
+ const pb_callback_t *callback = (const pb_callback_t*)pData;
+
+#ifdef PB_OLD_CALLBACK_STYLE
+ const void *arg = callback->arg;
+#else
+ void * const *arg = &(callback->arg);
+#endif
+
+ if (callback->funcs.encode != NULL)
+ {
+ if (!callback->funcs.encode(stream, field, arg))
+ PB_RETURN_ERROR(stream, "callback error");
+ }
+ return true;
+}
+
+/* Encode a single field of any callback or static type. */
+static bool checkreturn encode_field(pb_ostream_t *stream,
+ const pb_field_t *field, const void *pData)
+{
+ switch (PB_ATYPE(field->type))
+ {
+ case PB_ATYPE_STATIC:
+ case PB_ATYPE_POINTER:
+ return encode_basic_field(stream, field, pData);
+
+ case PB_ATYPE_CALLBACK:
+ return encode_callback_field(stream, field, pData);
+
+ default:
+ PB_RETURN_ERROR(stream, "invalid field type");
+ }
+}
+
+/* Default handler for extension fields. Expects to have a pb_field_t
+ * pointer in the extension->type->arg field. */
+static bool checkreturn default_extension_encoder(pb_ostream_t *stream,
+ const pb_extension_t *extension)
+{
+ const pb_field_t *field = (const pb_field_t*)extension->type->arg;
+
+ if (PB_ATYPE(field->type) == PB_ATYPE_POINTER)
+ {
+ /* For pointer extensions, the pointer is stored directly
+ * in the extension structure. This avoids having an extra
+ * indirection. */
+ return encode_field(stream, field, &extension->dest);
+ }
+ else
+ {
+ return encode_field(stream, field, extension->dest);
+ }
+}
+
+/* Walk through all the registered extensions and give them a chance
+ * to encode themselves. */
+static bool checkreturn encode_extension_field(pb_ostream_t *stream,
+ const pb_field_t *field, const void *pData)
+{
+ const pb_extension_t *extension = *(const pb_extension_t* const *)pData;
+ PB_UNUSED(field);
+
+ while (extension)
+ {
+ bool status;
+ if (extension->type->encode)
+ status = extension->type->encode(stream, extension);
+ else
+ status = default_extension_encoder(stream, extension);
+
+ if (!status)
+ return false;
+
+ extension = extension->next;
+ }
+
+ return true;
+}
+
+/*********************
+ * Encode all fields *
+ *********************/
+
+static void *pb_const_cast(const void *p)
+{
+ /* Note: this casts away const, in order to use the common field iterator
+ * logic for both encoding and decoding. */
+ union {
+ void *p1;
+ const void *p2;
+ } t;
+ t.p2 = p;
+ return t.p1;
+}
+
+bool checkreturn pb_encode(pb_ostream_t *stream, const pb_field_t fields[], const void *src_struct)
+{
+ pb_field_iter_t iter;
+ if (!pb_field_iter_begin(&iter, fields, pb_const_cast(src_struct)))
+ return true; /* Empty message type */
+
+ do {
+ if (PB_LTYPE(iter.pos->type) == PB_LTYPE_EXTENSION)
+ {
+ /* Special case for the extension field placeholder */
+ if (!encode_extension_field(stream, iter.pos, iter.pData))
+ return false;
+ }
+ else
+ {
+ /* Regular field */
+ if (!encode_field(stream, iter.pos, iter.pData))
+ return false;
+ }
+ } while (pb_field_iter_next(&iter));
+
+ return true;
+}
+
+bool pb_encode_delimited(pb_ostream_t *stream, const pb_field_t fields[], const void *src_struct)
+{
+ return pb_encode_submessage(stream, fields, src_struct);
+}
+
+bool pb_encode_nullterminated(pb_ostream_t *stream, const pb_field_t fields[], const void *src_struct)
+{
+ const pb_byte_t zero = 0;
+
+ if (!pb_encode(stream, fields, src_struct))
+ return false;
+
+ return pb_write(stream, &zero, 1);
+}
+
+bool pb_get_encoded_size(size_t *size, const pb_field_t fields[], const void *src_struct)
+{
+ pb_ostream_t stream = PB_OSTREAM_SIZING;
+
+ if (!pb_encode(&stream, fields, src_struct))
+ return false;
+
+ *size = stream.bytes_written;
+ return true;
+}
+
+/********************
+ * Helper functions *
+ ********************/
+
+#ifdef PB_WITHOUT_64BIT
+bool checkreturn pb_encode_negative_varint(pb_ostream_t *stream, pb_uint64_t value)
+{
+ pb_byte_t buffer[10];
+ size_t i = 0;
+ size_t compensation = 32;/* we need to compensate 32 bits all set to 1 */
+
+ while (value)
+ {
+ buffer[i] = (pb_byte_t)((value & 0x7F) | 0x80);
+ value >>= 7;
+ if (compensation)
+ {
+ /* re-set all the compensation bits we can or need */
+ size_t bits = compensation > 7 ? 7 : compensation;
+ value ^= (pb_uint64_t)((0xFFu >> (8 - bits)) << 25); /* set the number of bits needed on the lowest of the most significant 7 bits */
+ compensation -= bits;
+ }
+ i++;
+ }
+ buffer[i - 1] &= 0x7F; /* Unset top bit on last byte */
+
+ return pb_write(stream, buffer, i);
+}
+#endif
+
+bool checkreturn pb_encode_varint(pb_ostream_t *stream, pb_uint64_t value)
+{
+ pb_byte_t buffer[10];
+ size_t i = 0;
+
+ if (value <= 0x7F)
+ {
+ pb_byte_t v = (pb_byte_t)value;
+ return pb_write(stream, &v, 1);
+ }
+
+ while (value)
+ {
+ buffer[i] = (pb_byte_t)((value & 0x7F) | 0x80);
+ value >>= 7;
+ i++;
+ }
+ buffer[i-1] &= 0x7F; /* Unset top bit on last byte */
+
+ return pb_write(stream, buffer, i);
+}
+
+bool checkreturn pb_encode_svarint(pb_ostream_t *stream, pb_int64_t value)
+{
+ pb_uint64_t zigzagged;
+ if (value < 0)
+ zigzagged = ~((pb_uint64_t)value << 1);
+ else
+ zigzagged = (pb_uint64_t)value << 1;
+
+ return pb_encode_varint(stream, zigzagged);
+}
+
+bool checkreturn pb_encode_fixed32(pb_ostream_t *stream, const void *value)
+{
+ uint32_t val = *(const uint32_t*)value;
+ pb_byte_t bytes[4];
+ bytes[0] = (pb_byte_t)(val & 0xFF);
+ bytes[1] = (pb_byte_t)((val >> 8) & 0xFF);
+ bytes[2] = (pb_byte_t)((val >> 16) & 0xFF);
+ bytes[3] = (pb_byte_t)((val >> 24) & 0xFF);
+ return pb_write(stream, bytes, 4);
+}
+
+#ifndef PB_WITHOUT_64BIT
+bool checkreturn pb_encode_fixed64(pb_ostream_t *stream, const void *value)
+{
+ uint64_t val = *(const uint64_t*)value;
+ pb_byte_t bytes[8];
+ bytes[0] = (pb_byte_t)(val & 0xFF);
+ bytes[1] = (pb_byte_t)((val >> 8) & 0xFF);
+ bytes[2] = (pb_byte_t)((val >> 16) & 0xFF);
+ bytes[3] = (pb_byte_t)((val >> 24) & 0xFF);
+ bytes[4] = (pb_byte_t)((val >> 32) & 0xFF);
+ bytes[5] = (pb_byte_t)((val >> 40) & 0xFF);
+ bytes[6] = (pb_byte_t)((val >> 48) & 0xFF);
+ bytes[7] = (pb_byte_t)((val >> 56) & 0xFF);
+ return pb_write(stream, bytes, 8);
+}
+#endif
+
+bool checkreturn pb_encode_tag(pb_ostream_t *stream, pb_wire_type_t wiretype, uint32_t field_number)
+{
+ pb_uint64_t tag = ((pb_uint64_t)field_number << 3) | wiretype;
+ return pb_encode_varint(stream, tag);
+}
+
+bool checkreturn pb_encode_tag_for_field(pb_ostream_t *stream, const pb_field_t *field)
+{
+ pb_wire_type_t wiretype;
+ switch (PB_LTYPE(field->type))
+ {
+ case PB_LTYPE_VARINT:
+ case PB_LTYPE_UVARINT:
+ case PB_LTYPE_SVARINT:
+ wiretype = PB_WT_VARINT;
+ break;
+
+ case PB_LTYPE_FIXED32:
+ wiretype = PB_WT_32BIT;
+ break;
+
+ case PB_LTYPE_FIXED64:
+ wiretype = PB_WT_64BIT;
+ break;
+
+ case PB_LTYPE_BYTES:
+ case PB_LTYPE_STRING:
+ case PB_LTYPE_SUBMESSAGE:
+ case PB_LTYPE_FIXED_LENGTH_BYTES:
+ wiretype = PB_WT_STRING;
+ break;
+
+ default:
+ PB_RETURN_ERROR(stream, "invalid field type");
+ }
+
+ return pb_encode_tag(stream, wiretype, field->tag);
+}
+
+bool checkreturn pb_encode_string(pb_ostream_t *stream, const pb_byte_t *buffer, size_t size)
+{
+ if (!pb_encode_varint(stream, (pb_uint64_t)size))
+ return false;
+
+ return pb_write(stream, buffer, size);
+}
+
+bool checkreturn pb_encode_submessage(pb_ostream_t *stream, const pb_field_t fields[], const void *src_struct)
+{
+ /* First calculate the message size using a non-writing substream. */
+ pb_ostream_t substream = PB_OSTREAM_SIZING;
+ size_t size;
+ bool status;
+
+ if (!pb_encode(&substream, fields, src_struct))
+ {
+#ifndef PB_NO_ERRMSG
+ stream->errmsg = substream.errmsg;
+#endif
+ return false;
+ }
+
+ size = substream.bytes_written;
+
+ if (!pb_encode_varint(stream, (pb_uint64_t)size))
+ return false;
+
+ if (stream->callback == NULL)
+ return pb_write(stream, NULL, size); /* Just sizing */
+
+ if (stream->bytes_written + size > stream->max_size)
+ PB_RETURN_ERROR(stream, "stream full");
+
+ /* Use a substream to verify that a callback doesn't write more than
+ * what it did the first time. */
+ substream.callback = stream->callback;
+ substream.state = stream->state;
+ substream.max_size = size;
+ substream.bytes_written = 0;
+#ifndef PB_NO_ERRMSG
+ substream.errmsg = NULL;
+#endif
+
+ status = pb_encode(&substream, fields, src_struct);
+
+ stream->bytes_written += substream.bytes_written;
+ stream->state = substream.state;
+#ifndef PB_NO_ERRMSG
+ stream->errmsg = substream.errmsg;
+#endif
+
+ if (substream.bytes_written != size)
+ PB_RETURN_ERROR(stream, "submsg size changed");
+
+ return status;
+}
+
+/* Field encoders */
+
+static bool checkreturn pb_enc_varint(pb_ostream_t *stream, const pb_field_t *field, const void *src)
+{
+ pb_int64_t value = 0;
+
+ if (field->data_size == sizeof(int_least8_t))
+ value = *(const int_least8_t*)src;
+ else if (field->data_size == sizeof(int_least16_t))
+ value = *(const int_least16_t*)src;
+ else if (field->data_size == sizeof(int32_t))
+ value = *(const int32_t*)src;
+ else if (field->data_size == sizeof(pb_int64_t))
+ value = *(const pb_int64_t*)src;
+ else
+ PB_RETURN_ERROR(stream, "invalid data_size");
+
+#ifdef PB_WITHOUT_64BIT
+ if (value < 0)
+ return pb_encode_negative_varint(stream, (pb_uint64_t)value);
+ else
+#endif
+ return pb_encode_varint(stream, (pb_uint64_t)value);
+}
+
+static bool checkreturn pb_enc_uvarint(pb_ostream_t *stream, const pb_field_t *field, const void *src)
+{
+ pb_uint64_t value = 0;
+
+ if (field->data_size == sizeof(uint_least8_t))
+ value = *(const uint_least8_t*)src;
+ else if (field->data_size == sizeof(uint_least16_t))
+ value = *(const uint_least16_t*)src;
+ else if (field->data_size == sizeof(uint32_t))
+ value = *(const uint32_t*)src;
+ else if (field->data_size == sizeof(pb_uint64_t))
+ value = *(const pb_uint64_t*)src;
+ else
+ PB_RETURN_ERROR(stream, "invalid data_size");
+
+ return pb_encode_varint(stream, value);
+}
+
+static bool checkreturn pb_enc_svarint(pb_ostream_t *stream, const pb_field_t *field, const void *src)
+{
+ pb_int64_t value = 0;
+
+ if (field->data_size == sizeof(int_least8_t))
+ value = *(const int_least8_t*)src;
+ else if (field->data_size == sizeof(int_least16_t))
+ value = *(const int_least16_t*)src;
+ else if (field->data_size == sizeof(int32_t))
+ value = *(const int32_t*)src;
+ else if (field->data_size == sizeof(pb_int64_t))
+ value = *(const pb_int64_t*)src;
+ else
+ PB_RETURN_ERROR(stream, "invalid data_size");
+
+ return pb_encode_svarint(stream, value);
+}
+
+static bool checkreturn pb_enc_fixed64(pb_ostream_t *stream, const pb_field_t *field, const void *src)
+{
+ PB_UNUSED(field);
+#ifndef PB_WITHOUT_64BIT
+ return pb_encode_fixed64(stream, src);
+#else
+ PB_UNUSED(src);
+ PB_RETURN_ERROR(stream, "no 64bit support");
+#endif
+}
+
+static bool checkreturn pb_enc_fixed32(pb_ostream_t *stream, const pb_field_t *field, const void *src)
+{
+ PB_UNUSED(field);
+ return pb_encode_fixed32(stream, src);
+}
+
+static bool checkreturn pb_enc_bytes(pb_ostream_t *stream, const pb_field_t *field, const void *src)
+{
+ const pb_bytes_array_t *bytes = NULL;
+
+ bytes = (const pb_bytes_array_t*)src;
+
+ if (src == NULL)
+ {
+ /* Treat null pointer as an empty bytes field */
+ return pb_encode_string(stream, NULL, 0);
+ }
+
+ if (PB_ATYPE(field->type) == PB_ATYPE_STATIC &&
+ PB_BYTES_ARRAY_T_ALLOCSIZE(bytes->size) > field->data_size)
+ {
+ PB_RETURN_ERROR(stream, "bytes size exceeded");
+ }
+
+ return pb_encode_string(stream, bytes->bytes, bytes->size);
+}
+
+static bool checkreturn pb_enc_string(pb_ostream_t *stream, const pb_field_t *field, const void *src)
+{
+ size_t size = 0;
+ size_t max_size = field->data_size;
+ const char *p = (const char*)src;
+
+ if (PB_ATYPE(field->type) == PB_ATYPE_POINTER)
+ max_size = (size_t)-1;
+
+ if (src == NULL)
+ {
+ size = 0; /* Treat null pointer as an empty string */
+ }
+ else
+ {
+ /* strnlen() is not always available, so just use a loop */
+ while (size < max_size && *p != '\0')
+ {
+ size++;
+ p++;
+ }
+ }
+
+ return pb_encode_string(stream, (const pb_byte_t*)src, size);
+}
+
+static bool checkreturn pb_enc_submessage(pb_ostream_t *stream, const pb_field_t *field, const void *src)
+{
+ if (field->ptr == NULL)
+ PB_RETURN_ERROR(stream, "invalid field descriptor");
+
+ return pb_encode_submessage(stream, (const pb_field_t*)field->ptr, src);
+}
+
+static bool checkreturn pb_enc_fixed_length_bytes(pb_ostream_t *stream, const pb_field_t *field, const void *src)
+{
+ return pb_encode_string(stream, (const pb_byte_t*)src, field->data_size);
+}
+
diff --git a/security/container/protos/nanopb/pb_encode.h b/security/container/protos/nanopb/pb_encode.h
new file mode 100644
index 0000000..8bf78dd
--- /dev/null
+++ b/security/container/protos/nanopb/pb_encode.h
@@ -0,0 +1,170 @@
+/* pb_encode.h: Functions to encode protocol buffers. Depends on pb_encode.c.
+ * The main function is pb_encode. You also need an output stream, and the
+ * field descriptions created by nanopb_generator.py.
+ */
+
+#ifndef PB_ENCODE_H_INCLUDED
+#define PB_ENCODE_H_INCLUDED
+
+#include "pb.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/* Structure for defining custom output streams. You will need to provide
+ * a callback function to write the bytes to your storage, which can be
+ * for example a file or a network socket.
+ *
+ * The callback must conform to these rules:
+ *
+ * 1) Return false on IO errors. This will cause encoding to abort.
+ * 2) You can use state to store your own data (e.g. buffer pointer).
+ * 3) pb_write will update bytes_written after your callback runs.
+ * 4) Substreams will modify max_size and bytes_written. Don't use them
+ * to calculate any pointers.
+ */
+struct pb_ostream_s
+{
+#ifdef PB_BUFFER_ONLY
+ /* Callback pointer is not used in buffer-only configuration.
+ * Having an int pointer here allows binary compatibility but
+ * gives an error if someone tries to assign callback function.
+ * Also, NULL pointer marks a 'sizing stream' that does not
+ * write anything.
+ */
+ int *callback;
+#else
+ bool (*callback)(pb_ostream_t *stream, const pb_byte_t *buf, size_t count);
+#endif
+ void *state; /* Free field for use by callback implementation. */
+ size_t max_size; /* Limit number of output bytes written (or use SIZE_MAX). */
+ size_t bytes_written; /* Number of bytes written so far. */
+
+#ifndef PB_NO_ERRMSG
+ const char *errmsg;
+#endif
+};
+
+/***************************
+ * Main encoding functions *
+ ***************************/
+
+/* Encode a single protocol buffers message from C structure into a stream.
+ * Returns true on success, false on any failure.
+ * The actual struct pointed to by src_struct must match the description in fields.
+ * All required fields in the struct are assumed to have been filled in.
+ *
+ * Example usage:
+ * MyMessage msg = {};
+ * uint8_t buffer[64];
+ * pb_ostream_t stream;
+ *
+ * msg.field1 = 42;
+ * stream = pb_ostream_from_buffer(buffer, sizeof(buffer));
+ * pb_encode(&stream, MyMessage_fields, &msg);
+ */
+bool pb_encode(pb_ostream_t *stream, const pb_field_t fields[], const void *src_struct);
+
+/* Same as pb_encode, but prepends the length of the message as a varint.
+ * Corresponds to writeDelimitedTo() in Google's protobuf API.
+ */
+bool pb_encode_delimited(pb_ostream_t *stream, const pb_field_t fields[], const void *src_struct);
+
+/* Same as pb_encode, but appends a null byte to the message for termination.
+ * NOTE: This behaviour is not supported in most other protobuf implementations, so pb_encode_delimited()
+ * is a better option for compatibility.
+ */
+bool pb_encode_nullterminated(pb_ostream_t *stream, const pb_field_t fields[], const void *src_struct);
+
+/* Encode the message to get the size of the encoded data, but do not store
+ * the data. */
+bool pb_get_encoded_size(size_t *size, const pb_field_t fields[], const void *src_struct);
+
+/**************************************
+ * Functions for manipulating streams *
+ **************************************/
+
+/* Create an output stream for writing into a memory buffer.
+ * The number of bytes written can be found in stream.bytes_written after
+ * encoding the message.
+ *
+ * Alternatively, you can use a custom stream that writes directly to e.g.
+ * a file or a network socket.
+ */
+pb_ostream_t pb_ostream_from_buffer(pb_byte_t *buf, size_t bufsize);
+
+/* Pseudo-stream for measuring the size of a message without actually storing
+ * the encoded data.
+ *
+ * Example usage:
+ * MyMessage msg = {};
+ * pb_ostream_t stream = PB_OSTREAM_SIZING;
+ * pb_encode(&stream, MyMessage_fields, &msg);
+ * printf("Message size is %d\n", stream.bytes_written);
+ */
+#ifndef PB_NO_ERRMSG
+#define PB_OSTREAM_SIZING {0,0,0,0,0}
+#else
+#define PB_OSTREAM_SIZING {0,0,0,0}
+#endif
+
+/* Function to write into a pb_ostream_t stream. You can use this if you need
+ * to append or prepend some custom headers to the message.
+ */
+bool pb_write(pb_ostream_t *stream, const pb_byte_t *buf, size_t count);
+
+
+/************************************************
+ * Helper functions for writing field callbacks *
+ ************************************************/
+
+/* Encode field header based on type and field number defined in the field
+ * structure. Call this from the callback before writing out field contents. */
+bool pb_encode_tag_for_field(pb_ostream_t *stream, const pb_field_t *field);
+
+/* Encode field header by manually specifing wire type. You need to use this
+ * if you want to write out packed arrays from a callback field. */
+bool pb_encode_tag(pb_ostream_t *stream, pb_wire_type_t wiretype, uint32_t field_number);
+
+/* Encode an integer in the varint format.
+ * This works for bool, enum, int32, int64, uint32 and uint64 field types. */
+#ifndef PB_WITHOUT_64BIT
+bool pb_encode_varint(pb_ostream_t *stream, uint64_t value);
+#else
+bool pb_encode_varint(pb_ostream_t *stream, uint32_t value);
+#endif
+
+/* Encode an integer in the zig-zagged svarint format.
+ * This works for sint32 and sint64. */
+#ifndef PB_WITHOUT_64BIT
+bool pb_encode_svarint(pb_ostream_t *stream, int64_t value);
+#else
+bool pb_encode_svarint(pb_ostream_t *stream, int32_t value);
+#endif
+
+/* Encode a string or bytes type field. For strings, pass strlen(s) as size. */
+bool pb_encode_string(pb_ostream_t *stream, const pb_byte_t *buffer, size_t size);
+
+/* Encode a fixed32, sfixed32 or float value.
+ * You need to pass a pointer to a 4-byte wide C variable. */
+bool pb_encode_fixed32(pb_ostream_t *stream, const void *value);
+
+#ifndef PB_WITHOUT_64BIT
+/* Encode a fixed64, sfixed64 or double value.
+ * You need to pass a pointer to a 8-byte wide C variable. */
+bool pb_encode_fixed64(pb_ostream_t *stream, const void *value);
+#endif
+
+/* Encode a submessage field.
+ * You need to pass the pb_field_t array and pointer to struct, just like
+ * with pb_encode(). This internally encodes the submessage twice, first to
+ * calculate message size and then to actually write it out.
+ */
+bool pb_encode_submessage(pb_ostream_t *stream, const pb_field_t fields[], const void *src_struct);
+
+#ifdef __cplusplus
+} /* extern "C" */
+#endif
+
+#endif
diff --git a/security/container/protos/pbsystem.h b/security/container/protos/pbsystem.h
new file mode 100644
index 0000000..f2308f8
--- /dev/null
+++ b/security/container/protos/pbsystem.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Header and types for nanopb to work with the Linux kernel */
+#include <linux/kernel.h>
+#include <linux/string.h>
+
+/* Small types. */
+
+/* Signed. */
+typedef signed char int_least8_t;
+typedef short int int_least16_t;
+typedef int int_least32_t;
+typedef long int int_least64_t;
+
+/* Unsigned. */
+typedef unsigned char uint_least8_t;
+typedef unsigned short int uint_least16_t;
+typedef unsigned int uint_least32_t;
+typedef unsigned long int uint_least64_t;
+
+/* Fast types. */
+
+/* Signed. */
+typedef signed char int_fast8_t;
+typedef long int int_fast16_t;
+typedef long int int_fast32_t;
+typedef long int int_fast64_t;
+
+/* Unsigned. */
+typedef unsigned char uint_fast8_t;
+typedef unsigned long int uint_fast16_t;
+typedef unsigned long int uint_fast32_t;
+typedef unsigned long int uint_fast64_t;
diff --git a/security/container/vsock.c b/security/container/vsock.c
new file mode 100644
index 0000000..8d1710239
--- /dev/null
+++ b/security/container/vsock.c
@@ -0,0 +1,559 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Container Security Monitor module
+ *
+ * Copyright (c) 2018 Google, Inc
+ */
+
+#include "monitor.h"
+
+#include <net/net_namespace.h>
+#include <net/vsock_addr.h>
+#include <net/sock.h>
+#include <linux/socket.h>
+#include <linux/workqueue.h>
+#include <linux/jiffies.h>
+#include <linux/mutex.h>
+#include <linux/version.h>
+#include <linux/kthread.h>
+#include <linux/printk.h>
+#include <linux/delay.h>
+#include <linux/timekeeping.h>
+
+/*
+ * virtio vsocket over which to send events to the host.
+ * NULL if monitoring is disabled, or if the socket was disconnected and we're
+ * trying to reconnect to the host.
+ */
+static struct socket *csm_vsocket;
+
+/* reconnect delay */
+#define CSM_RECONNECT_FREQ_MSEC 5000
+
+/* config pull delay */
+#define CSM_CONFIG_FREQ_MSEC 1000
+
+/* vsock receive attempts and delay until giving up */
+#define CSM_RECV_ATTEMPTS 2
+#define CSM_RECV_DELAY_MSEC 100
+
+/* heartbeat work */
+#define CSM_HEARTBEAT_FREQ msecs_to_jiffies(5000)
+static void csm_heartbeat(struct work_struct *work);
+static DECLARE_DELAYED_WORK(csm_heartbeat_work, csm_heartbeat);
+
+/* csm protobuf work */
+static void csm_sendmsg_pipe_handler(struct work_struct *work);
+
+/* csm message work container*/
+struct msg_work_data {
+ struct work_struct msg_work;
+ size_t pos_bytes_written;
+ char msg[];
+};
+
+/* size used for the config error message. */
+#define CSM_ERROR_BUF_SIZE 40
+
+/* Running thread to manage vsock connections. */
+static struct task_struct *socket_thread;
+
+/* Mutex to ensure sequential dumping of protos */
+static DEFINE_MUTEX(protodump);
+
+static struct socket *csm_create_socket(void)
+{
+ int err;
+ struct sockaddr_vm host_addr;
+ struct socket *sock;
+
+ err = sock_create_kern(&init_net, AF_VSOCK, SOCK_STREAM, 0,
+ &sock);
+ if (err) {
+ pr_debug("error creating AF_VSOCK socket: %d\n", err);
+ return ERR_PTR(err);
+ }
+
+ vsock_addr_init(&host_addr, VMADDR_CID_HYPERVISOR, CSM_HOST_PORT);
+
+ err = kernel_connect(sock, (struct sockaddr *)&host_addr,
+ sizeof(host_addr), 0);
+ if (err) {
+ if (err != -ECONNRESET) {
+ pr_debug("error connecting AF_VSOCK socket to host port %u: %d\n",
+ CSM_HOST_PORT, err);
+ }
+ goto error_release;
+ }
+
+ return sock;
+
+error_release:
+ sock_release(sock);
+ return ERR_PTR(err);
+}
+
+static void csm_destroy_socket(void)
+{
+ down_write(&csm_rwsem_vsocket);
+ if (csm_vsocket) {
+ sock_release(csm_vsocket);
+ csm_vsocket = NULL;
+ }
+ up_write(&csm_rwsem_vsocket);
+}
+
+static int csm_vsock_sendmsg(struct kvec *vecs, size_t vecs_size,
+ size_t total_length)
+{
+ struct msghdr msg = { };
+ int res = -EPIPE;
+
+ if (!cmdline_boot_vsock_enabled)
+ return 0;
+
+ down_read(&csm_rwsem_vsocket);
+ if (csm_vsocket) {
+ res = kernel_sendmsg(csm_vsocket, &msg, vecs, vecs_size,
+ total_length);
+ if (res > 0)
+ res = 0;
+ }
+ up_read(&csm_rwsem_vsocket);
+
+ return res;
+}
+
+static ssize_t csm_user_pipe_write(struct kvec *vecs, size_t vecs_size,
+ size_t total_length)
+{
+ ssize_t perr = 0;
+ struct iov_iter io = { };
+ loff_t pos = 0;
+ struct pipe_inode_info *pipe;
+ unsigned int readers;
+
+ if (!csm_user_write_pipe)
+ return 0;
+
+ down_read(&csm_rwsem_pipe);
+
+ if (csm_user_write_pipe == NULL)
+ goto end;
+
+ /* The pipe info is the same for reader and write files. */
+ pipe = get_pipe_info(csm_user_write_pipe);
+
+ /* If nobody is listening, don't write events. */
+ readers = READ_ONCE(pipe->readers);
+ if (readers <= 1) {
+ WARN_ON(readers == 0);
+ goto end;
+ }
+
+
+ iov_iter_kvec(&io, ITER_KVEC|WRITE, vecs, vecs_size,
+ total_length);
+
+ file_start_write(csm_user_write_pipe);
+ perr = vfs_iter_write(csm_user_write_pipe, &io, &pos, 0);
+ file_end_write(csm_user_write_pipe);
+
+end:
+ up_read(&csm_rwsem_pipe);
+ return perr;
+}
+
+static int csm_sendmsg(int type, const void *buf, size_t len)
+{
+ struct csm_msg_hdr hdr = {
+ .msg_type = cpu_to_le32(type),
+ .msg_length = cpu_to_le32(sizeof(hdr) + len),
+ };
+ struct kvec vecs[] = {
+ {
+ .iov_base = &hdr,
+ .iov_len = sizeof(hdr),
+ }, {
+ .iov_base = (void *)buf,
+ .iov_len = len,
+ }
+ };
+ int res;
+ ssize_t perr;
+
+ res = csm_vsock_sendmsg(vecs, ARRAY_SIZE(vecs),
+ le32_to_cpu(hdr.msg_length));
+ if (res < 0) {
+ pr_warn_ratelimited("sendmsg error (msg_type=%d, msg_length=%u): %d\n",
+ type, le32_to_cpu(hdr.msg_length), res);
+ }
+
+ perr = csm_user_pipe_write(vecs, ARRAY_SIZE(vecs),
+ le32_to_cpu(hdr.msg_length));
+ if (perr < 0) {
+ pr_warn_ratelimited("vfs_iter_write error (msg_type=%d, msg_length=%u): %zd\n",
+ type, le32_to_cpu(hdr.msg_length), perr);
+ }
+
+ /* If one of them failed, increase the stats once. */
+ if (res < 0 || perr < 0)
+ csm_stats.event_writing_failed++;
+
+ return res;
+}
+
+static bool csm_get_expected_size(size_t *size, const pb_field_t fields[],
+ const void *src_struct)
+{
+ schema_Event *event;
+
+ if (fields != schema_Event_fields)
+ goto other;
+
+ /* Size above 99% of the 100 containers tested running k8s. */
+ event = (schema_Event *)src_struct;
+ switch (event->which_event) {
+ case schema_Event_execute_tag:
+ *size = 3344;
+ return true;
+ case schema_Event_memexec_tag:
+ *size = 176;
+ return true;
+ case schema_Event_clone_tag:
+ *size = 50;
+ return true;
+ case schema_Event_exit_tag:
+ *size = 30;
+ return true;
+ }
+
+other:
+ /* If unknown, do the pre-computation. */
+ return pb_get_encoded_size(size, fields, src_struct);
+}
+
+static struct msg_work_data *csm_encodeproto(size_t size,
+ const pb_field_t fields[],
+ const void *src_struct)
+{
+ pb_ostream_t pos;
+ struct msg_work_data *wd;
+ size_t total;
+
+ total = size + sizeof(*wd);
+ if (total < size)
+ return ERR_PTR(-EINVAL);
+
+ wd = kmalloc(total, GFP_KERNEL);
+ if (!wd)
+ return ERR_PTR(-ENOMEM);
+
+ pos = pb_ostream_from_buffer(wd->msg, size);
+ if (!pb_encode(&pos, fields, src_struct)) {
+ kfree(wd);
+ return ERR_PTR(-EINVAL);
+ }
+
+ INIT_WORK(&wd->msg_work, csm_sendmsg_pipe_handler);
+ wd->pos_bytes_written = pos.bytes_written;
+ return wd;
+}
+
+static int csm_sendproto(int type, const pb_field_t fields[],
+ const void *src_struct)
+{
+ int err = 0;
+ size_t size, previous_size;
+ struct msg_work_data *wd;
+
+ /* Use the expected size first. */
+ if (!csm_get_expected_size(&size, fields, src_struct))
+ return -EINVAL;
+
+ wd = csm_encodeproto(size, fields, src_struct);
+ if (unlikely(IS_ERR(wd))) {
+ /* If it failed, retry with the exact size. */
+ csm_stats.size_picking_failed++;
+ previous_size = size;
+
+ if (!pb_get_encoded_size(&size, fields, src_struct))
+ return -EINVAL;
+
+ wd = csm_encodeproto(size, fields, src_struct);
+ if (IS_ERR(wd)) {
+ csm_stats.proto_encoding_failed++;
+ return PTR_ERR(wd);
+ }
+
+ pr_debug("size picking failed %lu vs %lu\n", previous_size,
+ size);
+ }
+
+ /* The work handler takes care of cleanup, if successfully scheduled. */
+ if (likely(schedule_work(&wd->msg_work)))
+ return 0;
+
+ csm_stats.workqueue_failed++;
+ pr_err_ratelimited("Sent msg to workqueue unsuccessfully (assume dropped).\n");
+
+ kfree(wd);
+ return err;
+}
+
+static void csm_sendmsg_pipe_handler(struct work_struct *work)
+{
+ int err;
+ int type = CSM_MSG_EVENT_PROTO;
+ struct msg_work_data *wd = container_of(work, struct msg_work_data,
+ msg_work);
+
+ err = csm_sendmsg(type, wd->msg, wd->pos_bytes_written);
+ if (err)
+ pr_err_ratelimited("csm_sendmsg failed in work handler %s\n",
+ __func__);
+
+ kfree(wd);
+}
+
+int csm_sendeventproto(const pb_field_t fields[], schema_Event *event)
+{
+ /* Last check before generating and sending an event. */
+ if (!csm_enabled)
+ return -ENOTSUPP;
+
+ event->timestamp = ktime_get_real_ns();
+
+ return csm_sendproto(CSM_MSG_EVENT_PROTO, fields, event);
+}
+
+int csm_sendconfigrespproto(const pb_field_t fields[],
+ schema_ConfigurationResponse *resp)
+{
+ return csm_sendproto(CSM_MSG_CONFIG_RESPONSE_PROTO, fields, resp);
+}
+
+static void csm_heartbeat(struct work_struct *work)
+{
+ csm_sendmsg(CSM_MSG_TYPE_HEARTBEAT, NULL, 0);
+ schedule_delayed_work(&csm_heartbeat_work, CSM_HEARTBEAT_FREQ);
+}
+
+static int config_send_response(int err)
+{
+ char buf[CSM_ERROR_BUF_SIZE] = {};
+ schema_ConfigurationResponse resp =
+ schema_ConfigurationResponse_init_zero;
+
+ resp.error = schema_ConfigurationResponse_ErrorCode_NO_ERROR;
+ resp.version = CSM_VERSION;
+ resp.kernel_version = LINUX_VERSION_CODE;
+
+ if (err) {
+ resp.error = schema_ConfigurationResponse_ErrorCode_UNKNOWN;
+ snprintf(buf, sizeof(buf) - 1, "error code: %d", err);
+ resp.msg.funcs.encode = pb_encode_string_field;
+ resp.msg.arg = buf;
+ }
+
+ return csm_sendconfigrespproto(schema_ConfigurationResponse_fields,
+ &resp);
+}
+
+static int csm_recvmsg(void *buf, size_t len, bool expected)
+{
+ int err = 0;
+ struct msghdr msg = {};
+ struct kvec vecs;
+ size_t pos = 0;
+ size_t attempts = 0;
+
+ while (pos < len) {
+ vecs.iov_base = (char *)buf + pos;
+ vecs.iov_len = len - pos;
+
+ down_read(&csm_rwsem_vsocket);
+ if (csm_vsocket) {
+ err = kernel_recvmsg(csm_vsocket, &msg, &vecs, 1, len,
+ MSG_DONTWAIT);
+ } else {
+ pr_err("csm_vsocket was unset while the config thread was running\n");
+ err = -ENOENT;
+ }
+ up_read(&csm_rwsem_vsocket);
+
+ if (err == 0) {
+ err = -ENOTCONN;
+ pr_warn_ratelimited("vsock connection was reset\n");
+ break;
+ }
+
+ if (err == -EAGAIN) {
+ /*
+ * If nothing is received and nothing was expected
+ * just bail.
+ */
+ if (!expected && pos == 0) {
+ err = -EAGAIN;
+ break;
+ }
+
+ /*
+ * If we missing data after multiple attempts
+ * reset the connection.
+ */
+ if (++attempts >= CSM_RECV_ATTEMPTS) {
+ err = -EPIPE;
+ break;
+ }
+
+ msleep(CSM_RECV_DELAY_MSEC);
+ continue;
+ }
+
+ if (err < 0) {
+ pr_err_ratelimited("kernel_recvmsg failed with %d\n",
+ err);
+ break;
+ }
+
+ pos += err;
+ }
+
+ return err;
+}
+
+/*
+ * Listen for configuration until connection is closed or desynchronize.
+ * If something wrong happens while parsing the packet buffer that may
+ * desynchronize the thread with the backend, the connection is reset.
+ */
+
+static void listen_configuration(void *buf)
+{
+ int err;
+ struct csm_msg_hdr hdr = {};
+ uint32_t msg_type, msg_length;
+
+ pr_debug("listening for configuration messages\n");
+
+ while (true) {
+ err = csm_recvmsg(&hdr, sizeof(hdr), false);
+
+ /* Nothing available, wait and try again. */
+ if (err == -EAGAIN) {
+ msleep(CSM_CONFIG_FREQ_MSEC);
+ continue;
+ }
+
+ if (err < 0)
+ break;
+
+ msg_type = le32_to_cpu(hdr.msg_type);
+
+ if (msg_type != CSM_MSG_CONFIG_REQUEST_PROTO) {
+ pr_warn_ratelimited("unexpected message type: %d\n",
+ msg_type);
+ break;
+ }
+
+ msg_length = le32_to_cpu(hdr.msg_length);
+
+ if (msg_length <= sizeof(hdr) || msg_length > PAGE_SIZE) {
+ pr_warn_ratelimited("unexpected message length: %d\n",
+ msg_length);
+ break;
+ }
+
+ /* The message length include the size of the header. */
+ msg_length -= sizeof(hdr);
+
+ err = csm_recvmsg(buf, msg_length, true);
+ if (err < 0) {
+ pr_warn_ratelimited("failed to gather configuration: %d\n",
+ err);
+ break;
+ }
+
+ err = csm_update_config_from_buffer(buf, msg_length);
+ if (err < 0) {
+ /*
+ * Warn of the error but continue listening for
+ * configuration changes.
+ */
+ pr_warn_ratelimited("config update failed: %d\n", err);
+ } else {
+ pr_debug("config received and applied\n");
+ }
+
+ err = config_send_response(err);
+ if (err < 0) {
+ pr_err_ratelimited("config response failed: %d\n", err);
+ break;
+ }
+
+ pr_debug("config response sent\n");
+ }
+}
+
+/* Thread managing connection and listening for new configurations. */
+static int socket_thread_fn(void *unsued)
+{
+ void *buf;
+ struct socket *sock;
+
+ /* One page should be enough for current configurations. */
+ buf = (void *)__get_free_page(GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ while (true) {
+ sock = csm_create_socket();
+ if (IS_ERR(sock)) {
+ pr_debug("unable to connect to host (port %u), will retry in %u ms\n",
+ CSM_HOST_PORT, CSM_RECONNECT_FREQ_MSEC);
+ msleep(CSM_RECONNECT_FREQ_MSEC);
+ continue;
+ }
+
+ down_write(&csm_rwsem_vsocket);
+ csm_vsocket = sock;
+ up_write(&csm_rwsem_vsocket);
+
+ schedule_delayed_work(&csm_heartbeat_work, 0);
+
+ listen_configuration(buf);
+
+ pr_warn("vsock state incorrect, disconnecting. Messages will be lost.\n");
+
+ cancel_delayed_work_sync(&csm_heartbeat_work);
+ csm_destroy_socket();
+ }
+
+ return 0;
+}
+
+void __init vsock_destroy(void)
+{
+ if (socket_thread) {
+ kthread_stop(socket_thread);
+ socket_thread = NULL;
+ }
+}
+
+int __init vsock_initialize(void)
+{
+ struct task_struct *task;
+
+ if (cmdline_boot_vsock_enabled) {
+ task = kthread_run(socket_thread_fn, NULL, "csm-vsock-thread");
+ if (IS_ERR(task)) {
+ pr_err("failed to create socket thread: %ld\n", PTR_ERR(task));
+ vsock_destroy();
+ return PTR_ERR(task);
+ }
+
+ socket_thread = task;
+ }
+ return 0;
+}
diff --git a/security/loadpin/loadpin.c b/security/loadpin/loadpin.c
index 0716af2..379fcb8 100644
--- a/security/loadpin/loadpin.c
+++ b/security/loadpin/loadpin.c
@@ -45,6 +45,8 @@
}
static int enabled = IS_ENABLED(CONFIG_SECURITY_LOADPIN_ENABLED);
+static char *exclude_read_files[READING_MAX_ID];
+static int ignore_read_file_id[READING_MAX_ID] __ro_after_init;
static struct super_block *pinned_root;
static DEFINE_SPINLOCK(pinned_root_spinlock);
@@ -126,6 +128,13 @@
struct super_block *load_root;
const char *origin = kernel_read_file_id_str(id);
+ /* If the file id is excluded, ignore the pinning. */
+ if ((unsigned int)id < ARRAY_SIZE(ignore_read_file_id) &&
+ ignore_read_file_id[id]) {
+ report_load(origin, file, "pinning-excluded");
+ return 0;
+ }
+
/* This handles the older init_module API that has a NULL file. */
if (!file) {
if (!enabled) {
@@ -184,12 +193,51 @@
LSM_HOOK_INIT(kernel_load_data, loadpin_load_data),
};
+static void __init parse_exclude(void)
+{
+ int i, j;
+ char *cur;
+
+ /*
+ * Make sure all the arrays stay within expected sizes. This
+ * is slightly weird because kernel_read_file_str[] includes
+ * READING_MAX_ID, which isn't actually meaningful here.
+ */
+ BUILD_BUG_ON(ARRAY_SIZE(exclude_read_files) !=
+ ARRAY_SIZE(ignore_read_file_id));
+ BUILD_BUG_ON(ARRAY_SIZE(kernel_read_file_str) <
+ ARRAY_SIZE(ignore_read_file_id));
+
+ for (i = 0; i < ARRAY_SIZE(exclude_read_files); i++) {
+ cur = exclude_read_files[i];
+ if (!cur)
+ break;
+ if (*cur == '\0')
+ continue;
+
+ for (j = 0; j < ARRAY_SIZE(ignore_read_file_id); j++) {
+ if (strcmp(cur, kernel_read_file_str[j]) == 0) {
+ pr_info("excluding: %s\n",
+ kernel_read_file_str[j]);
+ ignore_read_file_id[j] = 1;
+ /*
+ * Can not break, because one read_file_str
+ * may map to more than on read_file_id.
+ */
+ }
+ }
+ }
+}
+
void __init loadpin_add_hooks(void)
{
pr_info("ready to pin (currently %sabled)", enabled ? "en" : "dis");
+ parse_exclude();
security_add_hooks(loadpin_hooks, ARRAY_SIZE(loadpin_hooks), "loadpin");
}
/* Should not be mutable after boot, so not listed in sysfs (perm == 0). */
module_param(enabled, int, 0);
MODULE_PARM_DESC(enabled, "Pin module/firmware loading (default: true)");
+module_param_array_named(exclude, exclude_read_files, charp, NULL, 0);
+MODULE_PARM_DESC(exclude, "Exclude pinning specific read file types");
diff --git a/security/security.c b/security/security.c
index 9478444..82dcca2 100644
--- a/security/security.c
+++ b/security/security.c
@@ -885,6 +885,11 @@
call_void_hook(file_free_security, file);
}
+void security_file_pre_free(struct file *file)
+{
+ call_void_hook(file_pre_free_security, file);
+}
+
int security_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
return call_int_hook(file_ioctl, 0, file, cmd, arg);
@@ -987,6 +992,11 @@
return call_int_hook(task_alloc, 0, task, clone_flags);
}
+void security_task_post_alloc(struct task_struct *task)
+{
+ call_void_hook(task_post_alloc, task);
+}
+
void security_task_free(struct task_struct *task)
{
call_void_hook(task_free, task);
@@ -1156,6 +1166,11 @@
return call_int_hook(task_kill, 0, p, info, sig, cred);
}
+void security_task_exit(struct task_struct *p)
+{
+ call_void_hook(task_exit, p);
+}
+
int security_task_prctl(int option, unsigned long arg2, unsigned long arg3,
unsigned long arg4, unsigned long arg5)
{
diff --git a/tools/arch/x86/include/asm/rmwcc.h b/tools/arch/x86/include/asm/rmwcc.h
index fee7983..dc90c0c 100644
--- a/tools/arch/x86/include/asm/rmwcc.h
+++ b/tools/arch/x86/include/asm/rmwcc.h
@@ -2,7 +2,7 @@
#ifndef _TOOLS_LINUX_ASM_X86_RMWcc
#define _TOOLS_LINUX_ASM_X86_RMWcc
-#ifdef CONFIG_CC_HAS_ASM_GOTO
+#ifdef CC_HAVE_ASM_GOTO
#define __GEN_RMWcc(fullop, var, cc, ...) \
do { \
@@ -20,7 +20,7 @@
#define GEN_BINARY_RMWcc(op, var, vcon, val, arg0, cc) \
__GEN_RMWcc(op " %1, " arg0, var, cc, vcon (val))
-#else /* !CONFIG_CC_HAS_ASM_GOTO */
+#else /* !CC_HAVE_ASM_GOTO */
#define __GEN_RMWcc(fullop, var, cc, ...) \
do { \
@@ -37,6 +37,6 @@
#define GEN_BINARY_RMWcc(op, var, vcon, val, arg0, cc) \
__GEN_RMWcc(op " %2, " arg0, var, cc, vcon (val))
-#endif /* CONFIG_CC_HAS_ASM_GOTO */
+#endif /* CC_HAVE_ASM_GOTO */
#endif /* _TOOLS_LINUX_ASM_X86_RMWcc */
diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index 334b16d..4ee1bf6 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -29,7 +29,7 @@
char *name;
int alias;
int refs;
- int aliases, align, cache_dma, cpu_slabs, destroy_by_rcu;
+ int aliases, align, cache_dma, cache_dma32, cpu_slabs, destroy_by_rcu;
unsigned int hwcache_align, object_size, objs_per_slab;
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
@@ -531,6 +531,8 @@
printf("** Hardware cacheline aligned\n");
if (s->cache_dma)
printf("** Memory is allocated in a special DMA zone\n");
+ if (s->cache_dma32)
+ printf("** Memory is allocated in a special DMA32 zone\n");
if (s->destroy_by_rcu)
printf("** Slabs are destroyed via RCU\n");
if (s->reclaim_account)
@@ -599,6 +601,8 @@
*p++ = '*';
if (s->cache_dma)
*p++ = 'd';
+ if (s->cache_dma32)
+ *p++ = 'D';
if (s->hwcache_align)
*p++ = 'A';
if (s->poison)
@@ -1205,6 +1209,7 @@
slab->aliases = get_obj("aliases");
slab->align = get_obj("align");
slab->cache_dma = get_obj("cache_dma");
+ slab->cache_dma32 = get_obj("cache_dma32");
slab->cpu_slabs = get_obj("cpu_slabs");
slab->destroy_by_rcu = get_obj("destroy_by_rcu");
slab->hwcache_align = get_obj("hwcache_align");