merge-upstream/v4.19.127 from branch/tag: upstream/v4.19.127 into branch: cos-4.19
Changelog:
-------------------------------------------------------------
Aneesh Kumar K.V (1):
libnvdimm: Fix endian conversion issues
Anju T Sudhakar (1):
powerpc/powernv: Avoid re-registration of imc debugfs directory
Atsushi Nemoto (1):
i2c: altera: Fix race between xfer_msg and isr thread
Can Guo (1):
scsi: ufs: Release clock if DMA map fails
Chaitanya Kulkarni (1):
null_blk: return error for invalid zone size
DENG Qingfang (1):
net: dsa: mt7530: set CPU port to fallback mode
Dan Carpenter (1):
airo: Fix read overflows sending packets
Daniel Axtens (1):
kernel/relay.c: handle alloc_percpu returning NULL in relay_open
Dinghao Liu (1):
net: smsc911x: Fix runtime PM imbalance on error
Eugeniy Paltsev (1):
ARC: Fix ICCM & DCCM runtime size checks
Fan Yang (1):
mm: Fix mremap not considering huge pmd devmap
Gerald Schaefer (1):
s390/mm: fix set_huge_pte_at() for empty ptes
Giuseppe Marco Randazzo (1):
p54usb: add AirVasT USB stick device-id
Greg Kroah-Hartman (1):
Linux 4.19.127
Jan Schmidt (1):
drm/edid: Add Oculus Rift S to non-desktop list
Jeremy Kerr (1):
net: bmac: Fix read of MAC address from ROM
Jonathan McDowell (1):
net: ethernet: stmmac: Enable interface clocks on probe for IPQ806x
Julian Sax (1):
HID: i2c-hid: add Schneider SCL142ALM to descriptor override
Jérôme Pouiller (1):
mmc: fix compilation of user API
Lucas De Marchi (1):
drm/i915: fix port checks for MST support on gen >= 11
Madhuparna Bhowmik (1):
evm: Fix RCU list related warnings
Nathan Chancellor (1):
x86/mmiotrace: Use cpumask_available() for cpumask_var_t variables
Scott Shumate (1):
HID: sony: Fix for broken buttons on DS3 USB dongles
Tejun Heo (1):
Revert "cgroup: Add memory barriers to plug cgroup_rstat_updated() race window"
Valentin Longchamp (1):
net/ethernet/freescale: rework quiesce/activate for ucc_geth
Vasily Gorbik (1):
s390/ftrace: save traced function caller
Vineet Gupta (1):
ARC: [plat-eznps]: Restrict to CONFIG_ISA_ARCOMPACT
Xiang Chen (1):
scsi: hisi_sas: Check sas_port before using it
Xinwei Kong (1):
spi: dw: use "smp_mb()" to avoid sending spi data error
BUG=b/158444866
TEST=tryjob, validation and K8s e2e
RELEASE_NOTE=Upgraded the Linux kernel to upstream/v4.19.127
Signed-off-by: Lakitu Kernel Bot <cloud-image-merge-automation@prod.google.com>
Change-Id: I582afd906bcb6aa0b28e4e51c85ec20ac3317485
diff --git a/.gitignore b/.gitignore
index 97ba6b7..98e745c 100644
--- a/.gitignore
+++ b/.gitignore
@@ -94,6 +94,9 @@
include/ksym
arch/*/include/generated
+# kernelconfig build directory
+/build/
+
# stgit generated dirs
patches-*
diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab
index 29601d9..d742c6c 100644
--- a/Documentation/ABI/testing/sysfs-kernel-slab
+++ b/Documentation/ABI/testing/sysfs-kernel-slab
@@ -106,6 +106,15 @@
are from ZONE_DMA.
Available when CONFIG_ZONE_DMA is enabled.
+What: /sys/kernel/slab/cache/cache_dma32
+Date: December 2018
+KernelVersion: 4.21
+Contact: Nicolas Boichat <drinkcat@chromium.org>
+Description:
+ The cache_dma32 file is read-only and specifies whether objects
+ are from ZONE_DMA32.
+ Available when CONFIG_ZONE_DMA32 is enabled.
+
What: /sys/kernel/slab/cache/cpu_slabs
Date: May 2007
KernelVersion: 2.6.22
diff --git a/Documentation/ABI/testing/sysfs-kernel-wakeup_reasons b/Documentation/ABI/testing/sysfs-kernel-wakeup_reasons
new file mode 100644
index 0000000..acb19b9
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-wakeup_reasons
@@ -0,0 +1,16 @@
+What: /sys/kernel/wakeup_reasons/last_resume_reason
+Date: February 2014
+Contact: Ruchi Kandoi <kandoiruchi@google.com>
+Description:
+ The /sys/kernel/wakeup_reasons/last_resume_reason is
+ used to report wakeup reasons after system exited suspend.
+
+What: /sys/kernel/wakeup_reasons/last_suspend_time
+Date: March 2015
+Contact: jinqian <jinqian@google.com>
+Description:
+ The /sys/kernel/wakeup_reasons/last_suspend_time is
+ used to report time spent in last suspend cycle. It contains
+ two numbers (in seconds) separated by space. First number is
+ the time spent in suspend and resume processes. Second number
+ is the time spent in sleep state.
\ No newline at end of file
diff --git a/Documentation/admin-guide/LSM/LoadPin.rst b/Documentation/admin-guide/LSM/LoadPin.rst
index 3207076..716ad9b 100644
--- a/Documentation/admin-guide/LSM/LoadPin.rst
+++ b/Documentation/admin-guide/LSM/LoadPin.rst
@@ -19,3 +19,13 @@
created to toggle pinning: ``/proc/sys/kernel/loadpin/enabled``. (Having
a mutable filesystem means pinning is mutable too, but having the
sysctl allows for easy testing on systems with a mutable filesystem.)
+
+It's also possible to exclude specific file types from LoadPin using kernel
+command line option "``loadpin.exclude``". By default, all files are
+included, but they can be excluded using kernel command line option such
+as "``loadpin.exclude=kernel-module,kexec-image``". This allows to use
+different mechanisms such as ``CONFIG_MODULE_SIG`` and
+``CONFIG_KEXEC_VERIFY_SIG`` to verify kernel module and kernel image while
+still use LoadPin to protect the integrity of other files kernel loads. The
+full list of valid file types can be found in ``kernel_read_file_str``
+defined in ``include/linux/fs.h``.
diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt
index 8d8d8f0..1a0f2ac0 100644
--- a/Documentation/block/bfq-iosched.txt
+++ b/Documentation/block/bfq-iosched.txt
@@ -20,13 +20,26 @@
details on how to configure BFQ for the desired tradeoff between
latency and throughput, or on how to maximize throughput.
-BFQ has a non-null overhead, which limits the maximum IOPS that a CPU
-can process for a device scheduled with BFQ. To give an idea of the
-limits on slow or average CPUs, here are, first, the limits of BFQ for
-three different CPUs, on, respectively, an average laptop, an old
-desktop, and a cheap embedded system, in case full hierarchical
-support is enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set), but
-CONFIG_DEBUG_BLK_CGROUP is not set (Section 4-2):
+As every I/O scheduler, BFQ adds some overhead to per-I/O-request
+processing. To give an idea of this overhead, the total,
+single-lock-protected, per-request processing time of BFQ---i.e., the
+sum of the execution times of the request insertion, dispatch and
+completion hooks---is, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
+(dated CPU for notebooks; time measured with simple code
+instrumentation, and using the throughput-sync.sh script of the S
+suite [1], in performance-profiling mode). To put this result into
+context, the total, single-lock-protected, per-request execution time
+of the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7
+us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ).
+
+Scheduling overhead further limits the maximum IOPS that a CPU can
+process (already limited by the execution of the rest of the I/O
+stack). To give an idea of the limits with BFQ, on slow or average
+CPUs, here are, first, the limits of BFQ for three different CPUs, on,
+respectively, an average laptop, an old desktop, and a cheap embedded
+system, in case full hierarchical support is enabled (i.e.,
+CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_DEBUG_BLK_CGROUP is not
+set (Section 4-2):
- Intel i7-4850HQ: 400 KIOPS
- AMD A8-3850: 250 KIOPS
- ARM CortexTM-A53 Octa-core: 80 KIOPS
@@ -357,6 +370,13 @@
than maximum throughput. In these cases, consider setting the
strict_guarantees parameter.
+slice_idle_us
+-------------
+
+Controls the same tuning parameter as slice_idle, but in microseconds.
+Either tunable can be used to set idling behavior. Afterwards, the
+other tunable will reflect the newly set value in sysfs.
+
strict_guarantees
-----------------
@@ -559,3 +579,5 @@
Slightly extended version:
http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
results.pdf
+
+[3] https://github.com/Algodev-github/S
diff --git a/Documentation/dev-tools/gcov.rst b/Documentation/dev-tools/gcov.rst
index 69a7d90..46aae52 100644
--- a/Documentation/dev-tools/gcov.rst
+++ b/Documentation/dev-tools/gcov.rst
@@ -34,10 +34,6 @@
CONFIG_DEBUG_FS=y
CONFIG_GCOV_KERNEL=y
-select the gcc's gcov format, default is autodetect based on gcc version::
-
- CONFIG_GCOV_FORMAT_AUTODETECT=y
-
and to get coverage data for the entire kernel::
CONFIG_GCOV_PROFILE_ALL=y
@@ -169,6 +165,20 @@
[user@build] gcov -o /tmp/coverage/tmp/out/init main.c
+Note on compilers
+-----------------
+
+GCC and LLVM gcov tools are not necessarily compatible. Use gcov_ to work with
+GCC-generated .gcno and .gcda files, and use llvm-cov_ for Clang.
+
+.. _gcov: http://gcc.gnu.org/onlinedocs/gcc/Gcov.html
+.. _llvm-cov: https://llvm.org/docs/CommandGuide/llvm-cov.html
+
+Build differences between GCC and Clang gcov are handled by Kconfig. It
+automatically selects the appropriate gcov format depending on the detected
+toolchain.
+
+
Troubleshooting
---------------
diff --git a/Documentation/device-mapper/dm-init.txt b/Documentation/device-mapper/dm-init.txt
new file mode 100644
index 0000000..8464ee7
--- /dev/null
+++ b/Documentation/device-mapper/dm-init.txt
@@ -0,0 +1,114 @@
+Early creation of mapped devices
+====================================
+
+It is possible to configure a device-mapper device to act as the root device for
+your system in two ways.
+
+The first is to build an initial ramdisk which boots to a minimal userspace
+which configures the device, then pivot_root(8) in to it.
+
+The second is to create one or more device-mappers using the module parameter
+"dm-mod.create=" through the kernel boot command line argument.
+
+The format is specified as a string of data separated by commas and optionally
+semi-colons, where:
+ - a comma is used to separate fields like name, uuid, flags and table
+ (specifies one device)
+ - a semi-colon is used to separate devices.
+
+So the format will look like this:
+
+ dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]
+
+Where,
+ <name> ::= The device name.
+ <uuid> ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | ""
+ <minor> ::= The device minor number | ""
+ <flags> ::= "ro" | "rw"
+ <table> ::= <start_sector> <num_sectors> <target_type> <target_args>
+ <target_type> ::= "verity" | "linear" | ... (see list below)
+
+The dm line should be equivalent to the one used by the dmsetup tool with the
+--concise argument.
+
+Target types
+============
+
+Not all target types are available as there are serious risks in allowing
+activation of certain DM targets without first using userspace tools to check
+the validity of associated metadata.
+
+ "cache": constrained, userspace should verify cache device
+ "crypt": allowed
+ "delay": allowed
+ "era": constrained, userspace should verify metadata device
+ "flakey": constrained, meant for test
+ "linear": allowed
+ "log-writes": constrained, userspace should verify metadata device
+ "mirror": constrained, userspace should verify main/mirror device
+ "raid": constrained, userspace should verify metadata device
+ "snapshot": constrained, userspace should verify src/dst device
+ "snapshot-origin": allowed
+ "snapshot-merge": constrained, userspace should verify src/dst device
+ "striped": allowed
+ "switch": constrained, userspace should verify dev path
+ "thin": constrained, requires dm target message from userspace
+ "thin-pool": constrained, requires dm target message from userspace
+ "verity": allowed
+ "writecache": constrained, userspace should verify cache device
+ "zero": constrained, not meant for rootfs
+
+If the target is not listed above, it is constrained by default (not tested).
+
+Examples
+========
+An example of booting to a linear array made up of user-mode linux block
+devices:
+
+ dm-mod.create="lroot,,,rw, 0 4096 linear 98:16 0, 4096 4096 linear 98:32 0" root=/dev/dm-0
+
+This will boot to a rw dm-linear target of 8192 sectors split across two block
+devices identified by their major:minor numbers. After boot, udev will rename
+this target to /dev/mapper/lroot (depending on the rules). No uuid was assigned.
+
+An example of multiple device-mappers, with the dm-mod.create="..." contents is shown here
+split on multiple lines for readability:
+
+ vroot,,,ro,
+ 0 1740800 verity 254:0 254:0 1740800 sha1
+ 76e9be054b15884a9fa85973e9cb274c93afadb6
+ 5b3549d54d6c7a3837b9b81ed72e49463a64c03680c47835bef94d768e5646fe;
+ vram,,,rw,
+ 0 32768 linear 1:0 0,
+ 32768 32768 linear 1:1 0
+
+Other examples (per target):
+
+"crypt":
+ dm-crypt,,8,ro,
+ 0 1048576 crypt aes-xts-plain64
+ babebabebabebabebabebabebabebabebabebabebabebabebabebabebabebabe 0
+ /dev/sda 0 1 allow_discards
+
+"delay":
+ dm-delay,,4,ro,0 409600 delay /dev/sda1 0 500
+
+"linear":
+ dm-linear,,,rw,
+ 0 32768 linear /dev/sda1 0,
+ 32768 1024000 linear /dev/sda2 0,
+ 1056768 204800 linear /dev/sda3 0,
+ 1261568 512000 linear /dev/sda4 0
+
+"snapshot-origin":
+ dm-snap-orig,,4,ro,0 409600 snapshot-origin 8:2
+
+"striped":
+ dm-striped,,4,ro,0 1638400 striped 4 4096
+ /dev/sda1 0 /dev/sda2 0 /dev/sda3 0 /dev/sda4 0
+
+"verity":
+ dm-verity,,4,ro,
+ 0 1638400 verity 1 8:1 8:2 4096 4096 204800 1 sha256
+ fb1a5a0f00deb908d8b53cb270858975e76cf64105d412ce764225d53b8f3cfd
+ 51934789604d1b92399c52e7cb149d1b3a1b74bbbcb103b2a0aaacbed5c08584
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 0d0ecc7..94ce383 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -398,6 +398,8 @@
[stack] = the stack of the main process
[vdso] = the "virtual dynamic shared object",
the kernel system call handler
+ [anon:<name>] = an anonymous mapping that has been
+ named by userspace
or if empty, the mapping is anonymous.
@@ -427,6 +429,7 @@
Locked: 0 kB
THPeligible: 0
VmFlags: rd ex mr mw me dw
+Name: name from userspace
the first of these lines shows the same information as is displayed for the
mapping in /proc/PID/maps. The remaining lines show the size of the mapping
@@ -503,6 +506,9 @@
might change in future as well. So each consumer of these flags has to
follow each specific kernel version for the exact semantic.
+The "Name" field will only be present on a mapping that has been named by
+userspace, and will show the name passed in by userspace.
+
This file is only present if the CONFIG_MMU kernel configuration option is
enabled.
diff --git a/Documentation/lzo.txt b/Documentation/lzo.txt
index 6fa6a93..f799342 100644
--- a/Documentation/lzo.txt
+++ b/Documentation/lzo.txt
@@ -78,16 +78,34 @@
is an implementation design choice independent on the algorithm or
encoding.
+Versions
+
+0: Original version
+1: LZO-RLE
+
+Version 1 of LZO implements an extension to encode runs of zeros using run
+length encoding. This improves speed for data with many zeros, which is a
+common case for zram. This modifies the bitstream in a backwards compatible way
+(v1 can correctly decompress v0 compressed data, but v0 cannot read v1 data).
+
+For maximum compatibility, both versions are available under different names
+(lzo and lzo-rle). Differences in the encoding are noted in this document with
+e.g.: version 1 only.
+
Byte sequences
==============
First byte encoding::
- 0..17 : follow regular instruction encoding, see below. It is worth
- noting that codes 16 and 17 will represent a block copy from
- the dictionary which is empty, and that they will always be
+ 0..16 : follow regular instruction encoding, see below. It is worth
+ noting that code 16 will represent a block copy from the
+ dictionary which is empty, and that it will always be
invalid at this place.
+ 17 : bitstream version. If the first byte is 17, the next byte
+ gives the bitstream version (version 1 only). If the first byte
+ is not 17, the bitstream version is 0.
+
18..21 : copy 0..3 literals
state = (byte - 17) = 0..3 [ copy <state> literals ]
skip byte
@@ -140,6 +158,11 @@
state = S (copy S literals after this block)
End of stream is reached if distance == 16384
+ In version 1 only, this instruction is also used to encode a run of
+ zeros if distance = 0xbfff, i.e. H = 1 and the D bits are all 1.
+ In this case, it is followed by a fourth byte, X.
+ run length = ((X << 3) | (0 0 0 0 0 L L L)) + 4.
+
0 0 1 L L L L L (32..63)
Copy of small block within 16kB distance (preferably less than 34B)
length = 2 + (L ?: 31 + (zero_bytes * 255) + non_zero_byte)
@@ -165,7 +188,9 @@
=======
This document was written by Willy Tarreau <w@1wt.eu> on 2014/07/19 during an
- analysis of the decompression code available in Linux 3.16-rc5. The code is
- tricky, it is possible that this document contains mistakes or that a few
- corner cases were overlooked. In any case, please report any doubt, fix, or
- proposed updates to the author(s) so that the document can be updated.
+ analysis of the decompression code available in Linux 3.16-rc5, and updated
+ by Dave Rodgman <dave.rodgman@arm.com> on 2018/10/30 to introduce run-length
+ encoding. The code is tricky, it is possible that this document contains
+ mistakes or that a few corner cases were overlooked. In any case, please
+ report any doubt, fix, or proposed updates to the author(s) so that the
+ document can be updated.
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 7eb9366..7294284 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -639,6 +639,16 @@
0 to disable the blackhole detection.
By default, it is set to 1hr.
+tcp_fwmark_accept - BOOLEAN
+ If set, incoming connections to listening sockets that do not have a
+ socket mark will set the mark of the accepting socket to the fwmark of
+ the incoming SYN packet. This will cause all packets on that connection
+ (starting from the first SYNACK) to be sent with that fwmark. The
+ listening socket's mark is unchanged. Listening sockets that already
+ have a fwmark set via setsockopt(SOL_SOCKET, SO_MARK, ...) are
+ unaffected.
+ Default: 0
+
tcp_syn_retries - INTEGER
Number of times initial SYNs for an active TCP connection attempt
will be retransmitted. Should not be higher than 127. Default value
diff --git a/Documentation/scheduler/sched-tune.txt b/Documentation/scheduler/sched-tune.txt
new file mode 100644
index 0000000..1a10371
--- /dev/null
+++ b/Documentation/scheduler/sched-tune.txt
@@ -0,0 +1,388 @@
+ Central, scheduler-driven, power-performance control
+ (EXPERIMENTAL)
+
+Abstract
+========
+
+The topic of a single simple power-performance tunable, that is wholly
+scheduler centric, and has well defined and predictable properties has come up
+on several occasions in the past [1,2]. With techniques such as a scheduler
+driven DVFS [3], we now have a good framework for implementing such a tunable.
+This document describes the overall ideas behind its design and implementation.
+
+
+Table of Contents
+=================
+
+1. Motivation
+2. Introduction
+3. Signal Boosting Strategy
+4. OPP selection using boosted CPU utilization
+5. Per task group boosting
+6. Per-task wakeup-placement-strategy Selection
+7. Question and Answers
+ - What about "auto" mode?
+ - What about boosting on a congested system?
+ - How CPUs are boosted when we have tasks with multiple boost values?
+8. References
+
+
+1. Motivation
+=============
+
+Schedutil [3] is a utilization-driven cpufreq governor which allows the
+scheduler to select the optimal DVFS operating point (OPP) for running a task
+allocated to a CPU.
+
+However, sometimes it may be desired to intentionally boost the performance of
+a workload even if that could imply a reasonable increase in energy
+consumption. For example, in order to reduce the response time of a task, we
+may want to run the task at a higher OPP than the one that is actually required
+by it's CPU bandwidth demand.
+
+This last requirement is especially important if we consider that one of the
+main goals of the utilization-driven governor component is to replace all
+currently available CPUFreq policies. Since schedutil is event-based, as
+opposed to the sampling driven governors we currently have, they are already
+more responsive at selecting the optimal OPP to run tasks allocated to a CPU.
+However, just tracking the actual task utilization may not be enough from a
+performance standpoint. For example, it is not possible to get behaviors
+similar to those provided by the "performance" and "interactive" CPUFreq
+governors.
+
+This document describes an implementation of a tunable, stacked on top of the
+utilization-driven governor which extends its functionality to support task
+performance boosting.
+
+By "performance boosting" we mean the reduction of the time required to
+complete a task activation, i.e. the time elapsed from a task wakeup to its
+next deactivation (e.g. because it goes back to sleep or it terminates). For
+example, if we consider a simple periodic task which executes the same workload
+for 5[s] every 20[s] while running at a certain OPP, a boosted execution of
+that task must complete each of its activations in less than 5[s].
+
+The rest of this document introduces in more details the proposed solution
+which has been named SchedTune.
+
+
+2. Introduction
+===============
+
+SchedTune exposes a simple user-space interface provided through a new
+CGroup controller 'stune' which provides two power-performance tunables
+per group:
+
+ /<stune cgroup mount point>/schedtune.prefer_idle
+ /<stune cgroup mount point>/schedtune.boost
+
+The CGroup implementation permits arbitrary user-space defined task
+classification to tune the scheduler for different goals depending on the
+specific nature of the task, e.g. background vs interactive vs low-priority.
+
+More details are given in section 5.
+
+2.1 Boosting
+============
+
+The boost value is expressed as an integer in the range [0..100].
+
+A value of 0 (default) configures the CFS scheduler for maximum energy
+efficiency. This means that schedutil runs the tasks at the minimum OPP
+required to satisfy their workload demand.
+
+A value of 100 configures scheduler for maximum performance, which translates
+to the selection of the maximum OPP on that CPU.
+
+The range between 0 and 100 can be set to satisfy other scenarios suitably. For
+example to satisfy interactive response or depending on other system events
+(battery level etc).
+
+The overall design of the SchedTune module is built on top of "Per-Entity Load
+Tracking" (PELT) signals and schedutil by introducing a bias on the OPP
+selection.
+
+Each time a task is allocated on a CPU, cpufreq is given the opportunity to tune
+the operating frequency of that CPU to better match the workload demand. The
+selection of the actual OPP being activated is influenced by the boost value
+for the task CGroup.
+
+This simple biasing approach leverages existing frameworks, which means minimal
+modifications to the scheduler, and yet it allows to achieve a range of
+different behaviours all from a single simple tunable knob.
+
+In EAS schedulers, we use boosted task and CPU utilization for energy
+calculation and energy-aware task placement.
+
+2.2 prefer_idle
+===============
+
+This is a flag which indicates to the scheduler that userspace would like
+the scheduler to focus on energy or to focus on performance.
+
+A value of 0 (default) signals to the CFS scheduler that tasks in this group
+can be placed according to the energy-aware wakeup strategy.
+
+A value of 1 signals to the CFS scheduler that tasks in this group should be
+placed to minimise wakeup latency.
+
+Android platforms typically use this flag for application tasks which the
+user is currently interacting with.
+
+
+3. Signal Boosting Strategy
+===========================
+
+The whole PELT machinery works based on the value of a few load tracking signals
+which basically track the CPU bandwidth requirements for tasks and the capacity
+of CPUs. The basic idea behind the SchedTune knob is to artificially inflate
+some of these load tracking signals to make a task or RQ appears more demanding
+that it actually is.
+
+Which signals have to be inflated depends on the specific "consumer". However,
+independently from the specific (signal, consumer) pair, it is important to
+define a simple and possibly consistent strategy for the concept of boosting a
+signal.
+
+A boosting strategy defines how the "abstract" user-space defined
+sched_cfs_boost value is translated into an internal "margin" value to be added
+to a signal to get its inflated value:
+
+ margin := boosting_strategy(sched_cfs_boost, signal)
+ boosted_signal := signal + margin
+
+The boosting strategy currently implemented in SchedTune is called 'Signal
+Proportional Compensation' (SPC). With SPC, the sched_cfs_boost value is used to
+compute a margin which is proportional to the complement of the original signal.
+When a signal has a maximum possible value, its complement is defined as
+the delta from the actual value and its possible maximum.
+
+Since the tunable implementation uses signals which have SCHED_CAPACITY_SCALE as
+the maximum possible value, the margin becomes:
+
+ margin := sched_cfs_boost * (SCHED_CAPACITY_SCALE - signal)
+
+Using this boosting strategy:
+- a 100% sched_cfs_boost means that the signal is scaled to the maximum value
+- each value in the range of sched_cfs_boost effectively inflates the signal in
+ question by a quantity which is proportional to the maximum value.
+
+For example, by applying the SPC boosting strategy to the selection of the OPP
+to run a task it is possible to achieve these behaviors:
+
+- 0% boosting: run the task at the minimum OPP required by its workload
+- 100% boosting: run the task at the maximum OPP available for the CPU
+- 50% boosting: run at the half-way OPP between minimum and maximum
+
+Which means that, at 50% boosting, a task will be scheduled to run at half of
+the maximum theoretically achievable performance on the specific target
+platform.
+
+A graphical representation of an SPC boosted signal is represented in the
+following figure where:
+ a) "-" represents the original signal
+ b) "b" represents a 50% boosted signal
+ c) "p" represents a 100% boosted signal
+
+
+ ^
+ | SCHED_CAPACITY_SCALE
+ +-----------------------------------------------------------------+
+ |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
+ |
+ | boosted_signal
+ | bbbbbbbbbbbbbbbbbbbbbbbb
+ |
+ | original signal
+ | bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
+ | |
+ |bbbbbbbbbbbbbbbbbb |
+ | |
+ | |
+ | |
+ | +-----------------------+
+ | |
+ | |
+ | |
+ |------------------+
+ |
+ |
+ +----------------------------------------------------------------------->
+
+The plot above shows a ramped load signal (titled 'original_signal') and it's
+boosted equivalent. For each step of the original signal the boosted signal
+corresponding to a 50% boost is midway from the original signal and the upper
+bound. Boosting by 100% generates a boosted signal which is always saturated to
+the upper bound.
+
+
+4. OPP selection using boosted CPU utilization
+==============================================
+
+It is worth calling out that the implementation does not introduce any new load
+signals. Instead, it provides an API to tune existing signals. This tuning is
+done on demand and only in scheduler code paths where it is sensible to do so.
+The new API calls are defined to return either the default signal or a boosted
+one, depending on the value of sched_cfs_boost. This is a clean an non invasive
+modification of the existing existing code paths.
+
+The signal representing a CPU's utilization is boosted according to the
+previously described SPC boosting strategy. To schedutil, this allows a CPU
+(ie CFS run-queue) to appear more used then it actually is.
+
+Thus, with the sched_cfs_boost enabled we have the following main functions to
+get the current utilization of a CPU:
+
+ cpu_util()
+ boosted_cpu_util()
+
+The new boosted_cpu_util() is similar to the first but returns a boosted
+utilization signal which is a function of the sched_cfs_boost value.
+
+This function is used in the CFS scheduler code paths where schedutil needs to
+decide the OPP to run a CPU at. For example, this allows selecting the highest
+OPP for a CPU which has the boost value set to 100%.
+
+
+5. Per task group boosting
+==========================
+
+On battery powered devices there usually are many background services which are
+long running and need energy efficient scheduling. On the other hand, some
+applications are more performance sensitive and require an interactive
+response and/or maximum performance, regardless of the energy cost.
+
+To better service such scenarios, the SchedTune implementation has an extension
+that provides a more fine grained boosting interface.
+
+A new CGroup controller, namely "schedtune", can be enabled which allows to
+defined and configure task groups with different boosting values.
+Tasks that require special performance can be put into separate CGroups.
+The value of the boost associated with the tasks in this group can be specified
+using a single knob exposed by the CGroup controller:
+
+ schedtune.boost
+
+This knob allows the definition of a boost value that is to be used for
+SPC boosting of all tasks attached to this group.
+
+The current schedtune controller implementation is really simple and has these
+main characteristics:
+
+ 1) It is only possible to create 1 level depth hierarchies
+
+ The root control groups define the system-wide boost value to be applied
+ by default to all tasks. Its direct subgroups are named "boost groups" and
+ they define the boost value for specific set of tasks.
+ Further nested subgroups are not allowed since they do not have a sensible
+ meaning from a user-space standpoint.
+
+ 2) It is possible to define only a limited number of "boost groups"
+
+ This number is defined at compile time and by default configured to 16.
+ This is a design decision motivated by two main reasons:
+ a) In a real system we do not expect utilization scenarios with more than
+ a few boost groups. For example, a reasonable collection of groups could
+ be just "background", "interactive" and "performance".
+ b) It simplifies the implementation considerably, especially for the code
+ which has to compute the per CPU boosting once there are multiple
+ RUNNABLE tasks with different boost values.
+
+Such a simple design should allow servicing the main utilization scenarios
+identified so far. It provides a simple interface which can be used to manage
+the power-performance of all tasks or only selected tasks.
+Moreover, this interface can be easily integrated by user-space run-times (e.g.
+Android, ChromeOS) to implement a QoS solution for task boosting based on tasks
+classification, which has been a long standing requirement.
+
+Setup and usage
+---------------
+
+0. Use a kernel with CONFIG_SCHED_TUNE support enabled
+
+1. Check that the "schedtune" CGroup controller is available:
+
+ root@linaro-nano:~# cat /proc/cgroups
+ #subsys_name hierarchy num_cgroups enabled
+ cpuset 0 1 1
+ cpu 0 1 1
+ schedtune 0 1 1
+
+2. Mount a tmpfs to create the CGroups mount point (Optional)
+
+ root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
+
+3. Mount the "schedtune" controller
+
+ root@linaro-nano:~# mkdir /sys/fs/cgroup/stune
+ root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
+
+4. Create task groups and configure their specific boost value (Optional)
+
+ For example here we create a "performance" boost group configure to boost
+ all its tasks to 100%
+
+ root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance
+ root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
+
+5. Move tasks into the boost group
+
+ For example, the following moves the tasks with PID $TASKPID (and all its
+ threads) into the "performance" boost group.
+
+ root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
+
+This simple configuration allows only the threads of the $TASKPID task to run,
+when needed, at the highest OPP in the most capable CPU of the system.
+
+
+6. Per-task wakeup-placement-strategy Selection
+===============================================
+
+Many devices have a number of CFS tasks in use which require an absolute
+minimum wakeup latency, and many tasks for which wakeup latency is not
+important.
+
+For touch-driven environments, removing additional wakeup latency can be
+critical.
+
+When you use the Schedtume CGroup controller, you have access to a second
+parameter which allows a group to be marked such that energy_aware task
+placement is bypassed for tasks belonging to that group.
+
+prefer_idle=0 (default - use energy-aware task placement if available)
+prefer_idle=1 (never use energy-aware task placement for these tasks)
+
+Since the regular wakeup task placement algorithm in CFS is biased for
+performance, this has the effect of restoring minimum wakeup latency
+for the desired tasks whilst still allowing energy-aware wakeup placement
+to save energy for other tasks.
+
+
+7. Question and Answers
+=======================
+
+What about "auto" mode?
+-----------------------
+
+The 'auto' mode as described in [5] can be implemented by interfacing SchedTune
+with some suitable user-space element. This element could use the exposed
+system-wide or cgroup based interface.
+
+How are multiple groups of tasks with different boost values managed?
+---------------------------------------------------------------------
+
+The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
+on a CPU. The CPU utilization seen by schedutil (and used to select an
+appropriate OPP) is boosted with a value which is the maximum of the boost
+values of the currently RUNNABLE tasks in its RQ.
+
+This allows cpufreq to boost a CPU only while there are boosted tasks ready
+to run and switch back to the energy efficient mode as soon as the last boosted
+task is dequeued.
+
+
+8. References
+=============
+[1] http://lwn.net/Articles/552889
+[2] http://lkml.org/lkml/2012/5/18/91
+[3] https://lkml.org/lkml/2016/3/29/1041
diff --git a/Makefile b/Makefile
index a93e38c..52e7149 100644
--- a/Makefile
+++ b/Makefile
@@ -391,7 +391,7 @@
CHECKFLAGS := -D__linux__ -Dlinux -D__STDC__ -Dunix -D__unix__ \
-Wbitwise -Wno-return-void -Wno-unknown-attribute $(CF)
-NOSTDINC_FLAGS =
+NOSTDINC_FLAGS :=
CFLAGS_MODULE =
AFLAGS_MODULE =
LDFLAGS_MODULE =
@@ -481,7 +481,7 @@
$(srctree) $(objtree) $(VERSION) $(PATCHLEVEL)
endif
-ifeq ($(cc-name),clang)
+ifneq ($(shell $(CC) --version 2>&1 | head -n 1 | grep clang),)
ifneq ($(CROSS_COMPILE),)
CLANG_FLAGS += --target=$(notdir $(CROSS_COMPILE:%-=%))
GCC_TOOLCHAIN_DIR := $(dir $(shell which $(CROSS_COMPILE)elfedit))
@@ -507,9 +507,6 @@
export RETPOLINE_CFLAGS
export RETPOLINE_VDSO_CFLAGS
-KBUILD_CFLAGS += $(call cc-option,-fno-PIE)
-KBUILD_AFLAGS += $(call cc-option,-fno-PIE)
-
# The expansion should be delayed until arch/$(SRCARCH)/Makefile is included.
# Some architectures define CROSS_COMPILE in arch/$(SRCARCH)/Makefile.
# CC_VERSION_TEXT is referenced from Kconfig (so it needs export),
@@ -596,6 +593,8 @@
# Defaults to vmlinux, but the arch makefile usually adds further targets
all: vmlinux
+KBUILD_CFLAGS += $(call cc-option,-fno-PIE)
+KBUILD_AFLAGS += $(call cc-option,-fno-PIE)
CFLAGS_GCOV := -fprofile-arcs -ftest-coverage \
$(call cc-option,-fno-tree-loop-im) \
$(call cc-disable-warning,maybe-uninitialized,)
@@ -666,6 +665,12 @@
KBUILD_CFLAGS += $(call cc-option,--param=allow-store-data-races=0)
KBUILD_CFLAGS += $(call cc-option,-fno-allow-store-data-races)
+# check for 'asm goto'
+ifeq ($(shell $(CONFIG_SHELL) $(srctree)/scripts/gcc-goto.sh $(CC) $(KBUILD_CFLAGS)), y)
+ KBUILD_CFLAGS += -DCC_HAVE_ASM_GOTO
+ KBUILD_AFLAGS += -DCC_HAVE_ASM_GOTO
+endif
+
include scripts/Makefile.kcov
include scripts/Makefile.gcc-plugins
@@ -689,17 +694,19 @@
KBUILD_CFLAGS += $(stackp-flags-y)
-ifeq ($(cc-name),clang)
-KBUILD_CPPFLAGS += $(call cc-option,-Qunused-arguments,)
-KBUILD_CFLAGS += $(call cc-disable-warning, format-invalid-specifier)
-KBUILD_CFLAGS += $(call cc-disable-warning, gnu)
+ifdef CONFIG_CC_IS_CLANG
+KBUILD_CPPFLAGS += -Qunused-arguments
+KBUILD_CFLAGS += -Wno-format-invalid-specifier
+KBUILD_CFLAGS += -Wno-gnu
+KBUILD_CFLAGS += -Wno-address-of-packed-member
+KBUILD_CFLAGS += -Wno-duplicate-decl-specifier
# Quiet clang warning: comparison of unsigned expression < 0 is always false
-KBUILD_CFLAGS += $(call cc-disable-warning, tautological-compare)
+KBUILD_CFLAGS += -Wno-tautological-compare
+KBUILD_CFLAGS += -Wno-constant-conversion
# CLANG uses a _MergedGlobals as optimization, but this breaks modpost, as the
# source of a reference will be _MergedGlobals and not on of the whitelisted names.
# See modpost pattern 2
-KBUILD_CFLAGS += $(call cc-option, -mno-global-merge,)
-KBUILD_CFLAGS += $(call cc-option, -fcatch-undefined-behavior)
+KBUILD_CFLAGS += -mno-global-merge
else
# These warnings generated too much noise in a regular build.
diff --git a/PRESUBMIT.cfg b/PRESUBMIT.cfg
new file mode 100644
index 0000000..4fb5526
--- /dev/null
+++ b/PRESUBMIT.cfg
@@ -0,0 +1,9 @@
+[Hook Overrides]
+checkpatch_check: true
+aosp_license_check: false
+cros_license_check: false
+long_line_check: false
+signoff_check: true
+stray_whitespace_check: false
+tab_check: false
+tabbed_indent_required_check: false
diff --git a/arch/Kconfig b/arch/Kconfig
index a336548..6801123 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -71,7 +71,6 @@
config JUMP_LABEL
bool "Optimize very unlikely/likely branches"
depends on HAVE_ARCH_JUMP_LABEL
- depends on CC_HAS_ASM_GOTO
help
This option enables a transparent branch optimization that
makes certain almost-always-true or almost-always-false branch
diff --git a/arch/arm/kernel/jump_label.c b/arch/arm/kernel/jump_label.c
index 303b3ab..90bce3d 100644
--- a/arch/arm/kernel/jump_label.c
+++ b/arch/arm/kernel/jump_label.c
@@ -4,6 +4,8 @@
#include <asm/patch.h>
#include <asm/insn.h>
+#ifdef HAVE_JUMP_LABEL
+
static void __arch_jump_label_transform(struct jump_entry *entry,
enum jump_label_type type,
bool is_static)
@@ -33,3 +35,5 @@
{
__arch_jump_label_transform(entry, type, true);
}
+
+#endif
diff --git a/arch/arm64/kernel/jump_label.c b/arch/arm64/kernel/jump_label.c
index b90754a..e075641 100644
--- a/arch/arm64/kernel/jump_label.c
+++ b/arch/arm64/kernel/jump_label.c
@@ -20,6 +20,8 @@
#include <linux/jump_label.h>
#include <asm/insn.h>
+#ifdef HAVE_JUMP_LABEL
+
void arch_jump_label_transform(struct jump_entry *entry,
enum jump_label_type type)
{
@@ -47,3 +49,5 @@
* NOP needs to be replaced by a branch.
*/
}
+
+#endif /* HAVE_JUMP_LABEL */
diff --git a/arch/mips/Makefile b/arch/mips/Makefile
index ad0a92f..39cab0e 100644
--- a/arch/mips/Makefile
+++ b/arch/mips/Makefile
@@ -128,7 +128,7 @@
# clang's output will be based upon the build machine. So for clang we simply
# unconditionally specify -EB or -EL as appropriate.
#
-ifeq ($(cc-name),clang)
+ifdef CONFIG_CC_IS_CLANG
cflags-$(CONFIG_CPU_BIG_ENDIAN) += -EB
cflags-$(CONFIG_CPU_LITTLE_ENDIAN) += -EL
else
diff --git a/arch/mips/kernel/jump_label.c b/arch/mips/kernel/jump_label.c
index ab94392..32e3168 100644
--- a/arch/mips/kernel/jump_label.c
+++ b/arch/mips/kernel/jump_label.c
@@ -16,6 +16,8 @@
#include <asm/cacheflush.h>
#include <asm/inst.h>
+#ifdef HAVE_JUMP_LABEL
+
/*
* Define parameters for the standard MIPS and the microMIPS jump
* instruction encoding respectively:
@@ -68,3 +70,5 @@
mutex_unlock(&text_mutex);
}
+
+#endif /* HAVE_JUMP_LABEL */
diff --git a/arch/mips/vdso/Makefile b/arch/mips/vdso/Makefile
index c99fa1c..8b23bb1 100644
--- a/arch/mips/vdso/Makefile
+++ b/arch/mips/vdso/Makefile
@@ -12,7 +12,7 @@
$(filter -mno-loongson-%,$(KBUILD_CFLAGS)) \
-D__VDSO__
-ifeq ($(cc-name),clang)
+ifdef CONFIG_CC_IS_CLANG
ccflags-vdso += $(filter --target=%,$(KBUILD_CFLAGS))
endif
diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 8954108..55ac368 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -98,7 +98,7 @@
endif
endif
-ifneq ($(cc-name),clang)
+ifndef CONFIG_CC_IS_CLANG
cflags-$(CONFIG_CPU_LITTLE_ENDIAN) += -mno-strict-align
endif
@@ -179,7 +179,7 @@
# Work around gcc code-gen bugs with -pg / -fno-omit-frame-pointer in gcc <= 4.8
# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44199
# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52828
-ifneq ($(cc-name),clang)
+ifndef CONFIG_CC_IS_CLANG
CC_FLAGS_FTRACE += $(call cc-ifversion, -lt, 0409, -mno-sched-epilog)
endif
endif
diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h
index d0609c1..95b2df1 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -38,7 +38,7 @@
void __trace_hcall_entry(unsigned long opcode, unsigned long *args);
void __trace_hcall_exit(long opcode, long retval, unsigned long *retbuf);
/* OPAL tracing */
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
extern struct static_key opal_tracepoint_key;
#endif
diff --git a/arch/powerpc/kernel/jump_label.c b/arch/powerpc/kernel/jump_label.c
index 0080c5f..6472472 100644
--- a/arch/powerpc/kernel/jump_label.c
+++ b/arch/powerpc/kernel/jump_label.c
@@ -11,6 +11,7 @@
#include <linux/jump_label.h>
#include <asm/code-patching.h>
+#ifdef HAVE_JUMP_LABEL
void arch_jump_label_transform(struct jump_entry *entry,
enum jump_label_type type)
{
@@ -21,3 +22,4 @@
else
patch_instruction(addr, PPC_INST_NOP);
}
+#endif
diff --git a/arch/powerpc/platforms/powernv/opal-tracepoints.c b/arch/powerpc/platforms/powernv/opal-tracepoints.c
index f16a435..1ab7d26 100644
--- a/arch/powerpc/platforms/powernv/opal-tracepoints.c
+++ b/arch/powerpc/platforms/powernv/opal-tracepoints.c
@@ -4,7 +4,7 @@
#include <asm/trace.h>
#include <asm/asm-prototypes.h>
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
struct static_key opal_tracepoint_key = STATIC_KEY_INIT;
int opal_tracepoint_regfunc(void)
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index 74215eb..3f98158 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -20,7 +20,7 @@
.section ".text"
#ifdef CONFIG_TRACEPOINTS
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
#define OPAL_BRANCH(LABEL) \
ARCH_STATIC_BRANCH(LABEL, opal_tracepoint_key)
#else
diff --git a/arch/powerpc/platforms/pseries/hvCall.S b/arch/powerpc/platforms/pseries/hvCall.S
index 50dc942..d91412c 100644
--- a/arch/powerpc/platforms/pseries/hvCall.S
+++ b/arch/powerpc/platforms/pseries/hvCall.S
@@ -19,7 +19,7 @@
#ifdef CONFIG_TRACEPOINTS
-#ifndef CONFIG_JUMP_LABEL
+#ifndef HAVE_JUMP_LABEL
.section ".toc","aw"
.globl hcall_tracepoint_refcount
@@ -79,7 +79,7 @@
mr r5,BUFREG; \
__HCALL_INST_POSTCALL
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
#define HCALL_BRANCH(LABEL) \
ARCH_STATIC_BRANCH(LABEL, hcall_tracepoint_key)
#else
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index d660a90..47942de 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -833,7 +833,7 @@
#endif /* CONFIG_PPC_BOOK3S_64 */
#ifdef CONFIG_TRACEPOINTS
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
struct static_key hcall_tracepoint_key = STATIC_KEY_INIT;
int hcall_tracepoint_regfunc(void)
diff --git a/arch/s390/appldata/appldata_mem.c b/arch/s390/appldata/appldata_mem.c
index e68136c..0fc6522 100644
--- a/arch/s390/appldata/appldata_mem.c
+++ b/arch/s390/appldata/appldata_mem.c
@@ -63,6 +63,9 @@
u64 pgalloc; /* page allocations */
u64 pgfault; /* page faults (major+minor) */
u64 pgmajfault; /* page faults (major only) */
+ u64 pgmajfault_s; /* shmem page faults (major only) */
+ u64 pgmajfault_a; /* anonymous page faults (major only) */
+ u64 pgmajfault_f; /* file page faults (major only) */
// <-- New in 2.6
} __packed;
@@ -94,7 +97,11 @@
mem_data->pgalloc = ev[PGALLOC_NORMAL];
mem_data->pgalloc += ev[PGALLOC_DMA];
mem_data->pgfault = ev[PGFAULT];
- mem_data->pgmajfault = ev[PGMAJFAULT];
+ mem_data->pgmajfault =
+ ev[PGMAJFAULT_S] + ev[PGMAJFAULT_A] + ev[PGMAJFAULT_F];
+ mem_data->pgmajfault_s = ev[PGMAJFAULT_S];
+ mem_data->pgmajfault_a = ev[PGMAJFAULT_A];
+ mem_data->pgmajfault_f = ev[PGMAJFAULT_F];
si_meminfo(&val);
mem_data->sharedram = val.sharedram;
diff --git a/arch/s390/kernel/Makefile b/arch/s390/kernel/Makefile
index 762fc453..b524c15 100644
--- a/arch/s390/kernel/Makefile
+++ b/arch/s390/kernel/Makefile
@@ -46,7 +46,7 @@
obj-y := traps.o time.o process.o base.o early.o setup.o idle.o vtime.o
obj-y += processor.o sys_s390.o ptrace.o signal.o cpcmd.o ebcdic.o nmi.o
obj-y += debug.o irq.o ipl.o dis.o diag.o vdso.o early_nobss.o
-obj-y += sysinfo.o lgr.o os_info.o machine_kexec.o pgm_check.o
+obj-y += sysinfo.o jump_label.o lgr.o os_info.o machine_kexec.o pgm_check.o
obj-y += runtime_instr.o cache.o fpu.o dumpstack.o guarded_storage.o sthyi.o
obj-y += entry.o reipl.o relocate_kernel.o kdebugfs.o alternative.o
obj-y += nospec-branch.o
@@ -70,7 +70,6 @@
obj-$(CONFIG_FUNCTION_TRACER) += mcount.o ftrace.o
obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_UPROBES) += uprobes.o
-obj-$(CONFIG_JUMP_LABEL) += jump_label.o
obj-$(CONFIG_KEXEC_FILE) += machine_kexec_file.o kexec_image.o
obj-$(CONFIG_KEXEC_FILE) += kexec_elf.o
diff --git a/arch/s390/kernel/jump_label.c b/arch/s390/kernel/jump_label.c
index 68f415e..43f8430 100644
--- a/arch/s390/kernel/jump_label.c
+++ b/arch/s390/kernel/jump_label.c
@@ -10,6 +10,8 @@
#include <linux/jump_label.h>
#include <asm/ipl.h>
+#ifdef HAVE_JUMP_LABEL
+
struct insn {
u16 opcode;
s32 offset;
@@ -100,3 +102,5 @@
{
__jump_label_transform(entry, type, 1);
}
+
+#endif
diff --git a/arch/sparc/kernel/Makefile b/arch/sparc/kernel/Makefile
index 97c0e19..cf86408 100644
--- a/arch/sparc/kernel/Makefile
+++ b/arch/sparc/kernel/Makefile
@@ -118,4 +118,4 @@
obj-$(CONFIG_SPARC64) += $(pc--y)
obj-$(CONFIG_UPROBES) += uprobes.o
-obj-$(CONFIG_JUMP_LABEL) += jump_label.o
+obj-$(CONFIG_SPARC64) += jump_label.o
diff --git a/arch/sparc/kernel/jump_label.c b/arch/sparc/kernel/jump_label.c
index a4cfaee..7f8eac5 100644
--- a/arch/sparc/kernel/jump_label.c
+++ b/arch/sparc/kernel/jump_label.c
@@ -9,6 +9,8 @@
#include <asm/cacheflush.h>
+#ifdef HAVE_JUMP_LABEL
+
void arch_jump_label_transform(struct jump_entry *entry,
enum jump_label_type type)
{
@@ -45,3 +47,5 @@
flushi(insn);
mutex_unlock(&text_mutex);
}
+
+#endif
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index af35f5c..fbbe59a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -204,6 +204,7 @@
select USER_STACKTRACE_SUPPORT
select VIRT_TO_BUS
select X86_FEATURE_NAMES if PROC_FS
+ select ARCH_HAS_ALT_SYSCALL if X86_64
config INSTRUCTION_DECODER
def_bool y
@@ -408,6 +409,17 @@
If in doubt, say Y.
+config X86_FAST_FEATURE_TESTS
+ bool "Fast CPU feature tests" if EMBEDDED
+ default y
+ ---help---
+ Some fast-paths in the kernel depend on the capabilities of the CPU.
+ Say Y here for the kernel to patch in the appropriate code at runtime
+ based on the capabilities of the CPU. The infrastructure for patching
+ code at runtime takes up some additional space; space-constrained
+ embedded systems may wish to say N here to produce smaller, slightly
+ slower code.
+
config X86_X2APIC
bool "Support x2apic"
depends on X86_LOCAL_APIC && X86_64 && (IRQ_REMAP || HYPERVISOR_GUEST)
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 4833dd7..5a38de8 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -306,10 +306,6 @@
archprepare: checkbin
checkbin:
-ifndef CONFIG_CC_HAS_ASM_GOTO
- @echo Compiler lacks asm-goto support.
- @exit 1
-endif
ifdef CONFIG_RETPOLINE
ifeq ($(RETPOLINE_CFLAGS),)
@echo "You are building kernel with non-retpoline compiler." >&2
diff --git a/arch/x86/boot/compressed/Makefile b/arch/x86/boot/compressed/Makefile
index 466f66c..de1e6e6 100644
--- a/arch/x86/boot/compressed/Makefile
+++ b/arch/x86/boot/compressed/Makefile
@@ -28,6 +28,7 @@
KBUILD_CFLAGS := -m$(BITS) -O2
KBUILD_CFLAGS += -fno-strict-aliasing $(call cc-option, -fPIE, -fPIC)
+KBUILD_CFLAGS += -fomit-frame-pointer
KBUILD_CFLAGS += -DDISABLE_BRANCH_PROFILING
cflags-$(CONFIG_X86_32) := -march=i386
cflags-$(CONFIG_X86_64) := -mcmodel=small
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 993dd06..7c56a2a 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -341,7 +341,7 @@
*/
.macro CALL_enter_from_user_mode
#ifdef CONFIG_CONTEXT_TRACKING
-#ifdef CONFIG_JUMP_LABEL
+#ifdef HAVE_JUMP_LABEL
STATIC_JUMP_IF_FALSE .Lafter_call_\@, context_tracking_enabled, def=0
#endif
call enter_from_user_mode
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 8353348..ad35f51 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -288,10 +288,17 @@
* regs->orig_ax, which changes the behavior of some syscalls.
*/
nr &= __SYSCALL_MASK;
+#ifdef CONFIG_ALT_SYSCALL
+ if (likely(nr < ti->nr_syscalls)) {
+ nr = array_index_nospec(nr, ti->nr_syscalls);
+ regs->ax = ti->sys_call_table[nr](regs);
+ }
+#else
if (likely(nr < NR_syscalls)) {
nr = array_index_nospec(nr, NR_syscalls);
regs->ax = sys_call_table[nr](regs);
}
+#endif
syscall_return_slowpath(regs);
}
@@ -323,6 +330,12 @@
nr = syscall_trace_enter(regs);
}
+#ifdef CONFIG_ALT_SYSCALL
+ if (likely(nr < ti->ia32_nr_syscalls)) {
+ nr = array_index_nospec(nr, ti->ia32_nr_syscalls);
+ regs->ax = ti->ia32_sys_call_table[nr](regs);
+ }
+#else
if (likely(nr < IA32_NR_syscalls)) {
nr = array_index_nospec(nr, IA32_NR_syscalls);
#ifdef CONFIG_IA32_EMULATION
@@ -340,6 +353,7 @@
(unsigned int)regs->di, (unsigned int)regs->bp);
#endif /* CONFIG_IA32_EMULATION */
}
+#endif
syscall_return_slowpath(regs);
}
diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 68889ac..0ecc9ba 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -140,20 +140,7 @@
#define setup_force_cpu_bug(bit) setup_force_cpu_cap(bit)
-#if defined(__clang__) && !defined(CONFIG_CC_HAS_ASM_GOTO)
-
-/*
- * Workaround for the sake of BPF compilation which utilizes kernel
- * headers, but clang does not support ASM GOTO and fails the build.
- */
-#ifndef __BPF_TRACING__
-#warning "Compiler lacks ASM_GOTO support. Add -D __BPF_TRACING__ to your compiler arguments"
-#endif
-
-#define static_cpu_has(bit) boot_cpu_has(bit)
-
-#else
-
+#if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_X86_FAST_FEATURE_TESTS)
/*
* Static testing of CPU features. Used the same as boot_cpu_has().
* These will statically patch the target code for additional
@@ -209,6 +196,12 @@
boot_cpu_has(bit) : \
_static_cpu_has(bit) \
)
+#else
+/*
+ * Fall back to dynamic for gcc versions which don't support asm goto. Should be
+ * a minority now anyway.
+ */
+#define static_cpu_has(bit) boot_cpu_has(bit)
#endif
#define cpu_has_bug(c, bit) cpu_has(c, (bit))
diff --git a/arch/x86/include/asm/jump_label.h b/arch/x86/include/asm/jump_label.h
index 7010e1c..8c0de42 100644
--- a/arch/x86/include/asm/jump_label.h
+++ b/arch/x86/include/asm/jump_label.h
@@ -2,6 +2,19 @@
#ifndef _ASM_X86_JUMP_LABEL_H
#define _ASM_X86_JUMP_LABEL_H
+#ifndef HAVE_JUMP_LABEL
+/*
+ * For better or for worse, if jump labels (the gcc extension) are missing,
+ * then the entire static branch patching infrastructure is compiled out.
+ * If that happens, the code in here will malfunction. Raise a compiler
+ * error instead.
+ *
+ * In theory, jump labels and the static branch patching infrastructure
+ * could be decoupled to fix this.
+ */
+#error asm/jump_label.h included on a non-jump-label kernel
+#endif
+
#define JUMP_LABEL_NOP_SIZE 5
#ifdef CONFIG_X86_64
diff --git a/arch/x86/include/asm/rmwcc.h b/arch/x86/include/asm/rmwcc.h
index 033dc7c..4914a3e 100644
--- a/arch/x86/include/asm/rmwcc.h
+++ b/arch/x86/include/asm/rmwcc.h
@@ -4,7 +4,7 @@
#define __CLOBBERS_MEM(clb...) "memory", ## clb
-#if !defined(__GCC_ASM_FLAG_OUTPUTS__) && defined(CONFIG_CC_HAS_ASM_GOTO)
+#if !defined(__GCC_ASM_FLAG_OUTPUTS__) && defined(CC_HAVE_ASM_GOTO)
/* Use asm goto */
@@ -21,7 +21,7 @@
#define __BINARY_RMWcc_ARG " %1, "
-#else /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CONFIG_CC_HAS_ASM_GOTO) */
+#else /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CC_HAVE_ASM_GOTO) */
/* Use flags output or a set instruction */
@@ -36,7 +36,7 @@
#define __BINARY_RMWcc_ARG " %2, "
-#endif /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CONFIG_CC_HAS_ASM_GOTO) */
+#endif /* defined(__GCC_ASM_FLAG_OUTPUTS__) || !defined(CC_HAVE_ASM_GOTO) */
#define GEN_UNARY_RMWcc(op, var, arg0, cc) \
__GEN_RMWcc(op " " arg0, var, cc, __CLOBBERS_MEM())
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index d653139..dc4418f 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -33,6 +33,7 @@
#define ia32_sys_call_table sys_call_table
#define __NR_syscall_compat_max __NR_syscall_max
#define IA32_NR_syscalls NR_syscalls
+#define ia32_nr_syscalls nr_syscalls
#endif
#if defined(CONFIG_IA32_EMULATION)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 82b73b7..7ca78ae 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -50,17 +50,52 @@
*/
#ifndef __ASSEMBLY__
struct task_struct;
+
+/* same as sys_call_ptr_t from asm/syscall.h */
+typedef asmlinkage long (*ti_sys_call_ptr_t)(const struct pt_regs *);
+
#include <asm/cpufeature.h>
#include <linux/atomic.h>
struct thread_info {
unsigned long flags; /* low level flags */
u32 status; /* thread synchronous flags */
+#ifdef CONFIG_ALT_SYSCALL
+ /*
+ * This uses nr_syscalls instead of nr_syscall_max because we want
+ * to be able to entirely disable a syscall table (e.g. compat) by
+ * setting nr_syscalls to 0. This requires some careful work in
+ * the syscall entry assembly code, most variations use ..._max.
+ */
+ unsigned int nr_syscalls; /* size of below */
+ const ti_sys_call_ptr_t *sys_call_table;
+# ifdef CONFIG_IA32_EMULATION
+ unsigned int ia32_nr_syscalls; /* size of below */
+ const ti_sys_call_ptr_t *ia32_sys_call_table;
+# endif
+#endif
};
+#ifdef CONFIG_ALT_SYSCALL
+# ifdef CONFIG_IA32_EMULATION
+# define INIT_THREAD_INFO_SYSCALL_COMPAT \
+ .ia32_nr_syscalls = IA32_NR_syscalls, \
+ .ia32_sys_call_table = ia32_sys_call_table,
+# else
+# define INIT_THREAD_INFO_SYSCALL_COMPAT /* */
+# endif
+# define INIT_THREAD_INFO_SYSCALL \
+ .nr_syscalls = NR_syscalls, \
+ .sys_call_table = sys_call_table, \
+ INIT_THREAD_INFO_SYSCALL_COMPAT
+#else
+# define INIT_THREAD_INFO_SYSCALL /* */
+#endif
+
#define INIT_THREAD_INFO(tsk) \
{ \
.flags = 0, \
+ INIT_THREAD_INFO_SYSCALL \
}
#else /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index da0b6bc..b7661a3 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -49,8 +49,7 @@
obj-y += traps.o idt.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
obj-y += time.o ioport.o dumpstack.o nmi.o
obj-$(CONFIG_MODIFY_LDT_SYSCALL) += ldt.o
-obj-y += setup.o x86_init.o i8259.o irqinit.o
-obj-$(CONFIG_JUMP_LABEL) += jump_label.o
+obj-y += setup.o x86_init.o i8259.o irqinit.o jump_label.o
obj-$(CONFIG_IRQ_WORK) += irq_work.o
obj-y += probe_roms.o
obj-$(CONFIG_X86_64) += sys_x86_64.o
@@ -140,6 +139,8 @@
obj-$(CONFIG_UNWINDER_FRAME_POINTER) += unwind_frame.o
obj-$(CONFIG_UNWINDER_GUESS) += unwind_guess.o
+obj-$(CONFIG_ALT_SYSCALL) += alt-syscall.o
+
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/alt-syscall.c b/arch/x86/kernel/alt-syscall.c
new file mode 100644
index 0000000..09e7ed7
--- /dev/null
+++ b/arch/x86/kernel/alt-syscall.c
@@ -0,0 +1,70 @@
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/unistd.h>
+#include <linux/slab.h>
+#include <linux/stddef.h>
+#include <linux/syscalls.h>
+#include <linux/alt-syscall.h>
+
+#include <asm/syscall.h>
+#include <asm/syscalls.h>
+
+int arch_dup_sys_call_table(struct alt_sys_call_table *entry)
+{
+ if (!entry)
+ return -EINVAL;
+ /* Table already allocated. */
+ if (entry->table)
+ return -EINVAL;
+#ifdef CONFIG_IA32_EMULATION
+ if (entry->compat_table)
+ return -EINVAL;
+#endif
+ entry->size = NR_syscalls;
+ entry->table = kcalloc(entry->size, sizeof(sys_call_ptr_t),
+ GFP_KERNEL);
+ if (!entry->table)
+ goto failed;
+
+ memcpy(entry->table, sys_call_table,
+ entry->size * sizeof(sys_call_ptr_t));
+
+#ifdef CONFIG_IA32_EMULATION
+ entry->compat_size = IA32_NR_syscalls;
+ entry->compat_table = kcalloc(entry->compat_size,
+ sizeof(sys_call_ptr_t), GFP_KERNEL);
+ if (!entry->compat_table)
+ goto failed;
+ memcpy(entry->compat_table, ia32_sys_call_table,
+ entry->compat_size * sizeof(sys_call_ptr_t));
+#endif
+
+ return 0;
+
+failed:
+ entry->size = 0;
+ kfree(entry->table);
+ entry->table = NULL;
+#ifdef CONFIG_IA32_EMULATION
+ entry->compat_size = 0;
+#endif
+ return -ENOMEM;
+}
+
+/* Operates on "current", which isn't racey, since it's _in_ a syscall. */
+int arch_set_sys_call_table(struct alt_sys_call_table *entry)
+{
+ if (!entry)
+ return -EINVAL;
+
+ current_thread_info()->nr_syscalls = entry->size;
+ current_thread_info()->sys_call_table = entry->table;
+#ifdef CONFIG_IA32_EMULATION
+ current_thread_info()->ia32_nr_syscalls = entry->compat_size;
+ current_thread_info()->ia32_sys_call_table = entry->compat_table;
+#endif
+
+ return 0;
+}
diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index 4c3d9a3..eeea935 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -16,6 +16,8 @@
#include <asm/alternative.h>
#include <asm/text-patching.h>
+#ifdef HAVE_JUMP_LABEL
+
union jump_code_union {
char code[JUMP_LABEL_NOP_SIZE];
struct {
@@ -140,3 +142,5 @@
if (jlstate == JL_STATE_UPDATE)
__jump_label_transform(entry, type, text_poke_early, 1);
}
+
+#endif
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 03b7529..42053b2 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -185,8 +185,7 @@
/*
* Secondary CPUs do not run through tsc_init(), so set up
* all the scale factors for all CPUs, assuming the same
- * speed as the bootup CPU. (cpufreq notifiers will fix this
- * up if their speed diverges)
+ * speed as the bootup CPU.
*/
static void __init cyc2ns_init_secondary_cpus(void)
{
@@ -936,12 +935,12 @@
}
#ifdef CONFIG_CPU_FREQ
-/* Frequency scaling support. Adjust the TSC based timer when the cpu frequency
+/*
+ * Frequency scaling support. Adjust the TSC based timer when the CPU frequency
* changes.
*
- * RED-PEN: On SMP we assume all CPUs run with the same frequency. It's
- * not that important because current Opteron setups do not support
- * scaling on SMP anyroads.
+ * NOTE: On SMP the situation is not fixable in general, so simply mark the TSC
+ * as unstable and give up in those cases.
*
* Should fix up last_tsc too. Currently gettimeofday in the
* first tick after the change will be slightly wrong.
@@ -955,22 +954,22 @@
void *data)
{
struct cpufreq_freqs *freq = data;
- unsigned long *lpj;
- lpj = &boot_cpu_data.loops_per_jiffy;
-#ifdef CONFIG_SMP
- if (!(freq->flags & CPUFREQ_CONST_LOOPS))
- lpj = &cpu_data(freq->cpu).loops_per_jiffy;
-#endif
+ if (num_online_cpus() > 1) {
+ mark_tsc_unstable("cpufreq changes on SMP");
+ return 0;
+ }
if (!ref_freq) {
ref_freq = freq->old;
- loops_per_jiffy_ref = *lpj;
+ loops_per_jiffy_ref = boot_cpu_data.loops_per_jiffy;
tsc_khz_ref = tsc_khz;
}
+
if ((val == CPUFREQ_PRECHANGE && freq->old < freq->new) ||
- (val == CPUFREQ_POSTCHANGE && freq->old > freq->new)) {
- *lpj = cpufreq_scale(loops_per_jiffy_ref, ref_freq, freq->new);
+ (val == CPUFREQ_POSTCHANGE && freq->old > freq->new)) {
+ boot_cpu_data.loops_per_jiffy =
+ cpufreq_scale(loops_per_jiffy_ref, ref_freq, freq->new);
tsc_khz = cpufreq_scale(tsc_khz_ref, ref_freq, freq->new);
if (!(freq->flags & CPUFREQ_CONST_LOOPS))
@@ -1377,6 +1376,8 @@
static bool __init determine_cpu_tsc_frequencies(bool early)
{
+ u64 initial_tsc;
+
/* Make sure that cpu and tsc are not already calibrated */
WARN_ON(cpu_khz || tsc_khz);
@@ -1389,6 +1390,8 @@
cpu_khz = pit_hpet_ptimer_calibrate_cpu();
}
+ initial_tsc = rdtsc();
+
/*
* Trust non-zero tsc_khz as authoritative,
* and use it to sanity check cpu_khz,
@@ -1402,6 +1405,10 @@
if (tsc_khz == 0)
return false;
+ do_div(initial_tsc, cpu_khz / 1000);
+ pr_info("Initial usec timer %llu\n",
+ (unsigned long long)initial_tsc);
+
pr_info("Detected %lu.%03lu MHz processor\n",
(unsigned long)cpu_khz / KHZ,
(unsigned long)cpu_khz % KHZ);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 210eabd..93d07cf 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -456,7 +456,7 @@
/*
* XXX: inoutclob user must know where the argument is being expanded.
- * Relying on CONFIG_CC_HAS_ASM_GOTO would allow us to remove _fault.
+ * Relying on CC_HAVE_ASM_GOTO would allow us to remove _fault.
*/
#define asm_safe(insn, inoutclob...) \
({ \
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index e7f19de..b129915 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -86,6 +86,8 @@
pgd_t *save_pgd;
save_pgd = efi_call_phys_prolog();
+ if (!save_pgd)
+ return EFI_ABORTED;
/* Disable interrupts around EFI calls: */
local_irq_save(flags);
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 52dd59a..c54b5a58 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -84,13 +84,15 @@
if (!efi_enabled(EFI_OLD_MEMMAP)) {
efi_switch_mm(&efi_mm);
- return NULL;
+ return efi_mm.pgd;
}
early_code_mapping_set_exec(1);
n_pgds = DIV_ROUND_UP((max_pfn << PAGE_SHIFT), PGDIR_SIZE);
save_pgd = kmalloc_array(n_pgds, sizeof(*save_pgd), GFP_KERNEL);
+ if (!save_pgd)
+ return NULL;
/*
* Build 1:1 identity mapping for efi=old_map usage. Note that
@@ -138,10 +140,11 @@
pgd_offset_k(pgd * PGDIR_SIZE)->pgd &= ~_PAGE_NX;
}
-out:
__flush_tlb_all();
-
return save_pgd;
+out:
+ efi_call_phys_epilog(save_pgd);
+ return NULL;
}
void __init efi_call_phys_epilog(pgd_t *save_pgd)
diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c
index ecd3d0e..860ad04 100644
--- a/block/bfq-cgroup.c
+++ b/block/bfq-cgroup.c
@@ -579,7 +579,8 @@
bfqg_and_blkg_get(bfqg);
if (bfq_bfqq_busy(bfqq)) {
- bfq_pos_tree_add_move(bfqd, bfqq);
+ if (unlikely(!bfqd->nonrot_with_queueing))
+ bfq_pos_tree_add_move(bfqd, bfqq);
bfq_activate_bfqq(bfqd, bfqq);
}
@@ -1103,7 +1104,7 @@
},
#endif /* CONFIG_DEBUG_BLK_CGROUP */
- /* the same statictics which cover the bfqg and its descendants */
+ /* the same statistics which cover the bfqg and its descendants */
{
.name = "bfq.io_service_bytes_recursive",
.private = (unsigned long)&blkcg_policy_bfq,
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 5198ed1..fb80791 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -189,7 +189,7 @@
/*
* When a sync request is dispatched, the queue that contains that
* request, and all the ancestor entities of that queue, are charged
- * with the number of sectors of the request. In constrast, if the
+ * with the number of sectors of the request. In contrast, if the
* request is async, then the queue and its ancestor entities are
* charged with the number of sectors of the request, multiplied by
* the factor below. This throttles the bandwidth for async I/O,
@@ -217,7 +217,7 @@
* queue merging.
*
* As can be deduced from the low time limit below, queue merging, if
- * successful, happens at the very beggining of the I/O of the involved
+ * successful, happens at the very beginning of the I/O of the involved
* cooperating processes, as a consequence of the arrival of the very
* first requests from each cooperator. After that, there is very
* little chance to find cooperators.
@@ -230,13 +230,26 @@
#define BFQ_MIN_TT (2 * NSEC_PER_MSEC)
/* hw_tag detection: parallel requests threshold and min samples needed. */
-#define BFQ_HW_QUEUE_THRESHOLD 4
+#define BFQ_HW_QUEUE_THRESHOLD 3
#define BFQ_HW_QUEUE_SAMPLES 32
#define BFQQ_SEEK_THR (sector_t)(8 * 100)
#define BFQQ_SECT_THR_NONROT (sector_t)(2 * 32)
+#define BFQ_RQ_SEEKY(bfqd, last_pos, rq) \
+ (get_sdist(last_pos, rq) > \
+ BFQQ_SEEK_THR && \
+ (!blk_queue_nonrot(bfqd->queue) || \
+ blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT))
#define BFQQ_CLOSE_THR (sector_t)(8 * 1024)
#define BFQQ_SEEKY(bfqq) (hweight32(bfqq->seek_history) > 19)
+/*
+ * Sync random I/O is likely to be confused with soft real-time I/O,
+ * because it is characterized by limited throughput and apparently
+ * isochronous arrival pattern. To avoid false positives, queues
+ * containing only random (seeky) I/O are prevented from being tagged
+ * as soft real-time.
+ */
+#define BFQQ_TOTALLY_SEEKY(bfqq) (bfqq->seek_history == -1)
/* Min number of samples required to perform peak-rate update */
#define BFQ_RATE_MIN_SAMPLES 32
@@ -428,7 +441,7 @@
/*
* Lifted from AS - choose which of rq1 and rq2 that is best served now.
- * We choose the request that is closesr to the head right now. Distance
+ * We choose the request that is closer to the head right now. Distance
* behind the head is penalized and only allowed to a certain extent.
*/
static struct request *bfq_choose_req(struct bfq_data *bfqd,
@@ -590,7 +603,16 @@
bfq_merge_time_limit);
}
-void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+/*
+ * The following function is not marked as __cold because it is
+ * actually cold, but for the same performance goal described in the
+ * comments on the likely() at the beginning of
+ * bfq_setup_cooperator(). Unexpectedly, to reach an even lower
+ * execution time for the case where this function is not invoked, we
+ * had to add an unlikely() in each involved if().
+ */
+void __cold
+bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq)
{
struct rb_node **p, *parent;
struct bfq_queue *__bfqq;
@@ -624,59 +646,73 @@
}
/*
- * Tell whether there are active queues or groups with differentiated weights.
- */
-static bool bfq_differentiated_weights(struct bfq_data *bfqd)
-{
- /*
- * For weights to differ, at least one of the trees must contain
- * at least two nodes.
- */
- return (!RB_EMPTY_ROOT(&bfqd->queue_weights_tree) &&
- (bfqd->queue_weights_tree.rb_node->rb_left ||
- bfqd->queue_weights_tree.rb_node->rb_right)
-#ifdef CONFIG_BFQ_GROUP_IOSCHED
- ) ||
- (!RB_EMPTY_ROOT(&bfqd->group_weights_tree) &&
- (bfqd->group_weights_tree.rb_node->rb_left ||
- bfqd->group_weights_tree.rb_node->rb_right)
-#endif
- );
-}
-
-/*
- * The following function returns true if every queue must receive the
- * same share of the throughput (this condition is used when deciding
- * whether idling may be disabled, see the comments in the function
- * bfq_better_to_idle()).
+ * The following function returns false either if every active queue
+ * must receive the same share of the throughput (symmetric scenario),
+ * or, as a special case, if bfqq must receive a share of the
+ * throughput lower than or equal to the share that every other active
+ * queue must receive. If bfqq does sync I/O, then these are the only
+ * two cases where bfqq happens to be guaranteed its share of the
+ * throughput even if I/O dispatching is not plugged when bfqq remains
+ * temporarily empty (for more details, see the comments in the
+ * function bfq_better_to_idle()). For this reason, the return value
+ * of this function is used to check whether I/O-dispatch plugging can
+ * be avoided.
*
- * Such a scenario occurs when:
+ * The above first case (symmetric scenario) occurs when:
* 1) all active queues have the same weight,
- * 2) all active groups at the same level in the groups tree have the same
- * weight,
+ * 2) all active queues belong to the same I/O-priority class,
* 3) all active groups at the same level in the groups tree have the same
+ * weight,
+ * 4) all active groups at the same level in the groups tree have the same
* number of children.
*
- * Unfortunately, keeping the necessary state for evaluating exactly the
- * above symmetry conditions would be quite complex and time-consuming.
- * Therefore this function evaluates, instead, the following stronger
- * sub-conditions, for which it is much easier to maintain the needed
- * state:
+ * Unfortunately, keeping the necessary state for evaluating exactly
+ * the last two symmetry sub-conditions above would be quite complex
+ * and time consuming. Therefore this function evaluates, instead,
+ * only the following stronger three sub-conditions, for which it is
+ * much easier to maintain the needed state:
* 1) all active queues have the same weight,
- * 2) all active groups have the same weight,
- * 3) all active groups have at most one active child each.
- * In particular, the last two conditions are always true if hierarchical
- * support and the cgroups interface are not enabled, thus no state needs
- * to be maintained in this case.
+ * 2) all active queues belong to the same I/O-priority class,
+ * 3) there are no active groups.
+ * In particular, the last condition is always true if hierarchical
+ * support or the cgroups interface are not enabled, thus no state
+ * needs to be maintained in this case.
*/
-static bool bfq_symmetric_scenario(struct bfq_data *bfqd)
+static bool bfq_asymmetric_scenario(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
{
- return !bfq_differentiated_weights(bfqd);
+ bool smallest_weight = bfqq &&
+ bfqq->weight_counter &&
+ bfqq->weight_counter ==
+ container_of(
+ rb_first_cached(&bfqd->queue_weights_tree),
+ struct bfq_weight_counter,
+ weights_node);
+
+ /*
+ * For queue weights to differ, queue_weights_tree must contain
+ * at least two nodes.
+ */
+ bool varied_queue_weights = !smallest_weight &&
+ !RB_EMPTY_ROOT(&bfqd->queue_weights_tree.rb_root) &&
+ (bfqd->queue_weights_tree.rb_root.rb_node->rb_left ||
+ bfqd->queue_weights_tree.rb_root.rb_node->rb_right);
+
+ bool multiple_classes_busy =
+ (bfqd->busy_queues[0] && bfqd->busy_queues[1]) ||
+ (bfqd->busy_queues[0] && bfqd->busy_queues[2]) ||
+ (bfqd->busy_queues[1] && bfqd->busy_queues[2]);
+
+ return varied_queue_weights || multiple_classes_busy
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
+ || bfqd->num_groups_with_pending_reqs > 0
+#endif
+ ;
}
/*
* If the weight-counter tree passed as input contains no counter for
- * the weight of the input entity, then add that counter; otherwise just
+ * the weight of the input queue, then add that counter; otherwise just
* increment the existing counter.
*
* Note that weight-counter trees contain few nodes in mostly symmetric
@@ -687,25 +723,26 @@
* In most scenarios, the rate at which nodes are created/destroyed
* should be low too.
*/
-void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity,
- struct rb_root *root)
+void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ struct rb_root_cached *root)
{
- struct rb_node **new = &(root->rb_node), *parent = NULL;
+ struct bfq_entity *entity = &bfqq->entity;
+ struct rb_node **new = &(root->rb_root.rb_node), *parent = NULL;
+ bool leftmost = true;
/*
- * Do not insert if the entity is already associated with a
+ * Do not insert if the queue is already associated with a
* counter, which happens if:
- * 1) the entity is associated with a queue,
- * 2) a request arrival has caused the queue to become both
+ * 1) a request arrival has caused the queue to become both
* non-weight-raised, and hence change its weight, and
* backlogged; in this respect, each of the two events
* causes an invocation of this function,
- * 3) this is the invocation of this function caused by the
+ * 2) this is the invocation of this function caused by the
* second event. This second invocation is actually useless,
* and we handle this fact by exiting immediately. More
* efficient or clearer solutions might possibly be adopted.
*/
- if (entity->weight_counter)
+ if (bfqq->weight_counter)
return;
while (*new) {
@@ -715,77 +752,79 @@
parent = *new;
if (entity->weight == __counter->weight) {
- entity->weight_counter = __counter;
+ bfqq->weight_counter = __counter;
goto inc_counter;
}
if (entity->weight < __counter->weight)
new = &((*new)->rb_left);
- else
+ else {
new = &((*new)->rb_right);
+ leftmost = false;
+ }
}
- entity->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
- GFP_ATOMIC);
+ bfqq->weight_counter = kzalloc(sizeof(struct bfq_weight_counter),
+ GFP_ATOMIC);
/*
* In the unlucky event of an allocation failure, we just
- * exit. This will cause the weight of entity to not be
- * considered in bfq_differentiated_weights, which, in its
- * turn, causes the scenario to be deemed wrongly symmetric in
- * case entity's weight would have been the only weight making
- * the scenario asymmetric. On the bright side, no unbalance
- * will however occur when entity becomes inactive again (the
+ * exit. This will cause the weight of queue to not be
+ * considered in bfq_asymmetric_scenario, which, in its turn,
+ * causes the scenario to be deemed wrongly symmetric in case
+ * bfqq's weight would have been the only weight making the
+ * scenario asymmetric. On the bright side, no unbalance will
+ * however occur when bfqq becomes inactive again (the
* invocation of this function is triggered by an activation
- * of entity). In fact, bfq_weights_tree_remove does nothing
- * if !entity->weight_counter.
+ * of queue). In fact, bfq_weights_tree_remove does nothing
+ * if !bfqq->weight_counter.
*/
- if (unlikely(!entity->weight_counter))
+ if (unlikely(!bfqq->weight_counter))
return;
- entity->weight_counter->weight = entity->weight;
- rb_link_node(&entity->weight_counter->weights_node, parent, new);
- rb_insert_color(&entity->weight_counter->weights_node, root);
+ bfqq->weight_counter->weight = entity->weight;
+ rb_link_node(&bfqq->weight_counter->weights_node, parent, new);
+ rb_insert_color_cached(&bfqq->weight_counter->weights_node, root,
+ leftmost);
inc_counter:
- entity->weight_counter->num_active++;
+ bfqq->weight_counter->num_active++;
+ bfqq->ref++;
}
/*
- * Decrement the weight counter associated with the entity, and, if the
+ * Decrement the weight counter associated with the queue, and, if the
* counter reaches 0, remove the counter from the tree.
* See the comments to the function bfq_weights_tree_add() for considerations
* about overhead.
*/
void __bfq_weights_tree_remove(struct bfq_data *bfqd,
- struct bfq_entity *entity,
- struct rb_root *root)
+ struct bfq_queue *bfqq,
+ struct rb_root_cached *root)
{
- if (!entity->weight_counter)
+ if (!bfqq->weight_counter)
return;
- entity->weight_counter->num_active--;
- if (entity->weight_counter->num_active > 0)
+ bfqq->weight_counter->num_active--;
+ if (bfqq->weight_counter->num_active > 0)
goto reset_entity_pointer;
- rb_erase(&entity->weight_counter->weights_node, root);
- kfree(entity->weight_counter);
+ rb_erase_cached(&bfqq->weight_counter->weights_node, root);
+ kfree(bfqq->weight_counter);
reset_entity_pointer:
- entity->weight_counter = NULL;
+ bfqq->weight_counter = NULL;
+ bfq_put_queue(bfqq);
}
/*
- * Invoke __bfq_weights_tree_remove on bfqq and all its inactive
- * parent entities.
+ * Invoke __bfq_weights_tree_remove on bfqq and decrement the number
+ * of active groups for each queue's inactive parent entity.
*/
void bfq_weights_tree_remove(struct bfq_data *bfqd,
struct bfq_queue *bfqq)
{
struct bfq_entity *entity = bfqq->entity.parent;
- __bfq_weights_tree_remove(bfqd, &bfqq->entity,
- &bfqd->queue_weights_tree);
-
for_each_entity(entity) {
struct bfq_sched_data *sd = entity->my_sched_data;
@@ -797,18 +836,37 @@
* next_in_service for details on why
* in_service_entity must be checked too).
*
- * As a consequence, the weight of entity is
- * not to be removed. In addition, if entity
- * is active, then its parent entities are
- * active as well, and thus their weights are
- * not to be removed either. In the end, this
- * loop must stop here.
+ * As a consequence, its parent entities are
+ * active as well, and thus this loop must
+ * stop here.
*/
break;
}
- __bfq_weights_tree_remove(bfqd, entity,
- &bfqd->group_weights_tree);
+
+ /*
+ * The decrement of num_groups_with_pending_reqs is
+ * not performed immediately upon the deactivation of
+ * entity, but it is delayed to when it also happens
+ * that the first leaf descendant bfqq of entity gets
+ * all its pending requests completed. The following
+ * instructions perform this delayed decrement, if
+ * needed. See the comments on
+ * num_groups_with_pending_reqs for details.
+ */
+ if (entity->in_groups_with_pending_reqs) {
+ entity->in_groups_with_pending_reqs = false;
+ bfqd->num_groups_with_pending_reqs--;
+ }
}
+
+ /*
+ * Next function is invoked last, because it causes bfqq to be
+ * freed if the following holds: bfqq is not in service and
+ * has no dispatched request. DO NOT use bfqq after the next
+ * function invocation.
+ */
+ __bfq_weights_tree_remove(bfqd, bfqq,
+ &bfqd->queue_weights_tree);
}
/*
@@ -864,7 +922,8 @@
static unsigned long bfq_serv_to_charge(struct request *rq,
struct bfq_queue *bfqq)
{
- if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1)
+ if (bfq_bfqq_sync(bfqq) || bfqq->wr_coeff > 1 ||
+ bfq_asymmetric_scenario(bfqq->bfqd, bfqq))
return blk_rq_sectors(rq);
return blk_rq_sectors(rq) * bfq_async_charge_factor;
@@ -898,8 +957,10 @@
*/
return;
- new_budget = max_t(unsigned long, bfqq->max_budget,
- bfq_serv_to_charge(next_rq, bfqq));
+ new_budget = max_t(unsigned long,
+ max_t(unsigned long, bfqq->max_budget,
+ bfq_serv_to_charge(next_rq, bfqq)),
+ entity->service);
if (entity->budget != new_budget) {
entity->budget = new_budget;
bfq_log_bfqq(bfqd, bfqq, "updated next rq: new budget %lu",
@@ -928,7 +989,7 @@
* of several files
* mplayer took 23 seconds to start, if constantly weight-raised.
*
- * As for higher values than that accomodating the above bad
+ * As for higher values than that accommodating the above bad
* scenario, tests show that higher values would often yield
* the opposite of the desired result, i.e., would worsen
* responsiveness by allowing non-interactive applications to
@@ -967,6 +1028,7 @@
else
bfq_clear_bfqq_IO_bound(bfqq);
+ bfqq->entity.new_weight = bic->saved_weight;
bfqq->ttime = bic->saved_ttime;
bfqq->wr_coeff = bic->saved_wr_coeff;
bfqq->wr_start_at_switch_to_srt = bic->saved_wr_start_at_switch_to_srt;
@@ -1002,7 +1064,8 @@
static int bfqq_process_refs(struct bfq_queue *bfqq)
{
- return bfqq->ref - bfqq->allocated - bfqq->entity.on_st;
+ return bfqq->ref - bfqq->allocated - bfqq->entity.on_st -
+ (bfqq->weight_counter != NULL);
}
/* Empty burst list and add just bfqq (see comments on bfq_handle_burst) */
@@ -1013,8 +1076,18 @@
hlist_for_each_entry_safe(item, n, &bfqd->burst_list, burst_list_node)
hlist_del_init(&item->burst_list_node);
- hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);
- bfqd->burst_size = 1;
+
+ /*
+ * Start the creation of a new burst list only if there is no
+ * active queue. See comments on the conditional invocation of
+ * bfq_handle_burst().
+ */
+ if (bfq_tot_busy_queues(bfqd) == 0) {
+ hlist_add_head(&bfqq->burst_list_node, &bfqd->burst_list);
+ bfqd->burst_size = 1;
+ } else
+ bfqd->burst_size = 0;
+
bfqd->burst_parent_entity = bfqq->entity.parent;
}
@@ -1070,7 +1143,8 @@
* many parallel threads/processes. Examples are systemd during boot,
* or git grep. To help these processes get their job done as soon as
* possible, it is usually better to not grant either weight-raising
- * or device idling to their queues.
+ * or device idling to their queues, unless these queues must be
+ * protected from the I/O flowing through other active queues.
*
* In this comment we describe, firstly, the reasons why this fact
* holds, and, secondly, the next function, which implements the main
@@ -1082,7 +1156,10 @@
* cumulatively served, the sooner the target job of these queues gets
* completed. As a consequence, weight-raising any of these queues,
* which also implies idling the device for it, is almost always
- * counterproductive. In most cases it just lowers throughput.
+ * counterproductive, unless there are other active queues to isolate
+ * these new queues from. If there no other active queues, then
+ * weight-raising these new queues just lowers throughput in most
+ * cases.
*
* On the other hand, a burst of queue creations may be caused also by
* the start of an application that does not consist of a lot of
@@ -1116,14 +1193,16 @@
* are very rare. They typically occur if some service happens to
* start doing I/O exactly when the interactive task starts.
*
- * Turning back to the next function, it implements all the steps
- * needed to detect the occurrence of a large burst and to properly
- * mark all the queues belonging to it (so that they can then be
- * treated in a different way). This goal is achieved by maintaining a
- * "burst list" that holds, temporarily, the queues that belong to the
- * burst in progress. The list is then used to mark these queues as
- * belonging to a large burst if the burst does become large. The main
- * steps are the following.
+ * Turning back to the next function, it is invoked only if there are
+ * no active queues (apart from active queues that would belong to the
+ * same, possible burst bfqq would belong to), and it implements all
+ * the steps needed to detect the occurrence of a large burst and to
+ * properly mark all the queues belonging to it (so that they can then
+ * be treated in a different way). This goal is achieved by
+ * maintaining a "burst list" that holds, temporarily, the queues that
+ * belong to the burst in progress. The list is then used to mark
+ * these queues as belonging to a large burst if the burst does become
+ * large. The main steps are the following.
*
* . when the very first queue is created, the queue is inserted into the
* list (as it could be the first queue in a possible burst)
@@ -1371,7 +1450,15 @@
{
struct bfq_entity *entity = &bfqq->entity;
- if (bfq_bfqq_non_blocking_wait_rq(bfqq) && arrived_in_time) {
+ /*
+ * In the next compound condition, we check also whether there
+ * is some budget left, because otherwise there is no point in
+ * trying to go on serving bfqq with this same budget: bfqq
+ * would be expired immediately after being selected for
+ * service. This would only cause useless overhead.
+ */
+ if (bfq_bfqq_non_blocking_wait_rq(bfqq) && arrived_in_time &&
+ bfq_bfqq_budget_left(bfqq) > 0) {
/*
* We do not clear the flag non_blocking_wait_rq here, as
* the latter is used in bfq_activate_bfqq to signal
@@ -1560,6 +1647,7 @@
*/
in_burst = bfq_bfqq_in_large_burst(bfqq);
soft_rt = bfqd->bfq_wr_max_softrt_rate > 0 &&
+ !BFQQ_TOTALLY_SEEKY(bfqq) &&
!in_burst &&
time_is_before_jiffies(bfqq->soft_rt_next_start) &&
bfqq->dispatched == 0;
@@ -1656,6 +1744,72 @@
false, BFQQE_PREEMPTED);
}
+static void bfq_reset_inject_limit(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
+{
+ /* invalidate baseline total service time */
+ bfqq->last_serv_time_ns = 0;
+
+ /*
+ * Reset pointer in case we are waiting for
+ * some request completion.
+ */
+ bfqd->waited_rq = NULL;
+
+ /*
+ * If bfqq has a short think time, then start by setting the
+ * inject limit to 0 prudentially, because the service time of
+ * an injected I/O request may be higher than the think time
+ * of bfqq, and therefore, if one request was injected when
+ * bfqq remains empty, this injected request might delay the
+ * service of the next I/O request for bfqq significantly. In
+ * case bfqq can actually tolerate some injection, then the
+ * adaptive update will however raise the limit soon. This
+ * lucky circumstance holds exactly because bfqq has a short
+ * think time, and thus, after remaining empty, is likely to
+ * get new I/O enqueued---and then completed---before being
+ * expired. This is the very pattern that gives the
+ * limit-update algorithm the chance to measure the effect of
+ * injection on request service times, and then to update the
+ * limit accordingly.
+ *
+ * However, in the following special case, the inject limit is
+ * left to 1 even if the think time is short: bfqq's I/O is
+ * synchronized with that of some other queue, i.e., bfqq may
+ * receive new I/O only after the I/O of the other queue is
+ * completed. Keeping the inject limit to 1 allows the
+ * blocking I/O to be served while bfqq is in service. And
+ * this is very convenient both for bfqq and for overall
+ * throughput, as explained in detail in the comments in
+ * bfq_update_has_short_ttime().
+ *
+ * On the opposite end, if bfqq has a long think time, then
+ * start directly by 1, because:
+ * a) on the bright side, keeping at most one request in
+ * service in the drive is unlikely to cause any harm to the
+ * latency of bfqq's requests, as the service time of a single
+ * request is likely to be lower than the think time of bfqq;
+ * b) on the downside, after becoming empty, bfqq is likely to
+ * expire before getting its next request. With this request
+ * arrival pattern, it is very hard to sample total service
+ * times and update the inject limit accordingly (see comments
+ * on bfq_update_inject_limit()). So the limit is likely to be
+ * never, or at least seldom, updated. As a consequence, by
+ * setting the limit to 1, we avoid that no injection ever
+ * occurs with bfqq. On the downside, this proactive step
+ * further reduces chances to actually compute the baseline
+ * total service time. Thus it reduces chances to execute the
+ * limit-update algorithm and possibly raise the limit to more
+ * than 1.
+ */
+ if (bfq_bfqq_has_short_ttime(bfqq))
+ bfqq->inject_limit = 0;
+ else
+ bfqq->inject_limit = 1;
+
+ bfqq->decrease_time_jif = jiffies;
+}
+
static void bfq_add_request(struct request *rq)
{
struct bfq_queue *bfqq = RQ_BFQQ(rq);
@@ -1668,6 +1822,60 @@
bfqq->queued[rq_is_sync(rq)]++;
bfqd->queued++;
+ if (RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_bfqq_sync(bfqq)) {
+ /*
+ * Periodically reset inject limit, to make sure that
+ * the latter eventually drops in case workload
+ * changes, see step (3) in the comments on
+ * bfq_update_inject_limit().
+ */
+ if (time_is_before_eq_jiffies(bfqq->decrease_time_jif +
+ msecs_to_jiffies(1000)))
+ bfq_reset_inject_limit(bfqd, bfqq);
+
+ /*
+ * The following conditions must hold to setup a new
+ * sampling of total service time, and then a new
+ * update of the inject limit:
+ * - bfqq is in service, because the total service
+ * time is evaluated only for the I/O requests of
+ * the queues in service;
+ * - this is the right occasion to compute or to
+ * lower the baseline total service time, because
+ * there are actually no requests in the drive,
+ * or
+ * the baseline total service time is available, and
+ * this is the right occasion to compute the other
+ * quantity needed to update the inject limit, i.e.,
+ * the total service time caused by the amount of
+ * injection allowed by the current value of the
+ * limit. It is the right occasion because injection
+ * has actually been performed during the service
+ * hole, and there are still in-flight requests,
+ * which are very likely to be exactly the injected
+ * requests, or part of them;
+ * - the minimum interval for sampling the total
+ * service time and updating the inject limit has
+ * elapsed.
+ */
+ if (bfqq == bfqd->in_service_queue &&
+ (bfqd->rq_in_driver == 0 ||
+ (bfqq->last_serv_time_ns > 0 &&
+ bfqd->rqs_injected && bfqd->rq_in_driver > 0)) &&
+ time_is_before_eq_jiffies(bfqq->decrease_time_jif +
+ msecs_to_jiffies(100))) {
+ bfqd->last_empty_occupied_ns = ktime_get_ns();
+ /*
+ * Start the state machine for measuring the
+ * total service time of rq: setting
+ * wait_dispatch will cause bfqd->waited_rq to
+ * be set when rq will be dispatched.
+ */
+ bfqd->wait_dispatch = true;
+ bfqd->rqs_injected = false;
+ }
+ }
+
elv_rb_add(&bfqq->sort_list, rq);
/*
@@ -1679,8 +1887,9 @@
/*
* Adjust priority tree position, if next_rq changes.
+ * See comments on bfq_pos_tree_add_move() for the unlikely().
*/
- if (prev != bfqq->next_rq)
+ if (unlikely(!bfqd->nonrot_with_queueing && prev != bfqq->next_rq))
bfq_pos_tree_add_move(bfqd, bfqq);
if (!bfq_bfqq_busy(bfqq)) /* switching to busy ... */
@@ -1820,7 +2029,9 @@
bfqq->pos_root = NULL;
}
} else {
- bfq_pos_tree_add_move(bfqd, bfqq);
+ /* see comments on bfq_pos_tree_add_move() for the unlikely() */
+ if (unlikely(!bfqd->nonrot_with_queueing))
+ bfq_pos_tree_add_move(bfqd, bfqq);
}
if (rq->cmd_flags & REQ_META)
@@ -1910,7 +2121,12 @@
*/
if (prev != bfqq->next_rq) {
bfq_updated_next_req(bfqd, bfqq);
- bfq_pos_tree_add_move(bfqd, bfqq);
+ /*
+ * See comments on bfq_pos_tree_add_move() for
+ * the unlikely().
+ */
+ if (unlikely(!bfqd->nonrot_with_queueing))
+ bfq_pos_tree_add_move(bfqd, bfqq);
}
}
}
@@ -2196,6 +2412,46 @@
struct bfq_queue *in_service_bfqq, *new_bfqq;
/*
+ * Do not perform queue merging if the device is non
+ * rotational and performs internal queueing. In fact, such a
+ * device reaches a high speed through internal parallelism
+ * and pipelining. This means that, to reach a high
+ * throughput, it must have many requests enqueued at the same
+ * time. But, in this configuration, the internal scheduling
+ * algorithm of the device does exactly the job of queue
+ * merging: it reorders requests so as to obtain as much as
+ * possible a sequential I/O pattern. As a consequence, with
+ * the workload generated by processes doing interleaved I/O,
+ * the throughput reached by the device is likely to be the
+ * same, with and without queue merging.
+ *
+ * Disabling merging also provides a remarkable benefit in
+ * terms of throughput. Merging tends to make many workloads
+ * artificially more uneven, because of shared queues
+ * remaining non empty for incomparably more time than
+ * non-merged queues. This may accentuate workload
+ * asymmetries. For example, if one of the queues in a set of
+ * merged queues has a higher weight than a normal queue, then
+ * the shared queue may inherit such a high weight and, by
+ * staying almost always active, may force BFQ to perform I/O
+ * plugging most of the time. This evidently makes it harder
+ * for BFQ to let the device reach a high throughput.
+ *
+ * Finally, the likely() macro below is not used because one
+ * of the two branches is more likely than the other, but to
+ * have the code path after the following if() executed as
+ * fast as possible for the case of a non rotational device
+ * with queueing. We want it because this is the fastest kind
+ * of device. On the opposite end, the likely() may lengthen
+ * the execution time of BFQ for the case of slower devices
+ * (rotational or at least without queueing). But in this case
+ * the execution time of BFQ matters very little, if not at
+ * all.
+ */
+ if (likely(bfqd->nonrot_with_queueing))
+ return NULL;
+
+ /*
* Prevent bfqq from being merged if it has been created too
* long ago. The idea is that true cooperating processes, and
* thus their associated bfq_queues, are supposed to be
@@ -2216,7 +2472,7 @@
return NULL;
/* If there is only one backlogged queue, don't search. */
- if (bfqd->busy_queues == 1)
+ if (bfq_tot_busy_queues(bfqd) == 1)
return NULL;
in_service_bfqq = bfqd->in_service_queue;
@@ -2258,6 +2514,7 @@
if (!bic)
return;
+ bic->saved_weight = bfqq->entity.orig_weight;
bic->saved_ttime = bfqq->ttime;
bic->saved_has_short_ttime = bfq_bfqq_has_short_ttime(bfqq);
bic->saved_IO_bound = bfq_bfqq_IO_bound(bfqq);
@@ -2276,6 +2533,7 @@
* to enjoy weight raising if split soon.
*/
bic->saved_wr_coeff = bfqq->bfqd->bfq_wr_coeff;
+ bic->saved_wr_start_at_switch_to_srt = bfq_smallest_from_now();
bic->saved_wr_cur_max_time = bfq_wr_duration(bfqq->bfqd);
bic->saved_last_wr_start_finish = jiffies;
} else {
@@ -2346,6 +2604,16 @@
* assignment causes no harm).
*/
new_bfqq->bic = NULL;
+ /*
+ * If the queue is shared, the pid is the pid of one of the associated
+ * processes. Which pid depends on the exact sequence of merge events
+ * the queue underwent. So printing such a pid is useless and confusing
+ * because it reports a random pid between those of the associated
+ * processes.
+ * We mark such a queue with a pid -1, and then print SHARED instead of
+ * a pid in logging messages.
+ */
+ new_bfqq->pid = -1;
bfqq->bic = NULL;
/* release process reference to bfqq */
bfq_put_queue(bfqq);
@@ -2380,8 +2648,8 @@
/*
* bic still points to bfqq, then it has not yet been
* redirected to some other bfq_queue, and a queue
- * merge beween bfqq and new_bfqq can be safely
- * fulfillled, i.e., bic can be redirected to new_bfqq
+ * merge between bfqq and new_bfqq can be safely
+ * fulfilled, i.e., bic can be redirected to new_bfqq
* and bfqq can be put.
*/
bfq_merge_bfqqs(bfqd, bfqd->bio_bic, bfqq,
@@ -2515,12 +2783,14 @@
* queue).
*/
if (BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 &&
- bfq_symmetric_scenario(bfqd))
+ !bfq_asymmetric_scenario(bfqd, bfqq))
sl = min_t(u64, sl, BFQ_MIN_TT);
else if (bfqq->wr_coeff > 1)
sl = max_t(u32, sl, 20ULL * NSEC_PER_MSEC);
bfqd->last_idling_start = ktime_get();
+ bfqd->last_idling_start_jiffies = jiffies;
+
hrtimer_start(&bfqd->idle_slice_timer, ns_to_ktime(sl),
HRTIMER_MODE_REL);
bfqg_stats_set_start_idle_time(bfqq_group(bfqq));
@@ -2744,7 +3014,7 @@
if ((bfqd->rq_in_driver > 0 ||
now_ns - bfqd->last_completion < BFQ_MIN_TT)
- && get_sdist(bfqd->last_position, rq) < BFQQ_SEEK_THR)
+ && !BFQ_RQ_SEEKY(bfqd, bfqd->last_position, rq))
bfqd->sequential_samples++;
bfqd->tot_sectors_dispatched += blk_rq_sectors(rq);
@@ -2796,7 +3066,7 @@
bfq_remove_request(q, rq);
}
-static void __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
+static bool __bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq)
{
/*
* If this bfqq is shared between multiple processes, check
@@ -2822,16 +3092,20 @@
bfq_requeue_bfqq(bfqd, bfqq, true);
/*
* Resort priority tree of potential close cooperators.
+ * See comments on bfq_pos_tree_add_move() for the unlikely().
*/
- bfq_pos_tree_add_move(bfqd, bfqq);
+ if (unlikely(!bfqd->nonrot_with_queueing))
+ bfq_pos_tree_add_move(bfqd, bfqq);
}
/*
* All in-service entities must have been properly deactivated
* or requeued before executing the next function, which
- * resets all in-service entites as no more in service.
+ * resets all in-service entities as no more in service. This
+ * may cause bfqq to be freed. If this happens, the next
+ * function returns true.
*/
- __bfq_bfqd_reset_in_service(bfqd);
+ return __bfq_bfqd_reset_in_service(bfqd);
}
/**
@@ -3195,13 +3469,6 @@
jiffies + nsecs_to_jiffies(bfqq->bfqd->bfq_slice_idle) + 4);
}
-static bool bfq_bfqq_injectable(struct bfq_queue *bfqq)
-{
- return BFQQ_SEEKY(bfqq) && bfqq->wr_coeff == 1 &&
- blk_queue_nonrot(bfqq->bfqd->queue) &&
- bfqq->bfqd->hw_tag;
-}
-
/**
* bfq_bfqq_expire - expire a queue.
* @bfqd: device owning the queue.
@@ -3236,7 +3503,6 @@
bool slow;
unsigned long delta = 0;
struct bfq_entity *entity = &bfqq->entity;
- int ref;
/*
* Check whether the process is slow (see bfq_bfqq_is_slow).
@@ -3278,16 +3544,32 @@
* requests, then the request pattern is isochronous
* (see the comments on the function
* bfq_bfqq_softrt_next_start()). Thus we can compute
- * soft_rt_next_start. If, instead, the queue still
- * has outstanding requests, then we have to wait for
- * the completion of all the outstanding requests to
- * discover whether the request pattern is actually
- * isochronous.
+ * soft_rt_next_start. And we do it, unless bfqq is in
+ * interactive weight raising. We do not do it in the
+ * latter subcase, for the following reason. bfqq may
+ * be conveying the I/O needed to load a soft
+ * real-time application. Such an application will
+ * actually exhibit a soft real-time I/O pattern after
+ * it finally starts doing its job. But, if
+ * soft_rt_next_start is computed here for an
+ * interactive bfqq, and bfqq had received a lot of
+ * service before remaining with no outstanding
+ * request (likely to happen on a fast device), then
+ * soft_rt_next_start would be assigned such a high
+ * value that, for a very long time, bfqq would be
+ * prevented from being possibly considered as soft
+ * real time.
+ *
+ * If, instead, the queue still has outstanding
+ * requests, then we have to wait for the completion
+ * of all the outstanding requests to discover whether
+ * the request pattern is actually isochronous.
*/
- if (bfqq->dispatched == 0)
+ if (bfqq->dispatched == 0 &&
+ bfqq->wr_coeff != bfqd->bfq_wr_coeff)
bfqq->soft_rt_next_start =
bfq_bfqq_softrt_next_start(bfqd, bfqq);
- else {
+ else if (bfqq->dispatched > 0) {
/*
* Schedule an update of soft_rt_next_start to when
* the task may be discovered to be isochronous.
@@ -3301,18 +3583,22 @@
slow, bfqq->dispatched, bfq_bfqq_has_short_ttime(bfqq));
/*
+ * bfqq expired, so no total service time needs to be computed
+ * any longer: reset state machine for measuring total service
+ * times.
+ */
+ bfqd->rqs_injected = bfqd->wait_dispatch = false;
+ bfqd->waited_rq = NULL;
+
+ /*
* Increase, decrease or leave budget unchanged according to
* reason.
*/
__bfq_bfqq_recalc_budget(bfqd, bfqq, reason);
- ref = bfqq->ref;
- __bfq_bfqq_expire(bfqd, bfqq);
-
- if (ref == 1) /* bfqq is gone, no more actions on it */
+ if (__bfq_bfqq_expire(bfqd, bfqq))
+ /* bfqq is gone, no more actions on it */
return;
- bfqq->injected_service = 0;
-
/* mark bfqq as waiting a request only if a bic still points to it */
if (!bfq_bfqq_busy(bfqq) &&
reason != BFQQE_BUDGET_TIMEOUT &&
@@ -3380,53 +3666,13 @@
bfq_bfqq_budget_timeout(bfqq);
}
-/*
- * For a queue that becomes empty, device idling is allowed only if
- * this function returns true for the queue. As a consequence, since
- * device idling plays a critical role in both throughput boosting and
- * service guarantees, the return value of this function plays a
- * critical role in both these aspects as well.
- *
- * In a nutshell, this function returns true only if idling is
- * beneficial for throughput or, even if detrimental for throughput,
- * idling is however necessary to preserve service guarantees (low
- * latency, desired throughput distribution, ...). In particular, on
- * NCQ-capable devices, this function tries to return false, so as to
- * help keep the drives' internal queues full, whenever this helps the
- * device boost the throughput without causing any service-guarantee
- * issue.
- *
- * In more detail, the return value of this function is obtained by,
- * first, computing a number of boolean variables that take into
- * account throughput and service-guarantee issues, and, then,
- * combining these variables in a logical expression. Most of the
- * issues taken into account are not trivial. We discuss these issues
- * individually while introducing the variables.
- */
-static bool bfq_better_to_idle(struct bfq_queue *bfqq)
+static bool idling_boosts_thr_without_issues(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
{
- struct bfq_data *bfqd = bfqq->bfqd;
bool rot_without_queueing =
!blk_queue_nonrot(bfqd->queue) && !bfqd->hw_tag,
bfqq_sequential_and_IO_bound,
- idling_boosts_thr, idling_boosts_thr_without_issues,
- idling_needed_for_service_guarantees,
- asymmetric_scenario;
-
- if (bfqd->strict_guarantees)
- return true;
-
- /*
- * Idling is performed only if slice_idle > 0. In addition, we
- * do not idle if
- * (a) bfqq is async
- * (b) bfqq is in the idle io prio class: in this case we do
- * not idle because we want to minimize the bandwidth that
- * queues in this class can steal to higher-priority queues
- */
- if (bfqd->bfq_slice_idle == 0 || !bfq_bfqq_sync(bfqq) ||
- bfq_class_idle(bfqq))
- return false;
+ idling_boosts_thr;
bfqq_sequential_and_IO_bound = !BFQQ_SEEKY(bfqq) &&
bfq_bfqq_IO_bound(bfqq) && bfq_bfqq_has_short_ttime(bfqq);
@@ -3458,8 +3704,7 @@
bfqq_sequential_and_IO_bound);
/*
- * The value of the next variable,
- * idling_boosts_thr_without_issues, is equal to that of
+ * The return value of this function is equal to that of
* idling_boosts_thr, unless a special case holds. In this
* special case, described below, idling may cause problems to
* weight-raised queues.
@@ -3476,169 +3721,259 @@
* which enqueue several requests in advance, and further
* reorder internally-queued requests.
*
- * For this reason, we force to false the value of
- * idling_boosts_thr_without_issues if there are weight-raised
- * busy queues. In this case, and if bfqq is not weight-raised,
- * this guarantees that the device is not idled for bfqq (if,
- * instead, bfqq is weight-raised, then idling will be
- * guaranteed by another variable, see below). Combined with
- * the timestamping rules of BFQ (see [1] for details), this
- * behavior causes bfqq, and hence any sync non-weight-raised
- * queue, to get a lower number of requests served, and thus
- * to ask for a lower number of requests from the request
- * pool, before the busy weight-raised queues get served
- * again. This often mitigates starvation problems in the
- * presence of heavy write workloads and NCQ, thereby
- * guaranteeing a higher application and system responsiveness
- * in these hostile scenarios.
+ * For this reason, we force to false the return value if
+ * there are weight-raised busy queues. In this case, and if
+ * bfqq is not weight-raised, this guarantees that the device
+ * is not idled for bfqq (if, instead, bfqq is weight-raised,
+ * then idling will be guaranteed by another variable, see
+ * below). Combined with the timestamping rules of BFQ (see
+ * [1] for details), this behavior causes bfqq, and hence any
+ * sync non-weight-raised queue, to get a lower number of
+ * requests served, and thus to ask for a lower number of
+ * requests from the request pool, before the busy
+ * weight-raised queues get served again. This often mitigates
+ * starvation problems in the presence of heavy write
+ * workloads and NCQ, thereby guaranteeing a higher
+ * application and system responsiveness in these hostile
+ * scenarios.
*/
- idling_boosts_thr_without_issues = idling_boosts_thr &&
+ return idling_boosts_thr &&
bfqd->wr_busy_queues == 0;
+}
+
+/*
+ * There is a case where idling does not have to be performed for
+ * throughput concerns, but to preserve the throughput share of
+ * the process associated with bfqq.
+ *
+ * To introduce this case, we can note that allowing the drive
+ * to enqueue more than one request at a time, and hence
+ * delegating de facto final scheduling decisions to the
+ * drive's internal scheduler, entails loss of control on the
+ * actual request service order. In particular, the critical
+ * situation is when requests from different processes happen
+ * to be present, at the same time, in the internal queue(s)
+ * of the drive. In such a situation, the drive, by deciding
+ * the service order of the internally-queued requests, does
+ * determine also the actual throughput distribution among
+ * these processes. But the drive typically has no notion or
+ * concern about per-process throughput distribution, and
+ * makes its decisions only on a per-request basis. Therefore,
+ * the service distribution enforced by the drive's internal
+ * scheduler is likely to coincide with the desired throughput
+ * distribution only in a completely symmetric, or favorably
+ * skewed scenario where:
+ * (i-a) each of these processes must get the same throughput as
+ * the others,
+ * (i-b) in case (i-a) does not hold, it holds that the process
+ * associated with bfqq must receive a lower or equal
+ * throughput than any of the other processes;
+ * (ii) the I/O of each process has the same properties, in
+ * terms of locality (sequential or random), direction
+ * (reads or writes), request sizes, greediness
+ * (from I/O-bound to sporadic), and so on;
+
+ * In fact, in such a scenario, the drive tends to treat the requests
+ * of each process in about the same way as the requests of the
+ * others, and thus to provide each of these processes with about the
+ * same throughput. This is exactly the desired throughput
+ * distribution if (i-a) holds, or, if (i-b) holds instead, this is an
+ * even more convenient distribution for (the process associated with)
+ * bfqq.
+ *
+ * In contrast, in any asymmetric or unfavorable scenario, device
+ * idling (I/O-dispatch plugging) is certainly needed to guarantee
+ * that bfqq receives its assigned fraction of the device throughput
+ * (see [1] for details).
+ *
+ * The problem is that idling may significantly reduce throughput with
+ * certain combinations of types of I/O and devices. An important
+ * example is sync random I/O on flash storage with command
+ * queueing. So, unless bfqq falls in cases where idling also boosts
+ * throughput, it is important to check conditions (i-a), i(-b) and
+ * (ii) accurately, so as to avoid idling when not strictly needed for
+ * service guarantees.
+ *
+ * Unfortunately, it is extremely difficult to thoroughly check
+ * condition (ii). And, in case there are active groups, it becomes
+ * very difficult to check conditions (i-a) and (i-b) too. In fact,
+ * if there are active groups, then, for conditions (i-a) or (i-b) to
+ * become false 'indirectly', it is enough that an active group
+ * contains more active processes or sub-groups than some other active
+ * group. More precisely, for conditions (i-a) or (i-b) to become
+ * false because of such a group, it is not even necessary that the
+ * group is (still) active: it is sufficient that, even if the group
+ * has become inactive, some of its descendant processes still have
+ * some request already dispatched but still waiting for
+ * completion. In fact, requests have still to be guaranteed their
+ * share of the throughput even after being dispatched. In this
+ * respect, it is easy to show that, if a group frequently becomes
+ * inactive while still having in-flight requests, and if, when this
+ * happens, the group is not considered in the calculation of whether
+ * the scenario is asymmetric, then the group may fail to be
+ * guaranteed its fair share of the throughput (basically because
+ * idling may not be performed for the descendant processes of the
+ * group, but it had to be). We address this issue with the following
+ * bi-modal behavior, implemented in the function
+ * bfq_asymmetric_scenario().
+ *
+ * If there are groups with requests waiting for completion
+ * (as commented above, some of these groups may even be
+ * already inactive), then the scenario is tagged as
+ * asymmetric, conservatively, without checking any of the
+ * conditions (i-a), (i-b) or (ii). So the device is idled for bfqq.
+ * This behavior matches also the fact that groups are created
+ * exactly if controlling I/O is a primary concern (to
+ * preserve bandwidth and latency guarantees).
+ *
+ * On the opposite end, if there are no groups with requests waiting
+ * for completion, then only conditions (i-a) and (i-b) are actually
+ * controlled, i.e., provided that conditions (i-a) or (i-b) holds,
+ * idling is not performed, regardless of whether condition (ii)
+ * holds. In other words, only if conditions (i-a) and (i-b) do not
+ * hold, then idling is allowed, and the device tends to be prevented
+ * from queueing many requests, possibly of several processes. Since
+ * there are no groups with requests waiting for completion, then, to
+ * control conditions (i-a) and (i-b) it is enough to check just
+ * whether all the queues with requests waiting for completion also
+ * have the same weight.
+ *
+ * Not checking condition (ii) evidently exposes bfqq to the
+ * risk of getting less throughput than its fair share.
+ * However, for queues with the same weight, a further
+ * mechanism, preemption, mitigates or even eliminates this
+ * problem. And it does so without consequences on overall
+ * throughput. This mechanism and its benefits are explained
+ * in the next three paragraphs.
+ *
+ * Even if a queue, say Q, is expired when it remains idle, Q
+ * can still preempt the new in-service queue if the next
+ * request of Q arrives soon (see the comments on
+ * bfq_bfqq_update_budg_for_activation). If all queues and
+ * groups have the same weight, this form of preemption,
+ * combined with the hole-recovery heuristic described in the
+ * comments on function bfq_bfqq_update_budg_for_activation,
+ * are enough to preserve a correct bandwidth distribution in
+ * the mid term, even without idling. In fact, even if not
+ * idling allows the internal queues of the device to contain
+ * many requests, and thus to reorder requests, we can rather
+ * safely assume that the internal scheduler still preserves a
+ * minimum of mid-term fairness.
+ *
+ * More precisely, this preemption-based, idleless approach
+ * provides fairness in terms of IOPS, and not sectors per
+ * second. This can be seen with a simple example. Suppose
+ * that there are two queues with the same weight, but that
+ * the first queue receives requests of 8 sectors, while the
+ * second queue receives requests of 1024 sectors. In
+ * addition, suppose that each of the two queues contains at
+ * most one request at a time, which implies that each queue
+ * always remains idle after it is served. Finally, after
+ * remaining idle, each queue receives very quickly a new
+ * request. It follows that the two queues are served
+ * alternatively, preempting each other if needed. This
+ * implies that, although both queues have the same weight,
+ * the queue with large requests receives a service that is
+ * 1024/8 times as high as the service received by the other
+ * queue.
+ *
+ * The motivation for using preemption instead of idling (for
+ * queues with the same weight) is that, by not idling,
+ * service guarantees are preserved (completely or at least in
+ * part) without minimally sacrificing throughput. And, if
+ * there is no active group, then the primary expectation for
+ * this device is probably a high throughput.
+ *
+ * We are now left only with explaining the additional
+ * compound condition that is checked below for deciding
+ * whether the scenario is asymmetric. To explain this
+ * compound condition, we need to add that the function
+ * bfq_asymmetric_scenario checks the weights of only
+ * non-weight-raised queues, for efficiency reasons (see
+ * comments on bfq_weights_tree_add()). Then the fact that
+ * bfqq is weight-raised is checked explicitly here. More
+ * precisely, the compound condition below takes into account
+ * also the fact that, even if bfqq is being weight-raised,
+ * the scenario is still symmetric if all queues with requests
+ * waiting for completion happen to be
+ * weight-raised. Actually, we should be even more precise
+ * here, and differentiate between interactive weight raising
+ * and soft real-time weight raising.
+ *
+ * As a side note, it is worth considering that the above
+ * device-idling countermeasures may however fail in the
+ * following unlucky scenario: if idling is (correctly)
+ * disabled in a time period during which all symmetry
+ * sub-conditions hold, and hence the device is allowed to
+ * enqueue many requests, but at some later point in time some
+ * sub-condition stops to hold, then it may become impossible
+ * to let requests be served in the desired order until all
+ * the requests already queued in the device have been served.
+ */
+static bool idling_needed_for_service_guarantees(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
+{
+ return (bfqq->wr_coeff > 1 &&
+ bfqd->wr_busy_queues <
+ bfq_tot_busy_queues(bfqd)) ||
+ bfq_asymmetric_scenario(bfqd, bfqq);
+}
+
+/*
+ * For a queue that becomes empty, device idling is allowed only if
+ * this function returns true for that queue. As a consequence, since
+ * device idling plays a critical role for both throughput boosting
+ * and service guarantees, the return value of this function plays a
+ * critical role as well.
+ *
+ * In a nutshell, this function returns true only if idling is
+ * beneficial for throughput or, even if detrimental for throughput,
+ * idling is however necessary to preserve service guarantees (low
+ * latency, desired throughput distribution, ...). In particular, on
+ * NCQ-capable devices, this function tries to return false, so as to
+ * help keep the drives' internal queues full, whenever this helps the
+ * device boost the throughput without causing any service-guarantee
+ * issue.
+ *
+ * Most of the issues taken into account to get the return value of
+ * this function are not trivial. We discuss these issues in the two
+ * functions providing the main pieces of information needed by this
+ * function.
+ */
+static bool bfq_better_to_idle(struct bfq_queue *bfqq)
+{
+ struct bfq_data *bfqd = bfqq->bfqd;
+ bool idling_boosts_thr_with_no_issue, idling_needed_for_service_guar;
+
+ if (unlikely(bfqd->strict_guarantees))
+ return true;
/*
- * There is then a case where idling must be performed not
- * for throughput concerns, but to preserve service
- * guarantees.
- *
- * To introduce this case, we can note that allowing the drive
- * to enqueue more than one request at a time, and hence
- * delegating de facto final scheduling decisions to the
- * drive's internal scheduler, entails loss of control on the
- * actual request service order. In particular, the critical
- * situation is when requests from different processes happen
- * to be present, at the same time, in the internal queue(s)
- * of the drive. In such a situation, the drive, by deciding
- * the service order of the internally-queued requests, does
- * determine also the actual throughput distribution among
- * these processes. But the drive typically has no notion or
- * concern about per-process throughput distribution, and
- * makes its decisions only on a per-request basis. Therefore,
- * the service distribution enforced by the drive's internal
- * scheduler is likely to coincide with the desired
- * device-throughput distribution only in a completely
- * symmetric scenario where:
- * (i) each of these processes must get the same throughput as
- * the others;
- * (ii) all these processes have the same I/O pattern
- (either sequential or random).
- * In fact, in such a scenario, the drive will tend to treat
- * the requests of each of these processes in about the same
- * way as the requests of the others, and thus to provide
- * each of these processes with about the same throughput
- * (which is exactly the desired throughput distribution). In
- * contrast, in any asymmetric scenario, device idling is
- * certainly needed to guarantee that bfqq receives its
- * assigned fraction of the device throughput (see [1] for
- * details).
- *
- * We address this issue by controlling, actually, only the
- * symmetry sub-condition (i), i.e., provided that
- * sub-condition (i) holds, idling is not performed,
- * regardless of whether sub-condition (ii) holds. In other
- * words, only if sub-condition (i) holds, then idling is
- * allowed, and the device tends to be prevented from queueing
- * many requests, possibly of several processes. The reason
- * for not controlling also sub-condition (ii) is that we
- * exploit preemption to preserve guarantees in case of
- * symmetric scenarios, even if (ii) does not hold, as
- * explained in the next two paragraphs.
- *
- * Even if a queue, say Q, is expired when it remains idle, Q
- * can still preempt the new in-service queue if the next
- * request of Q arrives soon (see the comments on
- * bfq_bfqq_update_budg_for_activation). If all queues and
- * groups have the same weight, this form of preemption,
- * combined with the hole-recovery heuristic described in the
- * comments on function bfq_bfqq_update_budg_for_activation,
- * are enough to preserve a correct bandwidth distribution in
- * the mid term, even without idling. In fact, even if not
- * idling allows the internal queues of the device to contain
- * many requests, and thus to reorder requests, we can rather
- * safely assume that the internal scheduler still preserves a
- * minimum of mid-term fairness. The motivation for using
- * preemption instead of idling is that, by not idling,
- * service guarantees are preserved without minimally
- * sacrificing throughput. In other words, both a high
- * throughput and its desired distribution are obtained.
- *
- * More precisely, this preemption-based, idleless approach
- * provides fairness in terms of IOPS, and not sectors per
- * second. This can be seen with a simple example. Suppose
- * that there are two queues with the same weight, but that
- * the first queue receives requests of 8 sectors, while the
- * second queue receives requests of 1024 sectors. In
- * addition, suppose that each of the two queues contains at
- * most one request at a time, which implies that each queue
- * always remains idle after it is served. Finally, after
- * remaining idle, each queue receives very quickly a new
- * request. It follows that the two queues are served
- * alternatively, preempting each other if needed. This
- * implies that, although both queues have the same weight,
- * the queue with large requests receives a service that is
- * 1024/8 times as high as the service received by the other
- * queue.
- *
- * On the other hand, device idling is performed, and thus
- * pure sector-domain guarantees are provided, for the
- * following queues, which are likely to need stronger
- * throughput guarantees: weight-raised queues, and queues
- * with a higher weight than other queues. When such queues
- * are active, sub-condition (i) is false, which triggers
- * device idling.
- *
- * According to the above considerations, the next variable is
- * true (only) if sub-condition (i) holds. To compute the
- * value of this variable, we not only use the return value of
- * the function bfq_symmetric_scenario(), but also check
- * whether bfqq is being weight-raised, because
- * bfq_symmetric_scenario() does not take into account also
- * weight-raised queues (see comments on
- * bfq_weights_tree_add()). In particular, if bfqq is being
- * weight-raised, it is important to idle only if there are
- * other, non-weight-raised queues that may steal throughput
- * to bfqq. Actually, we should be even more precise, and
- * differentiate between interactive weight raising and
- * soft real-time weight raising.
- *
- * As a side note, it is worth considering that the above
- * device-idling countermeasures may however fail in the
- * following unlucky scenario: if idling is (correctly)
- * disabled in a time period during which all symmetry
- * sub-conditions hold, and hence the device is allowed to
- * enqueue many requests, but at some later point in time some
- * sub-condition stops to hold, then it may become impossible
- * to let requests be served in the desired order until all
- * the requests already queued in the device have been served.
+ * Idling is performed only if slice_idle > 0. In addition, we
+ * do not idle if
+ * (a) bfqq is async
+ * (b) bfqq is in the idle io prio class: in this case we do
+ * not idle because we want to minimize the bandwidth that
+ * queues in this class can steal to higher-priority queues
*/
- asymmetric_scenario = (bfqq->wr_coeff > 1 &&
- bfqd->wr_busy_queues < bfqd->busy_queues) ||
- !bfq_symmetric_scenario(bfqd);
+ if (bfqd->bfq_slice_idle == 0 || !bfq_bfqq_sync(bfqq) ||
+ bfq_class_idle(bfqq))
+ return false;
+
+ idling_boosts_thr_with_no_issue =
+ idling_boosts_thr_without_issues(bfqd, bfqq);
+
+ idling_needed_for_service_guar =
+ idling_needed_for_service_guarantees(bfqd, bfqq);
/*
- * Finally, there is a case where maximizing throughput is the
- * best choice even if it may cause unfairness toward
- * bfqq. Such a case is when bfqq became active in a burst of
- * queue activations. Queues that became active during a large
- * burst benefit only from throughput, as discussed in the
- * comments on bfq_handle_burst. Thus, if bfqq became active
- * in a burst and not idling the device maximizes throughput,
- * then the device must no be idled, because not idling the
- * device provides bfqq and all other queues in the burst with
- * maximum benefit. Combining this and the above case, we can
- * now establish when idling is actually needed to preserve
- * service guarantees.
- */
- idling_needed_for_service_guarantees =
- asymmetric_scenario && !bfq_bfqq_in_large_burst(bfqq);
-
- /*
- * We have now all the components we need to compute the
+ * We have now the two components we need to compute the
* return value of the function, which is true only if idling
* either boosts the throughput (without issues), or is
* necessary to preserve service guarantees.
*/
- return idling_boosts_thr_without_issues ||
- idling_needed_for_service_guarantees;
+ return idling_boosts_thr_with_no_issue ||
+ idling_needed_for_service_guar;
}
/*
@@ -3657,26 +3992,98 @@
return RB_EMPTY_ROOT(&bfqq->sort_list) && bfq_better_to_idle(bfqq);
}
-static struct bfq_queue *bfq_choose_bfqq_for_injection(struct bfq_data *bfqd)
+/*
+ * This function chooses the queue from which to pick the next extra
+ * I/O request to inject, if it finds a compatible queue. See the
+ * comments on bfq_update_inject_limit() for details on the injection
+ * mechanism, and for the definitions of the quantities mentioned
+ * below.
+ */
+static struct bfq_queue *
+bfq_choose_bfqq_for_injection(struct bfq_data *bfqd)
{
- struct bfq_queue *bfqq;
+ struct bfq_queue *bfqq, *in_serv_bfqq = bfqd->in_service_queue;
+ unsigned int limit = in_serv_bfqq->inject_limit;
+ /*
+ * If
+ * - bfqq is not weight-raised and therefore does not carry
+ * time-critical I/O,
+ * or
+ * - regardless of whether bfqq is weight-raised, bfqq has
+ * however a long think time, during which it can absorb the
+ * effect of an appropriate number of extra I/O requests
+ * from other queues (see bfq_update_inject_limit for
+ * details on the computation of this number);
+ * then injection can be performed without restrictions.
+ */
+ bool in_serv_always_inject = in_serv_bfqq->wr_coeff == 1 ||
+ !bfq_bfqq_has_short_ttime(in_serv_bfqq);
/*
- * A linear search; but, with a high probability, very few
- * steps are needed to find a candidate queue, i.e., a queue
- * with enough budget left for its next request. In fact:
+ * If
+ * - the baseline total service time could not be sampled yet,
+ * so the inject limit happens to be still 0, and
+ * - a lot of time has elapsed since the plugging of I/O
+ * dispatching started, so drive speed is being wasted
+ * significantly;
+ * then temporarily raise inject limit to one request.
+ */
+ if (limit == 0 && in_serv_bfqq->last_serv_time_ns == 0 &&
+ bfq_bfqq_wait_request(in_serv_bfqq) &&
+ time_is_before_eq_jiffies(bfqd->last_idling_start_jiffies +
+ bfqd->bfq_slice_idle)
+ )
+ limit = 1;
+
+ if (bfqd->rq_in_driver >= limit)
+ return NULL;
+
+ /*
+ * Linear search of the source queue for injection; but, with
+ * a high probability, very few steps are needed to find a
+ * candidate queue, i.e., a queue with enough budget left for
+ * its next request. In fact:
* - BFQ dynamically updates the budget of every queue so as
* to accommodate the expected backlog of the queue;
* - if a queue gets all its requests dispatched as injected
* service, then the queue is removed from the active list
- * (and re-added only if it gets new requests, but with
- * enough budget for its new backlog).
+ * (and re-added only if it gets new requests, but then it
+ * is assigned again enough budget for its new backlog).
*/
list_for_each_entry(bfqq, &bfqd->active_list, bfqq_list)
if (!RB_EMPTY_ROOT(&bfqq->sort_list) &&
+ (in_serv_always_inject || bfqq->wr_coeff > 1) &&
bfq_serv_to_charge(bfqq->next_rq, bfqq) <=
- bfq_bfqq_budget_left(bfqq))
- return bfqq;
+ bfq_bfqq_budget_left(bfqq)) {
+ /*
+ * Allow for only one large in-flight request
+ * on non-rotational devices, for the
+ * following reason. On non-rotationl drives,
+ * large requests take much longer than
+ * smaller requests to be served. In addition,
+ * the drive prefers to serve large requests
+ * w.r.t. to small ones, if it can choose. So,
+ * having more than one large requests queued
+ * in the drive may easily make the next first
+ * request of the in-service queue wait for so
+ * long to break bfqq's service guarantees. On
+ * the bright side, large requests let the
+ * drive reach a very high throughput, even if
+ * there is only one in-flight large request
+ * at a time.
+ */
+ if (blk_queue_nonrot(bfqd->queue) &&
+ blk_rq_sectors(bfqq->next_rq) >=
+ BFQQ_SECT_THR_NONROT)
+ limit = min_t(unsigned int, 1, limit);
+ else
+ limit = in_serv_bfqq->inject_limit;
+
+ if (bfqd->rq_in_driver < limit) {
+ bfqd->rqs_injected = true;
+ return bfqq;
+ }
+ }
return NULL;
}
@@ -3763,14 +4170,32 @@
* for a new request, or has requests waiting for a completion and
* may idle after their completion, then keep it anyway.
*
- * Yet, to boost throughput, inject service from other queues if
- * possible.
+ * Yet, inject service from other queues if it boosts
+ * throughput and is possible.
*/
if (bfq_bfqq_wait_request(bfqq) ||
(bfqq->dispatched != 0 && bfq_better_to_idle(bfqq))) {
- if (bfq_bfqq_injectable(bfqq) &&
- bfqq->injected_service * bfqq->inject_coeff <
- bfqq->entity.service * 10)
+ struct bfq_queue *async_bfqq =
+ bfqq->bic && bfqq->bic->bfqq[0] &&
+ bfq_bfqq_busy(bfqq->bic->bfqq[0]) ?
+ bfqq->bic->bfqq[0] : NULL;
+
+ /*
+ * If the process associated with bfqq has also async
+ * I/O pending, then inject it
+ * unconditionally. Injecting I/O from the same
+ * process can cause no harm to the process. On the
+ * contrary, it can only increase bandwidth and reduce
+ * latency for the process.
+ */
+ if (async_bfqq &&
+ icq_to_bic(async_bfqq->next_rq->elv.icq) == bfqq->bic &&
+ bfq_serv_to_charge(async_bfqq->next_rq, async_bfqq) <=
+ bfq_bfqq_budget_left(async_bfqq))
+ bfqq = bfqq->bic->bfqq[0];
+ else if (!idling_boosts_thr_without_issues(bfqd, bfqq) &&
+ (bfqq->wr_coeff == 1 || bfqd->wr_busy_queues > 1 ||
+ !bfq_bfqq_has_short_ttime(bfqq)))
bfqq = bfq_choose_bfqq_for_injection(bfqd);
else
bfqq = NULL;
@@ -3862,15 +4287,15 @@
bfq_bfqq_served(bfqq, service_to_charge);
+ if (bfqq == bfqd->in_service_queue && bfqd->wait_dispatch) {
+ bfqd->wait_dispatch = false;
+ bfqd->waited_rq = rq;
+ }
+
bfq_dispatch_remove(bfqd->queue, rq);
- if (bfqq != bfqd->in_service_queue) {
- if (likely(bfqd->in_service_queue))
- bfqd->in_service_queue->injected_service +=
- bfq_serv_to_charge(rq, bfqq);
-
+ if (bfqq != bfqd->in_service_queue)
goto return_rq;
- }
/*
* If weight raising has to terminate for bfqq, then next
@@ -3890,7 +4315,7 @@
* belongs to CLASS_IDLE and other queues are waiting for
* service.
*/
- if (!(bfqd->busy_queues > 1 && bfq_class_idle(bfqq)))
+ if (!(bfq_tot_busy_queues(bfqd) > 1 && bfq_class_idle(bfqq)))
goto return_rq;
bfq_bfqq_expire(bfqd, bfqq, false, BFQQE_BUDGET_EXHAUSTED);
@@ -3908,7 +4333,7 @@
* most a call to dispatch for nothing
*/
return !list_empty_careful(&bfqd->dispatch) ||
- bfqd->busy_queues > 0;
+ bfq_tot_busy_queues(bfqd) > 0;
}
static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx)
@@ -3962,9 +4387,10 @@
goto start_rq;
}
- bfq_log(bfqd, "dispatch requests: %d busy queues", bfqd->busy_queues);
+ bfq_log(bfqd, "dispatch requests: %d busy queues",
+ bfq_tot_busy_queues(bfqd));
- if (bfqd->busy_queues == 0)
+ if (bfq_tot_busy_queues(bfqd) == 0)
goto exit;
/*
@@ -4301,13 +4727,6 @@
bfq_mark_bfqq_has_short_ttime(bfqq);
bfq_mark_bfqq_sync(bfqq);
bfq_mark_bfqq_just_created(bfqq);
- /*
- * Aggressively inject a lot of service: up to 90%.
- * This coefficient remains constant during bfqq life,
- * but this behavior might be changed, after enough
- * testing and tuning.
- */
- bfqq->inject_coeff = 1;
} else
bfq_clear_bfqq_sync(bfqq);
@@ -4445,17 +4864,19 @@
struct request *rq)
{
bfqq->seek_history <<= 1;
- bfqq->seek_history |=
- get_sdist(bfqq->last_request_pos, rq) > BFQQ_SEEK_THR &&
- (!blk_queue_nonrot(bfqd->queue) ||
- blk_rq_sectors(rq) < BFQQ_SECT_THR_NONROT);
+ bfqq->seek_history |= BFQ_RQ_SEEKY(bfqd, bfqq->last_request_pos, rq);
+
+ if (bfqq->wr_coeff > 1 &&
+ bfqq->wr_cur_max_time == bfqd->bfq_wr_rt_max_time &&
+ BFQQ_TOTALLY_SEEKY(bfqq))
+ bfq_bfqq_end_wr(bfqq);
}
static void bfq_update_has_short_ttime(struct bfq_data *bfqd,
struct bfq_queue *bfqq,
struct bfq_io_cq *bic)
{
- bool has_short_ttime = true;
+ bool has_short_ttime = true, state_changed;
/*
* No need to update has_short_ttime if bfqq is async or in
@@ -4480,13 +4901,93 @@
bfqq->ttime.ttime_mean > bfqd->bfq_slice_idle))
has_short_ttime = false;
- bfq_log_bfqq(bfqd, bfqq, "update_has_short_ttime: has_short_ttime %d",
- has_short_ttime);
+ state_changed = has_short_ttime != bfq_bfqq_has_short_ttime(bfqq);
if (has_short_ttime)
bfq_mark_bfqq_has_short_ttime(bfqq);
else
bfq_clear_bfqq_has_short_ttime(bfqq);
+
+ /*
+ * Until the base value for the total service time gets
+ * finally computed for bfqq, the inject limit does depend on
+ * the think-time state (short|long). In particular, the limit
+ * is 0 or 1 if the think time is deemed, respectively, as
+ * short or long (details in the comments in
+ * bfq_update_inject_limit()). Accordingly, the next
+ * instructions reset the inject limit if the think-time state
+ * has changed and the above base value is still to be
+ * computed.
+ *
+ * However, the reset is performed only if more than 100 ms
+ * have elapsed since the last update of the inject limit, or
+ * (inclusive) if the change is from short to long think
+ * time. The reason for this waiting is as follows.
+ *
+ * bfqq may have a long think time because of a
+ * synchronization with some other queue, i.e., because the
+ * I/O of some other queue may need to be completed for bfqq
+ * to receive new I/O. This happens, e.g., if bfqq is
+ * associated with a process that does some sync. A sync
+ * generates extra blocking I/O, which must be completed
+ * before the process associated with bfqq can go on with its
+ * I/O.
+ *
+ * If such a synchronization is actually in place, then,
+ * without injection on bfqq, the blocking I/O cannot happen
+ * to served while bfqq is in service. As a consequence, if
+ * bfqq is granted I/O-dispatch-plugging, then bfqq remains
+ * empty, and no I/O is dispatched, until the idle timeout
+ * fires. This is likely to result in lower bandwidth and
+ * higher latencies for bfqq, and in a severe loss of total
+ * throughput.
+ *
+ * On the opposite end, a non-zero inject limit may allow the
+ * I/O that blocks bfqq to be executed soon, and therefore
+ * bfqq to receive new I/O soon. But, if this actually
+ * happens, then the next think-time sample for bfqq may be
+ * very low. This in turn may cause bfqq's think time to be
+ * deemed short. Without the 100 ms barrier, this new state
+ * change would cause the body of the next if to be executed
+ * immediately. But this would set to 0 the inject
+ * limit. Without injection, the blocking I/O would cause the
+ * think time of bfqq to become long again, and therefore the
+ * inject limit to be raised again, and so on. The only effect
+ * of such a steady oscillation between the two think-time
+ * states would be to prevent effective injection on bfqq.
+ *
+ * In contrast, if the inject limit is not reset during such a
+ * long time interval as 100 ms, then the number of short
+ * think time samples can grow significantly before the reset
+ * is allowed. As a consequence, the think time state can
+ * become stable before the reset. There will be no state
+ * change when the 100 ms elapse, and therefore no reset of
+ * the inject limit. The inject limit remains steadily equal
+ * to 1 both during and after the 100 ms. So injection can be
+ * performed at all times, and throughput gets boosted.
+ *
+ * An inject limit equal to 1 is however in conflict, in
+ * general, with the fact that the think time of bfqq is
+ * short, because injection may be likely to delay bfqq's I/O
+ * (as explained in the comments in
+ * bfq_update_inject_limit()). But this does not happen in
+ * this special case, because bfqq's low think time is due to
+ * an effective handling of a synchronization, through
+ * injection. In this special case, bfqq's I/O does not get
+ * delayed by injection; on the contrary, bfqq's I/O is
+ * brought forward, because it is not blocked for
+ * milliseconds.
+ *
+ * In addition, during the 100 ms, the base value for the
+ * total service time is likely to get finally computed,
+ * freeing the inject limit from its relation with the think
+ * time.
+ */
+ if (state_changed && bfqq->last_serv_time_ns == 0 &&
+ (time_is_before_eq_jiffies(bfqq->decrease_time_jif +
+ msecs_to_jiffies(100)) ||
+ !has_short_ttime))
+ bfq_reset_inject_limit(bfqd, bfqq);
}
/*
@@ -4496,19 +4997,9 @@
static void bfq_rq_enqueued(struct bfq_data *bfqd, struct bfq_queue *bfqq,
struct request *rq)
{
- struct bfq_io_cq *bic = RQ_BIC(rq);
-
if (rq->cmd_flags & REQ_META)
bfqq->meta_pending++;
- bfq_update_io_thinktime(bfqd, bfqq);
- bfq_update_has_short_ttime(bfqd, bfqq, bic);
- bfq_update_io_seektime(bfqd, bfqq, rq);
-
- bfq_log_bfqq(bfqd, bfqq,
- "rq_enqueued: has_short_ttime=%d (seeky %d)",
- bfq_bfqq_has_short_ttime(bfqq), BFQQ_SEEKY(bfqq));
-
bfqq->last_request_pos = blk_rq_pos(rq) + blk_rq_sectors(rq);
if (bfqq == bfqd->in_service_queue && bfq_bfqq_wait_request(bfqq)) {
@@ -4517,28 +5008,31 @@
bool budget_timeout = bfq_bfqq_budget_timeout(bfqq);
/*
- * There is just this request queued: if the request
- * is small and the queue is not to be expired, then
- * just exit.
+ * There is just this request queued: if
+ * - the request is small, and
+ * - we are idling to boost throughput, and
+ * - the queue is not to be expired,
+ * then just exit.
*
* In this way, if the device is being idled to wait
* for a new request from the in-service queue, we
* avoid unplugging the device and committing the
- * device to serve just a small request. On the
- * contrary, we wait for the block layer to decide
- * when to unplug the device: hopefully, new requests
- * will be merged to this one quickly, then the device
- * will be unplugged and larger requests will be
- * dispatched.
+ * device to serve just a small request. In contrast
+ * we wait for the block layer to decide when to
+ * unplug the device: hopefully, new requests will be
+ * merged to this one quickly, then the device will be
+ * unplugged and larger requests will be dispatched.
*/
- if (small_req && !budget_timeout)
+ if (small_req && idling_boosts_thr_without_issues(bfqd, bfqq) &&
+ !budget_timeout)
return;
/*
- * A large enough request arrived, or the queue is to
- * be expired: in both cases disk idling is to be
- * stopped, so clear wait_request flag and reset
- * timer.
+ * A large enough request arrived, or idling is being
+ * performed to preserve service guarantees, or
+ * finally the queue is to be expired: in all these
+ * cases disk idling is to be stopped, so clear
+ * wait_request flag and reset timer.
*/
bfq_clear_bfqq_wait_request(bfqq);
hrtimer_try_to_cancel(&bfqd->idle_slice_timer);
@@ -4564,8 +5058,6 @@
bool waiting, idle_timer_disabled = false;
if (new_bfqq) {
- if (bic_to_bfqq(RQ_BIC(rq), 1) != bfqq)
- new_bfqq = bic_to_bfqq(RQ_BIC(rq), 1);
/*
* Release the request's reference to the old bfqq
* and make sure one is taken to the shared queue.
@@ -4595,6 +5087,10 @@
bfqq = new_bfqq;
}
+ bfq_update_io_thinktime(bfqd, bfqq);
+ bfq_update_has_short_ttime(bfqd, bfqq, RQ_BIC(rq));
+ bfq_update_io_seektime(bfqd, bfqq, rq);
+
waiting = bfqq && bfq_bfqq_wait_request(bfqq);
bfq_add_request(rq);
idle_timer_disabled = waiting && !bfq_bfqq_wait_request(bfqq);
@@ -4708,6 +5204,8 @@
static void bfq_update_hw_tag(struct bfq_data *bfqd)
{
+ struct bfq_queue *bfqq = bfqd->in_service_queue;
+
bfqd->max_rq_in_driver = max_t(int, bfqd->max_rq_in_driver,
bfqd->rq_in_driver);
@@ -4720,7 +5218,18 @@
* sum is not exact, as it's not taking into account deactivated
* requests.
*/
- if (bfqd->rq_in_driver + bfqd->queued < BFQ_HW_QUEUE_THRESHOLD)
+ if (bfqd->rq_in_driver + bfqd->queued <= BFQ_HW_QUEUE_THRESHOLD)
+ return;
+
+ /*
+ * If active queue hasn't enough requests and can idle, bfq might not
+ * dispatch sufficient requests to hardware. Don't zero hw_tag in this
+ * case
+ */
+ if (bfqq && bfq_bfqq_has_short_ttime(bfqq) &&
+ bfqq->dispatched + bfqq->queued[0] + bfqq->queued[1] <
+ BFQ_HW_QUEUE_THRESHOLD &&
+ bfqd->rq_in_driver < BFQ_HW_QUEUE_THRESHOLD)
return;
if (bfqd->hw_tag_samples++ < BFQ_HW_QUEUE_SAMPLES)
@@ -4729,6 +5238,9 @@
bfqd->hw_tag = bfqd->max_rq_in_driver > BFQ_HW_QUEUE_THRESHOLD;
bfqd->max_rq_in_driver = 0;
bfqd->hw_tag_samples = 0;
+
+ bfqd->nonrot_with_queueing =
+ blk_queue_nonrot(bfqd->queue) && bfqd->hw_tag;
}
static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd)
@@ -4791,11 +5303,14 @@
* isochronous, and both requisites for this condition to hold
* are now satisfied, then compute soft_rt_next_start (see the
* comments on the function bfq_bfqq_softrt_next_start()). We
- * schedule this delayed check when bfqq expires, if it still
- * has in-flight requests.
+ * do not compute soft_rt_next_start if bfqq is in interactive
+ * weight raising (see the comments in bfq_bfqq_expire() for
+ * an explanation). We schedule this delayed update when bfqq
+ * expires, if it still has in-flight requests.
*/
if (bfq_bfqq_softrt_update(bfqq) && bfqq->dispatched == 0 &&
- RB_EMPTY_ROOT(&bfqq->sort_list))
+ RB_EMPTY_ROOT(&bfqq->sort_list) &&
+ bfqq->wr_coeff != bfqd->bfq_wr_coeff)
bfqq->soft_rt_next_start =
bfq_bfqq_softrt_next_start(bfqd, bfqq);
@@ -4853,6 +5368,164 @@
}
/*
+ * The processes associated with bfqq may happen to generate their
+ * cumulative I/O at a lower rate than the rate at which the device
+ * could serve the same I/O. This is rather probable, e.g., if only
+ * one process is associated with bfqq and the device is an SSD. It
+ * results in bfqq becoming often empty while in service. In this
+ * respect, if BFQ is allowed to switch to another queue when bfqq
+ * remains empty, then the device goes on being fed with I/O requests,
+ * and the throughput is not affected. In contrast, if BFQ is not
+ * allowed to switch to another queue---because bfqq is sync and
+ * I/O-dispatch needs to be plugged while bfqq is temporarily
+ * empty---then, during the service of bfqq, there will be frequent
+ * "service holes", i.e., time intervals during which bfqq gets empty
+ * and the device can only consume the I/O already queued in its
+ * hardware queues. During service holes, the device may even get to
+ * remaining idle. In the end, during the service of bfqq, the device
+ * is driven at a lower speed than the one it can reach with the kind
+ * of I/O flowing through bfqq.
+ *
+ * To counter this loss of throughput, BFQ implements a "request
+ * injection mechanism", which tries to fill the above service holes
+ * with I/O requests taken from other queues. The hard part in this
+ * mechanism is finding the right amount of I/O to inject, so as to
+ * both boost throughput and not break bfqq's bandwidth and latency
+ * guarantees. In this respect, the mechanism maintains a per-queue
+ * inject limit, computed as below. While bfqq is empty, the injection
+ * mechanism dispatches extra I/O requests only until the total number
+ * of I/O requests in flight---i.e., already dispatched but not yet
+ * completed---remains lower than this limit.
+ *
+ * A first definition comes in handy to introduce the algorithm by
+ * which the inject limit is computed. We define as first request for
+ * bfqq, an I/O request for bfqq that arrives while bfqq is in
+ * service, and causes bfqq to switch from empty to non-empty. The
+ * algorithm updates the limit as a function of the effect of
+ * injection on the service times of only the first requests of
+ * bfqq. The reason for this restriction is that these are the
+ * requests whose service time is affected most, because they are the
+ * first to arrive after injection possibly occurred.
+ *
+ * To evaluate the effect of injection, the algorithm measures the
+ * "total service time" of first requests. We define as total service
+ * time of an I/O request, the time that elapses since when the
+ * request is enqueued into bfqq, to when it is completed. This
+ * quantity allows the whole effect of injection to be measured. It is
+ * easy to see why. Suppose that some requests of other queues are
+ * actually injected while bfqq is empty, and that a new request R
+ * then arrives for bfqq. If the device does start to serve all or
+ * part of the injected requests during the service hole, then,
+ * because of this extra service, it may delay the next invocation of
+ * the dispatch hook of BFQ. Then, even after R gets eventually
+ * dispatched, the device may delay the actual service of R if it is
+ * still busy serving the extra requests, or if it decides to serve,
+ * before R, some extra request still present in its queues. As a
+ * conclusion, the cumulative extra delay caused by injection can be
+ * easily evaluated by just comparing the total service time of first
+ * requests with and without injection.
+ *
+ * The limit-update algorithm works as follows. On the arrival of a
+ * first request of bfqq, the algorithm measures the total time of the
+ * request only if one of the three cases below holds, and, for each
+ * case, it updates the limit as described below:
+ *
+ * (1) If there is no in-flight request. This gives a baseline for the
+ * total service time of the requests of bfqq. If the baseline has
+ * not been computed yet, then, after computing it, the limit is
+ * set to 1, to start boosting throughput, and to prepare the
+ * ground for the next case. If the baseline has already been
+ * computed, then it is updated, in case it results to be lower
+ * than the previous value.
+ *
+ * (2) If the limit is higher than 0 and there are in-flight
+ * requests. By comparing the total service time in this case with
+ * the above baseline, it is possible to know at which extent the
+ * current value of the limit is inflating the total service
+ * time. If the inflation is below a certain threshold, then bfqq
+ * is assumed to be suffering from no perceivable loss of its
+ * service guarantees, and the limit is even tentatively
+ * increased. If the inflation is above the threshold, then the
+ * limit is decreased. Due to the lack of any hysteresis, this
+ * logic makes the limit oscillate even in steady workload
+ * conditions. Yet we opted for it, because it is fast in reaching
+ * the best value for the limit, as a function of the current I/O
+ * workload. To reduce oscillations, this step is disabled for a
+ * short time interval after the limit happens to be decreased.
+ *
+ * (3) Periodically, after resetting the limit, to make sure that the
+ * limit eventually drops in case the workload changes. This is
+ * needed because, after the limit has gone safely up for a
+ * certain workload, it is impossible to guess whether the
+ * baseline total service time may have changed, without measuring
+ * it again without injection. A more effective version of this
+ * step might be to just sample the baseline, by interrupting
+ * injection only once, and then to reset/lower the limit only if
+ * the total service time with the current limit does happen to be
+ * too large.
+ *
+ * More details on each step are provided in the comments on the
+ * pieces of code that implement these steps: the branch handling the
+ * transition from empty to non empty in bfq_add_request(), the branch
+ * handling injection in bfq_select_queue(), and the function
+ * bfq_choose_bfqq_for_injection(). These comments also explain some
+ * exceptions, made by the injection mechanism in some special cases.
+ */
+static void bfq_update_inject_limit(struct bfq_data *bfqd,
+ struct bfq_queue *bfqq)
+{
+ u64 tot_time_ns = ktime_get_ns() - bfqd->last_empty_occupied_ns;
+ unsigned int old_limit = bfqq->inject_limit;
+
+ if (bfqq->last_serv_time_ns > 0) {
+ u64 threshold = (bfqq->last_serv_time_ns * 3)>>1;
+
+ if (tot_time_ns >= threshold && old_limit > 0) {
+ bfqq->inject_limit--;
+ bfqq->decrease_time_jif = jiffies;
+ } else if (tot_time_ns < threshold &&
+ old_limit < bfqd->max_rq_in_driver<<1)
+ bfqq->inject_limit++;
+ }
+
+ /*
+ * Either we still have to compute the base value for the
+ * total service time, and there seem to be the right
+ * conditions to do it, or we can lower the last base value
+ * computed.
+ *
+ * NOTE: (bfqd->rq_in_driver == 1) means that there is no I/O
+ * request in flight, because this function is in the code
+ * path that handles the completion of a request of bfqq, and,
+ * in particular, this function is executed before
+ * bfqd->rq_in_driver is decremented in such a code path.
+ */
+ if ((bfqq->last_serv_time_ns == 0 && bfqd->rq_in_driver == 1) ||
+ tot_time_ns < bfqq->last_serv_time_ns) {
+ bfqq->last_serv_time_ns = tot_time_ns;
+ /*
+ * Now we certainly have a base value: make sure we
+ * start trying injection.
+ */
+ bfqq->inject_limit = max_t(unsigned int, 1, old_limit);
+ } else if (!bfqd->rqs_injected && bfqd->rq_in_driver == 1)
+ /*
+ * No I/O injected and no request still in service in
+ * the drive: these are the exact conditions for
+ * computing the base value of the total service time
+ * for bfqq. So let's update this value, because it is
+ * rather variable. For example, it varies if the size
+ * or the spatial locality of the I/O requests in bfqq
+ * change.
+ */
+ bfqq->last_serv_time_ns = tot_time_ns;
+
+
+ /* update complete, not waiting for any request completion any longer */
+ bfqd->waited_rq = NULL;
+}
+
+/*
* Handle either a requeue or a finish for rq. The things to do are
* the same in both cases: all references to rq are to be dropped. In
* particular, rq is considered completed from the point of view of
@@ -4896,6 +5569,9 @@
spin_lock_irqsave(&bfqd->lock, flags);
+ if (rq == bfqd->waited_rq)
+ bfq_update_inject_limit(bfqd, bfqq);
+
bfq_completed_request(bfqq, bfqd);
bfq_finish_requeue_request_body(bfqq);
@@ -5059,7 +5735,7 @@
* preparation is that, after the prepare_request hook is invoked for
* rq, rq may still be transformed into a request with no icq, i.e., a
* request not associated with any queue. No bfq hook is invoked to
- * signal this tranformation. As a consequence, should these
+ * signal this transformation. As a consequence, should these
* preparation operations be performed when the prepare_request hook
* is invoked, and should rq be transformed one moment later, bfq
* would end up in an inconsistent state, because it would have
@@ -5150,7 +5826,29 @@
}
}
- if (unlikely(bfq_bfqq_just_created(bfqq)))
+ /*
+ * Consider bfqq as possibly belonging to a burst of newly
+ * created queues only if:
+ * 1) A burst is actually happening (bfqd->burst_size > 0)
+ * or
+ * 2) There is no other active queue. In fact, if, in
+ * contrast, there are active queues not belonging to the
+ * possible burst bfqq may belong to, then there is no gain
+ * in considering bfqq as belonging to a burst, and
+ * therefore in not weight-raising bfqq. See comments on
+ * bfq_handle_burst().
+ *
+ * This filtering also helps eliminating false positives,
+ * occurring when bfqq does not belong to an actual large
+ * burst, but some background task (e.g., a service) happens
+ * to trigger the creation of new queues very close to when
+ * bfqq and its possible companion queues are created. See
+ * comments on bfq_handle_burst() for further details also on
+ * this issue.
+ */
+ if (unlikely(bfq_bfqq_just_created(bfqq) &&
+ (bfqd->burst_size > 0 ||
+ bfq_tot_busy_queues(bfqd) == 0)))
bfq_handle_burst(bfqd, bfqq);
return bfqq;
@@ -5418,14 +6116,15 @@
HRTIMER_MODE_REL);
bfqd->idle_slice_timer.function = bfq_idle_slice_timer;
- bfqd->queue_weights_tree = RB_ROOT;
- bfqd->group_weights_tree = RB_ROOT;
+ bfqd->queue_weights_tree = RB_ROOT_CACHED;
+ bfqd->num_groups_with_pending_reqs = 0;
INIT_LIST_HEAD(&bfqd->active_list);
INIT_LIST_HEAD(&bfqd->idle_list);
INIT_HLIST_HEAD(&bfqd->burst_list);
bfqd->hw_tag = -1;
+ bfqd->nonrot_with_queueing = blk_queue_nonrot(bfqd->queue);
bfqd->bfq_max_budget = bfq_default_max_budget;
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index a41e988..eba7cd4 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -32,6 +32,8 @@
#define BFQ_DEFAULT_GRP_IOPRIO 0
#define BFQ_DEFAULT_GRP_CLASS IOPRIO_CLASS_BE
+#define MAX_PID_STR_LENGTH 12
+
/*
* Soft real-time applications are extremely more latency sensitive
* than interactive ones. Over-raise the weight of the former to
@@ -89,7 +91,7 @@
* expiration. This peculiar definition allows for the following
* optimization, not yet exploited: while a given entity is still in
* service, we already know which is the best candidate for next
- * service among the other active entitities in the same parent
+ * service among the other active entities in the same parent
* entity. We can then quickly compare the timestamps of the
* in-service entity with those of such best candidate.
*
@@ -108,15 +110,14 @@
};
/**
- * struct bfq_weight_counter - counter of the number of all active entities
+ * struct bfq_weight_counter - counter of the number of all active queues
* with a given weight.
*/
struct bfq_weight_counter {
- unsigned int weight; /* weight of the entities this counter refers to */
- unsigned int num_active; /* nr of active entities with this weight */
+ unsigned int weight; /* weight of the queues this counter refers to */
+ unsigned int num_active; /* nr of active queues with this weight */
/*
- * Weights tree member (see bfq_data's @queue_weights_tree and
- * @group_weights_tree)
+ * Weights tree member (see bfq_data's @queue_weights_tree)
*/
struct rb_node weights_node;
};
@@ -141,7 +142,7 @@
*
* Unless cgroups are used, the weight value is calculated from the
* ioprio to export the same interface as CFQ. When dealing with
- * ``well-behaved'' queues (i.e., queues that do not spend too much
+ * "well-behaved" queues (i.e., queues that do not spend too much
* time to consume their budget and have true sequential behavior, and
* when there are no external factors breaking anticipation) the
* relative weights at each level of the cgroups hierarchy should be
@@ -151,8 +152,6 @@
struct bfq_entity {
/* service_tree member */
struct rb_node rb_node;
- /* pointer to the weight counter associated with this entity */
- struct bfq_weight_counter *weight_counter;
/*
* Flag, true if the entity is on a tree (either the active or
@@ -199,6 +198,9 @@
/* flag, set to request a weight, ioprio or ioprio_class change */
int prio_changed;
+
+ /* flag, set if the entity is counted in groups_with_pending_reqs */
+ bool in_groups_with_pending_reqs;
};
struct bfq_group;
@@ -240,6 +242,13 @@
/* next ioprio and ioprio class if a change is in progress */
unsigned short new_ioprio, new_ioprio_class;
+ /* last total-service-time sample, see bfq_update_inject_limit() */
+ u64 last_serv_time_ns;
+ /* limit for request injection */
+ unsigned int inject_limit;
+ /* last time the inject limit has been decreased, in jiffies */
+ unsigned long decrease_time_jif;
+
/*
* Shared bfq_queue if queue is cooperating with one or more
* other queues.
@@ -266,6 +275,9 @@
/* entity representing this queue in the scheduler */
struct bfq_entity entity;
+ /* pointer to the weight counter associated with this entity */
+ struct bfq_weight_counter *weight_counter;
+
/* maximum budget allowed from the feedback mechanism */
int max_budget;
/* budget expiration (in jiffies) */
@@ -354,29 +366,6 @@
/* max service rate measured so far */
u32 max_service_rate;
- /*
- * Ratio between the service received by bfqq while it is in
- * service, and the cumulative service (of requests of other
- * queues) that may be injected while bfqq is empty but still
- * in service. To increase precision, the coefficient is
- * measured in tenths of unit. Here are some example of (1)
- * ratios, (2) resulting percentages of service injected
- * w.r.t. to the total service dispatched while bfqq is in
- * service, and (3) corresponding values of the coefficient:
- * 1 (50%) -> 10
- * 2 (33%) -> 20
- * 10 (9%) -> 100
- * 9.9 (9%) -> 99
- * 1.5 (40%) -> 15
- * 0.5 (66%) -> 5
- * 0.1 (90%) -> 1
- *
- * So, if the coefficient is lower than 10, then
- * injected service is more than bfqq service.
- */
- unsigned int inject_coeff;
- /* amount of service injected in current service slot */
- unsigned int injected_service;
};
/**
@@ -416,6 +405,15 @@
bool was_in_burst_list;
/*
+ * Save the weight when a merge occurs, to be able
+ * to restore it in case of split. If the weight is not
+ * correctly resumed when the queue is recycled,
+ * then the weight of the recycled queue could differ
+ * from the weight of the original queue.
+ */
+ unsigned int saved_weight;
+
+ /*
* Similar to previous fields: save wr information.
*/
unsigned long saved_wr_coeff;
@@ -447,22 +445,62 @@
* weight-raised @bfq_queue (see the comments to the functions
* bfq_weights_tree_[add|remove] for further details).
*/
- struct rb_root queue_weights_tree;
- /*
- * rbtree of non-queue @bfq_entity weight counters, sorted by
- * weight. Used to keep track of whether all @bfq_groups have
- * the same weight. The tree contains one counter for each
- * distinct weight associated to some active @bfq_group (see
- * the comments to the functions bfq_weights_tree_[add|remove]
- * for further details).
- */
- struct rb_root group_weights_tree;
+ struct rb_root_cached queue_weights_tree;
/*
- * Number of bfq_queues containing requests (including the
- * queue in service, even if it is idling).
+ * Number of groups with at least one descendant process that
+ * has at least one request waiting for completion. Note that
+ * this accounts for also requests already dispatched, but not
+ * yet completed. Therefore this number of groups may differ
+ * (be larger) than the number of active groups, as a group is
+ * considered active only if its corresponding entity has
+ * descendant queues with at least one request queued. This
+ * number is used to decide whether a scenario is symmetric.
+ * For a detailed explanation see comments on the computation
+ * of the variable asymmetric_scenario in the function
+ * bfq_better_to_idle().
+ *
+ * However, it is hard to compute this number exactly, for
+ * groups with multiple descendant processes. Consider a group
+ * that is inactive, i.e., that has no descendant process with
+ * pending I/O inside BFQ queues. Then suppose that
+ * num_groups_with_pending_reqs is still accounting for this
+ * group, because the group has descendant processes with some
+ * I/O request still in flight. num_groups_with_pending_reqs
+ * should be decremented when the in-flight request of the
+ * last descendant process is finally completed (assuming that
+ * nothing else has changed for the group in the meantime, in
+ * terms of composition of the group and active/inactive state of child
+ * groups and processes). To accomplish this, an additional
+ * pending-request counter must be added to entities, and must
+ * be updated correctly. To avoid this additional field and operations,
+ * we resort to the following tradeoff between simplicity and
+ * accuracy: for an inactive group that is still counted in
+ * num_groups_with_pending_reqs, we decrement
+ * num_groups_with_pending_reqs when the first descendant
+ * process of the group remains with no request waiting for
+ * completion.
+ *
+ * Even this simpler decrement strategy requires a little
+ * carefulness: to avoid multiple decrements, we flag a group,
+ * more precisely an entity representing a group, as still
+ * counted in num_groups_with_pending_reqs when it becomes
+ * inactive. Then, when the first descendant queue of the
+ * entity remains with no request waiting for completion,
+ * num_groups_with_pending_reqs is decremented, and this flag
+ * is reset. After this flag is reset for the entity,
+ * num_groups_with_pending_reqs won't be decremented any
+ * longer in case a new descendant queue of the entity remains
+ * with no request waiting for completion.
*/
- int busy_queues;
+ unsigned int num_groups_with_pending_reqs;
+
+ /*
+ * Per-class (RT, BE, IDLE) number of bfq_queues containing
+ * requests (including the queue in service, even if it is
+ * idling).
+ */
+ unsigned int busy_queues[3];
/* number of weight-raised busy @bfq_queues */
int wr_busy_queues;
/* number of queued requests */
@@ -470,6 +508,9 @@
/* number of requests dispatched and waiting for completion */
int rq_in_driver;
+ /* true if the device is non rotational and performs queueing */
+ bool nonrot_with_queueing;
+
/*
* Maximum number of requests in driver in the last
* @hw_tag_samples completed requests.
@@ -501,6 +542,26 @@
/* time of last request completion (ns) */
u64 last_completion;
+ /* time of last transition from empty to non-empty (ns) */
+ u64 last_empty_occupied_ns;
+
+ /*
+ * Flag set to activate the sampling of the total service time
+ * of a just-arrived first I/O request (see
+ * bfq_update_inject_limit()). This will cause the setting of
+ * waited_rq when the request is finally dispatched.
+ */
+ bool wait_dispatch;
+ /*
+ * If set, then bfq_update_inject_limit() is invoked when
+ * waited_rq is eventually completed.
+ */
+ struct request *waited_rq;
+ /*
+ * True if some request has been injected during the last service hole.
+ */
+ bool rqs_injected;
+
/* time of first rq dispatch in current observation interval (ns) */
u64 first_dispatch;
/* time of last rq dispatch in current observation interval (ns) */
@@ -510,6 +571,7 @@
ktime_t last_budget_start;
/* beginning of the last idle slice */
ktime_t last_idling_start;
+ unsigned long last_idling_start_jiffies;
/* number of samples in current observation interval */
int peak_rate_samples;
@@ -854,11 +916,11 @@
void bic_set_bfqq(struct bfq_io_cq *bic, struct bfq_queue *bfqq, bool is_sync);
struct bfq_data *bic_to_bfqd(struct bfq_io_cq *bic);
void bfq_pos_tree_add_move(struct bfq_data *bfqd, struct bfq_queue *bfqq);
-void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_entity *entity,
- struct rb_root *root);
+void bfq_weights_tree_add(struct bfq_data *bfqd, struct bfq_queue *bfqq,
+ struct rb_root_cached *root);
void __bfq_weights_tree_remove(struct bfq_data *bfqd,
- struct bfq_entity *entity,
- struct rb_root *root);
+ struct bfq_queue *bfqq,
+ struct rb_root_cached *root);
void bfq_weights_tree_remove(struct bfq_data *bfqd,
struct bfq_queue *bfqq);
void bfq_bfqq_expire(struct bfq_data *bfqd, struct bfq_queue *bfqq,
@@ -935,6 +997,7 @@
struct bfq_group *bfq_bfqq_to_bfqg(struct bfq_queue *bfqq);
struct bfq_queue *bfq_entity_to_bfqq(struct bfq_entity *entity);
+unsigned int bfq_tot_busy_queues(struct bfq_data *bfqd);
struct bfq_service_tree *bfq_entity_service_tree(struct bfq_entity *entity);
struct bfq_entity *bfq_entity_of(struct rb_node *node);
unsigned short bfq_ioprio_to_weight(int ioprio);
@@ -951,7 +1014,7 @@
bool ins_into_idle_tree);
bool next_queue_may_preempt(struct bfq_data *bfqd);
struct bfq_queue *bfq_get_next_queue(struct bfq_data *bfqd);
-void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd);
+bool __bfq_bfqd_reset_in_service(struct bfq_data *bfqd);
void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
bool ins_into_idle_tree, bool expiration);
void bfq_activate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq);
@@ -964,13 +1027,23 @@
/* --------------- end of interface of B-WF2Q+ ---------------- */
/* Logging facilities. */
+static inline void bfq_pid_to_str(int pid, char *str, int len)
+{
+ if (pid != -1)
+ snprintf(str, len, "%d", pid);
+ else
+ snprintf(str, len, "SHARED-");
+}
+
#ifdef CONFIG_BFQ_GROUP_IOSCHED
struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do { \
+ char pid_str[MAX_PID_STR_LENGTH]; \
+ bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH); \
blk_add_cgroup_trace_msg((bfqd)->queue, \
bfqg_to_blkg(bfqq_group(bfqq))->blkcg, \
- "bfq%d%c " fmt, (bfqq)->pid, \
+ "bfq%s%c " fmt, pid_str, \
bfq_bfqq_sync((bfqq)) ? 'S' : 'A', ##args); \
} while (0)
@@ -981,10 +1054,13 @@
#else /* CONFIG_BFQ_GROUP_IOSCHED */
-#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) \
- blk_add_trace_msg((bfqd)->queue, "bfq%d%c " fmt, (bfqq)->pid, \
+#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do { \
+ char pid_str[MAX_PID_STR_LENGTH]; \
+ bfq_pid_to_str((bfqq)->pid, pid_str, MAX_PID_STR_LENGTH); \
+ blk_add_trace_msg((bfqd)->queue, "bfq%s%c " fmt, pid_str, \
bfq_bfqq_sync((bfqq)) ? 'S' : 'A', \
- ##args)
+ ##args); \
+} while (0)
#define bfq_log_bfqg(bfqd, bfqg, fmt, args...) do {} while (0)
#endif /* CONFIG_BFQ_GROUP_IOSCHED */
diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c
index ff7c2d4..48d899c 100644
--- a/block/bfq-wf2q.c
+++ b/block/bfq-wf2q.c
@@ -44,6 +44,12 @@
BFQ_DEFAULT_GRP_CLASS - 1;
}
+unsigned int bfq_tot_busy_queues(struct bfq_data *bfqd)
+{
+ return bfqd->busy_queues[0] + bfqd->busy_queues[1] +
+ bfqd->busy_queues[2];
+}
+
static struct bfq_entity *bfq_lookup_next_entity(struct bfq_sched_data *sd,
bool expiration);
@@ -53,7 +59,7 @@
* bfq_update_next_in_service - update sd->next_in_service
* @sd: sched_data for which to perform the update.
* @new_entity: if not NULL, pointer to the entity whose activation,
- * requeueing or repositionig triggered the invocation of
+ * requeueing or repositioning triggered the invocation of
* this function.
* @expiration: id true, this function is being invoked after the
* expiration of the in-service entity
@@ -84,7 +90,7 @@
/*
* If this update is triggered by the activation, requeueing
- * or repositiong of an entity that does not coincide with
+ * or repositioning of an entity that does not coincide with
* sd->next_in_service, then a full lookup in the active tree
* can be avoided. In fact, it is enough to check whether the
* just-modified entity has the same priority as
@@ -731,7 +737,7 @@
struct bfq_queue *bfqq = bfq_entity_to_bfqq(entity);
unsigned int prev_weight, new_weight;
struct bfq_data *bfqd = NULL;
- struct rb_root *root;
+ struct rb_root_cached *root;
#ifdef CONFIG_BFQ_GROUP_IOSCHED
struct bfq_sched_data *sd;
struct bfq_group *bfqg;
@@ -788,25 +794,23 @@
new_weight = entity->orig_weight *
(bfqq ? bfqq->wr_coeff : 1);
/*
- * If the weight of the entity changes, remove the entity
- * from its old weight counter (if there is a counter
- * associated with the entity), and add it to the counter
- * associated with its new weight.
+ * If the weight of the entity changes, and the entity is a
+ * queue, remove the entity from its old weight counter (if
+ * there is a counter associated with the entity).
*/
- if (prev_weight != new_weight) {
- root = bfqq ? &bfqd->queue_weights_tree :
- &bfqd->group_weights_tree;
- __bfq_weights_tree_remove(bfqd, entity, root);
+ if (prev_weight != new_weight && bfqq) {
+ root = &bfqd->queue_weights_tree;
+ __bfq_weights_tree_remove(bfqd, bfqq, root);
}
entity->weight = new_weight;
/*
- * Add the entity to its weights tree only if it is
- * not associated with a weight-raised queue.
+ * Add the entity, if it is not a weight-raised queue,
+ * to the counter associated with its new weight.
*/
- if (prev_weight != new_weight &&
- (bfqq ? bfqq->wr_coeff == 1 : 1))
+ if (prev_weight != new_weight && bfqq && bfqq->wr_coeff == 1) {
/* If we get here, root has been initialized. */
- bfq_weights_tree_add(bfqd, entity, root);
+ bfq_weights_tree_add(bfqd, bfqq, root);
+ }
new_st->wsum += entity->weight;
@@ -1008,13 +1012,16 @@
entity->on_st = true;
}
-#ifdef BFQ_GROUP_IOSCHED_ENABLED
+#ifdef CONFIG_BFQ_GROUP_IOSCHED
if (!bfq_entity_to_bfqq(entity)) { /* bfq_group */
struct bfq_group *bfqg =
container_of(entity, struct bfq_group, entity);
+ struct bfq_data *bfqd = bfqg->bfqd;
- bfq_weights_tree_add(bfqg->bfqd, entity,
- &bfqd->group_weights_tree);
+ if (!entity->in_groups_with_pending_reqs) {
+ entity->in_groups_with_pending_reqs = true;
+ bfqd->num_groups_with_pending_reqs++;
+ }
}
#endif
@@ -1153,15 +1160,14 @@
}
/**
- * __bfq_deactivate_entity - deactivate an entity from its service tree.
- * @entity: the entity to deactivate.
+ * __bfq_deactivate_entity - update sched_data and service trees for
+ * entity, so as to represent entity as inactive
+ * @entity: the entity being deactivated.
* @ins_into_idle_tree: if false, the entity will not be put into the
* idle tree.
*
- * Deactivates an entity, independently of its previous state. Must
- * be invoked only if entity is on a service tree. Extracts the entity
- * from that tree, and if necessary and allowed, puts it into the idle
- * tree.
+ * If necessary and allowed, puts entity into the idle tree. NOTE:
+ * entity may be on no tree if in service.
*/
bool __bfq_deactivate_entity(struct bfq_entity *entity, bool ins_into_idle_tree)
{
@@ -1390,7 +1396,7 @@
* In this first case, update the virtual time in @st too (see the
* comments on this update inside the function).
*
- * In constrast, if there is an in-service entity, then return the
+ * In contrast, if there is an in-service entity, then return the
* entity that would be set in service if not only the above
* conditions, but also the next one held true: the currently
* in-service entity, on expiration,
@@ -1473,12 +1479,12 @@
* is being invoked as a part of the expiration path
* of the in-service queue. In this case, even if
* sd->in_service_entity is not NULL,
- * sd->in_service_entiy at this point is actually not
+ * sd->in_service_entity at this point is actually not
* in service any more, and, if needed, has already
* been properly queued or requeued into the right
* tree. The reason why sd->in_service_entity is still
* not NULL here, even if expiration is true, is that
- * sd->in_service_entiy is reset as a last step in the
+ * sd->in_service_entity is reset as a last step in the
* expiration path. So, if expiration is true, tell
* __bfq_lookup_next_entity that there is no
* sd->in_service_entity.
@@ -1513,7 +1519,7 @@
struct bfq_sched_data *sd;
struct bfq_queue *bfqq;
- if (bfqd->busy_queues == 0)
+ if (bfq_tot_busy_queues(bfqd) == 0)
return NULL;
/*
@@ -1599,7 +1605,8 @@
return bfqq;
}
-void __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
+/* returns true if the in-service queue gets freed */
+bool __bfq_bfqd_reset_in_service(struct bfq_data *bfqd)
{
struct bfq_queue *in_serv_bfqq = bfqd->in_service_queue;
struct bfq_entity *in_serv_entity = &in_serv_bfqq->entity;
@@ -1623,8 +1630,20 @@
* service tree either, then release the service reference to
* the queue it represents (taken with bfq_get_entity).
*/
- if (!in_serv_entity->on_st)
+ if (!in_serv_entity->on_st) {
+ /*
+ * If no process is referencing in_serv_bfqq any
+ * longer, then the service reference may be the only
+ * reference to the queue. If this is the case, then
+ * bfqq gets freed here.
+ */
+ int ref = in_serv_bfqq->ref;
bfq_put_queue(in_serv_bfqq);
+ if (ref == 1)
+ return true;
+ }
+
+ return false;
}
void bfq_deactivate_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,
@@ -1665,10 +1684,7 @@
bfq_clear_bfqq_busy(bfqq);
- bfqd->busy_queues--;
-
- if (!bfqq->dispatched)
- bfq_weights_tree_remove(bfqd, bfqq);
+ bfqd->busy_queues[bfqq->ioprio_class - 1]--;
if (bfqq->wr_coeff > 1)
bfqd->wr_busy_queues--;
@@ -1676,6 +1692,9 @@
bfqg_stats_update_dequeue(bfqq_group(bfqq));
bfq_deactivate_bfqq(bfqd, bfqq, true, expiration);
+
+ if (!bfqq->dispatched)
+ bfq_weights_tree_remove(bfqd, bfqq);
}
/*
@@ -1688,11 +1707,11 @@
bfq_activate_bfqq(bfqd, bfqq);
bfq_mark_bfqq_busy(bfqq);
- bfqd->busy_queues++;
+ bfqd->busy_queues[bfqq->ioprio_class - 1]++;
if (!bfqq->dispatched)
if (bfqq->wr_coeff == 1)
- bfq_weights_tree_add(bfqd, &bfqq->entity,
+ bfq_weights_tree_add(bfqd, bfqq,
&bfqd->queue_weights_tree);
if (bfqq->wr_coeff > 1)
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 5006a0d..4ca4f0b 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -16,6 +16,18 @@
static void blk_mq_sysfs_release(struct kobject *kobj)
{
+ struct blk_mq_ctxs *ctxs = container_of(kobj, struct blk_mq_ctxs, kobj);
+
+ free_percpu(ctxs->queue_ctx);
+ kfree(ctxs);
+}
+
+static void blk_mq_ctx_sysfs_release(struct kobject *kobj)
+{
+ struct blk_mq_ctx *ctx = container_of(kobj, struct blk_mq_ctx, kobj);
+
+ /* ctx->ctxs won't be released until all ctx are freed */
+ kobject_put(&ctx->ctxs->kobj);
}
static void blk_mq_hw_sysfs_release(struct kobject *kobj)
@@ -214,7 +226,7 @@
static struct kobj_type blk_mq_ctx_ktype = {
.sysfs_ops = &blk_mq_sysfs_ops,
.default_attrs = default_ctx_attrs,
- .release = blk_mq_sysfs_release,
+ .release = blk_mq_ctx_sysfs_release,
};
static struct kobj_type blk_mq_hw_ktype = {
@@ -246,7 +258,7 @@
if (!hctx->nr_ctx)
return 0;
- ret = kobject_add(&hctx->kobj, &q->mq_kobj, "%u", hctx->queue_num);
+ ret = kobject_add(&hctx->kobj, q->mq_kobj, "%u", hctx->queue_num);
if (ret)
return ret;
@@ -269,8 +281,8 @@
queue_for_each_hw_ctx(q, hctx, i)
blk_mq_unregister_hctx(hctx);
- kobject_uevent(&q->mq_kobj, KOBJ_REMOVE);
- kobject_del(&q->mq_kobj);
+ kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
+ kobject_del(q->mq_kobj);
kobject_put(&dev->kobj);
q->mq_sysfs_init_done = false;
@@ -290,7 +302,7 @@
ctx = per_cpu_ptr(q->queue_ctx, cpu);
kobject_put(&ctx->kobj);
}
- kobject_put(&q->mq_kobj);
+ kobject_put(q->mq_kobj);
}
void blk_mq_sysfs_init(struct request_queue *q)
@@ -298,10 +310,12 @@
struct blk_mq_ctx *ctx;
int cpu;
- kobject_init(&q->mq_kobj, &blk_mq_ktype);
+ kobject_init(q->mq_kobj, &blk_mq_ktype);
for_each_possible_cpu(cpu) {
ctx = per_cpu_ptr(q->queue_ctx, cpu);
+
+ kobject_get(q->mq_kobj);
kobject_init(&ctx->kobj, &blk_mq_ctx_ktype);
}
}
@@ -314,11 +328,11 @@
WARN_ON_ONCE(!q->kobj.parent);
lockdep_assert_held(&q->sysfs_lock);
- ret = kobject_add(&q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
+ ret = kobject_add(q->mq_kobj, kobject_get(&dev->kobj), "%s", "mq");
if (ret < 0)
goto out;
- kobject_uevent(&q->mq_kobj, KOBJ_ADD);
+ kobject_uevent(q->mq_kobj, KOBJ_ADD);
queue_for_each_hw_ctx(q, hctx, i) {
ret = blk_mq_register_hctx(hctx);
@@ -335,8 +349,8 @@
while (--i >= 0)
blk_mq_unregister_hctx(q->queue_hw_ctx[i]);
- kobject_uevent(&q->mq_kobj, KOBJ_REMOVE);
- kobject_del(&q->mq_kobj);
+ kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
+ kobject_del(q->mq_kobj);
kobject_put(&dev->kobj);
return ret;
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 684acaa..dba55f3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2453,6 +2453,34 @@
mutex_unlock(&set->tag_list_lock);
}
+/* All allocations will be freed in release handler of q->mq_kobj */
+static int blk_mq_alloc_ctxs(struct request_queue *q)
+{
+ struct blk_mq_ctxs *ctxs;
+ int cpu;
+
+ ctxs = kzalloc(sizeof(*ctxs), GFP_KERNEL);
+ if (!ctxs)
+ return -ENOMEM;
+
+ ctxs->queue_ctx = alloc_percpu(struct blk_mq_ctx);
+ if (!ctxs->queue_ctx)
+ goto fail;
+
+ for_each_possible_cpu(cpu) {
+ struct blk_mq_ctx *ctx = per_cpu_ptr(ctxs->queue_ctx, cpu);
+ ctx->ctxs = ctxs;
+ }
+
+ q->mq_kobj = &ctxs->kobj;
+ q->queue_ctx = ctxs->queue_ctx;
+
+ return 0;
+ fail:
+ kfree(ctxs);
+ return -ENOMEM;
+}
+
/*
* It is the actual release handler for mq, but we do it from
* request queue's release handler for avoiding use-after-free
@@ -2480,8 +2508,6 @@
* both share lifetime with request queue.
*/
blk_mq_sysfs_deinit(q);
-
- free_percpu(q->queue_ctx);
}
struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
@@ -2586,8 +2612,7 @@
if (!q->poll_cb)
goto err_exit;
- q->queue_ctx = alloc_percpu(struct blk_mq_ctx);
- if (!q->queue_ctx)
+ if (blk_mq_alloc_ctxs(q))
goto err_exit;
/* init q->mq_kobj and sw queues' kobjects */
@@ -2596,7 +2621,7 @@
q->queue_hw_ctx = kcalloc_node(nr_cpu_ids, sizeof(*(q->queue_hw_ctx)),
GFP_KERNEL, set->numa_node);
if (!q->queue_hw_ctx)
- goto err_percpu;
+ goto err_sys_init;
q->mq_map = set->mq_map;
@@ -2653,8 +2678,8 @@
err_hctxs:
kfree(q->queue_hw_ctx);
-err_percpu:
- free_percpu(q->queue_ctx);
+err_sys_init:
+ blk_mq_sysfs_deinit(q);
err_exit:
q->mq_ops = NULL;
return ERR_PTR(-ENOMEM);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 5ad9251..a6094c2 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -7,6 +7,11 @@
struct blk_mq_tag_set;
+struct blk_mq_ctxs {
+ struct kobject kobj;
+ struct blk_mq_ctx __percpu *queue_ctx;
+};
+
/**
* struct blk_mq_ctx - State for a software queue facing the submitting CPUs
*/
@@ -27,6 +32,7 @@
unsigned long ____cacheline_aligned_in_smp rq_completed[2];
struct request_queue *queue;
+ struct blk_mq_ctxs *ctxs;
struct kobject kobj;
} ____cacheline_aligned_in_smp;
diff --git a/crypto/Makefile b/crypto/Makefile
index f6a234d..e7397bd 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -124,7 +124,7 @@
obj-$(CONFIG_CRYPTO_CRC32) += crc32_generic.o
obj-$(CONFIG_CRYPTO_CRCT10DIF) += crct10dif_common.o crct10dif_generic.o
obj-$(CONFIG_CRYPTO_AUTHENC) += authenc.o authencesn.o
-obj-$(CONFIG_CRYPTO_LZO) += lzo.o
+obj-$(CONFIG_CRYPTO_LZO) += lzo.o lzo-rle.o
obj-$(CONFIG_CRYPTO_LZ4) += lz4.o
obj-$(CONFIG_CRYPTO_LZ4HC) += lz4hc.o
obj-$(CONFIG_CRYPTO_842) += 842.o
diff --git a/crypto/lzo-rle.c b/crypto/lzo-rle.c
new file mode 100644
index 0000000..ea9c75b
--- /dev/null
+++ b/crypto/lzo-rle.c
@@ -0,0 +1,175 @@
+/*
+ * Cryptographic API.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 51
+ * Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/vmalloc.h>
+#include <linux/mm.h>
+#include <linux/lzo.h>
+#include <crypto/internal/scompress.h>
+
+struct lzorle_ctx {
+ void *lzorle_comp_mem;
+};
+
+static void *lzorle_alloc_ctx(struct crypto_scomp *tfm)
+{
+ void *ctx;
+
+ ctx = kvmalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL);
+ if (!ctx)
+ return ERR_PTR(-ENOMEM);
+
+ return ctx;
+}
+
+static int lzorle_init(struct crypto_tfm *tfm)
+{
+ struct lzorle_ctx *ctx = crypto_tfm_ctx(tfm);
+
+ ctx->lzorle_comp_mem = lzorle_alloc_ctx(NULL);
+ if (IS_ERR(ctx->lzorle_comp_mem))
+ return -ENOMEM;
+
+ return 0;
+}
+
+static void lzorle_free_ctx(struct crypto_scomp *tfm, void *ctx)
+{
+ kvfree(ctx);
+}
+
+static void lzorle_exit(struct crypto_tfm *tfm)
+{
+ struct lzorle_ctx *ctx = crypto_tfm_ctx(tfm);
+
+ lzorle_free_ctx(NULL, ctx->lzorle_comp_mem);
+}
+
+static int __lzorle_compress(const u8 *src, unsigned int slen,
+ u8 *dst, unsigned int *dlen, void *ctx)
+{
+ size_t tmp_len = *dlen; /* size_t(ulong) <-> uint on 64 bit */
+ int err;
+
+ err = lzorle1x_1_compress(src, slen, dst, &tmp_len, ctx);
+
+ if (err != LZO_E_OK)
+ return -EINVAL;
+
+ *dlen = tmp_len;
+ return 0;
+}
+
+static int lzorle_compress(struct crypto_tfm *tfm, const u8 *src,
+ unsigned int slen, u8 *dst, unsigned int *dlen)
+{
+ struct lzorle_ctx *ctx = crypto_tfm_ctx(tfm);
+
+ return __lzorle_compress(src, slen, dst, dlen, ctx->lzorle_comp_mem);
+}
+
+static int lzorle_scompress(struct crypto_scomp *tfm, const u8 *src,
+ unsigned int slen, u8 *dst, unsigned int *dlen,
+ void *ctx)
+{
+ return __lzorle_compress(src, slen, dst, dlen, ctx);
+}
+
+static int __lzorle_decompress(const u8 *src, unsigned int slen,
+ u8 *dst, unsigned int *dlen)
+{
+ int err;
+ size_t tmp_len = *dlen; /* size_t(ulong) <-> uint on 64 bit */
+
+ err = lzo1x_decompress_safe(src, slen, dst, &tmp_len);
+
+ if (err != LZO_E_OK)
+ return -EINVAL;
+
+ *dlen = tmp_len;
+ return 0;
+}
+
+static int lzorle_decompress(struct crypto_tfm *tfm, const u8 *src,
+ unsigned int slen, u8 *dst, unsigned int *dlen)
+{
+ return __lzorle_decompress(src, slen, dst, dlen);
+}
+
+static int lzorle_sdecompress(struct crypto_scomp *tfm, const u8 *src,
+ unsigned int slen, u8 *dst, unsigned int *dlen,
+ void *ctx)
+{
+ return __lzorle_decompress(src, slen, dst, dlen);
+}
+
+static struct crypto_alg alg = {
+ .cra_name = "lzo-rle",
+ .cra_flags = CRYPTO_ALG_TYPE_COMPRESS,
+ .cra_ctxsize = sizeof(struct lzorle_ctx),
+ .cra_module = THIS_MODULE,
+ .cra_init = lzorle_init,
+ .cra_exit = lzorle_exit,
+ .cra_u = { .compress = {
+ .coa_compress = lzorle_compress,
+ .coa_decompress = lzorle_decompress } }
+};
+
+static struct scomp_alg scomp = {
+ .alloc_ctx = lzorle_alloc_ctx,
+ .free_ctx = lzorle_free_ctx,
+ .compress = lzorle_scompress,
+ .decompress = lzorle_sdecompress,
+ .base = {
+ .cra_name = "lzo-rle",
+ .cra_driver_name = "lzo-rle-scomp",
+ .cra_module = THIS_MODULE,
+ }
+};
+
+static int __init lzorle_mod_init(void)
+{
+ int ret;
+
+ ret = crypto_register_alg(&alg);
+ if (ret)
+ return ret;
+
+ ret = crypto_register_scomp(&scomp);
+ if (ret) {
+ crypto_unregister_alg(&alg);
+ return ret;
+ }
+
+ return ret;
+}
+
+static void __exit lzorle_mod_fini(void)
+{
+ crypto_unregister_alg(&alg);
+ crypto_unregister_scomp(&scomp);
+}
+
+module_init(lzorle_mod_init);
+module_exit(lzorle_mod_fini);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("LZO-RLE Compression Algorithm");
+MODULE_ALIAS_CRYPTO("lzo-rle");
diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
index d332988..18948f8 100644
--- a/crypto/tcrypt.c
+++ b/crypto/tcrypt.c
@@ -76,8 +76,8 @@
"cast6", "arc4", "michael_mic", "deflate", "crc32c", "tea", "xtea",
"khazad", "wp512", "wp384", "wp256", "tnepres", "xeta", "fcrypt",
"camellia", "seed", "salsa20", "rmd128", "rmd160", "rmd256", "rmd320",
- "lzo", "cts", "zlib", "sha3-224", "sha3-256", "sha3-384", "sha3-512",
- NULL
+ "lzo", "lzo-rle", "cts", "zlib", "sha3-224", "sha3-256", "sha3-384",
+ "sha3-512", NULL
};
static u32 block_sizes[] = { 16, 64, 256, 1024, 8192, 0 };
diff --git a/drivers/acpi/acpica/evgpe.c b/drivers/acpi/acpica/evgpe.c
index 4b5d3b4..4da586f 100644
--- a/drivers/acpi/acpica/evgpe.c
+++ b/drivers/acpi/acpica/evgpe.c
@@ -81,8 +81,12 @@
ACPI_FUNCTION_TRACE(ev_enable_gpe);
- /* Enable the requested GPE */
+ /* Clear the GPE (of stale events) */
+ status = acpi_hw_clear_gpe(gpe_event_info);
+ if (ACPI_FAILURE(status))
+ return_ACPI_STATUS(status);
+ /* Enable the requested GPE */
status = acpi_hw_low_set_gpe(gpe_event_info, ACPI_GPE_ENABLE);
return_ACPI_STATUS(status);
}
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index dd4c728..bce135d 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2292,7 +2292,7 @@
offset = to_interleave_offset(offset, mmio);
writeq(cmd, mmio->addr.base + offset);
- nvdimm_flush(nfit_blk->nd_region);
+ nvdimm_flush(nfit_blk->nd_region, NULL);
if (nfit_blk->dimm_flags & NFIT_BLK_DCR_LATCH)
readq(mmio->addr.base + offset);
@@ -2341,7 +2341,7 @@
}
if (rw)
- nvdimm_flush(nfit_blk->nd_region);
+ nvdimm_flush(nfit_blk->nd_region, NULL);
rc = read_blk_stat(nfit_blk, lane) ? -EIO : 0;
return rc;
diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
index 847db3e..6bd58d7 100644
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -584,6 +584,7 @@
acpi_status status = AE_OK;
u32 acpi_state = acpi_target_sleep_state;
int error;
+ u64 tsc;
ACPI_FLUSH_CPU_CACHE();
@@ -600,6 +601,9 @@
error = acpi_suspend_lowlevel();
if (error)
return error;
+ tsc = rdtsc_ordered();
+ printk(KERN_INFO "TSC at resume: %llu\n",
+ (unsigned long long)tsc);
pr_info(PREFIX "Low-level resume complete\n");
pm_set_resume_via_firmware();
break;
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 3e63a90..3360746 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -60,6 +60,15 @@
rescue mode with init=/bin/sh, even when the /dev directory
on the rootfs is completely empty.
+config DEVTMPFS_SAFE
+ bool "Automount devtmpfs with nosuid/noexec"
+ depends on DEVTMPFS_MOUNT
+ default y
+ help
+ This instructs the kernel to automount devtmpfs with the
+ MS_NOEXEC and MS_NOSUID mount flags, which can prevent
+ certain kinds of code-execution attack on embedded platforms.
+
config STANDALONE
bool "Select only drivers that don't need compile-time external firmware"
default y
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index caaeb79..c11f14e 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -614,6 +614,7 @@
return -EBUSY;
return 0;
}
+EXPORT_SYMBOL(driver_probe_done);
/**
* wait_for_device_probe
diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index f776807..5b6b1b7e 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -349,6 +349,7 @@
int devtmpfs_mount(const char *mntdir)
{
int err;
+ int mflags = MS_SILENT;
if (!mount_dev)
return 0;
@@ -356,8 +357,10 @@
if (!thread)
return 0;
- err = ksys_mount("devtmpfs", (char *)mntdir, "devtmpfs", MS_SILENT,
- NULL);
+#ifdef CONFIG_DEVTMPFS_SAFE
+ mflags |= MS_NOEXEC | MS_NOSUID;
+#endif
+ err = ksys_mount("devtmpfs", (char *)mntdir, "devtmpfs", mflags, NULL);
if (err)
printk(KERN_INFO "devtmpfs: error mounting %i\n", err);
else
diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 3b382a7..09678fa 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -1773,7 +1773,12 @@
dev->power.direct_complete = false;
if (dev->power.direct_complete) {
- if (pm_runtime_status_suspended(dev)) {
+ /*
+ * Check if we're runtime suspended. If not, try to runtime
+ * suspend for autosuspend cases.
+ */
+ if (pm_runtime_status_suspended(dev) ||
+ !pm_runtime_suspend(dev)) {
pm_runtime_disable(dev);
if (pm_runtime_status_suspended(dev))
goto Complete;
diff --git a/drivers/base/power/wakeup.c b/drivers/base/power/wakeup.c
index 2dfa2e0..33773c5 100644
--- a/drivers/base/power/wakeup.c
+++ b/drivers/base/power/wakeup.c
@@ -818,7 +818,7 @@
srcuidx = srcu_read_lock(&wakeup_srcu);
list_for_each_entry_rcu(ws, &wakeup_sources, entry) {
if (ws->active) {
- pr_debug("active wakeup source: %s\n", ws->name);
+ pm_pr_dbg("active wakeup source: %s\n", ws->name);
active = 1;
} else if (!active &&
(!last_activity_ws ||
@@ -829,7 +829,7 @@
}
if (!active && last_activity_ws)
- pr_debug("last active wakeup source: %s\n",
+ pm_pr_dbg("last active wakeup source: %s\n",
last_activity_ws->name);
srcu_read_unlock(&wakeup_srcu, srcuidx);
}
@@ -859,7 +859,7 @@
raw_spin_unlock_irqrestore(&events_lock, flags);
if (ret) {
- pr_debug("PM: Wakeup pending, aborting suspend\n");
+ pm_pr_dbg("Wakeup pending, aborting suspend\n");
pm_print_active_wakeup_sources();
}
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c1341c8..4164d3a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -460,7 +460,9 @@
if (!cmd->use_aio || cmd->ret < 0 || cmd->ret == blk_rq_bytes(rq) ||
req_op(rq) != REQ_OP_READ) {
- if (cmd->ret < 0)
+ if (cmd->ret == -EOPNOTSUPP)
+ ret = BLK_STS_NOTSUPP;
+ else if (cmd->ret < 0)
ret = BLK_STS_IOERR;
goto end_io;
}
@@ -931,6 +933,24 @@
return 0;
}
+static void loop_update_rotational(struct loop_device *lo)
+{
+ struct file *file = lo->lo_backing_file;
+ struct inode *file_inode = file->f_mapping->host;
+ struct block_device *file_bdev = file_inode->i_sb->s_bdev;
+ struct request_queue *q = lo->lo_queue;
+ bool nonrot = true;
+
+ /* not all filesystems (e.g. tmpfs) have a sb->s_bdev */
+ if (file_bdev)
+ nonrot = blk_queue_nonrot(bdev_get_queue(file_bdev));
+
+ if (nonrot)
+ blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
+ else
+ blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
+}
+
static int loop_set_fd(struct loop_device *lo, fmode_t mode,
struct block_device *bdev, unsigned int arg)
{
@@ -994,6 +1014,7 @@
if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
blk_queue_write_cache(lo->lo_queue, true, false);
+ loop_update_rotational(lo);
loop_update_dio(lo);
set_capacity(lo->lo_disk, size);
bd_set_size(bdev, size << 9);
@@ -1924,7 +1945,10 @@
failed:
/* complete non-aio request */
if (!cmd->use_aio || ret) {
- cmd->ret = ret ? -EIO : 0;
+ if (ret == -EOPNOTSUPP)
+ cmd->ret = ret;
+ else
+ cmd->ret = ret ? -EIO : 0;
blk_mq_complete_request(rq);
}
}
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 9be54e5..83a09a7 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -18,6 +18,7 @@
#define PART_BITS 4
#define VQ_NAME_LEN 16
+#define DISCARD_MAX_SEGMENTS 256
static int major;
static DEFINE_IDA(vd_index_ida);
@@ -188,10 +189,50 @@
return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC);
}
+
+static inline int virtblk_setup_discard_write_zeroes(struct request *req,
+ bool unmap)
+{
+ unsigned short segments = blk_rq_nr_discard_segments(req);
+ unsigned short n = 0;
+ struct virtio_blk_discard_write_zeroes *range;
+ struct bio *bio;
+ u32 flags = 0;
+
+ if (unmap)
+ flags |= VIRTIO_BLK_WRITE_ZEROES_FLAG_UNMAP;
+
+ range = kmalloc_array(segments, sizeof(*range), GFP_ATOMIC);
+ if (!range)
+ return -ENOMEM;
+
+ __rq_for_each_bio(bio, req) {
+ u64 sector = bio->bi_iter.bi_sector;
+ u32 num_sectors = bio->bi_iter.bi_size >> 9;
+
+ range[n].flags = cpu_to_le32(flags);
+ range[n].num_sectors = cpu_to_le32(num_sectors);
+ range[n].sector = cpu_to_le64(sector);
+ n++;
+ }
+
+ req->special_vec.bv_page = virt_to_page(range);
+ req->special_vec.bv_offset = offset_in_page(range);
+ req->special_vec.bv_len = sizeof(*range) * segments;
+ req->rq_flags |= RQF_SPECIAL_PAYLOAD;
+
+ return 0;
+}
+
static inline void virtblk_request_done(struct request *req)
{
struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
+ if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
+ kfree(page_address(req->special_vec.bv_page) +
+ req->special_vec.bv_offset);
+ }
+
switch (req_op(req)) {
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
@@ -241,6 +282,7 @@
int qid = hctx->queue_num;
int err;
bool notify = false;
+ bool unmap = false;
u32 type;
BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
@@ -253,6 +295,13 @@
case REQ_OP_FLUSH:
type = VIRTIO_BLK_T_FLUSH;
break;
+ case REQ_OP_DISCARD:
+ type = VIRTIO_BLK_T_DISCARD;
+ break;
+ case REQ_OP_WRITE_ZEROES:
+ type = VIRTIO_BLK_T_WRITE_ZEROES;
+ unmap = !(req->cmd_flags & REQ_NOUNMAP);
+ break;
case REQ_OP_SCSI_IN:
case REQ_OP_SCSI_OUT:
type = VIRTIO_BLK_T_SCSI_CMD;
@@ -272,6 +321,12 @@
blk_mq_start_request(req);
+ if (type == VIRTIO_BLK_T_DISCARD || type == VIRTIO_BLK_T_WRITE_ZEROES) {
+ err = virtblk_setup_discard_write_zeroes(req, unmap);
+ if (err)
+ return BLK_STS_RESOURCE;
+ }
+
num = blk_rq_map_sg(hctx->queue, req, vbr->sg);
if (num) {
if (rq_data_dir(req) == WRITE)
@@ -855,6 +910,42 @@
if (!err && opt_io_size)
blk_queue_io_opt(q, blk_size * opt_io_size);
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_DISCARD)) {
+ q->limits.discard_granularity = blk_size;
+
+ virtio_cread(vdev, struct virtio_blk_config,
+ discard_sector_alignment, &v);
+ if (v)
+ q->limits.discard_alignment = v << 9;
+ else
+ q->limits.discard_alignment = 0;
+
+ virtio_cread(vdev, struct virtio_blk_config,
+ max_discard_sectors, &v);
+ if (v)
+ blk_queue_max_discard_sectors(q, v);
+ else
+ blk_queue_max_discard_sectors(q, UINT_MAX);
+
+ virtio_cread(vdev, struct virtio_blk_config, max_discard_seg,
+ &v);
+ if (v && v <= DISCARD_MAX_SEGMENTS)
+ blk_queue_max_discard_segments(q, v);
+ else
+ blk_queue_max_discard_segments(q, DISCARD_MAX_SEGMENTS);
+
+ blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
+ }
+
+ if (virtio_has_feature(vdev, VIRTIO_BLK_F_WRITE_ZEROES)) {
+ virtio_cread(vdev, struct virtio_blk_config,
+ max_write_zeroes_sectors, &v);
+ if (v)
+ blk_queue_max_write_zeroes_sectors(q, v);
+ else
+ blk_queue_max_write_zeroes_sectors(q, UINT_MAX);
+ }
+
virtblk_update_capacity(vblk, false);
virtio_device_ready(vdev);
@@ -964,14 +1055,14 @@
VIRTIO_BLK_F_SCSI,
#endif
VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
- VIRTIO_BLK_F_MQ,
+ VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
}
;
static unsigned int features[] = {
VIRTIO_BLK_F_SEG_MAX, VIRTIO_BLK_F_SIZE_MAX, VIRTIO_BLK_F_GEOMETRY,
VIRTIO_BLK_F_RO, VIRTIO_BLK_F_BLK_SIZE,
VIRTIO_BLK_F_FLUSH, VIRTIO_BLK_F_TOPOLOGY, VIRTIO_BLK_F_CONFIG_WCE,
- VIRTIO_BLK_F_MQ,
+ VIRTIO_BLK_F_MQ, VIRTIO_BLK_F_DISCARD, VIRTIO_BLK_F_WRITE_ZEROES,
};
static struct virtio_driver virtio_blk = {
diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c
index 4ed0a78..4d9a388 100644
--- a/drivers/block/zram/zcomp.c
+++ b/drivers/block/zram/zcomp.c
@@ -20,6 +20,7 @@
static const char * const backends[] = {
"lzo",
+ "lzo-rle",
#if IS_ENABLED(CONFIG_CRYPTO_LZ4)
"lz4",
#endif
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 76abe40..e2c9e76 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -41,7 +41,7 @@
static DEFINE_MUTEX(zram_index_mutex);
static int zram_major;
-static const char *default_compressor = "lzo";
+static const char *default_compressor = "lzo-rle";
/* Module params (documentation at end) */
static unsigned int num_devices = 1;
@@ -1588,23 +1588,55 @@
return len;
}
-static int zram_open(struct block_device *bdev, fmode_t mode)
+int zram_open(struct block_device *bdev, fmode_t mode)
{
- int ret = 0;
struct zram *zram;
+ int open_count;
WARN_ON(!mutex_is_locked(&bdev->bd_mutex));
zram = bdev->bd_disk->private_data;
/* zram was claimed to reset so open request fails */
if (zram->claim)
- ret = -EBUSY;
+ goto out_busy;
- return ret;
+ /*
+ * Chromium OS specific behavior:
+ * sys_swapon opens the device once to populate its swapinfo->swap_file
+ * and once when it claims the block device (blkdev_get). By limiting
+ * the maximum number of opens to 2, we ensure there are no prior open
+ * references before swap is enabled.
+ * (Note, kzalloc ensures nr_opens starts at 0.)
+ */
+ open_count = atomic_inc_return(&zram->nr_opens);
+ if (open_count > 2)
+ goto out_busy_dec_nr_opens;
+ /*
+ * swapon(2) claims the block device after setup. If a zram is claimed
+ * then open attempts are rejected. This is belt-and-suspenders as the
+ * the block device and swap_file will both hold open nr_opens until
+ * swapoff(2) is called.
+ */
+ if (bdev->bd_holder != NULL)
+ goto out_busy_dec_nr_opens;
+
+ return 0;
+
+out_busy_dec_nr_opens:
+ atomic_dec(&zram->nr_opens);
+out_busy:
+ return -EBUSY;
+}
+
+void zram_release(struct gendisk *disk, fmode_t mode)
+{
+ struct zram *zram = disk->private_data;
+ atomic_dec(&zram->nr_opens);
}
static const struct block_device_operations zram_devops = {
.open = zram_open,
+ .release = zram_release,
.swap_slot_free_notify = zram_slot_free_notify,
.rw_page = zram_rw_page,
.owner = THIS_MODULE
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index d1095df..e934528 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -16,6 +16,8 @@
#define _ZRAM_DRV_H_
#include <linux/rwsem.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
#include <linux/zsmalloc.h>
#include <linux/crypto.h>
@@ -93,6 +95,8 @@
* the number of pages zram can consume for storing compressed data
*/
unsigned long limit_pages;
+ int max_comp_streams;
+ atomic_t nr_opens; /* number of active file handles */
struct zram_stats stats;
/*
diff --git a/drivers/clk/clk-bulk.c b/drivers/clk/clk-bulk.c
index 6904ed6..6a7118d 100644
--- a/drivers/clk/clk-bulk.c
+++ b/drivers/clk/clk-bulk.c
@@ -17,8 +17,65 @@
*/
#include <linux/clk.h>
+#include <linux/clk-provider.h>
#include <linux/device.h>
#include <linux/export.h>
+#include <linux/of.h>
+#include <linux/slab.h>
+
+static int __must_check of_clk_bulk_get(struct device_node *np, int num_clks,
+ struct clk_bulk_data *clks)
+{
+ int ret;
+ int i;
+
+ for (i = 0; i < num_clks; i++)
+ clks[i].clk = NULL;
+
+ for (i = 0; i < num_clks; i++) {
+ clks[i].clk = of_clk_get(np, i);
+ if (IS_ERR(clks[i].clk)) {
+ ret = PTR_ERR(clks[i].clk);
+ pr_err("%pOF: Failed to get clk index: %d ret: %d\n",
+ np, i, ret);
+ clks[i].clk = NULL;
+ goto err;
+ }
+ }
+
+ return 0;
+
+err:
+ clk_bulk_put(i, clks);
+
+ return ret;
+}
+
+static int __must_check of_clk_bulk_get_all(struct device_node *np,
+ struct clk_bulk_data **clks)
+{
+ struct clk_bulk_data *clk_bulk;
+ int num_clks;
+ int ret;
+
+ num_clks = of_clk_get_parent_count(np);
+ if (!num_clks)
+ return 0;
+
+ clk_bulk = kmalloc_array(num_clks, sizeof(*clk_bulk), GFP_KERNEL);
+ if (!clk_bulk)
+ return -ENOMEM;
+
+ ret = of_clk_bulk_get(np, num_clks, clk_bulk);
+ if (ret) {
+ kfree(clk_bulk);
+ return ret;
+ }
+
+ *clks = clk_bulk;
+
+ return num_clks;
+}
void clk_bulk_put(int num_clks, struct clk_bulk_data *clks)
{
@@ -59,6 +116,29 @@
}
EXPORT_SYMBOL(clk_bulk_get);
+void clk_bulk_put_all(int num_clks, struct clk_bulk_data *clks)
+{
+ if (IS_ERR_OR_NULL(clks))
+ return;
+
+ clk_bulk_put(num_clks, clks);
+
+ kfree(clks);
+}
+EXPORT_SYMBOL(clk_bulk_put_all);
+
+int __must_check clk_bulk_get_all(struct device *dev,
+ struct clk_bulk_data **clks)
+{
+ struct device_node *np = dev_of_node(dev);
+
+ if (!np)
+ return 0;
+
+ return of_clk_bulk_get_all(np, clks);
+}
+EXPORT_SYMBOL(clk_bulk_get_all);
+
#ifdef CONFIG_HAVE_CLK_PREPARE
/**
diff --git a/drivers/clk/clk-devres.c b/drivers/clk/clk-devres.c
index d854e26..12c8745 100644
--- a/drivers/clk/clk-devres.c
+++ b/drivers/clk/clk-devres.c
@@ -70,6 +70,30 @@
}
EXPORT_SYMBOL_GPL(devm_clk_bulk_get);
+int __must_check devm_clk_bulk_get_all(struct device *dev,
+ struct clk_bulk_data **clks)
+{
+ struct clk_bulk_devres *devres;
+ int ret;
+
+ devres = devres_alloc(devm_clk_bulk_release,
+ sizeof(*devres), GFP_KERNEL);
+ if (!devres)
+ return -ENOMEM;
+
+ ret = clk_bulk_get_all(dev, &devres->clks);
+ if (ret > 0) {
+ *clks = devres->clks;
+ devres->num_clks = ret;
+ devres_add(dev, devres);
+ } else {
+ devres_free(devres);
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(devm_clk_bulk_get_all);
+
static int devm_clk_match(struct device *dev, void *res, void *data)
{
struct clk **c = res;
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index e35c397..8cfd8f4 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2280,6 +2280,7 @@
ret = cpufreq_start_governor(policy);
if (!ret) {
pr_debug("cpufreq: governor change\n");
+ sched_cpufreq_governor_change(policy, old_gov);
return 0;
}
cpufreq_exit_governor(policy);
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 6df894d..96a3a9b 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -221,7 +221,7 @@
}
/* Take note of the planned idle state. */
- sched_idle_set_state(target_state);
+ sched_idle_set_state(target_state, index);
trace_cpu_idle_rcuidle(index, dev->cpu);
time_start = ns_to_ktime(local_clock());
@@ -235,7 +235,7 @@
trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
/* The cpu is no longer idle or about to enter idle. */
- sched_idle_set_state(NULL);
+ sched_idle_set_state(NULL, -1);
if (broadcast) {
if (WARN_ON_ONCE(!irqs_disabled()))
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index a89ebd9..41aaac6 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -658,7 +658,7 @@
* No 'host' or dax_operations since there is no access to this
* device outside of mmap of the resulting character device.
*/
- dax_dev = alloc_dax(dev_dax, NULL, NULL);
+ dax_dev = alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC);
if (!dax_dev) {
rc = -ENOMEM;
goto err_dax;
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 6e928f3..e3234fc 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -72,26 +72,18 @@
EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
#endif
-/**
- * __bdev_dax_supported() - Check if the device supports dax for filesystem
- * @bdev: block device to check
- * @blocksize: The block size of the device
- *
- * This is a library function for filesystems to check if the block device
- * can be mounted with dax option.
- *
- * Return: true if supported, false if unsupported
- */
-bool __bdev_dax_supported(struct block_device *bdev, int blocksize)
+bool __generic_fsdax_supported(struct dax_device *dax_dev,
+ struct block_device *bdev, int blocksize, sector_t start,
+ sector_t sectors)
{
- struct dax_device *dax_dev;
bool dax_enabled = false;
- struct request_queue *q;
- pgoff_t pgoff;
- int err, id;
- pfn_t pfn;
- long len;
+ pgoff_t pgoff, pgoff_end;
char buf[BDEVNAME_SIZE];
+ void *kaddr, *end_kaddr;
+ pfn_t pfn, end_pfn;
+ sector_t last_page;
+ long len, len2;
+ int err, id;
if (blocksize != PAGE_SIZE) {
pr_debug("%s: error: unsupported blocksize for dax\n",
@@ -99,36 +91,29 @@
return false;
}
- q = bdev_get_queue(bdev);
- if (!q || !blk_queue_dax(q)) {
- pr_debug("%s: error: request queue doesn't support dax\n",
- bdevname(bdev, buf));
- return false;
- }
-
- err = bdev_dax_pgoff(bdev, 0, PAGE_SIZE, &pgoff);
+ err = bdev_dax_pgoff(bdev, start, PAGE_SIZE, &pgoff);
if (err) {
pr_debug("%s: error: unaligned partition for dax\n",
bdevname(bdev, buf));
return false;
}
- dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
- if (!dax_dev) {
- pr_debug("%s: error: device does not support dax\n",
+ last_page = PFN_DOWN((start + sectors - 1) * 512) * PAGE_SIZE / 512;
+ err = bdev_dax_pgoff(bdev, last_page, PAGE_SIZE, &pgoff_end);
+ if (err) {
+ pr_debug("%s: error: unaligned partition for dax\n",
bdevname(bdev, buf));
return false;
}
id = dax_read_lock();
- len = dax_direct_access(dax_dev, pgoff, 1, NULL, &pfn);
+ len = dax_direct_access(dax_dev, pgoff, 1, &kaddr, &pfn);
+ len2 = dax_direct_access(dax_dev, pgoff_end, 1, &end_kaddr, &end_pfn);
dax_read_unlock(id);
- put_dax(dax_dev);
-
- if (len < 1) {
+ if (len < 1 || len2 < 1) {
pr_debug("%s: error: dax access failed (%ld)\n",
- bdevname(bdev, buf), len);
+ bdevname(bdev, buf), len < 1 ? len : len2);
return false;
}
@@ -143,13 +128,20 @@
*/
WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
dax_enabled = true;
- } else if (pfn_t_devmap(pfn)) {
- struct dev_pagemap *pgmap;
+ } else if (pfn_t_devmap(pfn) && pfn_t_devmap(end_pfn)) {
+ struct dev_pagemap *pgmap, *end_pgmap;
pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL);
- if (pgmap && pgmap->type == MEMORY_DEVICE_FS_DAX)
+ end_pgmap = get_dev_pagemap(pfn_t_to_pfn(end_pfn), NULL);
+ if (pgmap && pgmap == end_pgmap && pgmap->type == MEMORY_DEVICE_FS_DAX
+ && pfn_t_to_page(pfn)->pgmap == pgmap
+ && pfn_t_to_page(end_pfn)->pgmap == pgmap
+ && pfn_t_to_pfn(pfn) == PHYS_PFN(__pa(kaddr))
+ && pfn_t_to_pfn(end_pfn) == PHYS_PFN(__pa(end_kaddr)))
dax_enabled = true;
put_dev_pagemap(pgmap);
+ put_dev_pagemap(end_pgmap);
+
}
if (!dax_enabled) {
@@ -159,6 +151,49 @@
}
return true;
}
+EXPORT_SYMBOL_GPL(__generic_fsdax_supported);
+
+/**
+ * __bdev_dax_supported() - Check if the device supports dax for filesystem
+ * @bdev: block device to check
+ * @blocksize: The block size of the device
+ *
+ * This is a library function for filesystems to check if the block device
+ * can be mounted with dax option.
+ *
+ * Return: true if supported, false if unsupported
+ */
+bool __bdev_dax_supported(struct block_device *bdev, int blocksize)
+{
+ struct dax_device *dax_dev;
+ struct request_queue *q;
+ char buf[BDEVNAME_SIZE];
+ bool ret;
+ int id;
+
+ q = bdev_get_queue(bdev);
+ if (!q || !blk_queue_dax(q)) {
+ pr_debug("%s: error: request queue doesn't support dax\n",
+ bdevname(bdev, buf));
+ return false;
+ }
+
+ dax_dev = dax_get_by_host(bdev->bd_disk->disk_name);
+ if (!dax_dev) {
+ pr_debug("%s: error: device does not support dax\n",
+ bdevname(bdev, buf));
+ return false;
+ }
+
+ id = dax_read_lock();
+ ret = dax_supported(dax_dev, bdev, blocksize, 0,
+ i_size_read(bdev->bd_inode) / 512);
+ dax_read_unlock(id);
+
+ put_dax(dax_dev);
+
+ return ret;
+}
EXPORT_SYMBOL_GPL(__bdev_dax_supported);
#endif
@@ -167,6 +202,8 @@
DAXDEV_ALIVE,
/* gate whether dax_flush() calls the low level flush routine */
DAXDEV_WRITE_CACHE,
+ /* flag to check if device supports synchronous flush */
+ DAXDEV_SYNC,
};
/**
@@ -284,6 +321,15 @@
}
EXPORT_SYMBOL_GPL(dax_direct_access);
+bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
+ int blocksize, sector_t start, sector_t len)
+{
+ if (!dax_alive(dax_dev))
+ return false;
+
+ return dax_dev->ops->dax_supported(dax_dev, bdev, blocksize, start, len);
+}
+
size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
size_t bytes, struct iov_iter *i)
{
@@ -335,6 +381,18 @@
}
EXPORT_SYMBOL_GPL(dax_write_cache_enabled);
+bool __dax_synchronous(struct dax_device *dax_dev)
+{
+ return test_bit(DAXDEV_SYNC, &dax_dev->flags);
+}
+EXPORT_SYMBOL_GPL(__dax_synchronous);
+
+void __set_dax_synchronous(struct dax_device *dax_dev)
+{
+ set_bit(DAXDEV_SYNC, &dax_dev->flags);
+}
+EXPORT_SYMBOL_GPL(__set_dax_synchronous);
+
bool dax_alive(struct dax_device *dax_dev)
{
lockdep_assert_held(&dax_srcu);
@@ -488,7 +546,7 @@
}
struct dax_device *alloc_dax(void *private, const char *__host,
- const struct dax_operations *ops)
+ const struct dax_operations *ops, unsigned long flags)
{
struct dax_device *dax_dev;
const char *host;
@@ -511,6 +569,9 @@
dax_add_host(dax_dev, host);
dax_dev->ops = ops;
dax_dev->private = private;
+ if (flags & DAXDEV_F_SYNC)
+ set_dax_synchronous(dax_dev);
+
return dax_dev;
err_dev:
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 8b8c123..9949fd9 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -447,6 +447,18 @@
If unsure, say N.
+config DM_INIT
+ bool "DM \"dm-mod.create=\" parameter support"
+ depends on BLK_DEV_DM=y
+ ---help---
+ Enable "dm-mod.create=" parameter to create mapped devices at init time.
+ This option is useful to allow mounting rootfs without requiring an
+ initramfs.
+ See Documentation/device-mapper/dm-init.txt for dm-mod.create="..."
+ format.
+
+ If unsure, say N.
+
config DM_UEVENT
bool "DM uevents"
depends on BLK_DEV_DM
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 822f4e8..a52b703 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -69,6 +69,10 @@
obj-$(CONFIG_DM_ZONED) += dm-zoned.o
obj-$(CONFIG_DM_WRITECACHE) += dm-writecache.o
+ifeq ($(CONFIG_DM_INIT),y)
+dm-mod-objs += dm-init.o
+endif
+
ifeq ($(CONFIG_DM_UEVENT),y)
dm-mod-objs += dm-uevent.o
endif
diff --git a/drivers/md/dm-init.c b/drivers/md/dm-init.c
new file mode 100644
index 0000000..6f06e6b
--- /dev/null
+++ b/drivers/md/dm-init.c
@@ -0,0 +1,554 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * dm-init.c
+ * Copyright (C) 2017 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/ctype.h>
+#include <linux/device.h>
+#include <linux/device-mapper.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/moduleparam.h>
+
+#define DM_MSG_PREFIX "init"
+#define DM_MAX_DEVICES 256
+#define DM_MAX_TARGETS 256
+#define DM_MAX_STR_SIZE 4096
+
+static char *create;
+
+/*
+ * Format: dm-mod.create=<name>,<uuid>,<minor>,<flags>,<table>[,<table>+][;<name>,<uuid>,<minor>,<flags>,<table>[,<table>+]+]
+ * Table format: <start_sector> <num_sectors> <target_type> <target_args>
+ *
+ * See Documentation/device-mapper/dm-init.txt for dm-mod.create="..." format
+ * details.
+ */
+
+struct dm_device {
+ struct dm_ioctl dmi;
+ struct dm_target_spec *table[DM_MAX_TARGETS];
+ char *target_args_array[DM_MAX_TARGETS];
+ struct list_head list;
+};
+
+const char * const dm_allowed_targets[] __initconst = {
+ "crypt",
+ "delay",
+ "linear",
+ "snapshot-origin",
+ "striped",
+ "verity",
+};
+
+static int __init dm_verify_target_type(const char *target)
+{
+ unsigned int i;
+
+ for (i = 0; i < ARRAY_SIZE(dm_allowed_targets); i++) {
+ if (!strcmp(dm_allowed_targets[i], target))
+ return 0;
+ }
+ return -EINVAL;
+}
+
+static void __init dm_setup_cleanup(struct list_head *devices)
+{
+ struct dm_device *dev, *tmp;
+ unsigned int i;
+
+ list_for_each_entry_safe(dev, tmp, devices, list) {
+ list_del(&dev->list);
+ for (i = 0; i < dev->dmi.target_count; i++) {
+ kfree(dev->table[i]);
+ kfree(dev->target_args_array[i]);
+ }
+ kfree(dev);
+ }
+}
+
+/**
+ * str_field_delimit - delimit a string based on a separator char.
+ * @str: the pointer to the string to delimit.
+ * @separator: char that delimits the field
+ *
+ * Find a @separator and replace it by '\0'.
+ * Remove leading and trailing spaces.
+ * Return the remainder string after the @separator.
+ */
+static char __init *str_field_delimit(char **str, char separator)
+{
+ char *s;
+
+ /* TODO: add support for escaped characters */
+ *str = skip_spaces(*str);
+ s = strchr(*str, separator);
+ /* Delimit the field and remove trailing spaces */
+ if (s)
+ *s = '\0';
+ *str = strim(*str);
+ return s ? ++s : NULL;
+}
+
+/**
+ * dm_parse_table_entry - parse a table entry
+ * @dev: device to store the parsed information.
+ * @str: the pointer to a string with the format:
+ * <start_sector> <num_sectors> <target_type> <target_args>[, ...]
+ *
+ * Return the remainder string after the table entry, i.e, after the comma which
+ * delimits the entry or NULL if reached the end of the string.
+ */
+static char __init *dm_parse_table_entry(struct dm_device *dev, char *str)
+{
+ const unsigned int n = dev->dmi.target_count - 1;
+ struct dm_target_spec *sp;
+ unsigned int i;
+ /* fields: */
+ char *field[4];
+ char *next;
+
+ field[0] = str;
+ /* Delimit first 3 fields that are separated by space */
+ for (i = 0; i < ARRAY_SIZE(field) - 1; i++) {
+ field[i + 1] = str_field_delimit(&field[i], ' ');
+ if (!field[i + 1])
+ return ERR_PTR(-EINVAL);
+ }
+ /* Delimit last field that can be terminated by comma */
+ next = str_field_delimit(&field[i], ',');
+
+ sp = kzalloc(sizeof(*sp), GFP_KERNEL);
+ if (!sp)
+ return ERR_PTR(-ENOMEM);
+ dev->table[n] = sp;
+
+ /* start_sector */
+ if (kstrtoull(field[0], 0, &sp->sector_start))
+ return ERR_PTR(-EINVAL);
+ /* num_sector */
+ if (kstrtoull(field[1], 0, &sp->length))
+ return ERR_PTR(-EINVAL);
+ /* target_type */
+ strscpy(sp->target_type, field[2], sizeof(sp->target_type));
+ if (dm_verify_target_type(sp->target_type)) {
+ DMERR("invalid type \"%s\"", sp->target_type);
+ return ERR_PTR(-EINVAL);
+ }
+ /* target_args */
+ dev->target_args_array[n] = kstrndup(field[3], GFP_KERNEL,
+ DM_MAX_STR_SIZE);
+ if (!dev->target_args_array[n])
+ return ERR_PTR(-ENOMEM);
+
+ return next;
+}
+
+/**
+ * dm_parse_table - parse "dm-mod.create=" table field
+ * @dev: device to store the parsed information.
+ * @str: the pointer to a string with the format:
+ * <table>[,<table>+]
+ */
+static int __init dm_parse_table(struct dm_device *dev, char *str)
+{
+ char *table_entry = str;
+
+ while (table_entry) {
+ DMDEBUG("parsing table \"%s\"", str);
+ if (++dev->dmi.target_count > DM_MAX_TARGETS) {
+ DMERR("too many targets %u > %d",
+ dev->dmi.target_count, DM_MAX_TARGETS);
+ return -EINVAL;
+ }
+ table_entry = dm_parse_table_entry(dev, table_entry);
+ if (IS_ERR(table_entry)) {
+ DMERR("couldn't parse table");
+ return PTR_ERR(table_entry);
+ }
+ }
+
+ return 0;
+}
+
+/**
+ * dm_parse_device_entry - parse a device entry
+ * @dev: device to store the parsed information.
+ * @str: the pointer to a string with the format:
+ * name,uuid,minor,flags,table[; ...]
+ *
+ * Return the remainder string after the table entry, i.e, after the semi-colon
+ * which delimits the entry or NULL if reached the end of the string.
+ */
+static char __init *dm_parse_device_entry(struct dm_device *dev, char *str)
+{
+ /* There are 5 fields: name,uuid,minor,flags,table; */
+ char *field[5];
+ unsigned int i;
+ char *next;
+
+ field[0] = str;
+ /* Delimit first 4 fields that are separated by comma */
+ for (i = 0; i < ARRAY_SIZE(field) - 1; i++) {
+ field[i+1] = str_field_delimit(&field[i], ',');
+ if (!field[i+1])
+ return ERR_PTR(-EINVAL);
+ }
+ /* Delimit last field that can be delimited by semi-colon */
+ next = str_field_delimit(&field[i], ';');
+
+ /* name */
+ strscpy(dev->dmi.name, field[0], sizeof(dev->dmi.name));
+ /* uuid */
+ strscpy(dev->dmi.uuid, field[1], sizeof(dev->dmi.uuid));
+ /* minor */
+ if (strlen(field[2])) {
+ if (kstrtoull(field[2], 0, &dev->dmi.dev))
+ return ERR_PTR(-EINVAL);
+ dev->dmi.flags |= DM_PERSISTENT_DEV_FLAG;
+ }
+ /* flags */
+ if (!strcmp(field[3], "ro"))
+ dev->dmi.flags |= DM_READONLY_FLAG;
+ else if (strcmp(field[3], "rw"))
+ return ERR_PTR(-EINVAL);
+ /* table */
+ if (dm_parse_table(dev, field[4]))
+ return ERR_PTR(-EINVAL);
+
+ return next;
+}
+
+/**
+ * dm_parse_devices - parse "dm-mod.create=" argument
+ * @devices: list of struct dm_device to store the parsed information.
+ * @str: the pointer to a string with the format:
+ * <device>[;<device>+]
+ */
+static int __init dm_parse_devices(struct list_head *devices, char *str)
+{
+ unsigned long ndev = 0;
+ struct dm_device *dev;
+ char *device = str;
+
+ DMDEBUG("parsing \"%s\"", str);
+ while (device) {
+ dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+ if (!dev)
+ return -ENOMEM;
+ list_add_tail(&dev->list, devices);
+
+ if (++ndev > DM_MAX_DEVICES) {
+ DMERR("too many devices %lu > %d",
+ ndev, DM_MAX_DEVICES);
+ return -EINVAL;
+ }
+
+ device = dm_parse_device_entry(dev, device);
+ if (IS_ERR(device)) {
+ DMERR("couldn't parse device");
+ return PTR_ERR(device);
+ }
+ }
+
+ return 0;
+}
+
+/**
+ * dm_init_init - parse "dm-mod.create=" argument and configure drivers
+ */
+static int __init dm_init_init(void)
+{
+ struct dm_device *dev;
+ LIST_HEAD(devices);
+ char *str;
+ int r;
+
+ if (!create)
+ return 0;
+
+ if (strlen(create) >= DM_MAX_STR_SIZE) {
+ DMERR("Argument is too big. Limit is %d\n", DM_MAX_STR_SIZE);
+ return -EINVAL;
+ }
+ str = kstrndup(create, GFP_KERNEL, DM_MAX_STR_SIZE);
+ if (!str)
+ return -ENOMEM;
+
+ r = dm_parse_devices(&devices, str);
+ if (r)
+ goto out;
+
+ DMINFO("waiting for all devices to be available before creating mapped devices\n");
+ wait_for_device_probe();
+
+ list_for_each_entry(dev, &devices, list) {
+ if (dm_early_create(&dev->dmi, dev->table,
+ dev->target_args_array))
+ break;
+ }
+out:
+ kfree(str);
+ dm_setup_cleanup(&devices);
+ return r;
+}
+
+late_initcall(dm_init_init);
+
+module_param(create, charp, 0);
+MODULE_PARM_DESC(create, "Create a mapped device in early boot");
+
+/* ---------------------------------------------------------------
+ * ChromeOS shim - convert dm= format to dm-mod.create= format
+ * ---------------------------------------------------------------
+ */
+
+struct dm_chrome_target {
+ char *field[4];
+};
+
+struct dm_chrome_dev {
+ char *name, *uuid, *mode;
+ unsigned int num_targets;
+ struct dm_chrome_target targets[DM_MAX_TARGETS];
+};
+
+static char __init *dm_chrome_parse_target(char *str, struct dm_chrome_target *tgt)
+{
+ unsigned int i;
+
+ tgt->field[0] = str;
+ /* Delimit first 3 fields that are separated by space */
+ for (i = 0; i < ARRAY_SIZE(tgt->field) - 1; i++) {
+ tgt->field[i + 1] = str_field_delimit(&tgt->field[i], ' ');
+ if (!tgt->field[i + 1])
+ return NULL;
+ }
+ /* Delimit last field that can be terminated by comma */
+ return str_field_delimit(&tgt->field[i], ',');
+}
+
+static char __init *dm_chrome_parse_dev(char *str, struct dm_chrome_dev *dev)
+{
+ char *target, *num;
+ unsigned int i;
+
+ if (!str)
+ return ERR_PTR(-EINVAL);
+
+ target = str_field_delimit(&str, ',');
+ if (!target)
+ return ERR_PTR(-EINVAL);
+
+ /* Delimit first 3 fields that are separated by space */
+ dev->name = str;
+ dev->uuid = str_field_delimit(&dev->name, ' ');
+ if (!dev->uuid)
+ return ERR_PTR(-EINVAL);
+
+ dev->mode = str_field_delimit(&dev->uuid, ' ');
+ if (!dev->mode)
+ return ERR_PTR(-EINVAL);
+
+ /* num is optional */
+ num = str_field_delimit(&dev->mode, ' ');
+ if (!num)
+ dev->num_targets = 1;
+ else {
+ /* Delimit num and check if it the last field */
+ if(str_field_delimit(&num, ' '))
+ return ERR_PTR(-EINVAL);
+ if (kstrtouint(num, 0, &dev->num_targets))
+ return ERR_PTR(-EINVAL);
+ }
+
+ if (dev->num_targets > DM_MAX_TARGETS) {
+ DMERR("too many targets %u > %d",
+ dev->num_targets, DM_MAX_TARGETS);
+ return ERR_PTR(-EINVAL);
+ }
+
+ for (i = 0; i < dev->num_targets - 1; i++) {
+ target = dm_chrome_parse_target(target, &dev->targets[i]);
+ if (!target)
+ return ERR_PTR(-EINVAL);
+ }
+ /* The last one can return NULL if it reaches the end of str */
+ return dm_chrome_parse_target(target, &dev->targets[i]);
+}
+
+static char __init *dm_chrome_convert(struct dm_chrome_dev *devs, unsigned int num_devs)
+{
+ char *str = kmalloc(DM_MAX_STR_SIZE, GFP_KERNEL);
+ char *p = str;
+ unsigned int i, j;
+ int ret;
+
+ if (!str)
+ return ERR_PTR(-ENOMEM);
+
+ for (i = 0; i < num_devs; i++) {
+ if (!strcmp(devs[i].uuid, "none"))
+ devs[i].uuid = "";
+ ret = snprintf(p, DM_MAX_STR_SIZE - (p - str),
+ "%s,%s,,%s",
+ devs[i].name,
+ devs[i].uuid,
+ devs[i].mode);
+ if (ret < 0)
+ goto out;
+ p += ret;
+
+ for (j = 0; j < devs[i].num_targets; j++) {
+ ret = snprintf(p, DM_MAX_STR_SIZE - (p - str),
+ ",%s %s %s %s",
+ devs[i].targets[j].field[0],
+ devs[i].targets[j].field[1],
+ devs[i].targets[j].field[2],
+ devs[i].targets[j].field[3]);
+ if (ret < 0)
+ goto out;
+ p += ret;
+ }
+ if (i < num_devs - 1) {
+ ret = snprintf(p, DM_MAX_STR_SIZE - (p - str), ";");
+ if (ret < 0)
+ goto out;
+ p += ret;
+ }
+ }
+
+ return str;
+
+out:
+ kfree(str);
+ return ERR_PTR(ret);
+}
+
+/**
+ * dm_chrome_shim - convert old dm= format used in chromeos to the new
+ * upstream format.
+ *
+ * ChromeOS old format
+ * -------------------
+ * <device> ::= [<num>] <device-mapper>+
+ * <device-mapper> ::= <head> "," <target>+
+ * <head> ::= <name> <uuid> <mode> [<num>]
+ * <target> ::= <start> <length> <type> <options> ","
+ * <mode> ::= "ro" | "rw"
+ * <uuid> ::= xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx | "none"
+ * <type> ::= "verity" | "bootcache" | ...
+ *
+ * Example:
+ * 2 vboot none ro 1,
+ * 0 1768000 bootcache
+ * device=aa55b119-2a47-8c45-946a-5ac57765011f+1
+ * signature=76e9be054b15884a9fa85973e9cb274c93afadb6
+ * cache_start=1768000 max_blocks=100000 size_limit=23 max_trace=20000,
+ * vroot none ro 1,
+ * 0 1740800 verity payload=254:0 hashtree=254:0 hashstart=1740800 alg=sha1
+ * root_hexdigest=76e9be054b15884a9fa85973e9cb274c93afadb6
+ * salt=5b3549d54d6c7a3837b9b81ed72e49463a64c03680c47835bef94d768e5646fe
+ *
+ * Notes:
+ * 1. uuid is a label for the device and we set it to "none".
+ * 2. The <num> field will be optional initially and assumed to be 1.
+ * Once all the scripts that set these fields have been set, it will
+ * be made mandatory.
+ */
+
+static char *chrome_create;
+
+static int __init dm_chrome_shim(char *arg) {
+ if (!arg || create)
+ return -EINVAL;
+ chrome_create = arg;
+ return 0;
+}
+
+static int __init dm_chrome_parse_devices(void)
+{
+ struct dm_chrome_dev *devs;
+ unsigned int num_devs, i;
+ char *next, *base_str;
+ int ret = 0;
+
+ /* Verify if dm-mod.create was not used */
+ if (!chrome_create || create)
+ return -EINVAL;
+
+ if (strlen(chrome_create) >= DM_MAX_STR_SIZE) {
+ DMERR("Argument is too big. Limit is %d\n", DM_MAX_STR_SIZE);
+ return -EINVAL;
+ }
+
+ base_str = kstrdup(chrome_create, GFP_KERNEL);
+ if (!base_str)
+ return -ENOMEM;
+
+ next = str_field_delimit(&base_str, ' ');
+ if (!next) {
+ ret = -EINVAL;
+ goto out_str;
+ }
+
+ /* if first field is not the optional <num> field */
+ if (kstrtouint(base_str, 0, &num_devs)) {
+ num_devs = 1;
+ /* rewind next pointer */
+ next = base_str;
+ }
+
+ if (num_devs > DM_MAX_DEVICES) {
+ DMERR("too many devices %u > %d", num_devs, DM_MAX_DEVICES);
+ ret = -EINVAL;
+ goto out_str;
+ }
+
+ devs = kcalloc(num_devs, sizeof(*devs), GFP_KERNEL);
+ if (!devs)
+ return -ENOMEM;
+
+ /* restore string */
+ strcpy(base_str, chrome_create);
+
+ /* parse devices */
+ for (i = 0; i < num_devs; i++) {
+ next = dm_chrome_parse_dev(next, &devs[i]);
+ if (IS_ERR(next)) {
+ DMERR("couldn't parse device");
+ ret = PTR_ERR(next);
+ goto out_devs;
+ }
+ }
+
+ create = dm_chrome_convert(devs, num_devs);
+ if (IS_ERR(create)) {
+ ret = PTR_ERR(create);
+ goto out_devs;
+ }
+
+ DMDEBUG("Converting:\n\tdm=\"%s\"\n\tdm-mod.create=\"%s\"\n",
+ chrome_create, create);
+
+ /* Call upstream code */
+ dm_init_init();
+
+ kfree(create);
+
+out_devs:
+ create = NULL;
+ kfree(devs);
+out_str:
+ kfree(base_str);
+
+ return ret;
+}
+
+late_initcall(dm_chrome_parse_devices);
+
+__setup("dm=", dm_chrome_shim);
diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index f666778..1e03bc8 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -2018,3 +2018,110 @@
return r;
}
+
+
+/**
+ * dm_early_create - create a mapped device in early boot.
+ *
+ * @dmi: Contains main information of the device mapping to be created.
+ * @spec_array: array of pointers to struct dm_target_spec. Describes the
+ * mapping table of the device.
+ * @target_params_array: array of strings with the parameters to a specific
+ * target.
+ *
+ * Instead of having the struct dm_target_spec and the parameters for every
+ * target embedded at the end of struct dm_ioctl (as performed in a normal
+ * ioctl), pass them as arguments, so the caller doesn't need to serialize them.
+ * The size of the spec_array and target_params_array is given by
+ * @dmi->target_count.
+ * This function is supposed to be called in early boot, so locking mechanisms
+ * to protect against concurrent loads are not required.
+ */
+int __init dm_early_create(struct dm_ioctl *dmi,
+ struct dm_target_spec **spec_array,
+ char **target_params_array)
+{
+ int r, m = DM_ANY_MINOR;
+ struct dm_table *t, *old_map;
+ struct mapped_device *md;
+ unsigned int i;
+
+ if (!dmi->target_count)
+ return -EINVAL;
+
+ r = check_name(dmi->name);
+ if (r)
+ return r;
+
+ if (dmi->flags & DM_PERSISTENT_DEV_FLAG)
+ m = MINOR(huge_decode_dev(dmi->dev));
+
+ /* alloc dm device */
+ r = dm_create(m, &md);
+ if (r)
+ return r;
+
+ /* hash insert */
+ r = dm_hash_insert(dmi->name, *dmi->uuid ? dmi->uuid : NULL, md);
+ if (r)
+ goto err_destroy_dm;
+
+ /* alloc table */
+ r = dm_table_create(&t, get_mode(dmi), dmi->target_count, md);
+ if (r)
+ goto err_hash_remove;
+
+ /* add targets */
+ for (i = 0; i < dmi->target_count; i++) {
+ r = dm_table_add_target(t, spec_array[i]->target_type,
+ (sector_t) spec_array[i]->sector_start,
+ (sector_t) spec_array[i]->length,
+ target_params_array[i]);
+ if (r) {
+ DMWARN("error adding target to table");
+ goto err_destroy_table;
+ }
+ }
+
+ /* finish table */
+ r = dm_table_complete(t);
+ if (r)
+ goto err_destroy_table;
+
+ md->type = dm_table_get_type(t);
+ /* setup md->queue to reflect md's type (may block) */
+ r = dm_setup_md_queue(md, t);
+ if (r) {
+ DMWARN("unable to set up device queue for new table.");
+ goto err_destroy_table;
+ }
+
+ /* Set new map */
+ dm_suspend(md, 0);
+ old_map = dm_swap_table(md, t);
+ if (IS_ERR(old_map)) {
+ r = PTR_ERR(old_map);
+ goto err_destroy_table;
+ }
+ set_disk_ro(dm_disk(md), !!(dmi->flags & DM_READONLY_FLAG));
+
+ /* resume device */
+ r = dm_resume(md);
+ if (r)
+ goto err_destroy_table;
+
+ DMINFO("%s (%s) is ready", md->disk->disk_name, dmi->name);
+ dm_put(md);
+ return 0;
+
+err_destroy_table:
+ dm_table_destroy(t);
+err_hash_remove:
+ (void) __hash_remove(__get_name_cell(dmi->name));
+ /* release reference from __get_name_cell */
+ dm_put(md);
+err_destroy_dm:
+ dm_put(md);
+ dm_destroy(md);
+ return r;
+}
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 36275c5..eed37b6 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -882,13 +882,25 @@
}
EXPORT_SYMBOL_GPL(dm_table_set_type);
-static int device_supports_dax(struct dm_target *ti, struct dm_dev *dev,
- sector_t start, sector_t len, void *data)
+/* validate the dax capability of the target device span */
+int device_supports_dax(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
{
- return bdev_dax_supported(dev->bdev, PAGE_SIZE);
+ int blocksize = *(int *) data;
+
+ return generic_fsdax_supported(dev->dax_dev, dev->bdev, blocksize,
+ start, len);
}
-static bool dm_table_supports_dax(struct dm_table *t)
+/* Check devices support synchronous DAX */
+static int device_synchronous(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
+{
+ return dev->dax_dev && dax_synchronous(dev->dax_dev);
+}
+
+bool dm_table_supports_dax(struct dm_table *t,
+ iterate_devices_callout_fn iterate_fn, int *blocksize)
{
struct dm_target *ti;
unsigned i;
@@ -901,7 +913,7 @@
return false;
if (!ti->type->iterate_devices ||
- !ti->type->iterate_devices(ti, device_supports_dax, NULL))
+ !ti->type->iterate_devices(ti, iterate_fn, blocksize))
return false;
}
@@ -937,6 +949,7 @@
struct dm_target *tgt;
struct list_head *devices = dm_table_get_devices(t);
enum dm_queue_mode live_md_type = dm_get_md_type(t->md);
+ int page_size = PAGE_SIZE;
if (t->type != DM_TYPE_NONE) {
/* target already set the table's type */
@@ -981,7 +994,7 @@
verify_bio_based:
/* We must use this table as bio-based */
t->type = DM_TYPE_BIO_BASED;
- if (dm_table_supports_dax(t) ||
+ if (dm_table_supports_dax(t, device_supports_dax, &page_size) ||
(list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) {
t->type = DM_TYPE_DAX_BIO_BASED;
} else {
@@ -1909,6 +1922,7 @@
struct queue_limits *limits)
{
bool wc = false, fua = false;
+ int page_size = PAGE_SIZE;
/*
* Copy table's limits to the DM device's request_queue
@@ -1936,8 +1950,11 @@
}
blk_queue_write_cache(q, wc, fua);
- if (dm_table_supports_dax(t))
+ if (dm_table_supports_dax(t, device_supports_dax, &page_size)) {
blk_queue_flag_set(QUEUE_FLAG_DAX, q);
+ if (dm_table_supports_dax(t, device_synchronous, NULL))
+ set_dax_synchronous(t->md->dax_dev);
+ }
else
blk_queue_flag_clear(QUEUE_FLAG_DAX, q);
diff --git a/drivers/md/dm-verity-fec.c b/drivers/md/dm-verity-fec.c
index bb83279..6c6493c 100644
--- a/drivers/md/dm-verity-fec.c
+++ b/drivers/md/dm-verity-fec.c
@@ -11,6 +11,7 @@
#include "dm-verity-fec.h"
#include <linux/math64.h>
+#include <linux/sysfs.h>
#define DM_MSG_PREFIX "verity-fec"
@@ -175,9 +176,11 @@
if (r < 0 && neras)
DMERR_LIMIT("%s: FEC %llu: failed to correct: %d",
v->data_dev->name, (unsigned long long)rsb, r);
- else if (r > 0)
+ else if (r > 0) {
DMWARN_LIMIT("%s: FEC %llu: corrected %d errors",
v->data_dev->name, (unsigned long long)rsb, r);
+ atomic_add_unless(&v->fec->corrected, 1, INT_MAX);
+ }
return r;
}
@@ -545,6 +548,7 @@
void verity_fec_dtr(struct dm_verity *v)
{
struct dm_verity_fec *f = v->fec;
+ struct kobject *kobj = &f->kobj_holder.kobj;
if (!verity_fec_is_enabled(v))
goto out;
@@ -562,6 +566,12 @@
if (f->dev)
dm_put_device(v->ti, f->dev);
+
+ if (kobj->state_initialized) {
+ kobject_put(kobj);
+ wait_for_completion(dm_get_completion_from_kobject(kobj));
+ }
+
out:
kfree(f);
v->fec = NULL;
@@ -650,6 +660,28 @@
return 0;
}
+static ssize_t corrected_show(struct kobject *kobj, struct kobj_attribute *attr,
+ char *buf)
+{
+ struct dm_verity_fec *f = container_of(kobj, struct dm_verity_fec,
+ kobj_holder.kobj);
+
+ return sprintf(buf, "%d\n", atomic_read(&f->corrected));
+}
+
+static struct kobj_attribute attr_corrected = __ATTR_RO(corrected);
+
+static struct attribute *fec_attrs[] = {
+ &attr_corrected.attr,
+ NULL
+};
+
+static struct kobj_type fec_ktype = {
+ .sysfs_ops = &kobj_sysfs_ops,
+ .default_attrs = fec_attrs,
+ .release = dm_kobject_release
+};
+
/*
* Allocate dm_verity_fec for v->fec. Must be called before verity_fec_ctr.
*/
@@ -673,8 +705,10 @@
*/
int verity_fec_ctr(struct dm_verity *v)
{
+ int r;
struct dm_verity_fec *f = v->fec;
struct dm_target *ti = v->ti;
+ struct mapped_device *md = dm_table_get_md(ti->table);
u64 hash_blocks;
int ret;
@@ -683,6 +717,16 @@
return 0;
}
+ /* Create a kobject and sysfs attributes */
+ init_completion(&f->kobj_holder.completion);
+
+ r = kobject_init_and_add(&f->kobj_holder.kobj, &fec_ktype,
+ &disk_to_dev(dm_disk(md))->kobj, "%s", "fec");
+ if (r) {
+ ti->error = "Cannot create kobject";
+ return r;
+ }
+
/*
* FEC is computed over data blocks, possible metadata, and
* hash blocks. In other words, FEC covers total of fec_blocks
diff --git a/drivers/md/dm-verity-fec.h b/drivers/md/dm-verity-fec.h
index 6ad803b..93af417 100644
--- a/drivers/md/dm-verity-fec.h
+++ b/drivers/md/dm-verity-fec.h
@@ -12,6 +12,8 @@
#ifndef DM_VERITY_FEC_H
#define DM_VERITY_FEC_H
+#include "dm.h"
+#include "dm-core.h"
#include "dm-verity.h"
#include <linux/rslib.h>
@@ -51,6 +53,8 @@
mempool_t extra_pool; /* mempool for extra buffers */
mempool_t output_pool; /* mempool for output */
struct kmem_cache *cache; /* cache for buffers */
+ atomic_t corrected; /* corrected errors */
+ struct dm_kobject_holder kobj_holder; /* for sysfs attributes */
};
/* per-bio data */
diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c
index e3599b4..6332a83 100644
--- a/drivers/md/dm-verity-target.c
+++ b/drivers/md/dm-verity-target.c
@@ -17,8 +17,12 @@
#include "dm-verity.h"
#include "dm-verity-fec.h"
+#include <linux/async.h>
+#include <linux/delay.h>
+#include <linux/device-mapper.h>
#include <linux/module.h>
#include <linux/reboot.h>
+#include <crypto/hash.h>
#define DM_MSG_PREFIX "verity"
@@ -28,6 +32,7 @@
#define DM_VERITY_DEFAULT_PREFETCH_SIZE 262144
#define DM_VERITY_MAX_CORRUPTED_ERRS 100
+#define DM_VERITY_NUM_POSITIONAL_ARGS 10
#define DM_VERITY_OPT_LOGGING "ignore_corruption"
#define DM_VERITY_OPT_RESTART "restart_on_corruption"
@@ -47,6 +52,118 @@
unsigned n_blocks;
};
+/* Provide a lightweight means of specifying the global default for
+ * error behavior: eio, reboot, or none
+ * Legacy support for 0 = eio, 1 = reboot/panic, 2 = none, 3 = notify.
+ * This is matched to the enum in dm-verity.h.
+ */
+static const char *allowed_error_behaviors[] = { "eio", "panic", "none",
+ "notify", NULL };
+static char *error_behavior = "eio";
+module_param(error_behavior, charp, 0644);
+MODULE_PARM_DESC(error_behavior, "Behavior on error "
+ "(eio, panic, none, notify)");
+
+/* Controls whether verity_get_device will wait forever for a device. */
+static int dev_wait;
+module_param(dev_wait, int, 0444);
+MODULE_PARM_DESC(dev_wait, "Wait forever for a backing device");
+
+static BLOCKING_NOTIFIER_HEAD(verity_error_notifier);
+
+int dm_verity_register_error_notifier(struct notifier_block *nb)
+{
+ return blocking_notifier_chain_register(&verity_error_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(dm_verity_register_error_notifier);
+
+int dm_verity_unregister_error_notifier(struct notifier_block *nb)
+{
+ return blocking_notifier_chain_unregister(&verity_error_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(dm_verity_unregister_error_notifier);
+
+/* If the request is not successful, this handler takes action.
+ * TODO make this call a registered handler.
+ */
+static void verity_error(struct dm_verity *v, struct dm_verity_io *io,
+ blk_status_t status)
+{
+ const char *message = v->hash_failed ? "integrity" : "block";
+ int error_behavior = DM_VERITY_ERROR_BEHAVIOR_PANIC;
+ dev_t devt = 0;
+ u64 block = ~0;
+ struct dm_verity_error_state error_state;
+ /* If the hash did not fail, then this is likely transient. */
+ int transient = !v->hash_failed;
+
+ devt = v->data_dev->bdev->bd_dev;
+ error_behavior = v->error_behavior;
+
+ DMERR_LIMIT("verification failure occurred: %s failure", message);
+
+ if (error_behavior == DM_VERITY_ERROR_BEHAVIOR_NOTIFY) {
+ error_state.code = status;
+ error_state.transient = transient;
+ error_state.block = block;
+ error_state.message = message;
+ error_state.dev_start = v->data_start;
+ error_state.dev_len = v->data_blocks;
+ error_state.dev = v->data_dev->bdev;
+ error_state.hash_dev_start = v->hash_start;
+ error_state.hash_dev_len = v->hash_blocks;
+ error_state.hash_dev = v->hash_dev->bdev;
+
+ /* Set default fallthrough behavior. */
+ error_state.behavior = DM_VERITY_ERROR_BEHAVIOR_PANIC;
+ error_behavior = DM_VERITY_ERROR_BEHAVIOR_PANIC;
+
+ if (!blocking_notifier_call_chain(
+ &verity_error_notifier, transient, &error_state)) {
+ error_behavior = error_state.behavior;
+ }
+ }
+
+ switch (error_behavior) {
+ case DM_VERITY_ERROR_BEHAVIOR_EIO:
+ break;
+ case DM_VERITY_ERROR_BEHAVIOR_NONE:
+ break;
+ default:
+ if (!transient)
+ goto do_panic;
+ }
+ return;
+
+do_panic:
+ panic("dm-verity failure: "
+ "device:%u:%u status:%d block:%llu message:%s",
+ MAJOR(devt), MINOR(devt), status, (u64)block, message);
+}
+
+/**
+ * verity_parse_error_behavior - parse a behavior charp to the enum
+ * @behavior: NUL-terminated char array
+ *
+ * Checks if the behavior is valid either as text or as an index digit
+ * and returns the proper enum value or -1 on error.
+ */
+static int verity_parse_error_behavior(const char *behavior)
+{
+ const char **allowed = allowed_error_behaviors;
+ char index = '0';
+
+ for (; *allowed; allowed++, index++)
+ if (!strcmp(*allowed, behavior) || behavior[0] == index)
+ break;
+
+ if (!*allowed)
+ return -1;
+
+ /* Convert to the integer index matching the enum. */
+ return allowed - allowed_error_behaviors;
+}
+
/*
* Auxiliary structure appended to each dm-bufio buffer. If the value
* hash_verified is nonzero, hash of the block has been verified.
@@ -541,6 +658,8 @@
struct dm_verity *v = io->v;
struct bio *bio = dm_bio_from_per_bio_data(io, v->ti->per_io_data_size);
+ if (status && !verity_fec_is_enabled(io->v))
+ verity_error(v, io, status);
bio->bi_end_io = io->orig_bi_end_io;
bio->bi_status = status;
@@ -564,7 +683,6 @@
verity_finish_io(io, bio->bi_status);
return;
}
-
INIT_WORK(&io->work, verity_work);
queue_work(io->v->verify_wq, &io->work);
}
@@ -913,6 +1031,187 @@
return r;
}
+static int verity_get_device(struct dm_target *ti, const char *devname,
+ struct dm_dev **dm_dev)
+{
+ do {
+ /* Try the normal path first since if everything is ready, it
+ * will be the fastest.
+ */
+ if (!dm_get_device(ti, devname, /*FMODE_READ*/
+ dm_table_get_mode(ti->table), dm_dev))
+ return 0;
+
+ /* No need to be too aggressive since this is a slow path. */
+ msleep(500);
+ } while (dev_wait && (driver_probe_done() != 0 || *dm_dev == NULL));
+ async_synchronize_full();
+ return -1;
+}
+
+struct verity_args {
+ int version;
+ char *data_device;
+ char *hash_device;
+ int data_block_size_bits;
+ int hash_block_size_bits;
+ u64 num_data_blocks;
+ u64 hash_start_block;
+ char *algorithm;
+ char *digest;
+ char *salt;
+ char *error_behavior;
+};
+
+static void pr_args(struct verity_args *args)
+{
+ printk(KERN_INFO "VERITY args: version=%d data_device=%s hash_device=%s"
+ " data_block_size_bits=%d hash_block_size_bits=%d"
+ " num_data_blocks=%lld hash_start_block=%lld"
+ " algorithm=%s digest=%s salt=%s error_behavior=%s\n",
+ args->version,
+ args->data_device,
+ args->hash_device,
+ args->data_block_size_bits,
+ args->hash_block_size_bits,
+ args->num_data_blocks,
+ args->hash_start_block,
+ args->algorithm,
+ args->digest,
+ args->salt,
+ args->error_behavior);
+}
+
+/*
+ * positional_args - collects the argments using the positional
+ * parameters.
+ * arg# - parameter
+ * 0 - version
+ * 1 - data device
+ * 2 - hash device - may be same as data device
+ * 3 - data block size log2
+ * 4 - hash block size log2
+ * 5 - number of data blocks
+ * 6 - hash start block
+ * 7 - algorithm
+ * 8 - digest
+ * 9 - salt
+ */
+static char *positional_args(unsigned argc, char **argv,
+ struct verity_args *args)
+{
+ unsigned num;
+ unsigned long long num_ll;
+ char dummy;
+
+ if (argc < DM_VERITY_NUM_POSITIONAL_ARGS)
+ return "Invalid argument count: at least 10 arguments required";
+
+ if (sscanf(argv[0], "%d%c", &num, &dummy) != 1 ||
+ num < 0 || num > 1)
+ return "Invalid version";
+ args->version = num;
+
+ args->data_device = argv[1];
+ args->hash_device = argv[2];
+
+
+ if (sscanf(argv[3], "%u%c", &num, &dummy) != 1 ||
+ !num || (num & (num - 1)) ||
+ num > PAGE_SIZE)
+ return "Invalid data device block size";
+ args->data_block_size_bits = ffs(num) - 1;
+
+ if (sscanf(argv[4], "%u%c", &num, &dummy) != 1 ||
+ !num || (num & (num - 1)) ||
+ num > INT_MAX)
+ return "Invalid hash device block size";
+ args->hash_block_size_bits = ffs(num) - 1;
+
+ if (sscanf(argv[5], "%llu%c", &num_ll, &dummy) != 1 ||
+ (sector_t)(num_ll << (args->data_block_size_bits - SECTOR_SHIFT))
+ >> (args->data_block_size_bits - SECTOR_SHIFT) != num_ll)
+ return "Invalid data blocks";
+ args->num_data_blocks = num_ll;
+
+
+ if (sscanf(argv[6], "%llu%c", &num_ll, &dummy) != 1 ||
+ (sector_t)(num_ll << (args->hash_block_size_bits - SECTOR_SHIFT))
+ >> (args->hash_block_size_bits - SECTOR_SHIFT) != num_ll)
+ return "Invalid hash start";
+ args->hash_start_block = num_ll;
+
+
+ args->algorithm = argv[7];
+ args->digest = argv[8];
+ args->salt = argv[9];
+
+ return NULL;
+}
+
+static void splitarg(char *arg, char **key, char **val)
+{
+ *key = strsep(&arg, "=");
+ *val = strsep(&arg, "");
+}
+
+static char *chromeos_args(unsigned argc, char **argv, struct verity_args *args)
+{
+ char *key, *val;
+ unsigned long num;
+ int i;
+
+ args->version = 0;
+ args->data_block_size_bits = 12;
+ args->hash_block_size_bits = 12;
+ for (i = 0; i < argc; ++i) {
+ DMDEBUG("Argument %d: '%s'", i, argv[i]);
+ splitarg(argv[i], &key, &val);
+ if (!key) {
+ DMWARN("Bad argument %d: missing key?", i);
+ return "Bad argument: missing key";
+ }
+ if (!val) {
+ DMWARN("Bad argument %d='%s': missing value", i, key);
+ return "Bad argument: missing value";
+ }
+ if (!strcmp(key, "alg")) {
+ args->algorithm = val;
+ } else if (!strcmp(key, "payload")) {
+ args->data_device = val;
+ } else if (!strcmp(key, "hashtree")) {
+ args->hash_device = val;
+ } else if (!strcmp(key, "root_hexdigest")) {
+ args->digest = val;
+ } else if (!strcmp(key, "hashstart")) {
+ if (kstrtoul(val, 10, &num))
+ return "Invalid hashstart";
+ args->hash_start_block =
+ num >> (args->hash_block_size_bits - SECTOR_SHIFT);
+ args->num_data_blocks = args->hash_start_block;
+ } else if (!strcmp(key, "error_behavior")) {
+ args->error_behavior = val;
+ } else if (!strcmp(key, "salt")) {
+ args->salt = val;
+ }
+ }
+ if (!args->salt)
+ args->salt = "";
+
+#define NEEDARG(n) \
+ if (!(n)) { \
+ return "Missing argument: " #n; \
+ }
+
+ NEEDARG(args->algorithm);
+ NEEDARG(args->data_device);
+ NEEDARG(args->hash_device);
+ NEEDARG(args->digest);
+
+#undef NEEDARG
+ return NULL;
+}
+
/*
* Target parameters:
* <version> The current format is version 1.
@@ -929,14 +1228,22 @@
*/
static int verity_ctr(struct dm_target *ti, unsigned argc, char **argv)
{
+ struct verity_args args = { 0 };
struct dm_verity *v;
struct dm_arg_set as;
- unsigned int num;
- unsigned long long num_ll;
int r;
int i;
sector_t hash_position;
- char dummy;
+
+ args.error_behavior = error_behavior;
+ if (argc >= DM_VERITY_NUM_POSITIONAL_ARGS)
+ ti->error = positional_args(argc, argv, &args);
+ else
+ ti->error = chromeos_args(argc, argv, &args);
+ if (ti->error)
+ return -EINVAL;
+ if (0)
+ pr_args(&args);
v = kzalloc(sizeof(struct dm_verity), GFP_KERNEL);
if (!v) {
@@ -949,84 +1256,46 @@
r = verity_fec_ctr_alloc(v);
if (r)
goto bad;
+ v->version = args.version;
- if ((dm_table_get_mode(ti->table) & ~FMODE_READ)) {
- ti->error = "Device must be readonly";
- r = -EINVAL;
- goto bad;
- }
-
- if (argc < 10) {
- ti->error = "Not enough arguments";
- r = -EINVAL;
- goto bad;
- }
-
- if (sscanf(argv[0], "%u%c", &num, &dummy) != 1 ||
- num > 1) {
- ti->error = "Invalid version";
- r = -EINVAL;
- goto bad;
- }
- v->version = num;
-
- r = dm_get_device(ti, argv[1], FMODE_READ, &v->data_dev);
+ r = verity_get_device(ti, args.data_device, &v->data_dev);
if (r) {
ti->error = "Data device lookup failed";
goto bad;
}
- r = dm_get_device(ti, argv[2], FMODE_READ, &v->hash_dev);
+ r = verity_get_device(ti, args.hash_device, &v->hash_dev);
if (r) {
ti->error = "Hash device lookup failed";
goto bad;
}
- if (sscanf(argv[3], "%u%c", &num, &dummy) != 1 ||
- !num || (num & (num - 1)) ||
- num < bdev_logical_block_size(v->data_dev->bdev) ||
- num > PAGE_SIZE) {
+ v->data_dev_block_bits = args.data_block_size_bits;
+ if ((1 << v->data_dev_block_bits) <
+ bdev_logical_block_size(v->data_dev->bdev)) {
ti->error = "Invalid data device block size";
r = -EINVAL;
goto bad;
}
- v->data_dev_block_bits = __ffs(num);
- if (sscanf(argv[4], "%u%c", &num, &dummy) != 1 ||
- !num || (num & (num - 1)) ||
- num < bdev_logical_block_size(v->hash_dev->bdev) ||
- num > INT_MAX) {
+ v->hash_dev_block_bits = args.hash_block_size_bits;
+ if ((1 << v->data_dev_block_bits) <
+ bdev_logical_block_size(v->hash_dev->bdev)) {
ti->error = "Invalid hash device block size";
r = -EINVAL;
goto bad;
}
- v->hash_dev_block_bits = __ffs(num);
- if (sscanf(argv[5], "%llu%c", &num_ll, &dummy) != 1 ||
- (sector_t)(num_ll << (v->data_dev_block_bits - SECTOR_SHIFT))
- >> (v->data_dev_block_bits - SECTOR_SHIFT) != num_ll) {
- ti->error = "Invalid data blocks";
- r = -EINVAL;
- goto bad;
- }
- v->data_blocks = num_ll;
-
+ v->data_blocks = args.num_data_blocks;
if (ti->len > (v->data_blocks << (v->data_dev_block_bits - SECTOR_SHIFT))) {
ti->error = "Data device is too small";
r = -EINVAL;
goto bad;
}
- if (sscanf(argv[6], "%llu%c", &num_ll, &dummy) != 1 ||
- (sector_t)(num_ll << (v->hash_dev_block_bits - SECTOR_SHIFT))
- >> (v->hash_dev_block_bits - SECTOR_SHIFT) != num_ll) {
- ti->error = "Invalid hash start";
- r = -EINVAL;
- goto bad;
- }
- v->hash_start = num_ll;
+ v->hash_start = args.hash_start_block;
- v->alg_name = kstrdup(argv[7], GFP_KERNEL);
+ v->alg_name = kstrdup(args.algorithm, GFP_KERNEL);
if (!v->alg_name) {
ti->error = "Cannot allocate algorithm name";
r = -ENOMEM;
@@ -1055,36 +1324,33 @@
r = -ENOMEM;
goto bad;
}
- if (strlen(argv[8]) != v->digest_size * 2 ||
- hex2bin(v->root_digest, argv[8], v->digest_size)) {
+ if (strlen(args.digest) != v->digest_size * 2 ||
+ hex2bin(v->root_digest, args.digest, v->digest_size)) {
ti->error = "Invalid root digest";
r = -EINVAL;
goto bad;
}
- if (strcmp(argv[9], "-")) {
- v->salt_size = strlen(argv[9]) / 2;
+ if (strcmp(args.salt, "-")) {
+ v->salt_size = strlen(args.salt) / 2;
v->salt = kmalloc(v->salt_size, GFP_KERNEL);
if (!v->salt) {
ti->error = "Cannot allocate salt";
r = -ENOMEM;
goto bad;
}
- if (strlen(argv[9]) != v->salt_size * 2 ||
- hex2bin(v->salt, argv[9], v->salt_size)) {
+ if (strlen(args.salt) != v->salt_size * 2 ||
+ hex2bin(v->salt, args.salt, v->salt_size)) {
ti->error = "Invalid salt";
r = -EINVAL;
goto bad;
}
}
- argv += 10;
- argc -= 10;
-
/* Optional parameters */
- if (argc) {
- as.argc = argc;
- as.argv = argv;
+ if (argc > DM_VERITY_NUM_POSITIONAL_ARGS) {
+ as.argc = argc - DM_VERITY_NUM_POSITIONAL_ARGS;
+ as.argv = argv + DM_VERITY_NUM_POSITIONAL_ARGS;
r = verity_parse_opt_args(&as, v);
if (r < 0)
@@ -1156,6 +1422,16 @@
ti->per_io_data_size = roundup(ti->per_io_data_size,
__alignof__(struct dm_verity_io));
+ /* chromeos allows setting error_behavior from both the module
+ * parameters and the device args.
+ */
+ v->error_behavior = verity_parse_error_behavior(args.error_behavior);
+ if (v->error_behavior == -1) {
+ ti->error = "Bad error_behavior supplied";
+ r = -EINVAL;
+ goto bad;
+ }
+
return 0;
bad:
diff --git a/drivers/md/dm-verity.h b/drivers/md/dm-verity.h
index 3441c10..b1a8442 100644
--- a/drivers/md/dm-verity.h
+++ b/drivers/md/dm-verity.h
@@ -15,6 +15,7 @@
#include <linux/dm-bufio.h>
#include <linux/device-mapper.h>
#include <crypto/hash.h>
+#include <linux/notifier.h>
#define DM_VERITY_MAX_LEVELS 63
@@ -56,6 +57,7 @@
int hash_failed; /* set to 1 if hash of any block failed */
enum verity_mode mode; /* mode for handling verification errors */
unsigned corrupted_errs;/* Number of errors for corrupted blocks */
+ int error_behavior; /* selects error behavior on io errors */
struct workqueue_struct *verify_wq;
@@ -91,6 +93,40 @@
*/
};
+struct verity_result {
+ struct completion completion;
+ int err;
+};
+
+struct dm_verity_error_state {
+ int code;
+ int transient; /* Likely to not happen after a reboot */
+ u64 block;
+ const char *message;
+
+ sector_t dev_start;
+ sector_t dev_len;
+ struct block_device *dev;
+
+ sector_t hash_dev_start;
+ sector_t hash_dev_len;
+ struct block_device *hash_dev;
+
+ /* Final behavior after all notifications are completed. */
+ int behavior;
+};
+
+/* This enum must be matched to allowed_error_behaviors in dm-verity.c */
+enum dm_verity_error_behavior {
+ DM_VERITY_ERROR_BEHAVIOR_EIO = 0,
+ DM_VERITY_ERROR_BEHAVIOR_PANIC,
+ DM_VERITY_ERROR_BEHAVIOR_NONE,
+ DM_VERITY_ERROR_BEHAVIOR_NOTIFY
+};
+
+int dm_verity_register_error_notifier(struct notifier_block *nb);
+int dm_verity_unregister_error_notifier(struct notifier_block *nb);
+
static inline struct ahash_request *verity_io_hash_req(struct dm_verity *v,
struct dm_verity_io *io)
{
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 4364315..1ca7c5c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1070,6 +1070,25 @@
return ret;
}
+static bool dm_dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
+ int blocksize, sector_t start, sector_t len)
+{
+ struct mapped_device *md = dax_get_private(dax_dev);
+ struct dm_table *map;
+ int srcu_idx;
+ bool ret;
+
+ map = dm_get_live_table(md, &srcu_idx);
+ if (!map)
+ return false;
+
+ ret = dm_table_supports_dax(map, device_supports_dax, &blocksize);
+
+ dm_put_live_table(md, srcu_idx);
+
+ return ret;
+}
+
static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i)
{
@@ -1941,7 +1960,8 @@
sprintf(md->disk->disk_name, "dm-%d", minor);
if (IS_ENABLED(CONFIG_DAX_DRIVER)) {
- dax_dev = alloc_dax(md, md->disk->disk_name, &dm_dax_ops);
+ dax_dev = alloc_dax(md, md->disk->disk_name,
+ &dm_dax_ops, 0);
if (!dax_dev)
goto bad;
}
@@ -3185,6 +3205,7 @@
static const struct dax_operations dm_dax_ops = {
.direct_access = dm_dax_direct_access,
+ .dax_supported = dm_dax_supported,
.copy_from_iter = dm_dax_copy_from_iter,
.copy_to_iter = dm_dax_copy_to_iter,
};
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 114a81b..5022e83 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -73,6 +73,10 @@
bool dm_table_all_blk_mq_devices(struct dm_table *t);
void dm_table_free_md_mempools(struct dm_table *t);
struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t);
+bool dm_table_supports_dax(struct dm_table *t, iterate_devices_callout_fn fn,
+ int *blocksize);
+int device_supports_dax(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data);
void dm_lock_md_type(struct mapped_device *md);
void dm_unlock_md_type(struct mapped_device *md);
diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig
index 6fde68a..c1ffbf1 100644
--- a/drivers/net/ethernet/Kconfig
+++ b/drivers/net/ethernet/Kconfig
@@ -75,6 +75,7 @@
source "drivers/net/ethernet/faraday/Kconfig"
source "drivers/net/ethernet/freescale/Kconfig"
source "drivers/net/ethernet/fujitsu/Kconfig"
+source "drivers/net/ethernet/google/Kconfig"
source "drivers/net/ethernet/hisilicon/Kconfig"
source "drivers/net/ethernet/hp/Kconfig"
source "drivers/net/ethernet/huawei/Kconfig"
diff --git a/drivers/net/ethernet/Makefile b/drivers/net/ethernet/Makefile
index b45d5f6..60caee1 100644
--- a/drivers/net/ethernet/Makefile
+++ b/drivers/net/ethernet/Makefile
@@ -39,6 +39,7 @@
obj-$(CONFIG_NET_VENDOR_FARADAY) += faraday/
obj-$(CONFIG_NET_VENDOR_FREESCALE) += freescale/
obj-$(CONFIG_NET_VENDOR_FUJITSU) += fujitsu/
+obj-$(CONFIG_NET_VENDOR_GOOGLE) += google/
obj-$(CONFIG_NET_VENDOR_HISILICON) += hisilicon/
obj-$(CONFIG_NET_VENDOR_HP) += hp/
obj-$(CONFIG_NET_VENDOR_HUAWEI) += huawei/
diff --git a/drivers/net/ethernet/google/Kconfig b/drivers/net/ethernet/google/Kconfig
new file mode 100644
index 0000000..888f08f
--- /dev/null
+++ b/drivers/net/ethernet/google/Kconfig
@@ -0,0 +1,27 @@
+#
+# Google network device configuration
+#
+
+config NET_VENDOR_GOOGLE
+ bool "Google Devices"
+ default y
+ help
+ If you have a network (Ethernet) device belonging to this class, say Y.
+
+ Note that the answer to this question doesn't directly affect the
+ kernel: saying N will just cause the configurator to skip all
+ the questions about Google devices. If you say Y, you will be asked
+ for your specific device in the following questions.
+
+if NET_VENDOR_GOOGLE
+
+config GVE
+ tristate "Google Virtual NIC (gVNIC) support"
+ depends on (PCI_MSI && X86)
+ help
+ This driver supports Google Virtual NIC (gVNIC)"
+
+ To compile this driver as a module, choose M here.
+ The module will be called gve.
+
+endif #NET_VENDOR_GOOGLE
diff --git a/drivers/net/ethernet/google/Makefile b/drivers/net/ethernet/google/Makefile
new file mode 100644
index 0000000..402cc3b
--- /dev/null
+++ b/drivers/net/ethernet/google/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the Google network device drivers.
+#
+
+obj-$(CONFIG_GVE) += gve/
diff --git a/drivers/net/ethernet/google/gve/Makefile b/drivers/net/ethernet/google/gve/Makefile
new file mode 100644
index 0000000..3354ce4
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/Makefile
@@ -0,0 +1,4 @@
+# Makefile for the Google virtual Ethernet (gve) driver
+
+obj-$(CONFIG_GVE) += gve.o
+gve-objs := gve_main.o gve_tx.o gve_rx.o gve_ethtool.o gve_adminq.o
diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
new file mode 100644
index 0000000..e0f6142
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -0,0 +1,575 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR MIT)
+ * Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#ifndef _GVE_H_
+#define _GVE_H_
+
+#include <linux/dma-mapping.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
+#include <linux/u64_stats_sync.h>
+#include "gve_desc.h"
+
+#ifndef PCI_VENDOR_ID_GOOGLE
+#define PCI_VENDOR_ID_GOOGLE 0x1ae0
+#endif
+
+#define PCI_DEV_ID_GVNIC 0x0042
+
+#define GVE_REGISTER_BAR 0
+#define GVE_DOORBELL_BAR 2
+
+/* Driver can alloc up to 2 segments for the header and 2 for the payload. */
+#define GVE_TX_MAX_IOVEC 4
+#ifndef ETH_MIN_MTU
+#define ETH_MIN_MTU 68/* Min IPv4 MTU per RFC791 */
+#endif
+
+/* 1 for management, 1 for rx, 1 for tx */
+#define GVE_MIN_MSIX 3
+
+/* Numbers of gve tx/rx stats in stats report. */
+#define GVE_TX_STATS_REPORT_NUM 5
+#define GVE_RX_STATS_REPORT_NUM 2
+
+/* Numbers of NIC tx/rx stats in stats report. */
+#define NIC_TX_STATS_REPORT_NUM 0
+#define NIC_RX_STATS_REPORT_NUM 4
+
+/* Interval to schedule a service task, 20000ms. */
+#define GVE_SERVICE_TIMER_PERIOD 20000
+
+/* Each slot in the desc ring has a 1:1 mapping to a slot in the data ring */
+struct gve_rx_desc_queue {
+ struct gve_rx_desc *desc_ring; /* the descriptor ring */
+ dma_addr_t bus; /* the bus for the desc_ring */
+ u8 seqno; /* the next expected seqno for this desc*/
+};
+
+/* The page info for a single slot in the RX data queue */
+struct gve_rx_slot_page_info {
+ struct page *page;
+ void *page_address;
+ u32 page_offset; /* offset to write to in page */
+ int pagecnt_bias; /* expected pagecnt if only the driver has a ref */
+ bool can_flip; /* page can be flipped and reused */
+};
+
+/* A list of pages registered with the device during setup and used by a queue
+ * as buffers
+ */
+struct gve_queue_page_list {
+ u32 id; /* unique id */
+ u32 num_entries;
+ struct page **pages; /* list of num_entries pages */
+ dma_addr_t *page_buses; /* the dma addrs of the pages */
+};
+
+/* Each slot in the data ring has a 1:1 mapping to a slot in the desc ring */
+struct gve_rx_data_queue {
+ struct gve_rx_data_slot *data_ring; /* read by NIC */
+ dma_addr_t data_bus; /* dma mapping of the slots */
+ struct gve_rx_slot_page_info *page_info; /* page info of the buffers */
+ struct gve_queue_page_list *qpl; /* qpl assigned to this queue */
+ bool raw_addressing; /* use raw_addressing? */
+};
+
+struct gve_priv;
+
+/* An RX ring that contains a power-of-two sized desc and data ring. */
+struct gve_rx_ring {
+ struct gve_priv *gve;
+ struct gve_rx_desc_queue desc;
+ struct gve_rx_data_queue data;
+ u64 rbytes; /* free-running bytes received */
+ u64 rpackets; /* free-running packets received */
+ u32 cnt; /* free-running total number of completed packets */
+ u32 fill_cnt; /* free-running total number of descs and buffs posted */
+ u32 mask; /* masks the cnt and fill_cnt to the size of the ring */
+ u32 db_threshold; /* threshold for posting new buffs and descs */
+ u64 rx_copybreak_pkt; /* free-running count of copybreak packets */
+ u64 rx_copied_pkt; /* free-running total number of copied packets */
+ u64 rx_skb_alloc_fail; /* free-running count of skb alloc fails */
+ u64 rx_buf_alloc_fail; /* free-running count of buffer alloc fails */
+ u64 rx_desc_err_dropped_pkt; /* free-running count of packets dropped by descriptor error */
+ u64 rx_no_refill_dropped_pkt; /* free-running count of packets dropped because of lack of buffer refill */
+ u32 q_num; /* queue index */
+ u32 ntfy_id; /* notification block index */
+ struct gve_queue_resources *q_resources; /* head and tail pointer idx */
+ dma_addr_t q_resources_bus; /* dma address for the queue resources */
+ struct u64_stats_sync statss; /* sync stats for 32bit archs */
+};
+
+/* A TX desc ring entry */
+union gve_tx_desc {
+ struct gve_tx_pkt_desc pkt; /* first desc for a packet */
+ struct gve_tx_seg_desc seg; /* subsequent descs for a packet */
+};
+
+/* Tracks the memory in the fifo occupied by a segment of a packet */
+struct gve_tx_iovec {
+ u32 iov_offset; /* offset into this segment */
+ u32 iov_len; /* length */
+ u32 iov_padding; /* padding associated with this segment */
+};
+
+struct gve_tx_dma_buf {
+ DEFINE_DMA_UNMAP_ADDR(dma);
+ DEFINE_DMA_UNMAP_LEN(len);
+};
+
+/* Tracks the memory in the fifo occupied by the skb. Mapped 1:1 to a desc
+ * ring entry but only used for a pkt_desc not a seg_desc
+ */
+struct gve_tx_buffer_state {
+ struct sk_buff *skb; /* skb for this pkt */
+ union {
+ struct gve_tx_iovec iov[GVE_TX_MAX_IOVEC]; /* segments of this pkt */
+ struct gve_tx_dma_buf buf;
+ };
+};
+
+/* A TX buffer - each queue has one */
+struct gve_tx_fifo {
+ void *base; /* address of base of FIFO */
+ u32 size; /* total size */
+ atomic_t available; /* how much space is still available */
+ u32 head; /* offset to write at */
+ struct gve_queue_page_list *qpl; /* QPL mapped into this FIFO */
+};
+
+/* A TX ring that contains a power-of-two sized desc ring and a FIFO buffer */
+struct gve_tx_ring {
+ /* Cacheline 0 -- Accessed & dirtied during transmit */
+ struct gve_tx_fifo tx_fifo;
+ u32 req; /* driver tracked head pointer */
+ u32 done; /* driver tracked tail pointer */
+
+ /* Cacheline 1 -- Accessed & dirtied during gve_clean_tx_done */
+ __be32 last_nic_done ____cacheline_aligned; /* NIC tail pointer */
+ u64 pkt_done; /* free-running - total packets completed */
+ u64 bytes_done; /* free-running - total bytes completed */
+ u32 dropped_pkt; /* free-running - total packets dropped */
+
+ /* Cacheline 2 -- Read-mostly fields */
+ union gve_tx_desc *desc ____cacheline_aligned;
+ struct gve_tx_buffer_state *info; /* Maps 1:1 to a desc */
+ struct netdev_queue *netdev_txq;
+ struct gve_queue_resources *q_resources; /* head and tail pointer idx */
+ struct device *dev;
+ u32 mask; /* masks req and done down to queue size */
+ bool raw_addressing; /* use raw_addressing? */
+
+ /* Slow-path fields */
+ u32 q_num ____cacheline_aligned; /* queue idx */
+ u32 stop_queue; /* count of queue stops */
+ u32 wake_queue; /* count of queue wakes */
+ u32 ntfy_id; /* notification block index */
+ dma_addr_t bus; /* dma address of the descr ring */
+ dma_addr_t q_resources_bus; /* dma address of the queue resources */
+ struct u64_stats_sync statss; /* sync stats for 32bit archs */
+} ____cacheline_aligned;
+
+/* Wraps the info for one irq including the napi struct and the queues
+ * associated with that irq.
+ */
+struct gve_notify_block {
+ __be32 *irq_db_index; /* pointer to idx into Bar2 */
+ char name[IFNAMSIZ + 16]; /* name registered with the kernel */
+ struct napi_struct napi; /* kernel napi struct for this block */
+ struct gve_priv *priv;
+ struct gve_tx_ring *tx; /* tx rings on this block */
+ struct gve_rx_ring *rx; /* rx rings on this block */
+};
+
+/* Tracks allowed and current queue settings */
+struct gve_queue_config {
+ u16 max_queues;
+ u16 num_queues; /* current */
+};
+
+/* Tracks the available and used qpl IDs */
+struct gve_qpl_config {
+ u32 qpl_map_size; /* map memory size */
+ unsigned long *qpl_id_map; /* bitmap of used qpl ids */
+};
+
+struct gve_irq_db {
+ __be32 index;
+} ____cacheline_aligned;
+
+struct gve_priv {
+ struct net_device *dev;
+ struct gve_tx_ring *tx; /* array of tx_cfg.num_queues */
+ struct gve_rx_ring *rx; /* array of rx_cfg.num_queues */
+ struct gve_queue_page_list *qpls; /* array of num qpls */
+ struct gve_notify_block *ntfy_blocks; /* array of num_ntfy_blks */
+ struct gve_irq_db *irq_db_indices; /* array of num_ntfy_blks */
+ dma_addr_t irq_db_indices_bus;
+ struct msix_entry *msix_vectors; /* array of num_ntfy_blks + 1 */
+ char mgmt_msix_name[IFNAMSIZ + 16];
+ u32 mgmt_msix_idx;
+ __be32 *counter_array; /* array of num_event_counters */
+ dma_addr_t counter_array_bus;
+
+ u16 num_event_counters;
+ u16 tx_desc_cnt; /* num desc per ring */
+ u16 rx_desc_cnt; /* num desc per ring */
+ u16 tx_pages_per_qpl; /* tx buffer length */
+ u16 rx_data_slot_cnt; /* rx buffer length */
+ u64 max_registered_pages;
+ u64 num_registered_pages; /* num pages registered with NIC */
+ u32 rx_copybreak; /* copy packets smaller than this */
+ u16 default_num_queues; /* default num queues to set up */
+ bool raw_addressing; /* true if this dev supports raw addressing */
+
+ struct gve_queue_config tx_cfg;
+ struct gve_queue_config rx_cfg;
+ struct gve_qpl_config qpl_cfg; /* map used QPL ids */
+ u32 num_ntfy_blks; /* spilt between TX and RX so must be even */
+
+ struct gve_registers __iomem *reg_bar0; /* see gve_register.h */
+ __be32 __iomem *db_bar2; /* "array" of doorbells */
+ u32 msg_enable; /* level for netif* netdev print macros */
+ struct pci_dev *pdev;
+
+ /* metrics */
+ u32 tx_timeo_cnt;
+
+ /* Admin queue - see gve_adminq.h*/
+ union gve_adminq_command *adminq;
+ dma_addr_t adminq_bus_addr;
+ u32 adminq_mask; /* masks prod_cnt to adminq size */
+ u32 adminq_prod_cnt; /* free-running count of AQ cmds executed */
+ u32 adminq_cmd_fail; /* free-running count of AQ cmds failed */
+ u32 adminq_timeouts; /* free-running count of AQ cmds timeouts */
+ /* free-running count of per AQ cmd executed */
+ u32 adminq_describe_device_cnt;
+ u32 adminq_cfg_device_resources_cnt;
+ u32 adminq_register_page_list_cnt;
+ u32 adminq_unregister_page_list_cnt;
+ u32 adminq_create_tx_queue_cnt;
+ u32 adminq_create_rx_queue_cnt;
+ u32 adminq_destroy_tx_queue_cnt;
+ u32 adminq_destroy_rx_queue_cnt;
+ u32 adminq_dcfg_device_resources_cnt;
+ u32 adminq_set_driver_parameter_cnt;
+ u32 adminq_report_stats_cnt;
+
+ /* Global stats */
+ u32 interface_up_cnt; /* count of times interface turned up */
+ u32 interface_down_cnt; /* count of times interface turned down */
+ u32 reset_cnt; /* count of reset */
+ u32 page_alloc_fail; /* count of page alloc fails */
+ u32 dma_mapping_error; /* count of dma mapping errors */
+
+ struct workqueue_struct *gve_wq;
+ struct work_struct service_task;
+ unsigned long service_task_flags;
+ unsigned long state_flags;
+
+ struct gve_stats_report *stats_report;
+ u64 stats_report_len;
+ dma_addr_t stats_report_bus; /* dma address for the stats report */
+ unsigned long ethtool_flags;
+
+ unsigned long service_timer_period;
+ struct timer_list service_timer;
+
+ /* Gvnic device's dma mask, set during probe. */
+ u8 dma_mask;
+
+ /* Gvnic device link speed from hypervisor. */
+ u64 link_speed;
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0))
+ int max_mtu;
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+};
+
+enum gve_service_task_flags_bit {
+ GVE_PRIV_FLAGS_DO_RESET = 1,
+ GVE_PRIV_FLAGS_RESET_IN_PROGRESS = 2,
+ GVE_PRIV_FLAGS_PROBE_IN_PROGRESS = 3,
+ GVE_PRIV_FLAGS_DO_REPORT_STATS = 4,
+};
+
+enum gve_state_flags_bit {
+ GVE_PRIV_FLAGS_ADMIN_QUEUE_OK = 1,
+ GVE_PRIV_FLAGS_DEVICE_RESOURCES_OK = 2,
+ GVE_PRIV_FLAGS_DEVICE_RINGS_OK = 3,
+ GVE_PRIV_FLAGS_NAPI_ENABLED = 4,
+};
+
+enum gve_ethtool_flags_bit {
+ GVE_PRIV_FLAGS_REPORT_STATS = 0,
+};
+
+static inline bool gve_get_do_reset(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_DO_RESET, &priv->service_task_flags);
+}
+
+static inline void gve_set_do_reset(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_DO_RESET, &priv->service_task_flags);
+}
+
+static inline void gve_clear_do_reset(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_DO_RESET, &priv->service_task_flags);
+}
+
+static inline bool gve_get_reset_in_progress(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_RESET_IN_PROGRESS,
+ &priv->service_task_flags);
+}
+
+static inline void gve_set_reset_in_progress(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_RESET_IN_PROGRESS, &priv->service_task_flags);
+}
+
+static inline void gve_clear_reset_in_progress(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_RESET_IN_PROGRESS, &priv->service_task_flags);
+}
+
+static inline bool gve_get_probe_in_progress(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_PROBE_IN_PROGRESS,
+ &priv->service_task_flags);
+}
+
+static inline void gve_set_probe_in_progress(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_PROBE_IN_PROGRESS, &priv->service_task_flags);
+}
+
+static inline void gve_clear_probe_in_progress(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_PROBE_IN_PROGRESS, &priv->service_task_flags);
+}
+
+static inline bool gve_get_do_report_stats(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_DO_REPORT_STATS,
+ &priv->service_task_flags);
+}
+
+static inline void gve_set_do_report_stats(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_DO_REPORT_STATS, &priv->service_task_flags);
+}
+
+static inline void gve_clear_do_report_stats(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_DO_REPORT_STATS, &priv->service_task_flags);
+}
+
+static inline bool gve_get_admin_queue_ok(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_ADMIN_QUEUE_OK, &priv->state_flags);
+}
+
+static inline void gve_set_admin_queue_ok(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_ADMIN_QUEUE_OK, &priv->state_flags);
+}
+
+static inline void gve_clear_admin_queue_ok(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_ADMIN_QUEUE_OK, &priv->state_flags);
+}
+
+static inline bool gve_get_device_resources_ok(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_DEVICE_RESOURCES_OK, &priv->state_flags);
+}
+
+static inline void gve_set_device_resources_ok(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_DEVICE_RESOURCES_OK, &priv->state_flags);
+}
+
+static inline void gve_clear_device_resources_ok(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_DEVICE_RESOURCES_OK, &priv->state_flags);
+}
+
+static inline bool gve_get_device_rings_ok(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_DEVICE_RINGS_OK, &priv->state_flags);
+}
+
+static inline void gve_set_device_rings_ok(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_DEVICE_RINGS_OK, &priv->state_flags);
+}
+
+static inline void gve_clear_device_rings_ok(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_DEVICE_RINGS_OK, &priv->state_flags);
+}
+
+static inline bool gve_get_napi_enabled(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_NAPI_ENABLED, &priv->state_flags);
+}
+
+static inline void gve_set_napi_enabled(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_NAPI_ENABLED, &priv->state_flags);
+}
+
+static inline void gve_clear_napi_enabled(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_NAPI_ENABLED, &priv->state_flags);
+}
+
+static inline bool gve_get_report_stats(struct gve_priv *priv)
+{
+ return test_bit(GVE_PRIV_FLAGS_REPORT_STATS, &priv->ethtool_flags);
+}
+
+static inline void gve_set_report_stats(struct gve_priv *priv)
+{
+ set_bit(GVE_PRIV_FLAGS_REPORT_STATS, &priv->ethtool_flags);
+}
+
+static inline void gve_clear_report_stats(struct gve_priv *priv)
+{
+ clear_bit(GVE_PRIV_FLAGS_REPORT_STATS, &priv->ethtool_flags);
+}
+
+/* Returns the address of the ntfy_blocks irq doorbell
+ */
+static inline __be32 __iomem *gve_irq_doorbell(struct gve_priv *priv,
+ struct gve_notify_block *block)
+{
+ return &priv->db_bar2[be32_to_cpu(*block->irq_db_index)];
+}
+
+/* Returns the index into ntfy_blocks of the given tx ring's block
+ */
+static inline u32 gve_tx_idx_to_ntfy(struct gve_priv *priv, u32 queue_idx)
+{
+ return queue_idx;
+}
+
+/* Returns the index into ntfy_blocks of the given rx ring's block
+ */
+static inline u32 gve_rx_idx_to_ntfy(struct gve_priv *priv, u32 queue_idx)
+{
+ return (priv->num_ntfy_blks / 2) + queue_idx;
+}
+
+/* Returns the number of tx queue page lists
+ */
+static inline u32 gve_num_tx_qpls(struct gve_priv *priv)
+{
+ if (priv->raw_addressing) {
+ return 0;
+ } else {
+ return priv->tx_cfg.num_queues;
+ }
+}
+
+/* Returns the number of rx queue page lists
+ */
+static inline u32 gve_num_rx_qpls(struct gve_priv *priv)
+{
+ if (priv->raw_addressing) {
+ return 0;
+ } else {
+ return priv->rx_cfg.num_queues;
+ }
+}
+
+/* Returns a pointer to the next available tx qpl in the list of qpls
+ */
+static inline
+struct gve_queue_page_list *gve_assign_tx_qpl(struct gve_priv *priv)
+{
+ int id = find_first_zero_bit(priv->qpl_cfg.qpl_id_map,
+ priv->qpl_cfg.qpl_map_size);
+
+ /* we are out of tx qpls */
+ if (id >= gve_num_tx_qpls(priv))
+ return NULL;
+
+ set_bit(id, priv->qpl_cfg.qpl_id_map);
+ return &priv->qpls[id];
+}
+
+/* Returns a pointer to the next available rx qpl in the list of qpls
+ */
+static inline
+struct gve_queue_page_list *gve_assign_rx_qpl(struct gve_priv *priv)
+{
+ int id = find_next_zero_bit(priv->qpl_cfg.qpl_id_map,
+ priv->qpl_cfg.qpl_map_size,
+ gve_num_tx_qpls(priv));
+
+ /* we are out of rx qpls */
+ if (id == priv->qpl_cfg.qpl_map_size)
+ return NULL;
+
+ set_bit(id, priv->qpl_cfg.qpl_id_map);
+ return &priv->qpls[id];
+}
+
+/* Unassigns the qpl with the given id
+ */
+static inline void gve_unassign_qpl(struct gve_priv *priv, int id)
+{
+ clear_bit(id, priv->qpl_cfg.qpl_id_map);
+}
+
+/* Returns the correct dma direction for tx and rx qpls
+ */
+static inline enum dma_data_direction gve_qpl_dma_dir(struct gve_priv *priv,
+ int id)
+{
+ if (id < gve_num_tx_qpls(priv))
+ return DMA_TO_DEVICE;
+ else
+ return DMA_FROM_DEVICE;
+}
+
+/* buffers */
+int gve_alloc_page(struct gve_priv *priv, struct device *dev,
+ struct page **page, dma_addr_t *dma,
+ enum dma_data_direction, gfp_t gfp_flags);
+void gve_free_page(struct device *dev, struct page *page, dma_addr_t dma,
+ enum dma_data_direction);
+/* tx handling */
+netdev_tx_t gve_tx(struct sk_buff *skb, struct net_device *dev);
+bool gve_tx_poll(struct gve_notify_block *block, int budget);
+int gve_tx_alloc_rings(struct gve_priv *priv);
+void gve_tx_free_rings(struct gve_priv *priv);
+__be32 gve_tx_load_event_counter(struct gve_priv *priv,
+ struct gve_tx_ring *tx);
+/* rx handling */
+void gve_rx_write_doorbell(struct gve_priv *priv, struct gve_rx_ring *rx);
+bool gve_rx_poll(struct gve_notify_block *block, int budget);
+int gve_rx_alloc_rings(struct gve_priv *priv);
+void gve_rx_free_rings(struct gve_priv *priv);
+bool gve_clean_rx_done(struct gve_rx_ring *rx, int budget,
+ netdev_features_t feat);
+/* Reset */
+void gve_schedule_reset(struct gve_priv *priv);
+int gve_reset(struct gve_priv *priv, bool attempt_teardown);
+int gve_adjust_queues(struct gve_priv *priv,
+ struct gve_queue_config new_rx_config,
+ struct gve_queue_config new_tx_config);
+/* report stats handling */
+void gve_handle_report_stats(struct gve_priv *priv);
+/* exported by ethtool.c */
+extern const struct ethtool_ops gve_ethtool_ops;
+/* needed by ethtool */
+extern const char gve_version_str[];
+#endif /* _GVE_H_ */
diff --git a/drivers/net/ethernet/google/gve/gve_adminq.c b/drivers/net/ethernet/google/gve/gve_adminq.c
new file mode 100644
index 0000000..052b6b8
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_adminq.c
@@ -0,0 +1,647 @@
+// SPDX-License-Identifier: (GPL-2.0 OR MIT)
+/* Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#include "gve_linux_version.h"
+#include <linux/etherdevice.h>
+#include <linux/pci.h>
+#include "gve.h"
+#include "gve_adminq.h"
+#include "gve_register.h"
+
+#define GVE_MAX_ADMINQ_RELEASE_CHECK 500
+#define GVE_ADMINQ_SLEEP_LEN 20
+#define GVE_MAX_ADMINQ_EVENT_COUNTER_CHECK 100
+
+int gve_adminq_alloc(struct device *dev, struct gve_priv *priv)
+{
+ priv->adminq = dma_alloc_coherent(dev, PAGE_SIZE,
+ &priv->adminq_bus_addr, GFP_KERNEL);
+ if (unlikely(!priv->adminq))
+ return -ENOMEM;
+
+ priv->adminq_mask = (PAGE_SIZE / sizeof(union gve_adminq_command)) - 1;
+ priv->adminq_prod_cnt = 0;
+ priv->adminq_cmd_fail = 0;
+ priv->adminq_timeouts = 0;
+ priv->adminq_describe_device_cnt = 0;
+ priv->adminq_cfg_device_resources_cnt = 0;
+ priv->adminq_register_page_list_cnt = 0;
+ priv->adminq_unregister_page_list_cnt = 0;
+ priv->adminq_create_tx_queue_cnt = 0;
+ priv->adminq_create_rx_queue_cnt = 0;
+ priv->adminq_destroy_tx_queue_cnt = 0;
+ priv->adminq_destroy_rx_queue_cnt = 0;
+ priv->adminq_dcfg_device_resources_cnt = 0;
+ priv->adminq_set_driver_parameter_cnt = 0;
+ priv->adminq_report_stats_cnt = 0;
+
+ /* Setup Admin queue with the device */
+ iowrite32be(priv->adminq_bus_addr / PAGE_SIZE,
+ &priv->reg_bar0->adminq_pfn);
+
+ gve_set_admin_queue_ok(priv);
+ return 0;
+}
+
+void gve_adminq_release(struct gve_priv *priv)
+{
+ int i = 0;
+
+ /* Tell the device the adminq is leaving */
+ iowrite32be(0x0, &priv->reg_bar0->adminq_pfn);
+ while (ioread32be(&priv->reg_bar0->adminq_pfn)) {
+ /* If this is reached the device is unrecoverable and still
+ * holding memory. Continue looping to avoid memory corruption,
+ * but WARN so it is visible what is going on.
+ */
+ if (i == GVE_MAX_ADMINQ_RELEASE_CHECK)
+ WARN(1, "Unrecoverable platform error!");
+ i++;
+ msleep(GVE_ADMINQ_SLEEP_LEN);
+ }
+ gve_clear_device_rings_ok(priv);
+ gve_clear_device_resources_ok(priv);
+ gve_clear_admin_queue_ok(priv);
+}
+
+void gve_adminq_free(struct device *dev, struct gve_priv *priv)
+{
+ if (!gve_get_admin_queue_ok(priv))
+ return;
+ gve_adminq_release(priv);
+ dma_free_coherent(dev, PAGE_SIZE, priv->adminq, priv->adminq_bus_addr);
+ gve_clear_admin_queue_ok(priv);
+}
+
+static void gve_adminq_kick_cmd(struct gve_priv *priv, u32 prod_cnt)
+{
+ iowrite32be(prod_cnt, &priv->reg_bar0->adminq_doorbell);
+}
+
+static bool gve_adminq_wait_for_cmd(struct gve_priv *priv, u32 prod_cnt)
+{
+ int i;
+
+ for (i = 0; i < GVE_MAX_ADMINQ_EVENT_COUNTER_CHECK; i++) {
+ if (ioread32be(&priv->reg_bar0->adminq_event_counter)
+ == prod_cnt)
+ return true;
+ msleep(GVE_ADMINQ_SLEEP_LEN);
+ }
+
+ return false;
+}
+
+static int gve_adminq_parse_err(struct gve_priv *priv, u32 status)
+{
+ if (status != GVE_ADMINQ_COMMAND_PASSED &&
+ status != GVE_ADMINQ_COMMAND_UNSET) {
+ dev_err(&priv->pdev->dev, "AQ command failed with status %d\n", status);
+ priv->adminq_cmd_fail++;
+ }
+ switch (status) {
+ case GVE_ADMINQ_COMMAND_PASSED:
+ return 0;
+ case GVE_ADMINQ_COMMAND_UNSET:
+ dev_err(&priv->pdev->dev, "parse_aq_err: err and status both unset, this should not be possible.\n");
+ return -EINVAL;
+ case GVE_ADMINQ_COMMAND_ERROR_ABORTED:
+ case GVE_ADMINQ_COMMAND_ERROR_CANCELLED:
+ case GVE_ADMINQ_COMMAND_ERROR_DATALOSS:
+ case GVE_ADMINQ_COMMAND_ERROR_FAILED_PRECONDITION:
+ case GVE_ADMINQ_COMMAND_ERROR_UNAVAILABLE:
+ return -EAGAIN;
+ case GVE_ADMINQ_COMMAND_ERROR_ALREADY_EXISTS:
+ case GVE_ADMINQ_COMMAND_ERROR_INTERNAL_ERROR:
+ case GVE_ADMINQ_COMMAND_ERROR_INVALID_ARGUMENT:
+ case GVE_ADMINQ_COMMAND_ERROR_NOT_FOUND:
+ case GVE_ADMINQ_COMMAND_ERROR_OUT_OF_RANGE:
+ case GVE_ADMINQ_COMMAND_ERROR_UNKNOWN_ERROR:
+ return -EINVAL;
+ case GVE_ADMINQ_COMMAND_ERROR_DEADLINE_EXCEEDED:
+ return -ETIME;
+ case GVE_ADMINQ_COMMAND_ERROR_PERMISSION_DENIED:
+ case GVE_ADMINQ_COMMAND_ERROR_UNAUTHENTICATED:
+ return -EACCES;
+ case GVE_ADMINQ_COMMAND_ERROR_RESOURCE_EXHAUSTED:
+ return -ENOMEM;
+ case GVE_ADMINQ_COMMAND_ERROR_UNIMPLEMENTED:
+ return -ENOTSUPP;
+ default:
+ dev_err(&priv->pdev->dev, "parse_aq_err: unknown status code %d\n", status);
+ return -EINVAL;
+ }
+}
+
+/* Flushes all AQ commands currently queued and waits for them to complete.
+ * If there are failures, it will return the first error.
+ */
+static int gve_adminq_kick_and_wait(struct gve_priv *priv)
+{
+ u32 tail, head;
+ int i;
+
+ tail = ioread32be(&priv->reg_bar0->adminq_event_counter);
+ head = priv->adminq_prod_cnt;
+
+ gve_adminq_kick_cmd(priv, head);
+ if (!gve_adminq_wait_for_cmd(priv, head)) {
+ dev_err(&priv->pdev->dev, "AQ commands timed out, need to reset AQ\n");
+ priv->adminq_timeouts++;
+ return -ENOTRECOVERABLE;
+ }
+
+ for (i = tail; i < head; i++) {
+ union gve_adminq_command *cmd;
+ u32 status, err;
+
+ cmd = &priv->adminq[i & priv->adminq_mask];
+ status = be32_to_cpu(READ_ONCE(cmd->status));
+ err = gve_adminq_parse_err(priv, status);
+ if (err)
+ // Return the first error if we failed.
+ return err;
+ }
+
+ return 0;
+}
+
+/* This function is not threadsafe - the caller is responsible for any
+ * necessary locks.
+ */
+static int gve_adminq_issue_cmd(struct gve_priv *priv,
+ union gve_adminq_command *cmd_orig)
+{
+ union gve_adminq_command *cmd;
+ u32 tail;
+ u32 opcode;
+
+ tail = ioread32be(&priv->reg_bar0->adminq_event_counter);
+
+ // Check if next command will overflow the buffer.
+ if (((priv->adminq_prod_cnt + 1) & priv->adminq_mask) == tail) {
+ int err;
+
+ // Flush existing commands to make room.
+ err = gve_adminq_kick_and_wait(priv);
+ if (err)
+ return err;
+
+ // Retry.
+ tail = ioread32be(&priv->reg_bar0->adminq_event_counter);
+ if (((priv->adminq_prod_cnt + 1) & priv->adminq_mask) == tail) {
+ // This should never happen. We just flushed the
+ // command queue so there should be enough space.
+ return -ENOMEM;
+ }
+ }
+
+ cmd = &priv->adminq[priv->adminq_prod_cnt & priv->adminq_mask];
+ priv->adminq_prod_cnt++;
+
+ memcpy(cmd, cmd_orig, sizeof(*cmd_orig));
+ opcode = be32_to_cpu(READ_ONCE(cmd->opcode));
+
+ switch(opcode) {
+ case GVE_ADMINQ_DESCRIBE_DEVICE:
+ priv->adminq_describe_device_cnt++;
+ break;
+ case GVE_ADMINQ_CONFIGURE_DEVICE_RESOURCES:
+ priv->adminq_cfg_device_resources_cnt++;
+ break;
+ case GVE_ADMINQ_REGISTER_PAGE_LIST:
+ priv->adminq_register_page_list_cnt++;
+ break;
+ case GVE_ADMINQ_UNREGISTER_PAGE_LIST:
+ priv->adminq_unregister_page_list_cnt++;
+ break;
+ case GVE_ADMINQ_CREATE_TX_QUEUE:
+ priv->adminq_create_tx_queue_cnt++;
+ break;
+ case GVE_ADMINQ_CREATE_RX_QUEUE:
+ priv->adminq_create_rx_queue_cnt++;
+ break;
+ case GVE_ADMINQ_DESTROY_TX_QUEUE:
+ priv->adminq_destroy_tx_queue_cnt++;
+ break;
+ case GVE_ADMINQ_DESTROY_RX_QUEUE:
+ priv->adminq_destroy_rx_queue_cnt++;
+ break;
+ case GVE_ADMINQ_DECONFIGURE_DEVICE_RESOURCES:
+ priv->adminq_dcfg_device_resources_cnt++;
+ break;
+ case GVE_ADMINQ_SET_DRIVER_PARAMETER:
+ priv->adminq_set_driver_parameter_cnt++;
+ break;
+ case GVE_ADMINQ_REPORT_STATS:
+ priv->adminq_report_stats_cnt++;
+ break;
+ default:
+ dev_err(&priv->pdev->dev, "unknown AQ command opcode %d\n", opcode);
+ }
+
+ return 0;
+}
+
+/* This function is not threadsafe - the caller is responsible for any
+ * necessary locks.
+ * The caller is also responsible for making sure there are no commands
+ * waiting to be executed.
+ */
+static int gve_adminq_execute_cmd(struct gve_priv *priv,
+ union gve_adminq_command *cmd_orig)
+{
+ u32 tail, head;
+ int err;
+
+ tail = ioread32be(&priv->reg_bar0->adminq_event_counter);
+ head = priv->adminq_prod_cnt;
+ if (tail != head)
+ // This is not a valid path
+ return -EINVAL;
+
+ err = gve_adminq_issue_cmd(priv, cmd_orig);
+ if (err)
+ return err;
+
+ return gve_adminq_kick_and_wait(priv);
+}
+/* The device specifies that the management vector can either be the first irq
+ * or the last irq. ntfy_blk_msix_base_idx indicates the first irq assigned to
+ * the ntfy blks. It if is 0 then the management vector is last, if it is 1 then
+ * the management vector is first.
+ *
+ * gve arranges the msix vectors so that the management vector is last.
+ */
+#define GVE_NTFY_BLK_BASE_MSIX_IDX 0
+int gve_adminq_configure_device_resources(struct gve_priv *priv,
+ dma_addr_t counter_array_bus_addr,
+ u32 num_counters,
+ dma_addr_t db_array_bus_addr,
+ u32 num_ntfy_blks)
+{
+ union gve_adminq_command cmd;
+
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_CONFIGURE_DEVICE_RESOURCES);
+ cmd.configure_device_resources =
+ (struct gve_adminq_configure_device_resources) {
+ .counter_array = cpu_to_be64(counter_array_bus_addr),
+ .num_counters = cpu_to_be32(num_counters),
+ .irq_db_addr = cpu_to_be64(db_array_bus_addr),
+ .num_irq_dbs = cpu_to_be32(num_ntfy_blks),
+ .irq_db_stride = cpu_to_be32(sizeof(*priv->irq_db_indices)),
+ .ntfy_blk_msix_base_idx =
+ cpu_to_be32(GVE_NTFY_BLK_BASE_MSIX_IDX),
+ };
+
+ return gve_adminq_execute_cmd(priv, &cmd);
+}
+
+int gve_adminq_deconfigure_device_resources(struct gve_priv *priv)
+{
+ union gve_adminq_command cmd;
+
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_DECONFIGURE_DEVICE_RESOURCES);
+
+ return gve_adminq_execute_cmd(priv, &cmd);
+}
+
+int gve_adminq_create_tx_queues(struct gve_priv *priv, u32 num_queues)
+{
+ union gve_adminq_command cmd;
+ struct gve_tx_ring *tx;
+ u32 qpl_id;
+ int err;
+ int i;
+
+ for (i = 0; i < num_queues; i++) {
+ tx = &priv->tx[i];
+ qpl_id = priv->raw_addressing ? GVE_RAW_ADDRESSING_QPL_ID :
+ tx->tx_fifo.qpl->id;
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_CREATE_TX_QUEUE);
+ cmd.create_tx_queue = (struct gve_adminq_create_tx_queue) {
+ .queue_id = cpu_to_be32(i),
+ .reserved = 0,
+ .queue_resources_addr =
+ cpu_to_be64(tx->q_resources_bus),
+ .tx_ring_addr = cpu_to_be64(tx->bus),
+ .queue_page_list_id = cpu_to_be32(qpl_id),
+ .ntfy_id = cpu_to_be32(tx->ntfy_id),
+ };
+ err = gve_adminq_issue_cmd(priv, &cmd);
+ if (err)
+ return err;
+ }
+
+ return gve_adminq_kick_and_wait(priv);
+}
+
+int gve_adminq_create_rx_queues(struct gve_priv *priv, u32 num_queues)
+{
+ union gve_adminq_command cmd;
+ struct gve_rx_ring *rx;
+ u32 qpl_id;
+ int err;
+ int i;
+
+ for (i = 0; i < num_queues; i++) {
+ rx = &priv->rx[i];
+ qpl_id = priv->raw_addressing ? GVE_RAW_ADDRESSING_QPL_ID :
+ rx->data.qpl->id;
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_CREATE_RX_QUEUE);
+ cmd.create_rx_queue = (struct gve_adminq_create_rx_queue) {
+ .queue_id = cpu_to_be32(i),
+ .index = cpu_to_be32(i),
+ .reserved = 0,
+ .ntfy_id = cpu_to_be32(rx->ntfy_id),
+ .queue_resources_addr = cpu_to_be64(rx->q_resources_bus),
+ .rx_desc_ring_addr = cpu_to_be64(rx->desc.bus),
+ .rx_data_ring_addr = cpu_to_be64(rx->data.data_bus),
+ .queue_page_list_id = cpu_to_be32(qpl_id),
+ };
+ err = gve_adminq_issue_cmd(priv, &cmd);
+ if (err)
+ return err;
+ }
+
+ return gve_adminq_kick_and_wait(priv);
+}
+
+int gve_adminq_destroy_tx_queues(struct gve_priv *priv, u32 num_queues)
+{
+ union gve_adminq_command cmd;
+ int err;
+ int i;
+
+ for (i = 0; i < num_queues; i++) {
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_DESTROY_TX_QUEUE);
+ cmd.destroy_tx_queue = (struct gve_adminq_destroy_tx_queue) {
+ .queue_id = cpu_to_be32(i),
+ };
+ err = gve_adminq_issue_cmd(priv, &cmd);
+ if (err)
+ return err;
+ }
+
+ return gve_adminq_kick_and_wait(priv);
+}
+
+int gve_adminq_destroy_rx_queues(struct gve_priv *priv, u32 num_queues)
+{
+ union gve_adminq_command cmd;
+ int err;
+ int i;
+
+ for (i = 0; i < num_queues; i++) {
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_DESTROY_RX_QUEUE);
+ cmd.destroy_rx_queue = (struct gve_adminq_destroy_rx_queue) {
+ .queue_id = cpu_to_be32(i),
+ };
+ err = gve_adminq_issue_cmd(priv, &cmd);
+ if (err)
+ return err;
+ }
+
+ return gve_adminq_kick_and_wait(priv);
+}
+
+int gve_adminq_describe_device(struct gve_priv *priv)
+{
+ struct gve_device_descriptor *descriptor;
+ struct gve_device_option *dev_opt;
+ union gve_adminq_command cmd;
+ dma_addr_t descriptor_bus;
+ u16 num_options;
+ int err = 0;
+ u8 *mac;
+ u16 mtu;
+ int i;
+
+ memset(&cmd, 0, sizeof(cmd));
+ descriptor = dma_alloc_coherent(&priv->pdev->dev, PAGE_SIZE,
+ &descriptor_bus, GFP_KERNEL);
+ if (!descriptor)
+ return -ENOMEM;
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_DESCRIBE_DEVICE);
+ cmd.describe_device.device_descriptor_addr =
+ cpu_to_be64(descriptor_bus);
+ cmd.describe_device.device_descriptor_version =
+ cpu_to_be32(GVE_ADMINQ_DEVICE_DESCRIPTOR_VERSION);
+ cmd.describe_device.available_length = cpu_to_be32(PAGE_SIZE);
+
+ err = gve_adminq_execute_cmd(priv, &cmd);
+ if (err)
+ goto free_device_descriptor;
+
+ priv->tx_desc_cnt = be16_to_cpu(descriptor->tx_queue_entries);
+ if (priv->tx_desc_cnt * sizeof(priv->tx->desc[0]) < PAGE_SIZE) {
+ dev_err(&priv->pdev->dev, "Tx desc count %d too low\n",
+ priv->tx_desc_cnt);
+ err = -EINVAL;
+ goto free_device_descriptor;
+ }
+ priv->rx_desc_cnt = be16_to_cpu(descriptor->rx_queue_entries);
+ if (priv->rx_desc_cnt * sizeof(priv->rx->desc.desc_ring[0])
+ < PAGE_SIZE ||
+ priv->rx_desc_cnt * sizeof(priv->rx->data.data_ring[0])
+ < PAGE_SIZE) {
+ dev_err(&priv->pdev->dev, "Rx desc count %d too low\n",
+ priv->rx_desc_cnt);
+ err = -EINVAL;
+
+ err = -EINVAL;
+ goto free_device_descriptor;
+ }
+ priv->max_registered_pages =
+ be64_to_cpu(descriptor->max_registered_pages);
+ mtu = be16_to_cpu(descriptor->mtu);
+ if (mtu < ETH_MIN_MTU) {
+ dev_err(&priv->pdev->dev, "MTU %d below minimum MTU\n", mtu);
+ err = -EINVAL;
+
+ err = -EINVAL;
+ goto free_device_descriptor;
+ }
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0))
+ priv->max_mtu = mtu;
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+ priv->dev->max_mtu = mtu;
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+ priv->num_event_counters = be16_to_cpu(descriptor->counters);
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0)
+ ether_addr_copy(priv->dev->dev_addr, descriptor->mac);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0) */
+ memcpy(priv->dev->dev_addr, descriptor->mac, ETH_ALEN);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0) */
+ mac = descriptor->mac;
+ dev_info(&priv->pdev->dev, "MAC addr: %pM\n", mac);
+ priv->tx_pages_per_qpl = be16_to_cpu(descriptor->tx_pages_per_qpl);
+ priv->rx_data_slot_cnt = be16_to_cpu(descriptor->rx_pages_per_qpl);
+ if (priv->rx_data_slot_cnt < priv->rx_desc_cnt) {
+ dev_err(&priv->pdev->dev, "rx_data_slot_cnt cannot be smaller than rx_desc_cnt, setting rx_desc_cnt down to %d.\n",
+ priv->rx_data_slot_cnt);
+ priv->rx_desc_cnt = priv->rx_data_slot_cnt;
+ }
+ priv->default_num_queues = be16_to_cpu(descriptor->default_num_queues);
+ dev_opt = (struct gve_device_option *)((void *)descriptor +
+ sizeof(*descriptor));
+
+ num_options = be16_to_cpu(descriptor->num_device_options);
+ for (i = 0; i < num_options; i++) {
+ u16 option_id;
+ u16 option_length;
+
+ if ((void *)dev_opt + sizeof(*dev_opt) > (void *)descriptor +
+ be16_to_cpu(descriptor->total_length)) {
+ dev_err(&priv->dev->dev,
+ "num_options in device_descriptor does not match total length.\n");
+ err = -EINVAL;
+ goto free_device_descriptor;
+ }
+
+ option_id = be16_to_cpu(dev_opt->option_id);
+ option_length = be16_to_cpu(dev_opt->option_length);
+ switch(option_id) {
+ case GVE_DEV_OPT_ID_RAW_ADDRESSING:
+ /* If the length or feature mask doesn't match,
+ * continue without enabling the feature.
+ */
+ if (option_length != GVE_DEV_OPT_LEN_RAW_ADDRESSING ||
+ be32_to_cpu(dev_opt->feat_mask) !=
+ GVE_DEV_OPT_FEAT_MASK_RAW_ADDRESSING) {
+ dev_info(&priv->pdev->dev,
+ "Raw addressing device option not enabled, length or features mask did not match expected.\n");
+ priv->raw_addressing = false;
+ } else {
+ dev_info(&priv->pdev->dev,
+ "Raw addressing device option enabled.\n");
+ priv->raw_addressing = true;
+ }
+ break;
+ default:
+ /* If we don't recognize the option just continue
+ * without doing anything.
+ */
+ dev_info(&priv->pdev->dev,
+ "Unrecognized device option 0x%hx not enabled.\n",
+ option_id);
+ break;
+ }
+ dev_opt = (void *)dev_opt + sizeof(*dev_opt) + option_length;
+ }
+
+free_device_descriptor:
+ dma_free_coherent(&priv->pdev->dev, PAGE_SIZE, descriptor,
+ descriptor_bus);
+ return err;
+}
+
+int gve_adminq_register_page_list(struct gve_priv *priv,
+ struct gve_queue_page_list *qpl)
+{
+ struct device *hdev = &priv->pdev->dev;
+ u32 num_entries = qpl->num_entries;
+ u32 size = num_entries * sizeof(qpl->page_buses[0]);
+ union gve_adminq_command cmd;
+ dma_addr_t page_list_bus;
+ __be64 *page_list;
+ int err;
+ int i;
+
+ memset(&cmd, 0, sizeof(cmd));
+ page_list = dma_alloc_coherent(hdev, size, &page_list_bus, GFP_KERNEL);
+ if (!page_list)
+ return -ENOMEM;
+
+ for (i = 0; i < num_entries; i++)
+ page_list[i] = cpu_to_be64(qpl->page_buses[i]);
+
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_REGISTER_PAGE_LIST);
+ cmd.reg_page_list = (struct gve_adminq_register_page_list) {
+ .page_list_id = cpu_to_be32(qpl->id),
+ .num_pages = cpu_to_be32(num_entries),
+ .page_address_list_addr = cpu_to_be64(page_list_bus),
+ };
+
+ err = gve_adminq_execute_cmd(priv, &cmd);
+ dma_free_coherent(hdev, size, page_list, page_list_bus);
+ return err;
+}
+
+int gve_adminq_unregister_page_list(struct gve_priv *priv, u32 page_list_id)
+{
+ union gve_adminq_command cmd;
+
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_UNREGISTER_PAGE_LIST);
+ cmd.unreg_page_list = (struct gve_adminq_unregister_page_list) {
+ .page_list_id = cpu_to_be32(page_list_id),
+ };
+
+ return gve_adminq_execute_cmd(priv, &cmd);
+}
+
+int gve_adminq_set_mtu(struct gve_priv *priv, u64 mtu)
+{
+ union gve_adminq_command cmd;
+
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_SET_DRIVER_PARAMETER);
+ cmd.set_driver_param = (struct gve_adminq_set_driver_parameter) {
+ .parameter_type = cpu_to_be32(GVE_SET_PARAM_MTU),
+ .parameter_value = cpu_to_be64(mtu),
+ };
+
+ return gve_adminq_execute_cmd(priv, &cmd);
+}
+
+int gve_adminq_report_stats(struct gve_priv *priv, u64 stats_report_len,
+ dma_addr_t stats_report_addr, u64 interval)
+{
+ union gve_adminq_command cmd;
+
+ memset(&cmd, 0, sizeof(cmd));
+ cmd.opcode = cpu_to_be32(GVE_ADMINQ_REPORT_STATS);
+ cmd.report_stats = (struct gve_adminq_report_stats) {
+ .stats_report_len = cpu_to_be64(stats_report_len),
+ .stats_report_addr = cpu_to_be64(stats_report_addr),
+ .interval = cpu_to_be64(interval),
+ };
+
+ return gve_adminq_execute_cmd(priv, &cmd);
+}
+
+int gve_adminq_report_link_speed(struct gve_priv *priv)
+{
+ union gve_adminq_command gvnic_cmd;
+ dma_addr_t link_speed_region_bus;
+ u64* link_speed_region;
+ int err;
+
+ link_speed_region = dma_alloc_coherent(&priv->pdev->dev,
+ sizeof(*link_speed_region), &link_speed_region_bus, GFP_KERNEL);
+
+ if (!link_speed_region)
+ return -ENOMEM;
+
+ memset(&gvnic_cmd, 0, sizeof(gvnic_cmd));
+ gvnic_cmd.opcode = cpu_to_be32(GVE_ADMINQ_REPORT_LINK_SPEED);
+ gvnic_cmd.report_link_speed.link_speed_address =
+ cpu_to_be64(link_speed_region_bus);
+
+ err = gve_adminq_execute_cmd(priv, &gvnic_cmd);
+
+ priv->link_speed = be64_to_cpu(*link_speed_region);
+ dma_free_coherent(&priv->pdev->dev, sizeof(*link_speed_region),
+ link_speed_region, link_speed_region_bus);
+ return err;
+}
diff --git a/drivers/net/ethernet/google/gve/gve_adminq.h b/drivers/net/ethernet/google/gve/gve_adminq.h
new file mode 100644
index 0000000..0b12486
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_adminq.h
@@ -0,0 +1,278 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR MIT)
+ * Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#ifndef _GVE_ADMINQ_H
+#define _GVE_ADMINQ_H
+
+#if LINUX_VERSION_CODE < KERNEL_VERSION(5,1,0)
+#include "gve_size_assert.h"
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(5,1,0) */
+#include <linux/build_bug.h>
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(5,1,0) */
+
+/* Admin queue opcodes */
+enum gve_adminq_opcodes {
+ GVE_ADMINQ_DESCRIBE_DEVICE = 0x1,
+ GVE_ADMINQ_CONFIGURE_DEVICE_RESOURCES = 0x2,
+ GVE_ADMINQ_REGISTER_PAGE_LIST = 0x3,
+ GVE_ADMINQ_UNREGISTER_PAGE_LIST = 0x4,
+ GVE_ADMINQ_CREATE_TX_QUEUE = 0x5,
+ GVE_ADMINQ_CREATE_RX_QUEUE = 0x6,
+ GVE_ADMINQ_DESTROY_TX_QUEUE = 0x7,
+ GVE_ADMINQ_DESTROY_RX_QUEUE = 0x8,
+ GVE_ADMINQ_DECONFIGURE_DEVICE_RESOURCES = 0x9,
+ GVE_ADMINQ_SET_DRIVER_PARAMETER = 0xB,
+ GVE_ADMINQ_REPORT_STATS = 0xC,
+ GVE_ADMINQ_REPORT_LINK_SPEED = 0xD
+};
+
+/* Admin queue status codes */
+enum gve_adminq_statuses {
+ GVE_ADMINQ_COMMAND_UNSET = 0x0,
+ GVE_ADMINQ_COMMAND_PASSED = 0x1,
+ GVE_ADMINQ_COMMAND_ERROR_ABORTED = 0xFFFFFFF0,
+ GVE_ADMINQ_COMMAND_ERROR_ALREADY_EXISTS = 0xFFFFFFF1,
+ GVE_ADMINQ_COMMAND_ERROR_CANCELLED = 0xFFFFFFF2,
+ GVE_ADMINQ_COMMAND_ERROR_DATALOSS = 0xFFFFFFF3,
+ GVE_ADMINQ_COMMAND_ERROR_DEADLINE_EXCEEDED = 0xFFFFFFF4,
+ GVE_ADMINQ_COMMAND_ERROR_FAILED_PRECONDITION = 0xFFFFFFF5,
+ GVE_ADMINQ_COMMAND_ERROR_INTERNAL_ERROR = 0xFFFFFFF6,
+ GVE_ADMINQ_COMMAND_ERROR_INVALID_ARGUMENT = 0xFFFFFFF7,
+ GVE_ADMINQ_COMMAND_ERROR_NOT_FOUND = 0xFFFFFFF8,
+ GVE_ADMINQ_COMMAND_ERROR_OUT_OF_RANGE = 0xFFFFFFF9,
+ GVE_ADMINQ_COMMAND_ERROR_PERMISSION_DENIED = 0xFFFFFFFA,
+ GVE_ADMINQ_COMMAND_ERROR_UNAUTHENTICATED = 0xFFFFFFFB,
+ GVE_ADMINQ_COMMAND_ERROR_RESOURCE_EXHAUSTED = 0xFFFFFFFC,
+ GVE_ADMINQ_COMMAND_ERROR_UNAVAILABLE = 0xFFFFFFFD,
+ GVE_ADMINQ_COMMAND_ERROR_UNIMPLEMENTED = 0xFFFFFFFE,
+ GVE_ADMINQ_COMMAND_ERROR_UNKNOWN_ERROR = 0xFFFFFFFF,
+};
+
+#define GVE_ADMINQ_DEVICE_DESCRIPTOR_VERSION 1
+
+/* All AdminQ command structs should be naturally packed. The static_assert
+ * calls make sure this is the case at compile time.
+ */
+
+struct gve_adminq_describe_device {
+ __be64 device_descriptor_addr;
+ __be32 device_descriptor_version;
+ __be32 available_length;
+};
+
+static_assert(sizeof(struct gve_adminq_describe_device) == 16);
+
+struct gve_device_descriptor {
+ __be64 max_registered_pages;
+ __be16 reserved1;
+ __be16 tx_queue_entries;
+ __be16 rx_queue_entries;
+ __be16 default_num_queues;
+ __be16 mtu;
+ __be16 counters;
+ __be16 tx_pages_per_qpl;
+ __be16 rx_pages_per_qpl;
+ u8 mac[ETH_ALEN];
+ __be16 num_device_options;
+ __be16 total_length;
+ u8 reserved2[6];
+};
+
+static_assert(sizeof(struct gve_device_descriptor) == 40);
+
+struct gve_device_option {
+ __be16 option_id;
+ __be16 option_length;
+ __be32 feat_mask;
+};
+
+static_assert(sizeof(struct gve_device_option) == 8);
+
+#define GVE_DEV_OPT_ID_RAW_ADDRESSING 0x1
+#define GVE_DEV_OPT_LEN_RAW_ADDRESSING 0x0
+#define GVE_DEV_OPT_FEAT_MASK_RAW_ADDRESSING 0x0
+
+struct gve_adminq_configure_device_resources {
+ __be64 counter_array;
+ __be64 irq_db_addr;
+ __be32 num_counters;
+ __be32 num_irq_dbs;
+ __be32 irq_db_stride;
+ __be32 ntfy_blk_msix_base_idx;
+};
+
+static_assert(sizeof(struct gve_adminq_configure_device_resources) == 32);
+
+struct gve_adminq_register_page_list {
+ __be32 page_list_id;
+ __be32 num_pages;
+ __be64 page_address_list_addr;
+};
+
+static_assert(sizeof(struct gve_adminq_register_page_list) == 16);
+
+struct gve_adminq_unregister_page_list {
+ __be32 page_list_id;
+};
+
+static_assert(sizeof(struct gve_adminq_unregister_page_list) == 4);
+
+#define GVE_RAW_ADDRESSING_QPL_ID 0xFFFFFFFF
+
+struct gve_adminq_create_tx_queue {
+ __be32 queue_id;
+ __be32 reserved;
+ __be64 queue_resources_addr;
+ __be64 tx_ring_addr;
+ __be32 queue_page_list_id;
+ __be32 ntfy_id;
+};
+
+static_assert(sizeof(struct gve_adminq_create_tx_queue) == 32);
+
+struct gve_adminq_create_rx_queue {
+ __be32 queue_id;
+ __be32 index;
+ __be32 reserved;
+ __be32 ntfy_id;
+ __be64 queue_resources_addr;
+ __be64 rx_desc_ring_addr;
+ __be64 rx_data_ring_addr;
+ __be32 queue_page_list_id;
+ u8 padding[4];
+};
+
+static_assert(sizeof(struct gve_adminq_create_rx_queue) == 48);
+
+/* Queue resources that are shared with the device */
+struct gve_queue_resources {
+ union {
+ struct {
+ __be32 db_index; /* Device -> Guest */
+ __be32 counter_index; /* Device -> Guest */
+ };
+ u8 reserved[64];
+ };
+};
+
+static_assert(sizeof(struct gve_queue_resources) == 64);
+
+struct gve_adminq_destroy_tx_queue {
+ __be32 queue_id;
+};
+
+static_assert(sizeof(struct gve_adminq_destroy_tx_queue) == 4);
+
+struct gve_adminq_destroy_rx_queue {
+ __be32 queue_id;
+};
+
+static_assert(sizeof(struct gve_adminq_destroy_rx_queue) == 4);
+
+/* GVE Set Driver Parameter Types */
+enum gve_set_driver_param_types {
+ GVE_SET_PARAM_MTU = 0x1,
+};
+
+struct gve_adminq_set_driver_parameter {
+ __be32 parameter_type;
+ u8 reserved[4];
+ __be64 parameter_value;
+};
+
+static_assert(sizeof(struct gve_adminq_set_driver_parameter) == 16);
+
+struct gve_adminq_report_stats {
+ __be64 stats_report_len;
+ __be64 stats_report_addr;
+ __be64 interval;
+};
+
+static_assert(sizeof(struct gve_adminq_report_stats) == 24);
+
+struct gve_adminq_report_link_speed {
+ __be64 link_speed_address;
+};
+
+static_assert(sizeof(struct gve_adminq_report_link_speed) == 8);
+
+struct stats {
+ __be32 stat_name;
+ __be32 queue_id;
+ __be64 value;
+};
+
+static_assert(sizeof(struct stats) == 16);
+
+struct gve_stats_report {
+ __be64 written_count;
+ struct stats stats[0];
+};
+
+static_assert(sizeof(struct gve_stats_report) == 8);
+
+enum gve_stat_names {
+ // stats from gve
+ TX_WAKE_CNT = 1,
+ TX_STOP_CNT = 2,
+ TX_FRAMES_SENT = 3,
+ TX_BYTES_SENT = 4,
+ TX_LAST_COMPLETION_PROCESSED = 5,
+ RX_NEXT_EXPECTED_SEQUENCE = 6,
+ RX_BUFFERS_POSTED = 7,
+ // stats from NIC
+ RX_QUEUE_DROP_CNT = 65,
+ RX_NO_BUFFERS_POSTED = 66,
+ RX_DROPS_PACKET_OVER_MRU = 67,
+ RX_DROPS_INVALID_CHECKSUM = 68,
+};
+
+union gve_adminq_command {
+ struct {
+ __be32 opcode;
+ __be32 status;
+ union {
+ struct gve_adminq_configure_device_resources
+ configure_device_resources;
+ struct gve_adminq_create_tx_queue create_tx_queue;
+ struct gve_adminq_create_rx_queue create_rx_queue;
+ struct gve_adminq_destroy_tx_queue destroy_tx_queue;
+ struct gve_adminq_destroy_rx_queue destroy_rx_queue;
+ struct gve_adminq_describe_device describe_device;
+ struct gve_adminq_register_page_list reg_page_list;
+ struct gve_adminq_unregister_page_list unreg_page_list;
+ struct gve_adminq_set_driver_parameter set_driver_param;
+ struct gve_adminq_report_stats report_stats;
+ struct gve_adminq_report_link_speed report_link_speed;
+ };
+ };
+ u8 reserved[64];
+};
+
+static_assert(sizeof(union gve_adminq_command) == 64);
+
+int gve_adminq_alloc(struct device *dev, struct gve_priv *priv);
+void gve_adminq_free(struct device *dev, struct gve_priv *priv);
+void gve_adminq_release(struct gve_priv *priv);
+int gve_adminq_describe_device(struct gve_priv *priv);
+int gve_adminq_configure_device_resources(struct gve_priv *priv,
+ dma_addr_t counter_array_bus_addr,
+ u32 num_counters,
+ dma_addr_t db_array_bus_addr,
+ u32 num_ntfy_blks);
+int gve_adminq_deconfigure_device_resources(struct gve_priv *priv);
+int gve_adminq_create_tx_queues(struct gve_priv *priv, u32 num_queues);
+int gve_adminq_destroy_tx_queues(struct gve_priv *priv, u32 queue_id);
+int gve_adminq_create_rx_queues(struct gve_priv *priv, u32 num_queues);
+int gve_adminq_destroy_rx_queues(struct gve_priv *priv, u32 queue_id);
+int gve_adminq_register_page_list(struct gve_priv *priv,
+ struct gve_queue_page_list *qpl);
+int gve_adminq_unregister_page_list(struct gve_priv *priv, u32 page_list_id);
+int gve_adminq_set_mtu(struct gve_priv *priv, u64 mtu);
+int gve_adminq_report_stats(struct gve_priv *priv, u64 stats_report_len,
+ dma_addr_t stats_report_addr, u64 interval);
+int gve_adminq_report_link_speed(struct gve_priv *priv);
+#endif /* _GVE_ADMINQ_H */
diff --git a/drivers/net/ethernet/google/gve/gve_desc.h b/drivers/net/ethernet/google/gve/gve_desc.h
new file mode 100644
index 0000000..d4553fb
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_desc.h
@@ -0,0 +1,121 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR MIT)
+ * Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+/* GVE Transmit Descriptor formats */
+
+#ifndef _GVE_DESC_H_
+#define _GVE_DESC_H_
+
+#if LINUX_VERSION_CODE < KERNEL_VERSION(5,1,0)
+#include "gve_size_assert.h"
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(5,1,0) */
+#include <linux/build_bug.h>
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(5,1,0) */
+
+/* A note on seg_addrs
+ *
+ * Base addresses encoded in seg_addr are not assumed to be physical
+ * addresses. The ring format assumes these come from some linear address
+ * space. This could be physical memory, kernel virtual memory, user virtual
+ * memory.
+ * If raw dma addressing is not supported then gVNIC uses lists of registered
+ * pages. Each queue is assumed to be associated with a single such linear
+ * address space to ensure a consistent meaning for seg_addrs posted to its
+ * rings.
+ */
+
+struct gve_tx_pkt_desc {
+ u8 type_flags; /* desc type is lower 4 bits, flags upper */
+ u8 l4_csum_offset; /* relative offset of L4 csum word */
+ u8 l4_hdr_offset; /* Offset of start of L4 headers in packet */
+ u8 desc_cnt; /* Total descriptors for this packet */
+ __be16 len; /* Total length of this packet (in bytes) */
+ __be16 seg_len; /* Length of this descriptor's segment */
+ __be64 seg_addr; /* Base address (see note) of this segment */
+} __packed;
+
+struct gve_tx_seg_desc {
+ u8 type_flags; /* type is lower 4 bits, flags upper */
+ u8 l3_offset; /* TSO: 2 byte units to start of IPH */
+ __be16 reserved;
+ __be16 mss; /* TSO MSS */
+ __be16 seg_len;
+ __be64 seg_addr;
+} __packed;
+
+/* GVE Transmit Descriptor Types */
+#define GVE_TXD_STD (0x0 << 4) /* Std with Host Address */
+#define GVE_TXD_TSO (0x1 << 4) /* TSO with Host Address */
+#define GVE_TXD_SEG (0x2 << 4) /* Seg with Host Address */
+
+/* GVE Transmit Descriptor Flags for Std Pkts */
+#define GVE_TXF_L4CSUM BIT(0) /* Need csum offload */
+#define GVE_TXF_TSTAMP BIT(2) /* Timestamp required */
+
+/* GVE Transmit Descriptor Flags for TSO Segs */
+#define GVE_TXSF_IPV6 BIT(1) /* IPv6 TSO */
+
+/* GVE Receive Packet Descriptor */
+/* The start of an ethernet packet comes 2 bytes into the rx buffer.
+ * gVNIC adds this padding so that both the DMA and the L3/4 protocol header
+ * access is aligned.
+ */
+#define GVE_RX_PAD 2
+
+struct gve_rx_desc {
+ u8 padding[48];
+ __be32 rss_hash; /* Receive-side scaling hash (Toeplitz for gVNIC) */
+ __be16 mss;
+ __be16 reserved; /* Reserved to zero */
+ u8 hdr_len; /* Header length (L2-L4) including padding */
+ u8 hdr_off; /* 64-byte-scaled offset into RX_DATA entry */
+ __sum16 csum; /* 1's-complement partial checksum of L3+ bytes */
+ __be16 len; /* Length of the received packet */
+ __be16 flags_seq; /* Flags [15:3] and sequence number [2:0] (1-7) */
+} __packed;
+static_assert(sizeof(struct gve_rx_desc) == 64);
+
+/* If the device supports raw dma addressing then the addr in data slot is
+ * the dma address of the buffer.
+ * If the device only supports registered segments than the addr is a byte
+ * offset into the registered segment (an ordered list of pages) where the
+ * buffer is.
+ */
+struct gve_rx_data_slot {
+ __be64 addr;
+};
+
+/* GVE Recive Packet Descriptor Seq No */
+#define GVE_SEQNO(x) (be16_to_cpu(x) & 0x7)
+
+/* GVE Recive Packet Descriptor Flags */
+#define GVE_RXFLG(x) cpu_to_be16(1 << (3 + (x)))
+#define GVE_RXF_FRAG GVE_RXFLG(3) /* IP Fragment */
+#define GVE_RXF_IPV4 GVE_RXFLG(4) /* IPv4 */
+#define GVE_RXF_IPV6 GVE_RXFLG(5) /* IPv6 */
+#define GVE_RXF_TCP GVE_RXFLG(6) /* TCP Packet */
+#define GVE_RXF_UDP GVE_RXFLG(7) /* UDP Packet */
+#define GVE_RXF_ERR GVE_RXFLG(8) /* Packet Error Detected */
+
+/* GVE IRQ */
+#define GVE_IRQ_ACK BIT(31)
+#define GVE_IRQ_MASK BIT(30)
+#define GVE_IRQ_EVENT BIT(29)
+
+static inline bool gve_needs_rss(__be16 flag)
+{
+ if (flag & GVE_RXF_FRAG)
+ return false;
+ if (flag & (GVE_RXF_IPV4 | GVE_RXF_IPV6))
+ return true;
+ return false;
+}
+
+static inline u8 gve_next_seqno(u8 seq)
+{
+ return (seq + 1) == 8 ? 1 : seq + 1;
+}
+#endif /* _GVE_DESC_H_ */
diff --git a/drivers/net/ethernet/google/gve/gve_ethtool.c b/drivers/net/ethernet/google/gve/gve_ethtool.c
new file mode 100644
index 0000000..0ad957f
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_ethtool.c
@@ -0,0 +1,541 @@
+// SPDX-License-Identifier: (GPL-2.0 OR MIT)
+/* Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#include "gve_linux_version.h"
+#include <linux/rtnetlink.h>
+#include "gve.h"
+#include "gve_adminq.h"
+
+static void gve_get_drvinfo(struct net_device *netdev,
+ struct ethtool_drvinfo *info)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+
+ strlcpy(info->driver, "gve", sizeof(info->driver));
+ strlcpy(info->version, gve_version_str, sizeof(info->version));
+ strlcpy(info->bus_info, pci_name(priv->pdev), sizeof(info->bus_info));
+}
+
+static void gve_set_msglevel(struct net_device *netdev, u32 value)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+
+ priv->msg_enable = value;
+}
+
+static u32 gve_get_msglevel(struct net_device *netdev)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+
+ return priv->msg_enable;
+}
+
+static const char gve_gstrings_main_stats[][ETH_GSTRING_LEN] = {
+ "rx_packets", "rx_total_bytes", "rx_total_dropped_pkt",
+ "rx_skb_alloc_fail", "rx_buf_alloc_fail", "rx_desc_err_dropped_pkt",
+ "tx_packets", "tx_total_bytes", "tx_total_dropped_pkt", "tx_timeouts",
+ "interface_up_cnt", "interface_down_cnt", "reset_cnt",
+ "page_alloc_fail", "dma_mapping_error",
+};
+
+static const char gve_gstrings_rx_stats[][ETH_GSTRING_LEN] = {
+ "rx_posted_desc[%u]", "rx_completed_desc[%u]", "rx_bytes[%u]",
+ "rx_dropped_pkt[%u]", "rx_copybreak_pkt[%u]", "rx_copied_pkt[%u]",
+ "rx_queue_drop_cnt[%u]", "rx_no_buffers_posted[%u]",
+ "rx_drops_packet_over_mru[%u]", "rx_drops_invalid_checksum[%u]",
+};
+
+static const char gve_gstrings_tx_stats[][ETH_GSTRING_LEN] = {
+ "tx_posted_desc[%u]", "tx_completed_desc[%u]", "tx_bytes[%u]",
+ "tx_wake[%u]", "tx_stop[%u]", "tx_event_counter[%u]",
+};
+
+static const char gve_gstrings_adminq_stats[][ETH_GSTRING_LEN] = {
+ "adminq_prod_cnt", "adminq_cmd_fail", "adminq_timeouts",
+ "adminq_describe_device_cnt", "adminq_cfg_device_resources_cnt",
+ "adminq_register_page_list_cnt", "adminq_unregister_page_list_cnt",
+ "adminq_create_tx_queue_cnt", "adminq_create_rx_queue_cnt",
+ "adminq_destroy_tx_queue_cnt", "adminq_destroy_rx_queue_cnt",
+ "adminq_dcfg_device_resources_cnt", "adminq_set_driver_parameter_cnt",
+ "adminq_report_stats_cnt",
+};
+
+static const char gve_gstrings_priv_flags[][ETH_GSTRING_LEN] = {
+ "report-stats",
+};
+
+#define GVE_MAIN_STATS_LEN ARRAY_SIZE(gve_gstrings_main_stats)
+#define GVE_ADMINQ_STATS_LEN ARRAY_SIZE(gve_gstrings_adminq_stats)
+#define NUM_GVE_TX_CNTS ARRAY_SIZE(gve_gstrings_tx_stats)
+#define NUM_GVE_RX_CNTS ARRAY_SIZE(gve_gstrings_rx_stats)
+#define GVE_PRIV_FLAGS_STR_LEN ARRAY_SIZE(gve_gstrings_priv_flags)
+
+static void gve_get_strings(struct net_device *netdev, u32 stringset, u8 *data)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+ char *s = (char *)data;
+ int i, j;
+
+ switch (stringset) {
+ case ETH_SS_STATS:
+ memcpy(s, *gve_gstrings_main_stats,
+ sizeof(gve_gstrings_main_stats));
+ s += sizeof(gve_gstrings_main_stats);
+
+ for (i = 0; i < priv->rx_cfg.num_queues; i++) {
+ for (j = 0; j < NUM_GVE_RX_CNTS; j++) {
+ snprintf(s, ETH_GSTRING_LEN,
+ gve_gstrings_rx_stats[j], i);
+ s += ETH_GSTRING_LEN;
+ }
+ }
+
+ for (i = 0; i < priv->tx_cfg.num_queues; i++) {
+ for (j = 0; j < NUM_GVE_TX_CNTS; j++) {
+ snprintf(s, ETH_GSTRING_LEN,
+ gve_gstrings_tx_stats[j], i);
+ s += ETH_GSTRING_LEN;
+ }
+ }
+
+ memcpy(s, *gve_gstrings_adminq_stats,
+ sizeof(gve_gstrings_adminq_stats));
+ s += sizeof(gve_gstrings_adminq_stats);
+ break;
+
+ case ETH_SS_PRIV_FLAGS:
+ memcpy(s, *gve_gstrings_priv_flags,
+ sizeof(gve_gstrings_priv_flags));
+ s += sizeof(gve_gstrings_priv_flags);
+ break;
+
+ default:
+ break;
+ }
+}
+
+static int gve_get_sset_count(struct net_device *netdev, int sset)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+
+ switch (sset) {
+ case ETH_SS_STATS:
+ return GVE_MAIN_STATS_LEN + GVE_ADMINQ_STATS_LEN +
+ (priv->rx_cfg.num_queues * NUM_GVE_RX_CNTS) +
+ (priv->tx_cfg.num_queues * NUM_GVE_TX_CNTS);
+ case ETH_SS_PRIV_FLAGS:
+ return GVE_PRIV_FLAGS_STR_LEN;
+ default:
+ return -EOPNOTSUPP;
+ }
+}
+
+static void
+gve_get_ethtool_stats(struct net_device *netdev,
+ struct ethtool_stats *stats, u64 *data)
+{
+ u64 tmp_rx_pkts, tmp_rx_bytes, tmp_rx_skb_alloc_fail,
+ tmp_rx_buf_alloc_fail, tmp_rx_desc_err_dropped_pkt,
+ tmp_tx_pkts, tmp_tx_bytes;
+ u64 rx_pkts, rx_bytes, rx_skb_alloc_fail, rx_buf_alloc_fail,
+ rx_desc_err_dropped_pkt, tx_pkts, tx_bytes;
+ struct gve_priv *priv = netdev_priv(netdev);
+ int *rx_qid_to_stats_idx;
+ int *tx_qid_to_stats_idx;
+ struct stats *report_stats = priv->stats_report->stats;
+ int stats_idx, base_stats_idx, max_stats_idx;
+ bool skip_nic_stats;
+
+ unsigned int start;
+ int ring;
+ int i, j;
+
+ ASSERT_RTNL();
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,11,0))
+ memset(data, 0, stats->n_stats * sizeof(*data));
+#endif /* (LINUX_VERSION_CODE < KERNEL_VERSION(4,11,0)) */
+
+ rx_qid_to_stats_idx = kmalloc_array(priv->rx_cfg.num_queues,
+ sizeof(int), GFP_KERNEL);
+ if (!rx_qid_to_stats_idx) {
+ return;
+ }
+ tx_qid_to_stats_idx = kmalloc_array(priv->tx_cfg.num_queues,
+ sizeof(int), GFP_KERNEL);
+ if (!tx_qid_to_stats_idx) {
+ kfree(rx_qid_to_stats_idx);
+ return;
+ }
+
+ for (rx_pkts = 0, rx_bytes = 0, rx_skb_alloc_fail = 0,
+ rx_buf_alloc_fail = 0, rx_desc_err_dropped_pkt = 0, ring = 0;
+ ring < priv->rx_cfg.num_queues; ring++) {
+ if (priv->rx) {
+ do {
+ struct gve_rx_ring *rx = &priv->rx[ring];
+ start =
+ u64_stats_fetch_begin(&priv->rx[ring].statss);
+ tmp_rx_pkts = rx->rpackets;
+ tmp_rx_bytes = rx->rbytes;
+ tmp_rx_skb_alloc_fail = rx->rx_skb_alloc_fail;
+ tmp_rx_buf_alloc_fail = rx->rx_buf_alloc_fail;
+ tmp_rx_desc_err_dropped_pkt =
+ rx->rx_desc_err_dropped_pkt;
+
+ } while (u64_stats_fetch_retry(&priv->rx[ring].statss,
+ start));
+ rx_pkts += tmp_rx_pkts;
+ rx_bytes += tmp_rx_bytes;
+ rx_skb_alloc_fail += tmp_rx_skb_alloc_fail;
+ rx_buf_alloc_fail += tmp_rx_buf_alloc_fail;
+ rx_desc_err_dropped_pkt += tmp_rx_desc_err_dropped_pkt;
+
+ }
+ }
+ for (tx_pkts = 0, tx_bytes = 0, ring = 0;
+ ring < priv->tx_cfg.num_queues; ring++) {
+ if (priv->tx) {
+ do {
+ start =
+ u64_stats_fetch_begin(&priv->tx[ring].statss);
+ tmp_tx_pkts = priv->tx[ring].pkt_done;
+ tmp_tx_bytes = priv->tx[ring].bytes_done;
+ } while (u64_stats_fetch_retry(&priv->tx[ring].statss,
+ start));
+ tx_pkts += tmp_tx_pkts;
+ tx_bytes += tmp_tx_bytes;
+ }
+ }
+
+ i = 0;
+ data[i++] = rx_pkts;
+ data[i++] = rx_bytes;
+ /* total rx dropped packets */
+ data[i++] = rx_skb_alloc_fail + rx_buf_alloc_fail +
+ rx_desc_err_dropped_pkt;
+ data[i++] = rx_skb_alloc_fail;
+ data[i++] = rx_buf_alloc_fail;
+ data[i++] = rx_desc_err_dropped_pkt;
+ data[i++] = tx_pkts;
+ data[i++] = tx_bytes;
+ /* Skip tx_dropped */
+ i++;
+ data[i++] = priv->tx_timeo_cnt;
+ data[i++] = priv->interface_up_cnt;
+ data[i++] = priv->interface_down_cnt;
+ data[i++] = priv->reset_cnt;
+ data[i++] = priv->page_alloc_fail;
+ data[i++] = priv->dma_mapping_error;
+ i = GVE_MAIN_STATS_LEN;
+
+ /* For rx cross-reporting stats, start from nic rx stats in report */
+ base_stats_idx = GVE_TX_STATS_REPORT_NUM * priv->tx_cfg.num_queues +
+ GVE_RX_STATS_REPORT_NUM * priv->rx_cfg.num_queues;
+ max_stats_idx = NIC_RX_STATS_REPORT_NUM * priv->rx_cfg.num_queues +
+ base_stats_idx;
+ /* Preprocess the stats report for rx, map queue id to start index */
+ skip_nic_stats = false;
+ for (stats_idx = base_stats_idx; stats_idx < max_stats_idx;
+ stats_idx += NIC_RX_STATS_REPORT_NUM) {
+ u32 stat_name = be32_to_cpu(report_stats[stats_idx].stat_name);
+ u32 queue_id = be32_to_cpu(report_stats[stats_idx].queue_id);
+ if (stat_name == 0) {
+ /* no stats written by NIC yet */
+ skip_nic_stats = true;
+ break;
+ }
+ rx_qid_to_stats_idx[queue_id] = stats_idx;
+ }
+ /* walk RX rings */
+ if (priv->rx) {
+ for (ring = 0; ring < priv->rx_cfg.num_queues; ring++) {
+ struct gve_rx_ring *rx = &priv->rx[ring];
+
+ data[i++] = rx->fill_cnt;
+ data[i++] = rx->cnt;
+ do {
+ start =
+ u64_stats_fetch_begin(&priv->rx[ring].statss);
+ tmp_rx_bytes = rx->rbytes;
+ tmp_rx_skb_alloc_fail = rx->rx_skb_alloc_fail;
+ tmp_rx_buf_alloc_fail = rx->rx_buf_alloc_fail;
+ tmp_rx_desc_err_dropped_pkt =
+ rx->rx_desc_err_dropped_pkt;
+ } while (u64_stats_fetch_retry(&priv->rx[ring].statss,
+ start));
+ data[i++] = tmp_rx_bytes;
+ /* rx dropped packets */
+ data[i++] = tmp_rx_skb_alloc_fail +
+ tmp_rx_buf_alloc_fail +
+ tmp_rx_desc_err_dropped_pkt;
+ data[i++] = rx->rx_copybreak_pkt;
+ data[i++] = rx->rx_copied_pkt;
+ /* stats from NIC */
+ if (skip_nic_stats) {
+ /* skip NIC rx stats */
+ i += NIC_RX_STATS_REPORT_NUM;
+ continue;
+ }
+ for (j = 0; j < NIC_RX_STATS_REPORT_NUM; j++) {
+ u64 value = be64_to_cpu(report_stats[
+ rx_qid_to_stats_idx[ring] + j].value);
+ data[i++] = value;
+ }
+ }
+ } else {
+ i += priv->rx_cfg.num_queues * NUM_GVE_RX_CNTS;
+ }
+ /* For tx cross-reporting stats, start from nic tx stats in report */
+ base_stats_idx = max_stats_idx;
+ max_stats_idx = NIC_TX_STATS_REPORT_NUM * priv->tx_cfg.num_queues +
+ max_stats_idx;
+ /* Preprocess the stats report for tx, map queue id to start index */
+ skip_nic_stats = false;
+ for (stats_idx = base_stats_idx; stats_idx < max_stats_idx;
+ stats_idx += NIC_TX_STATS_REPORT_NUM) {
+ u32 stat_name = be32_to_cpu(report_stats[stats_idx].stat_name);
+ u32 queue_id = be32_to_cpu(report_stats[stats_idx].queue_id);
+ if (stat_name == 0) {
+ /* no stats written by NIC yet */
+ skip_nic_stats = true;
+ break;
+ }
+ tx_qid_to_stats_idx[queue_id] = stats_idx;
+ }
+ /* walk TX rings */
+ if (priv->tx) {
+ for (ring = 0; ring < priv->tx_cfg.num_queues; ring++) {
+ struct gve_tx_ring *tx = &priv->tx[ring];
+
+ data[i++] = tx->req;
+ data[i++] = tx->done;
+ do {
+ start =
+ u64_stats_fetch_begin(&priv->tx[ring].statss);
+ tmp_tx_bytes = tx->bytes_done;
+ } while (u64_stats_fetch_retry(&priv->tx[ring].statss,
+ start));
+ data[i++] = tmp_tx_bytes;
+ data[i++] = tx->wake_queue;
+ data[i++] = tx->stop_queue;
+ data[i++] = be32_to_cpu(gve_tx_load_event_counter(priv,
+ tx));
+ /* stats from NIC */
+ if (skip_nic_stats) {
+ /* skip NIC tx stats */
+ i += NIC_TX_STATS_REPORT_NUM;
+ continue;
+ }
+ for (j = 0; j < NIC_TX_STATS_REPORT_NUM; j++) {
+ u64 value = be64_to_cpu(report_stats[
+ tx_qid_to_stats_idx[ring] + j].value);
+ data[i++] = value;
+ }
+ }
+ } else {
+ i += priv->tx_cfg.num_queues * NUM_GVE_TX_CNTS;
+ }
+
+ kfree(rx_qid_to_stats_idx);
+ kfree(tx_qid_to_stats_idx);
+
+ /* AQ Stats */
+ data[i++] = priv->adminq_prod_cnt;
+ data[i++] = priv->adminq_cmd_fail;
+ data[i++] = priv->adminq_timeouts;
+ data[i++] = priv->adminq_describe_device_cnt;
+ data[i++] = priv->adminq_cfg_device_resources_cnt;
+ data[i++] = priv->adminq_register_page_list_cnt;
+ data[i++] = priv->adminq_unregister_page_list_cnt;
+ data[i++] = priv->adminq_create_tx_queue_cnt;
+ data[i++] = priv->adminq_create_rx_queue_cnt;
+ data[i++] = priv->adminq_destroy_tx_queue_cnt;
+ data[i++] = priv->adminq_destroy_rx_queue_cnt;
+ data[i++] = priv->adminq_dcfg_device_resources_cnt;
+ data[i++] = priv->adminq_set_driver_parameter_cnt;
+ data[i++] = priv->adminq_report_stats_cnt;
+}
+
+static void gve_get_channels(struct net_device *netdev,
+ struct ethtool_channels *cmd)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+
+ cmd->max_rx = priv->rx_cfg.max_queues;
+ cmd->max_tx = priv->tx_cfg.max_queues;
+ cmd->max_other = 0;
+ cmd->max_combined = 0;
+ cmd->rx_count = priv->rx_cfg.num_queues;
+ cmd->tx_count = priv->tx_cfg.num_queues;
+ cmd->other_count = 0;
+ cmd->combined_count = 0;
+}
+
+static int gve_set_channels(struct net_device *netdev,
+ struct ethtool_channels *cmd)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+ struct gve_queue_config new_tx_cfg = priv->tx_cfg;
+ struct gve_queue_config new_rx_cfg = priv->rx_cfg;
+ struct ethtool_channels old_settings;
+ int new_tx = cmd->tx_count;
+ int new_rx = cmd->rx_count;
+
+ gve_get_channels(netdev, &old_settings);
+
+ /* Changing combined is not allowed allowed */
+ if (cmd->combined_count != old_settings.combined_count)
+ return -EINVAL;
+
+ if (!new_rx || !new_tx)
+ return -EINVAL;
+
+ if (!netif_carrier_ok(netdev)) {
+ priv->tx_cfg.num_queues = new_tx;
+ priv->rx_cfg.num_queues = new_rx;
+ return 0;
+ }
+
+ new_tx_cfg.num_queues = new_tx;
+ new_rx_cfg.num_queues = new_rx;
+
+ return gve_adjust_queues(priv, new_rx_cfg, new_tx_cfg);
+}
+
+static void gve_get_ringparam(struct net_device *netdev,
+ struct ethtool_ringparam *cmd)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+
+ cmd->rx_max_pending = priv->rx_desc_cnt;
+ cmd->tx_max_pending = priv->tx_desc_cnt;
+ cmd->rx_pending = priv->rx_desc_cnt;
+ cmd->tx_pending = priv->tx_desc_cnt;
+}
+
+static int gve_user_reset(struct net_device *netdev, u32 *flags)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+
+ if (*flags == ETH_RESET_ALL) {
+ *flags = 0;
+ return gve_reset(priv, true);
+ }
+
+ return -EOPNOTSUPP;
+}
+
+static int gve_get_tunable(struct net_device *netdev,
+ const struct ethtool_tunable *etuna, void *value)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+
+ switch (etuna->id) {
+ case ETHTOOL_RX_COPYBREAK:
+ *(u32 *)value = priv->rx_copybreak;
+ return 0;
+ default:
+ return -EINVAL;
+ }
+}
+
+static int gve_set_tunable(struct net_device *netdev,
+ const struct ethtool_tunable *etuna,
+ const void *value)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+ u32 len;
+
+ switch(etuna->id) {
+ case ETHTOOL_RX_COPYBREAK:
+ len = *(u32 *)value;
+ if (len > priv->dev->mtu) {
+ return -EINVAL;
+ }
+ priv->rx_copybreak = len;
+ return 0;
+ default:
+ return -EINVAL;
+ }
+}
+
+static u32 gve_get_priv_flags(struct net_device *netdev)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+ u32 i, ret_flags = 0;
+
+ for (i = 0; i < GVE_PRIV_FLAGS_STR_LEN; i++) {
+ if (priv->ethtool_flags & BIT(i)) {
+ ret_flags |= BIT(i);
+ }
+ }
+ return ret_flags;
+}
+
+static int gve_set_priv_flags(struct net_device *netdev, u32 flags)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+ u64 ori_flags, new_flags;
+ u32 i;
+
+ ori_flags = READ_ONCE(priv->ethtool_flags);
+ new_flags = ori_flags;
+
+ for (i = 0; i < GVE_PRIV_FLAGS_STR_LEN; i++) {
+ if (flags & BIT(i))
+ new_flags |= BIT(i);
+ else
+ new_flags &= ~(BIT(i));
+ priv->ethtool_flags = new_flags;
+ /* set report-stats */
+ if (strcmp(gve_gstrings_priv_flags[i], "report-stats") == 0) {
+ /* update the stats when user turns report-stats on */
+ if (flags & BIT(i))
+ gve_handle_report_stats(priv);
+ /* zero off gve stats when report-stats turned off */
+ if (!(flags & BIT(i)) && (ori_flags & BIT(i))) {
+ int tx_stats_num = GVE_TX_STATS_REPORT_NUM *
+ priv->tx_cfg.num_queues;
+ int rx_stats_num = GVE_RX_STATS_REPORT_NUM *
+ priv->rx_cfg.num_queues;
+ memset(priv->stats_report->stats, 0,
+ (tx_stats_num + rx_stats_num) *
+ sizeof(struct stats));
+ }
+ }
+ }
+
+ return 0;
+}
+
+static int gve_get_link_ksettings(struct net_device *netdev,
+ struct ethtool_link_ksettings *cmd)
+{
+ struct gve_priv *priv = netdev_priv(netdev);
+ int err = gve_adminq_report_link_speed(priv);
+
+ cmd->base.speed = priv->link_speed;
+ return err;
+}
+
+const struct ethtool_ops gve_ethtool_ops = {
+ .get_drvinfo = gve_get_drvinfo,
+ .get_strings = gve_get_strings,
+ .get_sset_count = gve_get_sset_count,
+ .get_ethtool_stats = gve_get_ethtool_stats,
+ .set_msglevel = gve_set_msglevel,
+ .get_msglevel = gve_get_msglevel,
+ .set_channels = gve_set_channels,
+ .get_channels = gve_get_channels,
+ .get_link = ethtool_op_get_link,
+ .get_ringparam = gve_get_ringparam,
+ .reset = gve_user_reset,
+ .get_tunable = gve_get_tunable,
+ .set_tunable = gve_set_tunable,
+ .get_priv_flags = gve_get_priv_flags,
+ .set_priv_flags = gve_set_priv_flags,
+ .get_link_ksettings = gve_get_link_ksettings
+};
diff --git a/drivers/net/ethernet/google/gve/gve_linux_version.h b/drivers/net/ethernet/google/gve/gve_linux_version.h
new file mode 100644
index 0000000..08b6bea
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_linux_version.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR MIT)
+ * Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2018 Google, Inc.
+ */
+
+#ifndef _GVE_LINUX_VERSION_H
+#define _GVE_LINUX_VERSION_H
+
+#ifndef LINUX_VERSION_CODE
+#include <linux/version.h>
+#else
+#define KERNEL_VERSION(a,b,c) ((((a) << 16) + (b) << 8) + (c))
+#endif
+#ifndef UTS_RELEASE
+#include <generated/utsrelease.h>
+#endif /* UTS_RELEASE */
+
+#ifndef RHEL_RELEASE_CODE
+#define RHEL_RELEASE_CODE 0
+#endif /* RHEL_RELEASE_CODE */
+
+#ifndef RHEL_RELEASE_VERSION
+#define RHEL_RELEASE_VERSION(a,b) (((a) << 8) + (b))
+#endif /* RHEL_RELEASE_VERSION */
+
+#ifndef UTS_UBUNTU_RELEASE_ABI
+#define UTS_UBUNTU_RELEASE_ABI 0
+#define UBUNTU_VERSION_CODE 0
+#else
+#define UBUNTU_VERSION_CODE (((LINUX_VERSION_CODE & ~0xFF) << 8) + (UTS_UBUNTU_RELEASE_ABI))
+#endif /* UTS_UBUNTU_RELEASE_ABI */
+
+#define UBUNTU_VERSION(a,b,c,d) ((KERNEL_VERSION(a,b,0) << 8) + (d))
+
+#endif /* _GVE_LINUX_VERSION_H_ */
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
new file mode 100644
index 0000000..baad1b6
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -0,0 +1,1565 @@
+// SPDX-License-Identifier: (GPL-2.0 OR MIT)
+/* Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#include "gve_linux_version.h"
+#include <linux/cpumask.h>
+#include <linux/etherdevice.h>
+#include <linux/interrupt.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/sched.h>
+#include <linux/timer.h>
+#include <linux/workqueue.h>
+#include <net/sch_generic.h>
+#include "gve.h"
+#include "gve_adminq.h"
+#include "gve_register.h"
+
+#define GVE_DEFAULT_RX_COPYBREAK (256)
+
+#define DEFAULT_MSG_LEVEL (NETIF_MSG_DRV | NETIF_MSG_LINK)
+#define GVE_VERSION "1.1.0"
+#define GVE_VERSION_PREFIX "GVE-"
+
+const char gve_version_str[] = GVE_VERSION;
+static const char gve_version_prefix[] = GVE_VERSION_PREFIX;
+
+static void gve_get_stats(struct net_device *dev, struct rtnl_link_stats64 *s)
+{
+ struct gve_priv *priv = netdev_priv(dev);
+ unsigned int start;
+ int ring;
+
+ if (priv->rx) {
+ for (ring = 0; ring < priv->rx_cfg.num_queues; ring++) {
+ do {
+ start =
+ u64_stats_fetch_begin(&priv->rx[ring].statss);
+ s->rx_packets += priv->rx[ring].rpackets;
+ s->rx_bytes += priv->rx[ring].rbytes;
+ } while (u64_stats_fetch_retry(&priv->rx[ring].statss,
+ start));
+ }
+ }
+ if (priv->tx) {
+ for (ring = 0; ring < priv->tx_cfg.num_queues; ring++) {
+ do {
+ start =
+ u64_stats_fetch_begin(&priv->tx[ring].statss);
+ s->tx_packets += priv->tx[ring].pkt_done;
+ s->tx_bytes += priv->tx[ring].bytes_done;
+ } while (u64_stats_fetch_retry(&priv->rx[ring].statss,
+ start));
+ }
+ }
+}
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,11,0))
+static struct rtnl_link_stats64 *
+backport_gve_get_stats(struct net_device *dev, struct rtnl_link_stats64 *s){
+ gve_get_stats(dev, s);
+ return s;
+}
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,11.0) */
+
+static int gve_alloc_counter_array(struct gve_priv *priv)
+{
+ priv->counter_array =
+ dma_alloc_coherent(&priv->pdev->dev,
+ priv->num_event_counters *
+ sizeof(*priv->counter_array),
+ &priv->counter_array_bus, GFP_KERNEL);
+ if (!priv->counter_array)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static void gve_free_counter_array(struct gve_priv *priv)
+{
+ dma_free_coherent(&priv->pdev->dev,
+ priv->num_event_counters *
+ sizeof(*priv->counter_array),
+ priv->counter_array, priv->counter_array_bus);
+ priv->counter_array = NULL;
+}
+
+void gve_service_task_schedule(struct gve_priv *priv)
+{
+ if (!gve_get_probe_in_progress(priv) &&
+ !gve_get_reset_in_progress(priv)) {
+ gve_set_do_report_stats(priv);
+ queue_work(priv->gve_wq, &priv->service_task);
+ }
+}
+
+static void gve_service_timer(struct timer_list *t)
+{
+ struct gve_priv *priv = from_timer(priv, t, service_timer);
+
+ mod_timer(&priv->service_timer,
+ round_jiffies(jiffies +
+ msecs_to_jiffies(priv->service_timer_period)));
+ gve_service_task_schedule(priv);
+}
+
+static int gve_alloc_stats_report(struct gve_priv *priv)
+{
+ int tx_stats_num, rx_stats_num;
+
+ tx_stats_num = (GVE_TX_STATS_REPORT_NUM + NIC_TX_STATS_REPORT_NUM) *
+ priv->tx_cfg.num_queues;
+ rx_stats_num = (GVE_RX_STATS_REPORT_NUM + NIC_RX_STATS_REPORT_NUM) *
+ priv->rx_cfg.num_queues;
+ priv->stats_report_len = sizeof(struct gve_stats_report) +
+ (tx_stats_num + rx_stats_num) *
+ sizeof(struct stats);
+ priv->stats_report =
+ dma_alloc_coherent(&priv->pdev->dev, priv->stats_report_len,
+ &priv->stats_report_bus, GFP_KERNEL);
+ if (!priv->stats_report)
+ return -ENOMEM;
+ /* Set up timer for periodic task */
+ timer_setup(&priv->service_timer, gve_service_timer, 0);
+ priv->service_timer_period = GVE_SERVICE_TIMER_PERIOD;
+ /* Start the service task timer */
+ mod_timer(&priv->service_timer,
+ round_jiffies(jiffies +
+ msecs_to_jiffies(priv->service_timer_period)));
+ return 0;
+}
+
+static void gve_free_stats_report(struct gve_priv *priv)
+{
+
+ del_timer_sync(&priv->service_timer);
+ dma_free_coherent(&priv->pdev->dev, priv->stats_report_len,
+ priv->stats_report, priv->stats_report_bus);
+ priv->stats_report = NULL;
+}
+
+static irqreturn_t gve_mgmnt_intr(int irq, void *arg)
+{
+ struct gve_priv *priv = arg;
+
+ queue_work(priv->gve_wq, &priv->service_task);
+ return IRQ_HANDLED;
+}
+
+static irqreturn_t gve_intr(int irq, void *arg)
+{
+ struct gve_notify_block *block = arg;
+ struct gve_priv *priv = block->priv;
+
+ iowrite32be(GVE_IRQ_MASK, gve_irq_doorbell(priv, block));
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0)
+ napi_schedule_irqoff(&block->napi);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ napi_schedule(&block->napi);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ return IRQ_HANDLED;
+}
+
+static int gve_napi_poll(struct napi_struct *napi, int budget)
+{
+ struct gve_notify_block *block;
+ __be32 __iomem *irq_doorbell;
+ bool reschedule = false;
+ struct gve_priv *priv;
+
+ block = container_of(napi, struct gve_notify_block, napi);
+ priv = block->priv;
+
+ if (block->tx)
+ reschedule |= gve_tx_poll(block, budget);
+ if (block->rx)
+ reschedule |= gve_rx_poll(block, budget);
+
+ if (reschedule)
+ return budget;
+
+ napi_complete(napi);
+ irq_doorbell = gve_irq_doorbell(priv, block);
+ iowrite32be(GVE_IRQ_ACK | GVE_IRQ_EVENT, irq_doorbell);
+
+ /* Double check we have no extra work.
+ * Ensure unmask synchronizes with checking for work.
+ */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0)
+ dma_rmb();
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ rmb();
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ if (block->tx)
+ reschedule |= gve_tx_poll(block, -1);
+ if (block->rx)
+ reschedule |= gve_rx_poll(block, -1);
+ if (reschedule && napi_reschedule(napi))
+ iowrite32be(GVE_IRQ_MASK, irq_doorbell);
+
+ return 0;
+}
+
+static int gve_alloc_notify_blocks(struct gve_priv *priv)
+{
+ int num_vecs_requested = priv->num_ntfy_blks + 1;
+ char *name = priv->dev->name;
+ unsigned int active_cpus;
+ int vecs_enabled;
+ int i, j;
+ int err;
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0)
+ priv->msix_vectors = kvzalloc(num_vecs_requested *
+ sizeof(*priv->msix_vectors), GFP_KERNEL);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ priv->msix_vectors = kcalloc(num_vecs_requested,
+ sizeof(*priv->msix_vectors), GFP_KERNEL);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ if (!priv->msix_vectors)
+ return -ENOMEM;
+ for (i = 0; i < num_vecs_requested; i++)
+ priv->msix_vectors[i].entry = i;
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0)
+ vecs_enabled = pci_enable_msix_range(priv->pdev, priv->msix_vectors,
+ GVE_MIN_MSIX, num_vecs_requested);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0) */
+ vecs_enabled = pci_enable_msix(priv->pdev, priv->msix_vectors,
+ num_vecs_requested);
+ if (!vecs_enabled) {
+ vecs_enabled = num_vecs_requested;
+ }
+ else
+ if (vecs_enabled > 0) {
+ if (vecs_enabled >= GVE_MIN_MSIX) {
+ vecs_enabled = pci_enable_msix(priv->pdev,
+ priv->msix_vectors,
+ GVE_MIN_MSIX);
+ if (vecs_enabled) {
+ dev_err(&priv->pdev->dev,
+ "Could not enable min msix %d error %d\n",
+ GVE_MIN_MSIX, vecs_enabled);
+ err = vecs_enabled;
+ goto abort_with_msix_vectors;
+ }
+ else {
+ vecs_enabled = GVE_MIN_MSIX;
+ }
+ }
+ else {
+ dev_err(&priv->pdev->dev,
+ "Could not enable msix error %d\n",
+ vecs_enabled);
+ err = vecs_enabled;
+ goto abort_with_msix_vectors;
+ }
+ }
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0) */
+
+ if (vecs_enabled < 0) {
+ dev_err(&priv->pdev->dev, "Could not enable min msix %d/%d\n",
+ GVE_MIN_MSIX, vecs_enabled);
+ err = vecs_enabled;
+ goto abort_with_msix_vectors;
+ }
+ if (vecs_enabled != num_vecs_requested) {
+ int new_num_ntfy_blks = (vecs_enabled - 1) & ~0x1;
+ int vecs_per_type = new_num_ntfy_blks / 2;
+ int vecs_left = new_num_ntfy_blks % 2;
+
+ priv->num_ntfy_blks = new_num_ntfy_blks;
+ priv->tx_cfg.max_queues = min_t(int, priv->tx_cfg.max_queues,
+ vecs_per_type);
+ priv->rx_cfg.max_queues = min_t(int, priv->rx_cfg.max_queues,
+ vecs_per_type + vecs_left);
+ dev_err(&priv->pdev->dev,
+ "Could not enable desired msix, only enabled %d, adjusting tx max queues to %d, and rx max queues to %d\n",
+ vecs_enabled, priv->tx_cfg.max_queues,
+ priv->rx_cfg.max_queues);
+ if (priv->tx_cfg.num_queues > priv->tx_cfg.max_queues)
+ priv->tx_cfg.num_queues = priv->tx_cfg.max_queues;
+ if (priv->rx_cfg.num_queues > priv->rx_cfg.max_queues)
+ priv->rx_cfg.num_queues = priv->rx_cfg.max_queues;
+ }
+ /* Half the notification blocks go to TX and half to RX */
+ active_cpus = min_t(int, priv->num_ntfy_blks / 2, num_online_cpus());
+
+ /* Setup Management Vector - the last vector */
+ snprintf(priv->mgmt_msix_name, sizeof(priv->mgmt_msix_name), "%s-mgmnt",
+ name);
+ err = request_irq(priv->msix_vectors[priv->mgmt_msix_idx].vector,
+ gve_mgmnt_intr, 0, priv->mgmt_msix_name, priv);
+ if (err) {
+ dev_err(&priv->pdev->dev, "Did not receive management vector.\n");
+ goto abort_with_msix_enabled;
+ }
+
+ priv->irq_db_indices =
+ dma_alloc_coherent(&priv->pdev->dev,
+ priv->num_ntfy_blks *
+ sizeof(*priv->irq_db_indices),
+ &priv->irq_db_indices_bus, GFP_KERNEL);
+ if (!priv->irq_db_indices) {
+ err = -ENOMEM;
+ goto abort_with_mgmt_vector;
+ }
+
+ priv->ntfy_blocks = kvzalloc(priv->num_ntfy_blks *
+ sizeof(*priv->ntfy_blocks), GFP_KERNEL);
+ if (!priv->ntfy_blocks) {
+ err = -ENOMEM;
+ goto abort_with_irq_db_indices;
+ }
+
+ /* Setup the other blocks - the first n-1 vectors */
+ for (i = 0; i < priv->num_ntfy_blks; i++) {
+ struct gve_notify_block *block = &priv->ntfy_blocks[i];
+ int msix_idx = i;
+
+ snprintf(block->name, sizeof(block->name), "%s-ntfy-block.%d",
+ name, i);
+ block->priv = priv;
+ err = request_irq(priv->msix_vectors[msix_idx].vector,
+ gve_intr, 0, block->name, block);
+ if (err) {
+ dev_err(&priv->pdev->dev,
+ "Failed to receive msix vector %d\n", i);
+ goto abort_with_some_ntfy_blocks;
+ }
+ irq_set_affinity_hint(priv->msix_vectors[msix_idx].vector,
+ get_cpu_mask(i % active_cpus));
+ block->irq_db_index = &priv->irq_db_indices[i].index;
+ }
+ return 0;
+abort_with_some_ntfy_blocks:
+ for (j = 0; j < i; j++) {
+ struct gve_notify_block *block = &priv->ntfy_blocks[j];
+ int msix_idx = j;
+
+ irq_set_affinity_hint(priv->msix_vectors[msix_idx].vector,
+ NULL);
+ free_irq(priv->msix_vectors[msix_idx].vector, block);
+ }
+ kvfree(priv->ntfy_blocks);
+ priv->ntfy_blocks = NULL;
+abort_with_irq_db_indices:
+ dma_free_coherent(&priv->pdev->dev, priv->num_ntfy_blks *
+ sizeof(*priv->irq_db_indices),
+ priv->irq_db_indices, priv->irq_db_indices_bus);
+ priv->irq_db_indices = NULL;
+abort_with_mgmt_vector:
+ free_irq(priv->msix_vectors[priv->mgmt_msix_idx].vector, priv);
+abort_with_msix_enabled:
+ pci_disable_msix(priv->pdev);
+abort_with_msix_vectors:
+ kfree(priv->msix_vectors);
+ priv->msix_vectors = NULL;
+ return err;
+}
+
+static void gve_free_notify_blocks(struct gve_priv *priv)
+{
+ int i;
+
+ /* Free the irqs */
+ for (i = 0; i < priv->num_ntfy_blks; i++) {
+ struct gve_notify_block *block = &priv->ntfy_blocks[i];
+ int msix_idx = i;
+
+ irq_set_affinity_hint(priv->msix_vectors[msix_idx].vector,
+ NULL);
+ free_irq(priv->msix_vectors[msix_idx].vector, block);
+ }
+ kvfree(priv->ntfy_blocks);
+ priv->ntfy_blocks = NULL;
+ dma_free_coherent(&priv->pdev->dev, priv->num_ntfy_blks *
+ sizeof(*priv->irq_db_indices),
+ priv->irq_db_indices, priv->irq_db_indices_bus);
+ priv->irq_db_indices = NULL;
+ free_irq(priv->msix_vectors[priv->mgmt_msix_idx].vector, priv);
+ pci_disable_msix(priv->pdev);
+ kfree(priv->msix_vectors);
+ priv->msix_vectors = NULL;
+}
+
+static int gve_setup_device_resources(struct gve_priv *priv)
+{
+ int err;
+
+ err = gve_alloc_counter_array(priv);
+ if (err)
+ return err;
+ err = gve_alloc_notify_blocks(priv);
+ if (err)
+ goto abort_with_counter;
+ err = gve_alloc_stats_report(priv);
+ if (err)
+ goto abort_with_ntfy_blocks;
+ err = gve_adminq_configure_device_resources(priv,
+ priv->counter_array_bus,
+ priv->num_event_counters,
+ priv->irq_db_indices_bus,
+ priv->num_ntfy_blks);
+ if (unlikely(err)) {
+ dev_err(&priv->pdev->dev,
+ "could not setup device_resources: err=%d\n", err);
+ err = -ENXIO;
+ goto abort_with_stats_report;
+ }
+ err = gve_adminq_report_stats(priv, priv->stats_report_len,
+ priv->stats_report_bus,
+ GVE_SERVICE_TIMER_PERIOD);
+ if (err)
+ dev_err(&priv->pdev->dev,
+ "Failed to report stats: err=%d\n", err);
+ gve_set_device_resources_ok(priv);
+ return 0;
+abort_with_stats_report:
+ gve_free_stats_report(priv);
+abort_with_ntfy_blocks:
+ gve_free_notify_blocks(priv);
+abort_with_counter:
+ gve_free_counter_array(priv);
+ return err;
+}
+
+static void gve_trigger_reset(struct gve_priv *priv);
+
+static void gve_teardown_device_resources(struct gve_priv *priv)
+{
+ int err;
+
+ /* Tell device its resources are being freed */
+ if (gve_get_device_resources_ok(priv)) {
+ /* detach the stats report */
+ err = gve_adminq_report_stats(priv, 0, 0x0,
+ GVE_SERVICE_TIMER_PERIOD);
+ if (err) {
+ dev_err(&priv->pdev->dev,
+ "Failed to detach stats report: err=%d\n", err);
+ gve_trigger_reset(priv);
+ }
+ err = gve_adminq_deconfigure_device_resources(priv);
+ if (err) {
+ dev_err(&priv->pdev->dev,
+ "Could not deconfigure device resources: err=%d\n",
+ err);
+ gve_trigger_reset(priv);
+ }
+ }
+ gve_free_counter_array(priv);
+ gve_free_notify_blocks(priv);
+ gve_free_stats_report(priv);
+ gve_clear_device_resources_ok(priv);
+}
+
+static void gve_add_napi(struct gve_priv *priv, int ntfy_idx)
+{
+ struct gve_notify_block *block = &priv->ntfy_blocks[ntfy_idx];
+
+ netif_napi_add(priv->dev, &block->napi, gve_napi_poll,
+ NAPI_POLL_WEIGHT);
+#if LINUX_VERSION_CODE < KERNEL_VERSION(4,5,0) && LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0)
+ napi_hash_add(&block->napi);
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,5,0) && LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) */
+}
+
+static void gve_remove_napi(struct gve_priv *priv, int ntfy_idx)
+{
+ struct gve_notify_block *block = &priv->ntfy_blocks[ntfy_idx];
+
+#if LINUX_VERSION_CODE < KERNEL_VERSION(4,5,0) && LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0)
+ napi_hash_del(&block->napi);
+ synchronize_net();
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,5,0) && LINUX_VERSION_CODE >= KERNEL_VERSION(3,11,0) */
+ netif_napi_del(&block->napi);
+}
+
+static int gve_register_qpls(struct gve_priv *priv)
+{
+ int num_qpls = gve_num_tx_qpls(priv) + gve_num_rx_qpls(priv);
+ int err;
+ int i;
+
+ for (i = 0; i < num_qpls; i++) {
+ err = gve_adminq_register_page_list(priv, &priv->qpls[i]);
+ if (err) {
+ netif_err(priv, drv, priv->dev,
+ "failed to register queue page list %d\n",
+ priv->qpls[i].id);
+ /* This failure will trigger a reset - no need to clean
+ * up
+ */
+ return err;
+ }
+ }
+ return 0;
+}
+
+static int gve_unregister_qpls(struct gve_priv *priv)
+{
+ int num_qpls = gve_num_tx_qpls(priv) + gve_num_rx_qpls(priv);
+ int err;
+ int i;
+
+ for (i = 0; i < num_qpls; i++) {
+ err = gve_adminq_unregister_page_list(priv, priv->qpls[i].id);
+ /* This failure will trigger a reset - no need to clean up */
+ if (err) {
+ netif_err(priv, drv, priv->dev,
+ "Failed to unregister queue page list %d\n",
+ priv->qpls[i].id);
+ return err;
+ }
+ }
+ return 0;
+}
+
+static int gve_create_rings(struct gve_priv *priv)
+{
+ int err;
+ int i;
+
+ err = gve_adminq_create_tx_queues(priv, priv->tx_cfg.num_queues);
+ if (err) {
+ netif_err(priv, drv, priv->dev, "failed to create %d tx queues\n",
+ priv->tx_cfg.num_queues);
+ /* This failure will trigger a reset - no need to clean
+ * up
+ */
+ return err;
+ }
+ netif_dbg(priv, drv, priv->dev, "created %d tx queues \n",
+ priv->tx_cfg.num_queues);
+
+ err = gve_adminq_create_rx_queues(priv, priv->rx_cfg.num_queues);
+ if (err) {
+ netif_err(priv, drv, priv->dev, "failed to create %d rx queues\n",
+ priv->rx_cfg.num_queues);
+ /* This failure will trigger a reset - no need to clean
+ * up
+ */
+ return err;
+ }
+ netif_dbg(priv, drv, priv->dev, "created %d rx queues \n",
+ priv->rx_cfg.num_queues);
+
+ /* Rx data ring has been prefilled with packet buffers at queue
+ * allocation time.
+ * Write the doorbell to provide descriptor slots and packet buffers
+ * to the NIC.
+ */
+ for (i = 0; i < priv->rx_cfg.num_queues; i++) {
+ gve_rx_write_doorbell(priv, &priv->rx[i]);
+ }
+
+ return 0;
+}
+
+static int gve_alloc_rings(struct gve_priv *priv)
+{
+ int ntfy_idx;
+ int err;
+ int i;
+
+ /* Setup tx rings */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0)
+ priv->tx = kvzalloc(priv->tx_cfg.num_queues * sizeof(*priv->tx),
+ GFP_KERNEL);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ priv->tx = kcalloc(priv->tx_cfg.num_queues, sizeof(*priv->tx),
+ GFP_KERNEL);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ if (!priv->tx)
+ return -ENOMEM;
+ err = gve_tx_alloc_rings(priv);
+ if (err)
+ goto free_tx;
+ /* Setup rx rings */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0)
+ priv->rx = kvzalloc(priv->rx_cfg.num_queues * sizeof(*priv->rx),
+ GFP_KERNEL);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ priv->rx = kcalloc(priv->rx_cfg.num_queues, sizeof(*priv->rx),
+ GFP_KERNEL);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ if (!priv->rx) {
+ err = -ENOMEM;
+ goto free_tx_queue;
+ }
+ err = gve_rx_alloc_rings(priv);
+ if (err)
+ goto free_rx;
+ /* Add tx napi & init sync stats*/
+ for (i = 0; i < priv->tx_cfg.num_queues; i++) {
+ u64_stats_init(&priv->tx[i].statss);
+ ntfy_idx = gve_tx_idx_to_ntfy(priv, i);
+ gve_add_napi(priv, ntfy_idx);
+ }
+ /* Add rx napi & init sync stats*/
+ for (i = 0; i < priv->rx_cfg.num_queues; i++) {
+ u64_stats_init(&priv->rx[i].statss);
+ ntfy_idx = gve_rx_idx_to_ntfy(priv, i);
+ gve_add_napi(priv, ntfy_idx);
+ }
+
+ return 0;
+
+free_rx:
+ kfree(priv->rx);
+ priv->rx = NULL;
+free_tx_queue:
+ gve_tx_free_rings(priv);
+free_tx:
+ kfree(priv->tx);
+ priv->tx = NULL;
+ return err;
+}
+
+static int gve_destroy_rings(struct gve_priv *priv)
+{
+ int err;
+
+ err = gve_adminq_destroy_tx_queues(priv, priv->tx_cfg.num_queues);
+ if (err) {
+ netif_err(priv, drv, priv->dev,
+ "failed to destroy tx queues\n");
+ /* This failure will trigger a reset - no need to clean up */
+ return err;
+ }
+ netif_dbg(priv, drv, priv->dev, "destroyed tx queues\n");
+ err = gve_adminq_destroy_rx_queues(priv, priv->rx_cfg.num_queues);
+ if (err) {
+ netif_err(priv, drv, priv->dev,
+ "failed to destroy rx queues\n");
+ /* This failure will trigger a reset - no need to clean up */
+ return err;
+ }
+ netif_dbg(priv, drv, priv->dev, "destroyed rx queues\n");
+ return 0;
+}
+
+static void gve_free_rings(struct gve_priv *priv)
+{
+ int ntfy_idx;
+ int i;
+
+ if (priv->tx) {
+ for (i = 0; i < priv->tx_cfg.num_queues; i++) {
+ ntfy_idx = gve_tx_idx_to_ntfy(priv, i);
+ gve_remove_napi(priv, ntfy_idx);
+ }
+ gve_tx_free_rings(priv);
+ kfree(priv->tx);
+ priv->tx = NULL;
+ }
+ if (priv->rx) {
+ for (i = 0; i < priv->rx_cfg.num_queues; i++) {
+ ntfy_idx = gve_rx_idx_to_ntfy(priv, i);
+ gve_remove_napi(priv, ntfy_idx);
+ }
+ gve_rx_free_rings(priv);
+ kfree(priv->rx);
+ priv->rx = NULL;
+ }
+}
+
+int gve_alloc_page(struct gve_priv* priv, struct device* dev,
+ struct page **page, dma_addr_t *dma,
+ enum dma_data_direction dir, gfp_t gfp_flags)
+{
+ if (priv->dma_mask == 24)
+ gfp_flags |= GFP_DMA;
+ else if (priv->dma_mask == 32)
+ gfp_flags |= GFP_DMA32;
+
+ *page = alloc_page(gfp_flags);
+ if (!*page) {
+ priv->page_alloc_fail++;
+ return -ENOMEM;
+ }
+ *dma = dma_map_page(dev, *page, 0, PAGE_SIZE, dir);
+ if (dma_mapping_error(dev, *dma)) {
+ priv->dma_mapping_error++;
+ put_page(*page);
+ *page = NULL;
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+static int gve_alloc_queue_page_list(struct gve_priv *priv, u32 id,
+ int pages)
+{
+ struct gve_queue_page_list *qpl = &priv->qpls[id];
+ int err;
+ int i;
+
+ if (pages + priv->num_registered_pages > priv->max_registered_pages) {
+ netif_err(priv, drv, priv->dev,
+ "Reached max number of registered pages %llu > %llu\n",
+ pages + priv->num_registered_pages,
+ priv->max_registered_pages);
+ return -EINVAL;
+ }
+
+ qpl->id = id;
+ qpl->num_entries = 0;
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0)
+ qpl->pages = kvzalloc(pages * sizeof(*qpl->pages), GFP_KERNEL);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ qpl->pages = kcalloc(pages, sizeof(*qpl->pages), GFP_KERNEL);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ /* caller handles clean up */
+ if (!qpl->pages)
+ return -ENOMEM;
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0)
+ qpl->page_buses = kvzalloc(pages * sizeof(*qpl->page_buses),
+ GFP_KERNEL);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ qpl->page_buses = kcalloc(pages, sizeof(*qpl->page_buses), GFP_KERNEL);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ /* caller handles clean up */
+ if (!qpl->page_buses)
+ return -ENOMEM;
+
+ for (i = 0; i < pages; i++) {
+ err = gve_alloc_page(priv, &priv->pdev->dev, &qpl->pages[i],
+ &qpl->page_buses[i],
+ gve_qpl_dma_dir(priv, id), GFP_KERNEL);
+ /* caller handles clean up */
+ if (err)
+ return -ENOMEM;
+ qpl->num_entries++;
+ }
+ priv->num_registered_pages += pages;
+
+ return 0;
+}
+
+void gve_free_page(struct device *dev, struct page *page, dma_addr_t dma,
+ enum dma_data_direction dir)
+{
+ if (!dma_mapping_error(dev, dma))
+ dma_unmap_page(dev, dma, PAGE_SIZE, dir);
+ if (page)
+ put_page(page);
+}
+
+static void gve_free_queue_page_list(struct gve_priv *priv,
+ int id)
+{
+ struct gve_queue_page_list *qpl = &priv->qpls[id];
+ int i;
+
+ if (!qpl->pages)
+ return;
+ if (!qpl->page_buses)
+ goto free_pages;
+
+ for (i = 0; i < qpl->num_entries; i++)
+ gve_free_page(&priv->pdev->dev, qpl->pages[i],
+ qpl->page_buses[i], gve_qpl_dma_dir(priv, id));
+
+ kfree(qpl->page_buses);
+free_pages:
+ kfree(qpl->pages);
+ priv->num_registered_pages -= qpl->num_entries;
+}
+
+static int gve_alloc_qpls(struct gve_priv *priv)
+{
+ int num_qpls = gve_num_tx_qpls(priv) + gve_num_rx_qpls(priv);
+ int i, j;
+ int err;
+
+ /* Raw addressing means no QPLs */
+ if (priv->raw_addressing)
+ return 0;
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0)
+ priv->qpls = kvzalloc(num_qpls * sizeof(*priv->qpls), GFP_KERNEL);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ priv->qpls = kcalloc(num_qpls, sizeof(*priv->qpls), GFP_KERNEL);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ if (!priv->qpls)
+ return -ENOMEM;
+
+ for (i = 0; i < gve_num_tx_qpls(priv); i++) {
+ err = gve_alloc_queue_page_list(priv, i,
+ priv->tx_pages_per_qpl);
+ if (err)
+ goto free_qpls;
+ }
+ for (; i < num_qpls; i++) {
+ err = gve_alloc_queue_page_list(priv, i,
+ priv->rx_data_slot_cnt);
+ if (err)
+ goto free_qpls;
+ }
+
+ priv->qpl_cfg.qpl_map_size = BITS_TO_LONGS(num_qpls) *
+ sizeof(unsigned long) * BITS_PER_BYTE;
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0)
+ priv->qpl_cfg.qpl_id_map = kvzalloc(BITS_TO_LONGS(num_qpls) *
+ sizeof(unsigned long), GFP_KERNEL);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ priv->qpl_cfg.qpl_id_map = kcalloc(BITS_TO_LONGS(num_qpls),
+ sizeof(unsigned long), GFP_KERNEL);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ if (!priv->qpl_cfg.qpl_id_map)
+ goto free_qpls;
+
+ return 0;
+
+free_qpls:
+ for (j = 0; j <= i; j++)
+ gve_free_queue_page_list(priv, j);
+ kfree(priv->qpls);
+ return err;
+}
+
+static void gve_free_qpls(struct gve_priv *priv)
+{
+ int num_qpls = gve_num_tx_qpls(priv) + gve_num_rx_qpls(priv);
+ int i;
+
+ /* Raw addressing means no QPLs */
+ if (priv->raw_addressing)
+ return;
+
+ kfree(priv->qpl_cfg.qpl_id_map);
+
+ for (i = 0; i < num_qpls; i++)
+ gve_free_queue_page_list(priv, i);
+
+ kfree(priv->qpls);
+}
+
+/* Use this to schedule a reset when the device is capable of continuing
+ * to handle other requests in its current state. If it is not, do a reset
+ * in thread instead.
+ */
+void gve_schedule_reset(struct gve_priv *priv)
+{
+ gve_set_do_reset(priv);
+ queue_work(priv->gve_wq, &priv->service_task);
+}
+
+static void gve_reset_and_teardown(struct gve_priv *priv, bool was_up);
+static int gve_reset_recovery(struct gve_priv *priv, bool was_up);
+static void gve_turndown(struct gve_priv *priv);
+static void gve_turnup(struct gve_priv *priv);
+
+static int gve_open(struct net_device *dev)
+{
+ struct gve_priv *priv = netdev_priv(dev);
+ int err;
+
+ err = gve_alloc_qpls(priv);
+ if (err)
+ return err;
+ err = gve_alloc_rings(priv);
+ if (err)
+ goto free_qpls;
+
+ err = netif_set_real_num_tx_queues(dev, priv->tx_cfg.num_queues);
+ if (err)
+ goto free_rings;
+ err = netif_set_real_num_rx_queues(dev, priv->rx_cfg.num_queues);
+ if (err)
+ goto free_rings;
+
+ err = gve_register_qpls(priv);
+ if (err)
+ goto reset;
+ err = gve_create_rings(priv);
+ if (err)
+ goto reset;
+ gve_set_device_rings_ok(priv);
+
+ gve_turnup(priv);
+ queue_work(priv->gve_wq, &priv->service_task);
+ priv->interface_up_cnt++;
+ return 0;
+
+free_rings:
+ gve_free_rings(priv);
+free_qpls:
+ gve_free_qpls(priv);
+ return err;
+
+reset:
+ /* This must have been called from a reset due to the rtnl lock
+ * so just return at this point.
+ */
+ if (gve_get_reset_in_progress(priv))
+ return err;
+ /* Otherwise reset before returning */
+ gve_reset_and_teardown(priv, true);
+ /* if this fails there is nothing we can do so just ignore the return */
+ gve_reset_recovery(priv, false);
+ /* return the original error */
+ return err;
+}
+
+static int gve_close(struct net_device *dev)
+{
+ struct gve_priv *priv = netdev_priv(dev);
+ int err;
+
+ netif_carrier_off(dev);
+ if (gve_get_device_rings_ok(priv)) {
+ gve_turndown(priv);
+ err = gve_destroy_rings(priv);
+ if (err)
+ goto err;
+ err = gve_unregister_qpls(priv);
+ if (err)
+ goto err;
+ gve_clear_device_rings_ok(priv);
+ }
+
+ gve_free_rings(priv);
+ gve_free_qpls(priv);
+ priv->interface_down_cnt++;
+ return 0;
+
+err:
+ /* This must have been called from a reset due to the rtnl lock
+ * so just return at this point.
+ */
+ if (gve_get_reset_in_progress(priv))
+ return err;
+ /* Otherwise reset before returning */
+ gve_reset_and_teardown(priv, true);
+ return gve_reset_recovery(priv, false);
+}
+
+int gve_adjust_queues(struct gve_priv *priv,
+ struct gve_queue_config new_rx_config,
+ struct gve_queue_config new_tx_config)
+{
+ int err;
+
+ if (netif_carrier_ok(priv->dev)) {
+ /* To make this process as simple as possible we teardown the
+ * device, set the new configuration, and then bring the device
+ * up again.
+ */
+ err = gve_close(priv->dev);
+ /* we have already tried to reset in close,
+ * just fail at this point
+ */
+ if (err)
+ return err;
+ priv->tx_cfg = new_tx_config;
+ priv->rx_cfg = new_rx_config;
+
+ err = gve_open(priv->dev);
+ if (err)
+ goto err;
+
+ return 0;
+ }
+ /* Set the config for the next up. */
+ priv->tx_cfg = new_tx_config;
+ priv->rx_cfg = new_rx_config;
+
+ return 0;
+err:
+ netif_err(priv, drv, priv->dev,
+ "Adjust queues failed! !!! DISABLING ALL QUEUES !!!\n");
+ gve_turndown(priv);
+ return err;
+}
+
+static void gve_turndown(struct gve_priv *priv)
+{
+ int idx;
+
+ if (netif_carrier_ok(priv->dev))
+ netif_carrier_off(priv->dev);
+
+ if (!gve_get_napi_enabled(priv))
+ return;
+
+ /* Disable napi to prevent more work from coming in */
+ for (idx = 0; idx < priv->tx_cfg.num_queues; idx++) {
+ int ntfy_idx = gve_tx_idx_to_ntfy(priv, idx);
+ struct gve_notify_block *block = &priv->ntfy_blocks[ntfy_idx];
+
+ napi_disable(&block->napi);
+ }
+ for (idx = 0; idx < priv->rx_cfg.num_queues; idx++) {
+ int ntfy_idx = gve_rx_idx_to_ntfy(priv, idx);
+ struct gve_notify_block *block = &priv->ntfy_blocks[ntfy_idx];
+
+ napi_disable(&block->napi);
+ }
+
+ /* Stop tx queues */
+ netif_tx_disable(priv->dev);
+
+ gve_clear_napi_enabled(priv);
+ gve_clear_report_stats(priv);
+}
+
+static void gve_turnup(struct gve_priv *priv)
+{
+ int idx;
+
+ /* Start the tx queues */
+ netif_tx_start_all_queues(priv->dev);
+
+ /* Enable napi and unmask interrupts for all queues */
+ for (idx = 0; idx < priv->tx_cfg.num_queues; idx++) {
+ int ntfy_idx = gve_tx_idx_to_ntfy(priv, idx);
+ struct gve_notify_block *block = &priv->ntfy_blocks[ntfy_idx];
+
+ napi_enable(&block->napi);
+ iowrite32be(0, gve_irq_doorbell(priv, block));
+ }
+ for (idx = 0; idx < priv->rx_cfg.num_queues; idx++) {
+ int ntfy_idx = gve_rx_idx_to_ntfy(priv, idx);
+ struct gve_notify_block *block = &priv->ntfy_blocks[ntfy_idx];
+
+ napi_enable(&block->napi);
+ iowrite32be(0, gve_irq_doorbell(priv, block));
+ }
+
+ gve_set_napi_enabled(priv);
+}
+
+static void gve_tx_timeout(struct net_device *dev)
+{
+ struct gve_priv *priv = netdev_priv(dev);
+
+ gve_schedule_reset(priv);
+ priv->tx_timeo_cnt++;
+}
+
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0))
+int gve_change_mtu(struct net_device *dev, int new_mtu){
+ struct gve_priv *priv = netdev_priv(dev);
+
+ if (new_mtu < ETH_MIN_MTU || new_mtu > priv->max_mtu)
+ return -EINVAL;
+ dev->mtu = new_mtu;
+ return 0;
+}
+#endif /* (LINUX_VERSION_CODE >= KERNEL_VERSION(4,10,0)) */
+
+static const struct net_device_ops gve_netdev_ops = {
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0))
+#if RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(7, 5) && RHEL_RELEASE_CODE < RHEL_RELEASE_VERSION(8, 0)
+ .ndo_change_mtu_rh74 = gve_change_mtu,
+#else /* RHEL_RELEASE_CODE < RHEL_RELEASE_VERSION(7, 5) || RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(8, 0) */
+
+ .ndo_change_mtu = gve_change_mtu,
+#endif /* RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(7, 5) && RHEL_RELEASE_CODE < RHEL_RELEASE_VERSION(8, 0) */
+#endif /* (LINUX_VERSION_CODE >= KERNEL_VERSION(4,10,0)) */
+
+ .ndo_start_xmit = gve_tx,
+ .ndo_open = gve_open,
+ .ndo_stop = gve_close,
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,11,0))
+ .ndo_get_stats64 = backport_gve_get_stats,
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(4,11.0) */
+
+ .ndo_get_stats64 = gve_get_stats,
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,11.0) */
+ .ndo_tx_timeout = gve_tx_timeout,
+};
+
+static void gve_handle_status(struct gve_priv *priv, u32 status)
+{
+ if (GVE_DEVICE_STATUS_RESET_MASK & status) {
+ dev_info(&priv->pdev->dev, "Device requested reset.\n");
+ gve_set_do_reset(priv);
+ }
+ if (GVE_DEVICE_STATUS_REPORT_STATS_MASK & status) {
+ dev_info(&priv->pdev->dev, "Device report stats on.\n");
+ gve_set_do_report_stats(priv);
+ }
+}
+
+static void gve_handle_reset(struct gve_priv *priv)
+{
+ /* A service task will be scheduled at the end of probe to catch any
+ * resets that need to happen, and we don't want to reset until
+ * probe is done.
+ */
+ if (gve_get_probe_in_progress(priv))
+ return;
+
+ if (gve_get_do_reset(priv)) {
+ rtnl_lock();
+ gve_reset(priv, false);
+ rtnl_unlock();
+ }
+}
+
+void gve_handle_report_stats(struct gve_priv *priv)
+{
+ int idx, stats_idx = 0, tx_bytes;
+ unsigned int start = 0;
+ struct stats *stats = priv->stats_report->stats;
+
+ if (!gve_get_report_stats(priv))
+ return;
+
+ be64_add_cpu(&priv->stats_report->written_count, 1);
+ /* tx stats */
+ if (priv->tx) {
+ for (idx = 0; idx < priv->tx_cfg.num_queues; idx++) {
+ do {
+ start = u64_stats_fetch_begin(&priv->tx[idx].statss);
+ tx_bytes = priv->tx[idx].bytes_done;
+ } while (u64_stats_fetch_retry(&priv->tx[idx].statss, start));
+ stats[stats_idx++] = (struct stats) {
+ .stat_name = cpu_to_be32(TX_WAKE_CNT),
+ .value = cpu_to_be64(priv->tx[idx].wake_queue),
+ .queue_id = cpu_to_be32(idx),
+ };
+ stats[stats_idx++] = (struct stats) {
+ .stat_name = cpu_to_be32(TX_STOP_CNT),
+ .value = cpu_to_be64(priv->tx[idx].stop_queue),
+ .queue_id = cpu_to_be32(idx),
+ };
+ stats[stats_idx++] = (struct stats) {
+ .stat_name = cpu_to_be32(TX_FRAMES_SENT),
+ .value = cpu_to_be64(priv->tx[idx].req),
+ .queue_id = cpu_to_be32(idx),
+ };
+ stats[stats_idx++] = (struct stats) {
+ .stat_name = cpu_to_be32(TX_BYTES_SENT),
+ .value = cpu_to_be64(tx_bytes),
+ .queue_id = cpu_to_be32(idx),
+ };
+ stats[stats_idx++] = (struct stats) {
+ .stat_name = cpu_to_be32(
+ TX_LAST_COMPLETION_PROCESSED),
+ .value = cpu_to_be64(priv->tx[idx].done),
+ .queue_id = cpu_to_be32(idx),
+ };
+ }
+ }
+ /* rx stats */
+ if (priv->rx) {
+ for (idx = 0; idx < priv->rx_cfg.num_queues; idx++) {
+ stats[stats_idx++] = (struct stats) {
+ .stat_name = cpu_to_be32(
+ RX_NEXT_EXPECTED_SEQUENCE),
+ .value = cpu_to_be64(priv->rx[idx].desc.seqno),
+ .queue_id = cpu_to_be32(idx),
+ };
+ stats[stats_idx++] = (struct stats) {
+ .stat_name = cpu_to_be32(RX_BUFFERS_POSTED),
+ .value = cpu_to_be64(priv->rx[0].fill_cnt),
+ .queue_id = cpu_to_be32(idx),
+ };
+ }
+ }
+}
+
+void gve_handle_link_status(struct gve_priv *priv, bool link_status)
+{
+ if (!gve_get_napi_enabled(priv))
+ return;
+
+ if (link_status == netif_carrier_ok(priv->dev))
+ return;
+
+ if (link_status) {
+ netif_carrier_on(priv->dev);
+ } else {
+ dev_info(&priv->pdev->dev, "Device link is down.\n");
+ netif_carrier_off(priv->dev);
+ }
+}
+
+/* Handle NIC status register changes, reset requests and report stats */
+static void gve_service_task(struct work_struct *work)
+{
+ struct gve_priv *priv = container_of(work, struct gve_priv,
+ service_task);
+ u32 status = ioread32be(&priv->reg_bar0->device_status);
+
+ gve_handle_status(priv, status);
+
+ gve_handle_reset(priv);
+ gve_handle_link_status(priv, GVE_DEVICE_STATUS_LINK_STATUS_MASK & status);
+ if (gve_get_do_report_stats(priv)) {
+ gve_handle_report_stats(priv);
+ gve_clear_do_report_stats(priv);
+ }
+}
+
+static int gve_init_priv(struct gve_priv *priv, bool skip_describe_device)
+{
+ int num_ntfy;
+ int err;
+
+ /* Set up the adminq */
+ err = gve_adminq_alloc(&priv->pdev->dev, priv);
+ if (err) {
+ dev_err(&priv->pdev->dev,
+ "Failed to alloc admin queue: err=%d\n", err);
+ return err;
+ }
+
+ if (skip_describe_device)
+ goto setup_device;
+
+ priv->raw_addressing = false;
+ /* Get the initial information we need from the device */
+ err = gve_adminq_describe_device(priv);
+ if (err) {
+ dev_err(&priv->pdev->dev,
+ "Could not get device information: err=%d\n", err);
+ goto err;
+ }
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0))
+ if (priv->max_mtu > PAGE_SIZE)
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+if (priv->dev->max_mtu > PAGE_SIZE)
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+ {
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0))
+ priv->max_mtu = PAGE_SIZE;
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+ priv->dev->max_mtu = PAGE_SIZE;
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+ err = gve_adminq_set_mtu(priv, priv->dev->mtu);
+ if (err) {
+ dev_err(&priv->pdev->dev, "Could not set mtu");
+ goto err;
+ }
+ }
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0))
+ priv->dev->mtu = priv->max_mtu;
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+ priv->dev->mtu = priv->dev->max_mtu;
+#endif /* LINUX_VERSION_CODE < KERNEL_VERSION(4,10,0) */
+ num_ntfy = pci_msix_vec_count(priv->pdev);
+ if (num_ntfy <= 0) {
+ dev_err(&priv->pdev->dev,
+ "could not count MSI-x vectors: err=%d\n", num_ntfy);
+ err = num_ntfy;
+ goto err;
+ } else if (num_ntfy < GVE_MIN_MSIX) {
+ dev_err(&priv->pdev->dev, "gve needs at least %d MSI-x vectors, but only has %d\n",
+ GVE_MIN_MSIX, num_ntfy);
+ err = -EINVAL;
+ goto err;
+ }
+
+ priv->num_registered_pages = 0;
+ priv->rx_copybreak = GVE_DEFAULT_RX_COPYBREAK;
+ /* gvnic has one Notification Block per MSI-x vector, except for the
+ * management vector
+ */
+ priv->num_ntfy_blks = (num_ntfy - 1) & ~0x1;
+ priv->mgmt_msix_idx = priv->num_ntfy_blks;
+
+ priv->tx_cfg.max_queues =
+ min_t(int, priv->tx_cfg.max_queues, priv->num_ntfy_blks / 2);
+ priv->rx_cfg.max_queues =
+ min_t(int, priv->rx_cfg.max_queues, priv->num_ntfy_blks / 2);
+
+ priv->tx_cfg.num_queues = priv->tx_cfg.max_queues;
+ priv->rx_cfg.num_queues = priv->rx_cfg.max_queues;
+ if (priv->default_num_queues > 0) {
+ priv->tx_cfg.num_queues = min_t(int, priv->default_num_queues,
+ priv->tx_cfg.num_queues);
+ priv->rx_cfg.num_queues = min_t(int, priv->default_num_queues,
+ priv->rx_cfg.num_queues);
+ }
+
+ dev_info(&priv->pdev->dev, "TX queues %d, RX queues %d\n",
+ priv->tx_cfg.num_queues, priv->rx_cfg.num_queues);
+ dev_info(&priv->pdev->dev, "Max TX queues %d, Max RX queues %d\n",
+ priv->tx_cfg.max_queues, priv->rx_cfg.max_queues);
+
+setup_device:
+ err = gve_setup_device_resources(priv);
+ if (!err)
+ return 0;
+err:
+ gve_adminq_free(&priv->pdev->dev, priv);
+ return err;
+}
+
+static void gve_teardown_priv_resources(struct gve_priv *priv)
+{
+ gve_teardown_device_resources(priv);
+ gve_adminq_free(&priv->pdev->dev, priv);
+}
+
+static void gve_trigger_reset(struct gve_priv *priv)
+{
+ /* Reset the device by releasing the AQ */
+ gve_adminq_release(priv);
+}
+
+static void gve_reset_and_teardown(struct gve_priv *priv, bool was_up)
+{
+ gve_trigger_reset(priv);
+ /* With the reset having already happened, close cannot fail */
+ if (was_up)
+ gve_close(priv->dev);
+ gve_teardown_priv_resources(priv);
+}
+
+static int gve_reset_recovery(struct gve_priv *priv, bool was_up)
+{
+ int err;
+
+ err = gve_init_priv(priv, true);
+ if (err)
+ goto err;
+ if (was_up) {
+ err = gve_open(priv->dev);
+ if (err)
+ goto err;
+ }
+ return 0;
+err:
+ dev_err(&priv->pdev->dev, "Reset failed! !!! DISABLING ALL QUEUES !!!\n");
+ gve_turndown(priv);
+ return err;
+}
+
+int gve_reset(struct gve_priv *priv, bool attempt_teardown)
+{
+ bool was_up = netif_carrier_ok(priv->dev);
+ int err;
+
+ dev_info(&priv->pdev->dev, "Performing reset\n");
+ gve_clear_do_reset(priv);
+ gve_set_reset_in_progress(priv);
+ /* If we aren't attempting to teardown normally, just go turndown and
+ * reset right away.
+ */
+ if (!attempt_teardown) {
+ gve_turndown(priv);
+ gve_reset_and_teardown(priv, was_up);
+ } else {
+ /* Otherwise attempt to close normally */
+ if (was_up) {
+ err = gve_close(priv->dev);
+ /* If that fails reset as we did above */
+ if (err)
+ gve_reset_and_teardown(priv, was_up);
+ }
+ /* Clean up any remaining resources */
+ gve_teardown_priv_resources(priv);
+ }
+
+ /* Set it all back up */
+ err = gve_reset_recovery(priv, was_up);
+ gve_clear_reset_in_progress(priv);
+ priv->reset_cnt++;
+ priv->interface_up_cnt = 0;
+ priv->interface_down_cnt = 0;
+ return err;
+}
+
+static void gve_write_version(u8 __iomem *driver_version_register)
+{
+ const char *c = gve_version_prefix;
+
+ while (*c) {
+ writeb(*c, driver_version_register);
+ c++;
+ }
+
+ c = gve_version_str;
+ while (*c) {
+ writeb(*c, driver_version_register);
+ c++;
+ }
+ writeb('\n', driver_version_register);
+}
+
+static int gve_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
+{
+ int max_tx_queues, max_rx_queues;
+ struct net_device *dev;
+ __be32 __iomem *db_bar;
+ struct gve_registers __iomem *reg_bar;
+ struct gve_priv *priv;
+ u8 dma_mask;
+ int err;
+
+ err = pci_enable_device(pdev);
+ if (err)
+ return -ENXIO;
+
+ err = pci_request_regions(pdev, "gvnic-cfg");
+ if (err)
+ goto abort_with_enabled;
+
+ pci_set_master(pdev);
+
+ reg_bar = pci_iomap(pdev, GVE_REGISTER_BAR, 0);
+ if (!reg_bar) {
+ dev_err(&pdev->dev, "Failed to map pci bar!\n");
+ err = -ENOMEM;
+ goto abort_with_pci_region;
+ }
+
+ db_bar = pci_iomap(pdev, GVE_DOORBELL_BAR, 0);
+ if (!db_bar) {
+ dev_err(&pdev->dev, "Failed to map doorbell bar!\n");
+ err = -ENOMEM;
+ goto abort_with_reg_bar;
+ }
+
+ dma_mask = readb(®_bar->dma_mask);
+ // Default to 64 if the register isn't set
+ if (!dma_mask)
+ dma_mask = 64;
+ gve_write_version(®_bar->driver_version);
+ /* Get max queues to alloc etherdev */
+ max_tx_queues = ioread32be(®_bar->max_tx_queues);
+ max_rx_queues = ioread32be(®_bar->max_rx_queues);
+
+ err = pci_set_dma_mask(pdev, DMA_BIT_MASK(dma_mask));
+ if (err) {
+ dev_err(&pdev->dev, "Failed to set dma mask: err=%d\n", err);
+ goto abort_with_reg_bar;
+ }
+
+ err = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(dma_mask));
+ if (err) {
+ dev_err(&pdev->dev,
+ "Failed to set consistent dma mask: err=%d\n", err);
+ goto abort_with_reg_bar;
+ }
+
+ /* Alloc and setup the netdev and priv */
+ dev = alloc_etherdev_mqs(sizeof(*priv), max_tx_queues, max_rx_queues);
+ if (!dev) {
+ dev_err(&pdev->dev, "could not allocate netdev\n");
+ goto abort_with_db_bar;
+ }
+ SET_NETDEV_DEV(dev, &pdev->dev);
+
+ pci_set_drvdata(pdev, dev);
+
+ dev->ethtool_ops = &gve_ethtool_ops;
+ dev->netdev_ops = &gve_netdev_ops;
+ /* advertise features */
+ dev->hw_features = NETIF_F_HIGHDMA;
+ dev->hw_features |= NETIF_F_SG;
+ dev->hw_features |= NETIF_F_HW_CSUM;
+ dev->hw_features |= NETIF_F_TSO;
+ dev->hw_features |= NETIF_F_TSO6;
+ dev->hw_features |= NETIF_F_TSO_ECN;
+ dev->hw_features |= NETIF_F_RXCSUM;
+ dev->hw_features |= NETIF_F_RXHASH;
+ dev->features = dev->hw_features;
+ dev->watchdog_timeo = 5 * HZ;
+#if (LINUX_VERSION_CODE >= KERNEL_VERSION(4,10,0))
+ dev->min_mtu = ETH_MIN_MTU;
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,10,0) */
+ netif_carrier_off(dev);
+
+ priv = netdev_priv(dev);
+ priv->dev = dev;
+ priv->pdev = pdev;
+ priv->msg_enable = DEFAULT_MSG_LEVEL;
+ priv->reg_bar0 = reg_bar;
+ priv->db_bar2 = db_bar;
+ priv->service_task_flags = 0x0;
+ priv->state_flags = 0x0;
+ priv->ethtool_flags = 0x0;
+ priv->dma_mask = dma_mask;
+
+ gve_set_probe_in_progress(priv);
+
+ priv->gve_wq = alloc_ordered_workqueue("gve", 0);
+ if (!priv->gve_wq) {
+ dev_err(&pdev->dev, "Could not allocate workqueue");
+ err = -ENOMEM;
+ goto abort_with_netdev;
+ }
+ INIT_WORK(&priv->service_task, gve_service_task);
+ priv->tx_cfg.max_queues = max_tx_queues;
+ priv->rx_cfg.max_queues = max_rx_queues;
+
+ err = gve_init_priv(priv, false);
+ if (err)
+ goto abort_with_wq;
+
+ err = register_netdev(dev);
+ if (err)
+ goto abort_with_wq;
+
+ dev_info(&pdev->dev, "GVE version %s\n", gve_version_str);
+ gve_clear_probe_in_progress(priv);
+ queue_work(priv->gve_wq, &priv->service_task);
+
+ return 0;
+
+abort_with_wq:
+ destroy_workqueue(priv->gve_wq);
+
+abort_with_netdev:
+ free_netdev(dev);
+
+abort_with_db_bar:
+ pci_iounmap(pdev, db_bar);
+
+abort_with_reg_bar:
+ pci_iounmap(pdev, reg_bar);
+
+abort_with_pci_region:
+ pci_release_regions(pdev);
+
+abort_with_enabled:
+ pci_disable_device(pdev);
+ return -ENXIO;
+}
+EXPORT_SYMBOL(gve_probe);
+
+static void gve_remove(struct pci_dev *pdev)
+{
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ struct gve_priv *priv = netdev_priv(netdev);
+ __be32 __iomem *db_bar = priv->db_bar2;
+ void __iomem *reg_bar = priv->reg_bar0;
+
+ unregister_netdev(netdev);
+ gve_teardown_priv_resources(priv);
+ destroy_workqueue(priv->gve_wq);
+ free_netdev(netdev);
+ pci_iounmap(pdev, db_bar);
+ pci_iounmap(pdev, reg_bar);
+ pci_release_regions(pdev);
+ pci_disable_device(pdev);
+}
+
+static const struct pci_device_id gve_id_table[] = {
+ { PCI_DEVICE(PCI_VENDOR_ID_GOOGLE, PCI_DEV_ID_GVNIC) },
+ { }
+};
+
+static struct pci_driver gvnic_driver = {
+ .name = "gvnic",
+ .id_table = gve_id_table,
+ .probe = gve_probe,
+ .remove = gve_remove,
+};
+
+module_pci_driver(gvnic_driver);
+
+MODULE_DEVICE_TABLE(pci, gve_id_table);
+MODULE_AUTHOR("Google, Inc.");
+MODULE_DESCRIPTION("gVNIC Driver");
+MODULE_LICENSE("Dual MIT/GPL");
+MODULE_VERSION(GVE_VERSION);
diff --git a/drivers/net/ethernet/google/gve/gve_register.h b/drivers/net/ethernet/google/gve/gve_register.h
new file mode 100644
index 0000000..776c291
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_register.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR MIT)
+ * Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#ifndef _GVE_REGISTER_H_
+#define _GVE_REGISTER_H_
+
+/* Fixed Configuration Registers */
+struct gve_registers {
+ __be32 device_status;
+ __be32 driver_status;
+ __be32 max_tx_queues;
+ __be32 max_rx_queues;
+ __be32 adminq_pfn;
+ __be32 adminq_doorbell;
+ __be32 adminq_event_counter;
+ u8 reserved[2];
+ u8 dma_mask;
+ u8 driver_version;
+};
+
+enum gve_device_status_flags {
+ GVE_DEVICE_STATUS_RESET_MASK = BIT(1),
+ GVE_DEVICE_STATUS_LINK_STATUS_MASK = BIT(2),
+ GVE_DEVICE_STATUS_REPORT_STATS_MASK = BIT(3),
+};
+#endif /* _GVE_REGISTER_H_ */
diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
new file mode 100644
index 0000000..302f443
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_rx.c
@@ -0,0 +1,690 @@
+// SPDX-License-Identifier: (GPL-2.0 OR MIT)
+/* Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#include "gve_linux_version.h"
+#include "gve.h"
+#include "gve_adminq.h"
+#include <linux/etherdevice.h>
+
+static void gve_rx_remove_from_block(struct gve_priv *priv, int queue_idx)
+{
+ struct gve_notify_block *block =
+ &priv->ntfy_blocks[gve_rx_idx_to_ntfy(priv, queue_idx)];
+
+ block->rx = NULL;
+}
+
+static void gve_rx_free_buffer(struct device *dev,
+ struct gve_rx_slot_page_info *page_info,
+ struct gve_rx_data_slot *data_slot) {
+ dma_addr_t dma = (dma_addr_t)(be64_to_cpu(data_slot->addr) -
+ page_info->page_offset);
+
+ page_ref_sub(page_info->page, page_info->pagecnt_bias - 1);
+ gve_free_page(dev, page_info->page, dma, DMA_FROM_DEVICE);
+}
+
+static void gve_rx_free_ring(struct gve_priv *priv, int idx)
+{
+ struct gve_rx_ring *rx = &priv->rx[idx];
+ struct device *dev = &priv->pdev->dev;
+ size_t bytes;
+ u32 slots = rx->mask + 1;
+
+ gve_rx_remove_from_block(priv, idx);
+
+ bytes = sizeof(struct gve_rx_desc) * priv->rx_desc_cnt;
+ dma_free_coherent(dev, bytes, rx->desc.desc_ring, rx->desc.bus);
+ rx->desc.desc_ring = NULL;
+
+ dma_free_coherent(dev, sizeof(*rx->q_resources),
+ rx->q_resources, rx->q_resources_bus);
+ rx->q_resources = NULL;
+
+ if (rx->data.raw_addressing) {
+ int i;
+
+ for (i = 0; i < slots; i++)
+ gve_rx_free_buffer(dev, &rx->data.page_info[i],
+ &rx->data.data_ring[i]);
+ } else {
+ gve_unassign_qpl(priv, rx->data.qpl->id);
+ rx->data.qpl = NULL;
+ }
+ kfree(rx->data.page_info);
+
+ bytes = sizeof(*rx->data.data_ring) * slots;
+ dma_free_coherent(dev, bytes, rx->data.data_ring,
+ rx->data.data_bus);
+ rx->data.data_ring = NULL;
+ netif_dbg(priv, drv, priv->dev, "freed rx ring %d\n", idx);
+}
+
+static void gve_setup_rx_buffer(struct gve_rx_slot_page_info *page_info,
+ struct gve_rx_data_slot *slot,
+ dma_addr_t addr, struct page *page)
+{
+ page_info->page = page;
+ page_info->page_offset = 0;
+ page_info->page_address = page_address(page);
+ slot->addr = cpu_to_be64(addr);
+ /* The page already has 1 ref */
+ page_ref_add(page, INT_MAX - 1);
+ page_info->pagecnt_bias = INT_MAX;
+}
+
+static int gve_prefill_rx_pages(struct gve_rx_ring *rx)
+{
+ struct gve_priv *priv = rx->gve;
+ u32 slots;
+ int err;
+ int i;
+
+ /* Allocate one page per Rx queue slot. Each page is split into two
+ * packet buffers, when possible we "page flip" between the two.
+ */
+ slots = rx->mask + 1;
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0)
+ rx->data.page_info = kvzalloc(slots *
+ sizeof(*rx->data.page_info), GFP_KERNEL);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ rx->data.page_info = kcalloc(slots, sizeof(*rx->data.page_info),
+ GFP_KERNEL);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(4,12,0) */
+ if (!rx->data.page_info)
+ return -ENOMEM;
+
+ if (!rx->data.raw_addressing)
+ rx->data.qpl = gve_assign_rx_qpl(priv);
+ for (i = 0; i < slots; i++) {
+ struct page *page;
+ dma_addr_t addr;
+
+ if (rx->data.raw_addressing) {
+ err = gve_alloc_page(priv, &priv->pdev->dev, &page,
+ &addr, DMA_FROM_DEVICE,
+ GFP_KERNEL);
+ if (err) {
+ int j;
+
+ u64_stats_update_begin(&rx->statss);
+ rx->rx_buf_alloc_fail++;
+ u64_stats_update_end(&rx->statss);
+ for (j = 0; j < i; j++)
+ gve_free_page(&priv->pdev->dev, page,
+ addr, DMA_FROM_DEVICE);
+ return err;
+ }
+ } else {
+ page = rx->data.qpl->pages[i];
+ addr = i * PAGE_SIZE;
+ }
+ gve_setup_rx_buffer(&rx->data.page_info[i],
+ &rx->data.data_ring[i], addr, page);
+ }
+
+ return slots;
+}
+
+static void gve_rx_add_to_block(struct gve_priv *priv, int queue_idx)
+{
+ u32 ntfy_idx = gve_rx_idx_to_ntfy(priv, queue_idx);
+ struct gve_notify_block *block = &priv->ntfy_blocks[ntfy_idx];
+ struct gve_rx_ring *rx = &priv->rx[queue_idx];
+
+ block->rx = rx;
+ rx->ntfy_id = ntfy_idx;
+}
+
+static int gve_rx_alloc_ring(struct gve_priv *priv, int idx)
+{
+ struct gve_rx_ring *rx = &priv->rx[idx];
+ struct device *hdev = &priv->pdev->dev;
+ u32 slots, npages;
+ int filled_pages;
+ size_t bytes;
+ int err;
+
+ netif_dbg(priv, drv, priv->dev, "allocating rx ring\n");
+ /* Make sure everything is zeroed to start with */
+ memset(rx, 0, sizeof(*rx));
+
+ rx->gve = priv;
+ rx->q_num = idx;
+
+ slots = priv->rx_data_slot_cnt;
+ rx->mask = slots - 1;
+ rx->data.raw_addressing = priv->raw_addressing;
+
+ /* alloc rx data ring */
+ bytes = sizeof(*rx->data.data_ring) * slots;
+ rx->data.data_ring = dma_alloc_coherent(hdev, bytes,
+ &rx->data.data_bus,
+ GFP_KERNEL);
+ if (!rx->data.data_ring)
+ return -ENOMEM;
+ filled_pages = gve_prefill_rx_pages(rx);
+ if (filled_pages < 0) {
+ err = -ENOMEM;
+ goto abort_with_slots;
+ }
+ rx->fill_cnt = filled_pages;
+ /* Ensure data ring slots (packet buffers) are visible. */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0)
+ dma_wmb();
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ wmb();
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+
+ /* Alloc gve_queue_resources */
+ rx->q_resources =
+ dma_alloc_coherent(hdev,
+ sizeof(*rx->q_resources),
+ &rx->q_resources_bus,
+ GFP_KERNEL);
+ if (!rx->q_resources) {
+ err = -ENOMEM;
+ goto abort_filled;
+ }
+ netif_dbg(priv, drv, priv->dev, "rx[%d]->data.data_bus=%lx\n", idx,
+ (unsigned long)rx->data.data_bus);
+
+ /* alloc rx desc ring */
+ bytes = sizeof(struct gve_rx_desc) * priv->rx_desc_cnt;
+ npages = bytes / PAGE_SIZE;
+ if (npages * PAGE_SIZE != bytes) {
+ err = -EIO;
+ goto abort_with_q_resources;
+ }
+
+ rx->desc.desc_ring = dma_alloc_coherent(hdev, bytes, &rx->desc.bus,
+ GFP_KERNEL);
+ if (!rx->desc.desc_ring) {
+ err = -ENOMEM;
+ goto abort_with_q_resources;
+ }
+ rx->cnt = 0;
+ rx->db_threshold = priv->rx_desc_cnt / 2;
+ rx->desc.seqno = 1;
+ gve_rx_add_to_block(priv, idx);
+
+ return 0;
+
+abort_with_q_resources:
+ dma_free_coherent(hdev, sizeof(*rx->q_resources),
+ rx->q_resources, rx->q_resources_bus);
+ rx->q_resources = NULL;
+abort_filled:
+ kfree(rx->data.page_info);
+abort_with_slots:
+ bytes = sizeof(*rx->data.data_ring) * slots;
+ dma_free_coherent(hdev, bytes, rx->data.data_ring, rx->data.data_bus);
+ rx->data.data_ring = NULL;
+
+ return err;
+}
+
+int gve_rx_alloc_rings(struct gve_priv *priv)
+{
+ int err = 0;
+ int i;
+
+ for (i = 0; i < priv->rx_cfg.num_queues; i++) {
+ err = gve_rx_alloc_ring(priv, i);
+ if (err) {
+ netif_err(priv, drv, priv->dev,
+ "Failed to alloc rx ring=%d: err=%d\n",
+ i, err);
+ break;
+ }
+ }
+ /* Unallocate if there was an error */
+ if (err) {
+ int j;
+
+ for (j = 0; j < i; j++)
+ gve_rx_free_ring(priv, j);
+ }
+ return err;
+}
+
+void gve_rx_free_rings(struct gve_priv *priv)
+{
+ int i;
+
+ for (i = 0; i < priv->rx_cfg.num_queues; i++)
+ gve_rx_free_ring(priv, i);
+}
+
+void gve_rx_write_doorbell(struct gve_priv *priv, struct gve_rx_ring *rx)
+{
+ u32 db_idx = be32_to_cpu(rx->q_resources->db_index);
+
+ iowrite32be(rx->fill_cnt, &priv->db_bar2[db_idx]);
+}
+
+#if RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(7, 0) || LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0)
+static enum pkt_hash_types gve_rss_type(__be16 pkt_flags)
+{
+ if (likely(pkt_flags & (GVE_RXF_TCP | GVE_RXF_UDP)))
+ return PKT_HASH_TYPE_L4;
+ if (pkt_flags & (GVE_RXF_IPV4 | GVE_RXF_IPV6))
+ return PKT_HASH_TYPE_L3;
+ return PKT_HASH_TYPE_L2;
+}
+#endif /* RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(7, 0) || LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0) */
+
+static struct sk_buff *gve_rx_copy(struct net_device *dev,
+ struct napi_struct *napi,
+ struct gve_rx_slot_page_info *page_info,
+ u16 len)
+{
+ struct sk_buff *skb = napi_alloc_skb(napi, len);
+ void *va = page_info->page_address + GVE_RX_PAD +
+ page_info->page_offset;
+
+ if (unlikely(!skb))
+ return NULL;
+
+ __skb_put(skb, len);
+
+ skb_copy_to_linear_data(skb, va, len);
+
+ skb->protocol = eth_type_trans(skb, dev);
+
+ return skb;
+}
+
+static struct sk_buff *gve_rx_add_frags(struct napi_struct *napi,
+ struct gve_rx_slot_page_info *page_info,
+ u16 len)
+{
+ struct sk_buff *skb = napi_get_frags(napi);
+
+ if (unlikely(!skb))
+ return NULL;
+
+ skb_add_rx_frag(skb, 0, page_info->page,
+ page_info->page_offset +
+ GVE_RX_PAD, len, PAGE_SIZE / 2);
+
+ return skb;
+}
+
+static int gve_rx_alloc_buffer(struct gve_priv *priv, struct device *dev,
+ struct gve_rx_slot_page_info *page_info,
+ struct gve_rx_data_slot *data_slot,
+ struct gve_rx_ring *rx)
+{
+ struct page *page;
+ dma_addr_t dma;
+ int err;
+
+ err = gve_alloc_page(priv, dev, &page, &dma, DMA_FROM_DEVICE,
+ GFP_ATOMIC);
+ if (err) {
+ u64_stats_update_begin(&rx->statss);
+ rx->rx_buf_alloc_fail++;
+ u64_stats_update_end(&rx->statss);
+ return err;
+ }
+
+ gve_setup_rx_buffer(page_info, data_slot, dma, page);
+ return 0;
+}
+
+static void gve_rx_flip_buffer(struct gve_rx_slot_page_info *page_info,
+ struct gve_rx_data_slot *data_slot)
+{
+ u64 addr = be64_to_cpu(data_slot->addr);
+
+ /* "flip" to other packet buffer on this page */
+ page_info->page_offset ^= PAGE_SIZE / 2;
+ addr ^= PAGE_SIZE / 2;
+ data_slot->addr = cpu_to_be64(addr);
+}
+
+static bool gve_rx_can_flip_buffers(struct net_device *netdev) {
+#if PAGE_SIZE == 4096
+ /* We can't flip a buffer if we can't fit a packet
+ * into half a page.
+ */
+ if (netdev->max_mtu + GVE_RX_PAD + ETH_HLEN > PAGE_SIZE / 2)
+ return false;
+ return true;
+#else
+ /* PAGE_SIZE != 4096 - don't try to reuse */
+ return false;
+#endif
+}
+
+static int gve_rx_can_recycle_buffer(struct gve_rx_slot_page_info *page_info)
+{
+ int pagecount = page_count(page_info->page);
+
+ /* This page is not being used by any SKBs - reuse */
+ if (pagecount == page_info->pagecnt_bias) {
+ return 1;
+ /* This page is still being used by an SKB - we can't reuse */
+ } else if (pagecount > page_info->pagecnt_bias) {
+ return 0;
+ } else {
+ WARN(pagecount < page_info->pagecnt_bias,
+ "Pagecount should never be less than the bias.");
+ return -1;
+ }
+}
+
+static void gve_rx_update_pagecnt_bias(struct gve_rx_slot_page_info *page_info)
+{
+ page_info->pagecnt_bias--;
+ if (page_info->pagecnt_bias == 0) {
+ int pagecount = page_count(page_info->page);
+
+ /* If we have run out of bias - set it back up to INT_MAX
+ * minus the existing refs.
+ */
+ page_info->pagecnt_bias = INT_MAX - (pagecount);
+ /* Set pagecount back up to max */
+ page_ref_add(page_info->page, INT_MAX - pagecount);
+ }
+}
+
+static struct sk_buff *
+gve_rx_raw_addressing(struct device *dev, struct net_device *netdev,
+ struct gve_rx_slot_page_info *page_info, u16 len,
+ struct napi_struct *napi,
+ struct gve_rx_data_slot *data_slot, bool can_flip)
+{
+ struct sk_buff *skb = gve_rx_add_frags(napi, page_info, len);
+
+ if (!skb)
+ return NULL;
+
+ /* Optimistically stop the kernel from freeing the page.
+ * We will check again in refill to determine if we need to alloc a
+ * new page.
+ */
+ gve_rx_update_pagecnt_bias(page_info);
+ page_info->can_flip = can_flip;
+
+ return skb;
+}
+
+static struct sk_buff *
+gve_rx_qpl(struct device *dev, struct net_device *netdev,
+ struct gve_rx_ring *rx, struct gve_rx_slot_page_info *page_info,
+ u16 len, struct napi_struct *napi,
+ struct gve_rx_data_slot *data_slot, bool recycle)
+{
+ struct sk_buff *skb;
+ /* if raw_addressing mode is not enabled gvnic can only receive into
+ * registered segments. If the buffer can't be recycled, our only
+ * choice is to copy the data out of it so that we can return it to the
+ * device.
+ */
+ if (recycle) {
+ skb = gve_rx_add_frags(napi, page_info, len);
+ /* No point in recycling if we didn't get the skb */
+ if (skb) {
+ /* Make sure the networking stack can't free the page */
+ gve_rx_update_pagecnt_bias(page_info);
+ gve_rx_flip_buffer(page_info, data_slot);
+ }
+ } else {
+ skb = gve_rx_copy(netdev, napi, page_info, len);
+ if (skb) {
+ u64_stats_update_begin(&rx->statss);
+ rx->rx_copied_pkt++;
+ u64_stats_update_end(&rx->statss);
+ }
+ }
+ return skb;
+}
+
+static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
+ netdev_features_t feat, u32 idx)
+{
+ struct gve_rx_slot_page_info *page_info;
+ struct gve_priv *priv = rx->gve;
+ struct napi_struct *napi = &priv->ntfy_blocks[rx->ntfy_id].napi;
+ struct net_device *netdev = priv->dev;
+ struct gve_rx_data_slot *data_slot;
+ struct sk_buff *skb = NULL;
+ dma_addr_t page_bus;
+ u16 len;
+
+ /* drop this packet */
+ if (unlikely(rx_desc->flags_seq & GVE_RXF_ERR)) {
+ u64_stats_update_begin(&rx->statss);
+ rx->rx_desc_err_dropped_pkt++;
+ u64_stats_update_end(&rx->statss);
+ return false;
+ }
+
+ len = be16_to_cpu(rx_desc->len) - GVE_RX_PAD;
+ page_info = &rx->data.page_info[idx];
+ data_slot = &rx->data.data_ring[idx];
+ page_bus = (rx->data.raw_addressing) ?
+ be64_to_cpu(data_slot->addr) - page_info->page_offset:
+ rx->data.qpl->page_buses[idx];
+ dma_sync_single_for_cpu(&priv->pdev->dev, page_bus,
+ PAGE_SIZE, DMA_FROM_DEVICE);
+
+ if (len <= priv->rx_copybreak) {
+ /* Just copy small packets */
+ skb = gve_rx_copy(netdev, napi, page_info, len);
+ if (skb) {
+ u64_stats_update_begin(&rx->statss);
+ rx->rx_copied_pkt++;
+ rx->rx_copybreak_pkt++;
+ u64_stats_update_end(&rx->statss);
+ }
+ } else {
+ bool can_flip = gve_rx_can_flip_buffers(netdev);
+ int recycle = 0;
+
+ if (can_flip) {
+ recycle = gve_rx_can_recycle_buffer(page_info);
+ if (recycle < 0) {
+ gve_schedule_reset(priv);
+ return false;
+ }
+ }
+ if (rx->data.raw_addressing) {
+ skb = gve_rx_raw_addressing(&priv->pdev->dev, netdev,
+ page_info, len, napi,
+ data_slot,
+ can_flip && recycle);
+ } else {
+ skb = gve_rx_qpl(&priv->pdev->dev, netdev, rx,
+ page_info, len, napi, data_slot,
+ can_flip && recycle);
+ }
+ }
+
+ if (!skb) {
+ u64_stats_update_begin(&rx->statss);
+ rx->rx_skb_alloc_fail++;
+ u64_stats_update_end(&rx->statss);
+ return false;
+ }
+
+ if (likely(feat & NETIF_F_RXCSUM)) {
+ /* NIC passes up the partial sum */
+ if (rx_desc->csum)
+ skb->ip_summed = CHECKSUM_COMPLETE;
+ else
+ skb->ip_summed = CHECKSUM_NONE;
+ skb->csum = csum_unfold(rx_desc->csum);
+ }
+
+ /* parse flags & pass relevant info up */
+ if (likely(feat & NETIF_F_RXHASH) &&
+ gve_needs_rss(rx_desc->flags_seq)) {
+#if RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(7, 0) || LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0)
+ skb_set_hash(skb, be32_to_cpu(rx_desc->rss_hash),
+ gve_rss_type(rx_desc->flags_seq));
+#else /* RHEL_RELEASE_CODE < RHEL_RELEASE_VERSION(7, 0) && LINUX_VERSION_CODE < KERNEL_VERSION(3,14,0) */
+ skb->rxhash = be32_to_cpu(rx_desc->rss_hash);
+ skb->l4_rxhash = !!(rx_desc->flags_seq & (GVE_RXF_TCP | GVE_RXF_UDP));
+#endif /* RHEL_RELEASE_CODE >= RHEL_RELEASE_VERSION(7, 0) || LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0) */
+ }
+
+ if (skb_is_nonlinear(skb))
+ napi_gro_frags(napi);
+ else
+ napi_gro_receive(napi, skb);
+ return true;
+}
+
+static bool gve_rx_work_pending(struct gve_rx_ring *rx)
+{
+ struct gve_rx_desc *desc;
+ __be16 flags_seq;
+ u32 next_idx;
+
+ next_idx = rx->cnt & rx->mask;
+ desc = rx->desc.desc_ring + next_idx;
+
+ /* make sure we have synchronized the seq no with the device */
+ smp_mb();
+ flags_seq = desc->flags_seq;
+ return (GVE_SEQNO(flags_seq) == rx->desc.seqno);
+}
+
+static bool gve_rx_refill_buffers(struct gve_priv *priv, struct gve_rx_ring *rx)
+{
+ u32 fill_cnt = rx->fill_cnt;
+
+ while ((fill_cnt & rx->mask) != (rx->cnt & rx->mask)) {
+ u32 idx = fill_cnt & rx->mask;
+ struct gve_rx_slot_page_info *page_info =
+ &rx->data.page_info[idx];
+
+ if (page_info->can_flip) {
+ /* The other half of the page is free because it was
+ * free when we processed the descriptor. Flip to it.
+ */
+ struct gve_rx_data_slot *data_slot =
+ &rx->data.data_ring[idx];
+
+ gve_rx_flip_buffer(page_info, data_slot);
+ } else {
+ /* It is possible that the networking stack has already
+ * finished processing all outstanding packets in the buffer
+ * and it can be reused.
+ * Flipping is unceccessary here - if the networking stack still
+ * owns half the page it is impossible to tell which half. Either
+ * the whole page is free or it needs to be replaced.
+ */
+ int recycle = gve_rx_can_recycle_buffer(page_info);
+
+ if (recycle < 0) {
+ gve_schedule_reset(priv);
+ return false;
+ }
+ if (!recycle) {
+ /* We can't reuse the buffer - alloc a new one*/
+ struct gve_rx_data_slot *data_slot =
+ &rx->data.data_ring[idx];
+ struct device *dev = &priv->pdev->dev;
+
+ gve_rx_free_buffer(dev, page_info, data_slot);
+ page_info->page = NULL;
+ if (gve_rx_alloc_buffer(priv, dev, page_info,
+ data_slot, rx)) {
+ break;
+ }
+ }
+ }
+ fill_cnt++;
+ }
+ rx->fill_cnt = fill_cnt;
+ return true;
+}
+
+bool gve_clean_rx_done(struct gve_rx_ring *rx, int budget,
+ netdev_features_t feat)
+{
+ struct gve_priv *priv = rx->gve;
+ u32 work_done = 0, packets = 0;
+ struct gve_rx_desc *desc;
+ u32 cnt = rx->cnt;
+ u32 idx = cnt & rx->mask;
+ u64 bytes = 0;
+
+ desc = rx->desc.desc_ring + idx;
+ while ((GVE_SEQNO(desc->flags_seq) == rx->desc.seqno) &&
+ work_done < budget) {
+ bool dropped;
+ netif_info(priv, rx_status, priv->dev,
+ "[%d] idx=%d desc=%p desc->flags_seq=0x%x\n",
+ rx->q_num, idx, desc, desc->flags_seq);
+ netif_info(priv, rx_status, priv->dev,
+ "[%d] seqno=%d rx->desc.seqno=%d\n",
+ rx->q_num, GVE_SEQNO(desc->flags_seq),
+ rx->desc.seqno);
+ dropped = !gve_rx(rx, desc, feat, idx);
+ if (!dropped) {
+ bytes += be16_to_cpu(desc->len) - GVE_RX_PAD;
+ packets++;
+ }
+ cnt++;
+ idx = cnt & rx->mask;
+ desc = rx->desc.desc_ring + idx;
+ rx->desc.seqno = gve_next_seqno(rx->desc.seqno);
+ work_done++;
+ }
+
+ if (!work_done)
+ return false;
+
+ u64_stats_update_begin(&rx->statss);
+ rx->rpackets += packets;
+ rx->rbytes += bytes;
+ u64_stats_update_end(&rx->statss);
+ rx->cnt = cnt;
+
+ /* restock ring slots */
+ if (!rx->data.raw_addressing) {
+ /* In QPL mode buffs are refilled as the desc are processed */
+ rx->fill_cnt += work_done;
+ dma_wmb();/* Ensure descs are visible before ringing doorbell */
+ gve_rx_write_doorbell(priv, rx);
+ } else if (rx->fill_cnt - cnt <= rx->db_threshold) {
+ /* In raw addressing mode buffs are only refilled if the avail
+ * falls below a threshold.
+ */
+ if(!gve_rx_refill_buffers(priv, rx))
+ return false;
+ /* restock desc ring slots */
+ dma_wmb();/* Ensure descs are visible before ringing doorbell */
+ gve_rx_write_doorbell(priv, rx);
+ }
+
+ return gve_rx_work_pending(rx);
+}
+
+bool gve_rx_poll(struct gve_notify_block *block, int budget)
+{
+ struct gve_rx_ring *rx = block->rx;
+ netdev_features_t feat;
+ bool repoll = false;
+
+ feat = block->napi.dev->features;
+
+ /* If budget is 0, do all the work */
+ if (budget == 0)
+ budget = INT_MAX;
+
+ if (budget > 0)
+ repoll |= gve_clean_rx_done(rx, budget, feat);
+ else
+ repoll |= gve_rx_work_pending(rx);
+ return repoll;
+}
diff --git a/drivers/net/ethernet/google/gve/gve_size_assert.h b/drivers/net/ethernet/google/gve/gve_size_assert.h
new file mode 100644
index 0000000..a6a238e
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_size_assert.h
@@ -0,0 +1,10 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR MIT)
+ * Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#ifndef _GVE_ASSERT_H_
+#define _GVE_ASSERT_H_
+#define static_assert(expr, ...) _Static_assert(expr, #expr)
+#endif /* _GVE_ASSERT_H_ */
diff --git a/drivers/net/ethernet/google/gve/gve_tx.c b/drivers/net/ethernet/google/gve/gve_tx.c
new file mode 100644
index 0000000..cf66eb4
--- /dev/null
+++ b/drivers/net/ethernet/google/gve/gve_tx.c
@@ -0,0 +1,772 @@
+// SPDX-License-Identifier: (GPL-2.0 OR MIT)
+/* Google virtual Ethernet (gve) driver
+ *
+ * Copyright (C) 2015-2019 Google, Inc.
+ */
+
+#include "gve_linux_version.h"
+#include "gve.h"
+#include "gve_adminq.h"
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/vmalloc.h>
+#include <linux/skbuff.h>
+
+static inline void gve_tx_put_doorbell(struct gve_priv *priv,
+ struct gve_queue_resources *q_resources,
+ u32 val)
+{
+ iowrite32be(val, &priv->db_bar2[be32_to_cpu(q_resources->db_index)]);
+}
+
+/* gvnic can only transmit from a Registered Segment.
+ * We copy skb payloads into the registered segment before writing Tx
+ * descriptors and ringing the Tx doorbell.
+ *
+ * gve_tx_fifo_* manages the Registered Segment as a FIFO - clients must
+ * free allocations in the order they were allocated.
+ */
+
+static int gve_tx_fifo_init(struct gve_priv *priv, struct gve_tx_fifo *fifo)
+{
+ fifo->base = vmap(fifo->qpl->pages, fifo->qpl->num_entries, VM_MAP,
+ PAGE_KERNEL);
+ if (unlikely(!fifo->base)) {
+ netif_err(priv, drv, priv->dev, "Failed to vmap fifo, qpl_id = %d\n",
+ fifo->qpl->id);
+ return -ENOMEM;
+ }
+
+ fifo->size = fifo->qpl->num_entries * PAGE_SIZE;
+ atomic_set(&fifo->available, fifo->size);
+ fifo->head = 0;
+ return 0;
+}
+
+static void gve_tx_fifo_release(struct gve_priv *priv, struct gve_tx_fifo *fifo)
+{
+ WARN(atomic_read(&fifo->available) != fifo->size,
+ "Releasing non-empty fifo");
+
+ vunmap(fifo->base);
+}
+
+static int gve_tx_fifo_pad_alloc_one_frag(struct gve_tx_fifo *fifo,
+ size_t bytes)
+{
+ return (fifo->head + bytes < fifo->size) ? 0 : fifo->size - fifo->head;
+}
+
+static bool gve_tx_fifo_can_alloc(struct gve_tx_fifo *fifo, size_t bytes)
+{
+ return (atomic_read(&fifo->available) <= bytes) ? false : true;
+}
+
+/* gve_tx_alloc_fifo - Allocate fragment(s) from Tx FIFO
+ * @fifo: FIFO to allocate from
+ * @bytes: Allocation size
+ * @iov: Scatter-gather elements to fill with allocation fragment base/len
+ *
+ * Returns number of valid elements in iov[] or negative on error.
+ *
+ * Allocations from a given FIFO must be externally synchronized but concurrent
+ * allocation and frees are allowed.
+ */
+static int gve_tx_alloc_fifo(struct gve_tx_fifo *fifo, size_t bytes,
+ struct gve_tx_iovec iov[2])
+{
+ size_t overflow, padding;
+ u32 aligned_head;
+ int nfrags = 0;
+
+ if (!bytes)
+ return 0;
+
+ /* This check happens before we know how much padding is needed to
+ * align to a cacheline boundary for the payload, but that is fine,
+ * because the FIFO head always start aligned, and the FIFO's boundaries
+ * are aligned, so if there is space for the data, there is space for
+ * the padding to the next alignment.
+ */
+ WARN(!gve_tx_fifo_can_alloc(fifo, bytes),
+ "Reached %s when there's not enough space in the fifo", __func__);
+
+ nfrags++;
+
+ iov[0].iov_offset = fifo->head;
+ iov[0].iov_len = bytes;
+ fifo->head += bytes;
+
+ if (fifo->head > fifo->size) {
+ /* If the allocation did not fit in the tail fragment of the
+ * FIFO, also use the head fragment.
+ */
+ nfrags++;
+ overflow = fifo->head - fifo->size;
+ iov[0].iov_len -= overflow;
+ iov[1].iov_offset = 0; /* Start of fifo*/
+ iov[1].iov_len = overflow;
+
+ fifo->head = overflow;
+ }
+
+ /* Re-align to a cacheline boundary */
+ aligned_head = L1_CACHE_ALIGN(fifo->head);
+ padding = aligned_head - fifo->head;
+ iov[nfrags - 1].iov_padding = padding;
+ atomic_sub(bytes + padding, &fifo->available);
+ fifo->head = aligned_head;
+
+ if (fifo->head == fifo->size)
+ fifo->head = 0;
+
+ return nfrags;
+}
+
+/* gve_tx_free_fifo - Return space to Tx FIFO
+ * @fifo: FIFO to return fragments to
+ * @bytes: Bytes to free
+ */
+static void gve_tx_free_fifo(struct gve_tx_fifo *fifo, size_t bytes)
+{
+ atomic_add(bytes, &fifo->available);
+}
+
+static void gve_tx_remove_from_block(struct gve_priv *priv, int queue_idx)
+{
+ struct gve_notify_block *block =
+ &priv->ntfy_blocks[gve_tx_idx_to_ntfy(priv, queue_idx)];
+
+ block->tx = NULL;
+}
+
+static int gve_clean_tx_done(struct gve_priv *priv, struct gve_tx_ring *tx,
+ u32 to_do, bool try_to_wake);
+
+static void gve_tx_free_ring(struct gve_priv *priv, int idx)
+{
+ struct gve_tx_ring *tx = &priv->tx[idx];
+ struct device *hdev = &priv->pdev->dev;
+ size_t bytes;
+ u32 slots;
+
+ gve_tx_remove_from_block(priv, idx);
+ slots = tx->mask + 1;
+ gve_clean_tx_done(priv, tx, tx->req, false);
+ netdev_tx_reset_queue(tx->netdev_txq);
+
+ dma_free_coherent(hdev, sizeof(*tx->q_resources),
+ tx->q_resources, tx->q_resources_bus);
+ tx->q_resources = NULL;
+
+ if (!tx->raw_addressing) {
+ gve_tx_fifo_release(priv, &tx->tx_fifo);
+ gve_unassign_qpl(priv, tx->tx_fifo.qpl->id);
+ tx->tx_fifo.qpl = NULL;
+ }
+
+ bytes = sizeof(*tx->desc) * slots;
+ dma_free_coherent(hdev, bytes, tx->desc, tx->bus);
+ tx->desc = NULL;
+
+ vfree(tx->info);
+ tx->info = NULL;
+
+ netif_dbg(priv, drv, priv->dev, "freed tx queue %d\n", idx);
+}
+
+static void gve_tx_add_to_block(struct gve_priv *priv, int queue_idx)
+{
+ unsigned int active_cpus = min_t(int, priv->num_ntfy_blks / 2,
+ num_online_cpus());
+ int ntfy_idx = gve_tx_idx_to_ntfy(priv, queue_idx);
+ struct gve_notify_block *block = &priv->ntfy_blocks[ntfy_idx];
+ struct gve_tx_ring *tx = &priv->tx[queue_idx];
+
+ block->tx = tx;
+ tx->ntfy_id = ntfy_idx;
+ netif_set_xps_queue(priv->dev, get_cpu_mask(ntfy_idx % active_cpus),
+ queue_idx);
+}
+
+static int gve_tx_alloc_ring(struct gve_priv *priv, int idx)
+{
+ struct gve_tx_ring *tx = &priv->tx[idx];
+ struct device *hdev = &priv->pdev->dev;
+ u32 slots = priv->tx_desc_cnt;
+ size_t bytes;
+
+ /* Make sure everything is zeroed to start */
+ memset(tx, 0, sizeof(*tx));
+ tx->q_num = idx;
+
+ tx->mask = slots - 1;
+
+ /* alloc metadata */
+ tx->info = vzalloc(sizeof(*tx->info) * slots);
+ if (!tx->info)
+ return -ENOMEM;
+
+ /* alloc tx queue */
+ bytes = sizeof(*tx->desc) * slots;
+ tx->desc = dma_alloc_coherent(hdev, bytes, &tx->bus, GFP_KERNEL);
+ if (!tx->desc)
+ goto abort_with_info;
+
+ tx->raw_addressing = priv->raw_addressing;
+ tx->dev = &priv->pdev->dev;
+ if (!tx->raw_addressing) {
+ tx->tx_fifo.qpl = gve_assign_tx_qpl(priv);
+
+ /* map Tx FIFO */
+ if (gve_tx_fifo_init(priv, &tx->tx_fifo))
+ goto abort_with_desc;
+ }
+
+ tx->q_resources =
+ dma_alloc_coherent(hdev,
+ sizeof(*tx->q_resources),
+ &tx->q_resources_bus,
+ GFP_KERNEL);
+ if (!tx->q_resources)
+ goto abort_with_fifo;
+
+ netif_dbg(priv, drv, priv->dev, "tx[%d]->bus=%lx\n", idx,
+ (unsigned long)tx->bus);
+ tx->netdev_txq = netdev_get_tx_queue(priv->dev, idx);
+ gve_tx_add_to_block(priv, idx);
+
+ return 0;
+
+abort_with_fifo:
+ if (!tx->raw_addressing)
+ gve_tx_fifo_release(priv, &tx->tx_fifo);
+abort_with_desc:
+ dma_free_coherent(hdev, bytes, tx->desc, tx->bus);
+ tx->desc = NULL;
+abort_with_info:
+ vfree(tx->info);
+ tx->info = NULL;
+ return -ENOMEM;
+}
+
+int gve_tx_alloc_rings(struct gve_priv *priv)
+{
+ int err = 0;
+ int i;
+
+ for (i = 0; i < priv->tx_cfg.num_queues; i++) {
+ err = gve_tx_alloc_ring(priv, i);
+ if (err) {
+ netif_err(priv, drv, priv->dev,
+ "Failed to alloc tx ring=%d: err=%d\n",
+ i, err);
+ break;
+ }
+ }
+ /* Unallocate if there was an error */
+ if (err) {
+ int j;
+
+ for (j = 0; j < i; j++)
+ gve_tx_free_ring(priv, j);
+ }
+ return err;
+}
+
+void gve_tx_free_rings(struct gve_priv *priv)
+{
+ int i;
+
+ for (i = 0; i < priv->tx_cfg.num_queues; i++)
+ gve_tx_free_ring(priv, i);
+}
+
+/* gve_tx_avail - Calculates the number of slots available in the ring
+ * @tx: tx ring to check
+ *
+ * Returns the number of slots available
+ *
+ * The capacity of the queue is mask + 1. We don't need to reserve an entry.
+ **/
+static inline u32 gve_tx_avail(struct gve_tx_ring *tx)
+{
+ return tx->mask + 1 - (tx->req - tx->done);
+}
+
+static inline int gve_skb_fifo_bytes_required(struct gve_tx_ring *tx,
+ struct sk_buff *skb)
+{
+ int pad_bytes, align_hdr_pad;
+ int bytes;
+ int hlen;
+
+ hlen = skb_is_gso(skb) ? skb_checksum_start_offset(skb) +
+ tcp_hdrlen(skb) : skb_headlen(skb);
+
+ pad_bytes = gve_tx_fifo_pad_alloc_one_frag(&tx->tx_fifo,
+ hlen);
+ /* We need to take into account the header alignment padding. */
+ align_hdr_pad = L1_CACHE_ALIGN(hlen) - hlen;
+ bytes = align_hdr_pad + pad_bytes + skb->len;
+
+ return bytes;
+}
+
+/* The most descriptors we could need are 3 - 1 for the headers, 1 for
+ * the beginning of the payload at the end of the FIFO, and 1 if the
+ * payload wraps to the beginning of the FIFO.
+ */
+#define MAX_TX_DESC_NEEDED 3
+static void gve_tx_unmap_buf(struct device *dev,
+ struct gve_tx_dma_buf *buf)
+{
+ const int buf_len = (int)dma_unmap_len(buf, len);
+ if (buf_len > 0) {
+ dma_unmap_single(dev, dma_unmap_addr(buf, dma),
+ dma_unmap_len(buf, len),
+ DMA_TO_DEVICE);
+ dma_unmap_len_set(buf, len, 0);
+ } else if (buf_len < 0) {
+ dma_unmap_page(dev, dma_unmap_addr(buf, dma),
+ -dma_unmap_len(buf, len),
+ DMA_TO_DEVICE);
+ dma_unmap_len_set(buf, len, 0);
+ }
+}
+
+/* Check if sufficient resources (descriptor ring space, FIFO space) are
+ * available to transmit the given number of bytes.
+ */
+static inline bool gve_can_tx(struct gve_tx_ring *tx, int bytes_required)
+{
+ bool can_alloc = true;
+
+ if (!tx->raw_addressing)
+ can_alloc = gve_tx_fifo_can_alloc(&tx->tx_fifo, bytes_required);
+
+ return (gve_tx_avail(tx) >= MAX_TX_DESC_NEEDED && can_alloc);
+}
+
+/* Stops the queue if the skb cannot be transmitted. */
+static int gve_maybe_stop_tx(struct gve_tx_ring *tx, struct sk_buff *skb)
+{
+ int bytes_required = 0;
+
+ if (!tx->raw_addressing)
+ bytes_required = gve_skb_fifo_bytes_required(tx, skb);
+
+ if (likely(gve_can_tx(tx, bytes_required)))
+ return 0;
+
+ /* No space, so stop the queue */
+ tx->stop_queue++;
+ netif_tx_stop_queue(tx->netdev_txq);
+ smp_mb(); /* sync with restarting queue in gve_clean_tx_done() */
+
+ /* Now check for resources again, in case gve_clean_tx_done() freed
+ * resources after we checked and we stopped the queue after
+ * gve_clean_tx_done() checked.
+ *
+ * gve_maybe_stop_tx() gve_clean_tx_done()
+ * nsegs/can_alloc test failed
+ * gve_tx_free_fifo()
+ * if (tx queue stopped)
+ * netif_tx_queue_wake()
+ * netif_tx_stop_queue()
+ * Need to check again for space here!
+ */
+ if (likely(!gve_can_tx(tx, bytes_required)))
+ return -EBUSY;
+
+ netif_tx_start_queue(tx->netdev_txq);
+ tx->wake_queue++;
+ return 0;
+}
+
+static void gve_tx_fill_pkt_desc(union gve_tx_desc *pkt_desc,
+ struct sk_buff *skb, bool is_gso,
+ int l4_hdr_offset, u32 desc_cnt,
+ u16 hlen, u64 addr)
+{
+ /* l4_hdr_offset and csum_offset are in units of 16-bit words */
+ if (is_gso) {
+ pkt_desc->pkt.type_flags = GVE_TXD_TSO | GVE_TXF_L4CSUM;
+ pkt_desc->pkt.l4_csum_offset = skb->csum_offset >> 1;
+ pkt_desc->pkt.l4_hdr_offset = l4_hdr_offset >> 1;
+ } else if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
+ pkt_desc->pkt.type_flags = GVE_TXD_STD | GVE_TXF_L4CSUM;
+ pkt_desc->pkt.l4_csum_offset = skb->csum_offset >> 1;
+ pkt_desc->pkt.l4_hdr_offset = l4_hdr_offset >> 1;
+ } else {
+ pkt_desc->pkt.type_flags = GVE_TXD_STD;
+ pkt_desc->pkt.l4_csum_offset = 0;
+ pkt_desc->pkt.l4_hdr_offset = 0;
+ }
+ pkt_desc->pkt.desc_cnt = desc_cnt;
+ pkt_desc->pkt.len = cpu_to_be16(skb->len);
+ pkt_desc->pkt.seg_len = cpu_to_be16(hlen);
+ pkt_desc->pkt.seg_addr = cpu_to_be64(addr);
+}
+
+static void gve_tx_fill_seg_desc(union gve_tx_desc *seg_desc,
+ struct sk_buff *skb, bool is_gso,
+ u16 len, u64 addr)
+{
+ seg_desc->seg.type_flags = GVE_TXD_SEG;
+ if (is_gso) {
+ if (skb_is_gso_v6(skb))
+ seg_desc->seg.type_flags |= GVE_TXSF_IPV6;
+ seg_desc->seg.l3_offset = skb_network_offset(skb) >> 1;
+ seg_desc->seg.mss = cpu_to_be16(skb_shinfo(skb)->gso_size);
+ }
+ seg_desc->seg.seg_len = cpu_to_be16(len);
+ seg_desc->seg.seg_addr = cpu_to_be64(addr);
+}
+
+static inline void gve_dma_sync_for_device(struct gve_priv *priv,
+ dma_addr_t *page_buses,
+ u64 iov_offset, u64 iov_len)
+{
+ u64 last_page = (iov_offset + iov_len - 1) / PAGE_SIZE;
+ u64 first_page = iov_offset / PAGE_SIZE;
+ u64 page;
+
+ for (page = first_page; page <= last_page; page++) {
+ dma_addr_t dma = page_buses[page];
+ dma_sync_single_for_device(&priv->pdev->dev, dma, PAGE_SIZE,
+ DMA_TO_DEVICE);
+ }
+}
+
+
+static int gve_tx_add_skb_copy(struct gve_priv *priv, struct gve_tx_ring *tx,
+ struct sk_buff *skb)
+{
+ int pad_bytes, hlen, hdr_nfrags, payload_nfrags, l4_hdr_offset;
+ union gve_tx_desc *pkt_desc, *seg_desc;
+ struct gve_tx_buffer_state *info;
+ bool is_gso = skb_is_gso(skb);
+ u32 idx = tx->req & tx->mask;
+ int payload_iov = 2;
+ int copy_offset;
+ u32 next_idx;
+ int i;
+
+ info = &tx->info[idx];
+ pkt_desc = &tx->desc[idx];
+
+ l4_hdr_offset = skb_checksum_start_offset(skb);
+ /* If the skb is gso, then we want the tcp header in the first segment
+ * otherwise we want the linear portion of the skb (which will contain
+ * the checksum because skb->csum_start and skb->csum_offset are given
+ * relative to skb->head) in the first segment.
+ */
+ hlen = is_gso ? l4_hdr_offset + tcp_hdrlen(skb) :
+ skb_headlen(skb);
+
+ info->skb = skb;
+ /* We don't want to split the header, so if necessary, pad to the end
+ * of the fifo and then put the header at the beginning of the fifo.
+ */
+ pad_bytes = gve_tx_fifo_pad_alloc_one_frag(&tx->tx_fifo, hlen);
+ hdr_nfrags = gve_tx_alloc_fifo(&tx->tx_fifo, hlen + pad_bytes,
+ &info->iov[0]);
+ WARN(!hdr_nfrags, "hdr_nfrags should never be 0!");
+ payload_nfrags = gve_tx_alloc_fifo(&tx->tx_fifo, skb->len - hlen,
+ &info->iov[payload_iov]);
+
+ gve_tx_fill_pkt_desc(pkt_desc, skb, is_gso, l4_hdr_offset,
+ 1 + payload_nfrags, hlen,
+ info->iov[hdr_nfrags - 1].iov_offset);
+
+ skb_copy_bits(skb, 0,
+ tx->tx_fifo.base + info->iov[hdr_nfrags - 1].iov_offset,
+ hlen);
+ gve_dma_sync_for_device(priv, tx->tx_fifo.qpl->page_buses,
+ info->iov[hdr_nfrags - 1].iov_offset,
+ info->iov[hdr_nfrags - 1].iov_len);
+ copy_offset = hlen;
+
+ for (i = payload_iov; i < payload_nfrags + payload_iov; i++) {
+ next_idx = (tx->req + 1 + i - payload_iov) & tx->mask;
+ seg_desc = &tx->desc[next_idx];
+
+ gve_tx_fill_seg_desc(seg_desc, skb, is_gso,
+ info->iov[i].iov_len,
+ info->iov[i].iov_offset);
+
+ skb_copy_bits(skb, copy_offset,
+ tx->tx_fifo.base + info->iov[i].iov_offset,
+ info->iov[i].iov_len);
+ gve_dma_sync_for_device(priv, tx->tx_fifo.qpl->page_buses,
+ info->iov[i].iov_offset,
+ info->iov[i].iov_len);
+ copy_offset += info->iov[i].iov_len;
+ }
+
+ return 1 + payload_nfrags;
+}
+
+static int gve_tx_add_skb_no_copy(struct gve_priv *priv, struct gve_tx_ring *tx,
+ struct sk_buff *skb)
+{
+ const struct skb_shared_info *shinfo = skb_shinfo(skb);
+ int hlen, payload_nfrags, l4_hdr_offset, seg_idx_bias;
+ union gve_tx_desc *pkt_desc, *seg_desc;
+ struct gve_tx_buffer_state *info;
+ bool is_gso = skb_is_gso(skb);
+ u32 idx = tx->req & tx->mask;
+ struct gve_tx_dma_buf *buf;
+ int last_mapped = 0;
+ u64 addr;
+ u32 len;
+ int i;
+
+ info = &tx->info[idx];
+ pkt_desc = &tx->desc[idx];
+
+ l4_hdr_offset = skb_checksum_start_offset(skb);
+ /* If the skb is gso, then we want the tcp header in the first segment
+ * otherwise we want the linear portion of the skb (which will contain
+ * the checksum because skb->csum_start and skb->csum_offset are given
+ * relative to skb->head) in the first segment.
+ */
+ hlen = is_gso ? l4_hdr_offset + tcp_hdrlen(skb) :
+ skb_headlen(skb);
+ len = skb_headlen(skb);
+
+ info->skb = skb;
+
+ addr = dma_map_single(tx->dev, skb->data, len, DMA_TO_DEVICE);
+ if (unlikely(dma_mapping_error(tx->dev, addr))) {
+ priv->dma_mapping_error++;
+ goto drop;
+ }
+ buf = &info->buf;
+ dma_unmap_len_set(buf, len, len);
+ dma_unmap_addr_set(buf, dma, addr);
+
+ payload_nfrags = shinfo->nr_frags;
+ if (hlen < len) {
+ /* For gso the rest of the linear portion of the skb needs to
+ * be in its own descriptor.
+ */
+ payload_nfrags++;
+ gve_tx_fill_pkt_desc(pkt_desc, skb, is_gso, l4_hdr_offset,
+ 1 + payload_nfrags, hlen, addr);
+
+ len -= hlen;
+ addr += hlen;
+ seg_desc = &tx->desc[(tx->req + 1) & tx->mask];
+ seg_idx_bias = 2;
+ gve_tx_fill_seg_desc(seg_desc, skb, is_gso, len, addr);
+ } else {
+ seg_idx_bias = 1;
+ gve_tx_fill_pkt_desc(pkt_desc, skb, is_gso, l4_hdr_offset,
+ 1 + payload_nfrags, hlen, addr);
+ }
+
+ for (i = 0; i < payload_nfrags - (seg_idx_bias - 1); i++) {
+ struct skb_frag_struct frag = shinfo->frags[i];
+
+ idx = (tx->req + i + seg_idx_bias) & tx->mask;
+ seg_desc = &tx->desc[idx];
+ len = skb_frag_size(&frag);
+ addr = skb_frag_dma_map(tx->dev, &frag, 0, len, DMA_TO_DEVICE);
+ if (unlikely(dma_mapping_error(tx->dev, addr))) {
+ priv->dma_mapping_error++;
+ goto unmap_drop;
+ }
+ buf = &tx->info[idx].buf;
+ dma_unmap_len_set(buf, len, -len);
+ dma_unmap_addr_set(buf, dma, addr);
+
+ gve_tx_fill_seg_desc(seg_desc, skb, is_gso, len, addr);
+ }
+
+ return 1 + payload_nfrags;
+
+unmap_drop:
+ i--;
+ for (last_mapped = i + seg_idx_bias; last_mapped >= 0; last_mapped--) {
+ idx = (tx->req + last_mapped) & tx->mask;
+ gve_tx_unmap_buf(tx->dev, &tx->info[idx].buf);
+ }
+drop:
+ tx->dropped_pkt++;
+ return 0;
+}
+
+netdev_tx_t gve_tx(struct sk_buff *skb, struct net_device *dev)
+{
+ struct gve_priv *priv = netdev_priv(dev);
+ struct gve_tx_ring *tx;
+ int nsegs;
+
+ WARN(skb_get_queue_mapping(skb) > priv->tx_cfg.num_queues,
+ "skb queue index out of range");
+ tx = &priv->tx[skb_get_queue_mapping(skb)];
+ if (unlikely(gve_maybe_stop_tx(tx, skb))) {
+ /* We need to ring the txq doorbell -- we have stopped the Tx
+ * queue for want of resources, but prior calls to gve_tx()
+ * may have added descriptors without ringing the doorbell.
+ */
+
+ /* Ensure tx descs from a prior gve_tx are visible before
+ * ringing doorbell.
+ */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0)
+ dma_wmb();
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ wmb();
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ gve_tx_put_doorbell(priv, tx->q_resources, tx->req);
+ return NETDEV_TX_BUSY;
+ }
+ if (tx->raw_addressing)
+ nsegs = gve_tx_add_skb_no_copy(priv, tx, skb);
+ else
+ nsegs = gve_tx_add_skb_copy(priv, tx, skb);
+
+ /* If the packet is getting sent, we need to update the skb */
+ if (nsegs) {
+ netdev_tx_sent_queue(tx->netdev_txq, skb->len);
+ skb_tx_timestamp(skb);
+ }
+
+ /* Give packets to NIC. Even if this packet failed to send the doorbell
+ * might need to be rung because of xmit_more.
+ */
+ tx->req += nsegs;
+
+ /* If we have xmit_more - don't ring the doorbell unless we are stopped */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,18,0)
+ if (!netif_xmit_stopped(tx->netdev_txq)
+#if LINUX_VERSION_CODE > KERNEL_VERSION(5,2,0)
+ && netdev_xmit_more()
+#else /* LINUX_VERSION_CODE > KERNEL_VERSION(5,2,0) */
+ && skb->xmit_more
+#endif /* LINUX_VERSION_CODE > KERNEL_VERSION(5,2,0) */
+)
+ return NETDEV_TX_OK;
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,18,0) */
+
+ /* Ensure tx descs are visible before ringing doorbell */
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0)
+ dma_wmb();
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ wmb();
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,19,0) */
+ gve_tx_put_doorbell(priv, tx->q_resources, tx->req);
+ return NETDEV_TX_OK;
+}
+
+#define GVE_TX_START_THRESH PAGE_SIZE
+
+static int gve_clean_tx_done(struct gve_priv *priv, struct gve_tx_ring *tx,
+ u32 to_do, bool try_to_wake)
+{
+ struct gve_tx_buffer_state *info;
+ u64 pkts = 0, bytes = 0;
+ size_t space_freed = 0;
+ struct sk_buff *skb;
+ int i, j;
+ u32 idx;
+
+ for (j = 0; j < to_do; j++) {
+ idx = tx->done & tx->mask;
+ netif_info(priv, tx_done, priv->dev,
+ "[%d] %s: idx=%d (req=%u done=%u)\n",
+ tx->q_num, __func__, idx, tx->req, tx->done);
+ info = &tx->info[idx];
+ skb = info->skb;
+
+ /* Unmap the buffer */
+ if (tx->raw_addressing)
+ gve_tx_unmap_buf(tx->dev, &tx->info[idx].buf);
+
+ /* Mark as free */
+ if (skb) {
+ info->skb = NULL;
+ bytes += skb->len;
+ pkts++;
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0)
+ dev_consume_skb_any(skb);
+#else /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0) */
+ dev_kfree_skb_any(skb);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,14,0) */
+ if (!tx->raw_addressing) {
+ /* FIFO free */
+ for (i = 0; i < ARRAY_SIZE(info->iov); i++) {
+ space_freed += info->iov[i].iov_len +
+ info->iov[i].iov_padding;
+ info->iov[i].iov_len = 0;
+ info->iov[i].iov_padding = 0;
+ }
+ }
+ }
+ tx->done++;
+ }
+
+ if (!tx->raw_addressing) {
+ gve_tx_free_fifo(&tx->tx_fifo, space_freed);
+ }
+ u64_stats_update_begin(&tx->statss);
+ tx->bytes_done += bytes;
+ tx->pkt_done += pkts;
+ u64_stats_update_end(&tx->statss);
+ netdev_tx_completed_queue(tx->netdev_txq, pkts, bytes);
+
+ /* start the queue if we've stopped it */
+#ifndef CONFIG_BQL
+ /* Make sure that the doorbells are synced */
+ smp_mb();
+#endif
+ if (try_to_wake && netif_tx_queue_stopped(tx->netdev_txq) &&
+ likely(gve_can_tx(tx, GVE_TX_START_THRESH))) {
+ tx->wake_queue++;
+ netif_tx_wake_queue(tx->netdev_txq);
+ }
+
+ return pkts;
+}
+
+__be32 gve_tx_load_event_counter(struct gve_priv *priv,
+ struct gve_tx_ring *tx)
+{
+ u32 counter_index = be32_to_cpu((tx->q_resources->counter_index));
+
+#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,20,0)
+ return READ_ONCE(priv->counter_array[counter_index]);
+#else /* LINUX_VERSION_CODE < KERNEL_VERSION(3,20,0) */
+ return ACCESS_ONCE(priv->counter_array[counter_index]);
+#endif /* LINUX_VERSION_CODE >= KERNEL_VERSION(3,20,0) */
+}
+
+bool gve_tx_poll(struct gve_notify_block *block, int budget)
+{
+ struct gve_priv *priv = block->priv;
+ struct gve_tx_ring *tx = block->tx;
+ bool repoll = false;
+ u32 nic_done;
+ u32 to_do;
+
+ /* If budget is 0, do all the work */
+ if (budget == 0)
+ budget = INT_MAX;
+
+ /* Find out how much work there is to be done */
+ tx->last_nic_done = gve_tx_load_event_counter(priv, tx);
+ nic_done = be32_to_cpu(tx->last_nic_done);
+ if (budget > 0) {
+ /* Do as much work as we have that the budget will
+ * allow
+ */
+ to_do = min_t(u32, (nic_done - tx->done), budget);
+ gve_clean_tx_done(priv, tx, to_do, true);
+ }
+ /* If we still have work we want to repoll */
+ repoll |= (nic_done != tx->done);
+ return repoll;
+}
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b21223b..208ec45 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2234,6 +2234,53 @@
return 0;
}
+static int virtnet_set_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec)
+{
+ struct ethtool_coalesce ec_default = {
+ .cmd = ETHTOOL_SCOALESCE,
+ .rx_max_coalesced_frames = 1,
+ };
+ struct virtnet_info *vi = netdev_priv(dev);
+ int i, napi_weight;
+
+ if (ec->tx_max_coalesced_frames > 1)
+ return -EINVAL;
+
+ ec_default.tx_max_coalesced_frames = ec->tx_max_coalesced_frames;
+ napi_weight = ec->tx_max_coalesced_frames ? NAPI_POLL_WEIGHT : 0;
+
+ /* disallow changes to fields not explicitly tested above */
+ if (memcmp(ec, &ec_default, sizeof(ec_default)))
+ return -EINVAL;
+
+ if (napi_weight ^ vi->sq[0].napi.weight) {
+ if (dev->flags & IFF_UP)
+ return -EBUSY;
+ for (i = 0; i < vi->max_queue_pairs; i++)
+ vi->sq[i].napi.weight = napi_weight;
+ }
+
+ return 0;
+}
+
+static int virtnet_get_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec)
+{
+ struct ethtool_coalesce ec_default = {
+ .cmd = ETHTOOL_GCOALESCE,
+ .rx_max_coalesced_frames = 1,
+ };
+ struct virtnet_info *vi = netdev_priv(dev);
+
+ memcpy(ec, &ec_default, sizeof(ec_default));
+
+ if (vi->sq[0].napi.weight)
+ ec->tx_max_coalesced_frames = 1;
+
+ return 0;
+}
+
static void virtnet_init_settings(struct net_device *dev)
{
struct virtnet_info *vi = netdev_priv(dev);
@@ -2272,6 +2319,8 @@
.get_ts_info = ethtool_op_get_ts_info,
.get_link_ksettings = virtnet_get_link_ksettings,
.set_link_ksettings = virtnet_set_link_ksettings,
+ .set_coalesce = virtnet_set_coalesce,
+ .get_coalesce = virtnet_get_coalesce,
};
static void virtnet_freeze_down(struct virtio_device *vdev)
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index fb667bf..13510ba 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -263,7 +263,7 @@
struct nd_namespace_io *nsio = to_nd_namespace_io(&ndns->dev);
unsigned int sz_align = ALIGN(size + (offset & (512 - 1)), 512);
sector_t sector = offset >> 9;
- int rc = 0;
+ int rc = 0, ret = 0;
if (unlikely(!size))
return 0;
@@ -301,7 +301,9 @@
}
memcpy_flushcache(nsio->addr + offset, buf, size);
- nvdimm_flush(to_nd_region(ndns->dev.parent));
+ ret = nvdimm_flush(to_nd_region(ndns->dev.parent), NULL);
+ if (ret)
+ rc = ret;
return rc;
}
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 01e194a..fbb01a7 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -163,6 +163,7 @@
struct badblocks bb;
struct nd_interleave_set *nd_set;
struct nd_percpu_lane __percpu *lane;
+ int (*flush)(struct nd_region *nd_region, struct bio *bio);
struct nd_mapping mapping[0];
};
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index a7ce2f1..68b4a90 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -192,6 +192,7 @@
static blk_qc_t pmem_make_request(struct request_queue *q, struct bio *bio)
{
+ int ret = 0;
blk_status_t rc = 0;
bool do_acct;
unsigned long start;
@@ -201,7 +202,7 @@
struct nd_region *nd_region = to_region(pmem);
if (bio->bi_opf & REQ_PREFLUSH)
- nvdimm_flush(nd_region);
+ ret = nvdimm_flush(nd_region, bio);
do_acct = nd_iostat_start(bio, &start);
bio_for_each_segment(bvec, bio, iter) {
@@ -216,7 +217,10 @@
nd_iostat_end(bio, start);
if (bio->bi_opf & REQ_FUA)
- nvdimm_flush(nd_region);
+ ret = nvdimm_flush(nd_region, bio);
+
+ if (ret)
+ bio->bi_status = errno_to_blk_status(ret);
bio_endio(bio);
return BLK_QC_T_NONE;
@@ -301,6 +305,7 @@
static const struct dax_operations pmem_dax_ops = {
.direct_access = pmem_dax_direct_access,
+ .dax_supported = generic_fsdax_supported,
.copy_from_iter = pmem_copy_from_iter,
.copy_to_iter = pmem_copy_to_iter,
};
@@ -371,6 +376,7 @@
struct gendisk *disk;
void *addr;
int rc;
+ unsigned long flags = 0UL;
pmem = devm_kzalloc(dev, sizeof(*pmem), GFP_KERNEL);
if (!pmem)
@@ -468,14 +474,15 @@
nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
disk->bb = &pmem->bb;
- dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops);
+ if (is_nvdimm_sync(nd_region))
+ flags = DAXDEV_F_SYNC;
+ dax_dev = alloc_dax(pmem, disk->disk_name, &pmem_dax_ops, flags);
if (!dax_dev) {
put_disk(disk);
return -ENOMEM;
}
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
pmem->dax_dev = dax_dev;
-
gendev = disk_to_dev(disk);
gendev->groups = pmem_attribute_groups;
@@ -533,14 +540,14 @@
sysfs_put(pmem->bb_state);
pmem->bb_state = NULL;
}
- nvdimm_flush(to_nd_region(dev->parent));
+ nvdimm_flush(to_nd_region(dev->parent), NULL);
return 0;
}
static void nd_pmem_shutdown(struct device *dev)
{
- nvdimm_flush(to_nd_region(dev->parent));
+ nvdimm_flush(to_nd_region(dev->parent), NULL);
}
static void nd_pmem_notify(struct device *dev, enum nvdimm_event event)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 609fc45..aa0f6f5 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -290,7 +290,9 @@
return rc;
if (!flush)
return -EINVAL;
- nvdimm_flush(nd_region);
+ rc = nvdimm_flush(nd_region, NULL);
+ if (rc)
+ return rc;
return len;
}
@@ -1076,6 +1078,11 @@
dev->of_node = ndr_desc->of_node;
nd_region->ndr_size = resource_size(ndr_desc->res);
nd_region->ndr_start = ndr_desc->res->start;
+ if (ndr_desc->flush)
+ nd_region->flush = ndr_desc->flush;
+ else
+ nd_region->flush = NULL;
+
nd_device_register(dev);
return nd_region;
@@ -1116,11 +1123,24 @@
}
EXPORT_SYMBOL_GPL(nvdimm_volatile_region_create);
+int nvdimm_flush(struct nd_region *nd_region, struct bio *bio)
+{
+ int rc = 0;
+
+ if (!nd_region->flush)
+ rc = generic_nvdimm_flush(nd_region);
+ else {
+ if (nd_region->flush(nd_region, bio))
+ rc = -EIO;
+ }
+
+ return rc;
+}
/**
* nvdimm_flush - flush any posted write queues between the cpu and pmem media
* @nd_region: blk or interleaved pmem region
*/
-void nvdimm_flush(struct nd_region *nd_region)
+int generic_nvdimm_flush(struct nd_region *nd_region)
{
struct nd_region_data *ndrd = dev_get_drvdata(&nd_region->dev);
int i, idx;
@@ -1144,6 +1164,8 @@
if (ndrd_get_flush_wpq(ndrd, i, 0))
writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
wmb();
+
+ return 0;
}
EXPORT_SYMBOL_GPL(nvdimm_flush);
@@ -1188,6 +1210,13 @@
}
EXPORT_SYMBOL_GPL(nvdimm_has_cache);
+bool is_nvdimm_sync(struct nd_region *nd_region)
+{
+ return is_nd_pmem(&nd_region->dev) &&
+ !test_bit(ND_REGION_ASYNC, &nd_region->flags);
+}
+EXPORT_SYMBOL_GPL(is_nvdimm_sync);
+
struct conflict_context {
struct nd_region *nd_region;
resource_size_t start, size;
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d5359c7..e206c39 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1055,15 +1055,15 @@
return id;
}
-static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
- void *buffer, size_t buflen, u32 *result)
+static int nvme_features(struct nvme_ctrl *dev, u8 op, unsigned int fid,
+ unsigned int dword11, void *buffer, size_t buflen, u32 *result)
{
union nvme_result res = { 0 };
struct nvme_command c;
int ret;
memset(&c, 0, sizeof(c));
- c.features.opcode = nvme_admin_set_features;
+ c.features.opcode = op;
c.features.fid = cpu_to_le32(fid);
c.features.dword11 = cpu_to_le32(dword11);
@@ -1074,6 +1074,24 @@
return ret;
}
+int nvme_set_features(struct nvme_ctrl *dev, unsigned int fid,
+ unsigned int dword11, void *buffer, size_t buflen,
+ u32 *result)
+{
+ return nvme_features(dev, nvme_admin_set_features, fid, dword11, buffer,
+ buflen, result);
+}
+EXPORT_SYMBOL_GPL(nvme_set_features);
+
+int nvme_get_features(struct nvme_ctrl *dev, unsigned int fid,
+ unsigned int dword11, void *buffer, size_t buflen,
+ u32 *result)
+{
+ return nvme_features(dev, nvme_admin_get_features, fid, dword11, buffer,
+ buflen, result);
+}
+EXPORT_SYMBOL_GPL(nvme_get_features);
+
int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
{
u32 q_count = (*count - 1) | ((*count - 1) << 16);
@@ -3772,6 +3790,17 @@
}
EXPORT_SYMBOL_GPL(nvme_start_queues);
+void nvme_sync_queues(struct nvme_ctrl *ctrl)
+{
+ struct nvme_ns *ns;
+
+ down_read(&ctrl->namespaces_rwsem);
+ list_for_each_entry(ns, &ctrl->namespaces, list)
+ blk_sync_queue(ns->queue);
+ up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_sync_queues);
+
int __init nvme_core_init(void)
{
int result = -ENOMEM;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index cc4273f..40192b6 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -436,6 +436,7 @@
void nvme_stop_queues(struct nvme_ctrl *ctrl);
void nvme_start_queues(struct nvme_ctrl *ctrl);
void nvme_kill_queues(struct nvme_ctrl *ctrl);
+void nvme_sync_queues(struct nvme_ctrl *ctrl);
void nvme_unfreeze(struct nvme_ctrl *ctrl);
void nvme_wait_freeze(struct nvme_ctrl *ctrl);
void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
@@ -453,6 +454,12 @@
union nvme_result *result, void *buffer, unsigned bufflen,
unsigned timeout, int qid, int at_head,
blk_mq_req_flags_t flags);
+int nvme_set_features(struct nvme_ctrl *dev, unsigned int fid,
+ unsigned int dword11, void *buffer, size_t buflen,
+ u32 *result);
+int nvme_get_features(struct nvme_ctrl *dev, unsigned int fid,
+ unsigned int dword11, void *buffer, size_t buflen,
+ u32 *result);
int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);
int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3c68a5b..7c00c85 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -26,6 +26,7 @@
#include <linux/mutex.h>
#include <linux/once.h>
#include <linux/pci.h>
+#include <linux/suspend.h>
#include <linux/t10-pi.h>
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
@@ -106,6 +107,7 @@
u32 cmbloc;
struct nvme_ctrl ctrl;
struct completion ioq_wait;
+ u32 last_ps;
mempool_t *iod_mempool;
@@ -1132,7 +1134,6 @@
struct nvme_dev *dev = nvmeq->dev;
struct request *abort_req;
struct nvme_command cmd;
- bool shutdown = false;
u32 csts = readl(dev->bar + NVME_REG_CSTS);
/* If PCI error recovery process is happening, we cannot reset or
@@ -1169,16 +1170,18 @@
* shutdown, so we return BLK_EH_DONE.
*/
switch (dev->ctrl.state) {
- case NVME_CTRL_DELETING:
- shutdown = true;
case NVME_CTRL_CONNECTING:
- case NVME_CTRL_RESETTING:
+ nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
+ /* fall through */
+ case NVME_CTRL_DELETING:
dev_warn_ratelimited(dev->ctrl.device,
"I/O %d QID %d timeout, disable controller\n",
req->tag, nvmeq->qid);
- nvme_dev_disable(dev, shutdown);
+ nvme_dev_disable(dev, true);
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
return BLK_EH_DONE;
+ case NVME_CTRL_RESETTING:
+ return BLK_EH_RESET_TIMER;
default:
break;
}
@@ -2150,7 +2153,7 @@
static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
{
int i;
- bool dead = true;
+ bool dead = true, freeze = false;
struct pci_dev *pdev = to_pci_dev(dev->dev);
mutex_lock(&dev->shutdown_lock);
@@ -2158,8 +2161,10 @@
u32 csts = readl(dev->bar + NVME_REG_CSTS);
if (dev->ctrl.state == NVME_CTRL_LIVE ||
- dev->ctrl.state == NVME_CTRL_RESETTING)
+ dev->ctrl.state == NVME_CTRL_RESETTING) {
+ freeze = true;
nvme_start_freeze(&dev->ctrl);
+ }
dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
pdev->error_state != pci_channel_io_normal);
}
@@ -2168,10 +2173,8 @@
* Give the controller a chance to complete all entered requests if
* doing a safe shutdown.
*/
- if (!dead) {
- if (shutdown)
- nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
- }
+ if (!dead && shutdown && freeze)
+ nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
nvme_stop_queues(&dev->ctrl);
@@ -2269,6 +2272,7 @@
*/
if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
nvme_dev_disable(dev, false);
+ nvme_sync_queues(&dev->ctrl);
mutex_lock(&dev->shutdown_lock);
result = nvme_pci_enable(dev);
@@ -2608,12 +2612,68 @@
}
#ifdef CONFIG_PM_SLEEP
-static int nvme_suspend(struct device *dev)
+static int nvme_deep_state(struct nvme_dev *dev)
+{
+ struct pci_dev *pdev = to_pci_dev(dev->dev);
+ struct nvme_ctrl *ctrl = &dev->ctrl;
+ int ret = -EBUSY;;
+
+ nvme_start_freeze(ctrl);
+ nvme_wait_freeze(ctrl);
+ nvme_sync_queues(ctrl);
+
+ if (ctrl->state != NVME_CTRL_LIVE &&
+ ctrl->state != NVME_CTRL_ADMIN_ONLY)
+ goto unfreeze;
+
+ dev->last_ps = 0;
+ ret = nvme_get_features(ctrl, NVME_FEAT_POWER_MGMT, 0, NULL, 0,
+ &dev->last_ps);
+ if (ret < 0)
+ goto unfreeze;
+
+ ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, dev->ctrl.npss,
+ NULL, 0, NULL);
+ if (ret < 0)
+ goto unfreeze;
+ if (ret) {
+ /*
+ * Clearing npss forces a controller reset on resume. The
+ * correct value will be resdicovered then.
+ */
+ ctrl->npss = 0;
+ nvme_dev_disable(dev, true);
+ ret = 0;
+ } else {
+ /*
+ * A saved state prevents pci pm from generically controlling
+ * the device's power. If we're using protocol specific
+ * settings, we don't want pci interfering.
+ */
+ pci_save_state(pdev);
+ }
+unfreeze:
+ nvme_unfreeze(ctrl);
+ return ret;
+}
+
+static int nvme_make_operational(struct nvme_dev *dev)
+{
+ struct nvme_ctrl *ctrl = &dev->ctrl;
+
+ if (nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, dev->last_ps,
+ NULL, 0, NULL) == 0)
+ return 0;
+ nvme_reset_ctrl(ctrl);
+ return 0;
+}
+
+static int nvme_simple_resume(struct device *dev)
{
struct pci_dev *pdev = to_pci_dev(dev);
struct nvme_dev *ndev = pci_get_drvdata(pdev);
- nvme_dev_disable(ndev, true);
+ nvme_reset_ctrl(&ndev->ctrl);
return 0;
}
@@ -2622,12 +2682,45 @@
struct pci_dev *pdev = to_pci_dev(dev);
struct nvme_dev *ndev = pci_get_drvdata(pdev);
- nvme_reset_ctrl(&ndev->ctrl);
+ return pm_resume_via_firmware() || !ndev->ctrl.npss ?
+ nvme_simple_resume(dev) : nvme_make_operational(ndev);
+}
+
+static int nvme_simple_suspend(struct device *dev)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ struct nvme_dev *ndev = pci_get_drvdata(pdev);
+
+ nvme_dev_disable(ndev, true);
return 0;
}
-#endif
-static SIMPLE_DEV_PM_OPS(nvme_dev_pm_ops, nvme_suspend, nvme_resume);
+static int nvme_suspend(struct device *dev)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ struct nvme_dev *ndev = pci_get_drvdata(pdev);
+
+ /*
+ * The platform does not remove power for a kernel managed suspend so
+ * use host managed nvme power settings for lowest idle power. This
+ * should have quicker resume latency than a full device shutdown.
+ */
+ return pm_suspend_via_firmware() || !ndev->ctrl.npss ?
+ nvme_simple_suspend(dev) : nvme_deep_state(ndev);
+}
+
+const struct dev_pm_ops nvme_dev_pm_ops = {
+ .suspend = nvme_suspend,
+ .resume = nvme_resume,
+ .freeze = nvme_simple_suspend,
+ .thaw = nvme_simple_resume,
+ .poweroff = nvme_simple_suspend,
+ .restore = nvme_simple_resume,
+};
+
+#else
+const struct dev_pm_ops nvme_dev_pm_ops = {};
+#endif
static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
pci_channel_state_t state)
diff --git a/drivers/rtc/interface.c b/drivers/rtc/interface.c
index ce051f9..c8242d4 100644
--- a/drivers/rtc/interface.c
+++ b/drivers/rtc/interface.c
@@ -579,7 +579,9 @@
struct rtc_time tm;
ktime_t now, onesec;
- __rtc_read_time(rtc, &tm);
+ err = __rtc_read_time(rtc, &tm);
+ if (err)
+ goto out;
onesec = ktime_set(1, 0);
now = rtc_tm_to_ktime(tm);
rtc->uie_rtctimer.node.expires = ktime_add(now, onesec);
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 23e526c..737dc0e 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -59,6 +59,7 @@
static const struct dax_operations dcssblk_dax_ops = {
.direct_access = dcssblk_dax_direct_access,
+ .dax_supported = generic_fsdax_supported,
.copy_from_iter = dcssblk_dax_copy_from_iter,
.copy_to_iter = dcssblk_dax_copy_to_iter,
};
@@ -678,7 +679,7 @@
goto put_dev;
dev_info->dax_dev = alloc_dax(dev_info, dev_info->gd->disk_name,
- &dcssblk_dax_ops);
+ &dcssblk_dax_ops, DAXDEV_F_SYNC);
if (!dev_info->dax_dev) {
rc = -ENOMEM;
goto put_dev;
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1afcbef..2df7b1c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -266,7 +266,10 @@
pages_to_bytes(events[PSWPIN]));
update_stat(vb, idx++, VIRTIO_BALLOON_S_SWAP_OUT,
pages_to_bytes(events[PSWPOUT]));
- update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, events[PGMAJFAULT]);
+ update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT,
+ events[PGMAJFAULT_S] +
+ events[PGMAJFAULT_A] +
+ events[PGMAJFAULT_F]);
update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, events[PGFAULT]);
#ifdef CONFIG_HUGETLB_PAGE
update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 58f48ea..8b4ded9 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -34,6 +34,7 @@
#include <linux/mutex.h>
#include <linux/anon_inodes.h>
#include <linux/device.h>
+#include <linux/freezer.h>
#include <linux/uaccess.h>
#include <asm/io.h>
#include <asm/mman.h>
@@ -1816,7 +1817,8 @@
}
spin_unlock_irq(&ep->wq.lock);
- if (!schedule_hrtimeout_range(to, slack, HRTIMER_MODE_ABS))
+ if (!freezable_schedule_hrtimeout_range(to, slack,
+ HRTIMER_MODE_ABS))
timed_out = 1;
spin_lock_irq(&ep->wq.lock);
diff --git a/fs/exec.c b/fs/exec.c
index cece8c1..eeac87c5 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -67,6 +67,7 @@
#include <asm/mmu_context.h>
#include <asm/tlb.h>
+#include <trace/events/fs.h>
#include <trace/events/task.h>
#include "internal.h"
@@ -865,9 +866,12 @@
if (err)
goto exit;
- if (name->name[0] != '\0')
+ if (name->name[0] != '\0') {
fsnotify_open(file);
+ trace_open_exec(name->name);
+ }
+
out:
return file;
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0a4461a..6c73dd8 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -207,6 +207,12 @@
*/
#define EXT4_IO_END_UNWRITTEN 0x0001
+struct ext4_io_end_vec {
+ struct list_head list; /* list of io_end_vec */
+ loff_t offset; /* offset in the file */
+ ssize_t size; /* size of the extent */
+};
+
/*
* For converting unwritten extents on a work queue. 'handle' is used for
* buffered writeback.
@@ -220,8 +226,7 @@
* bios covering the extent */
unsigned int flag; /* unwritten or not */
atomic_t count; /* reference counter */
- loff_t offset; /* offset in the file */
- ssize_t size; /* size of the extent */
+ struct list_head list_vec; /* list of ext4_io_end_vec */
} ext4_io_end_t;
struct ext4_io_submit {
@@ -3177,6 +3182,8 @@
loff_t len);
extern int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
loff_t offset, ssize_t len);
+extern int ext4_convert_unwritten_io_end_vec(handle_t *handle,
+ ext4_io_end_t *io_end);
extern int ext4_map_blocks(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map, int flags);
extern int ext4_ext_calc_metadata_amount(struct inode *inode,
@@ -3235,6 +3242,8 @@
int len,
struct writeback_control *wbc,
bool keep_towrite);
+extern struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end);
+extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);
/* mmp.c */
extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 6e80490..6d5cee1 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5031,23 +5031,13 @@
int ret = 0;
int ret2 = 0;
struct ext4_map_blocks map;
- unsigned int credits, blkbits = inode->i_blkbits;
+ unsigned int blkbits = inode->i_blkbits;
+ unsigned int credits = 0;
map.m_lblk = offset >> blkbits;
max_blocks = EXT4_MAX_BLOCKS(len, offset, blkbits);
- /*
- * This is somewhat ugly but the idea is clear: When transaction is
- * reserved, everything goes into it. Otherwise we rather start several
- * smaller transactions for conversion of each extent separately.
- */
- if (handle) {
- handle = ext4_journal_start_reserved(handle,
- EXT4_HT_EXT_CONVERT);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
- credits = 0;
- } else {
+ if (!handle) {
/*
* credits to insert 1 extent into extent tree
*/
@@ -5078,11 +5068,40 @@
if (ret <= 0 || ret2)
break;
}
- if (!credits)
- ret2 = ext4_journal_stop(handle);
return ret > 0 ? ret2 : ret;
}
+int ext4_convert_unwritten_io_end_vec(handle_t *handle, ext4_io_end_t *io_end)
+{
+ int ret, err = 0;
+ struct ext4_io_end_vec *io_end_vec;
+
+ /*
+ * This is somewhat ugly but the idea is clear: When transaction is
+ * reserved, everything goes into it. Otherwise we rather start several
+ * smaller transactions for conversion of each extent separately.
+ */
+ if (handle) {
+ handle = ext4_journal_start_reserved(handle,
+ EXT4_HT_EXT_CONVERT);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+ }
+
+ list_for_each_entry(io_end_vec, &io_end->list_vec, list) {
+ ret = ext4_convert_unwritten_extents(handle, io_end->inode,
+ io_end_vec->offset,
+ io_end_vec->size);
+ if (ret)
+ break;
+ }
+
+ if (handle)
+ err = ext4_journal_stop(handle);
+
+ return ret < 0 ? ret : err;
+}
+
/*
* If newes is not existing extent (newes->ec_pblk equals zero) find
* delayed extent at start of newes and update newes accordingly and
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 52d155b..adcd424 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -373,15 +373,17 @@
static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
{
struct inode *inode = file->f_mapping->host;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ struct dax_device *dax_dev = sbi->s_daxdev;
- if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+ if (unlikely(ext4_forced_shutdown(sbi)))
return -EIO;
/*
- * We don't support synchronous mappings for non-DAX files. At least
- * until someone comes with a sensible use case.
+ * We don't support synchronous mappings for non-DAX files and
+ * for DAX files if underneath dax_device is not synchronous.
*/
- if (!IS_DAX(file_inode(file)) && (vma->vm_flags & VM_SYNC))
+ if (!daxdev_mapping_supported(vma, dax_dev))
return -EOPNOTSUPP;
file_accessed(file);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3b1a759..8c01c71 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2342,6 +2342,79 @@
}
/*
+ * mpage_process_page - update page buffers corresponding to changed extent and
+ * may submit fully mapped page for IO
+ *
+ * @mpd - description of extent to map, on return next extent to map
+ * @m_lblk - logical block mapping.
+ * @m_pblk - corresponding physical mapping.
+ * @map_bh - determines on return whether this page requires any further
+ * mapping or not.
+ * Scan given page buffers corresponding to changed extent and update buffer
+ * state according to new extent state.
+ * We map delalloc buffers to their physical location, clear unwritten bits.
+ * If the given page is not fully mapped, we update @map to the next extent in
+ * the given page that needs mapping & return @map_bh as true.
+ */
+static int mpage_process_page(struct mpage_da_data *mpd, struct page *page,
+ ext4_lblk_t *m_lblk, ext4_fsblk_t *m_pblk,
+ bool *map_bh)
+{
+ struct buffer_head *head, *bh;
+ ext4_io_end_t *io_end = mpd->io_submit.io_end;
+ ext4_lblk_t lblk = *m_lblk;
+ ext4_fsblk_t pblock = *m_pblk;
+ int err = 0;
+ int blkbits = mpd->inode->i_blkbits;
+ ssize_t io_end_size = 0;
+ struct ext4_io_end_vec *io_end_vec = ext4_last_io_end_vec(io_end);
+
+ bh = head = page_buffers(page);
+ do {
+ if (lblk < mpd->map.m_lblk)
+ continue;
+ if (lblk >= mpd->map.m_lblk + mpd->map.m_len) {
+ /*
+ * Buffer after end of mapped extent.
+ * Find next buffer in the page to map.
+ */
+ mpd->map.m_len = 0;
+ mpd->map.m_flags = 0;
+ io_end_vec->size += io_end_size;
+ io_end_size = 0;
+
+ err = mpage_process_page_bufs(mpd, head, bh, lblk);
+ if (err > 0)
+ err = 0;
+ if (!err && mpd->map.m_len && mpd->map.m_lblk > lblk) {
+ io_end_vec = ext4_alloc_io_end_vec(io_end);
+ if (IS_ERR(io_end_vec)) {
+ err = PTR_ERR(io_end_vec);
+ goto out;
+ }
+ io_end_vec->offset = mpd->map.m_lblk << blkbits;
+ }
+ *map_bh = true;
+ goto out;
+ }
+ if (buffer_delay(bh)) {
+ clear_buffer_delay(bh);
+ bh->b_blocknr = pblock++;
+ }
+ clear_buffer_unwritten(bh);
+ io_end_size += (1 << blkbits);
+ } while (lblk++, (bh = bh->b_this_page) != head);
+
+ io_end_vec->size += io_end_size;
+ io_end_size = 0;
+ *map_bh = false;
+out:
+ *m_lblk = lblk;
+ *m_pblk = pblock;
+ return err;
+}
+
+/*
* mpage_map_buffers - update buffers corresponding to changed extent and
* submit fully mapped pages for IO
*
@@ -2360,12 +2433,12 @@
struct pagevec pvec;
int nr_pages, i;
struct inode *inode = mpd->inode;
- struct buffer_head *head, *bh;
int bpp_bits = PAGE_SHIFT - inode->i_blkbits;
pgoff_t start, end;
ext4_lblk_t lblk;
- sector_t pblock;
+ ext4_fsblk_t pblock;
int err;
+ bool map_bh = false;
start = mpd->map.m_lblk >> bpp_bits;
end = (mpd->map.m_lblk + mpd->map.m_len - 1) >> bpp_bits;
@@ -2381,50 +2454,19 @@
for (i = 0; i < nr_pages; i++) {
struct page *page = pvec.pages[i];
- bh = head = page_buffers(page);
- do {
- if (lblk < mpd->map.m_lblk)
- continue;
- if (lblk >= mpd->map.m_lblk + mpd->map.m_len) {
- /*
- * Buffer after end of mapped extent.
- * Find next buffer in the page to map.
- */
- mpd->map.m_len = 0;
- mpd->map.m_flags = 0;
- /*
- * FIXME: If dioread_nolock supports
- * blocksize < pagesize, we need to make
- * sure we add size mapped so far to
- * io_end->size as the following call
- * can submit the page for IO.
- */
- err = mpage_process_page_bufs(mpd, head,
- bh, lblk);
- pagevec_release(&pvec);
- if (err > 0)
- err = 0;
- return err;
- }
- if (buffer_delay(bh)) {
- clear_buffer_delay(bh);
- bh->b_blocknr = pblock++;
- }
- clear_buffer_unwritten(bh);
- } while (lblk++, (bh = bh->b_this_page) != head);
-
+ err = mpage_process_page(mpd, page, &lblk, &pblock,
+ &map_bh);
/*
- * FIXME: This is going to break if dioread_nolock
- * supports blocksize < pagesize as we will try to
- * convert potentially unmapped parts of inode.
+ * If map_bh is true, means page may require further bh
+ * mapping, or maybe the page was submitted for IO.
+ * So we return to call further extent mapping.
*/
- mpd->io_submit.io_end->size += PAGE_SIZE;
+ if (err < 0 || map_bh == true)
+ goto out;
/* Page fully mapped - let IO run! */
err = mpage_submit_page(mpd, page);
- if (err < 0) {
- pagevec_release(&pvec);
- return err;
- }
+ if (err < 0)
+ goto out;
}
pagevec_release(&pvec);
}
@@ -2432,6 +2474,9 @@
mpd->map.m_len = 0;
mpd->map.m_flags = 0;
return 0;
+out:
+ pagevec_release(&pvec);
+ return err;
}
static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
@@ -2515,9 +2560,13 @@
int err;
loff_t disksize;
int progress = 0;
+ ext4_io_end_t *io_end = mpd->io_submit.io_end;
+ struct ext4_io_end_vec *io_end_vec;
- mpd->io_submit.io_end->offset =
- ((loff_t)map->m_lblk) << inode->i_blkbits;
+ io_end_vec = ext4_alloc_io_end_vec(io_end);
+ if (IS_ERR(io_end_vec))
+ return PTR_ERR(io_end_vec);
+ io_end_vec->offset = ((loff_t)map->m_lblk) << inode->i_blkbits;
do {
err = mpage_map_one_extent(handle, mpd);
if (err < 0) {
@@ -3640,6 +3689,7 @@
ssize_t size, void *private)
{
ext4_io_end_t *io_end = private;
+ struct ext4_io_end_vec *io_end_vec;
/* if not async direct IO just return */
if (!io_end)
@@ -3657,8 +3707,9 @@
ext4_clear_io_unwritten_flag(io_end);
size = 0;
}
- io_end->offset = offset;
- io_end->size = size;
+ io_end_vec = ext4_alloc_io_end_vec(io_end);
+ io_end_vec->offset = offset;
+ io_end_vec->size = size;
ext4_put_io_end(io_end);
return 0;
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 9cc79b7..92860e6 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -31,18 +31,56 @@
#include "acl.h"
static struct kmem_cache *io_end_cachep;
+static struct kmem_cache *io_end_vec_cachep;
int __init ext4_init_pageio(void)
{
io_end_cachep = KMEM_CACHE(ext4_io_end, SLAB_RECLAIM_ACCOUNT);
if (io_end_cachep == NULL)
return -ENOMEM;
+
+ io_end_vec_cachep = KMEM_CACHE(ext4_io_end_vec, 0);
+ if (io_end_vec_cachep == NULL) {
+ kmem_cache_destroy(io_end_cachep);
+ return -ENOMEM;
+ }
return 0;
}
void ext4_exit_pageio(void)
{
kmem_cache_destroy(io_end_cachep);
+ kmem_cache_destroy(io_end_vec_cachep);
+}
+
+struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end)
+{
+ struct ext4_io_end_vec *io_end_vec;
+
+ io_end_vec = kmem_cache_zalloc(io_end_vec_cachep, GFP_NOFS);
+ if (!io_end_vec)
+ return ERR_PTR(-ENOMEM);
+ INIT_LIST_HEAD(&io_end_vec->list);
+ list_add_tail(&io_end_vec->list, &io_end->list_vec);
+ return io_end_vec;
+}
+
+static void ext4_free_io_end_vec(ext4_io_end_t *io_end)
+{
+ struct ext4_io_end_vec *io_end_vec, *tmp;
+
+ if (list_empty(&io_end->list_vec))
+ return;
+ list_for_each_entry_safe(io_end_vec, tmp, &io_end->list_vec, list) {
+ list_del(&io_end_vec->list);
+ kmem_cache_free(io_end_vec_cachep, io_end_vec);
+ }
+}
+
+struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end)
+{
+ BUG_ON(list_empty(&io_end->list_vec));
+ return list_last_entry(&io_end->list_vec, struct ext4_io_end_vec, list);
}
/*
@@ -133,6 +171,7 @@
ext4_finish_bio(bio);
bio_put(bio);
}
+ ext4_free_io_end_vec(io_end);
kmem_cache_free(io_end_cachep, io_end);
}
@@ -144,29 +183,26 @@
* cannot get to ext4_ext_truncate() before all IOs overlapping that range are
* completed (happens from ext4_free_ioend()).
*/
-static int ext4_end_io(ext4_io_end_t *io)
+static int ext4_end_io_end(ext4_io_end_t *io_end)
{
- struct inode *inode = io->inode;
- loff_t offset = io->offset;
- ssize_t size = io->size;
- handle_t *handle = io->handle;
+ struct inode *inode = io_end->inode;
+ handle_t *handle = io_end->handle;
int ret = 0;
- ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
+ ext4_debug("ext4_end_io_nolock: io_end 0x%p from inode %lu,list->next 0x%p,"
"list->prev 0x%p\n",
- io, inode->i_ino, io->list.next, io->list.prev);
+ io_end, inode->i_ino, io_end->list.next, io_end->list.prev);
- io->handle = NULL; /* Following call will use up the handle */
- ret = ext4_convert_unwritten_extents(handle, inode, offset, size);
+ io_end->handle = NULL; /* Following call will use up the handle */
+ ret = ext4_convert_unwritten_io_end_vec(handle, io_end);
if (ret < 0 && !ext4_forced_shutdown(EXT4_SB(inode->i_sb))) {
ext4_msg(inode->i_sb, KERN_EMERG,
"failed to convert unwritten extents to written "
"extents -- potential data loss! "
- "(inode %lu, offset %llu, size %zd, error %d)",
- inode->i_ino, offset, size, ret);
+ "(inode %lu, error %d)", inode->i_ino, ret);
}
- ext4_clear_io_unwritten_flag(io);
- ext4_release_io_end(io);
+ ext4_clear_io_unwritten_flag(io_end);
+ ext4_release_io_end(io_end);
return ret;
}
@@ -174,21 +210,21 @@
{
#ifdef EXT4FS_DEBUG
struct list_head *cur, *before, *after;
- ext4_io_end_t *io, *io0, *io1;
+ ext4_io_end_t *io_end, *io_end0, *io_end1;
if (list_empty(head))
return;
ext4_debug("Dump inode %lu completed io list\n", inode->i_ino);
- list_for_each_entry(io, head, list) {
- cur = &io->list;
+ list_for_each_entry(io_end, head, list) {
+ cur = &io_end->list;
before = cur->prev;
- io0 = container_of(before, ext4_io_end_t, list);
+ io_end0 = container_of(before, ext4_io_end_t, list);
after = cur->next;
- io1 = container_of(after, ext4_io_end_t, list);
+ io_end1 = container_of(after, ext4_io_end_t, list);
ext4_debug("io 0x%p from inode %lu,prev 0x%p,next 0x%p\n",
- io, inode->i_ino, io0, io1);
+ io_end, inode->i_ino, io_end0, io_end1);
}
#endif
}
@@ -215,7 +251,7 @@
static int ext4_do_flush_completed_IO(struct inode *inode,
struct list_head *head)
{
- ext4_io_end_t *io;
+ ext4_io_end_t *io_end;
struct list_head unwritten;
unsigned long flags;
struct ext4_inode_info *ei = EXT4_I(inode);
@@ -227,11 +263,11 @@
spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
while (!list_empty(&unwritten)) {
- io = list_entry(unwritten.next, ext4_io_end_t, list);
- BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
- list_del_init(&io->list);
+ io_end = list_entry(unwritten.next, ext4_io_end_t, list);
+ BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
+ list_del_init(&io_end->list);
- err = ext4_end_io(io);
+ err = ext4_end_io_end(io_end);
if (unlikely(!ret && err))
ret = err;
}
@@ -250,19 +286,22 @@
ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
{
- ext4_io_end_t *io = kmem_cache_zalloc(io_end_cachep, flags);
- if (io) {
- io->inode = inode;
- INIT_LIST_HEAD(&io->list);
- atomic_set(&io->count, 1);
+ ext4_io_end_t *io_end = kmem_cache_zalloc(io_end_cachep, flags);
+
+ if (io_end) {
+ io_end->inode = inode;
+ INIT_LIST_HEAD(&io_end->list);
+ INIT_LIST_HEAD(&io_end->list_vec);
+ atomic_set(&io_end->count, 1);
}
- return io;
+ return io_end;
}
void ext4_put_io_end_defer(ext4_io_end_t *io_end)
{
if (atomic_dec_and_test(&io_end->count)) {
- if (!(io_end->flag & EXT4_IO_END_UNWRITTEN) || !io_end->size) {
+ if (!(io_end->flag & EXT4_IO_END_UNWRITTEN) ||
+ list_empty(&io_end->list_vec)) {
ext4_release_io_end(io_end);
return;
}
@@ -276,9 +315,8 @@
if (atomic_dec_and_test(&io_end->count)) {
if (io_end->flag & EXT4_IO_END_UNWRITTEN) {
- err = ext4_convert_unwritten_extents(io_end->handle,
- io_end->inode, io_end->offset,
- io_end->size);
+ err = ext4_convert_unwritten_io_end_vec(io_end->handle,
+ io_end);
io_end->handle = NULL;
ext4_clear_io_unwritten_flag(io_end);
}
@@ -315,10 +353,8 @@
struct inode *inode = io_end->inode;
ext4_warning(inode->i_sb, "I/O error %d writing to inode %lu "
- "(offset %llu size %ld starting block %llu)",
+ "starting block %llu)",
bio->bi_status, inode->i_ino,
- (unsigned long long) io_end->offset,
- (long) io_end->size,
(unsigned long long)
bi_sector >> (inode->i_blkbits - 9));
mapping_set_error(inode->i_mapping,
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 93c14ec..e04d9ba 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1538,6 +1538,7 @@
{Opt_auto_da_alloc, "auto_da_alloc"},
{Opt_noauto_da_alloc, "noauto_da_alloc"},
{Opt_dioread_nolock, "dioread_nolock"},
+ {Opt_dioread_lock, "nodioread_nolock"},
{Opt_dioread_lock, "dioread_lock"},
{Opt_discard, "discard"},
{Opt_nodiscard, "nodiscard"},
@@ -2024,7 +2025,7 @@
unsigned int *journal_ioprio,
int is_remount)
{
- struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_sb_info __maybe_unused *sbi = EXT4_SB(sb);
char *p, __maybe_unused *usr_qf_name, __maybe_unused *grp_qf_name;
substring_t args[MAX_OPT_ARGS];
int token;
@@ -2078,16 +2079,6 @@
}
}
#endif
- if (test_opt(sb, DIOREAD_NOLOCK)) {
- int blocksize =
- BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
-
- if (blocksize < PAGE_SIZE) {
- ext4_msg(sb, KERN_ERR, "can't mount with "
- "dioread_nolock if block size != PAGE_SIZE");
- return 0;
- }
- }
return 1;
}
@@ -3701,6 +3692,7 @@
set_opt(sb, NO_UID32);
/* xattr user namespace & acls are now defaulted on */
set_opt(sb, XATTR_USER);
+ set_opt(sb, DIOREAD_NOLOCK);
#ifdef CONFIG_EXT4_FS_POSIX_ACL
set_opt(sb, POSIX_ACL);
#endif
@@ -3838,9 +3830,8 @@
goto failed_mount;
if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) {
- printk_once(KERN_WARNING "EXT4-fs: Warning: mounting "
- "with data=journal disables delayed "
- "allocation and O_DIRECT support!\n");
+ printk_once(KERN_WARNING "EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support!\n");
+ clear_opt(sb, DIOREAD_NOLOCK);
if (test_opt2(sb, EXPLICIT_DELALLOC)) {
ext4_msg(sb, KERN_ERR, "can't mount with "
"both data=journal and delalloc");
diff --git a/fs/file_table.c b/fs/file_table.c
index e49af4c..023fd1e 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -276,6 +276,9 @@
}
if (file->f_op->release)
file->f_op->release(inode, file);
+
+ security_file_pre_free(file);
+
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
!(file->f_mode & FMODE_PATH))) {
cdev_put(inode->i_cdev);
diff --git a/fs/namei.c b/fs/namei.c
index 327844f..5742d73 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -885,12 +885,65 @@
path_put(&last->link);
}
-int sysctl_protected_symlinks __read_mostly = 0;
-int sysctl_protected_hardlinks __read_mostly = 0;
+int sysctl_protected_symlinks __read_mostly = 1;
+int sysctl_protected_hardlinks __read_mostly = 1;
int sysctl_protected_fifos __read_mostly;
int sysctl_protected_regular __read_mostly;
/**
+ * nameidata_set_temporary - Used by Chromium OS LSM to check
+ * whether a mount point includes traversing symlinks.
+ */
+int nameidata_set_temporary(const char __user *dir_name)
+{
+ struct nameidata *tmp;
+ struct filename *name;
+
+ tmp = kmalloc(sizeof(*tmp), GFP_KERNEL);
+ if (unlikely(!tmp))
+ return -ENOMEM;
+ name = getname_flags(dir_name, LOOKUP_FOLLOW, NULL);
+ if (IS_ERR(name)) {
+ kfree(tmp);
+ return PTR_ERR(name);
+ }
+ set_nameidata(tmp, AT_FDCWD, name);
+ return 0;
+}
+
+/**
+ * nameidata_restore_temporary - Used by Chromium OS LSM to check
+ * whether a mount point includes traversing symlinks.
+ */
+void nameidata_restore_temporary(void)
+{
+ struct nameidata *tmp = current->nameidata;
+
+ restore_nameidata();
+ putname(tmp->name);
+ kfree(tmp);
+}
+
+/**
+ * nameidata_get_total_link_count - Used by security/chromiumos/lsm.c to check
+ * whether a mount point includes traversing symlinks.
+ */
+int nameidata_get_total_link_count(void)
+{
+ struct nameidata *tmp = current->nameidata;
+
+ if (unlikely(!tmp)) {
+ WARN(1, "Unexpectedly got here with current->nameidata == NULL");
+ /* Pretend we did traverse symlinks, that is the safe/sane
+ * result here from a security point of view...
+ */
+ return MAXSYMLINKS;
+ }
+ return tmp->total_link_count;
+}
+EXPORT_SYMBOL(nameidata_get_total_link_count);
+
+/**
* may_follow_link - Check symlink following for unsafe situations
* @nd: nameidata pathwalk data
*
diff --git a/fs/namespace.c b/fs/namespace.c
index 741f40c..dee73c0 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -719,8 +719,14 @@
goto done;
}
- if (!new)
- new = kmalloc(sizeof(struct mountpoint), GFP_KERNEL);
+ if (!new) {
+ /*
+ * We are allocating as GFP_NOFS to appease lockdep:
+ * since we are holding i_mutex we should not try to
+ * recurse into filesystem code.
+ */
+ new = kmalloc(sizeof(struct mountpoint), GFP_NOFS);
+ }
if (!new)
return ERR_PTR(-ENOMEM);
@@ -2736,12 +2742,19 @@
return -EINVAL;
/* ... and get the mountpoint */
- retval = user_path(dir_name, &path);
+ retval = nameidata_set_temporary(dir_name);
if (retval)
return retval;
+ retval = user_path(dir_name, &path);
+ if (retval) {
+ nameidata_restore_temporary();
+ return retval;
+ }
+
retval = security_sb_mount(dev_name, &path,
type_page, flags, data_page);
+ nameidata_restore_temporary();
if (!retval && !may_mount())
retval = -EPERM;
if (!retval && (flags & SB_MANDLOCK) && !may_mandlock())
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index 97a5169..61a440b 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -702,6 +702,8 @@
struct fsnotify_group *group;
struct inode *inode;
struct path path;
+ struct path alteredpath;
+ struct path *canonical_path = &path;
struct fd f;
int ret;
unsigned flags = 0;
@@ -747,13 +749,22 @@
if (ret)
goto fput_and_out;
+ /* support stacked filesystems */
+ if(path.dentry && path.dentry->d_op) {
+ if (path.dentry->d_op->d_canonical_path) {
+ path.dentry->d_op->d_canonical_path(&path, &alteredpath);
+ canonical_path = &alteredpath;
+ path_put(&path);
+ }
+ }
+
/* inode held in place by reference to path; group by fget on fd */
- inode = path.dentry->d_inode;
+ inode = canonical_path->dentry->d_inode;
group = f.file->private_data;
/* create/update an inode mark */
ret = inotify_update_watch(group, inode, mask);
- path_put(&path);
+ path_put(canonical_path);
fput_and_out:
fdput(f);
return ret;
diff --git a/fs/nsfs.c b/fs/nsfs.c
index 30d150a4..ceab3a5f 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -246,6 +246,7 @@
fput(file);
return ERR_PTR(-EINVAL);
}
+EXPORT_SYMBOL(proc_ns_fget);
static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry)
{
diff --git a/fs/open.c b/fs/open.c
index 76996f9..0d2bd0a 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -34,8 +34,11 @@
#include "internal.h"
-int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
- struct file *filp)
+#define CREATE_TRACE_POINTS
+#include <trace/events/fs.h>
+
+int do_truncate2(struct vfsmount *mnt, struct dentry *dentry, loff_t length,
+ unsigned int time_attrs, struct file *filp)
{
int ret;
struct iattr newattrs;
@@ -65,6 +68,12 @@
return ret;
}
+int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
+ struct file *filp)
+{
+ return do_truncate2(NULL, dentry, length, time_attrs, filp);
+}
+
long vfs_truncate(const struct path *path, loff_t length)
{
struct inode *inode;
@@ -1089,6 +1098,7 @@
} else {
fsnotify_open(f);
fd_install(fd, f);
+ trace_do_sys_open(tmp->name, flags, mode);
}
}
putname(tmp);
diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 817c02b1..4d96a7c 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -97,3 +97,10 @@
Say Y if you are running any user-space software which takes benefit from
this interface. For example, rkt is such a piece of software.
+
+config PROC_UID
+ bool "Include /proc/uid/ files"
+ default y
+ depends on PROC_FS && RT_MUTEXES
+ help
+ Provides aggregated per-uid information under /proc/uid.
diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index ead487e..3f849ca 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -27,6 +27,7 @@
proc-y += namespaces.o
proc-y += self.o
proc-y += thread_self.o
+proc-$(CONFIG_PROC_UID) += uid.o
proc-$(CONFIG_PROC_SYSCTL) += proc_sysctl.o
proc-$(CONFIG_NET) += proc_net.o
proc-$(CONFIG_PROC_KCORE) += kcore.o
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3b9b726..40089b8 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -144,6 +144,12 @@
NULL, &proc_single_file_operations, \
{ .proc_show = show } )
+#ifdef CONFIG_SECURITY_CHROMIUMOS_READONLY_PROC_SELF_MEM
+# define PROC_PID_MEM_MODE S_IRUSR
+#else
+# define PROC_PID_MEM_MODE S_IRUSR|S_IWUSR
+#endif
+
/*
* Count the number of hardlinks for the pid_entry table, excluding the .
* and .. links.
@@ -876,7 +882,11 @@
static ssize_t mem_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
+#ifdef CONFIG_SECURITY_CHROMIUMOS_READONLY_PROC_SELF_MEM
+ return -EACCES;
+#else
return mem_rw(file, (char __user*)buf, count, ppos, 1);
+#endif
}
loff_t mem_lseek(struct file *file, loff_t offset, int orig)
@@ -2386,10 +2396,13 @@
return -ESRCH;
if (p != current) {
- if (!capable(CAP_SYS_NICE)) {
+ rcu_read_lock();
+ if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
+ rcu_read_unlock();
count = -EPERM;
goto out;
}
+ rcu_read_unlock();
err = security_task_setscheduler(p);
if (err) {
@@ -2422,11 +2435,14 @@
return -ESRCH;
if (p != current) {
-
- if (!capable(CAP_SYS_NICE)) {
+ rcu_read_lock();
+ if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
+ rcu_read_unlock();
err = -EPERM;
goto out;
}
+ rcu_read_unlock();
+
err = security_task_getscheduler(p);
if (err)
goto out;
@@ -2977,7 +2993,7 @@
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
#endif
- REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations),
+ REG("mem", PROC_PID_MEM_MODE, proc_mem_operations),
LNK("cwd", proc_cwd_link),
LNK("root", proc_root_link),
LNK("exe", proc_exe_link),
@@ -2989,6 +3005,7 @@
REG("smaps", S_IRUGO, proc_pid_smaps_operations),
REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations),
REG("pagemap", S_IRUSR, proc_pagemap_operations),
+ REG("totmaps", S_IRUGO, proc_totmaps_operations),
#endif
#ifdef CONFIG_SECURITY
DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
@@ -3363,7 +3380,7 @@
#ifdef CONFIG_NUMA
REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
#endif
- REG("mem", S_IRUSR|S_IWUSR, proc_mem_operations),
+ REG("mem", PROC_PID_MEM_MODE, proc_mem_operations),
LNK("cwd", proc_cwd_link),
LNK("root", proc_root_link),
LNK("exe", proc_exe_link),
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 95b1419..a4cd4f5 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -84,6 +84,9 @@
struct task_struct *task);
};
+
+extern const struct file_operations proc_totmaps_operations;
+
struct proc_inode {
struct pid *pid;
unsigned int fd;
@@ -258,6 +261,15 @@
#endif
/*
+ * uid.c
+ */
+#ifdef CONFIG_PROC_UID
+extern int proc_uid_init(void);
+#else
+static inline void proc_uid_init(void) { }
+#endif
+
+/*
* proc_tty.c
*/
#ifdef CONFIG_TTY
@@ -285,6 +297,7 @@
struct mm_struct *mm;
#ifdef CONFIG_MMU
struct vm_area_struct *tail_vma;
+ struct mem_size_stats *mss;
#endif
#ifdef CONFIG_NUMA
struct mempolicy *task_mempolicy;
diff --git a/fs/proc/root.c b/fs/proc/root.c
index f4b1a9d..efc63a6 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -130,6 +130,7 @@
proc_symlink("mounts", NULL, "self/mounts");
proc_net_init();
+ proc_uid_init();
proc_mkdir("fs", NULL);
proc_mkdir("driver", NULL);
proc_create_mount_point("fs/nfsd"); /* somewhere for the nfsd filesystem to be mounted */
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index efa6273..006ae59 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -123,6 +123,56 @@
}
#endif
+static void seq_print_vma_name(struct seq_file *m, struct vm_area_struct *vma)
+{
+ const char __user *name = vma_get_anon_name(vma);
+ struct mm_struct *mm = vma->vm_mm;
+
+ unsigned long page_start_vaddr;
+ unsigned long page_offset;
+ unsigned long num_pages;
+ unsigned long max_len = NAME_MAX;
+ int i;
+
+ page_start_vaddr = (unsigned long)name & PAGE_MASK;
+ page_offset = (unsigned long)name - page_start_vaddr;
+ num_pages = DIV_ROUND_UP(page_offset + max_len, PAGE_SIZE);
+
+ seq_puts(m, "[anon:");
+
+ for (i = 0; i < num_pages; i++) {
+ int len;
+ int write_len;
+ const char *kaddr;
+ long pages_pinned;
+ struct page *page;
+
+ pages_pinned = get_user_pages_remote(current, mm,
+ page_start_vaddr, 1, 0, &page, NULL, NULL);
+ if (pages_pinned < 1) {
+ seq_puts(m, "<fault>]");
+ return;
+ }
+
+ kaddr = (const char *)kmap(page);
+ len = min(max_len, PAGE_SIZE - page_offset);
+ write_len = strnlen(kaddr + page_offset, len);
+ seq_write(m, kaddr + page_offset, write_len);
+ kunmap(page);
+ put_page(page);
+
+ /* if strnlen hit a null terminator then we're done */
+ if (write_len != len)
+ break;
+
+ max_len -= len;
+ page_offset = 0;
+ page_start_vaddr += PAGE_SIZE;
+ }
+
+ seq_putc(m, ']');
+}
+
static void vma_stop(struct proc_maps_private *priv)
{
struct mm_struct *mm = priv->mm;
@@ -348,8 +398,15 @@
goto done;
}
- if (is_stack(vma))
+ if (is_stack(vma)) {
name = "[stack]";
+ goto done;
+ }
+
+ if (vma_get_anon_name(vma)) {
+ seq_pad(m, ' ');
+ seq_print_vma_name(m, vma);
+ }
}
done:
@@ -421,17 +478,53 @@
unsigned long shared_hugetlb;
unsigned long private_hugetlb;
u64 pss;
+ u64 pss_anon;
+ u64 pss_file;
+ u64 pss_shmem;
u64 pss_locked;
u64 swap_pss;
bool check_shmem_swap;
};
+static void smaps_page_accumulate(struct mem_size_stats *mss,
+ struct page *page, unsigned long size, unsigned long pss,
+ bool dirty, bool locked, bool private)
+{
+ mss->pss += pss;
+
+ if (PageAnon(page))
+ mss->pss_anon += pss;
+ else if (PageSwapBacked(page))
+ mss->pss_shmem += pss;
+ else
+ mss->pss_file += pss;
+
+ if (locked)
+ mss->pss_locked += pss;
+
+ if (dirty || PageDirty(page)) {
+ if (private)
+ mss->private_dirty += size;
+ else
+ mss->shared_dirty += size;
+ } else {
+ if (private)
+ mss->private_clean += size;
+ else
+ mss->shared_clean += size;
+ }
+}
+
static void smaps_account(struct mem_size_stats *mss, struct page *page,
bool compound, bool young, bool dirty, bool locked)
{
int i, nr = compound ? 1 << compound_order(page) : 1;
unsigned long size = nr * PAGE_SIZE;
+ /*
+ * First accumulate quantities that depend only on |size| and the type
+ * of the compound page.
+ */
if (PageAnon(page)) {
mss->anonymous += size;
if (!PageSwapBacked(page) && !dirty && !PageDirty(page))
@@ -444,42 +537,26 @@
mss->referenced += size;
/*
+ * Then accumulate quantities that may depend on sharing, or that may
+ * differ page-by-page.
+ *
* page_count(page) == 1 guarantees the page is mapped exactly once.
* If any subpage of the compound page mapped with PTE it would elevate
* page_count().
*/
if (page_count(page) == 1) {
- if (dirty || PageDirty(page))
- mss->private_dirty += size;
- else
- mss->private_clean += size;
- mss->pss += (u64)size << PSS_SHIFT;
- if (locked)
- mss->pss_locked += (u64)size << PSS_SHIFT;
+ smaps_page_accumulate(mss, page, size, size << PSS_SHIFT, dirty,
+ locked, true);
return;
}
-
for (i = 0; i < nr; i++, page++) {
int mapcount = page_mapcount(page);
- unsigned long pss = (PAGE_SIZE << PSS_SHIFT);
+ bool private = mapcount < 2;
+ unsigned long pss = private ? PAGE_SIZE << PSS_SHIFT :
+ (PAGE_SIZE << PSS_SHIFT) / mapcount;
- if (mapcount >= 2) {
- if (dirty || PageDirty(page))
- mss->shared_dirty += PAGE_SIZE;
- else
- mss->shared_clean += PAGE_SIZE;
- mss->pss += pss / mapcount;
- if (locked)
- mss->pss_locked += pss / mapcount;
- } else {
- if (dirty || PageDirty(page))
- mss->private_dirty += PAGE_SIZE;
- else
- mss->private_clean += PAGE_SIZE;
- mss->pss += pss;
- if (locked)
- mss->pss_locked += pss;
- }
+ smaps_page_accumulate(mss, page, PAGE_SIZE, pss,
+ dirty, locked, private);
}
}
@@ -758,10 +835,21 @@
seq_put_decimal_ull_width(m, str, (val) >> 10, 8)
/* Show the contents common for smaps and smaps_rollup */
-static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss)
+static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss,
+ bool rollup_mode)
{
SEQ_PUT_DEC("Rss: ", mss->resident);
SEQ_PUT_DEC(" kB\nPss: ", mss->pss >> PSS_SHIFT);
+ if (rollup_mode) {
+ // These are meaningful only for smaps_rollup, otherwise two of
+ // them are zero, and the other is the same as Pss.
+ SEQ_PUT_DEC(" kB\nPss_Anon: ",
+ mss->pss_anon >> PSS_SHIFT);
+ SEQ_PUT_DEC(" kB\nPss_File: ",
+ mss->pss_file >> PSS_SHIFT);
+ SEQ_PUT_DEC(" kB\nPss_Shmem: ",
+ mss->pss_shmem >> PSS_SHIFT);
+ }
SEQ_PUT_DEC(" kB\nShared_Clean: ", mss->shared_clean);
SEQ_PUT_DEC(" kB\nShared_Dirty: ", mss->shared_dirty);
SEQ_PUT_DEC(" kB\nPrivate_Clean: ", mss->private_clean);
@@ -792,13 +880,18 @@
smap_gather_stats(vma, &mss);
show_map_vma(m, vma);
+ if (vma_get_anon_name(vma)) {
+ seq_puts(m, "Name: ");
+ seq_print_vma_name(m, vma);
+ seq_putc(m, '\n');
+ }
SEQ_PUT_DEC("Size: ", vma->vm_end - vma->vm_start);
SEQ_PUT_DEC(" kB\nKernelPageSize: ", vma_kernel_pagesize(vma));
SEQ_PUT_DEC(" kB\nMMUPageSize: ", vma_mmu_pagesize(vma));
seq_puts(m, " kB\n");
- __show_smap(m, &mss);
+ __show_smap(m, &mss, false);
seq_printf(m, "THPeligible: %d\n", transparent_hugepage_enabled(vma));
@@ -811,6 +904,84 @@
return 0;
}
+static void add_smaps_sum(struct mem_size_stats *mss,
+ struct mem_size_stats *mss_sum)
+{
+ mss_sum->resident += mss->resident;
+ mss_sum->pss += mss->pss;
+ mss_sum->pss_anon += mss->pss_anon;
+ mss_sum->pss_file += mss->pss_file;
+ mss_sum->pss_shmem += mss->pss_shmem;
+ mss_sum->shared_clean += mss->shared_clean;
+ mss_sum->shared_dirty += mss->shared_dirty;
+ mss_sum->private_clean += mss->private_clean;
+ mss_sum->private_dirty += mss->private_dirty;
+ mss_sum->referenced += mss->referenced;
+ mss_sum->anonymous += mss->anonymous;
+ mss_sum->anonymous_thp += mss->anonymous_thp;
+ mss_sum->swap += mss->swap;
+}
+
+static int totmaps_proc_show(struct seq_file *m, void *data)
+{
+ struct proc_maps_private *priv = m->private;
+ struct mm_struct *mm;
+ struct vm_area_struct *vma;
+ struct mem_size_stats *mss_sum = priv->mss;
+
+ /* reference to priv->task already taken */
+ /* but need to get the mm here because */
+ /* task could be in the process of exiting */
+ mm = get_task_mm(priv->task);
+ if (!mm || IS_ERR(mm))
+ return -EINVAL;
+
+ down_read(&mm->mmap_sem);
+ hold_task_mempolicy(priv);
+
+ for (vma = mm->mmap; vma != priv->tail_vma; vma = vma->vm_next) {
+ struct mem_size_stats mss;
+ struct mm_walk smaps_walk = {
+ .pmd_entry = smaps_pte_range,
+ .mm = vma->vm_mm,
+ .private = &mss,
+ };
+
+ if (vma->vm_mm && !is_vm_hugetlb_page(vma)) {
+ memset(&mss, 0, sizeof(mss));
+ walk_page_vma(vma, &smaps_walk);
+ add_smaps_sum(&mss, mss_sum);
+ }
+ }
+ seq_printf(m,
+ "Rss: %8lu kB\n"
+ "Pss: %8lu kB\n"
+ "Shared_Clean: %8lu kB\n"
+ "Shared_Dirty: %8lu kB\n"
+ "Private_Clean: %8lu kB\n"
+ "Private_Dirty: %8lu kB\n"
+ "Referenced: %8lu kB\n"
+ "Anonymous: %8lu kB\n"
+ "AnonHugePages: %8lu kB\n"
+ "Swap: %8lu kB\n",
+ mss_sum->resident >> 10,
+ (unsigned long)(mss_sum->pss >> (10 + PSS_SHIFT)),
+ mss_sum->shared_clean >> 10,
+ mss_sum->shared_dirty >> 10,
+ mss_sum->private_clean >> 10,
+ mss_sum->private_dirty >> 10,
+ mss_sum->referenced >> 10,
+ mss_sum->anonymous >> 10,
+ mss_sum->anonymous_thp >> 10,
+ mss_sum->swap >> 10);
+
+ release_task_mempolicy(priv);
+ up_read(&mm->mmap_sem);
+ mmput(mm);
+
+ return 0;
+}
+
static int show_smaps_rollup(struct seq_file *m, void *v)
{
struct proc_maps_private *priv = m->private;
@@ -848,7 +1019,7 @@
seq_pad(m, ' ');
seq_puts(m, "[rollup]\n");
- __show_smap(m, &mss);
+ __show_smap(m, &mss, true);
release_task_mempolicy(priv);
up_read(&mm->mmap_sem);
@@ -916,6 +1087,50 @@
return single_release(inode, file);
}
+static int totmaps_open(struct inode *inode, struct file *file)
+{
+ struct proc_maps_private *priv;
+ int ret = -ENOMEM;
+ priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+ if (priv) {
+ priv->mss = kzalloc(sizeof(*priv->mss), GFP_KERNEL);
+ if (!priv->mss)
+ return -ENOMEM;
+
+ /* we need to grab references to the task_struct */
+ /* at open time, because there's a potential information */
+ /* leak where the totmaps file is opened and held open */
+ /* while the underlying pid to task mapping changes */
+ /* underneath it */
+ priv->task = get_pid_task(proc_pid(inode), PIDTYPE_PID);
+ if (!priv->task) {
+ kfree(priv->mss);
+ kfree(priv);
+ return -ESRCH;
+ }
+
+ ret = single_open(file, totmaps_proc_show, priv);
+ if (ret) {
+ put_task_struct(priv->task);
+ kfree(priv->mss);
+ kfree(priv);
+ }
+ }
+ return ret;
+}
+
+static int totmaps_release(struct inode *inode, struct file *file)
+{
+ struct seq_file *m = file->private_data;
+ struct proc_maps_private *priv = m->private;
+
+ put_task_struct(priv->task);
+ kfree(priv->mss);
+ kfree(priv);
+ m->private = NULL;
+ return single_release(inode, file);
+}
+
const struct file_operations proc_pid_smaps_operations = {
.open = pid_smaps_open,
.read = seq_read,
@@ -930,6 +1145,13 @@
.release = smaps_rollup_release,
};
+const struct file_operations proc_totmaps_operations = {
+ .open = totmaps_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = totmaps_release,
+};
+
enum clear_refs_types {
CLEAR_REFS_ALL = 1,
CLEAR_REFS_ANON,
diff --git a/fs/proc/uid.c b/fs/proc/uid.c
new file mode 100644
index 0000000..ae720b9
--- /dev/null
+++ b/fs/proc/uid.c
@@ -0,0 +1,290 @@
+/*
+ * /proc/uid support
+ */
+
+#include <linux/fs.h>
+#include <linux/hashtable.h>
+#include <linux/init.h>
+#include <linux/proc_fs.h>
+#include <linux/rtmutex.h>
+#include <linux/sched.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include "internal.h"
+
+static struct proc_dir_entry *proc_uid;
+
+#define UID_HASH_BITS 10
+
+static DECLARE_HASHTABLE(proc_uid_hash_table, UID_HASH_BITS);
+
+/*
+ * use rt_mutex here to avoid priority inversion between high-priority readers
+ * of these files and tasks calling proc_register_uid().
+ */
+static DEFINE_RT_MUTEX(proc_uid_lock); /* proc_uid_hash_table */
+
+struct uid_hash_entry {
+ uid_t uid;
+ struct hlist_node hash;
+};
+
+/* Caller must hold proc_uid_lock */
+static bool uid_hash_entry_exists_locked(uid_t uid)
+{
+ struct uid_hash_entry *entry;
+
+ hash_for_each_possible(proc_uid_hash_table, entry, hash, uid) {
+ if (entry->uid == uid)
+ return true;
+ }
+ return false;
+}
+
+void proc_register_uid(kuid_t kuid)
+{
+ struct uid_hash_entry *entry;
+ bool exists;
+ uid_t uid = from_kuid_munged(current_user_ns(), kuid);
+
+ rt_mutex_lock(&proc_uid_lock);
+ exists = uid_hash_entry_exists_locked(uid);
+ rt_mutex_unlock(&proc_uid_lock);
+ if (exists)
+ return;
+
+ entry = kzalloc(sizeof(struct uid_hash_entry), GFP_KERNEL);
+ if (!entry)
+ return;
+ entry->uid = uid;
+
+ rt_mutex_lock(&proc_uid_lock);
+ if (uid_hash_entry_exists_locked(uid))
+ kfree(entry);
+ else
+ hash_add(proc_uid_hash_table, &entry->hash, uid);
+ rt_mutex_unlock(&proc_uid_lock);
+}
+
+struct uid_entry {
+ const char *name;
+ int len;
+ umode_t mode;
+ const struct inode_operations *iop;
+ const struct file_operations *fop;
+};
+
+#define NOD(NAME, MODE, IOP, FOP) { \
+ .name = (NAME), \
+ .len = sizeof(NAME) - 1, \
+ .mode = MODE, \
+ .iop = IOP, \
+ .fop = FOP, \
+}
+
+static const struct uid_entry uid_base_stuff[] = {};
+
+static const struct inode_operations proc_uid_def_inode_operations = {
+ .setattr = proc_setattr,
+};
+
+static struct inode *proc_uid_make_inode(struct super_block *sb, kuid_t kuid)
+{
+ struct inode *inode;
+
+ inode = new_inode(sb);
+ if (!inode)
+ return NULL;
+
+ inode->i_ino = get_next_ino();
+ inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+ inode->i_op = &proc_uid_def_inode_operations;
+ inode->i_uid = kuid;
+
+ return inode;
+}
+
+static struct dentry *proc_uident_instantiate(struct dentry *dentry,
+ struct task_struct *unused, const void *ptr)
+{
+ const struct uid_entry *u = ptr;
+ struct inode *inode;
+
+ uid_t uid = name_to_int(&dentry->d_name);
+ kuid_t kuid;
+ bool uid_exists;
+ rt_mutex_lock(&proc_uid_lock);
+ uid_exists = uid_hash_entry_exists_locked(uid);
+ rt_mutex_unlock(&proc_uid_lock);
+ if (uid_exists) {
+ kuid = make_kuid(current_user_ns(), uid);
+ inode = proc_uid_make_inode(dentry->d_sb, kuid);
+ if (!inode)
+ return ERR_PTR(-ENOENT);
+ } else {
+ return ERR_PTR(-ENOENT);
+ }
+
+ inode->i_mode = u->mode;
+ if (S_ISDIR(inode->i_mode))
+ set_nlink(inode, 2);
+ if (u->iop)
+ inode->i_op = u->iop;
+ if (u->fop)
+ inode->i_fop = u->fop;
+
+ return d_splice_alias(inode, dentry);
+}
+
+static struct dentry *proc_uid_base_lookup(struct inode *dir,
+ struct dentry *dentry,
+ unsigned int flags)
+{
+ const struct uid_entry *u, *last;
+ unsigned int nents = ARRAY_SIZE(uid_base_stuff);
+
+ if (nents == 0)
+ return ERR_PTR(-ENOENT);
+
+ last = &uid_base_stuff[nents - 1];
+ for (u = uid_base_stuff; u <= last; u++) {
+ if (u->len != dentry->d_name.len)
+ continue;
+ if (!memcmp(dentry->d_name.name, u->name, u->len))
+ break;
+ }
+ if (u > last)
+ return ERR_PTR(-ENOENT);
+
+ return proc_uident_instantiate(dentry, NULL, u);
+}
+
+static int proc_uid_base_readdir(struct file *file, struct dir_context *ctx)
+{
+ unsigned int nents = ARRAY_SIZE(uid_base_stuff);
+ const struct uid_entry *u;
+
+ if (!dir_emit_dots(file, ctx))
+ return 0;
+
+ if (ctx->pos >= nents + 2)
+ return 0;
+
+ for (u = uid_base_stuff + (ctx->pos - 2);
+ u < uid_base_stuff + nents; u++) {
+ if (!proc_fill_cache(file, ctx, u->name, u->len,
+ proc_uident_instantiate, NULL, u))
+ break;
+ ctx->pos++;
+ }
+
+ return 0;
+}
+
+static const struct inode_operations proc_uid_base_inode_operations = {
+ .lookup = proc_uid_base_lookup,
+ .setattr = proc_setattr,
+};
+
+static const struct file_operations proc_uid_base_operations = {
+ .read = generic_read_dir,
+ .iterate = proc_uid_base_readdir,
+ .llseek = default_llseek,
+};
+
+static struct dentry *proc_uid_instantiate(struct dentry *dentry,
+ struct task_struct *unused, const void *ptr)
+{
+ unsigned int i, len;
+ nlink_t nlinks;
+ kuid_t *kuid = (kuid_t *)ptr;
+ struct inode *inode = proc_uid_make_inode(dentry->d_sb, *kuid);
+
+ if (!inode)
+ return ERR_PTR(-ENOENT);
+
+ inode->i_mode = S_IFDIR | 0555;
+ inode->i_op = &proc_uid_base_inode_operations;
+ inode->i_fop = &proc_uid_base_operations;
+ inode->i_flags |= S_IMMUTABLE;
+
+ nlinks = 2;
+ len = ARRAY_SIZE(uid_base_stuff);
+ for (i = 0; i < len; ++i) {
+ if (S_ISDIR(uid_base_stuff[i].mode))
+ ++nlinks;
+ }
+ set_nlink(inode, nlinks);
+
+ return d_splice_alias(inode, dentry);
+}
+
+static int proc_uid_readdir(struct file *file, struct dir_context *ctx)
+{
+ int last_shown, i;
+ unsigned long bkt;
+ struct uid_hash_entry *entry;
+
+ if (!dir_emit_dots(file, ctx))
+ return 0;
+
+ i = 0;
+ last_shown = ctx->pos - 2;
+ rt_mutex_lock(&proc_uid_lock);
+ hash_for_each(proc_uid_hash_table, bkt, entry, hash) {
+ int len;
+ char buf[PROC_NUMBUF];
+
+ if (i < last_shown)
+ continue;
+ len = snprintf(buf, sizeof(buf), "%u", entry->uid);
+ if (!proc_fill_cache(file, ctx, buf, len,
+ proc_uid_instantiate, NULL, &entry->uid))
+ break;
+ i++;
+ ctx->pos++;
+ }
+ rt_mutex_unlock(&proc_uid_lock);
+ return 0;
+}
+
+static struct dentry *proc_uid_lookup(struct inode *dir, struct dentry *dentry,
+ unsigned int flags)
+{
+ int result = -ENOENT;
+
+ uid_t uid = name_to_int(&dentry->d_name);
+ bool uid_exists;
+
+ rt_mutex_lock(&proc_uid_lock);
+ uid_exists = uid_hash_entry_exists_locked(uid);
+ rt_mutex_unlock(&proc_uid_lock);
+ if (uid_exists) {
+ kuid_t kuid = make_kuid(current_user_ns(), uid);
+
+ return proc_uid_instantiate(dentry, NULL, &kuid);
+ }
+ return ERR_PTR(result);
+}
+
+static const struct file_operations proc_uid_operations = {
+ .read = generic_read_dir,
+ .iterate = proc_uid_readdir,
+ .llseek = default_llseek,
+};
+
+static const struct inode_operations proc_uid_inode_operations = {
+ .lookup = proc_uid_lookup,
+ .setattr = proc_setattr,
+};
+
+int __init proc_uid_init(void)
+{
+ proc_uid = proc_mkdir("uid", NULL);
+ if (!proc_uid)
+ return -ENOMEM;
+ proc_uid->proc_iops = &proc_uid_inode_operations;
+ proc_uid->proc_fops = &proc_uid_operations;
+
+ return 0;
+}
diff --git a/fs/sync.c b/fs/sync.c
index b54e054..055daab 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -9,7 +9,7 @@
#include <linux/slab.h>
#include <linux/export.h>
#include <linux/namei.h>
-#include <linux/sched.h>
+#include <linux/sched/xacct.h>
#include <linux/writeback.h>
#include <linux/syscalls.h>
#include <linux/linkage.h>
@@ -220,6 +220,7 @@
if (f.file) {
ret = vfs_fsync(f.file, datasync);
fdput(f);
+ inc_syscfs(current);
}
return ret;
}
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index d269d11..0e89a6d 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -913,7 +913,9 @@
new_flags, vma->anon_vma,
vma->vm_file, vma->vm_pgoff,
vma_policy(vma),
- NULL_VM_UFFD_CTX);
+ NULL_VM_UFFD_CTX,
+ vma_get_anon_name(vma));
+
if (prev)
vma = prev;
else
@@ -1463,7 +1465,8 @@
prev = vma_merge(mm, prev, start, vma_end, new_flags,
vma->anon_vma, vma->vm_file, vma->vm_pgoff,
vma_policy(vma),
- ((struct vm_userfaultfd_ctx){ ctx }));
+ ((struct vm_userfaultfd_ctx){ ctx }),
+ vma_get_anon_name(vma));
if (prev) {
vma = prev;
goto next;
@@ -1625,7 +1628,8 @@
prev = vma_merge(mm, prev, start, vma_end, new_flags,
vma->anon_vma, vma->vm_file, vma->vm_pgoff,
vma_policy(vma),
- NULL_VM_UFFD_CTX);
+ NULL_VM_UFFD_CTX,
+ vma_get_anon_name(vma));
if (prev) {
vma = prev;
goto next;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 2595496..3a56299 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1151,11 +1151,14 @@
struct file *filp,
struct vm_area_struct *vma)
{
+ struct dax_device *dax_dev;
+
+ dax_dev = xfs_find_daxdev_for_inode(file_inode(filp));
/*
- * We don't support synchronous mappings for non-DAX files. At least
- * until someone comes with a sensible use case.
+ * We don't support synchronous mappings for non-DAX files and
+ * for DAX files if underneath dax_device is not synchronous.
*/
- if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
+ if (!daxdev_mapping_supported(vma, dax_dev))
return -EOPNOTSUPP;
file_accessed(filp);
diff --git a/include/linux/alt-syscall.h b/include/linux/alt-syscall.h
new file mode 100644
index 0000000..00f37c0
--- /dev/null
+++ b/include/linux/alt-syscall.h
@@ -0,0 +1,59 @@
+#ifndef _ALT_SYSCALL_H
+#define _ALT_SYSCALL_H
+
+#include <linux/errno.h>
+
+#ifdef CONFIG_ALT_SYSCALL
+
+#include <linux/list.h>
+#include <asm/syscall.h>
+
+#define ALT_SYS_CALL_NAME_MAX 32
+
+struct alt_sys_call_table {
+ char name[ALT_SYS_CALL_NAME_MAX + 1];
+ sys_call_ptr_t *table;
+ int size;
+#ifdef CONFIG_IA32_EMULATION
+ sys_call_ptr_t *compat_table;
+ int compat_size;
+#endif
+ struct list_head node;
+};
+
+/*
+ * arch_dup_sys_call_table should return the default syscall table, not
+ * the current syscall table, since we want to explicitly not allow
+ * syscall table composition. A selected syscall table should be treated
+ * as a single execution personality.
+ */
+
+int arch_dup_sys_call_table(struct alt_sys_call_table *table);
+int arch_set_sys_call_table(struct alt_sys_call_table *table);
+
+int register_alt_sys_call_table(struct alt_sys_call_table *table);
+int set_alt_sys_call_table(char __user *name);
+
+#else
+
+struct alt_sys_call_table;
+
+static inline int arch_dup_sys_call_table(struct alt_sys_call_table *table)
+{
+ return -ENOSYS;
+}
+static inline int arch_set_sys_call_table(struct alt_sys_call_table *table)
+{
+ return -ENOSYS;
+}
+static inline int register_alt_sys_call_table(struct alt_sys_call_table *table)
+{
+ return -ENOSYS;
+}
+static inline int set_alt_sys_call_table(char __user *name)
+{
+ return -ENOSYS;
+}
+#endif
+
+#endif /* _ALT_SYSCALL_H */
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 9334fbe..bea2b15 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -85,6 +85,17 @@
u32 op;
};
+struct audit_task_info {
+ kuid_t loginuid;
+ unsigned int sessionid;
+ u64 contid;
+#ifdef CONFIG_AUDITSYSCALL
+ struct audit_context *ctx;
+#endif
+};
+
+extern struct audit_task_info init_struct_audit;
+
extern int is_audit_feature_set(int which);
extern int __init audit_register_class(int class, unsigned *list);
@@ -123,6 +134,9 @@
#ifdef CONFIG_AUDIT
/* These are defined in audit.c */
/* Public API */
+extern int audit_alloc(struct task_struct *task);
+extern void audit_free(struct task_struct *task);
+extern void __init audit_task_init(void);
extern __printf(4, 5)
void audit_log(struct audit_context *ctx, gfp_t gfp_mask, int type,
const char *fmt, ...);
@@ -162,8 +176,39 @@
extern int audit_rule_change(int type, int seq, void *data, size_t datasz);
extern int audit_list_rules_send(struct sk_buff *request_skb, int seq);
+static inline kuid_t audit_get_loginuid(struct task_struct *tsk)
+{
+ if (!tsk->audit)
+ return INVALID_UID;
+ return tsk->audit->loginuid;
+}
+
+static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
+{
+ if (!tsk->audit)
+ return AUDIT_SID_UNSET;
+ return tsk->audit->sessionid;
+}
+
+extern int audit_set_contid(struct task_struct *tsk, u64 contid);
+
+static inline u64 audit_get_contid(struct task_struct *tsk)
+{
+ if (!tsk->audit)
+ return AUDIT_CID_UNSET;
+ return tsk->audit->contid;
+}
+
extern u32 audit_enabled;
#else /* CONFIG_AUDIT */
+static inline int audit_alloc(struct task_struct *task)
+{
+ return 0;
+}
+static inline void audit_free(struct task_struct *task)
+{ }
+static inline void __init audit_task_init(void)
+{ }
static inline __printf(4, 5)
void audit_log(struct audit_context *ctx, gfp_t gfp_mask, int type,
const char *fmt, ...)
@@ -205,6 +250,12 @@
static inline void audit_log_task_info(struct audit_buffer *ab,
struct task_struct *tsk)
{ }
+
+static inline u64 audit_get_contid(struct task_struct *tsk)
+{
+ return AUDIT_CID_UNSET;
+}
+
#define audit_enabled AUDIT_OFF
#endif /* CONFIG_AUDIT */
@@ -219,8 +270,6 @@
/* These are defined in auditsc.c */
/* Public API */
-extern int audit_alloc(struct task_struct *task);
-extern void __audit_free(struct task_struct *task);
extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long a1,
unsigned long a2, unsigned long a3);
extern void __audit_syscall_exit(int ret_success, long ret_value);
@@ -242,12 +291,14 @@
static inline void audit_set_context(struct task_struct *task, struct audit_context *ctx)
{
- task->audit_context = ctx;
+ task->audit->ctx = ctx;
}
static inline struct audit_context *audit_context(void)
{
- return current->audit_context;
+ if (!current->audit)
+ return NULL;
+ return current->audit->ctx;
}
static inline bool audit_dummy_context(void)
@@ -255,11 +306,7 @@
void *p = audit_context();
return !p || *(int *)p;
}
-static inline void audit_free(struct task_struct *task)
-{
- if (unlikely(task->audit_context))
- __audit_free(task);
-}
+
static inline void audit_syscall_entry(int major, unsigned long a0,
unsigned long a1, unsigned long a2,
unsigned long a3)
@@ -329,16 +376,6 @@
struct timespec64 *t, unsigned int *serial);
extern int audit_set_loginuid(kuid_t loginuid);
-static inline kuid_t audit_get_loginuid(struct task_struct *tsk)
-{
- return tsk->loginuid;
-}
-
-static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
-{
- return tsk->sessionid;
-}
-
extern void __audit_ipc_obj(struct kern_ipc_perm *ipcp);
extern void __audit_ipc_set_perm(unsigned long qbytes, uid_t uid, gid_t gid, umode_t mode);
extern void __audit_bprm(struct linux_binprm *bprm);
@@ -461,12 +498,6 @@
extern int audit_n_rules;
extern int audit_signals;
#else /* CONFIG_AUDITSYSCALL */
-static inline int audit_alloc(struct task_struct *task)
-{
- return 0;
-}
-static inline void audit_free(struct task_struct *task)
-{ }
static inline void audit_syscall_entry(int major, unsigned long a0,
unsigned long a1, unsigned long a2,
unsigned long a3)
@@ -595,6 +626,16 @@
return uid_valid(audit_get_loginuid(tsk));
}
+static inline bool audit_contid_valid(u64 contid)
+{
+ return contid != AUDIT_CID_UNSET;
+}
+
+static inline bool audit_contid_set(struct task_struct *tsk)
+{
+ return audit_contid_valid(audit_get_contid(tsk));
+}
+
static inline void audit_log_string(struct audit_buffer *ab, const char *buf)
{
audit_log_n_string(ab, buf, strlen(buf));
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 745b2d0..ec08bba 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -538,7 +538,7 @@
/*
* mq queue kobject
*/
- struct kobject mq_kobj;
+ struct kobject *mq_kobj;
#ifdef CONFIG_BLK_DEV_INTEGRITY
struct blk_integrity integrity;
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index acb77dcf..8996c09 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -21,6 +21,10 @@
SUBSYS(cpuacct)
#endif
+#if IS_ENABLED(CONFIG_SCHED_TUNE)
+SUBSYS(schedtune)
+#endif
+
#if IS_ENABLED(CONFIG_BLK_CGROUP)
SUBSYS(io)
#endif
diff --git a/include/linux/clk.h b/include/linux/clk.h
index 4f750c4..c705271d 100644
--- a/include/linux/clk.h
+++ b/include/linux/clk.h
@@ -312,7 +312,26 @@
*/
int __must_check clk_bulk_get(struct device *dev, int num_clks,
struct clk_bulk_data *clks);
-
+/**
+ * clk_bulk_get_all - lookup and obtain all available references to clock
+ * producer.
+ * @dev: device for clock "consumer"
+ * @clks: pointer to the clk_bulk_data table of consumer
+ *
+ * This helper function allows drivers to get all clk consumers in one
+ * operation. If any of the clk cannot be acquired then any clks
+ * that were obtained will be freed before returning to the caller.
+ *
+ * Returns a positive value for the number of clocks obtained while the
+ * clock references are stored in the clk_bulk_data table in @clks field.
+ * Returns 0 if there're none and a negative value if something failed.
+ *
+ * Drivers must assume that the clock source is not enabled.
+ *
+ * clk_bulk_get should not be called from within interrupt context.
+ */
+int __must_check clk_bulk_get_all(struct device *dev,
+ struct clk_bulk_data **clks);
/**
* devm_clk_bulk_get - managed get multiple clk consumers
* @dev: device for clock "consumer"
@@ -327,6 +346,22 @@
*/
int __must_check devm_clk_bulk_get(struct device *dev, int num_clks,
struct clk_bulk_data *clks);
+/**
+ * devm_clk_bulk_get_all - managed get multiple clk consumers
+ * @dev: device for clock "consumer"
+ * @clks: pointer to the clk_bulk_data table of consumer
+ *
+ * Returns a positive value for the number of clocks obtained while the
+ * clock references are stored in the clk_bulk_data table in @clks field.
+ * Returns 0 if there're none and a negative value if something failed.
+ *
+ * This helper function allows drivers to get several clk
+ * consumers in one operation with management, the clks will
+ * automatically be freed when the device is unbound.
+ */
+
+int __must_check devm_clk_bulk_get_all(struct device *dev,
+ struct clk_bulk_data **clks);
/**
* devm_clk_get - lookup and obtain a managed reference to a clock producer.
@@ -488,6 +523,19 @@
void clk_bulk_put(int num_clks, struct clk_bulk_data *clks);