Documentation/admin-guide/hw-vuln/l1tf.rst - third_party/kernel - Git at Google

 L1TF - L1 Terminal Fault
 ========================

 L1 Terminal Fault is a hardware vulnerability which allows unprivileged
 speculative access to data which is available in the Level 1 Data Cache
 when the page table entry controlling the virtual address, which is used
 for the access, has the Present bit cleared or other reserved bits set.

 Affected processors
 -------------------

 This vulnerability affects a wide range of Intel processors. The
 vulnerability is not present on:

    - Processors from AMD, Centaur and other non Intel vendors

    - Older processor models, where the CPU family is < 6

    - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,
      Penwell, Pineview, Silvermont, Airmont, Merrifield)

    - The Intel XEON PHI family

    - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the
      IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected
      by the Meltdown vulnerability either. These CPUs should become
      available by end of 2018.

 Whether a processor is affected or not can be read out from the L1TF
 vulnerability file in sysfs. See :ref:`l1tf_sys_info`.

 Related CVEs
 ------------

 The following CVE entries are related to the L1TF vulnerability:

    =============  =================  ==============================
    CVE-2018-3615  L1 Terminal Fault  SGX related aspects
    CVE-2018-3620  L1 Terminal Fault  OS, SMM related aspects
    CVE-2018-3646  L1 Terminal Fault  Virtualization related aspects
    =============  =================  ==============================

 Problem
 -------

 If an instruction accesses a virtual address for which the relevant page
 table entry (PTE) has the Present bit cleared or other reserved bits set,
 then speculative execution ignores the invalid PTE and loads the referenced
 data if it is present in the Level 1 Data Cache, as if the page referenced
 by the address bits in the PTE was still present and accessible.

 While this is a purely speculative mechanism and the instruction will raise
 a page fault when it is retired eventually, the pure act of loading the
 data and making it available to other speculative instructions opens up the
 opportunity for side channel attacks to unprivileged malicious code,
 similar to the Meltdown attack.

 While Meltdown breaks the user space to kernel space protection, L1TF
 allows to attack any physical memory address in the system and the attack
 works across all protection domains. It allows an attack of SGX and also
 works from inside virtual machines because the speculation bypasses the
 extended page table (EPT) protection mechanism.


 Attack scenarios
 ----------------

 1. Malicious user space
 ^^^^^^^^^^^^^^^^^^^^^^^

    Operating Systems store arbitrary information in the address bits of a
    PTE which is marked non present. This allows a malicious user space
    application to attack the physical memory to which these PTEs resolve.
    In some cases user-space can maliciously influence the information
    encoded in the address bits of the PTE, thus making attacks more
    deterministic and more practical.

    The Linux kernel contains a mitigation for this attack vector, PTE
    inversion, which is permanently enabled and has no performance
    impact. The kernel ensures that the address bits of PTEs, which are not
    marked present, never point to cacheable physical memory space.

    A system with an up to date kernel is protected against attacks from
    malicious user space applications.

 2. Malicious guest in a virtual machine
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    The fact that L1TF breaks all domain protections allows malicious guest
    OSes, which can control the PTEs directly, and malicious guest user
    space applications, which run on an unprotected guest kernel lacking the
    PTE inversion mitigation for L1TF, to attack physical host memory.

    A special aspect of L1TF in the context of virtualization is symmetric
    multi threading (SMT). The Intel implementation of SMT is called
    HyperThreading. The fact that Hyperthreads on the affected processors
    share the L1 Data Cache (L1D) is important for this. As the flaw allows
    only to attack data which is present in L1D, a malicious guest running
    on one Hyperthread can attack the data which is brought into the L1D by
    the context which runs on the sibling Hyperthread of the same physical
    core. This context can be host OS, host user space or a different guest.

    If the processor does not support Extended Page Tables, the attack is
    only possible, when the hypervisor does not sanitize the content of the
    effective (shadow) page tables.

    While solutions exist to mitigate these attack vectors fully, these
    mitigations are not enabled by default in the Linux kernel because they
    can affect performance significantly. The kernel provides several
    mechanisms which can be utilized to address the problem depending on the
    deployment scenario. The mitigations, their protection scope and impact
    are described in the next sections.

    The default mitigations and the rationale for choosing them are explained
    at the end of this document. See :ref:`default_mitigations`.

 .. _l1tf_sys_info:

 L1TF system information
 -----------------------

 The Linux kernel provides a sysfs interface to enumerate the current L1TF
 status of the system: whether the system is vulnerable, and which
 mitigations are active. The relevant sysfs file is:

 /sys/devices/system/cpu/vulnerabilities/l1tf

 The possible values in this file are:

   ===========================   ===============================
   'Not affected'		The processor is not vulnerable
   'Mitigation: PTE Inversion'	The host protection is active
   ===========================   ===============================

 If KVM/VMX is enabled and the processor is vulnerable then the following
 information is appended to the 'Mitigation: PTE Inversion' part:

   - SMT status:

     =====================  ================
     'VMX: SMT vulnerable'  SMT is enabled
     'VMX: SMT disabled'    SMT is disabled
     =====================  ================

   - L1D Flush mode:

     ================================  ====================================
     'L1D vulnerable'		      L1D flushing is disabled

     'L1D conditional cache flushes'   L1D flush is conditionally enabled

     'L1D cache flushes'		      L1D flush is unconditionally enabled
     ================================  ====================================

 The resulting grade of protection is discussed in the following sections.


 Host mitigation mechanism
 -------------------------

 The kernel is unconditionally protected against L1TF attacks from malicious
 user space running on the host.


 Guest mitigation mechanisms
 ---------------------------

 .. _l1d_flush:

 1. L1D flush on VMENTER
 ^^^^^^^^^^^^^^^^^^^^^^^

    To make sure that a guest cannot attack data which is present in the L1D
    the hypervisor flushes the L1D before entering the guest.

    Flushing the L1D evicts not only the data which should not be accessed
    by a potentially malicious guest, it also flushes the guest
    data. Flushing the L1D has a performance impact as the processor has to
    bring the flushed guest data back into the L1D. Depending on the
    frequency of VMEXIT/VMENTER and the type of computations in the guest
    performance degradation in the range of 1% to 50% has been observed. For
    scenarios where guest VMEXIT/VMENTER are rare the performance impact is
    minimal. Virtio and mechanisms like posted interrupts are designed to
    confine the VMEXITs to a bare minimum, but specific configurations and
    application scenarios might still suffer from a high VMEXIT rate.

    The kernel provides two L1D flush modes:
     - conditional ('cond')
     - unconditional ('always')

    The conditional mode avoids L1D flushing after VMEXITs which execute
    only audited code paths before the corresponding VMENTER. These code
    paths have been verified that they cannot expose secrets or other
    interesting data to an attacker, but they can leak information about the
    address space layout of the hypervisor.

    Unconditional mode flushes L1D on all VMENTER invocations and provides
    maximum protection. It has a higher overhead than the conditional
    mode. The overhead cannot be quantified correctly as it depends on the
    workload scenario and the resulting number of VMEXITs.

    The general recommendation is to enable L1D flush on VMENTER. The kernel
    defaults to conditional mode on affected processors.

    **Note**, that L1D flush does not prevent the SMT problem because the
    sibling thread will also bring back its data into the L1D which makes it
    attackable again.

    L1D flush can be controlled by the administrator via the kernel command
    line and sysfs control files. See :ref:`mitigation_control_command_line`
    and :ref:`mitigation_control_kvm`.

 .. _guest_confinement:

 2. Guest VCPU confinement to dedicated physical cores
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    To address the SMT problem, it is possible to make a guest or a group of
    guests affine to one or more physical cores. The proper mechanism for
    that is to utilize exclusive cpusets to ensure that no other guest or
    host tasks can run on these cores.

    If only a single guest or related guests run on sibling SMT threads on
    the same physical core then they can only attack their own memory and
    restricted parts of the host memory.

    Host memory is attackable, when one of the sibling SMT threads runs in
    host OS (hypervisor) context and the other in guest context. The amount
    of valuable information from the host OS context depends on the context
    which the host OS executes, i.e. interrupts, soft interrupts and kernel
    threads. The amount of valuable data from these contexts cannot be
    declared as non-interesting for an attacker without deep inspection of
    the code.

    **Note**, that assigning guests to a fixed set of physical cores affects
    the ability of the scheduler to do load balancing and might have
    negative effects on CPU utilization depending on the hosting
    scenario. Disabling SMT might be a viable alternative for particular
    scenarios.

    For further information about confining guests to a single or to a group
    of cores consult the cpusets documentation:

    https://www.kernel.org/doc/Documentation/admin-guide/cgroup-v1/cpusets.rst

 .. _interrupt_isolation:

 3. Interrupt affinity
 ^^^^^^^^^^^^^^^^^^^^^

    Interrupts can be made affine to logical CPUs. This is not universally
    true because there are types of interrupts which are truly per CPU
    interrupts, e.g. the local timer interrupt. Aside of that multi queue
    devices affine their interrupts to single CPUs or groups of CPUs per
    queue without allowing the administrator to control the affinities.

    Moving the interrupts, which can be affinity controlled, away from CPUs
    which run untrusted guests, reduces the attack vector space.

    Whether the interrupts with are affine to CPUs, which run untrusted
    guests, provide interesting data for an attacker depends on the system
    configuration and the scenarios which run on the system. While for some
    of the interrupts it can be assumed that they won't expose interesting
    information beyond exposing hints about the host OS memory layout, there
    is no way to make general assumptions.

    Interrupt affinity can be controlled by the administrator via the
    /proc/irq/$NR/smp_affinity[_list] files. Limited documentation is
    available at:

    https://www.kernel.org/doc/Documentation/core-api/irq/irq-affinity.rst

 .. _smt_control:

 4. SMT control
 ^^^^^^^^^^^^^^

    To prevent the SMT issues of L1TF it might be necessary to disable SMT
    completely. Disabling SMT can have a significant performance impact, but
    the impact depends on the hosting scenario and the type of workloads.
    The impact of disabling SMT needs also to be weighted against the impact
    of other mitigation solutions like confining guests to dedicated cores.

    The kernel provides a sysfs interface to retrieve the status of SMT and
    to control it. It also provides a kernel command line interface to
    control SMT.

    The kernel command line interface consists of the following options:

      =========== ==========================================================
      nosmt	 Affects the bring up of the secondary CPUs during boot. The
 		 kernel tries to bring all present CPUs online during the
 		 boot process. "nosmt" makes sure that from each physical
 		 core only one - the so called primary (hyper) thread is
 		 activated. Due to a design flaw of Intel processors related
 		 to Machine Check Exceptions the non primary siblings have
 		 to be brought up at least partially and are then shut down
 		 again.  "nosmt" can be undone via the sysfs interface.

      nosmt=force Has the same effect as "nosmt" but it does not allow to
 		 undo the SMT disable via the sysfs interface.
      =========== ==========================================================

    The sysfs interface provides two files:

    - /sys/devices/system/cpu/smt/control
    - /sys/devices/system/cpu/smt/active

    /sys/devices/system/cpu/smt/control:

      This file allows to read out the SMT control state and provides the
      ability to disable or (re)enable SMT. The possible states are:

 	==============  ===================================================
 	on		SMT is supported by the CPU and enabled. All
 			logical CPUs can be onlined and offlined without
 			restrictions.

 	off		SMT is supported by the CPU and disabled. Only
 			the so called primary SMT threads can be onlined
 			and offlined without restrictions. An attempt to
 			online a non-primary sibling is rejected

 	forceoff	Same as 'off' but the state cannot be controlled.
 			Attempts to write to the control file are rejected.

 	notsupported	The processor does not support SMT. It's therefore
 			not affected by the SMT implications of L1TF.
 			Attempts to write to the control file are rejected.
 	==============  ===================================================

      The possible states which can be written into this file to control SMT
      state are:

      - on
      - off
      - forceoff

    /sys/devices/system/cpu/smt/active:

      This file reports whether SMT is enabled and active, i.e. if on any
      physical core two or more sibling threads are online.

    SMT control is also possible at boot time via the l1tf kernel command
    line parameter in combination with L1D flush control. See
    :ref:`mitigation_control_command_line`.

 5. Disabling EPT
 ^^^^^^^^^^^^^^^^

   Disabling EPT for virtual machines provides full mitigation for L1TF even
   with SMT enabled, because the effective page tables for guests are
   managed and sanitized by the hypervisor. Though disabling EPT has a
   significant performance impact especially when the Meltdown mitigation
   KPTI is enabled.

   EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.

 There is ongoing research and development for new mitigation mechanisms to
 address the performance impact of disabling SMT or EPT.

 .. _mitigation_control_command_line:

 Mitigation control on the kernel command line
 ---------------------------------------------

 The kernel command line allows to control the L1TF mitigations at boot
 time with the option "l1tf=". The valid arguments for this option are:

   ============  =============================================================
   full		Provides all available mitigations for the L1TF
 		vulnerability. Disables SMT and enables all mitigations in
 		the hypervisors, i.e. unconditional L1D flushing

 		SMT control and L1D flush control via the sysfs interface
 		is still possible after boot.  Hypervisors will issue a
 		warning when the first VM is started in a potentially
 		insecure configuration, i.e. SMT enabled or L1D flush
 		disabled.

   full,force	Same as 'full', but disables SMT and L1D flush runtime
 		control. Implies the 'nosmt=force' command line option.
 		(i.e. sysfs control of SMT is disabled.)

   flush		Leaves SMT enabled and enables the default hypervisor
 		mitigation, i.e. conditional L1D flushing

 		SMT control and L1D flush control via the sysfs interface
 		is still possible after boot.  Hypervisors will issue a
 		warning when the first VM is started in a potentially
 		insecure configuration, i.e. SMT enabled or L1D flush
 		disabled.

   flush,nosmt	Disables SMT and enables the default hypervisor mitigation,
 		i.e. conditional L1D flushing.

 		SMT control and L1D flush control via the sysfs interface
 		is still possible after boot.  Hypervisors will issue a
 		warning when the first VM is started in a potentially
 		insecure configuration, i.e. SMT enabled or L1D flush
 		disabled.

   flush,nowarn	Same as 'flush', but hypervisors will not warn when a VM is
 		started in a potentially insecure configuration.

   off		Disables hypervisor mitigations and doesn't emit any
 		warnings.
 		It also drops the swap size and available RAM limit restrictions
 		on both hypervisor and bare metal.

   ============  =============================================================

 The default is 'flush'. For details about L1D flushing see :ref:`l1d_flush`.


 .. _mitigation_control_kvm:

 Mitigation control for KVM - module parameter
 -------------------------------------------------------------

 The KVM hypervisor mitigation mechanism, flushing the L1D cache when
 entering a guest, can be controlled with a module parameter.

 The option/parameter is "kvm-intel.vmentry_l1d_flush=". It takes the
 following arguments:

   ============  ==============================================================
   always	L1D cache flush on every VMENTER.

   cond		Flush L1D on VMENTER only when the code between VMEXIT and
 		VMENTER can leak host memory which is considered
 		interesting for an attacker. This still can leak host memory
 		which allows e.g. to determine the hosts address space layout.

   never		Disables the mitigation
   ============  ==============================================================

 The parameter can be provided on the kernel command line, as a module
 parameter when loading the modules and at runtime modified via the sysfs
 file:

 /sys/module/kvm_intel/parameters/vmentry_l1d_flush

 The default is 'cond'. If 'l1tf=full,force' is given on the kernel command
 line, then 'always' is enforced and the kvm-intel.vmentry_l1d_flush
 module parameter is ignored and writes to the sysfs file are rejected.

 .. _mitigation_selection:

 Mitigation selection guide
 --------------------------

 1. No virtualization in use
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^

    The system is protected by the kernel unconditionally and no further
    action is required.

 2. Virtualization with trusted guests
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    If the guest comes from a trusted source and the guest OS kernel is
    guaranteed to have the L1TF mitigations in place the system is fully
    protected against L1TF and no further action is required.

    To avoid the overhead of the default L1D flushing on VMENTER the
    administrator can disable the flushing via the kernel command line and
    sysfs control files. See :ref:`mitigation_control_command_line` and
    :ref:`mitigation_control_kvm`.


 3. Virtualization with untrusted guests
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 3.1. SMT not supported or disabled
 """"""""""""""""""""""""""""""""""

   If SMT is not supported by the processor or disabled in the BIOS or by
   the kernel, it's only required to enforce L1D flushing on VMENTER.

   Conditional L1D flushing is the default behaviour and can be tuned. See
   :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.

 3.2. EPT not supported or disabled
 """"""""""""""""""""""""""""""""""

   If EPT is not supported by the processor or disabled in the hypervisor,
   the system is fully protected. SMT can stay enabled and L1D flushing on
   VMENTER is not required.

   EPT can be disabled in the hypervisor via the 'kvm-intel.ept' parameter.

 3.3. SMT and EPT supported and active
 """""""""""""""""""""""""""""""""""""

   If SMT and EPT are supported and active then various degrees of
   mitigations can be employed:

   - L1D flushing on VMENTER:

     L1D flushing on VMENTER is the minimal protection requirement, but it
     is only potent in combination with other mitigation methods.

     Conditional L1D flushing is the default behaviour and can be tuned. See
     :ref:`mitigation_control_command_line` and :ref:`mitigation_control_kvm`.

   - Guest confinement:

     Confinement of guests to a single or a group of physical cores which
     are not running any other processes, can reduce the attack surface
     significantly, but interrupts, soft interrupts and kernel threads can
     still expose valuable data to a potential attacker. See
     :ref:`guest_confinement`.

   - Interrupt isolation:

     Isolating the guest CPUs from interrupts can reduce the attack surface
     further, but still allows a malicious guest to explore a limited amount
     of host physical memory. This can at least be used to gain knowledge
     about the host address space layout. The interrupts which have a fixed
     affinity to the CPUs which run the untrusted guests can depending on
     the scenario still trigger soft interrupts and schedule kernel threads
     which might expose valuable information. See
     :ref:`interrupt_isolation`.

 The above three mitigation methods combined can provide protection to a
 certain degree, but the risk of the remaining attack surface has to be
 carefully analyzed. For full protection the following methods are
 available:

   - Disabling SMT:

     Disabling SMT and enforcing the L1D flushing provides the maximum
     amount of protection. This mitigation is not depending on any of the
     above mitigation methods.

     SMT control and L1D flushing can be tuned by the command line
     parameters 'nosmt', 'l1tf', 'kvm-intel.vmentry_l1d_flush' and at run
     time with the matching sysfs control files. See :ref:`smt_control`,
     :ref:`mitigation_control_command_line` and
     :ref:`mitigation_control_kvm`.

   - Disabling EPT:

     Disabling EPT provides the maximum amount of protection as well. It is
     not depending on any of the above mitigation methods. SMT can stay
     enabled and L1D flushing is not required, but the performance impact is
     significant.

     EPT can be disabled in the hypervisor via the 'kvm-intel.ept'
     parameter.

 3.4. Nested virtual machines
 """"""""""""""""""""""""""""

 When nested virtualization is in use, three operating systems are involved:
 the bare metal hypervisor, the nested hypervisor and the nested virtual
 machine.  VMENTER operations from the nested hypervisor into the nested
 guest will always be processed by the bare metal hypervisor. If KVM is the
 bare metal hypervisor it will:

  - Flush the L1D cache on every switch from the nested hypervisor to the
    nested virtual machine, so that the nested hypervisor's secrets are not
    exposed to the nested virtual machine;

  - Flush the L1D cache on every switch from the nested virtual machine to
    the nested hypervisor; this is a complex operation, and flushing the L1D
    cache avoids that the bare metal hypervisor's secrets are exposed to the
    nested virtual machine;

  - Instruct the nested hypervisor to not perform any L1D cache flush. This
    is an optimization to avoid double L1D flushing.


 .. _default_mitigations:

 Default mitigations
 -------------------

   The kernel default mitigations for vulnerable processors are:

   - PTE inversion to protect against malicious user space. This is done
     unconditionally and cannot be controlled. The swap storage is limited
     to ~16TB.

   - L1D conditional flushing on VMENTER when EPT is enabled for
     a guest.

   The kernel does not by default enforce the disabling of SMT, which leaves
   SMT systems vulnerable when running untrusted guests with EPT enabled.

   The rationale for this choice is:

   - Force disabling SMT can break existing setups, especially with
     unattended updates.

   - If regular users run untrusted guests on their machine, then L1TF is
     just an add on to other malware which might be embedded in an untrusted
     guest, e.g. spam-bots or attacks on the local network.

     There is no technical way to prevent a user from running untrusted code
     on their machines blindly.

   - It's technically extremely unlikely and from today's knowledge even
     impossible that L1TF can be exploited via the most popular attack
     mechanisms like JavaScript because these mechanisms have no way to
     control PTEs. If this would be possible and not other mitigation would
     be possible, then the default might be different.

   - The administrators of cloud and hosting setups have to carefully
     analyze the risk for their scenarios and make the appropriate
     mitigation choices, which might even vary across their deployed
     machines and also result in other changes of their overall setup.
     There is no way for the kernel to provide a sensible default for this
     kind of scenarios.