Documentation/x86/intel_mpx.rst - third_party/kernel - Git at Google

 .. SPDX-License-Identifier: GPL-2.0

 ===========================================
 Intel(R) Memory Protection Extensions (MPX)
 ===========================================

 Intel(R) MPX Overview
 =====================

 Intel(R) Memory Protection Extensions (Intel(R) MPX) is a new capability
 introduced into Intel Architecture. Intel MPX provides hardware features
 that can be used in conjunction with compiler changes to check memory
 references, for those references whose compile-time normal intentions are
 usurped at runtime due to buffer overflow or underflow.

 You can tell if your CPU supports MPX by looking in /proc/cpuinfo::

 	cat /proc/cpuinfo  | grep ' mpx '

 For more information, please refer to Intel(R) Architecture Instruction
 Set Extensions Programming Reference, Chapter 9: Intel(R) Memory Protection
 Extensions.

 Note: As of December 2014, no hardware with MPX is available but it is
 possible to use SDE (Intel(R) Software Development Emulator) instead, which
 can be downloaded from
 http://software.intel.com/en-us/articles/intel-software-development-emulator


 How to get the advantage of MPX
 ===============================

 For MPX to work, changes are required in the kernel, binutils and compiler.
 No source changes are required for applications, just a recompile.

 There are a lot of moving parts of this to all work right. The following
 is how we expect the compiler, application and kernel to work together.

 1) Application developer compiles with -fmpx. The compiler will add the
    instrumentation as well as some setup code called early after the app
    starts. New instruction prefixes are noops for old CPUs.
 2) That setup code allocates (virtual) space for the "bounds directory",
    points the "bndcfgu" register to the directory (must also set the valid
    bit) and notifies the kernel (via the new prctl(PR_MPX_ENABLE_MANAGEMENT))
    that the app will be using MPX.  The app must be careful not to access
    the bounds tables between the time when it populates "bndcfgu" and
    when it calls the prctl().  This might be hard to guarantee if the app
    is compiled with MPX.  You can add "__attribute__((bnd_legacy))" to
    the function to disable MPX instrumentation to help guarantee this.
    Also be careful not to call out to any other code which might be
    MPX-instrumented.
 3) The kernel detects that the CPU has MPX, allows the new prctl() to
    succeed, and notes the location of the bounds directory. Userspace is
    expected to keep the bounds directory at that location. We note it
    instead of reading it each time because the 'xsave' operation needed
    to access the bounds directory register is an expensive operation.
 4) If the application needs to spill bounds out of the 4 registers, it
    issues a bndstx instruction. Since the bounds directory is empty at
    this point, a bounds fault (#BR) is raised, the kernel allocates a
    bounds table (in the user address space) and makes the relevant entry
    in the bounds directory point to the new table.
 5) If the application violates the bounds specified in the bounds registers,
    a separate kind of #BR is raised which will deliver a signal with
    information about the violation in the 'struct siginfo'.
 6) Whenever memory is freed, we know that it can no longer contain valid
    pointers, and we attempt to free the associated space in the bounds
    tables. If an entire table becomes unused, we will attempt to free
    the table and remove the entry in the directory.

 To summarize, there are essentially three things interacting here:

 GCC with -fmpx:
  * enables annotation of code with MPX instructions and prefixes
  * inserts code early in the application to call in to the "gcc runtime"
 GCC MPX Runtime:
  * Checks for hardware MPX support in cpuid leaf
  * allocates virtual space for the bounds directory (malloc() essentially)
  * points the hardware BNDCFGU register at the directory
  * calls a new prctl(PR_MPX_ENABLE_MANAGEMENT) to notify the kernel to
    start managing the bounds directories
 Kernel MPX Code:
  * Checks for hardware MPX support in cpuid leaf
  * Handles #BR exceptions and sends SIGSEGV to the app when it violates
    bounds, like during a buffer overflow.
  * When bounds are spilled in to an unallocated bounds table, the kernel
    notices in the #BR exception, allocates the virtual space, then
    updates the bounds directory to point to the new table. It keeps
    special track of the memory with a VM_MPX flag.
  * Frees unused bounds tables at the time that the memory they described
    is unmapped.


 How does MPX kernel code work
 =============================

 Handling #BR faults caused by MPX
 ---------------------------------

 When MPX is enabled, there are 2 new situations that can generate
 #BR faults.

   * new bounds tables (BT) need to be allocated to save bounds.
   * bounds violation caused by MPX instructions.

 We hook #BR handler to handle these two new situations.

 On-demand kernel allocation of bounds tables
 --------------------------------------------

 MPX only has 4 hardware registers for storing bounds information. If
 MPX-enabled code needs more than these 4 registers, it needs to spill
 them somewhere. It has two special instructions for this which allow
 the bounds to be moved between the bounds registers and some new "bounds
 tables".

 #BR exceptions are a new class of exceptions just for MPX. They are
 similar conceptually to a page fault and will be raised by the MPX
 hardware during both bounds violations or when the tables are not
 present. The kernel handles those #BR exceptions for not-present tables
 by carving the space out of the normal processes address space and then
 pointing the bounds-directory over to it.

 The tables need to be accessed and controlled by userspace because
 the instructions for moving bounds in and out of them are extremely
 frequent. They potentially happen every time a register points to
 memory. Any direct kernel involvement (like a syscall) to access the
 tables would obviously destroy performance.

 Why not do this in userspace? MPX does not strictly require anything in
 the kernel. It can theoretically be done completely from userspace. Here
 are a few ways this could be done. We don't think any of them are practical
 in the real-world, but here they are.

 :Q: Can virtual space simply be reserved for the bounds tables so that we
     never have to allocate them?
 :A: MPX-enabled application will possibly create a lot of bounds tables in
     process address space to save bounds information. These tables can take
     up huge swaths of memory (as much as 80% of the memory on the system)
     even if we clean them up aggressively. In the worst-case scenario, the
     tables can be 4x the size of the data structure being tracked. IOW, a
     1-page structure can require 4 bounds-table pages. An X-GB virtual
     area needs 4*X GB of virtual space, plus 2GB for the bounds directory.
     If we were to preallocate them for the 128TB of user virtual address
     space, we would need to reserve 512TB+2GB, which is larger than the
     entire virtual address space today. This means they can not be reserved
     ahead of time. Also, a single process's pre-populated bounds directory
     consumes 2GB of virtual *AND* physical memory. IOW, it's completely
     infeasible to prepopulate bounds directories.

 :Q: Can we preallocate bounds table space at the same time memory is
     allocated which might contain pointers that might eventually need
     bounds tables?
 :A: This would work if we could hook the site of each and every memory
     allocation syscall. This can be done for small, constrained applications.
     But, it isn't practical at a larger scale since a given app has no
     way of controlling how all the parts of the app might allocate memory
     (think libraries). The kernel is really the only place to intercept
     these calls.

 :Q: Could a bounds fault be handed to userspace and the tables allocated
     there in a signal handler instead of in the kernel?
 :A: mmap() is not on the list of safe async handler functions and even
     if mmap() would work it still requires locking or nasty tricks to
     keep track of the allocation state there.

 Having ruled out all of the userspace-only approaches for managing
 bounds tables that we could think of, we create them on demand in
 the kernel.

 Decoding MPX instructions
 -------------------------

 If a #BR is generated due to a bounds violation caused by MPX.
 We need to decode MPX instructions to get violation address and
 set this address into extended struct siginfo.

 The _sigfault field of struct siginfo is extended as follow::

   87		/* SIGILL, SIGFPE, SIGSEGV, SIGBUS */
   88		struct {
   89			void __user *_addr; /* faulting insn/memory ref. */
   90 #ifdef __ARCH_SI_TRAPNO
   91			int _trapno;	/* TRAP # which caused the signal */
   92 #endif
   93			short _addr_lsb; /* LSB of the reported address */
   94			struct {
   95				void __user *_lower;
   96				void __user *_upper;
   97			} _addr_bnd;
   98		} _sigfault;

 The '_addr' field refers to violation address, and new '_addr_and'
 field refers to the upper/lower bounds when a #BR is caused.

 Glibc will be also updated to support this new siginfo. So user
 can get violation address and bounds when bounds violations occur.

 Cleanup unused bounds tables
 ----------------------------

 When a BNDSTX instruction attempts to save bounds to a bounds directory
 entry marked as invalid, a #BR is generated. This is an indication that
 no bounds table exists for this entry. In this case the fault handler
 will allocate a new bounds table on demand.

 Since the kernel allocated those tables on-demand without userspace
 knowledge, it is also responsible for freeing them when the associated
 mappings go away.

 Here, the solution for this issue is to hook do_munmap() to check
 whether one process is MPX enabled. If yes, those bounds tables covered
 in the virtual address region which is being unmapped will be freed also.

 Adding new prctl commands
 -------------------------

 Two new prctl commands are added to enable and disable MPX bounds tables
 management in kernel.
 ::

   155	#define PR_MPX_ENABLE_MANAGEMENT	43
   156	#define PR_MPX_DISABLE_MANAGEMENT	44

 Runtime library in userspace is responsible for allocation of bounds
 directory. So kernel have to use XSAVE instruction to get the base
 of bounds directory from BNDCFG register.

 But XSAVE is expected to be very expensive. In order to do performance
 optimization, we have to get the base of bounds directory and save it
 into struct mm_struct to be used in future during PR_MPX_ENABLE_MANAGEMENT
 command execution.


 Special rules
 =============

 1) If userspace is requesting help from the kernel to do the management
 of bounds tables, it may not create or modify entries in the bounds directory.

 Certainly users can allocate bounds tables and forcibly point the bounds
 directory at them through XSAVE instruction, and then set valid bit
 of bounds entry to have this entry valid.  But, the kernel will decline
 to assist in managing these tables.

 2) Userspace may not take multiple bounds directory entries and point
 them at the same bounds table.

 This is allowed architecturally.  See more information "Intel(R) Architecture
 Instruction Set Extensions Programming Reference" (9.3.4).

 However, if users did this, the kernel might be fooled in to unmapping an
 in-use bounds table since it does not recognize sharing.
	.. SPDX-License-Identifier: GPL-2.0

	===========================================
	Intel(R) Memory Protection Extensions (MPX)
	===========================================

	Intel(R) MPX Overview
	=====================

	Intel(R) Memory Protection Extensions (Intel(R) MPX) is a new capability
	introduced into Intel Architecture. Intel MPX provides hardware features
	that can be used in conjunction with compiler changes to check memory
	references, for those references whose compile-time normal intentions are
	usurped at runtime due to buffer overflow or underflow.

	You can tell if your CPU supports MPX by looking in /proc/cpuinfo::

	cat /proc/cpuinfo \| grep ' mpx '

	For more information, please refer to Intel(R) Architecture Instruction
	Set Extensions Programming Reference, Chapter 9: Intel(R) Memory Protection
	Extensions.

	Note: As of December 2014, no hardware with MPX is available but it is
	possible to use SDE (Intel(R) Software Development Emulator) instead, which
	can be downloaded from
	http://software.intel.com/en-us/articles/intel-software-development-emulator


	How to get the advantage of MPX
	===============================

	For MPX to work, changes are required in the kernel, binutils and compiler.
	No source changes are required for applications, just a recompile.

	There are a lot of moving parts of this to all work right. The following
	is how we expect the compiler, application and kernel to work together.

	1) Application developer compiles with -fmpx. The compiler will add the
	instrumentation as well as some setup code called early after the app
	starts. New instruction prefixes are noops for old CPUs.
	2) That setup code allocates (virtual) space for the "bounds directory",
	points the "bndcfgu" register to the directory (must also set the valid
	bit) and notifies the kernel (via the new prctl(PR_MPX_ENABLE_MANAGEMENT))
	that the app will be using MPX. The app must be careful not to access
	the bounds tables between the time when it populates "bndcfgu" and
	when it calls the prctl(). This might be hard to guarantee if the app
	is compiled with MPX. You can add "__attribute__((bnd_legacy))" to
	the function to disable MPX instrumentation to help guarantee this.
	Also be careful not to call out to any other code which might be
	MPX-instrumented.
	3) The kernel detects that the CPU has MPX, allows the new prctl() to
	succeed, and notes the location of the bounds directory. Userspace is
	expected to keep the bounds directory at that location. We note it
	instead of reading it each time because the 'xsave' operation needed
	to access the bounds directory register is an expensive operation.
	4) If the application needs to spill bounds out of the 4 registers, it
	issues a bndstx instruction. Since the bounds directory is empty at
	this point, a bounds fault (#BR) is raised, the kernel allocates a
	bounds table (in the user address space) and makes the relevant entry
	in the bounds directory point to the new table.
	5) If the application violates the bounds specified in the bounds registers,
	a separate kind of #BR is raised which will deliver a signal with
	information about the violation in the 'struct siginfo'.
	6) Whenever memory is freed, we know that it can no longer contain valid
	pointers, and we attempt to free the associated space in the bounds
	tables. If an entire table becomes unused, we will attempt to free
	the table and remove the entry in the directory.

	To summarize, there are essentially three things interacting here:

	GCC with -fmpx:
	* enables annotation of code with MPX instructions and prefixes
	* inserts code early in the application to call in to the "gcc runtime"
	GCC MPX Runtime:
	* Checks for hardware MPX support in cpuid leaf
	* allocates virtual space for the bounds directory (malloc() essentially)
	* points the hardware BNDCFGU register at the directory
	* calls a new prctl(PR_MPX_ENABLE_MANAGEMENT) to notify the kernel to
	start managing the bounds directories
	Kernel MPX Code:
	* Checks for hardware MPX support in cpuid leaf
	* Handles #BR exceptions and sends SIGSEGV to the app when it violates
	bounds, like during a buffer overflow.
	* When bounds are spilled in to an unallocated bounds table, the kernel
	notices in the #BR exception, allocates the virtual space, then
	updates the bounds directory to point to the new table. It keeps
	special track of the memory with a VM_MPX flag.
	* Frees unused bounds tables at the time that the memory they described
	is unmapped.


	How does MPX kernel code work
	=============================

	Handling #BR faults caused by MPX
	---------------------------------

	When MPX is enabled, there are 2 new situations that can generate
	#BR faults.

	* new bounds tables (BT) need to be allocated to save bounds.
	* bounds violation caused by MPX instructions.

	We hook #BR handler to handle these two new situations.

	On-demand kernel allocation of bounds tables
	--------------------------------------------

	MPX only has 4 hardware registers for storing bounds information. If
	MPX-enabled code needs more than these 4 registers, it needs to spill
	them somewhere. It has two special instructions for this which allow
	the bounds to be moved between the bounds registers and some new "bounds
	tables".

	#BR exceptions are a new class of exceptions just for MPX. They are
	similar conceptually to a page fault and will be raised by the MPX
	hardware during both bounds violations or when the tables are not
	present. The kernel handles those #BR exceptions for not-present tables
	by carving the space out of the normal processes address space and then
	pointing the bounds-directory over to it.

	The tables need to be accessed and controlled by userspace because
	the instructions for moving bounds in and out of them are extremely
	frequent. They potentially happen every time a register points to
	memory. Any direct kernel involvement (like a syscall) to access the
	tables would obviously destroy performance.

	Why not do this in userspace? MPX does not strictly require anything in
	the kernel. It can theoretically be done completely from userspace. Here
	are a few ways this could be done. We don't think any of them are practical
	in the real-world, but here they are.

	:Q: Can virtual space simply be reserved for the bounds tables so that we
	never have to allocate them?
	:A: MPX-enabled application will possibly create a lot of bounds tables in
	process address space to save bounds information. These tables can take
	up huge swaths of memory (as much as 80% of the memory on the system)
	even if we clean them up aggressively. In the worst-case scenario, the
	tables can be 4x the size of the data structure being tracked. IOW, a
	1-page structure can require 4 bounds-table pages. An X-GB virtual
	area needs 4*X GB of virtual space, plus 2GB for the bounds directory.
	If we were to preallocate them for the 128TB of user virtual address
	space, we would need to reserve 512TB+2GB, which is larger than the
	entire virtual address space today. This means they can not be reserved
	ahead of time. Also, a single process's pre-populated bounds directory
	consumes 2GB of virtual AND physical memory. IOW, it's completely
	infeasible to prepopulate bounds directories.

	:Q: Can we preallocate bounds table space at the same time memory is
	allocated which might contain pointers that might eventually need
	bounds tables?
	:A: This would work if we could hook the site of each and every memory
	allocation syscall. This can be done for small, constrained applications.
	But, it isn't practical at a larger scale since a given app has no
	way of controlling how all the parts of the app might allocate memory
	(think libraries). The kernel is really the only place to intercept
	these calls.

	:Q: Could a bounds fault be handed to userspace and the tables allocated
	there in a signal handler instead of in the kernel?
	:A: mmap() is not on the list of safe async handler functions and even
	if mmap() would work it still requires locking or nasty tricks to
	keep track of the allocation state there.

	Having ruled out all of the userspace-only approaches for managing
	bounds tables that we could think of, we create them on demand in
	the kernel.

	Decoding MPX instructions
	-------------------------

	If a #BR is generated due to a bounds violation caused by MPX.
	We need to decode MPX instructions to get violation address and
	set this address into extended struct siginfo.

	The _sigfault field of struct siginfo is extended as follow::

	87 /* SIGILL, SIGFPE, SIGSEGV, SIGBUS */
	88 struct {
	89 void __user _addr; / faulting insn/memory ref. */
	90 #ifdef __ARCH_SI_TRAPNO
	91 int _trapno; /* TRAP # which caused the signal */
	92 #endif
	93 short _addr_lsb; /* LSB of the reported address */
	94 struct {
	95 void __user *_lower;
	96 void __user *_upper;
	97 } _addr_bnd;
	98 } _sigfault;

	The '_addr' field refers to violation address, and new '_addr_and'
	field refers to the upper/lower bounds when a #BR is caused.

	Glibc will be also updated to support this new siginfo. So user
	can get violation address and bounds when bounds violations occur.

	Cleanup unused bounds tables
	----------------------------

	When a BNDSTX instruction attempts to save bounds to a bounds directory
	entry marked as invalid, a #BR is generated. This is an indication that
	no bounds table exists for this entry. In this case the fault handler
	will allocate a new bounds table on demand.

	Since the kernel allocated those tables on-demand without userspace
	knowledge, it is also responsible for freeing them when the associated
	mappings go away.

	Here, the solution for this issue is to hook do_munmap() to check
	whether one process is MPX enabled. If yes, those bounds tables covered
	in the virtual address region which is being unmapped will be freed also.

	Adding new prctl commands
	-------------------------

	Two new prctl commands are added to enable and disable MPX bounds tables
	management in kernel.
	::

	155 #define PR_MPX_ENABLE_MANAGEMENT 43
	156 #define PR_MPX_DISABLE_MANAGEMENT 44

	Runtime library in userspace is responsible for allocation of bounds
	directory. So kernel have to use XSAVE instruction to get the base
	of bounds directory from BNDCFG register.

	But XSAVE is expected to be very expensive. In order to do performance
	optimization, we have to get the base of bounds directory and save it
	into struct mm_struct to be used in future during PR_MPX_ENABLE_MANAGEMENT
	command execution.


	Special rules
	=============

	1) If userspace is requesting help from the kernel to do the management
	of bounds tables, it may not create or modify entries in the bounds directory.

	Certainly users can allocate bounds tables and forcibly point the bounds
	directory at them through XSAVE instruction, and then set valid bit
	of bounds entry to have this entry valid. But, the kernel will decline
	to assist in managing these tables.

	2) Userspace may not take multiple bounds directory entries and point
	them at the same bounds table.

	This is allowed architecturally. See more information "Intel(R) Architecture
	Instruction Set Extensions Programming Reference" (9.3.4).

	However, if users did this, the kernel might be fooled in to unmapping an
	in-use bounds table since it does not recognize sharing.