|  | ====================== | 
|  | ioctl based interfaces | 
|  | ====================== | 
|  |  | 
|  | ioctl() is the most common way for applications to interface | 
|  | with device drivers. It is flexible and easily extended by adding new | 
|  | commands and can be passed through character devices, block devices as | 
|  | well as sockets and other special file descriptors. | 
|  |  | 
|  | However, it is also very easy to get ioctl command definitions wrong, | 
|  | and hard to fix them later without breaking existing applications, | 
|  | so this documentation tries to help developers get it right. | 
|  |  | 
|  | Command number definitions | 
|  | ========================== | 
|  |  | 
|  | The command number, or request number, is the second argument passed to | 
|  | the ioctl system call. While this can be any 32-bit number that uniquely | 
|  | identifies an action for a particular driver, there are a number of | 
|  | conventions around defining them. | 
|  |  | 
|  | ``include/uapi/asm-generic/ioctl.h`` provides four macros for defining | 
|  | ioctl commands that follow modern conventions: ``_IO``, ``_IOR``, | 
|  | ``_IOW``, and ``_IOWR``. These should be used for all new commands, | 
|  | with the correct parameters: | 
|  |  | 
|  | _IO/_IOR/_IOW/_IOWR | 
|  | The macro name specifies how the argument will be used.  It may be a | 
|  | pointer to data to be passed into the kernel (_IOW), out of the kernel | 
|  | (_IOR), or both (_IOWR).  _IO can indicate either commands with no | 
|  | argument or those passing an integer value instead of a pointer. | 
|  | It is recommended to only use _IO for commands without arguments, | 
|  | and use pointers for passing data. | 
|  |  | 
|  | type | 
|  | An 8-bit number, often a character literal, specific to a subsystem | 
|  | or driver, and listed in Documentation/userspace-api/ioctl/ioctl-number.rst | 
|  |  | 
|  | nr | 
|  | An 8-bit number identifying the specific command, unique for a give | 
|  | value of 'type' | 
|  |  | 
|  | data_type | 
|  | The name of the data type pointed to by the argument, the command number | 
|  | encodes the ``sizeof(data_type)`` value in a 13-bit or 14-bit integer, | 
|  | leading to a limit of 8191 bytes for the maximum size of the argument. | 
|  | Note: do not pass sizeof(data_type) type into _IOR/_IOW/IOWR, as that | 
|  | will lead to encoding sizeof(sizeof(data_type)), i.e. sizeof(size_t). | 
|  | _IO does not have a data_type parameter. | 
|  |  | 
|  |  | 
|  | Interface versions | 
|  | ================== | 
|  |  | 
|  | Some subsystems use version numbers in data structures to overload | 
|  | commands with different interpretations of the argument. | 
|  |  | 
|  | This is generally a bad idea, since changes to existing commands tend | 
|  | to break existing applications. | 
|  |  | 
|  | A better approach is to add a new ioctl command with a new number. The | 
|  | old command still needs to be implemented in the kernel for compatibility, | 
|  | but this can be a wrapper around the new implementation. | 
|  |  | 
|  | Return code | 
|  | =========== | 
|  |  | 
|  | ioctl commands can return negative error codes as documented in errno(3); | 
|  | these get turned into errno values in user space. On success, the return | 
|  | code should be zero. It is also possible but not recommended to return | 
|  | a positive 'long' value. | 
|  |  | 
|  | When the ioctl callback is called with an unknown command number, the | 
|  | handler returns either -ENOTTY or -ENOIOCTLCMD, which also results in | 
|  | -ENOTTY being returned from the system call. Some subsystems return | 
|  | -ENOSYS or -EINVAL here for historic reasons, but this is wrong. | 
|  |  | 
|  | Prior to Linux 5.5, compat_ioctl handlers were required to return | 
|  | -ENOIOCTLCMD in order to use the fallback conversion into native | 
|  | commands. As all subsystems are now responsible for handling compat | 
|  | mode themselves, this is no longer needed, but it may be important to | 
|  | consider when backporting bug fixes to older kernels. | 
|  |  | 
|  | Timestamps | 
|  | ========== | 
|  |  | 
|  | Traditionally, timestamps and timeout values are passed as ``struct | 
|  | timespec`` or ``struct timeval``, but these are problematic because of | 
|  | incompatible definitions of these structures in user space after the | 
|  | move to 64-bit time_t. | 
|  |  | 
|  | The ``struct __kernel_timespec`` type can be used instead to be embedded | 
|  | in other data structures when separate second/nanosecond values are | 
|  | desired, or passed to user space directly. This is still not ideal though, | 
|  | as the structure matches neither the kernel's timespec64 nor the user | 
|  | space timespec exactly. The get_timespec64() and put_timespec64() helper | 
|  | functions can be used to ensure that the layout remains compatible with | 
|  | user space and the padding is treated correctly. | 
|  |  | 
|  | As it is cheap to convert seconds to nanoseconds, but the opposite | 
|  | requires an expensive 64-bit division, a simple __u64 nanosecond value | 
|  | can be simpler and more efficient. | 
|  |  | 
|  | Timeout values and timestamps should ideally use CLOCK_MONOTONIC time, | 
|  | as returned by ktime_get_ns() or ktime_get_ts64().  Unlike | 
|  | CLOCK_REALTIME, this makes the timestamps immune from jumping backwards | 
|  | or forwards due to leap second adjustments and clock_settime() calls. | 
|  |  | 
|  | ktime_get_real_ns() can be used for CLOCK_REALTIME timestamps that | 
|  | need to be persistent across a reboot or between multiple machines. | 
|  |  | 
|  | 32-bit compat mode | 
|  | ================== | 
|  |  | 
|  | In order to support 32-bit user space running on a 64-bit machine, each | 
|  | subsystem or driver that implements an ioctl callback handler must also | 
|  | implement the corresponding compat_ioctl handler. | 
|  |  | 
|  | As long as all the rules for data structures are followed, this is as | 
|  | easy as setting the .compat_ioctl pointer to a helper function such as | 
|  | compat_ptr_ioctl() or blkdev_compat_ptr_ioctl(). | 
|  |  | 
|  | compat_ptr() | 
|  | ------------ | 
|  |  | 
|  | On the s390 architecture, 31-bit user space has ambiguous representations | 
|  | for data pointers, with the upper bit being ignored. When running such | 
|  | a process in compat mode, the compat_ptr() helper must be used to | 
|  | clear the upper bit of a compat_uptr_t and turn it into a valid 64-bit | 
|  | pointer.  On other architectures, this macro only performs a cast to a | 
|  | ``void __user *`` pointer. | 
|  |  | 
|  | In an compat_ioctl() callback, the last argument is an unsigned long, | 
|  | which can be interpreted as either a pointer or a scalar depending on | 
|  | the command. If it is a scalar, then compat_ptr() must not be used, to | 
|  | ensure that the 64-bit kernel behaves the same way as a 32-bit kernel | 
|  | for arguments with the upper bit set. | 
|  |  | 
|  | The compat_ptr_ioctl() helper can be used in place of a custom | 
|  | compat_ioctl file operation for drivers that only take arguments that | 
|  | are pointers to compatible data structures. | 
|  |  | 
|  | Structure layout | 
|  | ---------------- | 
|  |  | 
|  | Compatible data structures have the same layout on all architectures, | 
|  | avoiding all problematic members: | 
|  |  | 
|  | * ``long`` and ``unsigned long`` are the size of a register, so | 
|  | they can be either 32-bit or 64-bit wide and cannot be used in portable | 
|  | data structures. Fixed-length replacements are ``__s32``, ``__u32``, | 
|  | ``__s64`` and ``__u64``. | 
|  |  | 
|  | * Pointers have the same problem, in addition to requiring the | 
|  | use of compat_ptr(). The best workaround is to use ``__u64`` | 
|  | in place of pointers, which requires a cast to ``uintptr_t`` in user | 
|  | space, and the use of u64_to_user_ptr() in the kernel to convert | 
|  | it back into a user pointer. | 
|  |  | 
|  | * On the x86-32 (i386) architecture, the alignment of 64-bit variables | 
|  | is only 32-bit, but they are naturally aligned on most other | 
|  | architectures including x86-64. This means a structure like:: | 
|  |  | 
|  | struct foo { | 
|  | __u32 a; | 
|  | __u64 b; | 
|  | __u32 c; | 
|  | }; | 
|  |  | 
|  | has four bytes of padding between a and b on x86-64, plus another four | 
|  | bytes of padding at the end, but no padding on i386, and it needs a | 
|  | compat_ioctl conversion handler to translate between the two formats. | 
|  |  | 
|  | To avoid this problem, all structures should have their members | 
|  | naturally aligned, or explicit reserved fields added in place of the | 
|  | implicit padding. The ``pahole`` tool can be used for checking the | 
|  | alignment. | 
|  |  | 
|  | * On ARM OABI user space, structures are padded to multiples of 32-bit, | 
|  | making some structs incompatible with modern EABI kernels if they | 
|  | do not end on a 32-bit boundary. | 
|  |  | 
|  | * On the m68k architecture, struct members are not guaranteed to have an | 
|  | alignment greater than 16-bit, which is a problem when relying on | 
|  | implicit padding. | 
|  |  | 
|  | * Bitfields and enums generally work as one would expect them to, | 
|  | but some properties of them are implementation-defined, so it is better | 
|  | to avoid them completely in ioctl interfaces. | 
|  |  | 
|  | * ``char`` members can be either signed or unsigned, depending on | 
|  | the architecture, so the __u8 and __s8 types should be used for 8-bit | 
|  | integer values, though char arrays are clearer for fixed-length strings. | 
|  |  | 
|  | Information leaks | 
|  | ================= | 
|  |  | 
|  | Uninitialized data must not be copied back to user space, as this can | 
|  | cause an information leak, which can be used to defeat kernel address | 
|  | space layout randomization (KASLR), helping in an attack. | 
|  |  | 
|  | For this reason (and for compat support) it is best to avoid any | 
|  | implicit padding in data structures.  Where there is implicit padding | 
|  | in an existing structure, kernel drivers must be careful to fully | 
|  | initialize an instance of the structure before copying it to user | 
|  | space.  This is usually done by calling memset() before assigning to | 
|  | individual members. | 
|  |  | 
|  | Subsystem abstractions | 
|  | ====================== | 
|  |  | 
|  | While some device drivers implement their own ioctl function, most | 
|  | subsystems implement the same command for multiple drivers.  Ideally the | 
|  | subsystem has an .ioctl() handler that copies the arguments from and | 
|  | to user space, passing them into subsystem specific callback functions | 
|  | through normal kernel pointers. | 
|  |  | 
|  | This helps in various ways: | 
|  |  | 
|  | * Applications written for one driver are more likely to work for | 
|  | another one in the same subsystem if there are no subtle differences | 
|  | in the user space ABI. | 
|  |  | 
|  | * The complexity of user space access and data structure layout is done | 
|  | in one place, reducing the potential for implementation bugs. | 
|  |  | 
|  | * It is more likely to be reviewed by experienced developers | 
|  | that can spot problems in the interface when the ioctl is shared | 
|  | between multiple drivers than when it is only used in a single driver. | 
|  |  | 
|  | Alternatives to ioctl | 
|  | ===================== | 
|  |  | 
|  | There are many cases in which ioctl is not the best solution for a | 
|  | problem. Alternatives include: | 
|  |  | 
|  | * System calls are a better choice for a system-wide feature that | 
|  | is not tied to a physical device or constrained by the file system | 
|  | permissions of a character device node | 
|  |  | 
|  | * netlink is the preferred way of configuring any network related | 
|  | objects through sockets. | 
|  |  | 
|  | * debugfs is used for ad-hoc interfaces for debugging functionality | 
|  | that does not need to be exposed as a stable interface to applications. | 
|  |  | 
|  | * sysfs is a good way to expose the state of an in-kernel object | 
|  | that is not tied to a file descriptor. | 
|  |  | 
|  | * configfs can be used for more complex configuration than sysfs | 
|  |  | 
|  | * A custom file system can provide extra flexibility with a simple | 
|  | user interface but adds a lot of complexity to the implementation. |