| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" | 
 |         "http://www.w3.org/TR/html4/loose.dtd"> | 
 |         <html> | 
 |         <head><title>A Tour Through TREE_RCU's Data Structures [LWN.net]</title> | 
 |         <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> | 
 |  | 
 |            <p>December 18, 2016</p> | 
 |            <p>This article was contributed by Paul E. McKenney</p> | 
 |  | 
 | <h3>Introduction</h3> | 
 |  | 
 | This document describes RCU's major data structures and their relationship | 
 | to each other. | 
 |  | 
 | <ol> | 
 | <li>	<a href="#Data-Structure Relationships"> | 
 | 	Data-Structure Relationships</a> | 
 | <li>	<a href="#The rcu_state Structure"> | 
 | 	The <tt>rcu_state</tt> Structure</a> | 
 | <li>	<a href="#The rcu_node Structure"> | 
 | 	The <tt>rcu_node</tt> Structure</a> | 
 | <li>	<a href="#The rcu_segcblist Structure"> | 
 | 	The <tt>rcu_segcblist</tt> Structure</a> | 
 | <li>	<a href="#The rcu_data Structure"> | 
 | 	The <tt>rcu_data</tt> Structure</a> | 
 | <li>	<a href="#The rcu_head Structure"> | 
 | 	The <tt>rcu_head</tt> Structure</a> | 
 | <li>	<a href="#RCU-Specific Fields in the task_struct Structure"> | 
 | 	RCU-Specific Fields in the <tt>task_struct</tt> Structure</a> | 
 | <li>	<a href="#Accessor Functions"> | 
 | 	Accessor Functions</a> | 
 | </ol> | 
 |  | 
 | <h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3> | 
 |  | 
 | <p>RCU is for all intents and purposes a large state machine, and its | 
 | data structures maintain the state in such a way as to allow RCU readers | 
 | to execute extremely quickly, while also processing the RCU grace periods | 
 | requested by updaters in an efficient and extremely scalable fashion. | 
 | The efficiency and scalability of RCU updaters is provided primarily | 
 | by a combining tree, as shown below: | 
 |  | 
 | </p><p><img src="BigTreeClassicRCU.svg" alt="BigTreeClassicRCU.svg" width="30%"> | 
 |  | 
 | </p><p>This diagram shows an enclosing <tt>rcu_state</tt> structure | 
 | containing a tree of <tt>rcu_node</tt> structures. | 
 | Each leaf node of the <tt>rcu_node</tt> tree has up to 16 | 
 | <tt>rcu_data</tt> structures associated with it, so that there | 
 | are <tt>NR_CPUS</tt> number of <tt>rcu_data</tt> structures, | 
 | one for each possible CPU. | 
 | This structure is adjusted at boot time, if needed, to handle the | 
 | common case where <tt>nr_cpu_ids</tt> is much less than | 
 | <tt>NR_CPUs</tt>. | 
 | For example, a number of Linux distributions set <tt>NR_CPUs=4096</tt>, | 
 | which results in a three-level <tt>rcu_node</tt> tree. | 
 | If the actual hardware has only 16 CPUs, RCU will adjust itself | 
 | at boot time, resulting in an <tt>rcu_node</tt> tree with only a single node. | 
 |  | 
 | </p><p>The purpose of this combining tree is to allow per-CPU events | 
 | such as quiescent states, dyntick-idle transitions, | 
 | and CPU hotplug operations to be processed efficiently | 
 | and scalably. | 
 | Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures, | 
 | and other events are recorded by the leaf-level <tt>rcu_node</tt> | 
 | structures. | 
 | All of these events are combined at each level of the tree until finally | 
 | grace periods are completed at the tree's root <tt>rcu_node</tt> | 
 | structure. | 
 | A grace period can be completed at the root once every CPU | 
 | (or, in the case of <tt>CONFIG_PREEMPT_RCU</tt>, task) | 
 | has passed through a quiescent state. | 
 | Once a grace period has completed, record of that fact is propagated | 
 | back down the tree. | 
 |  | 
 | </p><p>As can be seen from the diagram, on a 64-bit system | 
 | a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout | 
 | of 64 at the root and a fanout of 16 at the leaves. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	Why isn't the fanout at the leaves also 64? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	Because there are more types of events that affect the leaf-level | 
 | 	<tt>rcu_node</tt> structures than further up the tree. | 
 | 	Therefore, if the leaf <tt>rcu_node</tt> structures have fanout of | 
 | 	64, the contention on these structures' <tt>->structures</tt> | 
 | 	becomes excessive. | 
 | 	Experimentation on a wide variety of systems has shown that a fanout | 
 | 	of 16 works well for the leaves of the <tt>rcu_node</tt> tree. | 
 | 	</font> | 
 |  | 
 | 	<p><font color="ffffff">Of course, further experience with | 
 | 	systems having hundreds or thousands of CPUs may demonstrate | 
 | 	that the fanout for the non-leaf <tt>rcu_node</tt> structures | 
 | 	must also be reduced. | 
 | 	Such reduction can be easily carried out when and if it proves | 
 | 	necessary. | 
 | 	In the meantime, if you are using such a system and running into | 
 | 	contention problems on the non-leaf <tt>rcu_node</tt> structures, | 
 | 	you may use the <tt>CONFIG_RCU_FANOUT</tt> kernel configuration | 
 | 	parameter to reduce the non-leaf fanout as needed. | 
 | 	</font> | 
 |  | 
 | 	<p><font color="ffffff">Kernels built for systems with | 
 | 	strong NUMA characteristics might also need to adjust | 
 | 	<tt>CONFIG_RCU_FANOUT</tt> so that the domains of the | 
 | 	<tt>rcu_node</tt> structures align with hardware boundaries. | 
 | 	However, there has thus far been no need for this. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <p>If your system has more than 1,024 CPUs (or more than 512 CPUs on | 
 | a 32-bit system), then RCU will automatically add more levels to the | 
 | tree. | 
 | For example, if you are crazy enough to build a 64-bit system with 65,536 | 
 | CPUs, RCU would configure the <tt>rcu_node</tt> tree as follows: | 
 |  | 
 | </p><p><img src="HugeTreeClassicRCU.svg" alt="HugeTreeClassicRCU.svg" width="50%"> | 
 |  | 
 | </p><p>RCU currently permits up to a four-level tree, which on a 64-bit system | 
 | accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for | 
 | 32-bit systems. | 
 | On the other hand, you can set both <tt>CONFIG_RCU_FANOUT</tt> and | 
 | <tt>CONFIG_RCU_FANOUT_LEAF</tt> to be as small as 2, which would result | 
 | in a 16-CPU test using a 4-level tree. | 
 | This can be useful for testing large-system capabilities on small test | 
 | machines. | 
 |  | 
 | </p><p>This multi-level combining tree allows us to get most of the | 
 | performance and scalability | 
 | benefits of partitioning, even though RCU grace-period detection is | 
 | inherently a global operation. | 
 | The trick here is that only the last CPU to report a quiescent state | 
 | into a given <tt>rcu_node</tt> structure need advance to the <tt>rcu_node</tt> | 
 | structure at the next level up the tree. | 
 | This means that at the leaf-level <tt>rcu_node</tt> structure, only | 
 | one access out of sixteen will progress up the tree. | 
 | For the internal <tt>rcu_node</tt> structures, the situation is even | 
 | more extreme:  Only one access out of sixty-four will progress up | 
 | the tree. | 
 | Because the vast majority of the CPUs do not progress up the tree, | 
 | the lock contention remains roughly constant up the tree. | 
 | No matter how many CPUs there are in the system, at most 64 quiescent-state | 
 | reports per grace period will progress all the way to the root | 
 | <tt>rcu_node</tt> structure, thus ensuring that the lock contention | 
 | on that root <tt>rcu_node</tt> structure remains acceptably low. | 
 |  | 
 | </p><p>In effect, the combining tree acts like a big shock absorber, | 
 | keeping lock contention under control at all tree levels regardless | 
 | of the level of loading on the system. | 
 |  | 
 | </p><p>RCU updaters wait for normal grace periods by registering | 
 | RCU callbacks, either directly via <tt>call_rcu()</tt> | 
 | or indirectly via <tt>synchronize_rcu()</tt> and friends. | 
 | RCU callbacks are represented by <tt>rcu_head</tt> structures, | 
 | which are queued on <tt>rcu_data</tt> structures while they are | 
 | waiting for a grace period to elapse, as shown in the following figure: | 
 |  | 
 | </p><p><img src="BigTreePreemptRCUBHdyntickCB.svg" alt="BigTreePreemptRCUBHdyntickCB.svg" width="40%"> | 
 |  | 
 | </p><p>This figure shows how <tt>TREE_RCU</tt>'s and | 
 | <tt>PREEMPT_RCU</tt>'s major data structures are related. | 
 | Lesser data structures will be introduced with the algorithms that | 
 | make use of them. | 
 |  | 
 | </p><p>Note that each of the data structures in the above figure has | 
 | its own synchronization: | 
 |  | 
 | <p><ol> | 
 | <li>	Each <tt>rcu_state</tt> structures has a lock and a mutex, | 
 | 	and some fields are protected by the corresponding root | 
 | 	<tt>rcu_node</tt> structure's lock. | 
 | <li>	Each <tt>rcu_node</tt> structure has a spinlock. | 
 | <li>	The fields in <tt>rcu_data</tt> are private to the corresponding | 
 | 	CPU, although a few can be read and written by other CPUs. | 
 | </ol> | 
 |  | 
 | <p>It is important to note that different data structures can have | 
 | very different ideas about the state of RCU at any given time. | 
 | For but one example, awareness of the start or end of a given RCU | 
 | grace period propagates slowly through the data structures. | 
 | This slow propagation is absolutely necessary for RCU to have good | 
 | read-side performance. | 
 | If this balkanized implementation seems foreign to you, one useful | 
 | trick is to consider each instance of these data structures to be | 
 | a different person, each having the usual slightly different | 
 | view of reality. | 
 |  | 
 | </p><p>The general role of each of these data structures is as | 
 | follows: | 
 |  | 
 | </p><ol> | 
 | <li>	<tt>rcu_state</tt>: | 
 | 	This structure forms the interconnection between the | 
 | 	<tt>rcu_node</tt> and <tt>rcu_data</tt> structures, | 
 | 	tracks grace periods, serves as short-term repository | 
 | 	for callbacks orphaned by CPU-hotplug events, | 
 | 	maintains <tt>rcu_barrier()</tt> state, | 
 | 	tracks expedited grace-period state, | 
 | 	and maintains state used to force quiescent states when | 
 | 	grace periods extend too long, | 
 | <li>	<tt>rcu_node</tt>: This structure forms the combining | 
 | 	tree that propagates quiescent-state | 
 | 	information from the leaves to the root, and also propagates | 
 | 	grace-period information from the root to the leaves. | 
 | 	It provides local copies of the grace-period state in order | 
 | 	to allow this information to be accessed in a synchronized | 
 | 	manner without suffering the scalability limitations that | 
 | 	would otherwise be imposed by global locking. | 
 | 	In <tt>CONFIG_PREEMPT_RCU</tt> kernels, it manages the lists | 
 | 	of tasks that have blocked while in their current | 
 | 	RCU read-side critical section. | 
 | 	In <tt>CONFIG_PREEMPT_RCU</tt> with | 
 | 	<tt>CONFIG_RCU_BOOST</tt>, it manages the | 
 | 	per-<tt>rcu_node</tt> priority-boosting | 
 | 	kernel threads (kthreads) and state. | 
 | 	Finally, it records CPU-hotplug state in order to determine | 
 | 	which CPUs should be ignored during a given grace period. | 
 | <li>	<tt>rcu_data</tt>: This per-CPU structure is the | 
 | 	focus of quiescent-state detection and RCU callback queuing. | 
 | 	It also tracks its relationship to the corresponding leaf | 
 | 	<tt>rcu_node</tt> structure to allow more-efficient | 
 | 	propagation of quiescent states up the <tt>rcu_node</tt> | 
 | 	combining tree. | 
 | 	Like the <tt>rcu_node</tt> structure, it provides a local | 
 | 	copy of the grace-period information to allow for-free | 
 | 	synchronized | 
 | 	access to this information from the corresponding CPU. | 
 | 	Finally, this structure records past dyntick-idle state | 
 | 	for the corresponding CPU and also tracks statistics. | 
 | <li>	<tt>rcu_head</tt>: | 
 | 	This structure represents RCU callbacks, and is the | 
 | 	only structure allocated and managed by RCU users. | 
 | 	The <tt>rcu_head</tt> structure is normally embedded | 
 | 	within the RCU-protected data structure. | 
 | </ol> | 
 |  | 
 | <p>If all you wanted from this article was a general notion of how | 
 | RCU's data structures are related, you are done. | 
 | Otherwise, each of the following sections give more details on | 
 | the <tt>rcu_state</tt>, <tt>rcu_node</tt> and <tt>rcu_data</tt> data | 
 | structures. | 
 |  | 
 | <h3><a name="The rcu_state Structure"> | 
 | The <tt>rcu_state</tt> Structure</a></h3> | 
 |  | 
 | <p>The <tt>rcu_state</tt> structure is the base structure that | 
 | represents the state of RCU in the system. | 
 | This structure forms the interconnection between the | 
 | <tt>rcu_node</tt> and <tt>rcu_data</tt> structures, | 
 | tracks grace periods, contains the lock used to | 
 | synchronize with CPU-hotplug events, | 
 | and maintains state used to force quiescent states when | 
 | grace periods extend too long, | 
 |  | 
 | </p><p>A few of the <tt>rcu_state</tt> structure's fields are discussed, | 
 | singly and in groups, in the following sections. | 
 | The more specialized fields are covered in the discussion of their | 
 | use. | 
 |  | 
 | <h5>Relationship to rcu_node and rcu_data Structures</h5> | 
 |  | 
 | This portion of the <tt>rcu_state</tt> structure is declared | 
 | as follows: | 
 |  | 
 | <pre> | 
 |   1   struct rcu_node node[NUM_RCU_NODES]; | 
 |   2   struct rcu_node *level[NUM_RCU_LVLS + 1]; | 
 |   3   struct rcu_data __percpu *rda; | 
 | </pre> | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	Wait a minute! | 
 | 	You said that the <tt>rcu_node</tt> structures formed a tree, | 
 | 	but they are declared as a flat array! | 
 | 	What gives? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	The tree is laid out in the array. | 
 | 	The first node In the array is the head, the next set of nodes in the | 
 | 	array are children of the head node, and so on until the last set of | 
 | 	nodes in the array are the leaves. | 
 | 	</font> | 
 |  | 
 | 	<p><font color="ffffff">See the following diagrams to see how | 
 | 	this works. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <p>The <tt>rcu_node</tt> tree is embedded into the | 
 | <tt>->node[]</tt> array as shown in the following figure: | 
 |  | 
 | </p><p><img src="TreeMapping.svg" alt="TreeMapping.svg" width="40%"> | 
 |  | 
 | </p><p>One interesting consequence of this mapping is that a | 
 | breadth-first traversal of the tree is implemented as a simple | 
 | linear scan of the array, which is in fact what the | 
 | <tt>rcu_for_each_node_breadth_first()</tt> macro does. | 
 | This macro is used at the beginning and ends of grace periods. | 
 |  | 
 | </p><p>Each entry of the <tt>->level</tt> array references | 
 | the first <tt>rcu_node</tt> structure on the corresponding level | 
 | of the tree, for example, as shown below: | 
 |  | 
 | </p><p><img src="TreeMappingLevel.svg" alt="TreeMappingLevel.svg" width="40%"> | 
 |  | 
 | </p><p>The zero<sup>th</sup> element of the array references the root | 
 | <tt>rcu_node</tt> structure, the first element references the | 
 | first child of the root <tt>rcu_node</tt>, and finally the second | 
 | element references the first leaf <tt>rcu_node</tt> structure. | 
 |  | 
 | </p><p>For whatever it is worth, if you draw the tree to be tree-shaped | 
 | rather than array-shaped, it is easy to draw a planar representation: | 
 |  | 
 | </p><p><img src="TreeLevel.svg" alt="TreeLevel.svg" width="60%"> | 
 |  | 
 | </p><p>Finally, the <tt>->rda</tt> field references a per-CPU | 
 | pointer to the corresponding CPU's <tt>rcu_data</tt> structure. | 
 |  | 
 | </p><p>All of these fields are constant once initialization is complete, | 
 | and therefore need no protection. | 
 |  | 
 | <h5>Grace-Period Tracking</h5> | 
 |  | 
 | <p>This portion of the <tt>rcu_state</tt> structure is declared | 
 | as follows: | 
 |  | 
 | <pre> | 
 |   1   unsigned long gp_seq; | 
 | </pre> | 
 |  | 
 | <p>RCU grace periods are numbered, and | 
 | the <tt>->gp_seq</tt> field contains the current grace-period | 
 | sequence number. | 
 | The bottom two bits are the state of the current grace period, | 
 | which can be zero for not yet started or one for in progress. | 
 | In other words, if the bottom two bits of <tt>->gp_seq</tt> are | 
 | zero, then RCU is idle. | 
 | Any other value in the bottom two bits indicates that something is broken. | 
 | This field is protected by the root <tt>rcu_node</tt> structure's | 
 | <tt>->lock</tt> field. | 
 |  | 
 | </p><p>There are <tt>->gp_seq</tt> fields | 
 | in the <tt>rcu_node</tt> and <tt>rcu_data</tt> structures | 
 | as well. | 
 | The fields in the <tt>rcu_state</tt> structure represent the | 
 | most current value, and those of the other structures are compared | 
 | in order to detect the beginnings and ends of grace periods in a distributed | 
 | fashion. | 
 | The values flow from <tt>rcu_state</tt> to <tt>rcu_node</tt> | 
 | (down the tree from the root to the leaves) to <tt>rcu_data</tt>. | 
 |  | 
 | <h5>Miscellaneous</h5> | 
 |  | 
 | <p>This portion of the <tt>rcu_state</tt> structure is declared | 
 | as follows: | 
 |  | 
 | <pre> | 
 |   1   unsigned long gp_max; | 
 |   2   char abbr; | 
 |   3   char *name; | 
 | </pre> | 
 |  | 
 | <p>The <tt>->gp_max</tt> field tracks the duration of the longest | 
 | grace period in jiffies. | 
 | It is protected by the root <tt>rcu_node</tt>'s <tt>->lock</tt>. | 
 |  | 
 | <p>The <tt>->name</tt> and <tt>->abbr</tt> fields distinguish | 
 | between preemptible RCU (“rcu_preempt” and “p”) | 
 | and non-preemptible RCU (“rcu_sched” and “s”). | 
 | These fields are used for diagnostic and tracing purposes. | 
 |  | 
 | <h3><a name="The rcu_node Structure"> | 
 | The <tt>rcu_node</tt> Structure</a></h3> | 
 |  | 
 | <p>The <tt>rcu_node</tt> structures form the combining | 
 | tree that propagates quiescent-state | 
 | information from the leaves to the root and also that propagates | 
 | grace-period information from the root down to the leaves. | 
 | They provides local copies of the grace-period state in order | 
 | to allow this information to be accessed in a synchronized | 
 | manner without suffering the scalability limitations that | 
 | would otherwise be imposed by global locking. | 
 | In <tt>CONFIG_PREEMPT_RCU</tt> kernels, they manage the lists | 
 | of tasks that have blocked while in their current | 
 | RCU read-side critical section. | 
 | In <tt>CONFIG_PREEMPT_RCU</tt> with | 
 | <tt>CONFIG_RCU_BOOST</tt>, they manage the | 
 | per-<tt>rcu_node</tt> priority-boosting | 
 | kernel threads (kthreads) and state. | 
 | Finally, they record CPU-hotplug state in order to determine | 
 | which CPUs should be ignored during a given grace period. | 
 |  | 
 | </p><p>The <tt>rcu_node</tt> structure's fields are discussed, | 
 | singly and in groups, in the following sections. | 
 |  | 
 | <h5>Connection to Combining Tree</h5> | 
 |  | 
 | <p>This portion of the <tt>rcu_node</tt> structure is declared | 
 | as follows: | 
 |  | 
 | <pre> | 
 |   1   struct rcu_node *parent; | 
 |   2   u8 level; | 
 |   3   u8 grpnum; | 
 |   4   unsigned long grpmask; | 
 |   5   int grplo; | 
 |   6   int grphi; | 
 | </pre> | 
 |  | 
 | <p>The <tt>->parent</tt> pointer references the <tt>rcu_node</tt> | 
 | one level up in the tree, and is <tt>NULL</tt> for the root | 
 | <tt>rcu_node</tt>. | 
 | The RCU implementation makes heavy use of this field to push quiescent | 
 | states up the tree. | 
 | The <tt>->level</tt> field gives the level in the tree, with | 
 | the root being at level zero, its children at level one, and so on. | 
 | The <tt>->grpnum</tt> field gives this node's position within | 
 | the children of its parent, so this number can range between 0 and 31 | 
 | on 32-bit systems and between 0 and 63 on 64-bit systems. | 
 | The <tt>->level</tt> and <tt>->grpnum</tt> fields are | 
 | used only during initialization and for tracing. | 
 | The <tt>->grpmask</tt> field is the bitmask counterpart of | 
 | <tt>->grpnum</tt>, and therefore always has exactly one bit set. | 
 | This mask is used to clear the bit corresponding to this <tt>rcu_node</tt> | 
 | structure in its parent's bitmasks, which are described later. | 
 | Finally, the <tt>->grplo</tt> and <tt>->grphi</tt> fields | 
 | contain the lowest and highest numbered CPU served by this | 
 | <tt>rcu_node</tt> structure, respectively. | 
 |  | 
 | </p><p>All of these fields are constant, and thus do not require any | 
 | synchronization. | 
 |  | 
 | <h5>Synchronization</h5> | 
 |  | 
 | <p>This field of the <tt>rcu_node</tt> structure is declared | 
 | as follows: | 
 |  | 
 | <pre> | 
 |   1   raw_spinlock_t lock; | 
 | </pre> | 
 |  | 
 | <p>This field is used to protect the remaining fields in this structure, | 
 | unless otherwise stated. | 
 | That said, all of the fields in this structure can be accessed without | 
 | locking for tracing purposes. | 
 | Yes, this can result in confusing traces, but better some tracing confusion | 
 | than to be heisenbugged out of existence. | 
 |  | 
 | <h5>Grace-Period Tracking</h5> | 
 |  | 
 | <p>This portion of the <tt>rcu_node</tt> structure is declared | 
 | as follows: | 
 |  | 
 | <pre> | 
 |   1   unsigned long gp_seq; | 
 |   2   unsigned long gp_seq_needed; | 
 | </pre> | 
 |  | 
 | <p>The <tt>rcu_node</tt> structures' <tt>->gp_seq</tt> fields are | 
 | the counterparts of the field of the same name in the <tt>rcu_state</tt> | 
 | structure. | 
 | They each may lag up to one step behind their <tt>rcu_state</tt> | 
 | counterpart. | 
 | If the bottom two bits of a given <tt>rcu_node</tt> structure's | 
 | <tt>->gp_seq</tt> field is zero, then this <tt>rcu_node</tt> | 
 | structure believes that RCU is idle. | 
 | </p><p>The <tt>>gp_seq</tt> field of each <tt>rcu_node</tt> | 
 | structure is updated at the beginning and the end | 
 | of each grace period. | 
 |  | 
 | <p>The <tt>->gp_seq_needed</tt> fields record the | 
 | furthest-in-the-future grace period request seen by the corresponding | 
 | <tt>rcu_node</tt> structure.  The request is considered fulfilled when | 
 | the value of the <tt>->gp_seq</tt> field equals or exceeds that of | 
 | the <tt>->gp_seq_needed</tt> field. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	Suppose that this <tt>rcu_node</tt> structure doesn't see | 
 | 	a request for a very long time. | 
 | 	Won't wrapping of the <tt>->gp_seq</tt> field cause | 
 | 	problems? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	No, because if the <tt>->gp_seq_needed</tt> field lags behind the | 
 | 	<tt>->gp_seq</tt> field, the <tt>->gp_seq_needed</tt> field | 
 | 	will be updated at the end of the grace period. | 
 | 	Modulo-arithmetic comparisons therefore will always get the | 
 | 	correct answer, even with wrapping. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <h5>Quiescent-State Tracking</h5> | 
 |  | 
 | <p>These fields manage the propagation of quiescent states up the | 
 | combining tree. | 
 |  | 
 | </p><p>This portion of the <tt>rcu_node</tt> structure has fields | 
 | as follows: | 
 |  | 
 | <pre> | 
 |   1   unsigned long qsmask; | 
 |   2   unsigned long expmask; | 
 |   3   unsigned long qsmaskinit; | 
 |   4   unsigned long expmaskinit; | 
 | </pre> | 
 |  | 
 | <p>The <tt>->qsmask</tt> field tracks which of this | 
 | <tt>rcu_node</tt> structure's children still need to report | 
 | quiescent states for the current normal grace period. | 
 | Such children will have a value of 1 in their corresponding bit. | 
 | Note that the leaf <tt>rcu_node</tt> structures should be | 
 | thought of as having <tt>rcu_data</tt> structures as their | 
 | children. | 
 | Similarly, the <tt>->expmask</tt> field tracks which | 
 | of this <tt>rcu_node</tt> structure's children still need to report | 
 | quiescent states for the current expedited grace period. | 
 | An expedited grace period has | 
 | the same conceptual properties as a normal grace period, but the | 
 | expedited implementation accepts extreme CPU overhead to obtain | 
 | much lower grace-period latency, for example, consuming a few | 
 | tens of microseconds worth of CPU time to reduce grace-period | 
 | duration from milliseconds to tens of microseconds. | 
 | The <tt>->qsmaskinit</tt> field tracks which of this | 
 | <tt>rcu_node</tt> structure's children cover for at least | 
 | one online CPU. | 
 | This mask is used to initialize <tt>->qsmask</tt>, | 
 | and <tt>->expmaskinit</tt> is used to initialize | 
 | <tt>->expmask</tt> and the beginning of the | 
 | normal and expedited grace periods, respectively. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	Why are these bitmasks protected by locking? | 
 | 	Come on, haven't you heard of atomic instructions??? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	Lockless grace-period computation!  Such a tantalizing possibility! | 
 | 	</font> | 
 |  | 
 | 	<p><font color="ffffff">But consider the following sequence of events: | 
 | 	</font> | 
 |  | 
 | 	<ol> | 
 | 	<li>	<font color="ffffff">CPU 0 has been in dyntick-idle | 
 | 		mode for quite some time. | 
 | 		When it wakes up, it notices that the current RCU | 
 | 		grace period needs it to report in, so it sets a | 
 | 		flag where the scheduling clock interrupt will find it. | 
 | 		</font><p> | 
 | 	<li>	<font color="ffffff">Meanwhile, CPU 1 is running | 
 | 		<tt>force_quiescent_state()</tt>, | 
 | 		and notices that CPU 0 has been in dyntick idle mode, | 
 | 		which qualifies as an extended quiescent state. | 
 | 		</font><p> | 
 | 	<li>	<font color="ffffff">CPU 0's scheduling clock | 
 | 		interrupt fires in the | 
 | 		middle of an RCU read-side critical section, and notices | 
 | 		that the RCU core needs something, so commences RCU softirq | 
 | 		processing. | 
 | 		</font> | 
 | 		<p> | 
 | 	<li>	<font color="ffffff">CPU 0's softirq handler | 
 | 		executes and is just about ready | 
 | 		to report its quiescent state up the <tt>rcu_node</tt> | 
 | 		tree. | 
 | 		</font><p> | 
 | 	<li>	<font color="ffffff">But CPU 1 beats it to the punch, | 
 | 		completing the current | 
 | 		grace period and starting a new one. | 
 | 		</font><p> | 
 | 	<li>	<font color="ffffff">CPU 0 now reports its quiescent | 
 | 		state for the wrong | 
 | 		grace period. | 
 | 		That grace period might now end before the RCU read-side | 
 | 		critical section. | 
 | 		If that happens, disaster will ensue. | 
 | 		</font> | 
 | 	</ol> | 
 |  | 
 | 	<p><font color="ffffff">So the locking is absolutely required in | 
 | 	order to coordinate clearing of the bits with updating of the | 
 | 	grace-period sequence number in <tt>->gp_seq</tt>. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <h5>Blocked-Task Management</h5> | 
 |  | 
 | <p><tt>PREEMPT_RCU</tt> allows tasks to be preempted in the | 
 | midst of their RCU read-side critical sections, and these tasks | 
 | must be tracked explicitly. | 
 | The details of exactly why and how they are tracked will be covered | 
 | in a separate article on RCU read-side processing. | 
 | For now, it is enough to know that the <tt>rcu_node</tt> | 
 | structure tracks them. | 
 |  | 
 | <pre> | 
 |   1   struct list_head blkd_tasks; | 
 |   2   struct list_head *gp_tasks; | 
 |   3   struct list_head *exp_tasks; | 
 |   4   bool wait_blkd_tasks; | 
 | </pre> | 
 |  | 
 | <p>The <tt>->blkd_tasks</tt> field is a list header for | 
 | the list of blocked and preempted tasks. | 
 | As tasks undergo context switches within RCU read-side critical | 
 | sections, their <tt>task_struct</tt> structures are enqueued | 
 | (via the <tt>task_struct</tt>'s <tt>->rcu_node_entry</tt> | 
 | field) onto the head of the <tt>->blkd_tasks</tt> list for the | 
 | leaf <tt>rcu_node</tt> structure corresponding to the CPU | 
 | on which the outgoing context switch executed. | 
 | As these tasks later exit their RCU read-side critical sections, | 
 | they remove themselves from the list. | 
 | This list is therefore in reverse time order, so that if one of the tasks | 
 | is blocking the current grace period, all subsequent tasks must | 
 | also be blocking that same grace period. | 
 | Therefore, a single pointer into this list suffices to track | 
 | all tasks blocking a given grace period. | 
 | That pointer is stored in <tt>->gp_tasks</tt> for normal | 
 | grace periods and in <tt>->exp_tasks</tt> for expedited | 
 | grace periods. | 
 | These last two fields are <tt>NULL</tt> if either there is | 
 | no grace period in flight or if there are no blocked tasks | 
 | preventing that grace period from completing. | 
 | If either of these two pointers is referencing a task that | 
 | removes itself from the <tt>->blkd_tasks</tt> list, | 
 | then that task must advance the pointer to the next task on | 
 | the list, or set the pointer to <tt>NULL</tt> if there | 
 | are no subsequent tasks on the list. | 
 |  | 
 | </p><p>For example, suppose that tasks T1, T2, and T3 are | 
 | all hard-affinitied to the largest-numbered CPU in the system. | 
 | Then if task T1 blocked in an RCU read-side | 
 | critical section, then an expedited grace period started, | 
 | then task T2 blocked in an RCU read-side critical section, | 
 | then a normal grace period started, and finally task 3 blocked | 
 | in an RCU read-side critical section, then the state of the | 
 | last leaf <tt>rcu_node</tt> structure's blocked-task list | 
 | would be as shown below: | 
 |  | 
 | </p><p><img src="blkd_task.svg" alt="blkd_task.svg" width="60%"> | 
 |  | 
 | </p><p>Task T1 is blocking both grace periods, task T2 is | 
 | blocking only the normal grace period, and task T3 is blocking | 
 | neither grace period. | 
 | Note that these tasks will not remove themselves from this list | 
 | immediately upon resuming execution. | 
 | They will instead remain on the list until they execute the outermost | 
 | <tt>rcu_read_unlock()</tt> that ends their RCU read-side critical | 
 | section. | 
 |  | 
 | <p> | 
 | The <tt>->wait_blkd_tasks</tt> field indicates whether or not | 
 | the current grace period is waiting on a blocked task. | 
 |  | 
 | <h5>Sizing the <tt>rcu_node</tt> Array</h5> | 
 |  | 
 | <p>The <tt>rcu_node</tt> array is sized via a series of | 
 | C-preprocessor expressions as follows: | 
 |  | 
 | <pre> | 
 |  1 #ifdef CONFIG_RCU_FANOUT | 
 |  2 #define RCU_FANOUT CONFIG_RCU_FANOUT | 
 |  3 #else | 
 |  4 # ifdef CONFIG_64BIT | 
 |  5 # define RCU_FANOUT 64 | 
 |  6 # else | 
 |  7 # define RCU_FANOUT 32 | 
 |  8 # endif | 
 |  9 #endif | 
 | 10 | 
 | 11 #ifdef CONFIG_RCU_FANOUT_LEAF | 
 | 12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF | 
 | 13 #else | 
 | 14 # ifdef CONFIG_64BIT | 
 | 15 # define RCU_FANOUT_LEAF 64 | 
 | 16 # else | 
 | 17 # define RCU_FANOUT_LEAF 32 | 
 | 18 # endif | 
 | 19 #endif | 
 | 20 | 
 | 21 #define RCU_FANOUT_1        (RCU_FANOUT_LEAF) | 
 | 22 #define RCU_FANOUT_2        (RCU_FANOUT_1 * RCU_FANOUT) | 
 | 23 #define RCU_FANOUT_3        (RCU_FANOUT_2 * RCU_FANOUT) | 
 | 24 #define RCU_FANOUT_4        (RCU_FANOUT_3 * RCU_FANOUT) | 
 | 25 | 
 | 26 #if NR_CPUS <= RCU_FANOUT_1 | 
 | 27 #  define RCU_NUM_LVLS        1 | 
 | 28 #  define NUM_RCU_LVL_0        1 | 
 | 29 #  define NUM_RCU_NODES        NUM_RCU_LVL_0 | 
 | 30 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0 } | 
 | 31 #  define RCU_NODE_NAME_INIT  { "rcu_node_0" } | 
 | 32 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0" } | 
 | 33 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0" } | 
 | 34 #elif NR_CPUS <= RCU_FANOUT_2 | 
 | 35 #  define RCU_NUM_LVLS        2 | 
 | 36 #  define NUM_RCU_LVL_0        1 | 
 | 37 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) | 
 | 38 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) | 
 | 39 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1 } | 
 | 40 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1" } | 
 | 41 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1" } | 
 | 42 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1" } | 
 | 43 #elif NR_CPUS <= RCU_FANOUT_3 | 
 | 44 #  define RCU_NUM_LVLS        3 | 
 | 45 #  define NUM_RCU_LVL_0        1 | 
 | 46 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) | 
 | 47 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) | 
 | 48 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) | 
 | 49 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } | 
 | 50 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2" } | 
 | 51 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } | 
 | 52 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" } | 
 | 53 #elif NR_CPUS <= RCU_FANOUT_4 | 
 | 54 #  define RCU_NUM_LVLS        4 | 
 | 55 #  define NUM_RCU_LVL_0        1 | 
 | 56 #  define NUM_RCU_LVL_1        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) | 
 | 57 #  define NUM_RCU_LVL_2        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) | 
 | 58 #  define NUM_RCU_LVL_3        DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) | 
 | 59 #  define NUM_RCU_NODES        (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) | 
 | 60 #  define NUM_RCU_LVL_INIT    { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 } | 
 | 61 #  define RCU_NODE_NAME_INIT  { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" } | 
 | 62 #  define RCU_FQS_NAME_INIT   { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" } | 
 | 63 #  define RCU_EXP_NAME_INIT   { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" } | 
 | 64 #else | 
 | 65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" | 
 | 66 #endif | 
 | </pre> | 
 |  | 
 | <p>The maximum number of levels in the <tt>rcu_node</tt> structure | 
 | is currently limited to four, as specified by lines 21-24 | 
 | and the structure of the subsequent “if” statement. | 
 | For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which | 
 | should be sufficient for the next few years at least. | 
 | For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which | 
 | should see us through the next decade or so. | 
 | This four-level tree also allows kernels built with | 
 | <tt>CONFIG_RCU_FANOUT=8</tt> to support up to 4096 CPUs, | 
 | which might be useful in very large systems having eight CPUs per | 
 | socket (but please note that no one has yet shown any measurable | 
 | performance degradation due to misaligned socket and <tt>rcu_node</tt> | 
 | boundaries). | 
 | In addition, building kernels with a full four levels of <tt>rcu_node</tt> | 
 | tree permits better testing of RCU's combining-tree code. | 
 |  | 
 | </p><p>The <tt>RCU_FANOUT</tt> symbol controls how many children | 
 | are permitted at each non-leaf level of the <tt>rcu_node</tt> tree. | 
 | If the <tt>CONFIG_RCU_FANOUT</tt> Kconfig option is not specified, | 
 | it is set based on the word size of the system, which is also | 
 | the Kconfig default. | 
 |  | 
 | </p><p>The <tt>RCU_FANOUT_LEAF</tt> symbol controls how many CPUs are | 
 | handled by each leaf <tt>rcu_node</tt> structure. | 
 | Experience has shown that allowing a given leaf <tt>rcu_node</tt> | 
 | structure to handle 64 CPUs, as permitted by the number of bits in | 
 | the <tt>->qsmask</tt> field on a 64-bit system, results in | 
 | excessive contention for the leaf <tt>rcu_node</tt> structures' | 
 | <tt>->lock</tt> fields. | 
 | The number of CPUs per leaf <tt>rcu_node</tt> structure is therefore | 
 | limited to 16 given the default value of <tt>CONFIG_RCU_FANOUT_LEAF</tt>. | 
 | If <tt>CONFIG_RCU_FANOUT_LEAF</tt> is unspecified, the value | 
 | selected is based on the word size of the system, just as for | 
 | <tt>CONFIG_RCU_FANOUT</tt>. | 
 | Lines 11-19 perform this computation. | 
 |  | 
 | </p><p>Lines 21-24 compute the maximum number of CPUs supported by | 
 | a single-level (which contains a single <tt>rcu_node</tt> structure), | 
 | two-level, three-level, and four-level <tt>rcu_node</tt> tree, | 
 | respectively, given the fanout specified by <tt>RCU_FANOUT</tt> | 
 | and <tt>RCU_FANOUT_LEAF</tt>. | 
 | These numbers of CPUs are retained in the | 
 | <tt>RCU_FANOUT_1</tt>, | 
 | <tt>RCU_FANOUT_2</tt>, | 
 | <tt>RCU_FANOUT_3</tt>, and | 
 | <tt>RCU_FANOUT_4</tt> | 
 | C-preprocessor variables, respectively. | 
 |  | 
 | </p><p>These variables are used to control the C-preprocessor <tt>#if</tt> | 
 | statement spanning lines 26-66 that computes the number of | 
 | <tt>rcu_node</tt> structures required for each level of the tree, | 
 | as well as the number of levels required. | 
 | The number of levels is placed in the <tt>NUM_RCU_LVLS</tt> | 
 | C-preprocessor variable by lines 27, 35, 44, and 54. | 
 | The number of <tt>rcu_node</tt> structures for the topmost level | 
 | of the tree is always exactly one, and this value is unconditionally | 
 | placed into <tt>NUM_RCU_LVL_0</tt> by lines 28, 36, 45, and 55. | 
 | The rest of the levels (if any) of the <tt>rcu_node</tt> tree | 
 | are computed by dividing the maximum number of CPUs by the | 
 | fanout supported by the number of levels from the current level down, | 
 | rounding up.  This computation is performed by lines 37, | 
 | 46-47, and 56-58. | 
 | Lines 31-33, 40-42, 50-52, and 62-63 create initializers | 
 | for lockdep lock-class names. | 
 | Finally, lines 64-66 produce an error if the maximum number of | 
 | CPUs is too large for the specified fanout. | 
 |  | 
 | <h3><a name="The rcu_segcblist Structure"> | 
 | The <tt>rcu_segcblist</tt> Structure</a></h3> | 
 |  | 
 | The <tt>rcu_segcblist</tt> structure maintains a segmented list of | 
 | callbacks as follows: | 
 |  | 
 | <pre> | 
 |  1 #define RCU_DONE_TAIL        0 | 
 |  2 #define RCU_WAIT_TAIL        1 | 
 |  3 #define RCU_NEXT_READY_TAIL  2 | 
 |  4 #define RCU_NEXT_TAIL        3 | 
 |  5 #define RCU_CBLIST_NSEGS     4 | 
 |  6 | 
 |  7 struct rcu_segcblist { | 
 |  8   struct rcu_head *head; | 
 |  9   struct rcu_head **tails[RCU_CBLIST_NSEGS]; | 
 | 10   unsigned long gp_seq[RCU_CBLIST_NSEGS]; | 
 | 11   long len; | 
 | 12   long len_lazy; | 
 | 13 }; | 
 | </pre> | 
 |  | 
 | <p> | 
 | The segments are as follows: | 
 |  | 
 | <ol> | 
 | <li>	<tt>RCU_DONE_TAIL</tt>: Callbacks whose grace periods have elapsed. | 
 | 	These callbacks are ready to be invoked. | 
 | <li>	<tt>RCU_WAIT_TAIL</tt>: Callbacks that are waiting for the | 
 | 	current grace period. | 
 | 	Note that different CPUs can have different ideas about which | 
 | 	grace period is current, hence the <tt>->gp_seq</tt> field. | 
 | <li>	<tt>RCU_NEXT_READY_TAIL</tt>: Callbacks waiting for the next | 
 | 	grace period to start. | 
 | <li>	<tt>RCU_NEXT_TAIL</tt>: Callbacks that have not yet been | 
 | 	associated with a grace period. | 
 | </ol> | 
 |  | 
 | <p> | 
 | The <tt>->head</tt> pointer references the first callback or | 
 | is <tt>NULL</tt> if the list contains no callbacks (which is | 
 | <i>not</i> the same as being empty). | 
 | Each element of the <tt>->tails[]</tt> array references the | 
 | <tt>->next</tt> pointer of the last callback in the corresponding | 
 | segment of the list, or the list's <tt>->head</tt> pointer if | 
 | that segment and all previous segments are empty. | 
 | If the corresponding segment is empty but some previous segment is | 
 | not empty, then the array element is identical to its predecessor. | 
 | Older callbacks are closer to the head of the list, and new callbacks | 
 | are added at the tail. | 
 | This relationship between the <tt>->head</tt> pointer, the | 
 | <tt>->tails[]</tt> array, and the callbacks is shown in this | 
 | diagram: | 
 |  | 
 | </p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%"> | 
 |  | 
 | </p><p>In this figure, the <tt>->head</tt> pointer references the | 
 | first | 
 | RCU callback in the list. | 
 | The <tt>->tails[RCU_DONE_TAIL]</tt> array element references | 
 | the <tt>->head</tt> pointer itself, indicating that none | 
 | of the callbacks is ready to invoke. | 
 | The <tt>->tails[RCU_WAIT_TAIL]</tt> array element references callback | 
 | CB 2's <tt>->next</tt> pointer, which indicates that | 
 | CB 1 and CB 2 are both waiting on the current grace period, | 
 | give or take possible disagreements about exactly which grace period | 
 | is the current one. | 
 | The <tt>->tails[RCU_NEXT_READY_TAIL]</tt> array element | 
 | references the same RCU callback that <tt>->tails[RCU_WAIT_TAIL]</tt> | 
 | does, which indicates that there are no callbacks waiting on the next | 
 | RCU grace period. | 
 | The <tt>->tails[RCU_NEXT_TAIL]</tt> array element references | 
 | CB 4's <tt>->next</tt> pointer, indicating that all the | 
 | remaining RCU callbacks have not yet been assigned to an RCU grace | 
 | period. | 
 | Note that the <tt>->tails[RCU_NEXT_TAIL]</tt> array element | 
 | always references the last RCU callback's <tt>->next</tt> pointer | 
 | unless the callback list is empty, in which case it references | 
 | the <tt>->head</tt> pointer. | 
 |  | 
 | <p> | 
 | There is one additional important special case for the | 
 | <tt>->tails[RCU_NEXT_TAIL]</tt> array element: It can be <tt>NULL</tt> | 
 | when this list is <i>disabled</i>. | 
 | Lists are disabled when the corresponding CPU is offline or when | 
 | the corresponding CPU's callbacks are offloaded to a kthread, | 
 | both of which are described elsewhere. | 
 |  | 
 | </p><p>CPUs advance their callbacks from the | 
 | <tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the | 
 | <tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments | 
 | as grace periods advance. | 
 |  | 
 | </p><p>The <tt>->gp_seq[]</tt> array records grace-period | 
 | numbers corresponding to the list segments. | 
 | This is what allows different CPUs to have different ideas as to | 
 | which is the current grace period while still avoiding premature | 
 | invocation of their callbacks. | 
 | In particular, this allows CPUs that go idle for extended periods | 
 | to determine which of their callbacks are ready to be invoked after | 
 | reawakening. | 
 |  | 
 | </p><p>The <tt>->len</tt> counter contains the number of | 
 | callbacks in <tt>->head</tt>, and the | 
 | <tt>->len_lazy</tt> contains the number of those callbacks that | 
 | are known to only free memory, and whose invocation can therefore | 
 | be safely deferred. | 
 |  | 
 | <p><b>Important note</b>: It is the <tt>->len</tt> field that | 
 | determines whether or not there are callbacks associated with | 
 | this <tt>rcu_segcblist</tt> structure, <i>not</i> the <tt>->head</tt> | 
 | pointer. | 
 | The reason for this is that all the ready-to-invoke callbacks | 
 | (that is, those in the <tt>RCU_DONE_TAIL</tt> segment) are extracted | 
 | all at once at callback-invocation time (<tt>rcu_do_batch</tt>), due | 
 | to which <tt>->head</tt> may be set to NULL if there are no not-done | 
 | callbacks remaining in the <tt>rcu_segcblist</tt>. | 
 | If callback invocation must be postponed, for example, because a | 
 | high-priority process just woke up on this CPU, then the remaining | 
 | callbacks are placed back on the <tt>RCU_DONE_TAIL</tt> segment and | 
 | <tt>->head</tt> once again points to the start of the segment. | 
 | In short, the head field can briefly be <tt>NULL</tt> even though the | 
 | CPU has callbacks present the entire time. | 
 | Therefore, it is not appropriate to test the <tt>->head</tt> pointer | 
 | for <tt>NULL</tt>. | 
 |  | 
 | <p>In contrast, the <tt>->len</tt> and <tt>->len_lazy</tt> counts | 
 | are adjusted only after the corresponding callbacks have been invoked. | 
 | This means that the <tt>->len</tt> count is zero only if | 
 | the <tt>rcu_segcblist</tt> structure really is devoid of callbacks. | 
 | Of course, off-CPU sampling of the <tt>->len</tt> count requires | 
 | careful use of appropriate synchronization, for example, memory barriers. | 
 | This synchronization can be a bit subtle, particularly in the case | 
 | of <tt>rcu_barrier()</tt>. | 
 |  | 
 | <h3><a name="The rcu_data Structure"> | 
 | The <tt>rcu_data</tt> Structure</a></h3> | 
 |  | 
 | <p>The <tt>rcu_data</tt> maintains the per-CPU state for the RCU subsystem. | 
 | The fields in this structure may be accessed only from the corresponding | 
 | CPU (and from tracing) unless otherwise stated. | 
 | This structure is the | 
 | focus of quiescent-state detection and RCU callback queuing. | 
 | It also tracks its relationship to the corresponding leaf | 
 | <tt>rcu_node</tt> structure to allow more-efficient | 
 | propagation of quiescent states up the <tt>rcu_node</tt> | 
 | combining tree. | 
 | Like the <tt>rcu_node</tt> structure, it provides a local | 
 | copy of the grace-period information to allow for-free | 
 | synchronized | 
 | access to this information from the corresponding CPU. | 
 | Finally, this structure records past dyntick-idle state | 
 | for the corresponding CPU and also tracks statistics. | 
 |  | 
 | </p><p>The <tt>rcu_data</tt> structure's fields are discussed, | 
 | singly and in groups, in the following sections. | 
 |  | 
 | <h5>Connection to Other Data Structures</h5> | 
 |  | 
 | <p>This portion of the <tt>rcu_data</tt> structure is declared | 
 | as follows: | 
 |  | 
 | <pre> | 
 |   1   int cpu; | 
 |   2   struct rcu_node *mynode; | 
 |   3   unsigned long grpmask; | 
 |   4   bool beenonline; | 
 | </pre> | 
 |  | 
 | <p>The <tt>->cpu</tt> field contains the number of the | 
 | corresponding CPU and the <tt>->mynode</tt> field references the | 
 | corresponding <tt>rcu_node</tt> structure. | 
 | The <tt>->mynode</tt> is used to propagate quiescent states | 
 | up the combining tree. | 
 | These two fields are constant and therefore do not require synchronization. | 
 |  | 
 | <p>The <tt>->grpmask</tt> field indicates the bit in | 
 | the <tt>->mynode->qsmask</tt> corresponding to this | 
 | <tt>rcu_data</tt> structure, and is also used when propagating | 
 | quiescent states. | 
 | The <tt>->beenonline</tt> flag is set whenever the corresponding | 
 | CPU comes online, which means that the debugfs tracing need not dump | 
 | out any <tt>rcu_data</tt> structure for which this flag is not set. | 
 |  | 
 | <h5>Quiescent-State and Grace-Period Tracking</h5> | 
 |  | 
 | <p>This portion of the <tt>rcu_data</tt> structure is declared | 
 | as follows: | 
 |  | 
 | <pre> | 
 |   1   unsigned long gp_seq; | 
 |   2   unsigned long gp_seq_needed; | 
 |   3   bool cpu_no_qs; | 
 |   4   bool core_needs_qs; | 
 |   5   bool gpwrap; | 
 | </pre> | 
 |  | 
 | <p>The <tt>->gp_seq</tt> field is the counterpart of the field of the same | 
 | name in the <tt>rcu_state</tt> and <tt>rcu_node</tt> structures.  The | 
 | <tt>->gp_seq_needed</tt> field is the counterpart of the field of the same | 
 | name in the rcu_node</tt> structure. | 
 | They may each lag up to one behind their <tt>rcu_node</tt> | 
 | counterparts, but in <tt>CONFIG_NO_HZ_IDLE</tt> and | 
 | <tt>CONFIG_NO_HZ_FULL</tt> kernels can lag | 
 | arbitrarily far behind for CPUs in dyntick-idle mode (but these counters | 
 | will catch up upon exit from dyntick-idle mode). | 
 | If the lower two bits of a given <tt>rcu_data</tt> structure's | 
 | <tt>->gp_seq</tt> are zero, then this <tt>rcu_data</tt> | 
 | structure believes that RCU is idle. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	All this replication of the grace period numbers can only cause | 
 | 	massive confusion. | 
 | 	Why not just keep a global sequence number and be done with it??? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	Because if there was only a single global sequence | 
 | 	numbers, there would need to be a single global lock to allow | 
 | 	safely accessing and updating it. | 
 | 	And if we are not going to have a single global lock, we need | 
 | 	to carefully manage the numbers on a per-node basis. | 
 | 	Recall from the answer to a previous Quick Quiz that the consequences | 
 | 	of applying a previously sampled quiescent state to the wrong | 
 | 	grace period are quite severe. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <p>The <tt>->cpu_no_qs</tt> flag indicates that the | 
 | CPU has not yet passed through a quiescent state, | 
 | while the <tt>->core_needs_qs</tt> flag indicates that the | 
 | RCU core needs a quiescent state from the corresponding CPU. | 
 | The <tt>->gpwrap</tt> field indicates that the corresponding | 
 | CPU has remained idle for so long that the | 
 | <tt>gp_seq</tt> counter is in danger of overflow, which | 
 | will cause the CPU to disregard the values of its counters on | 
 | its next exit from idle. | 
 |  | 
 | <h5>RCU Callback Handling</h5> | 
 |  | 
 | <p>In the absence of CPU-hotplug events, RCU callbacks are invoked by | 
 | the same CPU that registered them. | 
 | This is strictly a cache-locality optimization: callbacks can and | 
 | do get invoked on CPUs other than the one that registered them. | 
 | After all, if the CPU that registered a given callback has gone | 
 | offline before the callback can be invoked, there really is no other | 
 | choice. | 
 |  | 
 | </p><p>This portion of the <tt>rcu_data</tt> structure is declared | 
 | as follows: | 
 |  | 
 | <pre> | 
 |  1 struct rcu_segcblist cblist; | 
 |  2 long qlen_last_fqs_check; | 
 |  3 unsigned long n_cbs_invoked; | 
 |  4 unsigned long n_nocbs_invoked; | 
 |  5 unsigned long n_cbs_orphaned; | 
 |  6 unsigned long n_cbs_adopted; | 
 |  7 unsigned long n_force_qs_snap; | 
 |  8 long blimit; | 
 | </pre> | 
 |  | 
 | <p>The <tt>->cblist</tt> structure is the segmented callback list | 
 | described earlier. | 
 | The CPU advances the callbacks in its <tt>rcu_data</tt> structure | 
 | whenever it notices that another RCU grace period has completed. | 
 | The CPU detects the completion of an RCU grace period by noticing | 
 | that the value of its <tt>rcu_data</tt> structure's | 
 | <tt>->gp_seq</tt> field differs from that of its leaf | 
 | <tt>rcu_node</tt> structure. | 
 | Recall that each <tt>rcu_node</tt> structure's | 
 | <tt>->gp_seq</tt> field is updated at the beginnings and ends of each | 
 | grace period. | 
 |  | 
 | <p> | 
 | The <tt>->qlen_last_fqs_check</tt> and | 
 | <tt>->n_force_qs_snap</tt> coordinate the forcing of quiescent | 
 | states from <tt>call_rcu()</tt> and friends when callback | 
 | lists grow excessively long. | 
 |  | 
 | </p><p>The <tt>->n_cbs_invoked</tt>, | 
 | <tt>->n_cbs_orphaned</tt>, and <tt>->n_cbs_adopted</tt> | 
 | fields count the number of callbacks invoked, | 
 | sent to other CPUs when this CPU goes offline, | 
 | and received from other CPUs when those other CPUs go offline. | 
 | The <tt>->n_nocbs_invoked</tt> is used when the CPU's callbacks | 
 | are offloaded to a kthread. | 
 |  | 
 | <p> | 
 | Finally, the <tt>->blimit</tt> counter is the maximum number of | 
 | RCU callbacks that may be invoked at a given time. | 
 |  | 
 | <h5>Dyntick-Idle Handling</h5> | 
 |  | 
 | <p>This portion of the <tt>rcu_data</tt> structure is declared | 
 | as follows: | 
 |  | 
 | <pre> | 
 |   1   int dynticks_snap; | 
 |   2   unsigned long dynticks_fqs; | 
 | </pre> | 
 |  | 
 | The <tt>->dynticks_snap</tt> field is used to take a snapshot | 
 | of the corresponding CPU's dyntick-idle state when forcing | 
 | quiescent states, and is therefore accessed from other CPUs. | 
 | Finally, the <tt>->dynticks_fqs</tt> field is used to | 
 | count the number of times this CPU is determined to be in | 
 | dyntick-idle state, and is used for tracing and debugging purposes. | 
 |  | 
 | <p> | 
 | This portion of the rcu_data structure is declared as follows: | 
 |  | 
 | <pre> | 
 |   1   long dynticks_nesting; | 
 |   2   long dynticks_nmi_nesting; | 
 |   3   atomic_t dynticks; | 
 |   4   bool rcu_need_heavy_qs; | 
 |   5   bool rcu_urgent_qs; | 
 | </pre> | 
 |  | 
 | <p>These fields in the rcu_data structure maintain the per-CPU dyntick-idle | 
 | state for the corresponding CPU. | 
 | The fields may be accessed only from the corresponding CPU (and from tracing) | 
 | unless otherwise stated. | 
 |  | 
 | <p>The <tt>->dynticks_nesting</tt> field counts the | 
 | nesting depth of process execution, so that in normal circumstances | 
 | this counter has value zero or one. | 
 | NMIs, irqs, and tracers are counted by the <tt>->dynticks_nmi_nesting</tt> | 
 | field. | 
 | Because NMIs cannot be masked, changes to this variable have to be | 
 | undertaken carefully using an algorithm provided by Andy Lutomirski. | 
 | The initial transition from idle adds one, and nested transitions | 
 | add two, so that a nesting level of five is represented by a | 
 | <tt>->dynticks_nmi_nesting</tt> value of nine. | 
 | This counter can therefore be thought of as counting the number | 
 | of reasons why this CPU cannot be permitted to enter dyntick-idle | 
 | mode, aside from process-level transitions. | 
 |  | 
 | <p>However, it turns out that when running in non-idle kernel context, | 
 | the Linux kernel is fully capable of entering interrupt handlers that | 
 | never exit and perhaps also vice versa. | 
 | Therefore, whenever the <tt>->dynticks_nesting</tt> field is | 
 | incremented up from zero, the <tt>->dynticks_nmi_nesting</tt> field | 
 | is set to a large positive number, and whenever the | 
 | <tt>->dynticks_nesting</tt> field is decremented down to zero, | 
 | the the <tt>->dynticks_nmi_nesting</tt> field is set to zero. | 
 | Assuming that the number of misnested interrupts is not sufficient | 
 | to overflow the counter, this approach corrects the | 
 | <tt>->dynticks_nmi_nesting</tt> field every time the corresponding | 
 | CPU enters the idle loop from process context. | 
 |  | 
 | </p><p>The <tt>->dynticks</tt> field counts the corresponding | 
 | CPU's transitions to and from either dyntick-idle or user mode, so | 
 | that this counter has an even value when the CPU is in dyntick-idle | 
 | mode or user mode and an odd value otherwise. The transitions to/from | 
 | user mode need to be counted for user mode adaptive-ticks support | 
 | (see timers/NO_HZ.txt). | 
 |  | 
 | </p><p>The <tt>->rcu_need_heavy_qs</tt> field is used | 
 | to record the fact that the RCU core code would really like to | 
 | see a quiescent state from the corresponding CPU, so much so that | 
 | it is willing to call for heavy-weight dyntick-counter operations. | 
 | This flag is checked by RCU's context-switch and <tt>cond_resched()</tt> | 
 | code, which provide a momentary idle sojourn in response. | 
 |  | 
 | </p><p>Finally, the <tt>->rcu_urgent_qs</tt> field is used to record | 
 | the fact that the RCU core code would really like to see a quiescent state from | 
 | the corresponding CPU, with the various other fields indicating just how badly | 
 | RCU wants this quiescent state. | 
 | This flag is checked by RCU's context-switch path | 
 | (<tt>rcu_note_context_switch</tt>) and the cond_resched code. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	Why not simply combine the <tt>->dynticks_nesting</tt> | 
 | 	and <tt>->dynticks_nmi_nesting</tt> counters into a | 
 | 	single counter that just counts the number of reasons that | 
 | 	the corresponding CPU is non-idle? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	Because this would fail in the presence of interrupts whose | 
 | 	handlers never return and of handlers that manage to return | 
 | 	from a made-up interrupt. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <p>Additional fields are present for some special-purpose | 
 | builds, and are discussed separately. | 
 |  | 
 | <h3><a name="The rcu_head Structure"> | 
 | The <tt>rcu_head</tt> Structure</a></h3> | 
 |  | 
 | <p>Each <tt>rcu_head</tt> structure represents an RCU callback. | 
 | These structures are normally embedded within RCU-protected data | 
 | structures whose algorithms use asynchronous grace periods. | 
 | In contrast, when using algorithms that block waiting for RCU grace periods, | 
 | RCU users need not provide <tt>rcu_head</tt> structures. | 
 |  | 
 | </p><p>The <tt>rcu_head</tt> structure has fields as follows: | 
 |  | 
 | <pre> | 
 |   1   struct rcu_head *next; | 
 |   2   void (*func)(struct rcu_head *head); | 
 | </pre> | 
 |  | 
 | <p>The <tt>->next</tt> field is used | 
 | to link the <tt>rcu_head</tt> structures together in the | 
 | lists within the <tt>rcu_data</tt> structures. | 
 | The <tt>->func</tt> field is a pointer to the function | 
 | to be called when the callback is ready to be invoked, and | 
 | this function is passed a pointer to the <tt>rcu_head</tt> | 
 | structure. | 
 | However, <tt>kfree_rcu()</tt> uses the <tt>->func</tt> | 
 | field to record the offset of the <tt>rcu_head</tt> | 
 | structure within the enclosing RCU-protected data structure. | 
 |  | 
 | </p><p>Both of these fields are used internally by RCU. | 
 | From the viewpoint of RCU users, this structure is an | 
 | opaque “cookie”. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	Given that the callback function <tt>->func</tt> | 
 | 	is passed a pointer to the <tt>rcu_head</tt> structure, | 
 | 	how is that function supposed to find the beginning of the | 
 | 	enclosing RCU-protected data structure? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	In actual practice, there is a separate callback function per | 
 | 	type of RCU-protected data structure. | 
 | 	The callback function can therefore use the <tt>container_of()</tt> | 
 | 	macro in the Linux kernel (or other pointer-manipulation facilities | 
 | 	in other software environments) to find the beginning of the | 
 | 	enclosing structure. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <h3><a name="RCU-Specific Fields in the task_struct Structure"> | 
 | RCU-Specific Fields in the <tt>task_struct</tt> Structure</a></h3> | 
 |  | 
 | <p>The <tt>CONFIG_PREEMPT_RCU</tt> implementation uses some | 
 | additional fields in the <tt>task_struct</tt> structure: | 
 |  | 
 | <pre> | 
 |  1 #ifdef CONFIG_PREEMPT_RCU | 
 |  2   int rcu_read_lock_nesting; | 
 |  3   union rcu_special rcu_read_unlock_special; | 
 |  4   struct list_head rcu_node_entry; | 
 |  5   struct rcu_node *rcu_blocked_node; | 
 |  6 #endif /* #ifdef CONFIG_PREEMPT_RCU */ | 
 |  7 #ifdef CONFIG_TASKS_RCU | 
 |  8   unsigned long rcu_tasks_nvcsw; | 
 |  9   bool rcu_tasks_holdout; | 
 | 10   struct list_head rcu_tasks_holdout_list; | 
 | 11   int rcu_tasks_idle_cpu; | 
 | 12 #endif /* #ifdef CONFIG_TASKS_RCU */ | 
 | </pre> | 
 |  | 
 | <p>The <tt>->rcu_read_lock_nesting</tt> field records the | 
 | nesting level for RCU read-side critical sections, and | 
 | the <tt>->rcu_read_unlock_special</tt> field is a bitmask | 
 | that records special conditions that require <tt>rcu_read_unlock()</tt> | 
 | to do additional work. | 
 | The <tt>->rcu_node_entry</tt> field is used to form lists of | 
 | tasks that have blocked within preemptible-RCU read-side critical | 
 | sections and the <tt>->rcu_blocked_node</tt> field references | 
 | the <tt>rcu_node</tt> structure whose list this task is a member of, | 
 | or <tt>NULL</tt> if it is not blocked within a preemptible-RCU | 
 | read-side critical section. | 
 |  | 
 | <p>The <tt>->rcu_tasks_nvcsw</tt> field tracks the number of | 
 | voluntary context switches that this task had undergone at the | 
 | beginning of the current tasks-RCU grace period, | 
 | <tt>->rcu_tasks_holdout</tt> is set if the current tasks-RCU | 
 | grace period is waiting on this task, <tt>->rcu_tasks_holdout_list</tt> | 
 | is a list element enqueuing this task on the holdout list, | 
 | and <tt>->rcu_tasks_idle_cpu</tt> tracks which CPU this | 
 | idle task is running, but only if the task is currently running, | 
 | that is, if the CPU is currently idle. | 
 |  | 
 | <h3><a name="Accessor Functions"> | 
 | Accessor Functions</a></h3> | 
 |  | 
 | <p>The following listing shows the | 
 | <tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt> and | 
 | <tt>rcu_for_each_leaf_node()</tt> function and macros: | 
 |  | 
 | <pre> | 
 |   1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp) | 
 |   2 { | 
 |   3   return &rsp->node[0]; | 
 |   4 } | 
 |   5 | 
 |   6 #define rcu_for_each_node_breadth_first(rsp, rnp) \ | 
 |   7   for ((rnp) = &(rsp)->node[0]; \ | 
 |   8        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) | 
 |   9 | 
 |  10 #define rcu_for_each_leaf_node(rsp, rnp) \ | 
 |  11   for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ | 
 |  12        (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) | 
 | </pre> | 
 |  | 
 | <p>The <tt>rcu_get_root()</tt> simply returns a pointer to the | 
 | first element of the specified <tt>rcu_state</tt> structure's | 
 | <tt>->node[]</tt> array, which is the root <tt>rcu_node</tt> | 
 | structure. | 
 |  | 
 | </p><p>As noted earlier, the <tt>rcu_for_each_node_breadth_first()</tt> | 
 | macro takes advantage of the layout of the <tt>rcu_node</tt> | 
 | structures in the <tt>rcu_state</tt> structure's | 
 | <tt>->node[]</tt> array, performing a breadth-first traversal by | 
 | simply traversing the array in order. | 
 | Similarly, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only | 
 | the last part of the array, thus traversing only the leaf | 
 | <tt>rcu_node</tt> structures. | 
 |  | 
 | <table> | 
 | <tr><th> </th></tr> | 
 | <tr><th align="left">Quick Quiz:</th></tr> | 
 | <tr><td> | 
 | 	What does | 
 | 	<tt>rcu_for_each_leaf_node()</tt> do if the <tt>rcu_node</tt> tree | 
 | 	contains only a single node? | 
 | </td></tr> | 
 | <tr><th align="left">Answer:</th></tr> | 
 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | 
 | 	In the single-node case, | 
 | 	<tt>rcu_for_each_leaf_node()</tt> traverses the single node. | 
 | </font></td></tr> | 
 | <tr><td> </td></tr> | 
 | </table> | 
 |  | 
 | <h3><a name="Summary"> | 
 | Summary</a></h3> | 
 |  | 
 | So the state of RCU is represented by an <tt>rcu_state</tt> structure, | 
 | which contains a combining tree of <tt>rcu_node</tt> and | 
 | <tt>rcu_data</tt> structures. | 
 | Finally, in <tt>CONFIG_NO_HZ_IDLE</tt> kernels, each CPU's dyntick-idle | 
 | state is tracked by dynticks-related fields in the <tt>rcu_data</tt> structure. | 
 |  | 
 | If you made it this far, you are well prepared to read the code | 
 | walkthroughs in the other articles in this series. | 
 |  | 
 | <h3><a name="Acknowledgments"> | 
 | Acknowledgments</a></h3> | 
 |  | 
 | I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul | 
 | Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn | 
 | for helping me get this document into a more human-readable state. | 
 |  | 
 | <h3><a name="Legal Statement"> | 
 | Legal Statement</a></h3> | 
 |  | 
 | <p>This work represents the view of the author and does not necessarily | 
 | represent the view of IBM. | 
 |  | 
 | </p><p>Linux is a registered trademark of Linus Torvalds. | 
 |  | 
 | </p><p>Other company, product, and service names may be trademarks or | 
 | service marks of others. | 
 |  | 
 | </body></html> |