Lately I have working in Snabb Switch as part of the networking team we have in Igalia. Snabb Switch is a kernel by-pass software switch that talks directly to 10-Gbps network cards. This allows Snabb Switch to manipulate packets are at speed rate the Linux kernel is simply not capable of doing it. Snabb Switch also provides a framework to develop network applications and most of the code is Lua (LuaJIT). Snabb Switch rides on the latest innovations in hardware and it’s very innovative in many ways. All this has allowed me to catch up wit many technologies. In this post I reviewed the latest 10 years in Intel microprocessor evolution.

One of the first attempts by Intel to parallelize processing was hyper-threading, a technology that debuted in Xeon in 2002 and later that year in Pentium 4. A single CPU with hyper-threading appears as two logical CPUs to the Operating System. Hyper-threading takes advantage of the superscalar architecture of modern CPUs, in which instruction processing is divided into several independent pipelines. By duplicating the number of registers in a CPU, hyper-threading can exploit the use of the different pipelines, so most of them are busy at a given time. However, other resources such as cache are not actually duplicated but shared by the logical and the physical CPU. A CPU with hyper-threading enabled can provide a performance boost of 30% or in the worst case no boost at all.

After this first attempt of bringing real concurrent processing, Intel CPUs started to incorporate multiple cores per socket. For some time hyper-threading technology was forgotten, as Intel modeled its new Core microarchitecture after the P6 microarchitecture (Pentium Pro, Pentium III and Celeron). However with the release of the new Nehalem microarchtecture (Core i7) by the end of 2008, hyper-threading made a come back. Since then, all Intel processors feature this technology. As hyper-threading adds a logical CPU for each physical core, my dual-core CPU appears as four cores to the OS.

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2

Two core(s) per socket and two thread(s) per core, makes a total of four CPU(s).

What really encouraged Intel to switch to a multicore architecture was the inability to keep improving speed by increasing the number of transistors in a CPU. As Gordon Moore, co-founder of Intel, noticed in 1965, the number of transistors per square inch in a CPU doubles every year (Moore’s Law), something that still applies today although the period has stretched to 2.5 years. However, while the number of transistors in the last decade has gone from 37.5 million to 904 million, CPU clock speed has barely doubled, going from 1.3 Ghz to 2.8 Ghz. In 2004, the heat build-up in CPUs caused Intel to abandon this model and start featuring multiple cores into one single processing unit.

First multicore processors accessed memory through a shared bus, in other words, all cores shared a common memory. This design is known as UMA or Uniform Memory Access. As the number of cores increased, contention issues appeared. Access to the bus became a bottleneck to scalability, preventing adding more cores. To solve this problem, CPU designers introduced a new type of memory layout known as Non-Uniform Memory Access or NUMA. In a NUMA topology, cores are grouped into a single unit called a NUMA node . Each NUMA node has a local memory assigned, which guarantees only a limited number of cores will try to access memory at a given time. The memory that is not local to a NUMA node is known as remote or foreign. Access to remote memory takes longer because the distance between processor and memory affects access time. And that is the main reason why this architecture is called NUMA and it is often refer as a topology. Either UMA and NUMA are two types of Shared Memory Architectures.

Architecture of a NUMA system
Architecture of a NUMA system

Architecture of a NUMA system. Source: Wikipedia

Unfortunately, my laptop only has one NUMA group:

$ lscpu | grep NUMA
NUMA node(s):          1
NUMA node0 CPU(s):     0-3

Or using the numactl command:

$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 
cpubind: 0 
nodebind: 0 
membind: 0 

Shows my CPU features only features one NUMA node containing 4 cores, or in other words, my laptop implements an UMA layout. The same command on a more powerful machine (Intel Xeon Processor E5-2697 v2, with 12 cores per socket), provides the following information:

$ lscpu
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
NUMA node0 CPU(s):     0-11,24-35
NUMA node1 CPU(s):     12-23,36-47

This is a much more advanced CPU. There are 2 sockets each containing 12 cores, or 24 cores with hyper-threading. Cores are grouped into 2 NUMA nodes (node0 and node1). The lscpu command also provides information about node’s distance:

node distances:
node   0   1 
0:  10  21 
1:  21  10 

Besides cores, peripheral devices also compete for accessing the data bus, since devices have direct access to memory trough DMA (Direct Memory Access). In a NUMA layout, this means that devices, usually connected to a PCI port, have a NUMA node number assigned too. If a process is running on a core which heavily interacts with an I/O device belonging to different NUMA node, performance degradation issues may appear. NUMA considerably benefits from the data locality principle, so devices and processes operating on the same data should run within the same NUMA node.

$ cat /sys/bus/pci/devices/0000:00:01.0/numa_node
0

In the example above, device with PCI address 0000:00:01.0 belongs to NUMA node0.

If data locallity is so important in a NUMA architecture, it seems it would be very handy to be able to select in what core to run a program. Generally, when the OS executes a program it creates a process and assigns it to a core following a scheduling algorithm. The process will run for some time, until the kernel dispatcher assigns a new process to the core and the former process is put to sleep. When it wakes up, it may be reassigned to a different core. Usually if the process consumes a lot of CPU time, the kernel won’t reassign it to a different core, but it could be possible.

CPU affinity let us bind a process or thread to a core or group of cores. From the user perspective, commands such as taskset or numactl can help us to control CPU affinity of a process:

$ taskset -c 0 snabb    # Run the program 'snabb' on core 0
$ taskset -c 0-3 snabb  # Run the program 'snabb' on cores 0,1,2,3

Likewise, with the command numatcl:

$ numactl --cpunodebind=0 snabb    # Run the program 'snabb' on any of the cores of node0
$ numactl --physcpubind=0-3 snabb  # Run the program 'snabb' on cores 0,1,2,3

Besides setting CPU affinity, numatcl allows a finer control being possible to select different preferences for memory allocation (only local allowed, local preferred, etc).

Summarizing, most modern CPUs, if not all, implement a multicore architecture on a NUMA topology. Arbitrary assignment of processes to a core may have an impact on performance, specially if a running program interacts with a PCI device in a different NUMA node. There are tools such as taskset and numactl, that allow us to have a finer control on CPU affinity, being possible to select what core or numa node will run a program.

Lastly, some recommended links to articles I read, or partially read, and helped me to write down this post: