A Digestible High-Level Overview of CPU & GPU Cores

23 Apr 2024

A digestible high-level overview of what happens in The Die

In this article, we'll go through some fundamental low-level details to understand why GPUs are good at Graphics, Neural Networks, and Deep Learning tasks and CPUs are good at a wide number of sequential, complex general purpose computing tasks. There were several topics that I had to research and get a bit more granular understanding for this post, some of which I will be just mentioning in passing. It is done deliberately to focus just on the absolute basics of CPU & GPU processing.

Von Neumann Architecture

Earlier computers were dedicated devices. Hardware circuits and logic gates were programmed to do a specific set of things. If something new had to be done, circuits needed to be rewired. "Something new" could be as simple as doing mathematical calculations for two different equations. During WWII, Alan Turing was working on a programmable machine to beat the Enigma machine and later published the "Turing Machine" paper. Around the same time, John von Neumann and other researchers were also working on an idea which fundamentally proposed:

Instruction and data should be stored in shared memory (Stored program).
Processing and memory units should be separate.
Control unit takes care of reading data & instructions from memory to do calculations using the processing unit.

The Bottleneck

Processing bottleneck - Only one instruction and its operand can be at a time in a processing unit (physical logic gate). Instructions are executed sequentially one after another. Over the years, focus and improvements have been made in making processors smaller, faster clock cycles, and increasing the number of cores.
Memory bottleneck - As processors grew faster and faster, speed and amount of data that could be transferred between memory and processing unit became a bottleneck. Memory is several orders slower than CPU. Over the years, focus and improvements have been in making memory denser and smaller.

CPUs

We know that everything in our computer is binary. String, image, video, audio, OS, application program, etc., are all represented as 1s & 0s. CPU architecture (RISC, CISC, etc.,) specifications have instruction sets (x86, x86-64, ARM, etc.,), which CPU manufacturers must comply with & are available for OS to interface with hardware.

OS & application programs including data are translated into instruction sets and binary data for processing in the CPU. At the chip level, processing is done at transistors and logic gates. If you execute a program to add two numbers, addition (the "processing") is done at a logic gate in the processor.

In CPU as per Von Neumann architecture, when we are adding two numbers, a single add instruction runs on two numbers in the circuit. For a fraction of that millisecond, only add instruction was executed in the (execution) core of the processing unit! This detail always fascinated me.

Core in a modern CPU

Overview of components in a CPU core (Single HW Thread)

The components in the above diagram are self-evident. For more details and detailed explanation refer to this excellent article. In modern CPUs, a single physical core can contain more than one integer ALU, floating-point ALU, etc., Again, these units are physical logic gates.

We need to understand the 'Hardware Thread' in the CPU core for a better appreciation of GPU. A hardware thread is a unit of computing that can be done in execution units of a CPU core, every single CPU clock cycle. It represents the smallest unit of work that can be executed in a core.

Instruction cycle

A process, running on a CPU core, in a single hardware thread, w/o pipelining or Superscaler

The above diagram illustrates the CPU instruction cycle/machine cycle. It is a series of steps that the CPU performs to execute a single instruction (eg: c=a+b).

Fetch: Program counter (special register in CPU core) keeps track of which instruction must be fetched. Instruction is fetched and stored in the instruction register. For simple operations, corresponding data is also fetched.
Decode: Instruction is decoded to see operators and operands.
Execute: Based on the operation specified, the appropriate processing unit is chosen and executed.
Memory Access: If an instruction is complex or additional data is needed (several factors can cause this), memory access is done before execution. (Ignored in the above diagram for simplicity). For a complex instruction, initial data will be available in data register of the compute unit, but for complete execution of instruction, data access from L1 & L2 cache is required. This means could be a small wait time before the compute unit executes and the hardware thread is still holding compute unit during wait time.
Write Back: If execution produces output (eg: c=a+b), the output is written back to register/cache/memory. (Ignored in the above diagram or anyplace later in the post for simplicity)

In the above diagram, only at t2 is compute being done. The rest of the time, the core is just idle (we are not getting any work done).

Modern CPUs have HW components that essentially enable (fetch-decode-execute) steps to occur concurrently per clock cycle.

A single hardware thread can now do computation in every clock cycle. This is called instruction pipelining.

Fetch, Decode, Memory Access, and Write Back are done by other components in a CPU. For lack of a better word, these are called "pipeline threads". The pipeline thread becomes a hardware thread when it is in the execute stage of an instruction cycle.

A process, running on a CPU core in a single hardware thread, w/ pipelining

As you can see, we get compute output every cycle from t2. Previously, we got compute output once every 3 cycles. Pipelining improves compute throughput. This is one of the techniques to manage processing bottlenecks in Von Neumann Architecture. There are also other optimizations like out-of-order execution, branch prediction, speculative execution, etc.,

Hyper-Threading

This is the last concept I want to discuss in CPU before we conclude and move on to GPUs. As the clock speeds increased, the processors also got faster and more efficient. With the increase in application (instruction set) complexity, CPU compute cores were underutilized and it was spending more time waiting on memory access.

So, we see a memory bottleneck. The compute unit is spending time on memory access and not doing any useful work. Memory is several orders slower than CPU and the gap is not going to close anytime soon. The idea was to increase memory bandwidth in some units of a single CPU core and keep data ready to utilize the compute units when it is awaiting memory access.

Hyper-threading was made available in 2002 by Intel in Xeon & Pentium 4 processors. Prior to hyper-threading, there was only one hardware thread per core. With hyper-threading, there will be 2 hardware threads per core. What does it mean? Duplicate processing circuit for some registers, program counter, fetch unit, decode unit, etc.

A CPU core with 2 HW Threads (Hyper-Threading)

The above diagram just shows new circuit elements in a CPU core with hyperthreading. This is how a single physical core is visible as 2 cores to the Operating System. If you had a 4-core processor, with hyper-threading enabled, it is seen by OS as 8 cores. L1 - L3 cache size will increase to accommodate additional registers. Note that the execution units are shared.

Two processes, running on a CPU core with two hardware threads, w/ pipelining

Assume we have processes P1 and P2 doing a=b+c, d=e+f, these can be executed concurrently in a single clock cycle because of HW threads 1 and 2. With a single HW thread, as we saw earlier, this would not be possible. Here we are increasing memory bandwidth within a core by adding Hardware Thread so that, the processing unit can be utilized efficiently. This improves compute concurrency.

Some interesting scenarios:

CPU has only one integer ALU. One HW Thread 1 or HW Thread 2 must wait for one clock cycle and proceed with compute in next cycle.
CPU has one integer ALU and one floating point ALU. HW Thread 1 and HW Thread 2 can do addition concurrently using ALU and FPU respectively.
All available ALUs are being utilized by HW Thread 1. HW Thread 2 must wait until ALU is available. (Not applicable for the addition example above, but can happen with other instructions).

Why CPU is so good at traditional desktop/server computing?

High clock speeds - Higher than GPU clock speeds. Combining this high speed with instruction pipelining, CPUs are extremely good at sequential tasks. Optimized for latency.
Diverse applications & computation needs - Personal computers and servers have a wide range of applications and computation needs. This results in a complex instruction set. CPU has to be good at several things.
Multitasking & Multi-processing - With so many apps in our computers, CPU workload demands context switching. Caching systems and memory access are set up to support this. When a process is scheduled in the CPU hardware thread, it has all the necessary data ready and executes compute instructions quickly one by one.

CPU Drawbacks

Check this article & also try the Colab notebook. It shows how matrix multiplication is a parallelizable task and how parallel compute cores can speed up the calculation.

Extremely good at sequential tasks but not good at parallel tasks.
Complex instruction set and complex memory access pattern.
CPU also spends lots of energy on context switching, and control unit activities in addition to compute

Key Takeaways

Instruction pipelining improves compute throughput.
Increasing memory bandwidth improves compute concurrency.
CPUs are good at sequential tasks (optimized for latency). Not good at massively parallel tasks as it needs a large number of compute units and hardware threads which are not available (not optimized for throughput). These are not available because CPUs are built for general-purpose computing and have complex instruction sets.

GPUs

As computing power increased, so did the demand for graphics processing. Tasks like UI rendering and gaming require parallel operations, driving the need for numerous ALUs and FPUs at the circuit level. CPUs, designed for sequential tasks, couldn't handle these parallel workloads effectively. Thus, GPUs were developed to fulfill the demand for parallel processing in graphics tasks, later paving the way for their adoption in accelerating deep learning algorithms.

I would highly recommend:

Watching this video that explains parallel tasks involved in Video Game rendering.
Reading this blog post to understand parallel tasks involved in a transformer. There are other deep learning architectures like CNNs, and RNNs as well. Since LLMs are taking over the world, high-level understanding of parallelism in matrix multiplications required for transformer tasks would set a good context for the remainder of this post. (At a later time, I plan to fully understand transformers & share a digestible high-level overview of what happens in transformer layers of a small GPT model.)

Sample CPU vs GPU specs

Cores, hardware threads, clock speed, memory bandwidth, and on-chip memory of CPUs & GPUs differ significantly. Example:

Intel Xeon 8280 has:
- 2700 MHz base and 4000 MHz at Turbo
- 28 cores and 56 hardware threads
- Overall pipeline threads: 896 - 56
- L3 Cache: 38.5 MB (shared by all cores) L2 Cache: 28.0 MB (divided among the cores) L1 Cache: 1.375 MB (divided among the cores)
- Register size is not publicly available
- Max Memory: 1TB DDR4, 2933 MHz, 6 channels
- Max Memory Bandwidth: 131 GB/s
- Peak FP64 Performance = 4.0 GHz 2 AVX-512 units 8 operations per AVX-512 unit per clock cycle * 28 cores = ~2.8 TFLOPs [Derived using: Peak FP64 Performance = (Max Turbo Frequency) (Number of AVX-512 units) (Operations per AVX-512 unit per clock cycle) * (Number of cores)]
  - This number is used for comparison with GPU as getting peak performance of general-purpose computing is very subjective. This number is a theoretical max limit which means, FP64 circuits are being used to their fullest.
Nvidia A100 80GB SXM has:
- 1065 MHz base and 1410 MHz at Turbo
- 108 SMs, 64 FP32 CUDA cores (also called as SPs) per SM, 4 FP64 Tensor cores per SM, 68 hardware threads (64 + 4) per SM
  - Overall per GPU: 6912 64 FP32 CUDA cores, 432 FP 64 Tensor cores, 7344 (6912 + 432) hardware threads
- Pipeline threads per SM: 2048 - 68 = 1980 per SM
  - Overall pipeline threads per GPU: (2048 x 108) - (68 x 108) = 21184 - 7344 = 13840
  - Refer: cudaLimitDevRuntimePendingLaunchCount
- L2 Cache: 40 MB (shared among all SMs) L1 Cache: 20.3 MB in total (192 KB per SM)
- Register size: 27.8 MB (256 KB per SM)
- Max GPU Main Memory: 80GB HBM2e, 1512 MHz
- Max GPU Main Memory Bandwidth: 2.39 TB/s
- Peak FP64 Performance = 19.5 TFLOPs [using only all FP64 Tensor cores]. The lower value of 9.7 TFLOPs when only FP64 in CUDA cores are used. This number is a theoretical max limit which means, FP64 circuits are being used to their fullest.

Core in a modern GPU

Terminologies we saw in CPU don't always translate directly to GPUs. Here we'll see components and core NVIDIA A100 GPU. One thing that was surprising to me while researching for this article was that CPU vendors don't publish how many ALUs, FPUs, etc., are available in execution units of a core. NVIDIA is very transparent about the number of cores and the CUDA framework gives complete flexibility & access at the circuit level.

A modern CPU vs NVIDIA GPU (source: NVIDIA)

In the above diagram in GPU, we can see that there is no L3 Cache, smaller L2 cache, smaller but a lot more control unit & L1 cache and a large number of processing units.

NVIDIA A100 GPU (source: NVIDIA)

NVIDIA A100 GPU streaming multiprocessor (equatable to a CPU core) source: NVIDIA

Here are the GPU components in the above diagrams and their CPU equivalent for our initial understanding. I haven't done CUDA programming, so comparing it with CPU equivalents helps with initial understanding. CUDA programmers understand this very well.

Multiple Streaming Multiprocessors <> Multi-Core CPU
Streaming Multiprocessor (SM) <> CPU Core
Streaming processor (SP)/ CUDA Core <> ALU / FPU in execution units of a CPU Core
Tensor Core (capable doing 4x4 FP64 operations on a single instruction) <> SIMD execution units in a modern CPU core (eg: AVX-512)
Hardware Thread (doing compute in CUDA or Tensor Cores in a single clock cycle) <> Hardware Thread (doing compute in execution units [ALUs, FPUs, etc.,] in a single clock cycle)
HBM / VRAM / DRAM / GPU Memory <> RAM
On-chip memory/SRAM (Registers, L1, L2 cache) <> On-chip memory/SRAM (Registers, L1, L2, L3 cache)
- Note: Registers in a SM are significantly larger than Registers in a core. Because of high number of threads. Remember that in hyper-threading in CPU, we saw an increase in the number of registers but not compute units. Same principle here.

Moving data & memory bandwidth

Graphics and deep learning tasks demand SIM(D/T) [Single instruction multi data/thread] type execution. i.e., reading and working on large amounts of data for a single instruction.

We discussed instruction pipelining and hyper-threading in CPU and GPUs also have capability. How it is implemented and working is slightly different but the principles are the same.

GPU scheduling (source: NVIDIA)

Unlike CPUs, GPUs (via CUDA) provide direct access to Pipeline Threads (fetching data from memory and utilizing the memory bandwidth). GPU schedulers work first by trying to fill compute units (including associated shared L1 cache & registers for storing compute operands), then "pipeline threads" which fetch data into registers and HBM. Again, I want to emphasize that CPU app programmers don't think about this, and specs about "pipeline threads" & number of compute units per core is not published. Nvidia not only publishes these but also provides complete control to programmers.

I will go into more detail about this in a dedicated post about the CUDA programming model & "batching" in model serving optimization technique where we can see how beneficial this is.

GPU - High Throughput Processor (source: NVIDIA)

The above diagram depicts hardware thread execution in the CPU & GPU core. Refer "memory access" section we discussed earlier in CPU pipelining. This diagram shows that. CPUs complex memory management makes this wait time small enough (a few clock cycles) to fetch data from the L1 cache to registers. When data needs to be fetched from L3 or main memory, the other thread for which data is already in register (we saw this in the hyper-threading section) get's control of execution units.

In GPUs, because of oversubscription (high number of pipeline threads & registers) & simple instruction set, a large amount of data is already available on registers pending execution. These pipeline threads waiting for execution become hardware threads and do the execution as often as every clock cycle as pipeline threads in GPUs are lightweight.

Bandwidth, Compute Intensity & Latency

What's over goal?

Fully utilize hardware resources (compute units) every clock cycle to get the best out of GPU.
To keep the compute units busy, we need to feed it enough data.

Matrix multiplication on CPU vs GPU

This is the main reason why the latency of matrix multiplication of smaller matrices is the same more or less in CPU & GPU. Try it out.

Tasks needs to be parallel enough, data needs to be huge enough to saturate compute FLOPs & memory bandwidth. If a single task is not big enough, multiple such tasks needs to be packed to saturate memory and compute to fully utilize the hardware.

Compute Intensity = FLOPs / Bandwidth. i.e., Ratio of amount of work that can be done by the compute units per second to amount of data that can be provided by memory per second.

Compute Intensity (source: NVIDIA)

In above diagram, we see that compute intensity increases as we go to higher latency and lower bandwidth memory. We want this number to be as small as possible so that compute is fully utilized. For that, we need to keep as much as data in L1 / Registers so that compute can happen quickly. If we fetch single data from HBM, there are only few operations where we do 100 operations on single data to make it worth it. If we don't do 100 operations, compute units were idle. This is where high number of threads and registers in GPUs come into play. To keep as much as data in L1/Registers to keep the compute intensity low and to keep parallel cores busy.

There is a difference in compute intensity of 4X between CUDA & Tensor cores because CUDA cores can done only one 1x1 FP64 MMA where as Tensor cores can do 4x4 FP64 MMA instruction per clock cycle.

Key Takeaways

High number of compute units (CUDA & Tensor cores), high number of threads and registers (over subscription), reduced instruction set, no L3 cache, HBM (SRAM), simple & high throughput memory access pattern (compared to CPU's - context switching, multi layer caching, memory paging, TLB, etc.,) are the principles that make GPUs so much better than CPUs in parallel computing (graphics rendering, deep learning, etc.,)

Beyond GPUs

GPUs were first created for handling graphics processing tasks. AI researchers started taking advantage of CUDA and its direct access to powerful parallel processing via CUDA cores. NVIDIA GPU has Texture Processing, Ray Tracing, Raster, Polymorph engines, etc., (let's say graphics-specific instruction sets). With the increase in adoption of AI, Tensor cores which are good at 4x4 matrix calculation (MMA instruction) are being added which are dedicated to deep learning.

Since 2017, NVIDIA has been increasing the number of Tensor cores in each architecture. But, these GPUs are also good at graphics processing. Although the instruction set and complexity is much less in GPUs, it's not fully dedicated to deep learning (especially Transformer Architecture).

FlashAttention 2, a software layer optimization (mechanical sympathy for the attention layer's memory access pattern) for transformer architecture provides 2X speedup in tasks.

With our in-depth first principles-based understanding of CPU & GPU, we can understand the need for Transformer Accelerators: A dedicated chip (circuit only for transformer operations), with even large number of compute units for parallelism, reduced instruction set, no L1/L2 caches, massive DRAM (registers) replacing HBM, memory units optimized for memory access pattern of transformer architecture. After all LLMs are new companions for humans (after web and mobile), and they need dedicated chips for efficiency and performance.

Some AI Accelerators:

Transformer Accelerators:

FPGA based Transformer Accelerators:

Achronix

References:

https://en.wikipedia.org/wiki/Von_Neumann_architecture
https://chsasank.com/llm-system-design.html
https://www.redhat.com/sysadmin/cpu-components-functionality
https://docs.wixstatic.com/ugd/56440f_e458602dcb0c4af9aaeb7fdaa34bb2b4.pdf
https://www.nand2tetris.org/course
https://cpu.land/
https://en.wikipedia.org/wiki/Hyper-threading
How do Video Game Graphics Work? - https://youtu.be/C8YtdC8mxTU?si=OdrFXUFMLBhuZF34
CPU vs GPU vs TPU vs DPU vs QPU - https://www.youtube.com/watch?v=r5NQecwZs1A
How GPU Computing Works | GTC 2021 | Stephen Jones - https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31151/
Compute Intensity - https://www.linkedin.com/pulse/threads-tensor-cores-beyond-unveiling-dynamics-gpu-memory-florit-smg2c/
How CUDA Programming Works | GTC Fall 2022 | Stephen Jones - https://www.nvidia.com/en-us/on-demand/session/gtcfall22-a41101/
Why use GPU with Neural Networks? - https://www.youtube.com/watch?v=GRRMi7UfZHg
CUDA Hardware | Tom Nurkkala | Taylor University Lecture - https://www.youtube.com/watch?v=kUqkOAU84bA
https://ashanpriyadarshana.medium.com/cuda-gpu-memory-architecture-8c3ac644bd64
https://colab.research.google.com/drive/1nw34aks9SdMwHXl9Gf5T9GPxRB9BIIyr
https://developer.nvidia.com/blog/cuda-refresher-reviewing-the-origins-of-gpu-computing/