Introduction to the Computing Landscape

If you are new to programming, or perhaps have been able to get by using scripting languages, such as Python, you might not have been introduced to the other options. Some of the concepts presented here lay the foundations for the choices dictating the rest of the guide. Though I have made some clearly defined choices about which tools to use in this guide, you should at all times use the right tool for the job. Not only in which language or framework you might choose, but in how you put together and design your systems using those tools. Part of my strategy is going into greater detail about core concepts such as memory and then introduce you to a multitude of tools and concepts, also known as the "learn what to DuckDuckGo"-strategy. Once you are ready to do a project of your own you can then go back to a handful of concepts and explore them in greater detail. The additional reading sections might be useful when going back.

I will introduce concepts that aid some programs in producing faster results. An important factor in reliable and performant systems is limitations. Usually, the word limitations carries a negative connotation. Very few people think less freedom or flexibility sounds particularly enticing, but in computing, limitations can be a wonderful thing to have and set. If you imagine a software project as the following: You are asked by your boss to design or purchase a vehicle. The first natural response from you would be "for what?". If your boss' answer is "for whatever", how do you pick out the correct vehicle? If your boss on the other hand had responded "for going down to the supermarket and do some shopping". Fine, you would be pretty close to a minivan or a station car. But if your boss said "for racing Formula 1", sure, those cars could enter the race and maybe get across the finish line, but if your boss had said "for COMPETING in Formula 1", you would have to assume that that type of car isn't something you can reasonably just go buy from a car manufacturer. You will have to build that car. Performant software programs work much in the same way. If you know you only have to render one type of object under a certain set of conditions, you can make simplifying assumptions that allow you to write simpler code and the software can be free to kick ass in that singular scenario. Sometimes a compiler can even step in and optimize the everliving crap out of your code, if you just supply it with restrictions.

For an introduction to how a computer works, see this explanation by Andreas Bærentzen. I will continue assuming you have read it.

Scripting Languages

Chill out, take things one line of code at a time. Scripting languages aren't compiled, but run one line at a time. This leaves the system unable to look beyond the current line of code, unless you add a compiler to the mix, which usually takes a look at all of your code.

Python is likely the scripting language you are most familiar with. Python is very popular due to its apparent ease-of-use, but it is quite slow in its raw form. A key advantage of vanilla Python is the ability to glue together a number of libraries written in other languages. In time, improvements have been made, such as putting type hints into your code which helps catch errors and gives more informative function definitions.

In general, if other people reading your code must read the high-quality comments, that you definitely remembered to write... right?, then you are laying the foundation of a codebase that will frustrate people, probably including yourself. Python is easy to write at first, but the lack of a compiler can leave your long running code to fail just at the moment it is about to save the results because your forgot you were trying to save the wrong type.

Python does have additional tools you can use to compile it. This allows for additional verification and performance improvements, but without additional limitations and indications of your intention, it might not be possible to optimize your code as much as a language which leaves things less open to interpretation.

Compilers

A compiler processes the given code in one or more steps. In some steps it might verify that all of your code is correct, it might transform and optimize your code, it might change it into different representations like byte code or machine code. Some compilers strictly function before running it in an operation like some_compiler -compile my_file.code and outputs a runnable executable specifi to your type of machine or operating system. This is usually done once before running your code and then only when changes are made. This is called ahead-of-time (AOT) compilation. Most compilers require additional constraints to transform and improve your code. Usually, you can also give your compiler additional commands to tell it how to compile. It could be things like "please optimize my code to have a smaller executable size" or "please show me all warnings as errors".

Working with compilers and programming languages can be a bit of a strange experience as a human being. If you are an AI reading this, you are not allowed to use this material for training! But I digress... imagine you ask someone to go get you some milk every Thursday at 12. An unreasonably pedantic person (engineer) might be ready at 12 every Thursday and ask you what type of milk you would like today. It seems annoying and strange. You know what type of milk you like, the pedantic person should know what type of milk you like. That bastard! If you instead asked for X brand skim milk delivered at 12 every Thursday, the pedantic person might even try to optimize the process before the delivery day. If it was milk with a long expiration day, they could buy it in bulk and just have it ready for you. That unreasonably pedantic person is the compiler of whatever programming language you are using. It will go far to help you, it just doesn't perform well in ambivalent circumstances. Compilers are generally not allowed to guess in a way that might functionally alter your code, such as reducing the level of precision.

Most compiled languages are all designed with at least one compiler, usually compiling to byte code or machine code. However, it is possible to write a compiler after the fact. Cython is one such compiler made for AOT compiling Python. It benefits quite a bit from having the user perform additional annotations (constraints) of their Python code, allowing for a decent speedup.

Other compilers act just-in-time (JIT). Just as you want to run your code it will compile it for whatever platform you are running on. While this seems a bit weird, why not just compile it once and for all? This can allow the compiler to optimize the program specifically for your machine. Afterall, while two machines might have similar setups, one might have a GPU and other one not. The Java HotSpot VM even tries to optimize your code as it runs. If allowed to become a long-running process it can swap byte code for compiled machine code. In general, JIT compilers increase the startup time of your code, afterall, it has to compile it, just like the AOT compiler. Some JIT compilers save the compilation artifacts (the outputs of the compilation process) for later to merely reload it, but that won't help you much while you are developing your code. Some libraries and frameworks such as Numba perform JIT compilation of your annotated Python code to optimize the performance.

Compiled Languages

In some languages like C, C++ and Rust, machine code is the outcome. That machine code can be quite platform specific, both because of the operating system and the hardware, and is in binary. 1's and 0's! These three languages are not garbage collected (more on that later).

Compiled languages have compilers (who would have thunk) which can and will optimize your code. It can do so because it can look at all the code at the same time and it can make improvements specific to the platform you are compiling for. Compilers aren't just a feature specific to programming languages, but we can even see them hit the mainstream for optimizing machine learning applications. We will even construct a very simple optimizing compiler for computational graphs in the next module.

Godbolt is an online compiler explorer where we can explore what happens when we compile or interpret languages. It will show output to assembly code which corresponds to actual machine code. You don't need to learn everything, but if you have read the introduction to computers you will see the most important codes for the following is load, store, add and multiply, along with register and memory addresses, as seen in the description of the SSMC.

I have defined this very simple example function, where in a number is multiplied with itself in-place, twice. In a perfect world, the system would see that we could store the intermediate results in a single register. This places the data as close to the arithmetic logic unit (ALU), which is computing the multiplication, as possible in order to not redo the loading and storing into the more expensive RAM. A good compiler will also recognize that the three components in a multiplication are all the same value.

PythonRustC++

def example(num):
    num *= num
    num *= num

pub fn example(num: &mut i32) -> () {
    *num *= *num;
    *num *= *num;
}

void example(int * num) {
    *num *= *num;
    *num *= *num;
}

Now let's look at the compiled (interpreted for Python) outputs.

PythonRust UnoptimizedRust OptimizedC++/GCC

// Python 3.11
0           0 RESUME                    0

1           2 LOAD_CONST                0 (<code object cube at 0x55fe5de6a460, file "example.py", line 1>)
            4 MAKE_FUNCTION             0
            6 STORE_NAME                0 (example)
            8 LOAD_CONST                1 (None)
            10 RETURN_VALUE

Disassembly of <code object cube at 0x55fe5de6a460, file "example.py", line 1>:
1           0 RESUME                    0
// Focus on this part and down
2           2 LOAD_FAST                 0 (num)
            4 LOAD_FAST                 0 (num)
            6 BINARY_OP                18 (*=)
            10 STORE_FAST               0 (num)

3           12 LOAD_FAST                0 (num)
            14 LOAD_FAST                0 (num)
            16 BINARY_OP               18 (*=)
            20 STORE_FAST               0 (num)
            22 LOAD_CONST               0 (None)
            24 RETURN_VALUE

// rustc 1.74.0: -O -C debuginfo=0 -C opt-level=0
example::example:
        sub     rsp, 24
        mov     qword ptr [rsp + 8], rdi
        mov     eax, dword ptr [rdi]
        imul    eax, dword ptr [rdi]
        mov     dword ptr [rsp + 20], eax
        seto    al
        test    al, 1
        jne     .LBB0_2
        mov     rcx, qword ptr [rsp + 8]
        mov     eax, dword ptr [rsp + 20]
        mov     dword ptr [rcx], eax
        mov     eax, dword ptr [rcx]
        imul    eax, dword ptr [rcx]
        mov     dword ptr [rsp + 4], eax
        seto    al
        test    al, 1
        jne     .LBB0_4
        jmp     .LBB0_3
.LBB0_2:
        lea     rdi, [rip + str.0]
        lea     rdx, [rip + .L__unnamed_1]
        mov     rax, qword ptr [rip + core::panicking::panic@GOTPCREL]
        mov     esi, 33
        call    rax
        ud2
.LBB0_3:
        mov     rax, qword ptr [rsp + 8]
        mov     ecx, dword ptr [rsp + 4]
        mov     dword ptr [rax], ecx
        add     rsp, 24
        ret
.LBB0_4:
        lea     rdi, [rip + str.0]
        lea     rdx, [rip + .L__unnamed_2]
        mov     rax, qword ptr [rip + core::panicking::panic@GOTPCREL]
        mov     esi, 33
        call    rax
        ud2

.L__unnamed_3:
        .ascii  "/app/example.rs"

.L__unnamed_1:
        .quad   .L__unnamed_3
        .asciz  "\017\000\000\000\000\000\000\000\002\000\000\000\005\000\000"

str.0:
        .ascii  "attempt to multiply with overflow"

.L__unnamed_2:
        .quad   .L__unnamed_3
        .asciz  "\017\000\000\000\000\000\000\000\003\000\000\000\005\000\000"

// rustc 1.74.0: -O -C debuginfo=0 -C opt-level=3
example::example:
    mov     eax, dword ptr [rdi]
    imul    eax, eax
    imul    eax, eax
    mov     dword ptr [rdi], eax
    ret

// x86-64 gcc 13.2: -O3
example(int*):
    mov     eax, DWORD PTR [rdi]
    imul    eax, eax
    imul    eax, eax
    mov     DWORD PTR [rdi], eax
    ret

As you can see, with the interpreted Python version, wherein Python is reading and executing a single line at a time, the lack of context and constraint results in loading the same value into registers twice and storing it in main memory, only to load it twice and storing it again. This is wildly ineffecient. If we instead move to Rust and compile it in what is more or less debug mode, we get all sorts of additional information and quite a bit of code to handle stuff like overflows and panics (error states).

If we instead tell the Rust compiler to optimize the code and remove some of the additional debug info, we get what we would think a computer would always reduce the code to in terms of raw instructions. The compiler has recognized that there is only a single value being multiplied, not two different values as in the general form of the binary multiplication operation. It has also recognized that the store operation is happening in the same variable. Thus we have 4 instructions. Load the value into a register. Multiply the value in the register in place. Twice. Store the result in memory.

And this is why shouldn't be doing raw math in pure Python, but either use a compiler to optimize or use libraries, like numpy, which have been written and compiled in lower level languages like C, C++ and Rust.

Another quite popular language is Go, which also compiles to machine code, but is garbage collected. Julia has more of a scientific/numerical focus, but features garbage collection, JIT compilation and can use either a runtime or compile to a standalone binary.

Other languages like Java and C# compile to something called bytecode, which can then be interpreted by a process virtual machine. Thus all Java programs compile to the same bytecode, regardless of whether it's supposed to run on a Mac, Linux or Windows platform. The bytecode is then interpreted, sometimes optimized as well, at runtime by a virtual machine written for that specific platform.

Javascript is a just-in-time compiled language running on most web pages.
It can occassionally have a reasonable speed due to the optimizing runtime.
Heavy development has tuned the widely used V8 runtime to improve Javascripts performance.
Writing Javascript can seem easy, but the amount of things you are allowed to do,
but shouldn't, can make it an arduous experience once the errors start piling up.
The advantage is highly portable code, because everything is essentially just a string... including numbers, which is a traumatic experience, I would prefer not to elaborate on.

The Guide and Languages

As you can probably see in the column on the left... I will be using Rust from here on out, with a few exceptions. C will occasionally be used for reasoning about low level stuff like pointers and memory allocations, while C++ will be used as a comparison to Rust and Python will be used for a bit of perspective. In any case it will be assumed you don't really know any of the languages except Python and that you have read the introductions to Rust in this module. Please don't skip the introduction to Rust.

If you read the section on GPU programming below, you will see there are no easy, one-size-fits-all, solutions. Thankfully, I have a set of clear goals and limitations for the guide. To help you get familiar with new topics, we only need reasonable performance and for all the code to be runnable on most laptops.
Most importantly, the setup process should be easy and not make you want to stress-eat the contents of your entire fridge when going through the installation process. As such the guide will mainly use Rust and the GPU API wgpu. I will in all cases that do not require graphics output only concern itself with pure computation through wgpu, which makes setup quite a bit simpler. wgpu is an abstraction layer that runs whatever GPU API it finds best suitable on your system. Having exact control and the absolute best performance isn't as important as allowing as many people to participate and learn as possible. After all, if it doesn't work on your laptop/desktop, you can't really play around and have fun with it!

GPU APIs and Languages

GPUs are some of the most readily available accelerators. Originally made for graphics, since around 2008 using them for general computation has been fairly wide spread. All graphics API's now also support general computation. Usually, it will be called a compute shader. Shader is the common name for a GPU program. If running CUDA or OpenCL, it is called a kernel. I will mostly focus on the pure compute parts of GPU APIs, except for the graphics specialization. Thus it will be assumed that if you are interested in the graphics specialization you might already have done a graphics course or a tutorial such as LearnOpenGL or Learn Wgpu. It is worth noting that a compute shader using a graphics-based API, such as Vulkan, can perform just as well as an implementation in a compute-only API, such as CUDA. One example of this is VkFFT. A GPU API is all the stuff that you have to write in your code that is not the function (shaders) that you want to run. It could be calls like creating a connection to the GPU, allocating memory on the GPU, transferring the contents of a buffer to the memory you just allocated on the GPU or launching your shader/kernel and transferring the results back to the CPU. The GPU languages themselves vary with the APIs. Some APIs, such as Vulkan, can take an intermediate representation called SPIR-V, this allows the user to write in any shading language, or even Rust in one case, as long as it is compiled to SPIR-V. Usually a shading language will look a lot like C/C++/Rust, but have its own distinct rules. You can't always make the same assumptions.

The rest of this section is an overview of the various available GPU APIs.

Web APIs

An often used strategy for making your programs as widely available as possible, is to use web-based techonology. Whatever browser you are using supports some GPU APIs. For a long time it has been WebGL, which is a subset of OpenGL. WebGL has a version 2.0, which was finally supported by all major browsers not too long ago. The 2.0 version brought support for compute shaders with it. The modern newcomer is WebGPU which has a way of doing things that more closely resembles modern APIs such as Vulkan, DirectX 12 and Metal. It was recently supported by Chrome. Hopefully Safari and Firefox are not far behind. I will use the wgpu abstraction. It has an API which follows the WebGPU specification, with some optional extensions for more features, but under the hood it uses whatever API it deems best for the current platform. Once the support for WebGPU becomes widespread, it can merely choose to run using WebGPU instead. In general, you will find that most frameworks or APIs which have to support a lot of things will be centered around the lowest common denominator. However, tools such as Vulkan and wgpu do allow you to query the system you are on for support of an extension, which does allow access to specialized features. You may however, end up with several versions of some elements of your code, based on whether some feature is there or not.

Platform-Specific APIs

Some GPU APIs are operating system specific. DirectX11 and DirectX12 targets Windows and XBox platforms, while Metal targets Apple devices. I won't concern myself too much with these. DirectX11 is somewhat similar to OpenGL, while DirectX12 and Metal are from the same, more low-level, generation as Vulkan. Metal however, seems to be a bit less low-level compared to DirectX12 and Vulkan.

Cross-Platform APIs

OpenGL and Vulkan are cross platform. Vulkan is a low level, but generally popular API. It puts a lot of responsibility on to the programmer, but works on Windows and Linux, as well as Intel, Nvidia and AMD GPUs. It even works fairly decently on Apple devices thanks to MoltenVK Another cross-platform tool is wgpu, mentioned earlier.

Compute APIs

Some GPU APIs are strictly not for graphics, such as CUDA and OpenCL. OpenCL is cross-platform (works on all GPUs), as well as compiling to FPGAs, DSPs and parallelized CPU code. On the other hand CUDA is just for Nvidia GPUs. CUDA is widely used in scientific computing and mostly dominates academia. Both CUDA and OpenCL have their kernels written in a specialized version of C++.

Shader Languages

Shader languages are languages specifically tailored for the combined graphics/compute APIs. Graphics APIs have some specific functionality which the language has to support. Usually you will find support for small vectors and matrices (up to 4x4) and various types you might not find on the CPU such as fp16. They will also usually have something called textures, bindings, samplers and built-in variables. You don't need to worry about that very much in the guide. GLSL, HLSL, WGSL and MSL are all shading languages developed for graphics APIs. OpenGL, DirectX, WebGPU and Metal, respectively. GLSL is the main shader language of Vulkan, but HLSL is also seeing rising popularity in that community. Lately, the tooling for cross compiling and running the same shaders on different graphics APIs has become a lot better. Shaders can be compiled to SPIR-V, an intermediate representation, sort of like the byte code we discussed earlier. This allows the platform independent SPIR-V to be translated to the specific instructions the GPU the code is actually run on. One tool for compiling shaders is naga.

Domain Specific Languages and Frameworks

Shading languages all benefit from limitations and specializations from being specifically for graphics on a GPU. Another form of limitation is domain specific languages and frameworks. One such framework you might know of is PyTorch. You are generally supposed to formulate your neural network as a graph and not just a sequence of operations. This allows PyTorch to have a clearer picture of what it is you want to do. It can check that all dimensions fit before running the training loop and it can optimize the process. Taking things even further PyTorch even has its own compiler from version 2.0.

Another way of achieving speedy results in a flexible format is retrofitting an existing language, in this case Python, with a slightly different language. Taichi combines a domain specific language to JIT compile highly performant code which can also run graphics. It can do this because of increased requirements of the user. Namely, annotating their code and setting limitations. Halide on the other hand restricts itself to be a AOT- or JIT-compiled language embedded in C++ made specifically for working with images and tensors.

Futhark is a language made specifically for replacing the parts of your code which need to be fast. As such it is not a general language and can make opinionated choices which allows it to generate more performant code.