The Go language and its concurrency model

presented by Victor Moraes @ BCG X Content Sessions. July 2024.

Going to the origins

Different programming languages exist to solve different problems facing different constraints.

Go originated at Google in 2009 to improve productivity in an era of concurrent programming, networked machines and large codebases. Legend says Go was designed while waiting for C++ programs to compile.

Go was not designed as a systems-language, but as an application language. It had to be fast enough, low-level enough, ergonomic, versatile and simple. C++ had too many footguns and Rust had too steep of a learning curve (besides, it wouldn't be around until 2015).

In terms of typology, Go is a statically typed, compiled high-level programming language. It features memory safety, garbage collection, structural typing, and CSP-style concurrency.

Selected Go Proverbs

  • Don't communicate by sharing memory, share memory by communicating.
  • Concurrency is not parallelism.
  • Channels orchestrate; mutexes serialize.
  • The bigger the interface, the weaker the abstraction.
  • gofmt's style is no one's favorite, yet gofmt is everyone's favorite.
  • A little copying is better than a little dependency.
  • Clear is better than clever.
  • Reflection is never clear.
  • Errors are values.
  • Don't panic.

The Benchmarks Game

Things that are built with Go

  • kubernetes
  • terraform
  • docker
  • podman
  • esbuild
  • hugo
  • fzf
  • caddy
  • ngrok
  • prometheus
  • cockroach db
  • iot
  • widespread usage at faang companies

IO-bound vs. CPU-bound

Before optimizing, find the bottleneck. Log statements can be sufficient, but you may need better tools (e.g. monotonic clock with high resolution):

Just throwing more CPUs is not going to make things faster, if the bottleneck is the hard disk, the database or RAM access. It's going to be a waste of energy and money.

Since performance depends on specific hardware, geographic location of the machines and workload, ideally, benchmarks should be run in a setup as close as possible to what you will have in production. Running statistical tests on repeated observations over an extended period of time may be required to prevent biases.

Parallelism and concurrency

A concurrent program is harder to write and debug than a serial one. Rewriting your program into a faster programming language will also mean a lot of work. Before you go into this, consider whether you are using the right libraries and the best algorithm that are available for your language of choice.

Concurrency is not parallelism, although concurrency enables parallelism.

A concurrent program is one in which multiple tasks can be in progress at the same time. For example, a computer with a single core can handle several applications at once, by means of the operating system's scheduler. Another example is that Python and Node.js are single-threaded, but can have concurrent execution by virtue of asyncio, Promises and callbacks.

On the other hand, parallelism is when at a single moment of time different sequences of instructions are executed in different cores.

Do we really need this?

There was a time when programmers didn't need to think about performance. They would focus on something else, take a vacation, and then six months later new hardware was released which solved all the processing bottlenecks. From 1986 to 2003, the performance of processors increased, on average, more than 50% per year.

At a certain point, we hit physical constraints which did not allow us to fit more transistors in the same integrated circuit. From 2015 to 2017, for instance, processor performance increased by less than 4% per year.

Since we cannot run things faster on a single processor, we must use several processors in parallel to speed up computations.

Welcome to the machine

Concurrency may happen through threads or fibers.

Threads are preemptive. This means that the current execution path may be interrupted (preempted) at any time, so data integrity is a big concern. Execution can be stopped in the middle of an update to a chunk of data, leaving it in an incomplete or corrupt state.

Fibers are cooperative. This means the execution path is only interrupted when the fiber yields. Fibers always start and stop in well-defined places, so data integrity is much less of an issue.

Threads can be implemented either in kernel or in user space, or both can be mixed.

Fibers are basically lightweight threads implemented in user space.

In addition to threads and fibers, in user space concurrency can happen by launching concurrent processes. Processes are more isolated from each other than threads. This means data races are more unlikely, but cooperation is also harder. Also, since threads take fewer resources, they have better performance in general.

Per process resourcesPer thread resources
Address spaceProgram counter
Global variablesRegisters
Open filesStack
Child processesState
Pending alarms
Signals & Signal handlers
Accounting information

Data parallelism vs. Task parallelism

When working with parallel computations, you can either distribute the tasks over the cores, or partition the data, while having each core carry out more or less the same operations on its part of the data.

Let's suppose 4 teachers have 40 papers to correct, each one with 4 questions:

SIMD & MIMD

To do, see SIMD at Wikipedia.

Concurrency and its discontents

Concurrency can introduce a new class of bugs that are hard to catch with tests and hard to debug. Since there are multiple sequences of instructions going on simultaneously, the exact order of execution depends on the OS (in particular, on the scheduler), over which you do not have control.

By itself, concurrency implies undefined behavior and non-determinism. This kind of situation has been affectionately called Heisenbug, not only because it sometimes happens and it sometimes doesn't, but also because observing the system may alter its state.

Synchronization constraints

Two constraints exist:

To illustrate this, consider the following situation. You and your friend live in distinct cities, and you have agreed that you will only start having lunch after your friend starts having lunch.

The first logical solution is to use a clock. So, your friend must start having lunch at 12:00 and you can start having lunch any time later than that.

This works in some situations, but there are two problems:

Let's pretend there are no clocks and try another solution. This time, your friend will give you a phone call right after she starts having lunch. Then, you can start having lunch.

This simple mechanism is on the basis of many synchronization solutions, and is called message passing. In general, there is no way to compare events from different threads or processes, except for message passing.

We arrive at another definition:

Two events are concurrent if we cannot tell by looking at the program which one will happen first.

Here is one example (Node 20+ allows top-level await):

const run = () =>
  new Promise((resolve) => {
    let x = 0;

    setTimeout(() => {
  	  x++;
    }, Math.random() * 10);

    setTimeout(() => {
  	  resolve(x);
    }, Math.random() * 10);
  });

while (true) {
  console.log(await run());
}

It is impossible to determine in advance if run() will return 0 or 1.

Enforcing synchronization

As we saw above, conflicts emerge when there are concurrent reads and writes. If there are only concurrent reads, then there is no problem. An interesting case is an update:

count = count + 1

It may not seem like it, but this code executes a read and a write operation:

temp  = count
count = temp  + 1

(In some cases, your CPU may have an increment or a decrement instruction, which assures that this kind of update occurs atomically at the CPU level. However, other kinds of updates can still bite you).

There is one synchronization primitive, of which all other synchronization methods are a variation: the semaphore.

The semaphore

A semaphore is an object that holds an integer. A semaphore must be initialized with any integer before it can be used. It implements two methods:

There are various patterns for working with semaphores, such as the multiplex, the barrier, the turnstile and the queue.

Locks and mutexes are variations of the semaphore, with the difference that sempahores can hold any integer and those can only hold a binary value.

Actors

To do.

Structured concurrency

To do.

Communicating Sequential Processes (CSP)

To do.

Shared state

Concurrency gets dramatically simpler if you remove mutable state from the equation. This is an idea that has been around for a while: concurrency is not inherently difficult, we are just using the wrong programming model.

Functional programming, or in more general terms, the functional paradigm, emphasizes the use of local immutable variables and pure functions, as well as the restriction of side-effects. Since the outcome of every function is only a consequence of its parameters, concurrent execution works just as well as serial execution.

Languages such as Erlang were pioneers in this paradigm switch, whereas nowadays we have Scala, Clojure and Haskell.

Go does not strongly adhere to the functional paradigm, but using it can certainly help us to write concurrent programs.

Are we ready to Go?

Go introduces two features for working with concurrency. These features are baked into the language and do not depend on any library or extension. These are channels and goroutines.

What we will work on

This example was chosen because the author is studying Chinese and wanted to have an automated way to create flashcards. The repository with a possible solution for this example can be accessed at GitHub.

When learning a language, it is important that no one of the four basic skills (reading, writing, listening, speaking) lags behind the others. In Chinese, this can be especially challenging, due to its morphosyllabic writing system and pronunciation features that are absent in most Indo-European languages.

The author wanted a tool which, given a list of words or sentences in Chinese (encoded with the Chinese writing system), would provide, for each of them:

  1. transliteration to the Latin alphabet (pinyin)
  2. English translation
  3. audio with the pronunciation
  4. audio with the pronunciation, slowed down to 70% of the original speed

There is a third-party library that solves (1); (2) and (3) can be solved by means of cloud services such as AWS Translate and AWS Polly; (4) can be obtained by using a tool like ffmpeg over the output of (3). Tasks (1) and (4) are CPU-bound, whereas tasks (2) and (3) are IO-bound (rely on a network service).

When given a list with sentences to work on, this starts to look a lot like the example of the teachers who had to correct a number of assignments, with the only difference that here, one of the tasks take precedence over the other.

sync.WaitGroup

To do.

sync/atomic

To do.

Panic! at the goroutine

To do.

The -race command line flag

To do.

Advanced Topics

References / Further reading