The Go language and its concurrency model

Going to the origins

Different programming languages exist to solve different problems facing different constraints.¹¹ Some people say you should learn one new programming language per year.

Go originated at Google in 2009 to improve productivity in an era of concurrent programming, networked machines and large codebases. Legend says Go was designed while waiting for C++ programs to compile.

Go was not designed as a systems-language, but as an application language. It had to be fast enough, low-level enough, ergonomic, versatile and simple. C++ had too many footguns and Rust had too steep of a learning curve (besides, it wouldn't be around until 2015).

In terms of typology, Go is a statically typed, compiled high-level programming language. It features memory safety, garbage collection, structural typing, and CSP-style concurrency.

Selected Go Proverbs²² Source.

Don't communicate by sharing memory, share memory by communicating.
Concurrency is not parallelism.
Channels orchestrate; mutexes serialize.
The bigger the interface, the weaker the abstraction.
gofmt's style is no one's favorite, yet gofmt is everyone's favorite.
A little copying is better than a little dependency.
Clear is better than clever.
Reflection is never clear.
Errors are values.
Don't panic.

The Benchmarks Game³³ Source.

...beats Python by orders of magnitude,
...beats Java and Javascript almost every time, by a significant margin,
...is 2-3x slower than C++ or Rust,
...is on par with C# AOT⁴⁴ Ahead of Time compilation skips Intermediate Language code and generates machine code directly. Not all language features can be used..

Things that are built with Go

kubernetes
terraform
docker
podman
esbuild
hugo
fzf
caddy
ngrok
prometheus
cockroach db
iot
widespread usage at faang companies

IO-bound vs. CPU-bound

Before optimizing, find the bottleneck. Log statements can be sufficient, but you may need better tools (e.g. monotonic clock with high resolution):

Python has timeit,
Javascript has Performance.now(),
Go has go test -bench,
PostgreSQL has pgbench,
HTTP servers can be tested with Grafana's k6⁵⁵ also ab, Apache benchmarking.,
Unix's time has information on real, user and system time,
htop can display CPU use, memory use, disk and network IO,
strace can monitor system calls,
your cloud provider offers a series of statistics for VM and function‑as‑a‑service instances.

Just throwing more CPUs is not going to make things faster, if the bottleneck is the hard disk, the database or RAM access. It's going to be a waste of energy and money.

Since performance depends on specific hardware, geographic location of the machines and workload, ideally, benchmarks should be run in a setup as close as possible to what you will have in production. Running statistical tests on repeated observations over an extended period of time may be required to prevent biases⁶⁶ "Performance Matters" by Emery Berger..

Parallelism and concurrency

A concurrent program is harder to write and debug than a serial one. Rewriting your program into a faster programming language will also mean a lot of work. Before you go into this, consider whether you are using the right libraries and the best algorithm that are available for your language of choice.

Concurrency is not parallelism, although concurrency enables parallelism.

A concurrent program is one in which multiple tasks can be in progress at the same time. For example, a computer with a single core can handle several applications at once, by means of the operating system's scheduler. Another example is that Python and Node.js are single-threaded, but can have concurrent execution by virtue of asyncio, Promises and callbacks.⁷⁷ Though nowadays there are several scapehatches for running truly parallel code in Python and Node.js.

On the other hand, parallelism is when at a single moment of time different sequences of instructions are executed in different cores.

Do we really need this?

There was a time when programmers didn't need to think about performance. They would focus on something else, take a vacation, and then six months later new hardware was released which solved all the processing bottlenecks. From 1986 to 2003, the performance of processors increased, on average, more than 50% per year.

At a certain point, we hit physical constraints which did not allow us to fit more transistors in the same integrated circuit. From 2015 to 2017, for instance, processor performance increased by less than 4% per year.

Since we cannot run things faster on a single processor, we must use several processors in parallel to speed up computations.

Welcome to the machine

Concurrency may happen through threads or fibers.

Threads are preemptive. This means that the current execution path may be interrupted (preempted) at any time, so data integrity is a big concern. Execution can be stopped in the middle of an update to a chunk of data, leaving it in an incomplete or corrupt state.

Fibers are cooperative. This means the execution path is only interrupted when the fiber yields. Fibers always start and stop in well-defined places, so data integrity is much less of an issue.

Threads can be implemented either in kernel or in user space, or both can be mixed.

Fibers are basically lightweight threads implemented in user space.

In addition to threads and fibers, in user space concurrency can happen by launching concurrent processes. Processes are more isolated from each other than threads. This means data races are more unlikely, but cooperation is also harder. Also, since threads take fewer resources, they have better performance in general.

Per process resources	Per thread resources
Address space	Program counter
Global variables	Registers
Open files	Stack
Child processes	State
Pending alarms
Signals & Signal handlers
Accounting information

Data parallelism vs. Task parallelism

When working with parallel computations, you can either distribute the tasks over the cores, or partition the data, while having each core carry out more or less the same operations on its part of the data.

Let's suppose 4 teachers have 40 papers to correct, each one with 4 questions:

Teacher 1 corrects Question 1 for all of the papers, Teacher 2 corrects Question 2 for all of the papers, ...
Teacher 1 corrects all of the questions for papers 1-10, Teacher 2 corrects all of the questions for papers 11-20, ...

SIMD & MIMD

To do, see SIMD at Wikipedia.

Concurrency and its discontents

Concurrency can introduce a new class of bugs that are hard to catch with tests and hard to debug. Since there are multiple sequences of instructions going on simultaneously, the exact order of execution depends on the OS (in particular, on the scheduler), over which you do not have control.

By itself, concurrency implies undefined behavior and non-determinism. This kind of situation has been affectionately called Heisenbug, not only because it sometimes happens and it sometimes doesn't, but also because observing the system may alter its state.

Synchronization constraints

Two constraints exist:

Serialization: Event A must happen before Event B.
Mutual exclusion: Events A and B must not happen at the same time.

To illustrate this, consider the following situation. You and your friend live in distinct cities, and you have agreed that you will only start having lunch after your friend starts having lunch.

The first logical solution is to use a clock. So, your friend must start having lunch at 12:00 and you can start having lunch any time later than that.

This works in some situations, but there are two problems:

Clocks have a limited precision.
Computers can't record the time everything happened — it's just too much, too fast.

Let's pretend there are no clocks⁸⁸ or sundials, or any other way to tell the time. and try another solution. This time, your friend will give you a phone call right after she starts having lunch. Then, you can start having lunch.

This simple mechanism is on the basis of many synchronization solutions, and is called message passing. In general, there is no way to compare events from different threads or processes, except for message passing.

We arrive at another definition:

Two events are concurrent if we cannot tell by looking at the program which one will happen first.

Here is one example (Node 20+ allows top-level await):

const run = () =>
  new Promise((resolve) => {
    let x = 0;

    setTimeout(() => {
  	  x++;
    }, Math.random() * 10);

    setTimeout(() => {
  	  resolve(x);
    }, Math.random() * 10);
  });

while (true) {
  console.log(await run());
}

It is impossible to determine in advance if run() will return 0 or 1⁹⁹ For more inside on how the Javascript event loop works, watch this video..

Enforcing synchronization

As we saw above, conflicts emerge when there are concurrent reads and writes. If there are only concurrent reads, then there is no problem. An interesting case is an update:

count = count + 1

It may not seem like it, but this code executes a read and a write operation:

temp  = count
count = temp  + 1

(In some cases, your CPU may have an increment or a decrement instruction, which assures that this kind of update occurs atomically at the CPU level. However, other kinds of updates can still bite you).

There is one synchronization primitive, of which all other synchronization methods are a variation: the semaphore.

The semaphore

A semaphore is an object that holds an integer. A semaphore must be initialized with any integer before it can be used. It implements two methods:

wait decrements the value of the semaphore. If a thread decrements a semaphore and the semaphore's value becomes negative, the thread blocks itself (sleeps).
signal increments the value of the semaphore. If there were other threads waiting, then one of them gets unblocked (awoken).

There are various patterns for working with semaphores, such as the multiplex, the barrier, the turnstile and the queue.

Locks and mutexes are variations of the semaphore, with the difference that sempahores can hold any integer and those can only hold a binary value.

Actors

To do.

Structured concurrency

To do.

Communicating Sequential Processes (CSP)

To do.

Shared state

Concurrency gets dramatically simpler if you remove mutable state from the equation. This is an idea that has been around for a while: concurrency is not inherently difficult, we are just using the wrong programming model.

Functional programming, or in more general terms, the functional paradigm, emphasizes the use of local immutable variables¹⁰¹⁰ Although that is a contradiction in terms... and pure functions, as well as the restriction of side-effects. Since the outcome of every function is only a consequence of its parameters, concurrent execution works just as well as serial execution.

Languages such as Erlang were pioneers in this paradigm switch, whereas nowadays we have Scala, Clojure and Haskell.

Go does not strongly adhere to the functional paradigm, but using it can certainly help us to write concurrent programs.

Are we ready to Go?

Go introduces two features for working with concurrency. These features are baked into the language and do not depend on any library or extension. These are channels and goroutines.

A Go channel is simply a queue.
A goroutine is a green thread, i.e., a lightweight thread managed by a userspace program instead of by the OS. It differs from a fiber in that it uses preemptive scheduling.

What we will work on

This example was chosen because the author is studying Chinese and wanted to have an automated way to create flashcards. The repository with a possible solution for this example can be accessed at GitHub.

When learning a language, it is important that no one of the four basic skills (reading, writing, listening, speaking) lags behind the others. In Chinese, this can be especially challenging, due to its morphosyllabic writing system and pronunciation features that are absent in most Indo-European languages.

The author wanted a tool which, given a list of words or sentences in Chinese (encoded with the Chinese writing system), would provide, for each of them:

transliteration to the Latin alphabet (pinyin)¹¹¹¹ i.e., a best effort transliteration, since this is not an easy task.
English translation
audio with the pronunciation
audio with the pronunciation, slowed down to 70% of the original speed

There is a third-party library that solves (1); (2) and (3) can be solved by means of cloud services such as AWS Translate and AWS Polly; (4) can be obtained by using a tool like ffmpeg over the output of (3). Tasks (1) and (4) are CPU-bound, whereas tasks (2) and (3) are IO-bound (rely on a network service).

When given a list with sentences to work on, this starts to look a lot like the example of the teachers who had to correct a number of assignments, with the only difference that here, one of the tasks take precedence over the other.

sync.WaitGroup

To do.

sync/atomic

To do.

Panic! at the goroutine

To do.

The `-race` command line flag

To do.

Advanced Topics

References / Further reading

Go by Example, by Mark McGranaghan and Eli Bendersky.
Learn Go with Tests, by Chris James.
The Little Book of Semaphores, by Allen B. Downey.
Seven Languages in Seven Weeks, esp. the chapters on Io and Erlang, by Bruce A. Tate.
An Introduction to Parallel Programming, by Peter Pacheco.

Going to the origins

Selected Go Proverbs22 Source.

The Benchmarks Game33 Source.

Things that are built with Go

IO-bound vs. CPU-bound

Parallelism and concurrency

Do we really need this?

Welcome to the machine

Data parallelism vs. Task parallelism

SIMD & MIMD

Concurrency and its discontents

Synchronization constraints

Enforcing synchronization

The semaphore

Actors

Structured concurrency

Communicating Sequential Processes (CSP)

Shared state

Are we ready to Go?

What we will work on

sync.WaitGroup

sync/atomic

Panic! at the goroutine

The -race command line flag

Advanced Topics

References / Further reading

Selected Go Proverbs²² Source.

The Benchmarks Game³³ Source.

The `-race` command line flag