Locks Are Some Shit

I've been reading Operating Systems: Three Easy Pieces. I highly recommend the book if you're cool fiddling with C a bit. Actually, scratch that: I recommend the book if you ever write code that runs on a server or any other linux/osx environment, especially if you feel a little out of your depth with C. The code examples are not that intimidating, even if you don't know from typecasting or a pointer (okay, learning the difference between foo, *foo and &foo is useful, but not knowing it doesn't prevent you from getting the gist of the code samples), and getting a deeper understanding of the environment your code works in will make a lot of known unknowns come into a bit more focus. Honest.

The first of the three parts was memory virtualization: that is, how computing time and resources get divvied up amongst processes. There was some fascinating stuff in there: the API for forking a new process, for instance, is weirder and cooler than I expected, and learning the topography of the boundary between the OS and application code (such as ls or Google Chrome.app) is rad. And then there are parts that are totally internal, though vital, to the kernel. Nothing against free space management, segmentation, or the five whole chapters on memory paging, but I'm just not as interested in the kernel's implementation as its interface.

I kept wondering if I would be better off just jumping straight to concurrency, because that's what I was really pumped to learn. Would I be missing out on some logically necessary information if I skipped them?? (Nope.) Learn from my mistake and jump to the shit you find interesting, because someday you're going to die. So: locks!

I Don't Need No Stinkin' Lock

Locks are what keep multiple threads, running in parallel, from fucking each other up when dealing with a shared bit of state. There is almost no operation too small for these little bastards to mess up, given the chance. Take this:

static volatile int counter = 0;

void *mythread(void *arg) {
  int i;
  for (i = 0; i < 100000; i++) {
    counter++;
  }
  return NULL;
}

Because I had to look it up: volatile is a keyword that prevents certain compiler optimizations from happening, specifically for things like this shared counter.

So, there's a shared counter and there's a procedure that uses it suitable for giving to a couple threads. I extracted all the code dealing with that shared bit of state:

counter++

That's it! A single line, with a single unary operator. How unsafe can THAT be? Let's add some logging and fire up a couple threads:

#include <stdio.h>
#include <pthread.h>

static volatile int counter = 0;

void *mythread(void *arg) {
  printf("%s: begin\n", (char *) arg);
  int i;
  for (i = 0; i < 100000; i++) {
    counter++;
  }
  printf("%s: end\n", (char *) arg);
  return NULL;
}

int main(int argc, char *argv[]) {
  pthread_t p1, p2;
  printf("main: begin (counter = %d)\n", counter);

  pthread_create(&p1, NULL, mythread, "A");
  pthread_create(&p2, NULL, mythread, "B");

  // wait for them fucks to finish
  pthread_join(p1, NULL);
  pthread_join(p2, NULL);

  printf("main: done with both (counter = %d)", counter);
  return 0;
}

So. counter starts at 0, and then two threads each run counter++ 100,000 times apiece. That makes 200,000, right?

main: begin (counter = 0)
A: begin
B: begin
B: end
A: end
main: done with both (counter = 100745)

INTERESTING.

The trouble is that counter++ is three operations, not one:

  1. Get the value of counter out of whatever register it's stored in
  2. Increment that value by one
  3. Store the new value in that register

So p2 reads the value of counter's register at, e.g., 17; then it increments the value to 18; at the same time as p2 is doing that incrementing, one core over, p1 reads that same register, which is still 17. In parallel, each adds one to the value it read and stores that new value in the register, and lo: 17 + 1 + 1 = 18.

Lock That Shit Down

So let's suppose you give a shit about the integrity of basic arithmetic in your code. The above nonsense won't do at all. I ran it eight times (you can, too! Just stick the code above in a file (say, bad_math.c), compile it with something like gcc -o bad_math bad_math.c, and go hog wild), with the following results:

ABAB 127499
ABAB 144926
ABAB 116942
ABBA 102988
ABBA 100745
ABAB 114188
AABB 200000
ABBA 104161

avg. 126431

I'm honestly pretty shocked that one run actually got 200,000. (As a sidenote, it looks like the ABAB pattern of thread starts/finishes performs better than ABBA, with respective averages of 125888 and 102631. AABB, of course, will always get 200,000 (as would BBAA, but A gets kicked off first by synchronous code).)

So let's lock this shit down. main doesn't need to change at all: we just need to initialize a commonly available lock, and use it in the mythread procedure to ensure that only one thread at a time can access the critical section (that's the term, coined by Dijkstra, for a section of code dealing with shared memory; here, counter++). Here's the most vanilla implementation for a POSIX system:

static volatile int counter = 0;
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

void *mythread(void *arg) {
  printf("%s: begin\n", (char *) arg);
  int i;
  for (i = 0; i < 100000; i++) {
    pthread_mutex_lock(&lock);
    counter += 1;
    pthread_mutex_unlock(&lock);
  }
  printf("%s: end\n", (char *) arg);
  return NULL;
}

No need for extra headers; as you probably gathered from the naming, that locking mechanism is part of pthread.h. When any thread calls pthread_mutex_lock(&foo) for a lock foo, one of two things happens: if no one else has the lock, it runs the critical section; or, if another thread has the lock, it waits for that thread to call pthread_mutex_unlock(&foo) and THEN does its thing.

As you might expect, this version gets 200,000 every time (but don't believe me...). So what's going on under the hood?

Under The Hood

A huge disclaimer before we start playing around with implementing our own locks: this shit does not work. Checking some value to determine if a lock is in use and updating that value to secure that lock takes multiple operations, and at the application level, there's no way to ensure that that happens atomically. In a real lock, the hardware exposes a prebundled set of operations that can be called from a C API.

Depending on your machine, those prebundled operations might look, if you squint, a little something like this:

typedef struct __lock_t {
  int flag;
} lock_t;

void init(lock_t *lock) {
  // 0 => lock is available, 1 => lock is held
  lock->flag = 0;
}

int compare_and_swap(int *old_ptr, int expected, int new) {
  int actual = *old_ptr;
  if (actual == expected)
    *old_ptr = new;
  return actual;
}

void lock(lock_t *lock) {
  /* while (compare_and_swap(&lock->flag, 0, 1) == 1) */
  while (compare_and_swap(&lock->flag, 0, 1) == 1)
    ; // do butt-ass nothing until that lock gets released
}

void unlock(lock_t *lock) {
  lock->flag = 0;
}

This is a shitty lock for a few reasons:

  1. If the lock is taken, the thread just wastes CPU cycles until the CPU scheduler decides to let the locking thread finish its work
  2. It's possible to have a thread that "starves": i.e. never, ever gets the lock
  3. It does not work.

But how badly does it not work?

ABAB 156000
ABAB 103114
ABAB 129168
ABBA 101519
ABBA 114152
ABAB 103095
ABBA 100576
ABAB 101809

avg. 113679

So, I mean, that's a worse average than no lock at all, but there was that fluky 200,000 in there. Without the outlier, it would look a lot wors-

avg. 115921

Oh. From an analytical perspective, there are more steps in compare_and_swap than in counter++, so there are more places for a malicious CPU scheduler to fuck with things; from a statistical perspective, we're nowhere near solid ground for declaring a winner in the contest of "no locks vs useless locks"; from an engineering perspective, please just use pthread_mutex_t locks.

There is a glimmer of hope in the book before we totally close the book on software locking:

A lock written from x86 assembly instructions

Seems easy enough. Take the busted lock, swap in the new CompareAndSwap implementation, however the fuck that works (looks like its evaluating literal strings as assembly language, but I'm in way over my head here), and give that a test run:

ABAB 141034
ABAB 128868
ABAB 149149
ABAB 133336
ABAB 165496
ABAB 130632
ABAB 163608
ABAB 163309

avg. 146929

Not half bad*!

* Almost exactly half bad

That's all I got rght now on locks.

A Brief Aside About C

Working in C feels like working with a database to me: the fundamental way to define the shape of your data is a struct: a behaviorless mapping of typed data fields to names, just like a table in a relational database.

Now, it's more complicated than that, of course, and C is a good bit more expressive than SQL. For instance, you could use the convention of pointing certain fields at integers that are pointers to the memory address of functions and use them to call those functions (I believe that's how C++ classes work under the hood, but don't quote me on that as an authority).

Stuff In This Chapter That I Left Out

  • Various tradeoffs in balancing fairness and performance while maintaining mutual exclusion (thus, incidentally, the term "mutex")
  • Some interesting historical locking mechanisms
  • Some background on what the hardware does and doesn't do, and what that means for the OS
  • A cool-ass locking implementation from the linux kernel that tracks both the status of the lock and the size of its queue with a single integer