Pacuna's Blog

Implementing a concurrent Set in Go – Part I

The purpose of this series is not to present a production-ready data structure, but to study different locking strategies when building one.

We will review three ways to implement a concurrent sorted Set using a Linked list. Sets are well-known data structures, so we can focus on the concurrency aspects and briefly describe the implementation.

Most of the material presented here is based on content from the book The Art of Multiprocessor Programming by Maurice Herlily & Nir Shavit. I translated the Java code to Golang the best I could. So if you find errors or details that can improve the implementation, please let me know.

The interface for the Set contains the following methods:

package set

type Set interface {
    Add(item []byte) bool
    Remove(item []byte) bool
    Contains(item []byte) bool
}

All methods return booleans that indicate if the operation was successful.

The choice for the item’s type is to emphasize the fact that it is not relevant for this implementation. The key used to maintain the order of the elements is its hashcode. The item could be an integer, string, or even a struct type, as long you have a function to hash these types into integer values. In this case, I’m using the hash structure library to generate these hashes.

Non-concurrent implementation

Let’s start with the non-concurrent approach. The basic algorithm won’t change a lot once we jump into the concurrent implementations. Most of the changes have to do with what sections of the code need to be protected with locks.

We start by creating two types: node and list. The fields for node are: item, key, and next, which is a pointer to the next node.

type node struct {
    item []byte
    key  uint64
    next *node
}

func newNode(item []byte) *node {
    key, _ := hashstructure.Hash(item, nil)
    return &node{
    	item: item,
    	key:  key,
    }
}

type list struct {
    head *node
    size int
}

func New() *list {
    head := &node{
    	key: 0,
    }
    tail := &node{
    	key: math.MaxUint64,
    }
    head.next = tail
    return &list{
    	head: head,
    	size: 0,
    }
}

The list has two fields: head and size. We initialize the list with a tail node as the initial successor for head. As you can notice, the type for key is uint64, and since head and tail are the first and last node, they have the minimum and maximum keys possible.

As I mentioned before, we only need the item to create a node. The hashing library generates a hash that we use as the key for that item.

Adding an item

To add an item, we traverse the list using two pointers: current and pred (predecessor). Once the key of the item we want to add falls in between the values of current and pred, we insert it and return true. If the key already exists, we return false.

func (l *list) Add(item []byte) bool {
    var pred, curr *node
    key, _ := hashstructure.Hash(item, nil)

    pred = l.head
    curr = pred.next

    for curr.key < key {
    	pred = curr
    	curr = curr.next
    }

    if key == curr.key {
    	return false
    }
    n := newNode(item)
    n.next = curr
    pred.next = n
    l.size++
    return true
}

Removing an item

To remove an item, we similarly traverse the list until we find the key of the item we want to delete. We then rearrange the pointer for its predecessor to point to its successor. We are not removing the node, but just making it unreachable starting from head.

func (l *list) Remove(item []byte) bool {
    var pred, curr *node
    key, _ := hashstructure.Hash(item, nil)

    pred = l.head
    curr = pred.next

    for curr.key < key {
    	pred = curr
    	curr = curr.next
    }

    if key == curr.key {
    	pred.next = curr.next
                l.size--
    	return true
    }
    return false
}

Checking the existence of an item

For the Contains() method, we similarly traverse the list. Once the current pointer reaches or passes the key of the item we are testing, we return true if it’s present or false otherwise.

func (l *list) Contains(item []byte) bool {
    var pred, curr *node
    key, _ := hashstructure.Hash(item, nil)

    pred = l.head
    curr = pred.next

    for curr.key < key {
    	pred = curr
    	curr = curr.next
    }
    return key == curr.key
}

Testing the implementation

Here’s a test for the basic functionality of the Set:

package simplelist

import "testing"

func Test_List(t *testing.T) {
    l := New()
    l.Add([]byte("item1"))
    l.Add([]byte("item2"))

    if l.Size() != 2 {
    	t.Errorf("Size()=%d, want 2", l.Size())
    }

    if !l.Contains([]byte("item1")) {
    	t.Error("item1 should be present")
    }

    if l.Contains([]byte("nonPresentItem")) {
    	t.Error("item doesn't exist in the set")
    }

    l.Remove([]byte("item2"))

    if l.Contains([]byte("item2")) {
    	t.Error("item2 doesn't exist in the set")
    }

    if l.Size() != 1 {
    	t.Errorf("Size()=%d, want 1", l.Size())
    }
}

Creating data races

We can generate incorrect results by launching multiple goroutines that add items to the Set. If there were also goroutines deleting items, the results would be even worse:

package main

import (
    "fmt"
    "sortedlist/simplelist"
    "sync"
)

func main() {
    l := simplelist.New()
    var wg sync.WaitGroup
    for i := 0; i < 200; i++ {
    	wg.Add(1)
    	go func(it int) {
    		item := []byte(string(it))
    		l.Add(item)
    		wg.Done()
    	}(i)
    }
    wg.Wait()

    fmt.Println(l.Size()) // 1st run: 199, 2nd run: 198, 3rd run: 200...
}

This code launches 200 concurrent goroutines that add an item. We manage the synchronization by using a Wait Group. Once all the goroutines are done, we print the final size of the Set.

If the algorithm were concurrent-safe, we should expect 200 as the final result for every run. But in this case, we get different results for different runs.

Some errors that can happen in this scenario:

  • Since the increment operator for the size of the list is not atomic and is not protected with a lock, different goroutines can overlap, and its final value can be incorrect.
  • When a goroutine is adding a value, it will iterate over the list until it finds the correct spot. Once it finds the right place, it will insert the item. Meanwhile, a different goroutine could have added an item in the same interval, creating a potentially invalid list.

One way of finding data races is to run the program using the -race flag. This command can be handy to debug concurrency errors. It can even display the section of the code that’s generating the data race. But keep in mind for some specific situations, it won’t detect all data races.

Summary

In this first part, we defined the interface for the sorted Set data structure. We identified the types and the fields needed to implement the Add(), Remove(), and Contains() methods. The locking strategies presented in the following posts don’t alter the fundamentals of this algorithm.

In the next part, we will review the most basic locking approach to prevent data races but at the cost of creating heavy contention when the concurrency levels are high.

View original

#go #concurrency

- 1 toasts