Data structures as jigs for programmers (Go edition)

Daniel Lemire's blog

A data structure in programming is a specific way of organizing and storing data in a computer so that it can be accessed and used efficiently.

In woodworking or metalworking, a jig holds a piece of work and guides the tools operating on it. It helps to produce consistent results. The simplest jig is probably a straight edge. For example, a straight metal bar can help you cut wood following a straight line. Another simple jig is the mitre box which helps us cut wood at fixed angles.

Data structures can be thought of as jigs for programmers. Much like how a jig guides the tool to make precise cuts or shapes, data structures in programming provide a framework for organizing and accessing data. Just as a jig guides the tool, data structures guide how data should be stored, accessed, and manipulated. Data structures help us ensure that operations on the data are performed in a predictable and efficient manner.

Conventional computer science courses often introduce a wide range of data structures: linked lists, red-black trees, and so forth. Yet, in practice, the overwhelming majority of programming problems can be solved using little more than arrays, the standard composite type (struct), and the occasional hash table. The analogy with the woodworking jig still holds: the most popular woodworking jigs are so simple that we often even forget that they are an actual tool.

Arrays

Some data structures are particularly useful due to their simplicity and efficiency. The simplest data structure is the array. Arrays are fixed-size collections of elements. E.g., in Go you can declare an array of 5 integers like so:

var myArray [5]int

Typically arrays are stored in a contiguous manner in memory. Thus arrays are ideal if you need to traverse all of the elements as quickly as possible: you access the memory in a predictable manner with little overhead. Accessing any given element in an array typically just involves taking the index and converting it to a memory address. It may require only a few inexpensive operations.

Generally speaking, arrays generate fast code in part because compilers find it easy to optimize the functions. For example, in Go, array accesses are typically subject to a bound checker: Go would prevent access to an array at index 10 if the array size is 5. However, the compiler can often optimize away such checks. For example, consider the following function where we sum the five elements in an array. It should compile to efficient code with hardly any overhead (e.g., bound checking).

func Sum(x [5]int) int {
    sum := 0
    for i := 0; i < 5; i++ {
        sum += x[i]
    }
    return sum
}

An array can contain different values (including arrays!). It is even possible to store Boolean values (true/false) in an array although in such cases, at least a byte per element is required. You may replace an array with a bitset in such cases. You may implement a bitset over an array as follows (using 128 bits as an example).

// BitSet represents a fixed-size bitset
type BitSet struct {
    bytes [16]byte
}


// Set sets the bit at 'index' to 1
func (bs *BitSet) Set(index int) {
    bs.bytes[index/8] |= 1 << uint(7-(index%8))
}

// Clear sets the bit at 'index' to 0
func (bs *BitSet) Clear(index int) {
    bs.bytes[index/8] &^= 1 << uint(7-(index%8))
}

// Get returns the value of the bit at 'index'
func (bs *BitSet) Get(index int) bool {
    return bs.bytes[index/8]&(1<&LTuint(7-(index%8))) != 0
}

A natural extension of the array is the multidimensional array. Even though the Go language supports natively multidimensional arrays, we can still implement them from scratch using a standard array. Typically, we use a row-major implementation. For example, we can implement a simple 3×3 matrix using a 9-element array. The first three elements of the array represent the first row, the next three elements represent the second row, and so forth.

// Matrix represents a 3x3 matrix
type Matrix struct {
    data [9]float64
}


// Set sets the value
func (m *Matrix) Set(i, j int, val float64) {
    if i < 0 || i > 2 || j < 0 || j > 2 {
        panic("Index out of bounds for a 3x3 matrix")
    }
    m.data[i*3+j] = val
}

// Get returns the value
func (m *Matrix) Get(i, j int) float64 {
    if i < 0 || i > 2 || j < 0 || j > 2 {
        panic("Index out of bounds for a 3x3 matrix")
    }
    return m.data[i*3+j]
}

// String provides a string representation of the Matrix
func (m *Matrix) String() string {
    var str string
    for i := 0; i < 3; i++ {
        for j := 0; j < 3; j++ {
            str += fmt.Sprintf("%6.2f ", m.Get(i, j))
        }
        str += "\n"
    }
    return str
}

Go, like many programming languages, allows you to create efficient composite data structures (struct). In particular, you can put arrays inside these data structures. Suppose, for example, that we want to represent points in space, we might do it as follows:

type Point struct {
    x float64
    y float64
}

We can put these points inside an array of ten elements:

var list [10]Point

This pattern is often called an array of structs. We can also represent the same data using a struct of arrays:

type Points struct {
    x [10]float64
    y [10]float64
}

While both the array of structs and the struct of arrays can contain the same information, they have different performance characteristics. For example, if you needed to sum the x values, the struct of arrays might lead to faster code.

Dynamic arrays and slices

A slightly more sophisticated data structure is the dynamic array. A dynamic array is an array that can grow or shrink in size. In Go, they are implemented as slices. They are one of the most fundamental data structures in Go.

var mySlice []int
mySlice = append(mySlice, 1, 2, 3)

You access arrays by integer indexes. E.g., you access an array of size 3 with the indexes 0, 1, 2. Each index gives you access to a value. Thus the array constitutes a key-value data structure where the keys are the indexes.

Because the size of the slice is dynamic, Go provides a function to query it: len (length). E.g., len(mySlice) might be 3.

In Go, you can have several slices share the same underlying array. We use the operator [begin:end] to create a new slice from an existing slice. Importantly, this operation does not copy the underlying data.

In the following example, the second slice will be of length 1 and contain the second element of the first slice.

var mySlice []int
mySlice = append(mySlice, 1, 2, 3)
secondSlice := mySlice[1:2]

Go provides some shorthand notations. Instead of writing mySlice[2:len(mySlice)], you may write mySlice[2:] and instead of writing mySlice[0:1], you may write mySlice[:1]. To create a copy of the slice, you may simply write mySlice[:].

Slices share many of the same characteristics as arrays. For example, we can write functions that take slices as parameters and compile into similarly efficient code, such as this example where we sum the first 5 elements of the slice:

func Sum(x []int) int {
    sum := 0
  if len(x) < 5 {
    return -1
  }
    for i := 0; i < 5; i++ {
        sum += x[i]
    }
    return sum
}

In fact, dynamic arrays, that is, slices, are typically implemented as a thin wrapper over an array. The slice points at a region inside an array. When we shrink or extend the region pointed at by the slice, the underlying array may remain the same.

Typically, the underlying array has more storage capacity than the length of the slice. You may query the size of the underlying array with the cap (capacity) function. It may seem wasteful to have more capacity than needed. However, consider the case where we regularly add elements to a given slice, and the size of the underlying array is set to match exactly the length of the slice. In such a case, each addition (append) to the slice might require allocating a new array and copying the elements. Counting only the copies of elements, we get that we need 1 + 2 + 3 + 4 + … + n − 1 = (n−1)n/2 copies to populate a slice with n values. Instead, we typically grow the size of the underlying array using exponential steps. For example, whenever the capacity becomes insufficient, we could allocate memory to the next power of two: we allocate 16 elements of capacity if we need 7 elements, we allocate 1024 elements if we need 600 elements, and so forth. We can verify that with such a strategy, if we need to repeatedly add an element to a slice, we will never need more than 2n copies to create a slice with n values: 2n is much smaller than (n−1)n/2 when n is large. By default, Go does not recover excess capacity when making the slice smaller (e.g., s = s[:newlength]). However, you can use the extended syntax s[low:high:max] to tell Go to reduce the size of the underlying array to max-low. E.g., s = s[::len(s)] would create a copy of the slice while adjusting the underlying capacity to len(s).

The following program will add elements to a slice and prints the capacity (size of the underlying array) and length of the slice at each increment. At the end of the program, we show that we can shrink the slice with or without adjusting the capacity.

package main

import "fmt"

func main() {
    var mySlice []int
    for i := 0; i < 100; i++ {
        mySlice = append(mySlice, 1)
        fmt.Println(cap(mySlice), " : ", len(mySlice))
    }
    mySlice = mySlice[:10]
    fmt.Println(cap(mySlice), " : ", len(mySlice))
    mySlice = mySlice[:10:10]
    fmt.Println(cap(mySlice), " : ", len(mySlice))
}

The following is a possible output:

1  :  1
2  :  2
4  :  3
4  :  4
8  :  5
8  :  6
8  :  7
8  :  8
16  :  9
...
128  :  99
128  :  100
128  :  10
10  :  10
9  :  2

Sometimes it is convenient to store the data in sorted order within an array. We can then use the array as an efficient set data structure even if the array is large. When searching for a value, we may use a binary search. The algorithm is relatively straight-forward: given the value that we are looking for, we first compare it against the value in the middle, that is, the median value. If the value we are searching for is greater than the median value, we search in the later half of the array, otherwise we search in the early half of the array. We repeat the division recursively until we either find the value that we are looking for, or we determine that the value cannot be found. We need about log2(N) iterations to complete the search over an array of size N, and the logarithmic function grows very slowly (e.g., log2(106) ≈ 20).

// Search searches for a number in a sorted array
// If the number is found, it returns the index of the number and true
// If the number is not found, it returns -1 and false
func Search(n int, array []int) (int, bool) {
    low, high := 0, len(array)-1
    for low <= high {
        mid := (low + high) / 2
        if array[mid] == n {
            return mid, true
        } else if array[mid] < n {
            low = mid + 1
        } else {
            high = mid - 1
        }
    }
    return -1, false
}

Sometimes we only want to be able to access the smallest or largest value. Suppose for example that you have a stream of requests that you need to process, and you always want to pick next the request that has been waiting the longest. You could repeatedly scan the array, or repeatedly sort it. However, if you constantly append and remove data from the array, it could become expensive. Instead, we can order the values in the array according to a binary heap. The root of the heap is stored at index 0, and it is either the smallest or the largest value depending on the type of heap you need. So we either have a max-heap or a min-heap. Within the array, the values are then organized as a tree. For any node at index i, its left child is at 2 * i + 1 and its right child is at 2 * i + 2. At the end of the array, some node may only have a left child, while others may have no child. If we have a max-heap, each node must be no smaller than its children. If we have a min-heap, each node must be no greater than its children. Operations like insertion and deletion are performed by manipulating elements in the array while maintaining the heap property. To insert a new value, you place the new element at the array’s end. It then becomes the child of an existing node if the array was not empty: if you put it at index i, then you can compute the index of the parent as the integer (i−1)/2. You then permute it with its parent to preserve the heap property (e.g., if we have a max-heap, each node must be no smaller than its children). You may then need to check against the parent once more and so forth until you get to the top of the heap. The following function modifies a slice according to this algorithm using the max-heap property.

func Insert(heap *[]int, value int) {
    // Append the new value to the end of the heap
    *heap = append(*heap, value)
    heapSize := len(*heap)

    // Start with the last element's index
    i := heapSize - 1

    // Heapify-up (sift up) to maintain the max-heap property
    for i > 0 {
        parent := (i - 1) / 2

        // If the value inserted is greater than its parent, swap them
        if (*heap)[i] > (*heap)[parent] {
            // Swap the elements
            (*heap)[i], (*heap)[parent] = (*heap)[parent], (*heap)[i]

            // Move up to the parent's index
            i = parent
        } else {
            // If we didn't swap, we've reached the correct position
            break
        }
    }
}

When removing the top element, you replace it with the last element and then propagate this element downward by swapping it with the larger (or smaller) child if it violates the heap property. We may implement it as follows for a max-heap:

// RemoveTop removes the top element of the max-heap
func RemoveTop(heap *[]int) int {
    if len(*heap) == 0 {
        return 0 // or an appropriate zero value/error handling
    }

    // Store the root value to return later
    top := (*heap)[0]

    // Replace the root with the last element
    heapSize := len(*heap)
    (*heap)[0] = (*heap)[heapSize-1]
    *heap = (*heap)[:heapSize-1]

    // Start heapifying down from the root
    i := 0
    for {
        // Calculate indices of children
        leftChild := 2*i + 1
        rightChild := 2*i + 2
        largest := i

        // Check left child
        if leftChild < len(*heap) && (*heap)[leftChild] >
            (*heap)[largest] {
            largest = leftChild
        }

        // Check right child
        if rightChild < len(*heap) && (*heap)[rightChild] >
            (*heap)[largest] {
            largest = rightChild
        }

        if largest != i {
            (*heap)[i],
                (*heap)[largest] = (*heap)[largest],
                (*heap)[i]
            i = largest
        } else {
            // If largest is the root, we are done
            break
        }
    }

    return top
}

Thus with only two relatively simple functions, we can maintain a binary heap. These functions require at most ⌈log2N⌉ comparisons where N is the number of elements in the array. Interestingly, a binary heap provides a sensible algorithm to sort an array sometimes called a heapsort: insert all elements in a binary heap, and then repeatedly remove the top value. The result is a O(NlogN) algorithm.

Just like we can implement a bitset over an array, we can implement a dynamic bitset over a dynamic array. The approach is much the same, but the array is replaced by a slice. Importantly, we need to check for the need to extend the bitset. When the need arises, we grow the bitset by allocating a larger array and copying our existing data. Observe that we choose to grow the slice by twice the amount needed: this prevents degenerate cases where the user might access the bits one by one (0, 1, …), leading to repeated copies and allocations.

// BitSet represents a dynamically growing bitset
type BitSet struct {
    bytes []byte
}

// ensureCapacity makes sure we have enough bytes to access the specified index
func (bs *BitSet) ensureCapacity(index int) {
    requiredBytes := (index + 1 + 7) / 8
    if len(bs.bytes) < requiredBytes {
        // Grow the slice if necessary
        newBytes := make([]byte, requiredBytes*2)
        copy(newBytes, bs.bytes)
        bs.bytes = newBytes
    }
}

// Set sets the bit at 'index' to 1
func (bs *BitSet) Set(index int) {
    bs.ensureCapacity(index)
    bs.bytes[index/8] |= 1 << uint(7-(index%8))
}

// Clear sets the bit at 'index' to 0
func (bs *BitSet) Clear(index int) {
    bs.ensureCapacity(index)
    bs.bytes[index/8] &^= 1 << uint(7-(index%8))
}

// Get returns the value of the bit at 'index'
func (bs *BitSet) Get(index int) bool {
    bs.ensureCapacity(index)
    return bs.bytes[index/8]&(1<&LTuint(7-(index%8))) != 0
}

Hash tables and maps

It is common that you need a data structure where you can access values using either non sequential integers (e.g., 10, 1000, 100000), or other types of values such as a string. E.g., maybe you want a map from names to phone numbers.

One of the most useful data structures is the hash table. Hash tables build on top of arrays, but instead of using consecutive integer values as indexes, they can take nearly any type of values (e.g., string) as keys.

The essential trick of the hash table is to use a hash function which maps arbitrary keys to integer values. The integer values are used as indexes inside an array. The array is typically picked to be of a size comparable to the number of distinct keys. Thus a hash table is a generalization of the array, but one where you replace the simple accesses with a hash function. A hash function takes a key and converts it into an index of the hash table array.

We typically hope that the hash function distributes keys evenly across the array to minimize collisions. When two keys are mapped to the same index inside the array, we must resolve the issue. One possibility is to create a larger array. But this would cause the arrays to grow too quickly in practice. Thus we must handle collisions efficiently. There are two broad strategies. One possibility is to have the ability to store several key-value pairs in each element of the array. It is often called a bucket approach. Another possibility is to store colliding key-value pairs elsewhere: e.g., you can store it to the next available slot in the array.

Many hash table implementations exist with varying properties. Go provides a built-in map type that implements a hash table. E.g., to create a map from strings to integers, we might proceed as follows:

myMap := make(map[string]int)
myMap["key"] = 10

Go uses a bucket approach: each element in the array can store several values. Each bucket in Go’s map implementation contains multiple key-value pairs. The exact number of pairs per bucket can vary, but it’s designed to handle a small number of entries efficiently. E.g., for example, we could use this type of data structure with slices:

type bucket struct {
    keys    []string // Keys
    values  []int    // Values
}

When two keys hash to the same index, they are placed in the same bucket.

When looking up a key, we compute the hash of the key, use the hash to find the correct bucket, check each entry in the bucket for a matching key. As long as the buckets are small enough, the performance is going to be acceptable.

We often compute the number of keys stored in the hash table and divide it by the size of the array. The result is called the load factor. When there are too many keys compared to the size of the underlying array, the array is grown, and the hash table is reconstructed. This will typically reduce the average size of the buckets. Similarly, we can reduce the size of the array if there are too few keys left.

Let us consider a complete example:

package main

import (
    "errors"
    "fmt"
)

type Bucket struct {
    keys   []string
    values []int
}

func (b *Bucket) Add(key string, value int) {
    b.keys = append(b.keys, key)
    b.values = append(b.values, value)
}

func (b *Bucket) Find(key string) (int, error) {
    for i := 0; i < len(b.keys); i++ {
        if key == b.keys[i] {
            return b.values[i], nil
        }
    }
    return 0, errors.New("Not found")
}

type HashTable struct {
    array []Bucket
}

func NewHashTable(size int) *HashTable {
    return &HashTable{
        make([]Bucket, size),
    }
}

func (ht *HashTable) hash(key string) int {
    // A very simple hash function
    hash := 0
    for i := 0; i < len(key); i++ {
        hash += 31 * int(key[i])
    }
    return hash % len(ht.array)
}

func (ht *HashTable) Get(key string) (int, error) {
    return ht.array[ht.hash(key)].Find(key)
}

func (ht *HashTable) Set(key string, value int) error {
    _, e := ht.Get(key)
    if e == nil {
        return errors.New("Key already present")
    }
    ht.array[ht.hash(key)].Add(key, value)
    return nil
}

func main() {
    ht := NewHashTable(10)
    ht.Set("apple", 1)
    ht.Set("banana", 3)
    fmt.Println(ht.Get("apple")) // Should print 1
}

Our example is simplified. However, it illustrates the basic principles of how hash tables work.

The approach we described, with buckets, is relatively simple to implement, but it may not always offer the best performance because of the overhead of maintaining buckets as separate dynamic arrays. Instead, we may consider a variation where we have one key-value per slot in the array. This alternative model is sometimes called open addressing. When two keys hash to the same slot, we can move one of the key-value pairs to another available slot, or increase the size of the underlying array. At query time, we begin by searching for the key-value according to the hash value of the key. If another key-value is found, we visit the next slot, until either we find the value we are looking for, or an empty slot is found. As long as we keep the underlying array large enough so that a sizeable fraction of slots are empty, the performance will be acceptable.

The following code illustrates the main idea behind open addressing, with the exception that the underlying array does not grow as more values are added. To add this functionality, we should track the number of keys added, and grow the array as needed. As a sentinel to indicate an available slot, we use the empty key. In practice, we would need to add additional checks to make sure that the user does not try to add a key with an empty string, and to handle the scenario appropriately.

type Item struct {
    key   string
    value int
}

type HashTable struct {
    items      []Item
    emptyValue Item
}

func NewHashTable(size int) *HashTable {
    ht := &HashTable{
        items:      make([]Item, size),
        emptyValue: Item{key: "", value: -1},
    }
    // Initialize all slots with emptyValue
    for i := 0; i < size; i++ {
        ht.items[i] = ht.emptyValue
    }
    return ht
}


func hash(key string) int {
    // A very simple hash function
    hash := 0
    for i := 0; i < len(key); i++ {
        hash += 31 * int(key[i])
    }
    return hash
}

// Put adds a key-value pair to the hash table
func (ht *HashTable) Put(key string, value int) {
    hashValue := hash(key) % len(ht.items)
    for {
        if ht.items[hashValue] == ht.emptyValue {
            ht.items[hashValue] = Item{key: key, value: value}
            return
        }
        if ht.items[hashValue].key == key {
            ht.items[hashValue].value = value
            return
        }
        // Linear probing to find next available slot
        hashValue = (hashValue + 1) % len(ht.items)
    }
}

// Get retrieves a value from the hash table by key
func (ht *HashTable) Get(key string) (int, bool) {
    hashValue := hash(key) % len(ht.items)
    for i := 0; i < len(ht.items); i++ {
        if ht.items[hashValue].key == key {
            return ht.items[hashValue].value, true
        }
        if ht.items[hashValue] == ht.emptyValue {
            return -1, false // key not found
        }
        hashValue = (hashValue + 1) % len(ht.items)
    }
    return -1, false // key not found after full cycle
}

There are many alternatives that could be even faster than open addressing in some instances, such as Cuckoo hashing. In practice, programmers rarely implement their own hash tables, but they should be aware that there are different implementations with various advantages.

A hash table that only has keys can be used to implement a set data structure. That is, we do not always need to have values. A straight-forward variation worth considering is to allow keys to be mapped to several different values. This variation is sometimes called a multimap.

Conclusion

Though there are countless data structures in computer science, much can be achieved with arrays, and a few simple functions. When organizing your data, you should first try to use arrays. In some cases, a hash table might be needed if you have a set or a map. It should cover the vast majority of your data structures.

To illustrate the idea, consider JSON. JSON (JavaScript Object Notation) is a standard text-based data interchange format that is commonly used to exchange data online. It has primitive values (strings, numbers, Booleans, the null value) and composite types (arrays and objects). Objects are maps from strings to primitive values or to other composite types (arrays and objects). We represent objects as comma-separated key-value pairs within curly braces {}, using the colon as a separator between the key and the value. We use square brackets [] for arrays, separating values with commas. Consider the following example representing a list of employees.

{
   "employees":[
      {
         "name":"Richard",
         "salary":56000,
         "function":"CEO"
      },
      {
         "name":"Lisa",
         "salary":55000,
         "function":"accountant"
      }
   ]
}

Empirically, JSON is found to be sufficient to represent most data, despite its simplicity. A key insight is that by combining arrays and maps, with a few standard types for numbers and strings, you can solve most problems.

In some cases, specialized data structures can provide superior performance. Yet you should be mindful that it is often easier to optimize code when the underlying data structure is simpler.

Generated by RSStT. The copyright belongs to the original author.

Source

Data structures as jigs for programmers (Go edition)

Report Page