Faster sorting with Go generics - Eli Bendersky's website

Faster sorting with Go generics - Eli Bendersky's website

Eli Bendersky's website

Recently, a new set of sorting functions has landed in Go'sgolang.org/exp/slices package [1]. These functions leverage Go genericsto provide a more ergonomic API for sorting (without requiring users toimplement sort.Interface), and alsodeliver a nice performance improvement, as the CL demonstrates.

In this post, I'll dive deep into why these generic functions are faster thanthe existing ones in the sort package, even though they use preciselythe same algorithm and loop structure. This should hopefully be an interestingpeek into how Go generics are implemented, in comparison to the existingdynamic dispatch mechanism (interfaces).

To decouple our discussion from the messy specifics of the Go standard libraryand experimental repositories, and to be able to focus on a smaller, simplercode sample, I've reimplemented this comparison on a simpler sorting algorithm,where the performance difference also manifests [2].

Bubble sort

Let's implement good old bubble sort!

The full code for this post, along with tests and benchmarks isavailable on GitHub.Here's a simple bubble sort implementation using a Go interface forcustomization:

func bubbleSortInterface(x sort.Interface) {  n := x.Len()  for {    swapped := false    for i := 1; i < n; i++ {      if x.Less(i, i-1) {        x.Swap(i, i-1)        swapped = true      }    }    if !swapped {      return    }  }}

As a reminder, sort.Interface is aninterface defined in Go's sort package, and looks like this:

type Interface interface {  Len() int  Less(i, j int) bool  Swap(i, j int)}

The sort package also provides useful type adapters to imbuesort.Interface onto common slice types. For example, sort.StringSlicemakes a []string implement this interface in a way that sorts elements inascending order, and we can write something like:

var ss []string// ... code that populates ssbubbleSortInterface(sort.StringSlice(ss))
Generic bubble sort

Here's a generic bubble sort, for a slice of types that implementconstraints.Ordered:

func bubbleSortGeneric[T constraints.Ordered](x []T) {  n := len(x)  for {    swapped := false    for i := 1; i < n; i++ {      if x[i] < x[i-1] {        x[i-1], x[i] = x[i], x[i-1]        swapped = true      }    }    if !swapped {      return    }  }}

As you can see, the code is exactly the same except that we can use len,< and swapping with multiple assignment directly instead of deferring tointerface methods. This is possible because the function knows important thingsabout what it's sorting: that it's a slice and also that it implementsOrdered.

Benchmarking

I've benchmarked these two implementations against each other by sorting arandomly-generated slice of 1000 strings; here are the results on my machine:

$ go test -bench=.goos: linuxgoarch: amd64pkg: example.comcpu: Intel(R) Core(TM) i7-4771 CPU @ 3.50GHzBenchmarkSortStringInterface-8             124           9599141 ns/opBenchmarkSortStringGeneric-8               158           7433097 ns/op

The generic version is over 20% faster. This is great news, overall. Gogenerics not only delivers a convenient way to write code that acts on multipletypes, but can also provide performance benefits!

In the rest of this post, we'll discuss why generics is faster here.

Analyzing the interface version

In the code accompanying this post, Ihave a standalone runner program that's useful for profiling. It creates a largeslice and sorts it using one of the methods provided on the command-line; italso enables pprof-based CPU profiling. Let's see how this looks for ourinterface-based bubble sort:

$ ./bubble.out -cpuprofile cpui.out -kind strinterface$ go tool pprof -list bubbleSortInterface ./bubble.out cpui.out<...>ROUTINE ======================== main.bubbleSortInterface     350ms      1.10s (flat, cum)   100% of Total         .          .     26:         .          .     27:func bubbleSortInterface(x sort.Interface) {         .          .     28: n := x.Len()         .          .     29: for {         .          .     30:         swapped := false      70ms       70ms     31:         for i := 1; i < n; i++ {     160ms      830ms     32:                 if x.Less(i, i-1) {      20ms      100ms     33:                         x.Swap(i, i-1)         .          .     34:                         swapped = true         .          .     35:                 }         .          .     36:         }     100ms      100ms     37:         if !swapped {         .          .     38:                 return         .          .     39:         }         .          .     40: }         .          .     41:}         .          .     42:

As expected, the program spends most of its time in the inner loop doingcomparisons and swaps, but mostly just comparisons. Bubble sort does O(N^2)comparisons for a sequence of length N.

If the majority of the time is spent on comparisons, it's interesting to examinewhat that actually entails; what instructions does the CPU execute whilecalling x.Less in this code? Luckily, I've recently written another blogpost on exactly this topic!I recommend that you at least skim it, but the important part for our purposeis that calling x.Less means this sequence of instructions:

MOVQ  0x48(SP), DXMOVQ  0x20(DX), SILEAQ  -0x1(CX), DIMOVQ  DI, 0x30(SP)MOVQ  0x50(SP), AXMOVQ  CX, BXMOVQ  DI, CXNOPLCALL  SI

In our code, we're running bubbleSortInterface(sort.StringSlice(ss)), so wehave to turn to sort.StringSlice for the definition of the actual Lessmethod that's going to be invoked:

func (x StringSlice) Less(i, j int) bool { return x[i] < x[j] }

In assembly, it looks like this:

CMPQ  0x10(R14), SPJBE   0x4664d1SUBQ  $0x28, SPMOVQ  BP, 0x20(SP)LEAQ  0x20(SP), BPMOVQ  AX, 0x30(SP)CMPQ  DI, BXJBE   0x4664c5SHLQ  $0x4, DIMOVQ  0(DI)(AX*1), DXMOVQ  0x8(DI)(AX*1), R8CMPQ  SI, BXJBE   0x4664b8SHLQ  $0x4, SIMOVQ  0(AX)(SI*1), CXMOVQ  0x8(AX)(SI*1), DIMOVQ  DX, AXMOVQ  R8, BXCALL  runtime.cmpstring(SB)TESTQ AX, AXSETL  ALMOVQ  0x20(SP), BPADDQ  $0x28, SPRET

If this looks longer than you've expected, it's because slice indexing in Go isprotected against out-of-bounds access, and the comparisons we see at thebeginning of the function are precisely that - jumping to a portion of the code(which I didn't include, for brevity) that invokes runtime.panicIndex. Atlast, the actual string comparison is performed with a call toruntime.cmpstring; this, on its own, is a pretty interesting function that'simplemented in assembly for the common architectures (for example AMD64),but let's stop the rabbit hole here, since this part will be shared across boththe implementations we're comparing.

Now we should have a fairly comprehensive understanding of where the CPU spendsits time when running bubble sort using the interface dispatch version. Let'sturn our attention to the generic version.

Detour: how generics are implemented in Go 1.18

Russ Cox has a short and excellent blog post titled The Generic Dilemma. Its main claim is that when a languagedecides on whether to have generics and how to implement them, it faces thefollowing decision:

do you want slow programmers, slow compilers and bloated binaries, or slowexecution times?

"Slow compilers and bloated binaries" refers to the C++ approach of implementingtemplates by full monomorphization - each template invocationis treated a bit like a macro expansion with its full code copy-pasted with theright types.

"Slow execution times" refers to the Java approach of boxing, or to dynamiclanguages where code is trivially generic due to transparent dynamic dispatchon every call.

Please note that none of these descriptions are meant to be denigratory; theseare all real tradeoffs language designers face, Go included.

Specifically, Go has seriously considered both monomorphization (called"stenciling" in the Go world) and dynamic dispatch (called "dictionaries" in theGo world):

Neither approach is perfect on its own, due to the reasons stated above.Therefore, another design was proposed:

This "GC shape" approach is a compromise between the two extremes of stencilingand dictionaries. Depending on the instantiated type we either monomorphize oruse dynamic dispatch. There is a more up-to-date documentthat describes how Go 1.18 does this in detail.

Specifically, different underlying types like ints and strings willget their own GC shape, meaning that a different function will be generated foreach, with the types hard-coded (so this is monomorphization). On the otherhand, all pointer types will be grouped in the same GC shape and will usedynamic dispatch.

Note that this is the state of the world in Go 1.18; it may, and most likelywill change in the future, as the Go team is teaming up with the communityto learn what works best for real-life workloads.

Analyzing the generic version

If we use pprof to analyze the generic version, we'll see that it also spendsthe majority of its time in comparisons, but overall less time than theinterface version.

As discussed in the previous section, a string type will get its own GCshape and therefore its own function hard-coded for the string type. Let'ssee how this looks in the assembly.

First, rummaging through the debug information of the binary we'll find thesymbol bubbleSortGeneric[go.shape.string_0] which represents the stenciledversion of bubbleSortGeneric for the GC shape that string is currentlythe only member of. We won't find it as a standalone function to call, though,since it's inlined into its call site. This inlining does not affect performancein any way, so we'll just focus on the instructions for the inner loop that,to remind you, does this:

for i := 1; i < n; i++ {  if x[i] < x[i-1] {    x[i-1], x[i] = x[i], x[i-1]    swapped = true  }}

And it translates to this assembly:

MOVQ  0x80(SP), R8INCQ  R8MOVQ  0x70(SP), CXMOVQ  0x78(SP), BXMOVQ  R8, DXMOVL  AX, SIMOVQ  0xb0(SP), AXCMPQ  DX, BXJLE   0x4aef20MOVQ  DX, 0x80(SP)MOVB  SI, 0x3d(SP)MOVQ  DX, SISHLQ  $0x4, DXMOVQ  DX, 0x90(SP)MOVQ  0(DX)(AX*1), R8MOVQ  0x8(DX)(AX*1), BXLEAQ  -0x1(SI), R9SHLQ  $0x4, R9MOVQ  R9, 0x88(SP)MOVQ  0(R9)(AX*1), CXMOVQ  0x8(R9)(AX*1), DIMOVQ  R8, AXCALL  runtime.cmpstring(SB)MOVQ  0xb0(SP), DXMOVQ  0x90(SP), SILEAQ  0(SI)(DX*1), DIMOVQ  0x88(SP), R8LEAQ  0(R8)(DX*1), R9TESTQ AX, AXJGE   0x4af01a

The first thing to note is that there is no dynamic dispatch to the Lessmethod. Each loop iteration invokes cmpstring directly. Second, the latterpart of the assembly resembles the code of Less shown earlier, with onecrucial difference - there are no bounds checks! Go includes abounds-check elimination (BCE) passwhich can get rid of the bounds checks for the comparison:

// ... earlier we had n := len(x)for i := 1; i < n; i++ {  if x[i] < x[i-1] {

The compiler knows that i is between 1 and len(x) at all times (bylooking at the loop description and the fact that i is not otherwisemodified [3]) and hence that x[i] and x[i-1] are both safely accessingthe slice in-bounds.

In the interface version, the compiler does not eliminatethe bounds checks from Less; the function is defined like this:

func (x StringSlice) Less(i, j int) bool { return x[i] < x[j] }

And who knows what indices are passed in! Moreover, because of the dynamicdispatch this function is not inlined its caller, where the compiler couldperhaps have more insight into what's going on. The Go compiler has somedevurtualization capabilities, but they don't kick in here. This isan additional interesting area of compiler improvements.

Generic sort with a custom comparison function

To verify some of the observations described earlier, I've implemented anotherversion of the generic bubble sort; this time, without relying onconstraints.Ordered but using a comparison function instead:

func bubbleSortFunc[T any](x []T, less func(a, b T) bool) {  n := len(x)  for {    swapped := false    for i := 1; i < n; i++ {      if less(x[i], x[i-1]) {        x[i-1], x[i] = x[i], x[i-1]        swapped = true      }    }    if !swapped {      return    }  }}

We can use this function to sort strings as follows:

bubbleSortFunc(ss, func(a, b string) bool { return a < b })

Before reading what follows, can you guess how well this approach does inthe benchmarks compared to the generic sort that uses < and comparedto the interface-based sort?

$ go test -bench=.goos: linuxgoarch: amd64pkg: example.comcpu: Intel(R) Core(TM) i7-4771 CPU @ 3.50GHzBenchmarkSortStringInterface-8             124           9599141 ns/opBenchmarkSortStringGeneric-8               158           7433097 ns/opBenchmarkSortStringFunc-8                  138           8584866 ns/op

It's about in the middle! 14% slower than the other generic function, but 10%faster than the interface-based version.

When looking at its assembly code, it's easy to see why. While thefunction-based version does not avoid the function call for each comparison (ithas to call a function provided as an argument), it does avoid the boundschecks, because the access to x[i] and x[i-1] happens within the body ofthe function (and not in a function it invokes). So we get a partial win.

This comparison has interesting real-life implications because SortFunc isalso a variant that was added to the golang.org/exp/slices, to provide moregeneral sorting capabilities (for types that are unconstrained). This versionalso provides a speedup against sort.Sort.

Another implication is for sorting pointer types; as mentioned earlier, the Gocompiler in 1.18 will group all pointer types into a single GC shape, meaningthat it will need to pass a dictionary around for dynamic dispatch. This maymake the code slower, though BCE should still kick in - so not much slower. Ihave a benchmark demonstrating this in the repository.

Finally, as a simple exercise - take the bubble sort functions shown here andbenchmark sorting a large slice of integers using the generic function vs. theinterface-based one. You'll likely see a massive speedup with generics, morethan for strings. Can you figure out why?

Parting words

While this post was about 2/3 of the way through its draft process, thisarticle written by Planetscale engineers waspublished. It describes some scenarios in which converting monomorphized code togenerics makes it significantly slower. It's a very good article, and is worth aread (and I love the disassembly widget).

It's worth saying that generics in Go are very new. What we're playing withnow is the initial implementation that was published literally 2 weeks ago! Thefirst release focused on making generics work at all, with a large effortspent on correctness for many tricky corner cases. Making it fast is also planned, butit's not the highest priority right now.

That said, the implementation is in flux and being improved all the time; forexample, this Go changewhich will make it into Go 1.19 already fixes some of the issues discussed inthe Planetscale article. On the other hand, due to the reasons discussed inThe Generic Dilemma, it's unlikely that Go's generics will ever be "zero-cost"in all possible scenarios. Go places a strong priority on fast compile timesand compact binary sizes, so it has to make a certain set of tradeoffs withany design.

Give generics a ride for your use cases! Play, explore, try things out. Peekunder the hood (hopefully my post helps with that), report issues that you findwith correctness or performance. This is just the beginning for Go generics.


[1]This package was created to incubate some new, generics-basedfunctionality for Go slices with the 1.18 release, and will likely jointhe Go standard library in some future release.[2]

The standard library uses a variation of quicksort which is split overseveral functions and is more difficult to follow in a blog post due toits large code size and relative complexity.

It's worth noting that there are proposals to implement moresophisticated sorting algorithms in the standard library; the algorithmitself would be mostly orthogonal to the implementation considerationsdescribed in this post, however.

[3]This is a gross oversimplification of how the compiler works! In realitythe Go backend uses SSA form, which is perfect for optimizations likethis.

本文章由 flowerss 抓取自RSS,版权归源站点所有。

查看原文:Faster sorting with Go generics - Eli Bendersky's website

Report Page