현재 3권의 신간들인 Go Optimizations 101, Go Details & Tips 101Go Generics 101이 출간되어 있습니다. Leanpub 서점에서 번들을 모두 구입하시는 방법이 비용 대비 효율이 가장 좋습니다.

Go에 대한 많은 정보들과 Go 101 책들의 최신 소식을 얻으시려면 Go 101 트위터 계정인 @Go100and1을 팔로잉 해주세요.

Pointers

Avoid unnecessary nil array pointer checks in a loop

There are some flaws in the current official standard Go compiler implementation (v1.20). One of them is some nil array pointer checks are not moved out of loops. Here is an example to show this flaw.

// unnecessary-checks.go
package pointers

import "testing"

const N = 1000
var a [N]int

//go:noinline
func g0(a *[N]int) {
	for i := range a {
		a[i] = i // line 12
	}
}

//go:noinline
func g1(a *[N]int) {
	_ = *a // line 18
	for i := range a {
		a[i] = i // line 20
	}
}

func Benchmark_g0(b *testing.B) {
	for i := 0; i < b.N; i++ { g0(&a) }
}

func Benchmark_g1(b *testing.B) {
	for i := 0; i < b.N; i++ { g1(&a) }
}

Let's run the benchmarks with the -S compiler option, the following outputs are got (uninterested texts are omitted):

$ go test -bench=. -gcflags=-S unnecessary-checks.go
...
0x0004 00004 (unnecessary-checks.go:12)	TESTB	AL, (AX)
0x0006 00006 (unnecessary-checks.go:12)	MOVQ	CX, (AX)(CX*8)
...
0x0000 00000 (unnecessary-checks.go:18)	TESTB	AL, (AX)
0x0002 00002 (unnecessary-checks.go:18)	XORL	CX, CX
0x0004 00004 (unnecessary-checks.go:19)	JMP	13
0x0006 00006 (unnecessary-checks.go:20)	MOVQ	CX, (AX)(CX*8)
...
Benchmark_g0-4  517.6 ns/op
Benchmark_g1-4  398.1 ns/op

From the outputs, we could find that the g1 implementation is more performant than the g0 implementation, even if the g1 implementation contains one more code line (line 18). Why? The question is answered by the outputted assembly instructions.

In the g0 implementation, the TESTB instruction is generated within the loop, whereas in the g1 implementation, the TESTB instruction is generated out of the loop. The TESTB instruction is used to check whether or not the argument a is a nil pointer. For this specified case, checking once is enough. The one more code line avoids the flaw in the compiler implementation.

There is a third implementation which is as performant as the g1 implementation. The third implementation uses a slice derived from the array pointer argument.

//go:noinline
func g2(x *[N]int) {
	a := x[:]
	for i := range a {
		a[i] = i
	}
}

Please note that the flaw might be fixed in future compiler versions.

And please note that, if the three implementation functions are inline-able, the benchmark results will change much. That is the reason why the //go:noinline compiler directives are used here. (Before Go toolchain v1.18, the //go:noinline compiler directives are actually unnecessary here. Because Go toolchain v1.18- never inlines a function containing a for-range loop.)

The case in which an array pointer is a struct field

For the cases in which an array pointer is a struct field, things are a little complex. The _ = *t.a line in the following code is useless to avoid the compiler flaw. For example, in the following code, the performance difference between the f1 function and the f0 function is small. (In fact, the f1 function might be even slower, if a NOP instruction is generated within its loop.)

type T struct {
	a *[N]int
}

//go:noinline
func f0(t *T) {
	for i := range t.a {
		t.a[i] = i
	}
}

//go:noinline
func f1(t *T) {
	_ = *t.a
	for i := range t.a {
		t.a[i] = i
	}
}

To move the nil array pointer checks out of the loop, we should copy the t.a field to a local variable, then adopt the trick introduced above:

//go:noinline
func f3(t *T) {
	a := t.a
	_ = *a
	for i := range a {
		a[i] = i
	}
}

Or simply derive a slice from the array pointer field:

//go:noinline
func f4(t *T) {
	a := t.a[:]
	for i := range a {
		a[i] = i
	}
}

The benchmark results:

Benchmark_f0-4  622.9 ns/op
Benchmark_f1-4  637.4 ns/op
Benchmark_f2-4  511.3 ns/op
Benchmark_f3-4  390.1 ns/op
Benchmark_f4-4  387.6 ns/op

The results verify our previous conclusions.

Note, the f2 function mentioned in the benchmark results is declared as

//go:noinline
func f2(t *T) {
	a := t.a
	for i := range a {
		a[i] = i
	}
}

The f2 implementation is not fast as the f3 and f4 implementations, but it is faster than the f0 and f1 implementations. However, that is another story.

If the elements of an array pointer field are not modified (only read) in the loop, then the f1 way is as performant as the f3 and f4 way.

Personally, for most cases, I think we should try to use the slice way (the f4 way) to get the best performance, because generally slices are optimized better than arrays by the official standard Go compiler.

Avoid unnecessary pointer dereferences in a loop

Sometimes, the current official standard Go compiler (v1.20) is not smart enough to generate assembly instructions in the most optimized way. We have to write the code in another way to get the best performance. For example, in the following code, the f function is much less performant than the g function.

// avoid-indirects_test.go
package pointers

import "testing"

//go:noinline
func f(sum *int, s []int) {
	for _, v := range s { // line 8
		*sum += v // line 9
	}
}

//go:noinline
func g(sum *int, s []int) {
	var n = *sum
	for _, v := range s { // line 16
		n += v // line 17
	}
	*sum = n
}

var s = make([]int, 1024)
var r int

func Benchmark_f(b *testing.B) {
	for i := 0; i < b.N; i++ {
		f(&r, s)
	}
}

func Benchmark_g(b *testing.B) {
	for i := 0; i < b.N; i++ {
		g(&r, s)
	}
}

The benchmark results (uninterested texts are omitted):

$ go test -bench=. -gcflags=-S avoid-indirects_test.go
...
0x0009 00009 (avoid-indirects_test.go:9)	MOVQ	(AX), SI
0x000c 00012 (avoid-indirects_test.go:9)	ADDQ	(BX)(DX*8), SI
0x0010 00016 (avoid-indirects_test.go:9)	MOVQ	SI, (AX)
0x0013 00019 (avoid-indirects_test.go:8)	INCQ	DX
0x0016 00022 (avoid-indirects_test.go:8)	CMPQ	CX, DX
0x0019 00025 (avoid-indirects_test.go:8)	JGT	9
...
0x000b 00011 (avoid-indirects_test.go:16)	MOVQ	(BX)(DX*8), DI
0x000f 00015 (avoid-indirects_test.go:16)	INCQ	DX
0x0012 00018 (avoid-indirects_test.go:17)	ADDQ	DI, SI
0x0015 00021 (avoid-indirects_test.go:16)	CMPQ	CX, DX
0x0018 00024 (avoid-indirects_test.go:16)	JGT	11
...
Benchmark_f-4  3024 ns/op
Benchmark_g-4   566.6 ns/op

The outputted assembly instructions show the pointer sum is dereferenced within the loop in the f function. A dereference operation is a memory operation. For the g function, the dereference operations happen out of the loop, and the instructions generated for the loop only process registers. It is much faster to let CPU instructions process registers than process memory, which is why the g function is much more performant than the f function.

This is not a compiler flaw. In fact, the f and g functions are not equivalent (though for most use cases in practice, their results are the same). For example, if they are called like the following code shows, then they return different results (thanks to skeeto@reddit for making this correction).

{
	var s = []int{1, 1, 1}
	var sum = &s[2]
	f(sum, s)
	println(*sum) // 6
}
{
	var s = []int{1, 1, 1}
	var sum = &s[2]
	g(sum, s)
	println(*sum) // 4
}

Another performant implementation for this specified case is to move the pointer parameter out of the function body (again, it is not totally equivalent to either f or g function):

//go:noinline
func h(s []int) int {
	var n = 0
	for _, v := range s {
		n += v
	}
	return n
}

func use_h(s []int) {
	var sum = new(int)
	*sum += h(s)
	...
}

Index↡

The Go 101 프로젝트는 Github 에서 호스팅됩니다. 오타, 문법 오류, 부정확한 표현, 설명 결함, 코드 버그, 끊어진 링크와 같은 모든 종류의 실수에 대한 수정 사항을 제출하여 Go 101을 개선을 돕는 것은 언제나 환영합니다.

주기적으로 Go에 대한 깊이 있는 정보를 얻고 싶다면 Go 101의 공식 트위터 계정인 @go100and1을 팔로우하거나 Go 101 슬랙 채널에j가입해주세요.

이 책의 디지털 버전은 아래와 같은 곳을 통해서 구매할 수 있습니다.
Go 101의 저자인 Tapir는 2016년 7월부터 Go 101 시리즈 책들을 집필하고 go101.org 웹사이트를 유지 관리하고 있습니다. 새로운 콘텐츠는 책과 웹사이트에 수시로 추가될 예정입니다. Tapir는 인디 게임 개발자이기도 합니다. Tapir의 게임을 플레이하여 Go 101을 지원할 수도 있습니다. (안드로이드와 아이폰/아이패드용):
  • Color Infection (★★★★★), 140개 이상의 단계로 이루어진 물리 기반의 캐주얼 퍼즐 게임
  • Rectangle Pushers (★★★★★), 2가지 모드와 104개 이상의 단계로 이루어진 캐주얼 퍼즐 게임
  • Let's Play With Particles, 세가지 미니 게임이 있는 캐주얼 액션 게임
페이팔을 통한 개인 기부도 환영합니다.

색인: