현재 3권의 신간들인 Go Optimizations 101, Go Details & Tips 101 과 Go Generics 101이 출간되어 있습니다. Leanpub 서점에서 번들을 모두 구입하시는 방법이 비용 대비 효율이 가장 좋습니다.

Go에 대한 많은 정보들과 Go 101 책들의 최신 소식을 얻으시려면 Go 101 트위터 계정인 @Go100and1을 팔로잉 해주세요.

Memory Allocations

Memory blocks

The basic memory allocation units are called memory blocks. A memory block is a continuous memory segment. As aforementioned, at run time, a value part is carried on a single memory block.

A single memory block might carry multiple value parts. The size of a memory block must not be smaller than the size of any value part it carries.

When a memory block is carrying a value part, we may say the value part is referencing the memory bock.

Memory allocation operations will consume some CPU resources to find the suitable memory blocks. So the more memory blocks are created (for memory allocations), the more CPU resources are consumed. In programming, we should try to avoid unnecessary memory allocations to get better code execution performances.

Memory allocation places

Go runtime might find a memory block on the stack (one kind of memory zone) of a goroutine or the heap (the other kind of memory zone) of the whole program to carry some value parts. The finding-out process is called a (memory) allocation.

The memory management manners of stack and heap are quite different. For most cases, finding a memory block on stack is much cheaper than on heap.

Collecting stack memory blocks is also much cheaper than collecting heap memory blocks. In fact, stack memory blocks don't need to be collected. The stack of a goroutine could be actually viewed as a single memory block, and it will be collected as a whole when the goroutine exits.

On the other hand, when all the value parts being carried on/by a heap memory block are not used any more (in other words, no alive value part is still referencing the memory block), the memory block will be viewed as garbage and automatically collected eventually, during runtime garbage collection cycles, which might consume certain CPU resources (garbage collection will be talked in detail in a later chapter). Generally, the more memory blocks are allocated on heap, the larger pressure is made for garbage collection.

As heap allocations are much more expensive, only heap memory allocations contribute to the allocation metrics in Go code benchmark results. But please note that allocating on stack still has a cost, though it is often comparatively much smaller.

The escape analysis module of a Go compiler could detect some value parts will be only used by one goroutine and try to let those value parts allocated on stack at run time if certain extra conditions are satisfied. Stack memory allocations and escape analysis will be explained with more details in the next chapter.

Memory allocation scenarios

Generally, each of the following operations will make at least one allocation.

declare variables
call the built-in new function.
call the built-in make function.
modify slices and maps with composite literals.
convert integers to strings.
concatenate strings by using use +.
convert between strings to byte slices, and vice versa.
convert strings to rune slices.
box values into interfaces (converting non-interface values into interfaces).
append elements to a slice and the capacity of the slice is not large enough.
put new entries into maps and the underlying array (to store entries) of the map is not large enough to store the new entries.

However, the official standard Go compiler makes some special code optimizations so that at certain cases, some of the above listed operations will not make allocations. These optimizations will be introduced later in this book.

Memory wasting caused by allocated memory blocks larger than needed

The sizes of different memory blocks might be different. But they are not arbitrary. In the official standard Go runtime implementation, for memory blocks allocated on heap,

some memory block size classes (no more than 32768 bytes) are predefined. As of the official standard Go compiler v1.20, the smallest size classes are 8, 16, 24, 32, 48, 64, 80 and 96 bytes.
For memory blocks larger than 32768 bytes, each of them is always composed of multiple memory pages. The memory page size used by the official standard Go runtime (v1.20) is 8192 bytes.

So,

to allocate a (heap) memory block for the value which size is in the range [33, 48], the size of the memory block is general (must be at least) 48. In other words, there might be up to 15 bytes wasted (if the value size is 33).
to create a byte slice with 32769 elements on heap, the size of the memory block carrying the elements of the slice is 40960 (32768 + 8192, 5 memory pages). In other words, 8191 bytes are wasted.

In other words, memory blocks are often larger than needed. The strategies are made to manage memory easily and efficiently, but might cause a bit memory wasting sometimes (yes, a trade-off).

These could be proved by the following program:

package main

import "testing"
import "unsafe"

var t *[5]int64
var s []byte

func f(b *testing.B) {
	for i := 0; i < b.N; i++ {
		t = &[5]int64{}
	}
}

func g(b *testing.B) {
	for i := 0; i < b.N; i++ {
		s = make([]byte, 32769)
	}
}

func main() {
	println(unsafe.Sizeof(*t))      // 40
	rf := testing.Benchmark(f)
	println(rf.AllocedBytesPerOp()) // 48
	rg := testing.Benchmark(g)
	println(rg.AllocedBytesPerOp()) // 40960
}

Another example:

package main

import "testing"

var s = []byte{32: 'b'} // len(s) == 33
var r string

func Concat(b *testing.B) {
	for i := 0; i < b.N; i++ {
		r = string(s) + string(s)
	}
}

func main() {
	br := testing.Benchmark(Concat)
	println(br.AllocsPerOp())       // 3
	println(br.AllocedBytesPerOp()) // 176
}

There are 3 allocations made within the Concat function. Two of them are caused by the byte slice to string conversions string(s), and the sizes of the two memory blocks carrying the underlying bytes of the two result strings are both 48 (which is the smallest size class which is not smaller than 33). The third allocation is caused by the string concatenation, and the size of the result memory block is 80 (the smallest size class which is not smaller than 66). The three allocations allocate 176 (48+48+80) bytes totally. In the end, 14 bytes are wasted. And 44 (15 + 15 + 14) bytes are wasted during executing the Concat function.

In the above example, the results of the string(s) conversions are used temporarily in the string concatenation operation. By the current official standard Go compiler/runtime implementation (v1.20), the string bytes are allocated on heap (see below sections for details). After the concatenation is done, the memory blocks carrying the string bytes become into memory garbage and will be collected eventually later.

Reduce memory allocations and save memory

The less memory (block) allocations are made, the less CPU resources are consumed, and the smaller pressure is made for garbage collection.

Memory is cheap nowadays, but this is not true for the memory sold by cloud computing providers. So if we run programs on cloud servers, the more memory is saved by the Go programs, the more money is saved.

The following are some suggestions to reduce memory allocations and save memory in programming.

Avoid unnecessary allocations by allocating enough in advance

We often use the built-in append function to push some slice elements. In the statement r = append(s, elements...), if the free capacity of s is not large enough to hold all appended elements, Go runtime will allocate a new memory block to hold all the elements of the result slice r.

If the append function needs to be called multiple times to push some elements, it is best to ensure that the base slice has a large enough capacity, to avoid several unnecessary allocations in the whole pushing process.

For example, to merge some slices into one, the following shown MergeWithTwoLoops implementation is more efficient than the MergeWithOneLoop implementation, because the former one makes less allocations and copies less values.

package allocations

import "testing"

func getData() [][]int {
	return [][]int{
		{1, 2},
		{9, 10, 11},
		{6, 2, 3, 7},
		{11, 5, 7, 12, 16},
		{8, 5, 6},
	}
}

func MergeWithOneLoop(data ...[]int) []int {
	var r []int
	for _, s := range data {
		r = append(r, s...)
	}
	return r
}

func MergeWithTwoLoops(data ...[]int) []int {
	n := 0
	for _, s := range data {
		n += len(s)
	}
	r := make([]int, 0, n)
	for _, s := range data {
		r = append(r, s...)
	}
	return r
}

func Benchmark_MergeWithOneLoop(b *testing.B) {
	data := getData()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = MergeWithOneLoop(data...)
	}
}

func Benchmark_MergeWithTwoLoops(b *testing.B) {
	data := getData()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = MergeWithTwoLoops(data...)
	}
}

The benchmark results:

Benchmark_MergeWithOneLoop-4   636.6 ns/op  352 B/op  4 allocs/op
Benchmark_MergeWithTwoLoops-4  268.4 ns/op  144 B/op  1 allocs/op

The benchmark results show that allocations affect code execution performance much.

Let's print some logs to see when each of the 4 allocations happens in a MergeWithOneLoop function call.

package main

import "fmt"

func getData() [][]int {
	return [][]int{
		{1, 2},
		{9, 10, 11},
		{6, 2, 3, 7},
		{11, 5, 7, 12, 16},
		{8, 5, 6},
	}
}

const format = "Allocate from %v to %v (when append slice#%v).\n"

func MergeWithOneLoop(data [][]int) []int {
	var oldCap int
	var r []int
	for i, s := range data {
		r = append(r, s...)
		if oldCap == cap(r) {
			continue
		}
		fmt.Printf(format, oldCap, cap(r), i)
		oldCap = cap(r)
	}
	return r
}

func main() {
	MergeWithOneLoop(getData())
}

The outputs (for the official standard Go compiler v1.20):

Allocate from 0 to 2 (when append slice#0).
Allocate from 2 to 6 (when append slice#1).
Allocate from 6 to 12 (when append slice#2).
Allocate from 12 to 24 (when append slice#3).

From the outputs, we could get that only the last append call doesn't allocate.

In fact, the Merge_TwoLoops function could be faster in theory. As of the official standard Go compiler v1.20, the make call in the Merge_TwoLoop function will zero all just created elements, which is actually unnecessary. Compiler optimizations in future versions might avoid the zero operation.

BTW, the above implementation of the Merge_TwoLoops function has an imperfection. It doesn't handle the integer overflowing case. The following is a better implementation.

func Merge_TwoLoops(data ...[][]byte) []byte {
	n := 0
	for _, s := range data {
		if k := n + len(s); k < n {
			panic("slice length overflows")
		} else {
			n = k
		}
	}
	r := make([]int, 0, n)
	...
}

Avoid allocations if possible

Allocating less is better, but allocating none is the best.

The following is another example to show the performance differences between two implementations. One of the implementations makes no allocations, the other one makes one allocation.

package allocations

import "testing"

func buildOrginalData() []int {
	s := make([]int, 1024)
	for i := range s {
		s[i] = i
	}
	return s
}

func check(v int) bool {
	return v%2 == 0
}

func FilterOneAllocation(data []int) []int {
	var r = make([]int, 0, len(data))
	for _, v := range data {
		if check(v) {
			r = append(r, v)
		}
	}
	return r
}

func FilterNoAllocations(data []int) []int {
	var k = 0
	for i, v := range data {
		if check(v) {
			data[i] = data[k]
			data[k] = v
			k++
		}
	}
	return data[:k]
}


func Benchmark_FilterOneAllocation(b *testing.B) {
	data := buildOrginalData()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = FilterOneAllocation(data)
	}
}

func Benchmark_FilterNoAllocations(b *testing.B) {
	data := buildOrginalData()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		_ = FilterNoAllocations(data)
	}
}

The benchmark results:

Benchmark_FilterOneAllocation-4  7263 ns/op   8192 B/op  1 allocs/op
Benchmark_FilterNoAllocations-4   903.3 ns/op    0 B/op  0 allocs/op

From the benchmark results, we could get that the FilterNoAllocations implementation is more performant. (Surely, if the input data is not allowed to be modified, then we have to choose an implementation which makes allocations.)

Save memory and reduce allocations by combining memory blocks

Sometimes, we could allocate one large memory block to carry many value parts instead of allocating a small memory block for each value part. Doing this will reduce many memory allocations, so less CPU resources are consumed and GC pressure is relieved to some extend. Sometimes, doing this could decrease memory wasting, but this is not always true.

Let's view an example:

package allocations

import "testing"

const N = 100

type Book struct {
	Title  string
	Author string
	Pages  int
}

//go:noinline
func CreateBooksOnOneLargeBlock(n int) []*Book {
	books := make([]Book, n)
	pbooks := make([]*Book, n)
	for i := range pbooks {
		pbooks[i] = &books[i]
	}
	return pbooks
}

//go:noinline
func CreateBooksOnManySmallBlocks(n int) []*Book {
	books := make([]*Book, n)
	for i := range books {
		books[i] = new(Book)
	}
	return books
}

func Benchmark_CreateBooksOnOneLargeBlock(b *testing.B) {
	for i := 0; i < b.N; i++ {
		_ = CreateBooksOnOneLargeBlock(N)
	}
}

func Benchmark_CreateBooksOnManySmallBlocks(b *testing.B) {
	for i := 0; i < b.N; i++ {
		_ = CreateBooksOnManySmallBlocks(N)
	}
}

Run the benchmarks, we get:

Benchmark_CreateOnOneLargeBlock-4     4372 ns/op  4992 B/op    2 allocs/op
Benchmark_CreateOnManySmallBlocks-4  18017 ns/op  5696 B/op  101 allocs/op

From the results, we could get that allocating many small value parts on one large memory block

spends much less CPU time.
consumes a bit less memory.

The first conclusion is easy to understand. Two allocation operations spend much less time than 101 allocation operations.

The second conclusion has actually already been explained before. As aforementioned, when the size of a small value (part) doesn't exactly match any memory block classes supported by the official standard Go runtime, then a bit larger memory block than needed will be allocated for the small value (part) if the small value (part) is created individually. The size of the Book type is 40 bytes (on 64-bit architectures), whereas the size of the smallest memory block size class larger than 40 is 48. So allocating a Book value individually will waste 8 bytes.

In fact, the second conclusion is only right under certain conditions. Specifically, the conclusion is not right when the value N is in the range from 820 to 852. In particular, when N == 820, the benchmark results show allocating many small value parts on one large memory block consumes 3.5% more memory.

Benchmark_CreateOnOneLargeBlock-4     30491 ns/op  47744 B/op    2 allocs/op
Benchmark_CreateOnManySmallBlocks-4  145204 ns/op  46144 B/op  821 allocs/op

Why does the CreateBooksOnOneLargeBlock function consume more memory when N == 820? Because it needs to allocate a memory block with the minimum size as 32800 (820 * 40), which is larger than the largest small memory block class 32768. So the memory block needs 5 memory pages, which total size is 40960 (8192 * 5). In other words, 8160 (40960 - 32800) bytes are wasted.

Despite it sometimes wastes more memory, generally speaking, allocating many small value parts on one large memory block is comparatively better than allocating each of them on a separated memory block. This is especially true when the life times of the small value parts are almost the same, in which case allocating many small value parts on one large memory block could often effectively avoid memory fragmentation.

Use value cache pool to avoid some allocations

Sometimes, we need to frequently allocate and discard values of a specified type from time to time. It is a good idea to reuse allocated values to avoid a large quantity of allocations.

For example, there are many non-player characters (NPC) in RTS games. A large quantity of NPCs will be spawned and destroyed from time to time in a game session. The related code is like

type NPC struct {
	name       [64]byte
	nameLen    uint16
	blood      uint16
	properties uint32
	x, y       float64
}

func SpawnNPC(name string, x, y float64) *NPC {
	var npc = newNPC()
	npc.nameLen = uint16(copy(npc.name[:], name))
	npc.x = x
	npc.y = y
	return npc
}

func newNPC() *NPC {
	return &NPC{}
}

func releaseNPC(npc *NPC) {
}

As Go supports automatic GC, the releaseNPC function may do nothing. However, such implementation will lead to a large quantity of allocations in game playing and cause large pressure for garbage collection, so that it is hard to guarantee a good game FPS (frames per second).

We could instead use a cache pool to reduce allocations, like the code shown below.

import "container/list"

var npcPool = struct {
	sync.Mutex
	*list.List
}{
	List: list.New(),
}

func newNPC() *NPC {
	npcPool.Lock()
	defer npcPool.Unlock()
	if npcPool.Len() == 0 {
		return &NPC{}
	}
	return npcPool.Remove(npcPool.Front()).(*NPC)
}

func releaseNPC(npc *NPC) {
	npcPool.Lock()
	defer npcPool.Unlock()
	*npc = NPC{} // zero the released NPC
	npcPool.PushBack(npc)
}

By using the pool (also called free list), allocations of NPC values will be reduced much, which is very helpful to keep a smooth FPS (frames per second) in game playing.

If the cache pool is used in only one goroutine, then concurrency synchronizations are not necessary in the implementation.

We could also set a max size for the pool to avoid the pool occupies too much memory.

The standard sync package provides a Pool type to provide similar functionalities but with several design differences:

a free object in the a sync.Pool will be automatically garbage collected if it is found to be unused in two successive garbage collection cycles. This means the max size of a sync.Pool is dynamic and depends on the demand at run time.
the types and sizes of the objects in a sync.Pool could be different. But the best practice is to make sure the objects put in the same sync.Pool are of the some type and have the same size.

Personally, I find the design of sync.Pool seldom satisfies the needs in practice. So I often use custom value cache pool implementations in my Go projects.

Index↡

The Go 101 프로젝트는 Github 에서 호스팅됩니다. 오타, 문법 오류, 부정확한 표현, 설명 결함, 코드 버그, 끊어진 링크와 같은 모든 종류의 실수에 대한 수정 사항을 제출하여 Go 101을 개선을 돕는 것은 언제나 환영합니다.

주기적으로 Go에 대한 깊이 있는 정보를 얻고 싶다면 Go 101의 공식 트위터 계정인 @go100and1을 팔로우하거나 Go 101 슬랙 채널에j가입해주세요.

이 책의 디지털 버전은 아래와 같은 곳을 통해서 구매할 수 있습니다.

Leanpub, 최소 7.99 달러 (이곳이나 이곳에서 번들로 구매하세요).
애플 북스, 7.99 달러.
아마존 킨들 스토어, 7.99 달러.

Go 101의 저자인 Tapir는 2016년 7월부터 Go 101 시리즈 책들을 집필하고 go101.org 웹사이트를 유지 관리하고 있습니다. 새로운 콘텐츠는 책과 웹사이트에 수시로 추가될 예정입니다. Tapir는 인디 게임 개발자이기도 합니다. Tapir의 게임을 플레이하여 Go 101을 지원할 수도 있습니다. (안드로이드와 아이폰/아이패드용):

Color Infection (★★★★★), 140개 이상의 단계로 이루어진 물리 기반의 캐주얼 퍼즐 게임
Rectangle Pushers (★★★★★), 2가지 모드와 104개 이상의 단계로 이루어진 캐주얼 퍼즐 게임
Let's Play With Particles, 세가지 미니 게임이 있는 캐주얼 액션 게임

페이팔을 통한 개인 기부도 환영합니다.

색인:

Acknowledgments
About Go Optimizations 101
Value Parts and Value Sizes ^{(available in the paid ebooks)}

value/type sizes
memory alignments
struct padding
avoid larger copy costs

Memory Allocations
Stack and Escape Analysis ^{(available in the paid ebooks)}

escape analysis
how to control value allocation places
stacks growth and shrinkage
how to reduce stack grow times

Garbage Collection ^{(available in the paid ebooks)}

GC pacer
how to reduce GC pressure
control GC frequency

Pointers
Structs ^{(available in the paid ebooks)}

3 facts/suggestions

Arrays and Slices ^{(available in the paid ebooks)}

10+ facts/suggestions

String and Byte Slices ^{(available in the paid ebooks)}

10+ facts/suggestions

BCE (Bound Check Eliminate) ^{(available in the paid ebooks)}

the cases BCE works for
the cases BCE doesn't work for
the cases BCE works for when given hints

Maps
Channels ^{(available in the paid ebooks)}

3 facts/suggestions

Functions ^{(available in the paid ebooks)}

how to make a function inline-able
how to make a function not inline-able
pointer parameters/results vs. non-pointer ones
named results vs. anonymous ones
10+ facts/suggestions

Interfaces ^{(available in the paid ebooks)}

value boxing costs
3+ facts/suggestions