현재 3권의 신간들인 Go Optimizations 101, Go Details & Tips 101Go Generics 101이 출간되어 있습니다. Leanpub 서점에서 번들을 모두 구입하시는 방법이 비용 대비 효율이 가장 좋습니다.

Go에 대한 많은 정보들과 Go 101 책들의 최신 소식을 얻으시려면 Go 101 트위터 계정인 @Go100and1을 팔로잉 해주세요.

Strings in Go

Like many other programming languages, string is also one important kind of types in Go. This article will list all the facts of strings.

The Internal Structure of String Types

For the standard Go compiler, the internal structure of any string type is declared like:
type _string struct {
	elements *byte // underlying bytes
	len      int   // number of bytes
}

From the declaration, we know that a string is actually a byte sequence wrapper. In fact, we can really view a string as an (element-immutable) byte slice.

Note, in Go, byte is a built-in alias of type uint8.

Some Simple Facts About Strings

We have learned the following facts about strings from previous articles. Example:
package main

import "fmt"

func main() {
	const World = "world"
	var hello = "hello"

	// Concatenate strings.
	var helloWorld = hello + " " + World
	helloWorld += "!"
	fmt.Println(helloWorld) // hello world!

	// Compare strings.
	fmt.Println(hello == "hello")   // true
	fmt.Println(hello > helloWorld) // false
}

More facts about string types and values in Go. Example:
package main

import (
	"fmt"
	"strings"
)

func main() {
	var helloWorld = "hello world!"

	var hello = helloWorld[:5] // substring
	// 104 is the ASCII code (and Unicode) of char 'h'.
	fmt.Println(hello[0])         // 104
	fmt.Printf("%T \n", hello[0]) // uint8

	// hello[0] is unaddressable and immutable,
	// so the following two lines fail to compile.
	/*
	hello[0] = 'H'         // error
	fmt.Println(&hello[0]) // error
	*/

	// The next statement prints: 5 12 true
	fmt.Println(len(hello), len(helloWorld),
			strings.HasPrefix(helloWorld, hello))
}

Note, if aString and the indexes in expressions aString[i] and aString[start:end] are all constants, then out-of-range constant indexes will make compilations fail. And please note that the evaluation results of such expressions are always non-constants (this might or might not change since a later Go version). For example, the following program will print 4 0.
package main

import "fmt"

const s = "Go101.org" // len(s) == 9

// len(s) is a constant expression,
// whereas len(s[:]) is not.
var a byte = 1 << len(s) / 128
var b byte = 1 << len(s[:]) / 128

func main() {
	fmt.Println(a, b) // 4 0
}

About why variables a and b are evaluated to different values, please read the special type deduction rule in bitwise shift operator operation and which function calls are evaluated at compile time

String Encoding and Unicode Code Points

Unicode standard specifies a unique value for each character in all kinds of human languages. But the basic unit in Unicode is not character, it is code point instead. For most code points, each of them corresponds to a character, but for a few characters, each of them consists of several code points.

Code points are represented as rune values in Go. In Go, rune is a built-in alias of type int32.

In applications, there are several encoding methods to represent code points, such as UTF-8 encoding and UTF-16 encoding. Nowadays, the most popularly used encoding method is UTF-8 encoding. In Go, all string constants are viewed as UTF-8 encoded. At compile time, illegal UTF-8 encoded string constants will make compilation fail. However, at run time, Go runtime can't prevent some strings from being illegally UTF-8 encoded.

For UTF-8 encoding, each code point value may be stored as one or more bytes (up to four bytes). For example, each English code point (which corresponds to one English character) is stored as one byte, however each Chinese code point (which corresponds to one Chinese character) is stored as three bytes.

String Related Conversions

In the article constants and variables, we have learned that integers can be explicitly converted to strings (but not vice versa).

Here introduces two more string related conversions rules in Go:
  1. a string value can be explicitly converted to a byte slice, and vice versa. A byte slice is a slice with element type's underlying type as []byte.
  2. a string value can be explicitly converted to a rune slice, and vice versa. A rune slice is a slice whose element type's underlying type as []rune.

In a conversion from a rune slice to string, each slice element (a rune value) will be UTF-8 encoded as from one to four bytes and stored in the result string. If a slice rune element value is outside the range of valid Unicode code points, then it will be viewed as 0xFFFD, the code point for the Unicode replacement character. 0xFFFD will be UTF-8 encoded as three bytes (0xef 0xbf 0xbd).

When a string is converted to a rune slice, the bytes stored in the string will be viewed as successive UTF-8 encoding byte sequence representations of many Unicode code points. Bad UTF-8 encoding representations will be converted to a rune value 0xFFFD.

When a string is converted to a byte slice, the result byte slice is just a deep copy of the underlying byte sequence of the string. When a byte slice is converted to a string, the underlying byte sequence of the result string is also just a deep copy of the byte slice. A memory allocation is needed to store the deep copy in each of such conversions. The reason why a deep copy is essential is slice elements are mutable but the bytes stored in strings are immutable, so a byte slice and a string can't share byte elements.

Please note, for conversions between strings and byte slices, Conversions between byte slices and rune slices are not supported directly in Go, We can use the following ways to achieve this goal: Example:
package main

import (
	"bytes"
	"unicode/utf8"
)

func Runes2Bytes(rs []rune) []byte {
	n := 0
	for _, r := range rs {
		n += utf8.RuneLen(r)
	}
	n, bs := 0, make([]byte, n)
	for _, r := range rs {
		n += utf8.EncodeRune(bs[n:], r)
	}
	return bs
}

func main() {
	s := "Color Infection is a fun game."
	bs := []byte(s) // string -> []byte
	s = string(bs)  // []byte -> string
	rs := []rune(s) // string -> []rune
	s = string(rs)  // []rune -> string
	rs = bytes.Runes(bs) // []byte -> []rune
	bs = Runes2Bytes(rs) // []rune -> []byte
}

Compiler Optimizations for Conversions Between Strings and Byte Slices

Above has mentioned that the underlying bytes in the conversions between strings and byte slices will be copied. The standard Go compiler makes some optimizations, which are proven to still work in Go Toolchain 1.20, for some special scenarios to avoid the duplicate copies. These scenarios include: Example:
package main

import "fmt"

func main() {
	var str = "world"
	// Here, the []byte(str) conversion will
	// not copy the underlying bytes of str.
	for i, b := range []byte(str) {
		fmt.Println(i, ":", b)
	}

	key := []byte{'k', 'e', 'y'}
	m := map[string]string{}
	// The string(key) conversion copys the bytes in key.
	m[string(key)] = "value"
	// Here, this string(key) conversion doesn't copy
	// the bytes in key. The optimization will be still
	// made, even if key is a package-level variable.
	fmt.Println(m[string(key)]) // value (very possible)
}

Note, the last line might not output value if there are data races in evaluating string(key). However, such data races will never cause panics.

Another example:
package main

import "fmt"
import "testing"

var s string
var x = []byte{1023: 'x'}
var y = []byte{1023: 'y'}

func fc() {
	// None of the below 4 conversions will
	// copy the underlying bytes of x and y.
	// Surely, the underlying bytes of x and y will
	// be still copied in the string concatenation.
	if string(x) != string(y) {
		s = (" " + string(x) + string(y))[1:]
	}
}

func fd() {
	// Only the two conversions in the comparison
	// will not copy the underlying bytes of x and y.
	if string(x) != string(y) {
		// Please note the difference between the
		// following concatenation and the one in fc.
		s = string(x) + string(y)
	}
}

func main() {
	fmt.Println(testing.AllocsPerRun(1, fc)) // 1
	fmt.Println(testing.AllocsPerRun(1, fd)) // 3
}

for-range on Strings

The for-range loop control flow applies to strings. But please note, for-range will iterate the Unicode code points (as rune values), instead of bytes, in a string. Bad UTF-8 encoding representations in the string will be interpreted as rune value 0xFFFD.

Example:
package main

import "fmt"

func main() {
	s := "éक्षिaπ囧"
	for i, rn := range s {
		fmt.Printf("%2v: 0x%x %v \n", i, rn, string(rn))
	}
	fmt.Println(len(s))
}
The output of the above program:
 0: 0x65 e
 1: 0x301 ́
 3: 0x915 क
 6: 0x94d ्
 9: 0x937 ष
12: 0x93f ि
15: 0x61 a
16: 0x3c0 π
18: 0x56e7 囧
21
From the output result, we can find that
  1. the iteration index value may be not continuous. The reason is the index is the byte index in the ranged string and one code point may need more than one byte to represent.
  2. the first character, , is composed of two runes (3 bytes total)
  3. the second character, क्षि, is composed of four runes (12 bytes total).
  4. the English character, a, is composed of one rune (1 byte).
  5. the character, π, is composed of one rune (2 bytes).
  6. the Chinese character, , is composed of one rune (3 bytes).
Then how to iterate bytes in a string? Do this:
package main

import "fmt"

func main() {
	s := "éक्षिaπ囧"
	for i := 0; i < len(s); i++ {
		fmt.Printf("The byte at index %v: 0x%x \n", i, s[i])
	}
}

Surely, we can also make use of the compiler optimization mentioned above to iterate bytes in a string. For the standard Go compiler, this way is a little more efficient than the above one.
package main

import "fmt"

func main() {
	s := "éक्षिaπ囧"
	// Here, the underlying bytes of s are not copied.
	for i, b := range []byte(s) {
		fmt.Printf("The byte at index %v: 0x%x \n", i, b)
	}
}

From the above several examples, we know that len(s) will return the number of bytes in string s. The time complexity of len(s) is O(1). How to get the number of runes in a string? Using a for-range loop to iterate and count all runes is a way, and using the RuneCountInString function in the unicode/utf8 standard package is another way. The efficiencies of the two ways are almost the same. The third way is to use len([]rune(s)) to get the count of runes in string s. Since Go Toolchain 1.11, the standard Go compiler makes an optimization for the third way to avoid an unnecessary deep copy so that it is as efficient as the former two ways. Please note that the time complexities of these ways are all O(n).

More String Concatenation Methods

Besides using the + operator to concatenate strings, we can also use following ways to concatenate strings.

The standard Go compiler makes optimizations for string concatenations by using the + operator. So generally, using + operator to concatenate strings is convenient and efficient if the number of the concatenated strings is known at compile time.

Sugar: Use Strings as Byte Slices

From the article arrays, slices and maps, we have learned that we can use the built-in copy and append functions to copy and append slice elements. In fact, as a special case, if the first argument of a call to either of the two functions is a byte slice, then the second argument can be a string (if the call is an append call, then the string argument must be followed by three dots ...). In other words, a string can be used as a byte slice for the special case.

Example:
package main

import "fmt"

func main() {
	hello := []byte("Hello ")
	world := "world!"

	// The normal way:
	// helloWorld := append(hello, []byte(world)...)
	helloWorld := append(hello, world...) // sugar way
	fmt.Println(string(helloWorld))

	helloWorld2 := make([]byte, len(hello) + len(world))
	copy(helloWorld2, hello)
	// The normal way:
	// copy(helloWorld2[len(hello):], []byte(world))
	copy(helloWorld2[len(hello):], world) // sugar way
	fmt.Println(string(helloWorld2))
}

More About String Comparisons

Above has mentioned that comparing two strings is comparing their underlying bytes actually. Generally, Go compilers will make the following optimizations for string comparisons.

So for two equal strings, the time complexity of comparing them depends on whether or not their underlying byte sequence pointers are equal. If the two are equal, then the time complexity is O(1), otherwise, the time complexity is O(n), where n is the length of the two strings.

As above mentioned, for the standard Go compiler, in a string value assignment, the destination string value and the source string value will share the same underlying byte sequence in memory. So the cost of comparing the two strings becomes very small.

Example:
package main

import (
	"fmt"
	"time"
)

func main() {
	bs := make([]byte, 1<<26)
	s0 := string(bs)
	s1 := string(bs)
	s2 := s1

	// s0, s1 and s2 are three equal strings.
	// The underlying bytes of s0 is a copy of bs.
	// The underlying bytes of s1 is also a copy of bs.
	// The underlying bytes of s0 and s1 are two
	// different copies of bs.
	// s2 shares the same underlying bytes with s1.

	startTime := time.Now()
	_ = s0 == s1
	duration := time.Now().Sub(startTime)
	fmt.Println("duration for (s0 == s1):", duration)

	startTime = time.Now()
	_ = s1 == s2
	duration = time.Now().Sub(startTime)
	fmt.Println("duration for (s1 == s2):", duration)
}
Output:
duration for (s0 == s1): 10.462075ms
duration for (s1 == s2): 136ns

1ms is 1000000ns! So please try to avoid comparing two long strings if they don't share the same underlying byte sequence.


Index↡

The Go 101 프로젝트는 Github 에서 호스팅됩니다. 오타, 문법 오류, 부정확한 표현, 설명 결함, 코드 버그, 끊어진 링크와 같은 모든 종류의 실수에 대한 수정 사항을 제출하여 Go 101을 개선을 돕는 것은 언제나 환영합니다.

주기적으로 Go에 대한 깊이 있는 정보를 얻고 싶다면 Go 101의 공식 트위터 계정인 @go100and1을 팔로우하거나 Go 101 슬랙 채널에j가입해주세요.

이 책의 디지털 버전은 아래와 같은 곳을 통해서 구매할 수 있습니다.
Go 101의 저자인 Tapir는 2016년 7월부터 Go 101 시리즈 책들을 집필하고 go101.org 웹사이트를 유지 관리하고 있습니다. 새로운 콘텐츠는 책과 웹사이트에 수시로 추가될 예정입니다. Tapir는 인디 게임 개발자이기도 합니다. Tapir의 게임을 플레이하여 Go 101을 지원할 수도 있습니다. (안드로이드와 아이폰/아이패드용):
  • Color Infection (★★★★★), 140개 이상의 단계로 이루어진 물리 기반의 캐주얼 퍼즐 게임
  • Rectangle Pushers (★★★★★), 2가지 모드와 104개 이상의 단계로 이루어진 캐주얼 퍼즐 게임
  • Let's Play With Particles, 세가지 미니 게임이 있는 캐주얼 액션 게임
페이팔을 통한 개인 기부도 환영합니다.

색인: