Skipping ahead n codepoints while iterating through a unicode string in Go - go

In Go, iterating over a string using
for i := 0; i < len(myString); i++{
doSomething(myString[i])
}
only accesses individual bytes in the string, whereas iterating over a string via
for i, c := range myString{
doSomething(c)
}
iterates over individual Unicode codepoints (calledrunes in Go), which may span multiple bytes.
My question is: how does one go about jumping ahead while iterating over a string with range Mystring? continue can jump ahead by one unicode codepoint, but it's not possible to just do i += 3 for instance if you want to jump ahead three codepoints. So what would be the most idiomatic way to advance forward by n codepoints?
I asked this question on the golang nuts mailing list, and it was answered, courtesy of some of the helpful folks on the list. Someone messaged me however suggesting I create a self-answered question on Stack Overflow for this, to save the next person with the same issue some trouble. That's what this is.

I'd consider avoiding the conversion to []rune, and code this directly.
skip := 0
for _, c := range myString {
if skip > 0 {
skip--
continue
}
skip = doSomething(c)
}
It looks inefficient to skip runes one by one like this, but it's the same amount of work as the conversion to []rune would be. The advantage of this code is that it avoids allocating the rune slice, which will be approximately 4 times larger than the original string (depending on the number of larger code points you have). Of course converting to []rune is a bit simpler so you may prefer that.

It turns out this can be done quite easily simply by casting the string into a slice of runes.
runes := []rune(myString)
for i := 0; i < len(runes); i++{
jumpHowFarAhead := doSomething(runes[i])
i += jumpHowFarAhead
}

Related

Which string copying method is the faster in Delphi?

I work in Delphi XE2 and I have to make a complicated function that sometimes copies longer parts of strings and sometimes only just characters. It depends on the content of the source string. So the question is that which example method is faster?
Len := Length(Str);
SetLength(Result, Len);
for I := 1 to Len do Result[I] := Str[I];
Len := Length(Str);
SetLength(Result, Len);
Move(Str[1], Result[1], Len * SizeOf(Char));
And I would be also curious how big is the difference in running time.
The Move() alternative - even if Move() was written naively as a byte-by-byte loop (which it is not in the RTL, despite much room for optimization, which we might get soon(tm)) - would be faster, because for every indexed write to a string, the compiler inserts a call to System._UniqueStringU().
To copy a part (if contiguous) of a string into a new string, I would probably use either System.Copy() or System.SetString() instead.
However, if performance matters, my intuition tells me that this part would probably not be the one worth optimizing, but rather to reduce string usage and copying parts of them as new strings. In .NET, that was the reason why they implemented Span<T>, which basically is a length restricted pointer. When dealing with things like string parsing, using such an approach boosts performance way more than optimizing the copying itself.
Bonus: If you write your loop like this, you omit the _UniqueStringU() call, because the SetLength() before already assured that Result is a string with RefCount = 1:
Len := Length(Str);
SetLength(Result, Len);
for I := 1 to Len do PChar(Pointer(Result))[I-1] := Str[I];
I am using a cast to Pointer first to avoid the _UStrToPWChar() call the compiler inserts when doing a string to PChar cast.

Fastest way to allocate a large string in Go?

I need to create a string in Go that is 1048577 characters (1MB + 1 byte). The content of the string is totally unimportant. Is there a way to allocate this directly without concatenating or using buffers?
Also, it's worth noting that the value of string will not change. It's for a unit test to verify that strings that are too long will return an error.
Use strings.Builder to allocate a string without using extra buffers.
var b strings.Builder
b.Grow(1048577)
for i := 0; i < 1048577; i++ {
b.WriteByte(0)
}
s := b.String()
The call to the Grow method allocates a slice with capacity 1048577. The WriteByte calls fill the slice to capacity. The String() method uses unsafe to convert that slice to a string.
The cost of the loop can be reduced by writing chunks of N bytes at a time and filling single bytes at the end.
If you are not opposed to using the unsafe package, then use this:
p := make([]byte, 1048577)
s := *(*string)(unsafe.Pointer(&p))
If you are asking about how to do this with the simplest code, then use the following:
s := string(make([]byte, 1048577)
This approach does not meet the requirements set forth in the question. It uses an extra buffer instead of allocating the string directly.
I ended up using this:
string(make([]byte, 1048577))
https://play.golang.org/p/afPukPc1Esr

Which optimisations does the Go 1.6 compiler apply when converting between []byte and string or vice versa?

I know that converting from a []byte to a string, or vice versa, results in a copy of the underlying array being made. This makes sense to me, from the point of view of strings being immutable.
Then I read here that two optimisations get made by the compiler in specific cases:
"The first optimization avoids extra allocations when []byte keys are used to lookup entries in map[string] collections: m[string(key)]."
This makes sense because the conversion is only scoped to the square brackets, so no risk of mutating the string there.
"The second optimization avoids extra allocations in for range clauses where strings are converted to []byte: for i,v := range []byte(str) {...}."
This makes sense because once again - no way of mutating the string here.
Also mentioned is further optimisations on the todo list (not sure which todo list is being referred to), so my question is:
Does any other such (further) optimisations exist in Go 1.6 and if so, what are they?
[]byte to string
For []byte to string conversion, the compiler generates a call to the internal runtime.slicebytetostringtmp function (link to source) when it can prove
that the string form will be discarded before the calling goroutine
could possibly modify the original slice or synchronize with another
goroutine.
runtime.slicebytetostringtmp returns a string referring to the actual []byte bytes, so it does not allocate. The comment in the function says
// First such case is a m[string(k)] lookup where
// m is a string-keyed map and k is a []byte.
// Second such case is "<"+string(b)+">" concatenation where b is []byte.
// Third such case is string(b)=="foo" comparison where b is []byte.
In short, for a b []byte:
map lookup m[string(b)] does not allocate
"<"+string(b)+"> concatenation does not allocate
string(b)=="foo" comparison does not allocate
The second optimization was implemented on 22 Jan 2015, and it is in go1.6
The third optimization was implemented on 27 Jan 2015, and it is in go1.6
So, for example, in the following:
var bs []byte = []byte{104, 97, 108, 108, 111}
func main() {
x := string(bs) == "hello"
println(x)
}
the comparison does not cause allocations in go1.6.
String to []byte
Similarly, the runtime.stringtoslicebytetmp function (link to source) says:
// Return a slice referring to the actual string bytes.
// This is only for use by internal compiler optimizations
// that know that the slice won't be mutated.
// The only such case today is:
// for i, c := range []byte(str)
so i, c := range []byte(str) does not allocate, but you already knew that.

How to ignore fields with sscanf (%* is rejected)

I wish to ignore a particular field whilst processing a string with sscanf.
Man page for sscanf says
An optional '*' assignment-suppression character: scanf() reads input as directed by the conversion specification, but discards the input. No corresponding pointer argument is required, and this specification is not included in the count of successful assignments returned by scanf().
Attempting to use this in Golang, to ignore the 3rd field:
if c, err := fmt.Sscanf(str, " %s %d %*d %d ", &iface.Name, &iface.BTx, &iface.BytesRx); err != nil || c != 3 {
compiles OK, but at runtime err is set to:
bad verb %* for integer
Golang doco doesn't specifically mention the %* conversion specification, but it does say,
Package fmt implements formatted I/O with functions analogous to C's printf and scanf.
It doesn't indicate that %* is not implemented, so... Am I doing it wrong? Or has it just been quietly omitted? ...but then, why does it compile?
To the best of my knowledge there is no such verb (as the format specifiers are called in the fmt package) for this task. What you can do however, is specifying some verb and ignoring its value. This is not particularly memory friendly, though. Ideally this would work:
fmt.Scan(&a, _, &b)
Sadly, it doesn't. So your next best option would be to declare the variables and ignore the one
you don't want:
var a,b,c int
fmt.Scanf("%d %v %d", &a, &b, &c)
fmt.Println(a,c)
%v would read a space separated token. Depending on what you're scanning on, you may fast forward the
stream to the position you need to scan on. See this answer
for details on seeking in buffers. If you're using stdio or you don't know which length your input may
have, you seem to be out of luck here.
It doesn't indicate that %* is not implemented, so... Am I doing it
wrong? Or has it just been quietly omitted? ...but then, why does it
compile?
It compiles because for the compiler a format string is just a string like any other. The content of that string is evaluated at run time by functions of the fmt package. Some C compilers may check format strings
for correctness, but this is a feature, not the norm. With go, the go vet command will try to warn you about format string errors with mismatched arguments.
Edit:
For the special case of needing to parse a row of integers and just caring for some of them, you
can use fmt.Scan in combination with a slice of integers. The following example reads 3 integers
from stdin and stores them in the slice named vals:
ints := make([]interface{}, 3)
vals := make([]int, len(ints))
for i, _ := range ints {
ints[i] = interface{}(&vals[i])
}
fmt.Scan(ints...)
fmt.Println(vals)
This is probably shorter than the conventional split/trim/strconv chain. It makes a slice of pointers
which each points to a value in vals. fmt.Scan then fills these pointers. With this you can even
ignore most of the values by assigning the same pointer over and over for the values you don't want:
ignored := 0
for i, _ := range ints {
if(i == 0 || i == 2) {
ints[i] = interface{}(&vals[i])
} else {
ints[i] = interface{}(&ignored)
}
}
The example above would assign the address of ignore to all values except the first and the second, thus
effectively ignoring them by overwriting.

Fast fmt.Scanf() of a large UTF-8 string

I have a string of about 8000000 UTF-8 characters. Scanning it via fmt.Scanf() takes about 10 seconds, how can I do it faster? I have a Go wrapper for C scanf() function that was written by my teacher as a workaround for some bugs in Go's fmt.Scanf(), it works in 1-2 seconds, but I don't like using side packages for such simple tasks. Could you suggest some faster way of reading strings in pure Go?
Found the solution. bufio works much faster (as it's buffered, and fmt's functions are not, and it doesn't parse anything):
reader := bufio.NewReader(os.Stdin)
str, _ := reader.ReadString('\n') // Like fmt.Scanf("%s", &str), but faster
var x, y rune
fmt.Fscanf(reader, "%c %c", &x, &y) // I need to read something else
// (see comments for the question)
// It's easy, as I can use fmt.Fscanf
...even faster that that C scanf() wrapper.

Resources