Fast fmt.Scanf() of a large UTF-8 string - performance

I have a string of about 8000000 UTF-8 characters. Scanning it via fmt.Scanf() takes about 10 seconds, how can I do it faster? I have a Go wrapper for C scanf() function that was written by my teacher as a workaround for some bugs in Go's fmt.Scanf(), it works in 1-2 seconds, but I don't like using side packages for such simple tasks. Could you suggest some faster way of reading strings in pure Go?

Found the solution. bufio works much faster (as it's buffered, and fmt's functions are not, and it doesn't parse anything):
reader := bufio.NewReader(os.Stdin)
str, _ := reader.ReadString('\n') // Like fmt.Scanf("%s", &str), but faster
var x, y rune
fmt.Fscanf(reader, "%c %c", &x, &y) // I need to read something else
// (see comments for the question)
// It's easy, as I can use fmt.Fscanf
...even faster that that C scanf() wrapper.

Related

Is it a good idea to use a list of *bufio.Scanner for files to be read later in golang?

I have a list of delimited files to be read after I obtained their path. Instead of saving path as a string, I'm wondering can I simply store a list of *bufio.Scanner so those will be much easier to be read later (and code will be cleaner too)? Here is a quick example:
func main(){
scannerList := read(filenameList)
dowork(scannerList)
}
func read(filenameList []string) (scannerList []*bufio.Scanner){
for _, filename := range filenameList{
op, _ := os.Open(filename)
defer op.Close()
scanner := bufio.NewScanner(op)
scannerList = append(scannerList, scanner)
}
return
}
func dowork(scannerList []*bufio.Scanner){
for _, scanner := range scannerList{
for scanner.Scan(){
//read stuff
}
//do stuff
}
}
My code similar to above example compiles, but I don't know if this is recommended (or works). Any comments? Thanks!
A Scanner is a complicated structure, and one that embeds a buffer. The buffer can grow dynamically (depending on what the scan function requests) up to 64kB (MaxScanTokenSize).
So in general it is not a good idea to keep redundant Scanners around, as the buffers cannot be released until the Scanners are discarded. But perhaps a few extra kilobytes of memory don't matter much in your case.

Golang high cpu usage on simple webserver unable to understand why?

So I have a simple net/http webserver. All it does is is deliver 100MB of random bytes, which I intend to use for network speed testing. My handler for the 100mb endpoint is really simple (pasted below). The code works fine and I get my random byte file, the problem is when I run this and someone downloads these 100megabytes, the CPU for this program shoots up to 150% and stays there until this handler finishes running. Am I doing something very wrong here? What could I do to improve this handler's performance?
func downloadHandler(w http.ResponseWriter, r *http.Request) {
str := RandStringBytes(8192); //generates 8192 bytes of randomness
sz := 1000*1000*100; //100Megabytes
iter := sz/len(str)+1;
w.Header().Set("Content-Type", "application/octet-stream")
w.Header().Set("Content-Length", strconv.Itoa( sz ))
for i := 0; i < iter ; i++ {
fmt.Fprintf(w, str )
}
}
The problem is that fmt.Fprintf() expects a format string:
func Fprintf(w io.Writer, format string, a ...interface{}) (n int, err error)
And you pass it a big, 8 KB format string. The fmt package has to analyze the format string, it is not something that gets to the output as is. Most definately this is what is eating your CPU.
If the random string contains the special % sign, that even makes your case worse, as then fmt.Fprintf() might expect further arguments which you don't "deliver", so the fmt package also has to (will) include error messages in the output, such as:
fmt.Fprintf(os.Stdout, "aaa%bbb%d")
Output:
aaa%!b(MISSING)bb%!d(MISSING)
Use fmt.Fprint() instead which does not expect a format string:
fmt.Fprint(w, str)
Or even better, convert your random string to a byte slice once, and just keep writing that:
data := []byte(str)
for i := 0; i < iter; i++ {
if _, err := w.Write(data); err != nil {
// Handle error, e.g. return
}
}
Delivering large amount of data – you won't get a faster solution than writing a prepared byte slice in a loop (maybe slightly if you vary the size of the slice). If your solution is still "slow", that might be due to your RandStringBytes() function which we don't know anything about, or your output might be compressed (gzipped) if you use other handlers or some framework (which does use relatively high CPU). Also if the client that receives the response is also on your computer (e.g. a browser), it –or a firewall / antivirus software– may check / analyze the response for malicious code (which may also be resource intensive).

Most efficient way to read Zlib compressed file in Golang?

I'm reading in and at the same time parsing (decoding) a file in a custom format, which is compressed with zlib. My question is how can I efficiently uncompress and then parse the uncompressed content without growing the slice? I would like to parse it whilst reading it into a reusable buffer.
This is for a speed-sensitive application and so I'd like to read it in as efficiently as possible. Normally I would just ioutil.ReadAll and then loop again through the data to parse it. This time I'd like to parse it as it's read, without having to grow the buffer into which it is read, for maximum efficiency.
Basically I'm thinking that if I can find a buffer of the perfect size then I can read into this, parse it, and then write over the buffer again, then parse that, etc. The issue here is that the zlib reader appears to read an arbitrary number of bytes each time Read(b) is called; it does not fill the slice. Because of this I don't know what the perfect buffer size would be. I'm concerned that it might break up some of the data that I wrote into two chunks, making it difficult to parse because one say uint64 could be split from into two reads and therefore not occur in the same buffer read - or perhaps that can never happen and it's always read out in chunks of the same size as were originally written?
What is the optimal buffer size, or is there a way to calculate this?
If I have written data into the zlib writer with f.Write(b []byte) is it possible that this same data could be split into two reads when reading back the compressed data (meaning I will have to have a history during parsing), or will it always come back in the same read?
You can wrap your zlib reader in a bufio reader, then implement a specialized reader on top that will rebuild your chunks of data by reading from the bufio reader until a full chunk is read. Be aware that bufio.Read calls Read at most once on the underlying Reader, so you need to call ReadByte in a loop. bufio will however take care of the unpredictable size of data returned by the zlib reader for you.
If you do not want to implement a specialized reader, you can just go with a bufio reader and read as many bytes as needed with ReadByte() to fill a given data type. The optimal buffer size is at least the size of your largest data structure, up to whatever you can shove into memory.
If you read directly from the zlib reader, there is no guarantee that your data won't be split between two reads.
Another, maybe cleaner, solution is to implement a writer for your data, then use io.Copy(your_writer, zlib_reader).
OK, so I figured this out in the end using my own implementation of a reader.
Basically the struct looks like this:
type reader struct {
at int
n int
f io.ReadCloser
buf []byte
}
This can be attached to the zlib reader:
// Open file for reading
fi, err := os.Open(filename)
if err != nil {
return nil, err
}
defer fi.Close()
// Attach zlib reader
r := new(reader)
r.buf = make([]byte, 2048)
r.f, err = zlib.NewReader(fi)
if err != nil {
return nil, err
}
defer r.f.Close()
Then x number of bytes can be read straight out of the zlib reader using a function like this:
mydata := r.readx(10)
func (r *reader) readx(x int) []byte {
for r.n < x {
copy(r.buf, r.buf[r.at:r.at+r.n])
r.at = 0
m, err := r.f.Read(r.buf[r.n:])
if err != nil {
panic(err)
}
r.n += m
}
tmp := make([]byte, x)
copy(tmp, r.buf[r.at:r.at+x]) // must be copied to avoid memory leak
r.at += x
r.n -= x
return tmp
}
Note that I have no need to check for EOF because I my parser should stop itself at the right place.

Skipping ahead n codepoints while iterating through a unicode string in Go

In Go, iterating over a string using
for i := 0; i < len(myString); i++{
doSomething(myString[i])
}
only accesses individual bytes in the string, whereas iterating over a string via
for i, c := range myString{
doSomething(c)
}
iterates over individual Unicode codepoints (calledrunes in Go), which may span multiple bytes.
My question is: how does one go about jumping ahead while iterating over a string with range Mystring? continue can jump ahead by one unicode codepoint, but it's not possible to just do i += 3 for instance if you want to jump ahead three codepoints. So what would be the most idiomatic way to advance forward by n codepoints?
I asked this question on the golang nuts mailing list, and it was answered, courtesy of some of the helpful folks on the list. Someone messaged me however suggesting I create a self-answered question on Stack Overflow for this, to save the next person with the same issue some trouble. That's what this is.
I'd consider avoiding the conversion to []rune, and code this directly.
skip := 0
for _, c := range myString {
if skip > 0 {
skip--
continue
}
skip = doSomething(c)
}
It looks inefficient to skip runes one by one like this, but it's the same amount of work as the conversion to []rune would be. The advantage of this code is that it avoids allocating the rune slice, which will be approximately 4 times larger than the original string (depending on the number of larger code points you have). Of course converting to []rune is a bit simpler so you may prefer that.
It turns out this can be done quite easily simply by casting the string into a slice of runes.
runes := []rune(myString)
for i := 0; i < len(runes); i++{
jumpHowFarAhead := doSomething(runes[i])
i += jumpHowFarAhead
}

How to be definite about the number of whitespace fmt.Fscanf consumes?

I am trying to implement a PPM decoder in Go. PPM is an image format that consists of a plaintext header and then some binary image data. The header looks like this (from the spec):
Each PPM image consists of the following:
A "magic number" for identifying the file type. A ppm image's magic number is the two characters "P6".
Whitespace (blanks, TABs, CRs, LFs).
A width, formatted as ASCII characters in decimal.
Whitespace.
A height, again in ASCII decimal.
Whitespace.
The maximum color value (Maxval), again in ASCII decimal. Must be less than 65536 and more than zero.
A single whitespace character (usually a newline).
I try to decode this header with the fmt.Fscanf function. The following call to
fmt.Fscanf parses the header (not addressing the caveat explained below):
var magic string
var width, height, maxVal uint
fmt.Fscanf(input,"%2s %d %d %d",&magic,&width,&height,&maxVal)
The documentation of fmt states:
Note: Fscan etc. can read one character (rune) past the input they
return, which means that a loop calling a scan routine may skip some
of the input. This is usually a problem only when there is no space
between input values. If the reader provided to Fscan implements
ReadRune, that method will be used to read characters. If the reader
also implements UnreadRune, that method will be used to save the
character and successive calls will not lose data. To attach ReadRune
and UnreadRune methods to a reader without that capability, use
bufio.NewReader.
As the very next character after the final whitespace is already the beginning of the image data, I have to be certain about how many whitespace fmt.Fscanf did consume after reading MaxVal. My code must work on whatever reader the was provided by the caller and parts of it must not read past the end of the header, therefore wrapping stuff into a buffered reader is not an option; the buffered reader might read more from the input than I actually want to read.
Some testing suggests that parsing a dummy character at the end solves the issues:
var magic string
var width, height, maxVal uint
var dummy byte
fmt.Fscanf(input,"%2s %d %d %d%c",&magic,&width,&height,&maxVal,&dummy)
Is that guaranteed to work according to the specification?
No, I would not consider that safe. While it works now, the documentation states that the function reserves the right to read past the value by one character unless you have an UnreadRune() method.
By wrapping your reader in a bufio.Reader, you can ensure the reader has an UnreadRune() method. You will then need to read the final whitespace yourself.
buf := bufio.NewReader(input)
fmt.Fscanf(buf,"%2s %d %d %d",&magic,&width,&height,&maxVal)
buf.ReadRune() // remove next rune (the whitespace) from the buffer.
Edit:
As we discussed in the chat, you can assume the dummy char method works and then write a test so you know when it stops working. The test can be something like:
func TestFmtBehavior(t *testing.T) {
// use multireader to prevent r from implementing io.RuneScanner
r := io.MultiReader(bytes.NewReader([]byte("data ")))
n, err := fmt.Fscanf(r, "%s%c", new(string), new(byte))
if n != 2 || err != nil {
t.Error("failed scan", n, err)
}
// the dummy char read 1 extra char past "data".
// one byte should still remain
if n, err := r.Read(make([]byte, 5)); n != 1 {
t.Error("assertion failed", n, err)
}
}

Resources