How to be definite about the number of whitespace fmt.Fscanf consumes? - go

I am trying to implement a PPM decoder in Go. PPM is an image format that consists of a plaintext header and then some binary image data. The header looks like this (from the spec):
Each PPM image consists of the following:
A "magic number" for identifying the file type. A ppm image's magic number is the two characters "P6".
Whitespace (blanks, TABs, CRs, LFs).
A width, formatted as ASCII characters in decimal.
Whitespace.
A height, again in ASCII decimal.
Whitespace.
The maximum color value (Maxval), again in ASCII decimal. Must be less than 65536 and more than zero.
A single whitespace character (usually a newline).
I try to decode this header with the fmt.Fscanf function. The following call to
fmt.Fscanf parses the header (not addressing the caveat explained below):
var magic string
var width, height, maxVal uint
fmt.Fscanf(input,"%2s %d %d %d",&magic,&width,&height,&maxVal)
The documentation of fmt states:
Note: Fscan etc. can read one character (rune) past the input they
return, which means that a loop calling a scan routine may skip some
of the input. This is usually a problem only when there is no space
between input values. If the reader provided to Fscan implements
ReadRune, that method will be used to read characters. If the reader
also implements UnreadRune, that method will be used to save the
character and successive calls will not lose data. To attach ReadRune
and UnreadRune methods to a reader without that capability, use
bufio.NewReader.
As the very next character after the final whitespace is already the beginning of the image data, I have to be certain about how many whitespace fmt.Fscanf did consume after reading MaxVal. My code must work on whatever reader the was provided by the caller and parts of it must not read past the end of the header, therefore wrapping stuff into a buffered reader is not an option; the buffered reader might read more from the input than I actually want to read.
Some testing suggests that parsing a dummy character at the end solves the issues:
var magic string
var width, height, maxVal uint
var dummy byte
fmt.Fscanf(input,"%2s %d %d %d%c",&magic,&width,&height,&maxVal,&dummy)
Is that guaranteed to work according to the specification?

No, I would not consider that safe. While it works now, the documentation states that the function reserves the right to read past the value by one character unless you have an UnreadRune() method.
By wrapping your reader in a bufio.Reader, you can ensure the reader has an UnreadRune() method. You will then need to read the final whitespace yourself.
buf := bufio.NewReader(input)
fmt.Fscanf(buf,"%2s %d %d %d",&magic,&width,&height,&maxVal)
buf.ReadRune() // remove next rune (the whitespace) from the buffer.
Edit:
As we discussed in the chat, you can assume the dummy char method works and then write a test so you know when it stops working. The test can be something like:
func TestFmtBehavior(t *testing.T) {
// use multireader to prevent r from implementing io.RuneScanner
r := io.MultiReader(bytes.NewReader([]byte("data ")))
n, err := fmt.Fscanf(r, "%s%c", new(string), new(byte))
if n != 2 || err != nil {
t.Error("failed scan", n, err)
}
// the dummy char read 1 extra char past "data".
// one byte should still remain
if n, err := r.Read(make([]byte, 5)); n != 1 {
t.Error("assertion failed", n, err)
}
}

Related

Is There a Scanner Function in Go That Separates on Length (or is Newline Agnostic)?

I have two types of files in go which can be represented by the following strings:
const nonewline := 'hello' # content but no newline
const newline := `hello\nworld' # content with newline
My goal is just to read all the content from both files (it's coming in via a stream, so I cannot use something built in like ReadAll, i'm using stdioPipe) and include newlines where they appear.
I'm using Scanner but it APPEARS that there's no way to tell if the line terminates with a newline, and if I use Scanner.Text() it auto-splits (making it impossible to tell if a line ends in a newline, or the line just terminated at the end of the file).
I've also looked at writing a custom Split function, but isn't that overkill? I just need to split on some fixed length (I assume the default buffer size - 4096), or whatever is left in the file, whichever is shorter.
I've also looking at Scanner.Split(bufio.ScanBytes) but is there a speed up by chunking the read?
Anyhow, this seems like something that should be really straightforward.
Use this loop to read a stream in fixed size chunks:
chunk := make([]byte, size) // Size is the chunk size.
for {
n, err := io.ReadFull(stream, chunk)
if n > 0 {
// Do something with the chunk of data.
process(chunk[:n])
}
if err != nil {
break
}
}

Is there a more efficient way to handle string escaping in this function?

I'm migrating some existing code from another language. In the following function it's more or less a 1-1 migration, but given the newness of the language to me I'd like to know if there's better / more efficient ways to handle how the escaped string gets built:
func influxEscape(str string) string {
var chars = map[string]bool{
"\\": true,
"\"": true,
",": true,
"=": true,
" ": true,
}
var escapeStr = ""
for i := 0; i < len(str); i++ {
var char = string(str[i])
if chars[char] == true {
escapeStr += "\\" + char
} else {
escapeStr += char
}
}
return escapeStr
}
This code performs escaping to make string values compatible with the InfluxDB line protocol.
This should be a comment, but it needs too much room for that.
One more thing to consider—which I mentioned in a comment on Burak Serdar's answer—is what happens when your input string is not valid UTF-8.
Remember that a Go string is a byte sequence. It need not be valid Unicode. It may be intended to represent valid Unicode, or it may not. For instance, it could be ISO-Latin-1 or something else that might not play well with UTF-8.
If it is non-UTF-8, using a range loop on it will translate each invalid sequence to the invalid rune. (See the linked Go blog post.) If it is intended to be valid UTF-8, this may be a plus, and of course, you can check for the resulting RuneError.
Your original loop leaves characters above ASCII DEL (127 or 0x7f) alone. If the bytes in the string are something like ISO-Latin-1, this may be the correct behavior. If not, you may be passing invalid, un-sanitized input to this other program. If you are deliberately sanitizing input, you must find out what kind of input it expects, and do a complete job of sanitizing input.
(I still have scars from being forced to cope with a really poor XML encoder coupled to an old database from some number of jobs ago, so I tend to be extra-cautious here.)
This should be somewhat equivalent to your code:
out := bytes.Buffer{}
for _, x := range str {
if strings.IndexRune(`\",= `, x)!=-1 {
out.WriteRune('\\')
}
out.WriteRune(x)
}
return out.String()

Read random lines off a text file in go

I am using encoding/csv to read and parse a very large .csv file.
I need to randomly select lines and pass them through some test.
My current solution is to read the whole file like
reader := csv.NewReader(file)
lines, err := reader.ReadAll()
then randomly select lines from lines
The obvious problem is it takes a long time to read the whole thing and I need lots of memory.
Question:
my question is, encoding/csv gives me an io/reader is there a way to use that to read random lines instead of loading the whole thing at once?
This is more of a curiosity to learn more about io/reader than a practical question, since it is very likely that in the end it is more efficient to read it once and access it in memory, that to keep seeking random lines off on the disk.
Apokalyptik's answer is the closest to what you want. Readers are streamers so you can't just hop to a random place (per-se).
Naively choosing a probability against which you keep any given line as you read it in can lead to problems: you may get to the end of the file without holding enough lines of input, or you may be too quick to hold lines and not get a good sample. Either is much more likely than guessing correctly, since you don't know beforehand how many lines are in the file (unless you first iterate it once to count them).
What you really need is reservoir sampling.
Basically, read the file line-by-line. Each line, you choose whether to hold it like so: The first line you read, you have a 1/1 chance of holding it. After you read the second line, you have 1/2 chance of replacing what you're holding with this one. After the third line, you have a 1/2 * 2/3 = 1/3 chance of holding onto that one instead. Thus you have a 1/N chance of holding onto any given line, where N is the number of lines you've read in. Here's a more detailed look at the algorithm (don't try to implement it just from what I've told you in this paragraph alone).
The simplest solution would be to make a decision as you read each line whether to test it or throw it away... make your decision random so that you don't have the requirement of keeping the entire thing in RAM... then pass through the file once running your tests... you can also do this same style with non-random distribution tests (e.g. after X bytes, or x lines, etc)
My suggestion would be to randomize the input file in advance, e.g. using shuf
http://en.wikipedia.org/wiki/Shuf
Then you can simply read the first n lines as needed.
This doesn't help you learning more about io/readers, but might solve your problem nevertheless.
I had a similar need: to randomly read (specific) lines from a massive text file. I wrote a package that I call ramcsv to do this.
It first reads through the entire file once and marks the byte offset of each line (it stores this information in memory, but does not store the full line).
When you request a line number, it will transparently seek to the correct offset and give you the csv-parsed line.
(Note that the csv.Reader parameter that is passed as the second argument to ramcsv.New is used only to copy the settings into a new reader.) This could no doubt be made more efficient, but it was sufficient for my needs and spared me from reading a ~20GB text file into memory.
encoding/csv does not give you an io.Reader it gives you a csv.Reader (note the lack of package qualification on the definition of csv.NewReader [1] indicating that the Reader it returns belongs to the same package.
A csv.Reader implements only the methods you see there, so it looks like there is no way to do what you want short of writing your own CSV parser.
[1] http://golang.org/pkg/encoding/csv/#NewReader
Per this SO answer, there's a relatively memory efficient way to read a single random line from a large file.
package main
import (
"bufio"
"bytes"
"fmt"
"io"
"math/rand"
"strconv"
"time"
)
var words []byte
func main() {
prepareWordsVar()
var r = rand.New(rand.NewSource(time.Now().Unix()))
var line string
for len(line) == 0 {
line = getRandomLine(r)
}
fmt.Println(line)
}
func prepareWordsVar() {
base := []string{"some", "really", "file", "with", "many", "manyy", "manyyy", "manyyyy", "manyyyyy", "lines."}
words = make([]byte, 200*len(base))
for i := 0; i < 200; i++ {
for _, s := range base {
words = append(words, []byte(s+strconv.Itoa(i)+"\n")...)
}
}
}
func getRandomLine(r *rand.Rand) string {
wordsLen := int64(len(words))
offset := r.Int63n(wordsLen)
rd := bytes.NewReader(words)
scanner := bufio.NewScanner(rd)
_, _ = rd.Seek(offset, io.SeekStart)
// discard - bound to be partial line
if !scanner.Scan() {
return ""
}
scanner.Scan()
if err := scanner.Err(); err != nil {
fmt.Printf("err: %s\n", err)
return ""
}
// now we have a random line.
return scanner.Text()
}
Go Playground
Couple of caveats:
You should use crypto/rand if you need it to be cryptographically secure.
Note the bufio.Scanner's default MaxScanTokenSize, and adjust code accordingly.
As per original SO answer, this does introduce bias based on the length of the line.

How to ignore fields with sscanf (%* is rejected)

I wish to ignore a particular field whilst processing a string with sscanf.
Man page for sscanf says
An optional '*' assignment-suppression character: scanf() reads input as directed by the conversion specification, but discards the input. No corresponding pointer argument is required, and this specification is not included in the count of successful assignments returned by scanf().
Attempting to use this in Golang, to ignore the 3rd field:
if c, err := fmt.Sscanf(str, " %s %d %*d %d ", &iface.Name, &iface.BTx, &iface.BytesRx); err != nil || c != 3 {
compiles OK, but at runtime err is set to:
bad verb %* for integer
Golang doco doesn't specifically mention the %* conversion specification, but it does say,
Package fmt implements formatted I/O with functions analogous to C's printf and scanf.
It doesn't indicate that %* is not implemented, so... Am I doing it wrong? Or has it just been quietly omitted? ...but then, why does it compile?
To the best of my knowledge there is no such verb (as the format specifiers are called in the fmt package) for this task. What you can do however, is specifying some verb and ignoring its value. This is not particularly memory friendly, though. Ideally this would work:
fmt.Scan(&a, _, &b)
Sadly, it doesn't. So your next best option would be to declare the variables and ignore the one
you don't want:
var a,b,c int
fmt.Scanf("%d %v %d", &a, &b, &c)
fmt.Println(a,c)
%v would read a space separated token. Depending on what you're scanning on, you may fast forward the
stream to the position you need to scan on. See this answer
for details on seeking in buffers. If you're using stdio or you don't know which length your input may
have, you seem to be out of luck here.
It doesn't indicate that %* is not implemented, so... Am I doing it
wrong? Or has it just been quietly omitted? ...but then, why does it
compile?
It compiles because for the compiler a format string is just a string like any other. The content of that string is evaluated at run time by functions of the fmt package. Some C compilers may check format strings
for correctness, but this is a feature, not the norm. With go, the go vet command will try to warn you about format string errors with mismatched arguments.
Edit:
For the special case of needing to parse a row of integers and just caring for some of them, you
can use fmt.Scan in combination with a slice of integers. The following example reads 3 integers
from stdin and stores them in the slice named vals:
ints := make([]interface{}, 3)
vals := make([]int, len(ints))
for i, _ := range ints {
ints[i] = interface{}(&vals[i])
}
fmt.Scan(ints...)
fmt.Println(vals)
This is probably shorter than the conventional split/trim/strconv chain. It makes a slice of pointers
which each points to a value in vals. fmt.Scan then fills these pointers. With this you can even
ignore most of the values by assigning the same pointer over and over for the values you don't want:
ignored := 0
for i, _ := range ints {
if(i == 0 || i == 2) {
ints[i] = interface{}(&vals[i])
} else {
ints[i] = interface{}(&ignored)
}
}
The example above would assign the address of ignore to all values except the first and the second, thus
effectively ignoring them by overwriting.

Removing NUL characters from bytes

To teach myself Go I'm building a simple server that takes some input, does some processing, and sends output back to the client (that includes the original input).
The input can vary in length from around 5 - 13 characters + endlines and whatever other guff the client sends.
The input is read into a byte array and then converted to a string for some processing. Another string is appended to this string and the whole thing is converted back into a byte array to get sent back to the client.
The problem is that the input is padded with a bunch of NUL characters, and I'm not sure how to get rid of them.
So I could loop through the array and when I come to a nul character, note the length (n), create a new byte array of that length, and copy the first n characters over to the new byte array and use that. Is that the best way, or is there something to make this easier for me?
Some stripped down code:
data := make([]byte, 16)
c.Read(data)
s := strings.Replace(string(data[:]), "an", "", -1)
s = strings.Replace(s, "\r", "", -1)
s += "some other string"
response := []byte(s)
c.Write(response)
c.close()
Also if I'm doing anything else obviously stupid here it would be nice to know.
In package "bytes", func Trim(s []byte, cutset string) []byte is your friend:
Trim returns a subslice of s by slicing off all leading and trailing UTF-8-encoded Unicode code points contained in cutset.
// Remove any NULL characters from 'b'
b = bytes.Trim(b, "\x00")
Your approach sounds basically right. Some remarks:
When you have found the index of the first nul byte in data, you don't need to copy, just truncate the slice: data[:idx].
bytes.Index should be able to find that index for you.
There is also bytes.Replace so you don't need to convert to string.
The io.Reader documentation says:
Read reads up to len(p) bytes into p. It returns the number of bytes read (0 <= n <= len(p)) and any error encountered.
If the call to Read in the application does not read 16 bytes, then data will have trailing zero bytes. Use the number of bytes read to trim the zero bytes from the buffer.
data := make([]byte, 16)
n, err := c.Read(data)
if err != nil {
// handle error
}
data = data[:n]
There's another issue. There's no guarantee that Read slurps up all of the "message" sent by the peer. The application may need to call Read more than once to get the complete message.
You mention endlines in the question. If the message from the client is terminated but a newline, then use bufio.Scanner to read lines from the connection:
s := bufio.NewScanner(c)
if s.Scan() {
data = s.Bytes() // data is next line, not including end lines, etc.
}
if s.Err() != nil {
// handle error
}
You could utilize the return value of Read:
package main
import "strings"
func main() {
r, b := strings.NewReader("north east south west"), make([]byte, 16)
n, e := r.Read(b)
if e != nil {
panic(e)
}
b = b[:n]
println(string(b) == "north east south")
}
https://golang.org/pkg/io#Reader

Resources