io.Reader and Line Break issue involving a CSV file - go

I have an application which deals with CSV's being delivered via RabbitMQ from many different upstream applications - typically 5000-15,000 rows per file. Most of the time it works great. However a couple of these upstream applications are old (12-15 years) and the people who wrote them are long gone.
I'm unable to read CSV files from these older aplications due to the line breaks. I'm finding this a bit weird as the line breaks see to map to UTF-8 Carriage Returns (http://www.fileformat.info/info/unicode/char/000d/index.htm). Typically the app reads in only the headers from those older files and nothing else.
If I open one of these files in a text editor and save as utf-8 encoding overwriting the exiting file then it works with no issues at all.
Things I've tried I expected to work:
-Using a Reader:
ba := make([]byte, 262144000)
if _, err := file.Read(ba); err != nil {
return nil, err
}
ba = bytes.Trim(ba, "\x00")
bb := bytes.NewBuffer(ba)
reader := csv.NewReader(bb)
records, err := reader.ReadAll()
if err != nil {
return nil, err
}
-Using the Scanner to read line by line (get a bufio.Scanner: token too long)
scanner := bufio.NewScanner(file)
var bb bytes.Buffer
for scanner.Scan() {
bb.WriteString(fmt.Sprintf("%s\n", scanner.Text()))
}
// check for errors
if err = scanner.Err(); err != nil {
return nil, err
}
reader := csv.NewReader(&bb)
records, err := reader.ReadAll()
if err != nil {
return nil, err
}
Things I tried I expected not to work (and didn't):
Writing file contents to a new file (.txt) and reading the file back in (including running dos2unix against the created txt file)
Reading file into a standard string (hoping Go's UTF-8 encoding would magically kick in which of course it doesn't)
Reading file to Rune slice, then transforming to a string via byte slice
I'm aware of the https://godoc.org/golang.org/x/text/transform package but not too sure of a viable approach - it looks like the src encoding needs to be known to transform.
Am I stupidly overlooking something? Are there any suggestions how to transform these files into UTF-8 or update the line endings without knowing the file encoding whilst keeping the application working for all the other valid CSV files being delivered? Are there any options that don't involve me going byte to byte and doing a bytes.Replace I've not considered?
I'm hoping there's something really obvious I've overlooked.
Apologies - I can't share the CSV files for obvious reasons.

For anyone who's stumbled on this and wants an answer that doesn't involve strings.Replace, here's a method that wraps an io.Reader to replace solo carriage returns. It could probably be more efficient, but works better with huge files than a strings.Replace-based solution.
https://gist.github.com/b5/78edaae9e6a4248ea06b45d089c277d6
// ReplaceSoloCarriageReturns wraps an io.Reader, on every call of Read it
// for instances of lonely \r replacing them with \r\n before returning to the end customer
// lots of files in the wild will come without "proper" line breaks, which irritates go's
// standard csv package. This'll fix by wrapping the reader passed to csv.NewReader:
// rdr, err := csv.NewReader(ReplaceSoloCarriageReturns(r))
//
func ReplaceSoloCarriageReturns(data io.Reader) io.Reader {
return crlfReplaceReader{
rdr: bufio.NewReader(data),
}
}
// crlfReplaceReader wraps a reader
type crlfReplaceReader struct {
rdr *bufio.Reader
}
// Read implements io.Reader for crlfReplaceReader
func (c crlfReplaceReader) Read(p []byte) (n int, err error) {
if len(p) == 0 {
return
}
for {
if n == len(p) {
return
}
p[n], err = c.rdr.ReadByte()
if err != nil {
return
}
// any time we encounter \r & still have space, check to see if \n follows
// if next char is not \n, add it in manually
if p[n] == '\r' && n < len(p) {
if pk, err := c.rdr.Peek(1); (err == nil && pk[0] != '\n') || (err != nil && err.Error() == io.EOF.Error()) {
n++
p[n] = '\n'
}
}
n++
}
return
}

Have you tried to replace all line endings from \r\n or \r to \n ?

Related

Multi line buffered read in go

I am trying to read file in buffered manner because I have very large files. I want to apply some text replacement on a file. Suppose for each read I search for a word 'foo' and replace it with some other word 'bar'. If I read using buffer of some size 5MB then it may be the case foo will split into two reads may be one read 'fo' and another read 'o' then I will not be able to find that word. Is there a way so that I can use buffered read upto last newline or may be read multiple line in buffer
I did below. But It will not read upto next line or previous line
file, err := os.Open(filename)
if err != nil {
panic(err)
}
defer file.Close()
byteSlice := make([]byte, 5*1024*1024) // read 5 MB
bufioreader := bufio.NewReaderSize(file, bufferSize)
for {
n, err := bufioreader.Read(byteSlice)
if n > 0 {
fmt.Println(byteSlice[:n])
} else if err == io.EOF {
break
} else {
panic(err)
}
}
Since you're using the bufio reader, you shouldn't really work on aligning the input with buffer boundaries yourself. Use one of the high level read functions, such as `bufioreader.ReadString('\n'), which will read a line using the underlying buffer, and you won't have to deal with line delimiters yourself.
You don’t need bufio reader if you have your own buffer. With your code you have a useless copy of data from the buffer in bufio to the byteslice.
Regarding the split "foo" problem, the solution is to move the last 2 characters from the buffer to the front before the next read.
More precisely, if the word to replace has length m, the copy the m-1 last letters of the buffer to the front of the buffer, fill the remain of the buffer and search for the word to replace in the buffer.
// assume we want to find word
file, err := os.Open(filename)
if err != nil {
panic(err)
}
defer file.Close()
trailingLen := len(word)-1
dataLen := 5*1024*1024 + trailingLen
data := make([]byte, dataLen) // read 5 MB
for {
n, err := file.Read(data[trailingLen:])
if err != nil {
if err == io.EOF {
break
}
panic(err)
}
// search and replace word in data[:n]
if n == dataLen {
copy(data, data[dataLen-trailingLen:])
}
}

Read file and display its contents in Go

I'm new to Go, I want to do a simple program that reads filename from user and display it's contents back to user. This is what I have so far:
fname := "D:\myfolder\file.txt"
f, err := os.Open(fname)
if err != nil {
fmt.Println(err)
}
var buff []byte
defer f.Close()
buff = make([]byte, 1024)
for {
n, err := f.Read(buff)
if n > 0 {
fmt.Println(string(buff[:n]))
}
if err == io.EOF {
break
}
}
but I get error:
The filename, directory name, or volume label syntax is incorrect.
I suspect the backslashes in fname is the reason. Try with double backslash (\\).
Put the filename in backquotes. This makes it a raw string literal. With raw string literals, no escape sequences such as \f will be processed.
fname := `D:\myfolder\file.txt`
You can also use the unix '/' path separators instead.
Does the job.
fname := "D:/myfolder/file.txt"
Congrats on learning Go! Though the question was about a specific error in the example, let's break it down line by line and learn a bit about some of the other issues that may be encountered:
fname := "D:\myfolder\file.txt"
Like C and many other languages, Go uses the backslash character for an "escape sequence". That is, certain characters that start with a backslash get translated into other characters that would be hard to see otherwise (eg. \t becomes a tab character, which may otherwise be indistinguishable from a space).
The fix is to use a raw string literal (use backticks instead of quotes) where no escape sequences are processed:
fname := `D:\myfolder\file.txt`
This fixes the initial error you were seeing by removing the invalid \m and \f escape sequences. A full list of escape sequences and more explanation can be found by reading the String Literals section of the Go spec.
f, err := os.Open(fname)
if err != nil {
fmt.Println(err)
}
The first line of this chunk is good, but it can be improved. If an error occurs, there is no reason for our program to continue executing since we couldn't even open the file, so we should both print it (probably to standard error) and exit, preferably with a non-zero exit status to indicate that something bad happened. Also, as a matter of good habit we probably want to close the file at the end of the function if opening it was successful. Putting it right below the Open call is conventional and makes it easier when someone else is reading your code. I would rewrite this as:
f, err := os.Open(fname)
if err != nil {
fmt.Fprintln(os.Stderr, err)
os.Exit(2)
// It is also common to replace these two lines with a call to log.Fatal
}
defer f.Close()
The last chunk is a bit complicated, and we could rewrite it in multiple ways. Right now it looks like this:
var buff []byte
defer f.Close()
buff = make([]byte, 1024)
for {
n, err := f.Read(buff)
if n > 0 {
fmt.Println(string(buff[:n]))
}
if err == io.EOF {
break
}
}
But we don't need to define our own buffering, because the standard library provides us with the bufio and bytes packages which can do this for us. In this case though, we probably don't need them because we can also replace the iteration with a call to io.Copy which does its own internal buffering. We could also use one of the other copy variants such as io.CopyBuffer if we wanted to use our own buffer. It's also missing some error handling, so we'll add that. Now this entire chunk becomes:
_, err := io.Copy(os.Stdout, f)
if err != nil {
fmt.Fprintf(os.Stderr, "Error reading from file: `%s'\n", err)
os.Exit(2)
}
// We're done!

golang - bufio read multiline until (CRLF) \r\n delimiter

I am trying to implement my own beanstalkd client as a way of learning go. https://github.com/kr/beanstalkd/blob/master/doc/protocol.txt
At the moment, I am using bufio to read in a line of data delimited by \n.
res, err := this.reader.ReadLine('\n')
This is fine for when I send a single command, and read a a single line response like: INSERTED %d\r\n but I find difficulties when I try to reserve a job because the job body could be multiple lines and as such, I cannot use the \n delimiter.
Is there a way to read into the buffer until CRLF?
e.g. when I send the reserve command. My expected response is as follows:
RESERVED <id> <bytes>\r\n
<data>\r\n
But data could contain \n, so I need to read until the \r\n.
Alternatively - is there a way of reading a specific number of bytes as specified in <bytes> in example response above?
At the moment, I have (err handling removed):
func (this *Bean) receiveLine() (string, error) {
res, err := this.reader.ReadString('\n')
return res, err
}
func (this *Bean) receiveBody(numBytesToRead int) ([]byte, error) {
res, err := this.reader.ReadString('\r\n') // What to do here to read to CRLF / up to number of expected bytes?
return res, err
}
func (this *Bean) Reserve() (*Job, error) {
this.send("reserve\r\n")
res, err := this.receiveLine()
var jobId uint64
var bodylen int
_, err = fmt.Sscanf(res, "RESERVED %d %d\r\n", &jobId, &bodylen)
body, err := this.receiveBody(bodylen)
job := new(Job)
job.Id = jobId
job.Body = body
return job, nil
}
res, err := this.reader.Read('\n')
Does not make any sense to me. Did you mean ReadBytes/ReadSlice/ReadString?
You need bufio.Scanner.
Define your bufio.SplitFunc (example is a copy of bufio.ScanLines with modifications to look for '\r\n'). Modify it to match your case.
// dropCR drops a terminal \r from the data.
func dropCR(data []byte) []byte {
if len(data) > 0 && data[len(data)-1] == '\r' {
return data[0 : len(data)-1]
}
return data
}
func ScanCRLF(data []byte, atEOF bool) (advance int, token []byte, err error) {
if atEOF && len(data) == 0 {
return 0, nil, nil
}
if i := bytes.Index(data, []byte{'\r','\n'}); i >= 0 {
// We have a full newline-terminated line.
return i + 2, dropCR(data[0:i]), nil
}
// If we're at EOF, we have a final, non-terminated line. Return it.
if atEOF {
return len(data), dropCR(data), nil
}
// Request more data.
return 0, nil, nil
}
Now, wrap your io.Reader with your custom scanner.
scanner := bufio.NewScanner(this.reader)
scanner.Split(ScanCRLF)
// Set the split function for the scanning operation.
scanner.Split(split)
// Validate the input
for scanner.Scan() {
fmt.Printf("%s\n", scanner.Text())
}
if err := scanner.Err(); err != nil {
fmt.Printf("Invalid input: %s", err)
}
Read bufio package's source code about Scanner.
Alternatively - is there a way of reading a specific number of bytes as specified in in example response above?
First you need to read "RESERVED \r\n" line some how.
And then you can use
nr_of_bytes : = read_number_of_butes_somehow(this.reader)
buf : = make([]byte, nr_of_bytes)
this.reader.Read(buf)
or LimitedReader.
But i dont like this approach.
Thanks for this - reader.Read('\n') was a typo - I corrected question. I have also attached example code of where I have got so far. As you can see, I can get the number of expected bytes of the body. Could you elaborate on why you don't like the idea of reading a specific number of bytes? This seems most logical?
I'd like to see Bean's definition, especially reader's part.
Imagine, this counter is wrong somehow.
Its short: you need to find following "\r\n" and discard everything up to that point? or not? why do you need counter in the first place then?
Its bigger then it should be (or even worse its huge!).
2.1 No next message in the reader: fine, read is shorter then expected but its fine.
2.2 There is next message waiting: bah, you read part of it and there is no easy way to recover.
2.3 Its huge: you cant allocate memory even if message is only 1 byte.
This byte counters in general are designed to verify the message.
And looks like it is the case with beanstalkd protocol.
Use Scanner, parse message, check length with expected number ... profit
UPD
Be warned, default bufio.Scanner cant read more then 64k, set max length with scanner.Buffer first. And thats bad, because you cant change this option on the fly and some data may have had been "pre"-read by scanner.
UPD2
Thinking about my last update. Take a look at net.textproto how it implements dotReader like simple state machine. You could do something similar with reading command first and "expected bytes" checking on payload.

How to read a text file line-by-line in Go when some lines are long enough to cause "bufio.Scanner: token too long" errors?

I have a text file where each line represents a JSON object. I am processing this file in Go with a simple for loop like this:
scanner := bufio.NewScanner(file)
for scanner.Scan() {
jsonBytes = scanner.Bytes()
var jsonObject interface{}
err := json.Unmarshal(jsonBytes, &jsonObject)
// do stuff with "jsonObject"...
}
if err := scanner.Err(); err != nil {
log.Fatal(err)
}
When this code reaches a line with a particularly large JSON string (~67kb), I get the error message, "bufio.Scanner: token too long".
Is there an easy way to increase the max line size readable by NewScanner? Or is there another approach you can take altogether, when needing to read lines that are too large for NewScanner but are known to not be of unsafe size generally?
You can also do:
scanner := bufio.NewScanner(file)
buf := make([]byte, 0, 64*1024)
scanner.Buffer(buf, 1024*1024)
for scanner.Scan() {
// do your stuff
}
The second argument to scanner.Buffer() sets the maximum token size. In the above example you will be able to scan the file as long as none of the lines is larger than 1MB.
From the package docs:
Programs that need more control over error handling or large tokens,
or must run sequential scans on a reader, should use bufio.Reader
instead.
It looks like the preferred solution is bufio.Reader.ReadLine.
You surely don't want to be reading line-by-line in the first place. Why don't you just do this:
d := json.NewDecoder(file)
for {
var ob whateverType
err := d.Decode(&ob)
if err == io.EOF {
break
}
if err != nil {
log.Fatalf("Error decoding: %v", err)
}
// do stuff with "jsonObject"...
}

Trying to write input from keyboard into a file in Golang

I am trying to take input from the keyboard and then store it in a text file but I am a bit confused on how to actually do it.
My current code is as follow at the moment:
// reads the file txt.txt
bs, err := ioutil.ReadFile("text.txt")
if err != nil {
panic(err)
}
// Prints out content
textInFile := string(bs)
fmt.Println(textInFile)
// Standard input from keyboard
var userInput string
fmt.Scanln(&userInput)
//Now I want to write input back to file text.txt
//func WriteFile(filename string, data []byte, perm os.FileMode) error
inputData := make([]byte, len(userInput))
err := ioutil.WriteFile("text.txt", inputData, )
There are so many functions in the "os" and "io" packages. I am very confused about which one I actually should use for this purpose.
I am also confused about what the third argument in the WriteFile function should be. In the documentation is says of type " perm os.FileMode" but since I am new to programming and Go I am a bit clueless.
Does anybody have any tips on how to proced?
Thanks in advance,
Marie
// reads the file txt.txt
bs, err := ioutil.ReadFile("text.txt")
if err != nil { //may want logic to create the file if it doesn't exist
panic(err)
}
var userInput []string
var err error = nil
var n int
//read in multiple lines from user input
//until user enters the EOF char
for ln := ""; err == nil; n, err = fmt.Scanln(ln) {
if n > 0 { //we actually read something into the string
userInput = append(userInput, ln)
} //if we didn't read anything, err is probably set
}
//open the file to append to it
//0666 corresponds to unix perms rw-rw-rw-,
//which means anyone can read or write it
out, err := os.OpenFile("text.txt", os.O_APPEND, 0666)
defer out.Close() //we'll close this file as we leave scope, no matter what
if err != nil { //assuming the file didn't somehow break
//write each of the user input lines followed by a newline
for _, outLn := range userInput {
io.WriteString(out, outLn+"\n")
}
}
I've made sure this compiles and runs on play.golang.org, but I'm not at my dev machine, so I can't verify that it's interacting with Stdin and the file entirely correctly. This should get you started though.
For example,
package main
import (
"fmt"
"io/ioutil"
"os"
)
func main() {
fname := "text.txt"
// print text file
textin, err := ioutil.ReadFile(fname)
if err == nil {
fmt.Println(string(textin))
}
// append text to file
f, err := os.OpenFile(fname, os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0666)
if err != nil {
panic(err)
}
var textout string
fmt.Scanln(&textout)
_, err = f.Write([]byte(textout))
if err != nil {
panic(err)
}
f.Close()
// print text file
textin, err = ioutil.ReadFile(fname)
if err != nil {
panic(err)
}
fmt.Println(string(textin))
}
If you simply want to append the user's input to a text file, you could just read the
input as you've already done and use ioutil.WriteFile, as you've tried to do.
So you already got the right idea.
To make your way go, the simplified solution would be this:
// Read old text
current, err := ioutil.ReadFile("text.txt")
// Standard input from keyboard
var userInput string
fmt.Scanln(&userInput)
// Append the new input to the old using builtin `append`
newContent := append(current, []byte(userInput)...)
// Now write the input back to file text.txt
err = ioutil.WriteFile("text.txt", newContent, 0666)
The last parameter of WriteFile is a flag which specifies the various options for
files. The higher bits are options like file type (os.ModeDir, for example) and the lower
bits represent the permissions in form of UNIX permissions (0666, in octal format, stands for user rw, group rw, others rw). See the documentation for more details.
Now that your code works, we can improve it. For example by keeping the file open
instead of opening it twice:
// Open the file for read and write (O_RDRW), append to it if it has
// content, create it if it does not exit, use 0666 for permissions
// on creation.
file, err := os.OpenFile("text.txt", os.O_RDWR|os.O_APPEND|os.O_CREATE, 0666)
// Close the file when the surrounding function exists
defer file.Close()
// Read old content
current, err := ioutil.ReadAll(file)
// Do something with that old content, for example, print it
fmt.Println(string(current))
// Standard input from keyboard
var userInput string
fmt.Scanln(&userInput)
// Now write the input back to file text.txt
_, err = file.WriteString(userInput)
The magic here is, that you use the flag os.O_APPEND while opening the file,
which makes file.WriteString() append. Note that you need to close the file after
opening it, which we do after the function exists using the defer keyword.

Resources