CSV to date and float - performance

I'm currently writing a small program which converts CSV-files into structs to be used for further prosessing. The csv lines look like this
20140102,09:30,38.88,38.88,38.82,38.85,67004
I have 500 files, each about 20-30 MB.
My code works just fine, but I can't help wondering if there isn't a better way to convert these files than what I'm doing now.
First reading the file and converting to csv records (pseudo code)
data, err := ioutil.ReadFile(path)
if err != nil {
...
}
r := csv.NewReader(bytes.NewReader(data))
records, err := r.ReadAll()
if err != nil {
...
}
Then looping over all the records and doing
parsedTime, err := time.Parse("2006010215:04", record[0]+record[1])
if err != nil {
return model.ZorroT6{}, time.Time{}, err
}
t6.Date = ConvertToOle(parsedTime)
if open, err := strconv.ParseFloat(record[2], 32); err == nil {
t6.Open = float32(open)
}
if high, err := strconv.ParseFloat(record[3], 32); err == nil {
t6.High = float32(high)
}
if low, err := strconv.ParseFloat(record[4], 32); err == nil {
t6.Low = float32(low)
}
if close, err := strconv.ParseFloat(record[5], 32); err == nil {
t6.Close = float32(close)
}
if vol, err := strconv.ParseInt(record[6], 10,32); err == nil {
t6.Vol = int32(vol)
}
For example I have to go through []byte -> string -> float64 -> float32 to get my float values. What could I do to improve this code?
EDIT: Just to be clear I don't really need to improve the performance, I'm just better trying to understand Go and what performance optimization that could be applied to a problem like this. For example it seems like a lot of overhead to create loads of strings and float64 when I have a byte slice and want a float32.

There is only one problem I see that needs fix:
Do not use ioutil.ReadFile together with bytes.NewReader. It reads all the contents into the memory, which is inefficient when the file is large.
Instead, use os.Open(file), it perfectly provides a io.Reader that csv.NewReader can utilize. Do not forget to close the file and handle errors.
If you still want to improve performance:
Since your csv file is of fixed format, it is possible to using raw bytes instead provided by bufio instead of csv.
You can copy and paste the underlying code in strconv and time to avoid general code that is not of your need.
But I think they are not worth the trouble.

Related

How to read arbitrary amounts of data directly from a file in Go?

Without reading the contents of a file into memory, how can I read "x" bytes from the file so that I can specify what x is for every separate read operation?
I see that the Read method of various Readers takes a byte slice of a certain length and I can read from a file into that slice. But in that case the size of the slice is fixed, whereas what I would like to do, ideally, is something like:
func main() {
f, err := os.Open("./file.txt")
if err != nil {
panic(err)
}
someBytes := f.Read(2)
someMoreBytes := f.Read(4)
}
bytes.Buffer has a Next method which behaves very closely to what I would want, but it requires an existing buffer to work, whereas I'm hoping to read an arbitrary amount of bytes from a file without needing to read the whole thing into memory.
What is the best way to accomplish this?
Thank you for your time.
Use this function:
// readN reads and returns n bytes from the reader.
// On error, readN returns the partial bytes read and
// a non-nil error.
func readN(r io.Reader, n int) ([]byte, error) {
// Allocate buffer for result
b := make([]byte, n)
// ReadFull ensures buffer is filled or error is returned.
n, err := io.ReadFull(r, b)
return b[:n], err
}
Call like this:
someBytes, err := readN(f, 2)
if err != nil { /* handle error here */
someMoreBytes := readN(f, 4)
if err != nil { /* handle error here */
you can do something like this:
f, err := os.Open("/tmp/dat")
check(err)
b1 := make([]byte, 5)
n1, err := f.Read(b1)
check(err)
fmt.Printf("%d bytes: %s\n", n1, string(b1[:n1]))
for more reading please check site.

Why is my Go app not reading from sysfs like the busybox `cat` command?

Go 1.12 on Linux 4.19.93 armv6l.
Hardware is a raspberypi zero w (BCM2835) running a yocto linux image.
I've got a gpio driven SRF04 proximity sensor driven by the srf04 linux driver.
It works great over sysfs and the busybox shell.
# cat /sys/bus/iio/devices/iio:device0/in_distance_raw
1646
I've used Go before with IIO devices that support triggers and buffered output at high sample rates on this hardware platform. However for this application the srf04 driver doesn't implement those IIO features. Drat. I don't really feel like adding buffer / trigger support to the driver myself (at this time) since I do not have a need for a 'high' sample rate. A handful of pings per second should suffice for my purpose. I figure I'll calculate mean & std. dev. for a rolling window of data points and 'divine' the signal out of the noise.
So with that - I'd be perfectly happy to Read the bytes from the published sysfs file with Go.
Which brings me to the point of this post.
When I open the file for reading, and try to Read() any number of bytes, I always get a generic -EIO error.
func (s *Srf04) Read() (int, error) {
samp := make([]byte, 16)
f, err := os.OpenFile(s.readPath, OS.O_RDONLY, os.ModeDevice)
if err != nil {
return 0, err
}
defer f.Close()
n, err := f.Read(samp)
if err != nil {
// This block is always executed.
// The error is never a timeout, and always 'input/output error' (-EIO aka -5)
log.Fatal(err)
}
...
}
This seems like strange behavior to me.
So I decided to mess with using io.ReadFull. This yielded unreliable results.
func (s *Srf04) Read() (int, error) {
samp := make([]byte, 16)
f, err := os.OpenFile(s.readPath, OS.O_RDONLY, os.ModeDevice)
if err != nil {
return 0, err
}
defer f.Close()
for {
n, err := io.ReadFull(readFile, samp)
log.Println("ReadFull ", n, " bytes.")
if err == io.EOF {
break
}
if err != nil {
log.Println(err)
}
}
...
}
I ended up adding it to a loop, as I found behavior changes from 'one-off' reads to multiple read calls subsequent to one another. I have it exiting if it gets an EOF, and repeatedly trying to read otherwise.
The results are straight-up crazy unreliable, seemingly returning random results. Sometimes I get the -5, other times I read between 2 - 5 bytes from the device. Sometimes I get bytes without an eof file before the EOF. The bytes appear to represent character data for numbers (each rune is a rune between [0-9]) -- which I'd expect.
Aside: I expect this is related to file polling and the go blocking IO implementation, but I have no way to really tell.
As a temporary workaround, I decided try using os.exec, and now I get results I'd expect to see.
func (s *Srf04)Read() (int, error) {
out, err := exec.Command("cat", s.readPath).Output()
if err != nil {
return 0, err
}
return strconv.Atoi(string(out))
}
But Yick. os.exec. Yuck.
I'd try to run that cat whatever encantation under strace and then peer at what read(2) calls cat actually manages to do (including the number of bytes actually read), and then I'd try to re-create that behaviour in Go.
My own sheer guess at the problem's cause is that the driver (or the sysfs layer) is not too well prepared to deal with certain access patterns.
For a start, consider that GNU cat is not a simple-minded byte shoveler but is rather a reasonably tricky piece of software, which, among other things, considers optimal I/O block sizes for both input and output devices (if available), calls fadvise(2) etc. It's not that any of that gets actually used when you run it on your sysfs-exported file, but it may influence how the full stack (starting with the sysfs layer) performs in the case of using cat and with your code, respectively.
Hence my advice: start with strace-ing the cat and then try to re-create its usage pattern in your Go code; then try to come up with a minimal subset of that, which works; then profoundly comment your code ;-)
I'm sure I've been looking at this too long tonight, and this code is probably terrible. That said, here's the snippet of what I came up with that works just as reliably as the busybox cat, but in Go.
The Srf04 struct carries a few things, the important bits are included below:
type Srf04 struct {
readBuf []byte `json:"-"`
readFile *os.File `json:"-"`
samples *ring.Ring `json:"-"`
}
func (s *Srf04) Read() (int, error) {
/** Reliable, but really really slow.
out, err := exec.Command("cat", s.readPath).Output()
if err != nil {
log.Fatal(err)
}
val, err := strconv.Atoi(string(out[:len(out) - 2]))
if err == nil {
s.samples.Value = val
s.samples = s.samples.Next()
}
*/
// Seek should tell us the new offset (0) and no err.
bytesRead := 0
_, err := s.readFile.Seek(0, 0)
// Loop until N > 0 AND err != EOF && err != timeout.
if err == nil {
n := 0
for {
n, err = s.readFile.Read(s.readBuf)
bytesRead += n
if os.IsTimeout(err) {
// bail out.
bytesRead = 0
break
}
if err == io.EOF {
// Success!
break
}
// Any other err means 'keep trying to read.'
}
}
if bytesRead > 0 {
val, err := strconv.Atoi(string(s.readBuf[:bytesRead-1]))
if err == nil {
fmt.Println(val)
s.samples.Value = val
s.samples = s.samples.Next()
}
return val, err
}
return 0, err
}

Is there a faster alternative to ioutil.ReadFile?

I am trying to make a program for checking file duplicates based on md5 checksum.
Not really sure whether I am missing something or not, but this function reading the XCode installer app (it has like 8GB) uses 16GB of Ram
func search() {
unique := make(map[string]string)
files, err := ioutil.ReadDir(".")
if err != nil {
log.Println(err)
}
for _, file := range files {
fileName := file.Name()
fmt.Println("CHECKING:", fileName)
fi, err := os.Stat(fileName)
if err != nil {
fmt.Println(err)
continue
}
if fi.Mode().IsRegular() {
data, err := ioutil.ReadFile(fileName)
if err != nil {
fmt.Println(err)
continue
}
sum := md5.Sum(data)
hexDigest := hex.EncodeToString(sum[:])
if _, ok := unique[hexDigest]; ok == false {
unique[hexDigest] = fileName
} else {
fmt.Println("DUPLICATE:", fileName)
}
}
}
}
As per my debugging the issue is with the file reading
Is there a better approach to do that?
thanks
There is an example in the Golang documentation, which covers your case.
package main
import (
"crypto/md5"
"fmt"
"io"
"log"
"os"
)
func main() {
f, err := os.Open("file.txt")
if err != nil {
log.Fatal(err)
}
defer f.Close()
h := md5.New()
if _, err := io.Copy(h, f); err != nil {
log.Fatal(err)
}
fmt.Printf("%x", h.Sum(nil))
}
For your case, just make sure to close the files in the loop and not defer them. Or put the logic into a function.
Sounds like the 16GB RAM is your problem, not speed per se.
Don't read the entire file into a variable with ReadFile; io.Copy from the Reader that Open gives you to the Writer that hash/md5 provides (md5.New returns a hash.Hash, which embeds an io.Writer). That only copies a little bit at a time instead of pulling all of the file into RAM.
This is a trick useful in a lot of places in Go; packages like text/template, compress/gzip, net/http, etc. work in terms of Readers and Writers. With them, you don't usually need to create huge []bytes or strings; you can hook I/O interfaces up to each other and let them pass around pieces of content for you. In a garbage collected language, saving memory tends to save you CPU work as well.

Golang - why is string slice element not included in exec cat unless I sort it

I have a slightly funky issue in golang. Essentially I have a slice of strings which represent file paths. I then run a cat against those filepaths to combine the files before sorting, deduping, etc.
here is the section of code (where 'applicableReductions' is the string slice):
applicableReductions := []string{}
for _, fqFromListName := range fqFromListNames {
filePath := GetFilePath()
//BROKE CODE GOES HERE
}
applicableReductions = append(applicableReductions, filePath)
fileOut, err := os.Create(toListWriteTmpFilePath)
if err != nil {
return err
}
cat := exec.Command("cat", applicableReductions...)
catStdOut, err := cat.StdoutPipe()
if err != nil {
return err
}
go func(cat *exec.Cmd) error {
if err := cat.Start(); err != nil {
return fmt.Errorf("File reduction error (cat) : %s", err)
}
return nil
}(cat)
// Init Writer & write file
writer := bufio.NewWriter(fileOut)
defer writer.Flush()
_, err = io.Copy(writer, catStdOut)
if err != nil {
return err
}
if err = cat.Wait(); err != nil {
return err
}
fDiff.StandardiseData(fileOut, toListUpdateFolderPath, list.Name)
The above works fine. The problem comes when I try to append a new ele to the array. I have a seperate function which creates a new file from db content which is then added to the applicableReductions slice.
func RetrieveDomainsFromDB(collection *Collection, listName, outputPath string) error {
domains, err := domainReviews.GetDomainsForList(listName)
if err != nil {
return err
}
if len(domains) < 1 {
return ErrNoDomainReviewsForList
}
fh, err := os.OpenFile(outputPath, os.O_RDWR, 0774)
if err != nil {
fh, err = os.Create(outputPath)
if err != nil {
return err
}
}
defer fh.Close()
_, err = fh.WriteString(strings.Join(domains, "\n"))
if err != nil {
return err
}
return nil
}
If I call the above function and append the filePath to the applicableReduction slice, it is in there but doesnt get called by cat.
To clarify, when I put the following where it says BROKE CODE GOES HERE:
if dbSource {
err = r.RetrieveDomainsFromDB(collection, ToListName, filePath)
if err != nil {
return err
continue
}
}
The filepath can be seen when doing fmt.Println(applicableReductions) but the content of the files contents are not seen in the cat output file.
I thought perhaps a delay in the file being written so i tried adding a time.wait, tis didnt help. However the solution I found was to sort the slice, e.g this code above the call to exec cat solves the problem but I dont know why:
sort.Strings(applicableReductions)
I have confirmed all files present on both successful and unsucessful runs the only difference is without the sort, the content of the final appended file is missing
An explanation from a go-pro out there would be very much appreciated, let me know if you need more info, debug - happy to oblige to understand
UPDATE
It has been suggested that this is the same issue as here: Golang append an item to a slice, I think I understand the issue there and I'm not saying this isnt the same but I cannot see the same thing happenning - the slice in question is not touched from outside the main function (e.g. no editing of the slice in RetrieveDomainsFromDB function), I create the slice before a loop, append to it within a loop and then use it after the loop - Ive added an example at the top to show how the slice is built - please could someone clarify where this slice is being copied if this is the case
UPDATE AND CLOSE
Please close question - the issue was unrelated to the use of a string slice. Turns out that I was reading from the final output file before bufio-writer had been flushed (at end of function before defer flush kicked in on function return)
I think the sorting was just re-arranging the problem so I didnt notice it persisted or possibly giving some time for the buffer to flush. Either way sorted now with a manual call to flush.
Thanks for all help provided

How to read a text file line-by-line in Go when some lines are long enough to cause "bufio.Scanner: token too long" errors?

I have a text file where each line represents a JSON object. I am processing this file in Go with a simple for loop like this:
scanner := bufio.NewScanner(file)
for scanner.Scan() {
jsonBytes = scanner.Bytes()
var jsonObject interface{}
err := json.Unmarshal(jsonBytes, &jsonObject)
// do stuff with "jsonObject"...
}
if err := scanner.Err(); err != nil {
log.Fatal(err)
}
When this code reaches a line with a particularly large JSON string (~67kb), I get the error message, "bufio.Scanner: token too long".
Is there an easy way to increase the max line size readable by NewScanner? Or is there another approach you can take altogether, when needing to read lines that are too large for NewScanner but are known to not be of unsafe size generally?
You can also do:
scanner := bufio.NewScanner(file)
buf := make([]byte, 0, 64*1024)
scanner.Buffer(buf, 1024*1024)
for scanner.Scan() {
// do your stuff
}
The second argument to scanner.Buffer() sets the maximum token size. In the above example you will be able to scan the file as long as none of the lines is larger than 1MB.
From the package docs:
Programs that need more control over error handling or large tokens,
or must run sequential scans on a reader, should use bufio.Reader
instead.
It looks like the preferred solution is bufio.Reader.ReadLine.
You surely don't want to be reading line-by-line in the first place. Why don't you just do this:
d := json.NewDecoder(file)
for {
var ob whateverType
err := d.Decode(&ob)
if err == io.EOF {
break
}
if err != nil {
log.Fatalf("Error decoding: %v", err)
}
// do stuff with "jsonObject"...
}

Resources