I need to regularly load over 300'000 rows x 78 columns of data into my Go program.
Currently I use (import github.com/360EntSecGroup-Skylar/excelize):
xlsx, err := excelize.OpenFile("/media/test snaps.xlsm")
if err != nil {
fmt.Println(err)
return
}
//read all rows into df
df := xlsx.GetRows("data")
It takes about 4 minutes on a decent PC using Samsung 960 EVO Series - M.2 Internal SSD.
Is there a faster way to load this data? Currently it takes me more time to read the data than processing it. I'm also opened to other file formats.
As suggested in the comments, instead of using the XLS format, use a custom, fast data format for reading and writing your table.
In the most basic case, just write the number of columns and rows to a binary file, then write all the data in one go. This will be very fast, I have created a little example here which just writes 300.000 by 40 float32s to a file and reads them back. On my machine this takes about 400ms and 250ms (notice that the file is hot in cache after writing it, may take longer on initial read).
package main
import (
"encoding/binary"
"os"
"github.com/gonutz/tic"
)
func main() {
const (
rowCount = 300000
colCount = 40
)
values := make([]float32, rowCount*colCount)
func() {
defer tic.Toc()("write")
f, _ := os.Create("file")
defer f.Close()
binary.Write(f, binary.LittleEndian, int64(rowCount))
binary.Write(f, binary.LittleEndian, int64(colCount))
check(binary.Write(f, binary.LittleEndian, values))
}()
func() {
defer tic.Toc()("read")
f, _ := os.Open("file")
defer f.Close()
var rows, cols int64
binary.Read(f, binary.LittleEndian, &rows)
binary.Read(f, binary.LittleEndian, &cols)
vals := make([]float32, rows*cols)
check(binary.Read(f, binary.LittleEndian, vals))
}()
}
func check(err error) {
if err != nil {
panic(err)
}
}
Related
Problem
I have written a TCP echo server in Go and I am trying to write/read as often as I can in 10s to measure how much data got transfered in this time. Weirdly, the value is way too high and does not depend on the length of the bytearray which I am transfering (but it should!). It is always around 600k connections in this 10 seconds (The length of the "result" Array depicts how much connections were made in the 10s). As soon as I add let's say a print statement to the server and the values get processed, I get more realistic values that depend on the length of the bytearray as a result.
Why doesn't the length of the bytearray matter in the first case?
Code
Server
package main
import (
"fmt"
"log"
"net"
)
func main() {
tcpAddr, err := net.ResolveTCPAddr("tcp", fmt.Sprintf("127.0.0.1:8888"))
checkError(err)
ln, err := net.ListenTCP("tcp", tcpAddr)
checkError(err)
for {
conn, err := ln.Accept()
checkError(err)
go handleConnection(conn)
}
}
func checkError(err error) {
if err != nil {
log.Fatal(err)
}
}
func handleConnection(conn net.Conn) {
var input [1000000]byte
for {
n, err := conn.Read(input[0:])
checkError(err)
//fmt.Println(input[0:n])
_, err = conn.Write(input[0:n])
checkError(err)
}
}
Client
package main
import (
"fmt"
"log"
"net"
"time"
)
var (
result []int
elapsed time.Duration
)
func main() {
input := make([]byte, 1000)
tcpAddr, err := net.ResolveTCPAddr("tcp", "127.0.0.1:8888")
checkError(err)
conn, err := net.DialTCP("tcp", nil, tcpAddr)
checkError(err)
for start := time.Now(); time.Since(start) < time.Second*time.Duration(10); {
startTimer := time.Now()
_, err = conn.Write(input)
checkError(err)
_, err := conn.Read(input[0:])
checkError(err)
elapsed = time.Since(startTimer)
result = append(result, int(elapsed))
}
fmt.Println(fmt.Sprintf("result: %v", len(result)))
}
func checkError(err error) {
if err != nil {
log.Fatal(err)
}
}
Read in the client loop is not guaranteed to read all of the data sent in the previous call to Write.
When input is small enough to be transmitted in a single packet on the network, Read in the client returns all of the data in the previous call to Write in the client. In this mode, the application measures the time to execute request/response pairs.
For larger sizes of input, read on the client can fall behind what the client is writing. When this happens, the calls to Read complete faster because the calls return data from an earlier call to Write. The application is pipelining in this mode. The throughput for pipelining is higher than the throughput for request/response pairs. The client will not read all data in this mode, but the timing impact of that is not significant.
Use the following code to time request/response pairs for arbitrary sizes of input.
for start := time.Now(); time.Since(start) < time.Second*time.Duration(10); {
startTimer := time.Now()
_, err = conn.Write(input)
checkError(err)
_, err := io.ReadFull(conn, input) // <-- read all of the data
checkError(err)
elapsed = time.Since(startTimer)
result = append(result, int(elapsed))
}
To measure full-on pipelining, modify the client to read and write from different goroutines. An example follows.
go func() {
for start := time.Now(); time.Since(start) < time.Second*time.Duration(10); {
_, err = conn.Write(input)
checkError(err)
}
conn.CloseWrite() // tell server that we are done sending data
}()
start := time.Now()
output := make([]byte, 4096)
for {
_, err := conn.Read(output)
if err != nil {
if err == io.EOF {
break
}
checkError(err)
}
}
fmt.Println(time.Since(start))
I write a file uploader with Go. I would like to have md5 of the file as a file name when I save it to the disk.
What is the best way to solve this problem?
I save a file this way:
reader, _ := r.MultipartReader()
p, _ := reader.NextPart()
f, _ := os.Create("./filename") // here I need md5 as a file name
defer f.Close()
lmt := io.LimitReader(p, maxSize + 1)
written, _ := io.Copy(f, lmt)
if written > maxSize {
os.Remove(f.Name())
}
here is an example using io.TeeReader to perform both computation and copy at same time
https://play.golang.org/p/IJJQiaeTOBh
package main
import (
"crypto/sha256"
"fmt"
"io"
"os"
"strings"
)
func main() {
var s io.Reader = strings.NewReader("some data")
// maxSize := 4096
// s = io.LimitReader(s, maxSize + 1)
h := sha256.New()
tr := io.TeeReader(s, h)
io.Copy(os.Stdout, tr)
fmt.Printf("\n%x", h.Sum(nil))
}
// Output:
//some data
//1307990e6ba5ca145eb35e99182a9bec46531bc54ddf656a602c780fa0240dee
And the comparison test for correctness
$ echo -n "some data" | sha256sum -
1307990e6ba5ca145eb35e99182a9bec46531bc54ddf656a602c780fa0240dee -
Instead of using io.TeeReader I have used io.MultiWriter to create 2 buffers (I will use the first buffer to calculate md5 and the second to write to a file with md5 name)
lmt := io.LimitReader(buf, maxSize + 1)
hash := md5.New()
var buf1, buf2 bytes.Buffer
w := io.MultiWriter(&buf1, &buf2)
if _, err := io.Copy(w, lmt); err != nil {
log.Fatal(err)
}
if _, err := io.Copy(hash, &buf1); err != nil {
log.Fatal(err)
}
fmt.Println("md5 is: ", hex.EncodeToString(hash.Sum(nil)))
// Now we can create file with os.Openfile passing md5 name as an argument + write &buf2 to this file
I liked the solution with TeeReader here, but simplified it like this:
type HashReader struct {
io.Reader
hash.Hash
}
func NewHashReader(r io.Reader, h hash.Hash) HashReader {
return HashReader{io.TeeReader(r, h), h}
}
func NewMD5Reader(r io.Reader) HashReader {
return NewHashReader(r, md5.New())
}
func main() {
dataReader := bytes.NewBufferString("Hello, world!")
hashReader := NewMD5Reader(dataReader)
resultBytes := make([]byte, dataReader.Len())
_, err := hashReader.Read(resultBytes)
if err != nil {
fmt.Println(err)
}
fmt.Println(hex.EncodeToString(hashReader.Sum(nil)))
}
hex-encoded string of md5 looks more familiar to me, but feel free to encode result byte array of hashReader.Sum(nil) as you wish.
P.S. One more thing on the playground example. They assign md5 result on EOF, but definitely not all consumers read until EOF. Since Hash object stores current hash calculation, it is enough to call hashReader.Sum after consumption finishes and use the result.
I'm trying to improve the performance of an app.
One part of its code uploads a file to a server in chunks.
The original version simply does this in a sequential loop. However, it's slow and during the sequence it also needs to talk to another server before uploading each chunk.
The upload of chunks could simply be placed in a goroutine. It works, but is not a good solution because if the source file is extremely large it ends up using a large amount of memory.
So, I try to limit the number of active goroutines by using a buffered channel. Here is some code that shows my attempt. I've stripped it down to show the concept and you can run it to test for yourself.
package main
import (
"fmt"
"io"
"os"
"time"
)
const defaultChunkSize = 1 * 1024 * 1024
// Lets have 4 workers
var c = make(chan int, 4)
func UploadFile(f *os.File) error {
fi, err := f.Stat()
if err != nil {
return fmt.Errorf("err: %s", err)
}
size := fi.Size()
total := (int)(size/defaultChunkSize + 1)
// Upload parts
buf := make([]byte, defaultChunkSize)
for partno := 1; partno <= total; partno++ {
readChunk := func(offset int, buf []byte) (int, error) {
fmt.Println("readChunk", partno, offset)
n, err := f.ReadAt(buf, int64(offset))
if err != nil {
return n, err
}
return n, nil
}
// This will block if there are not enough worker slots available
c <- partno
// The actual worker.
go func() {
offset := (partno - 1) * defaultChunkSize
n, err := readChunk(offset, buf)
if err != nil && err != io.EOF {
return
}
err = uploadPart(partno, buf[:n])
if err != nil {
fmt.Println("Uploadpart failed:", err)
}
<-c
}()
}
return nil
}
func uploadPart(partno int, buf []byte) error {
fmt.Printf("Uploading partno: %d, buflen=%d\n", partno, len(buf))
// Actually upload the part. Lets test it by instead writing each
// buffer to another file. We can then use diff to compare the
// source and dest files.
// Open file. Seek to (partno - 1) * defaultChunkSize, write buffer
f, err := os.OpenFile("/home/matthewh/Downloads/out.tar.gz", os.O_CREATE|os.O_WRONLY, 0755)
if err != nil {
fmt.Printf("err: %s\n", err)
}
n, err := f.WriteAt(buf, int64((partno-1)*defaultChunkSize))
if err != nil {
fmt.Printf("err=%s\n", err)
}
fmt.Printf("%d bytes written\n", n)
defer f.Close()
return nil
}
func main() {
filename := "/home/matthewh/Downloads/largefile.tar.gz"
fmt.Printf("Opening file: %s\n", filename)
f, err := os.Open(filename)
if err != nil {
panic(err)
}
UploadFile(f)
}
It almost works. But there are several problems.
1) The final partno 22 is occuring 3 times. The correct length is actually 612545 as the file length isn't a multiple of 1MB.
// Sample output
...
readChunk 21 20971520
readChunk 22 22020096
Uploading partno: 22, buflen=1048576
Uploading partno: 22, buflen=612545
Uploading partno: 22, buflen=1048576
Another problem, the upload could fail and I am not familiar enough with go and how best to solve failure of the goroutine.
Finally, I want to ordinarily return some data from the uploadPart when it succeeds. Specifically, it'll be a string (an HTTP ETag header value). These etag values need to be collected by the main function.
What is a better way to structure this code in this instance? I've not yet found a good golang design pattern that correctly fulfills my needs here.
Skipping for the moment the question of how better to structure this code, I see a bug in your code which may be causing the problem you're seeing. Since the function you're running in the goroutine uses the variable partno, which changes with each iteration of the loop, your goroutine isn't necessarily seeing the value of partno at the time you invoked the goroutine. A common way of fixing this is to create a local copy of that variable inside the loop:
for partno := 1; partno <= total; partno++ {
partno := partno
// ...
}
Data race #1
Multiple goroutines are using the same buffer concurrently. Note that one gorouting may be filling it with a new chunk while another is still reading an old chunk from it. Instead, each goroutine should have it's own buffer.
Data race #2
As Andy Schweig has pointed, the value in partno is updated by the loop before the goroutine created in that iteration has a chance to read it. This is why the final partno 22 occurs multiple times. To fix it, you can pass partno as a argument to the anonymous function. That will ensure each goroutine has it's own part number.
Also, you can use a channel to pass the results from the workers. Maybe a struct type with the part number and error. That way, you will be able to observe the progress and retry failed uploads.
For an example of a good pattern check out this example from the GOPL book.
Suggested changes
As noted by dev.bmax buf moved into go routine, as noted by Andy Schweig partno is param to anon function, also added WaitGroup since UploadFile was exiting before uploads were complete. Also defer f.Close() file, good habit.
package main
import (
"fmt"
"io"
"os"
"sync"
"time"
)
const defaultChunkSize = 1 * 1024 * 1024
// wg for uploads to complete
var wg sync.WaitGroup
// Lets have 4 workers
var c = make(chan int, 4)
func UploadFile(f *os.File) error {
// wait for all the uploads to complete before function exit
defer wg.Wait()
fi, err := f.Stat()
if err != nil {
return fmt.Errorf("err: %s", err)
}
size := fi.Size()
fmt.Printf("file size: %v\n", size)
total := int(size/defaultChunkSize + 1)
// Upload parts
for partno := 1; partno <= total; partno++ {
readChunk := func(offset int, buf []byte, partno int) (int, error) {
fmt.Println("readChunk", partno, offset)
n, err := f.ReadAt(buf, int64(offset))
if err != nil {
return n, err
}
return n, nil
}
// This will block if there are not enough worker slots available
c <- partno
// The actual worker.
go func(partno int) {
// wait for me to be done
wg.Add(1)
defer wg.Done()
buf := make([]byte, defaultChunkSize)
offset := (partno - 1) * defaultChunkSize
n, err := readChunk(offset, buf, partno)
if err != nil && err != io.EOF {
return
}
err = uploadPart(partno, buf[:n])
if err != nil {
fmt.Println("Uploadpart failed:", err)
}
<-c
}(partno)
}
return nil
}
func uploadPart(partno int, buf []byte) error {
fmt.Printf("Uploading partno: %d, buflen=%d\n", partno, len(buf))
// Actually do the upload. Simulate long running task with a sleep
time.Sleep(time.Second)
return nil
}
func main() {
filename := "/home/matthewh/Downloads/largefile.tar.gz"
fmt.Printf("Opening file: %s\n", filename)
f, err := os.Open(filename)
if err != nil {
panic(err)
}
defer f.Close()
UploadFile(f)
}
I'm sure you can deal a little smarter with the buf situation. I'm just letting go deal with the garbage. Since you are limiting your workers to specific number 4 you really need only 4 x defaultChunkSize buffers. Please do share if you come up with something simple and shareworth.
Have fun!
I am trying to make a program for checking file duplicates based on md5 checksum.
Not really sure whether I am missing something or not, but this function reading the XCode installer app (it has like 8GB) uses 16GB of Ram
func search() {
unique := make(map[string]string)
files, err := ioutil.ReadDir(".")
if err != nil {
log.Println(err)
}
for _, file := range files {
fileName := file.Name()
fmt.Println("CHECKING:", fileName)
fi, err := os.Stat(fileName)
if err != nil {
fmt.Println(err)
continue
}
if fi.Mode().IsRegular() {
data, err := ioutil.ReadFile(fileName)
if err != nil {
fmt.Println(err)
continue
}
sum := md5.Sum(data)
hexDigest := hex.EncodeToString(sum[:])
if _, ok := unique[hexDigest]; ok == false {
unique[hexDigest] = fileName
} else {
fmt.Println("DUPLICATE:", fileName)
}
}
}
}
As per my debugging the issue is with the file reading
Is there a better approach to do that?
thanks
There is an example in the Golang documentation, which covers your case.
package main
import (
"crypto/md5"
"fmt"
"io"
"log"
"os"
)
func main() {
f, err := os.Open("file.txt")
if err != nil {
log.Fatal(err)
}
defer f.Close()
h := md5.New()
if _, err := io.Copy(h, f); err != nil {
log.Fatal(err)
}
fmt.Printf("%x", h.Sum(nil))
}
For your case, just make sure to close the files in the loop and not defer them. Or put the logic into a function.
Sounds like the 16GB RAM is your problem, not speed per se.
Don't read the entire file into a variable with ReadFile; io.Copy from the Reader that Open gives you to the Writer that hash/md5 provides (md5.New returns a hash.Hash, which embeds an io.Writer). That only copies a little bit at a time instead of pulling all of the file into RAM.
This is a trick useful in a lot of places in Go; packages like text/template, compress/gzip, net/http, etc. work in terms of Readers and Writers. With them, you don't usually need to create huge []bytes or strings; you can hook I/O interfaces up to each other and let them pass around pieces of content for you. In a garbage collected language, saving memory tends to save you CPU work as well.
I am trying to parse a file that annoying consists of many separately zipped segments. I have parsed these segments one at a time into a slice of bytes and I want to uncompress them as I go.
Here is my current code that does the decompressing, which doesn't work. from and to are just set at the top as an example, in reality they are set by the code. data is the byte array containing the entire file. I don't want to seek it while it's on disk because its location on another server, so it's only realistic for me to load the entire file to []byte first and then parse it.
from, to := 0, 1000;
b := bytes.NewReader(data[from:from+to])
z, err := zlib.NewReader(b)
CheckErr(err)
defer z.Close()
p := make([]byte,0,1024)
z.Read(p)
fmt.Println(string(p))
So how is it so massively difficult just to unzip a slice of bytes? Anyway...
The problem appears to with how I am reading it out. Where it says z.Read, that doesn't seem to do anything.
How can I read the entire thing in one go into a slice of bytes?
Here's an outline for you. Note: In Go, CHECK FOR ERRORS!
package main
import (
"bytes"
"compress/zlib"
"fmt"
"io/ioutil"
)
func readSegment(data []byte, from, to int) ([]byte, error) {
b := bytes.NewReader(data[from : from+to])
z, err := zlib.NewReader(b)
if err != nil {
return nil, err
}
defer z.Close()
p, err := ioutil.ReadAll(z)
if err != nil {
return nil, err
}
return p, nil
}
func main() {
from, to := 0, 1000
data := make([]byte, from+to)
// ** parse input segments into data **
p, err := readSegment(data, from, to)
if err != nil {
fmt.Println(err)
return
}
fmt.Println(string(p))
}
Use ReadAll(r io.Reader) ([]byte, error) from the io/ioutil package.
p, err := ioutil.ReadAll(b)
fmt.Println(string(p))
Read only reads up to the length of the given slice (1024 bytes in your case).
To read in chunks of 1024 bytes:
p := make([]byte,1024)
for {
numBytes, err := l.Read(p)
if err == io.EOF {
// you are done, numBytes might be less than len(p)
break
}
// do what you want with p
}
If you are getting the data from a webserver, you might even do
import (
"net/http"
"io/ioutil"
)
...
resp, errGet := http.Get("http://example.com/somefile")
// do error handling
z, errZ := zlib.NewReader(resp.Body)
// do error handling
resp.Body.Close()
p, err := ioutil.ReadAll(b)
// do error handling
since resp.Body happens to be an io.Reader as most io related types.