Hash large file using little memory - go

I need to hash very large files (>10TB files). So I decided to hash 128KB per MB.
My idea is to divide the file into 1MB blocks and hash only the first 128KB of each block.
The following code works, but it uses insane amounts of memory and I can't tell why...
func partialMD5Hash(filePath string) string {
var blockSize int64 = 1024 * 1024
var sampleSize int64 = 1024 * 128
file, err := os.Open(filePath)
if err != nil {
return "ERROR"
}
defer file.Close()
fileInfo, _ := file.Stat()
fileSize := fileInfo.Size()
hash := md5.New()
var i int64
for i = 0; i < fileSize / blockSize; i++ {
sample := make([]byte, sampleSize)
_, err = file.Read(sample)
if err != nil {
return "ERROR"
}
hash.Write(sample)
_, err := file.Seek(blockSize-sampleSize, 1)
if err != nil {
return "ERROR"
}
}
return hex.EncodeToString(hash.Sum(nil))
}
Any help will be appreciated!

There are several problems with the approach, and with the program.
If you want to hash a large file, you have to hash all of it. Sampling parts of the file will not detect modifications to the parts you didn't sample.
You are allocating a new buffer for every iteration. Instead, allocate one buffer outside the for-loop, and reuse it.
Also, you seem to be ignoring how many bytes actually read. So:
block := make([]byte, blockSize)
for {
n, err = file.Read(block)
if n>0 {
hash.Write(sample[:n])
}
if err==io.EOF {
break
}
if err != nil {
return "ERROR"
}
}
However, the following would be much more concise:
io.Copy(hash,file)

Related

Ignore a line containing a pattern from a long text file in Go

I'm trying to implement a function to ignore a line containing a pattern from a long text file (ASCII guaranteed) in Go
The functions I have below withoutIgnore and withIgnore, both take a filename argument input and return a *byte.Buffer, which can be subsequently used to write to a io.Writer.
The withIgnore function takes an additional argument pattern to exclude the line containing the pattern from the file. The function works, but with benchmarking, found it to be 5x slower than withoutIgnore. Is there a way it could be improved?
package main
import (
"bufio"
"bytes"
"io"
"log"
"os"
)
func withoutIgnore(f string) (*bytes.Buffer, error) {
rfd, err := os.Open(f)
if err != nil {
log.Fatal(err)
}
defer func() {
if err := rfd.Close(); err != nil {
log.Fatal(err)
}
}()
inputBuffer := make([]byte, 1048576)
var bytesRead int
var bs []byte
opBuffer := bytes.NewBuffer(bs)
for {
bytesRead, err = rfd.Read(inputBuffer)
if err == io.EOF {
return opBuffer, nil
}
if err != nil {
return nil, nil
}
_, err = opBuffer.Write(inputBuffer[:bytesRead])
if err != nil {
return nil, err
}
}
return opBuffer, nil
}
func withIgnore(f, pattern string) (*bytes.Buffer, error) {
rfd, err := os.Open(f)
if err != nil {
log.Fatal(err)
}
defer func() {
if err := rfd.Close(); err != nil {
log.Fatal(err)
}
}()
scanner := bufio.NewScanner(rfd)
var bs []byte
buffer := bytes.NewBuffer(bs)
for scanner.Scan() {
if !bytes.Contains(scanner.Bytes(), []byte(pattern)) {
_, err := buffer.WriteString(scanner.Text() + "\n")
if err != nil {
return nil, nil
}
}
}
return buffer, nil
}
func main() {
// buff, err := withoutIgnore("base64dump.log")
buff, err := withIgnore("base64dump.log", "AUDIT")
if err != nil {
log.Fatal(err)
}
_, err = buff.WriteTo(os.Stdout)
if err != nil {
log.Fatal(err)
}
}
Benchmark test
package main
import "testing"
func BenchmarkTestWithoutIgnore(b *testing.B) {
for i := 0; i < b.N; i++ {
_, err := withoutIgnore("base64dump.log")
if err != nil {
b.Fatal(err)
}
}
}
func BenchmarkTestWithIgnore(b *testing.B) {
for i := 0; i < b.N; i++ {
_, err := withIgnore("base64dump.log", "AUDIT")
if err != nil {
b.Fatal(err)
}
}
}
and the "base64dump.log" can be generated in the command line using
base64 /dev/urandom | head -c 10000000 > base64dump.log
Since ASCII is guaranteed, one can work directly at byte level.
Still if one checks each byte for line breaks when reading the input and then searches for the pattern again within the line, operations are applied to each byte.
If, on the other hand, one reads chunks of the input and performs an optimized search for the pattern in the text, not even examining each input byte, one minimizes the operations per input byte.
For example, there is the Boyer-Moore string search algorithm. Go's built-in bytes.Index function is also optimized. The achieved speed depends of course on the input data and the actual pattern. For the input as specified in the question, `bytes.Index turned out to be significantly more performant when measured.
Procedure
read in a chunk, where the chunk size should be significantly longer than the maximum line length, a value >= 64KB should probably be good, in the test 1MB was used as in the question.
a chunk usually doesn't end at a linefeed, so search from the end of the chunk to the next linefeed, limit the search to this slice and remember the remaining data for the next pass
the last chunk does not necessarily end in a linefeed
with the help of the performant GO function bytes.Index you can find the places where the pattern occurs in the chunk
from the found location one searches for the preceding and the following linefeed
then the block is output up to the corresponding beginning of the line
and the search is continued from the end of the line where the pattern occurred
if the search does not find another location, the rest is output
read the next chunk and apply the described steps again until the end of the file is reached
Noteworthy
A read operation may return less data than the chunk size, so it makes sense to repeat the read operation until the chunk size data has been read.
Benchmark
Optimized code is often significantly more complicated, but the performance is also significantly better, as we will see in a moment.
BenchmarkTestWithoutIgnore-8 270 4137267 ns/op
BenchmarkTestWithIgnore-8 54 22403931 ns/op
BenchmarkTestFilter-8 150 7947454 ns/op
Here, the optimized code BenchmarkTestFilter-8 is only about 1.9x slower than the operation without filtering while the BenchmarkTestWithIgnore-8 method is 5.4x slower than the comparison value without filtering.
Looked at another way: the optimized code is 2.8 times faster than the unoptimized one.
Code
Of course, here is the code for your own tests:
func filterFile(f, pattern string) (*bytes.Buffer, error) {
rfd, err := os.Open(f)
if err != nil {
log.Fatal(err)
}
defer func() {
if err := rfd.Close(); err != nil {
log.Fatal(err)
}
}()
reader := bufio.NewReader(rfd)
return filter(reader, []byte(pattern), 1024*1024)
}
// chunkSize must be larger than the longest line
// a reasonable size is probably >= 64K
func filter(reader io.Reader, pattern []byte, chunkSize int) (*bytes.Buffer, error) {
var bs []byte
buffer := bytes.NewBuffer(bs)
chunk := make([]byte, chunkSize)
var remaining []byte
for lastChunk := false; !lastChunk; {
n, err := readChunk(reader, chunk, remaining, chunkSize)
if err != nil {
if err == io.EOF {
lastChunk = true
} else {
return nil, err
}
}
remaining = remaining[:0]
if !lastChunk {
for i := n - 1; i > 0; i-- {
if chunk[i] == '\n' {
remaining = append(remaining, chunk[i+1:n]...)
n = i + 1
break
}
}
}
s := 0
for s < n {
hit := bytes.Index(chunk[s:n], pattern)
if hit < 0 {
break
}
hit += s
startOfLine := hit
for ; startOfLine > 0; startOfLine-- {
if chunk[startOfLine] == '\n' {
startOfLine++
break
}
}
endOfLine := hit + len(pattern)
for ; endOfLine < n; endOfLine++ {
if chunk[endOfLine] == '\n' {
break
}
}
endOfLine++
_, err = buffer.Write(chunk[s:startOfLine])
if err != nil {
return nil, err
}
s = endOfLine
}
if s < n {
_, err = buffer.Write(chunk[s:n])
if err != nil {
return nil, err
}
}
}
return buffer, nil
}
func readChunk(reader io.Reader, chunk, remaining []byte, chunkSize int) (int, error) {
copy(chunk, remaining)
r := len(remaining)
for r < chunkSize {
n, err := reader.Read(chunk[r:])
r += n
if err != nil {
return r, err
}
}
return r, nil
}
And the benchmark part might look something like this:
func BenchmarkTestFilter(b *testing.B) {
for i := 0; i < b.N; i++ {
_, err := filterFile("base64dump.log", "AUDIT")
if err != nil {
b.Fatal(err)
}
}
}
The filter function was split and the actual job is done in func filter(reader io.Reader, pattern []byte, chunkSize int) (*bytes.Buffer, error).
By injecting a reader and a chunkSize, the creation of unit tests is already prepared or contemplated, which is missing here, but is definitely recommended when dealing with indexes.
However, the main point here was to find a way to significantly improve it in terms of performance.

Dynamic FlatBuffers delimiter

I'm using flatbuffer to send binary data over unix socket. The flatbuffer that I send is of dynamic length. The problem I'm facing is, how to know how many bytes I have to read for one table.
Is there something like a delimiter that can be appended while sending, which I can use to determine the end of the flatbuffer.
When I tried with a smaller size
buf := make([]byte, 512)
nr, err := c.Read(buf)
if err != nil {
fmt.Println("exit echo")
return
}
And if the flatbuffer that is bigger than 512 bytes is read, then this results in failure.
When I read by growing my buffer, then I'm not able to find the end of the read
var n, nr int
var err error
buf := make([]byte, 0, 4096) // big buffer
tmp := make([]byte, 512)
for {
n, err = c.Read(tmp)
if err != nil {
break
}
nr += n
if nr >= 4096 {
err = errOverrun
break
}
buf = append(buf, tmp[:n]...)
}
if err != nil {
fmt.Println("read error:", err)
break
}
FlatBuffers does not include a length field by design, since in most context the length is an implicit part of the storage or transfer of a buffer.
If you have no way to know the size of a buffer, or you are streaming buffers, the best is to simply pre-fix any buffer with a 32bit length field, so you can use that to read the rest of the data.
In the C++ API this is even built-in (see SizePrefixed functions), but this hasn't been ported to Go yet, so you'd have to do it manually.

Add prefix to io.Reader

I've written a little server which receives a blob of data in the form of an io.Reader, adds a header and streams the result back to the caller.
My implementation isn't particularly efficient as I'm buffering the blob's data in-memory so that I can calculate the blob's length, which needs to form part of the header.
I've seen some examples of io.Pipe() with io.TeeReader but they're more for splitting an io.Reader into two, and writing them away in parallel.
The blobs I'm dealing with are around 100KB, so not huge but if my server gets busy, memory's going to quickly become an issue...
Any ideas?
func addHeader(in io.Reader) (out io.Reader, err error) {
buf := new(bytes.Buffer)
if _, err = io.Copy(buf, in); err != nil {
return
}
header := bytes.NewReader([]byte(fmt.Sprintf("header:%d", buf.Len())))
return io.MultiReader(header, buf), nil
}
I appreciate it's not a good idea to return interfaces from functions but this code isn't destined to become an API, so I'm not too concerned with that bit.
In general, the only way to determine the length of data in an io.Reader is to read until EOF. There are ways to determine the length of the data for specific types.
func addHeader(in io.Reader) (out io.Reader, err error) {
n := 0
switch v := in.(type) {
case *bytes.Buffer:
n = v.Len()
case *bytes.Reader:
n = v.Len()
case *strings.Reader:
n = v.Len()
case io.Seeker:
cur, err := v.Seek(0, 1)
if err != nil {
return nil, err
}
end, err := v.Seek(0, 2)
if err != nil {
return nil, err
}
_, err = v.Seek(cur, 0)
if err != nil {
return nil, err
}
n = int(end - cur)
default:
var buf bytes.Buffer
if _, err := buf.ReadFrom(in); err != nil {
return nil, err
}
n = buf.Len()
in = &buf
}
header := strings.NewReader(fmt.Sprintf("header:%d", n))
return io.MultiReader(header, in), nil
}
This is similar to how the net/http package determines the content length of the request body.

File reading and encoding performance

I'm writing some web service which supposed to receive an xml file from user, read it and save data to database
This file is gzipped and encoded in UTF-16. So i have to ungzip it, save xml to a file (for future purposes). Next i have to read file into a string, decode it to UTF-8 and kind of xml.Unmarshal([]byte(xmlString), &report)
Currently without saving it into a database
On my local machine i've realized that processing of one request takes about 30% of my CPU and about 300ms of time. For one request looks like okay. But i made script which simultaneously fires 100 requests (via curl ) and i saw that CPU usage is up to 100% and time for one request increased to 2sec
What i wanted to ask is: should i worry about it or maybe on a real web server things are going to be ok? Or maybe i'm doing smth wrong
Here is the code:
func Parse(filename string) Report {
xmlString := getXml(filename)
report := Report{}
xml.Unmarshal([]byte(xmlString), &report)
return report
}
func getXml(filename string) string {
b, err := ioutil.ReadFile(filename)
if err != nil {
fmt.Println("Error opening file:", err)
}
s, err := decodeUTF16(b)
if err != nil {
panic(err)
}
pattern := `<?xml version="1.0" encoding="UTF-16"?>`
res := strings.Replace(s, pattern, "", 1)
return res
}
func decodeUTF16(b []byte) (string, error) {
if len(b)%2 != 0 {
return "", fmt.Errorf("Must have even length byte slice")
}
u16s := make([]uint16, 1)
ret := &bytes.Buffer{}
b8buf := make([]byte, 4)
lb := len(b)
for i := 0; i < lb; i += 2 {
u16s[0] = uint16(b[i]) + (uint16(b[i+1]) << 8)
r := utf16.Decode(u16s)
n := utf8.EncodeRune(b8buf, r[0])
ret.Write(b8buf[:n])
}
return ret.String(), nil
}
Please ask if i forgot something important

How do I read in a large flat file

I have a flat file that has 339276 line of text in it for a size of 62.1 MB. I am attempting to read in all the lines, parse them based on some conditions I have and then insert them into a database.
I originally attempted to use a bufio.Scan() loop and bufio.Text() to get the line but I was running out of buffer space. I switched to using bufio.ReadLine/ReadString/ReadByte (I tried each) and had the same problem with each. I didn't have enough buffer space.
I tried using read and setting the buffer size but as the document says it actually a const that can be made smaller but never bigger that 64*1024 bytes. I then tried to use File.ReadAt where I set the starting postilion and moved it along as I brought in each section to no avail. I have looked at the following examples and explanations (not an exhaustive list):
Read text file into string array (and write)
How to Read last lines from a big file with Go every 10 secs
reading file line by line in go
How do I read in an entire file (either line by line or the whole thing at once) into a slice so I can then go do things to the lines?
Here is some code that I have tried:
file, err := os.Open(feedFolder + value)
handleError(err)
defer file.Close()
// fileInfo, _ := file.Stat()
var linesInFile []string
r := bufio.NewReader(file)
for {
path, err := r.ReadLine("\n") // 0x0A separator = newline
linesInFile = append(linesInFile, path)
if err == io.EOF {
fmt.Printf("End Of File: %s", err)
break
} else if err != nil {
handleError(err) // if you return error
}
}
fmt.Println("Last Line: ", linesInFile[len(linesInFile)-1])
Here is something else I tried:
var fileSize int64 = fileInfo.Size()
fmt.Printf("File Size: %d\t", fileSize)
var bufferSize int64 = 1024 * 60
bytes := make([]byte, bufferSize)
var fullFile []byte
var start int64 = 0
var interationCounter int64 = 1
var currentErr error = nil
for currentErr != io.EOF {
_, currentErr = file.ReadAt(bytes, st)
fullFile = append(fullFile, bytes...)
start = (bufferSize * interationCounter) + 1
interationCounter++
}
fmt.Printf("Err: %s\n", currentErr)
fmt.Printf("fullFile Size: %s\n", len(fullFile))
fmt.Printf("Start: %d", start)
var currentLine []string
for _, value := range fullFile {
if string(value) != "\n" {
currentLine = append(currentLine, string(value))
} else {
singleLine := strings.Join(currentLine, "")
linesInFile = append(linesInFile, singleLine)
currentLine = nil
}
}
I am at a loss. Either I don't understand exactly how the buffer works or I don't understand something else. Thanks for reading.
bufio.Scan() and bufio.Text() in a loop perfectly works for me on a files with much larger size, so I suppose you have lines exceeded buffer capacity. Then
check your line ending
and which Go version you use path, err :=r.ReadLine("\n") // 0x0A separator = newline? Looks like func (b *bufio.Reader) ReadLine() (line []byte, isPrefix bool, err error) has return value isPrefix specifically for your use case
http://golang.org/pkg/bufio/#Reader.ReadLine
It's not clear that it's necessary to read in all the lines before parsing them and inserting them into a database. Try to avoid that.
You have a small file: "a flat file that has 339276 line of text in it for a size of 62.1 MB." For example,
package main
import (
"bytes"
"fmt"
"io"
"io/ioutil"
)
func readLines(filename string) ([]string, error) {
var lines []string
file, err := ioutil.ReadFile(filename)
if err != nil {
return lines, err
}
buf := bytes.NewBuffer(file)
for {
line, err := buf.ReadString('\n')
if len(line) == 0 {
if err != nil {
if err == io.EOF {
break
}
return lines, err
}
}
lines = append(lines, line)
if err != nil && err != io.EOF {
return lines, err
}
}
return lines, nil
}
func main() {
// a flat file that has 339276 lines of text in it for a size of 62.1 MB
filename := "flat.file"
lines, err := readLines(filename)
fmt.Println(len(lines))
if err != nil {
fmt.Println(err)
return
}
}
It seems to me this variant of readLines is shorter and faster than suggested peterSO
func readLines(filename string) (map[int]string, error) {
lines := make(map[int]string)
data, err := ioutil.ReadFile(filename)
if err != nil {
return nil, err
}
for n, line := range strings.Split(string(data), "\n") {
lines[n] = line
}
return lines, nil
}
package main
import (
"fmt"
"os"
"log"
"bufio"
)
func main() {
FileName := "assets/file.txt"
file, err := os.Open(FileName)
if err != nil {
log.Fatal(err)
}
defer file.Close()
scanner := bufio.NewScanner(file)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
}

Resources