Is there a faster alternative to ioutil.ReadFile? - go

I am trying to make a program for checking file duplicates based on md5 checksum.
Not really sure whether I am missing something or not, but this function reading the XCode installer app (it has like 8GB) uses 16GB of Ram
func search() {
unique := make(map[string]string)
files, err := ioutil.ReadDir(".")
if err != nil {
log.Println(err)
}
for _, file := range files {
fileName := file.Name()
fmt.Println("CHECKING:", fileName)
fi, err := os.Stat(fileName)
if err != nil {
fmt.Println(err)
continue
}
if fi.Mode().IsRegular() {
data, err := ioutil.ReadFile(fileName)
if err != nil {
fmt.Println(err)
continue
}
sum := md5.Sum(data)
hexDigest := hex.EncodeToString(sum[:])
if _, ok := unique[hexDigest]; ok == false {
unique[hexDigest] = fileName
} else {
fmt.Println("DUPLICATE:", fileName)
}
}
}
}
As per my debugging the issue is with the file reading
Is there a better approach to do that?
thanks

There is an example in the Golang documentation, which covers your case.
package main
import (
"crypto/md5"
"fmt"
"io"
"log"
"os"
)
func main() {
f, err := os.Open("file.txt")
if err != nil {
log.Fatal(err)
}
defer f.Close()
h := md5.New()
if _, err := io.Copy(h, f); err != nil {
log.Fatal(err)
}
fmt.Printf("%x", h.Sum(nil))
}
For your case, just make sure to close the files in the loop and not defer them. Or put the logic into a function.

Sounds like the 16GB RAM is your problem, not speed per se.
Don't read the entire file into a variable with ReadFile; io.Copy from the Reader that Open gives you to the Writer that hash/md5 provides (md5.New returns a hash.Hash, which embeds an io.Writer). That only copies a little bit at a time instead of pulling all of the file into RAM.
This is a trick useful in a lot of places in Go; packages like text/template, compress/gzip, net/http, etc. work in terms of Readers and Writers. With them, you don't usually need to create huge []bytes or strings; you can hook I/O interfaces up to each other and let them pass around pieces of content for you. In a garbage collected language, saving memory tends to save you CPU work as well.

Related

Incorrect data re-transmisson

I need my program to be in the middle of the connection and transfer data correctly in both directions. I wrote this code, but it does not work properly
package main
import (
"fmt"
"net"
)
func main() {
listener, err := net.Listen("tcp", ":8120")
if err != nil {
fmt.Println(err)
return
}
defer listener.Close()
fmt.Println("Server is listening...")
for {
var conn1, conn2 net.Conn
var err error
conn1, err = listener.Accept()
if err != nil {
fmt.Println(err)
conn1.Close()
continue
}
conn2, err = net.Dial("tcp", "185.151.245.51:80")
if err != nil {
fmt.Println(err)
conn2.Close()
continue
}
go handleConnection(conn1, conn2)
go handleConnection(conn2, conn1)
}
}
func handleConnection(conn1, conn2 net.Conn) {
defer conn1.Close()
for {
input := make([]byte, 1024)
n, err := conn1.Read(input)
if n == 0 || err != nil {
break
}
conn2.Write([]byte(input))
}
}
The problem is that the data is corrupted,
for example.
Left one is original, right one is what i got.
End of the final gotten file is unreadable.
But at the beginnig everything is ok.
I tried to change input slice size. If size > 0 and < 8, everything is fine, but slow. If i set input size very large, corruption of data become more awful.
What I'm doing wrong?
In handleConnection, you always write 1024 bytes, no matter what conn1.Read returns.
You want to write the data like this:
conn2.Write(input[:n])
You should also check your top-level for loop. Are you sure you're not accepting multiple connections and smushing them all together? I'd sprinkle in some log statements so you can see when connections are made and closed.
Another (probably inconsequential) mistake, is that you treat n==0 as a termination condition. In the documentation of io.Reader it's recommended that you ignore n==0, err==nil. Without checking the code I can't be sure, but I expect that conn.Read never returns n==0, err==nil, so it's unlikely that this is causing you trouble.
Although it doesn't affect correctness, you could also lift the definition of input out of the loop so that it's reused on each iteration; it's likely to reduce the amount of work the garbage collector has to do.

golang zlib reader output not being copied over to stdout

I've modified the official documentation example for the zlib package to use an opened file rather than a set of hardcoded bytes (code below).
The code reads in the contents of a source text file and compresses it with the zlib package. I then try to read back the compressed file and print its decompressed contents into stdout.
The code doesn't error, but it also doesn't do what I expect it to do; which is to display the decompressed file contents into stdout.
Also: is there another way of displaying this information, rather than using io.Copy?
package main
import (
"compress/zlib"
"io"
"log"
"os"
)
func main() {
var err error
// This defends against an error preventing `defer` from being called
// As log.Fatal otherwise calls `os.Exit`
defer func() {
if err != nil {
log.Fatalln("\nDeferred log: \n", err)
}
}()
src, err := os.Open("source.txt")
if err != nil {
return
}
defer src.Close()
dest, err := os.Create("new.txt")
if err != nil {
return
}
defer dest.Close()
zdest := zlib.NewWriter(dest)
defer zdest.Close()
if _, err := io.Copy(zdest, src); err != nil {
return
}
n, err := os.Open("new.txt")
if err != nil {
return
}
r, err := zlib.NewReader(n)
if err != nil {
return
}
defer r.Close()
io.Copy(os.Stdout, r)
err = os.Remove("new.txt")
if err != nil {
return
}
}
Your defer func doesn't do anything, because you're shadowing the err variable on every new assignment. If you want a defer to run, return from a separate function, and call log.Fatal after the return statement.
As for why you're not seeing any output, it's because you're deferring all the Close calls. The zlib.Writer isn't flushed until after the function exits, and neither is the destination file. Call Close() explicitly where you need it.
zdest := zlib.NewWriter(dest)
if _, err := io.Copy(zdest, src); err != nil {
log.Fatal(err)
}
zdest.Close()
dest.Close()
I think you messed up the code logic with all this defer stuff and your "trick" err checking.
Files are definitively written when flushed or closed. You just copy into new.txt without closing it before opening it to read it.
Defering the closing of the file is neat inside a function which has multiple exits: It makes sure the file is closed once the function is left. But your main requires the new.txt to be closed after the copy, before re-opening it. So don't defer the close here.
BTW: Your defense against log.Fatal terminating the code without calling your defers is, well, at least strange. The files are all put into some proper state by the OS, there is absolutely no need to complicate the stuff like this.
Check the error from the second Copy:
2015/12/22 19:00:33
Deferred log:
unexpected EOF
exit status 1
The thing is, you need to close zdest immediately after you've done writing. Close it after the first Copy and it works.
I would have suggested to use io.MultiWriter.
In this way you read only once from src. Not much gain for small files but is faster for bigger files.
w := io.MultiWriter(dest, os.Stdout)

Go: zlib uncompressing a slice of bytes

I am trying to parse a file that annoying consists of many separately zipped segments. I have parsed these segments one at a time into a slice of bytes and I want to uncompress them as I go.
Here is my current code that does the decompressing, which doesn't work. from and to are just set at the top as an example, in reality they are set by the code. data is the byte array containing the entire file. I don't want to seek it while it's on disk because its location on another server, so it's only realistic for me to load the entire file to []byte first and then parse it.
from, to := 0, 1000;
b := bytes.NewReader(data[from:from+to])
z, err := zlib.NewReader(b)
CheckErr(err)
defer z.Close()
p := make([]byte,0,1024)
z.Read(p)
fmt.Println(string(p))
So how is it so massively difficult just to unzip a slice of bytes? Anyway...
The problem appears to with how I am reading it out. Where it says z.Read, that doesn't seem to do anything.
How can I read the entire thing in one go into a slice of bytes?
Here's an outline for you. Note: In Go, CHECK FOR ERRORS!
package main
import (
"bytes"
"compress/zlib"
"fmt"
"io/ioutil"
)
func readSegment(data []byte, from, to int) ([]byte, error) {
b := bytes.NewReader(data[from : from+to])
z, err := zlib.NewReader(b)
if err != nil {
return nil, err
}
defer z.Close()
p, err := ioutil.ReadAll(z)
if err != nil {
return nil, err
}
return p, nil
}
func main() {
from, to := 0, 1000
data := make([]byte, from+to)
// ** parse input segments into data **
p, err := readSegment(data, from, to)
if err != nil {
fmt.Println(err)
return
}
fmt.Println(string(p))
}
Use ReadAll(r io.Reader) ([]byte, error) from the io/ioutil package.
p, err := ioutil.ReadAll(b)
fmt.Println(string(p))
Read only reads up to the length of the given slice (1024 bytes in your case).
To read in chunks of 1024 bytes:
p := make([]byte,1024)
for {
numBytes, err := l.Read(p)
if err == io.EOF {
// you are done, numBytes might be less than len(p)
break
}
// do what you want with p
}
If you are getting the data from a webserver, you might even do
import (
"net/http"
"io/ioutil"
)
...
resp, errGet := http.Get("http://example.com/somefile")
// do error handling
z, errZ := zlib.NewReader(resp.Body)
// do error handling
resp.Body.Close()
p, err := ioutil.ReadAll(b)
// do error handling
since resp.Body happens to be an io.Reader as most io related types.

Parallel zip compression in Go

I am trying build a zip archive from a large number of small-medium sized files. I want to be able to do this concurrently, since compression is CPU intensive, and I'm running on a multi core server. Also I don't want to have the whole archive in memory, since its might turn out to be large.
My question is that do I have to compress every file and then combine manually combine everything together with zip header, checksum etc?
Any help would be greatly appreciated.
I don't think you can combine the zip headers.
What you could do is, run the zip.Writer sequentially, in a separate goroutine, and then spawn a new goroutine for each file that you want to read, and pipe those to the goroutine that is zipping them.
This should reduce the IO overhead that you get by reading the files sequentially, although it probably won't leverage multiple cores for the archiving itself.
Here's a working example. Note that, to keep things simple,
it does not handle errors nicely, just panics if something goes wrong,
and it does not use the defer statement too much, to demonstrate the order in which things should happen.
Since defer is LIFO, it can sometimes be confusing when you stack a lot of them together.
package main
import (
"archive/zip"
"io"
"os"
"sync"
)
func ZipWriter(files chan *os.File) *sync.WaitGroup {
f, err := os.Create("out.zip")
if err != nil {
panic(err)
}
var wg sync.WaitGroup
wg.Add(1)
zw := zip.NewWriter(f)
go func() {
// Note the order (LIFO):
defer wg.Done() // 2. signal that we're done
defer f.Close() // 1. close the file
var err error
var fw io.Writer
for f := range files {
// Loop until channel is closed.
if fw, err = zw.Create(f.Name()); err != nil {
panic(err)
}
io.Copy(fw, f)
if err = f.Close(); err != nil {
panic(err)
}
}
// The zip writer must be closed *before* f.Close() is called!
if err = zw.Close(); err != nil {
panic(err)
}
}()
return &wg
}
func main() {
files := make(chan *os.File)
wait := ZipWriter(files)
// Send all files to the zip writer.
var wg sync.WaitGroup
wg.Add(len(os.Args)-1)
for i, name := range os.Args {
if i == 0 {
continue
}
// Read each file in parallel:
go func(name string) {
defer wg.Done()
f, err := os.Open(name)
if err != nil {
panic(err)
}
files <- f
}(name)
}
wg.Wait()
// Once we're done sending the files, we can close the channel.
close(files)
// This will cause ZipWriter to break out of the loop, close the file,
// and unblock the next mutex:
wait.Wait()
}
Usage: go run example.go /path/to/*.log.
This is the order in which things should be happening:
Open output file for writing.
Create a zip.Writer with that file.
Kick off a goroutine listening for files on a channel.
Go through each file, this can be done in one goroutine per file.
Send each file to the goroutine created in step 3.
After processing each file in said goroutine, close the file to free up resources.
Once each file has been sent to said goroutine, close the channel.
Wait until the zipping has been done (which is done sequentially).
Once zipping is done (channel exhausted), the zip writer should be closed.
Only when the zip writer is closed, should the output file be closed.
Finally everything is closed, so close the sync.WaitGroup to tell the calling function that we're good to go. (A channel could also be used here, but sync.WaitGroup seems more elegant.)
When you get the signal from the zip writer that everything is properly closed, you can exit from main and terminate nicely.
This might not answer your question, but I've been using similar code to generate zip archives on-the-fly for a web service some time ago. It performed quite well, even though the actual zipping was done in a single goroutine. Overcoming the IO bottleneck can already be an improvement.
From the look of it, you won't be able to parallelise the compression using the standard library archive/zip package because:
Compression is performed by the io.Writer returned by zip.Writer.Create or CreateHeader.
Calling Create/CreateHeader implicitly closes the writer returned by the previous call.
So passing the writers returned by Create to multiple goroutines and writing to them in parallel will not work.
If you wanted to write your own parallel zip writer, you'd probably want to structure it something like this:
Have multiple goroutines compress files using the compress/flate module, and keep track of the CRC32 value and length of the uncompressed data. The output should be directed to temporary files. Note the compressed size of the data.
Once everything has been compressed, start writing the Zip file starting with the header.
Write out the file header followed by the contents of the corresponding temporary file for each compressed file.
Write out the central directory record and end record at the end of the file. All the required information should be available at this point.
For added parallelism, step 1 could be performed in parallel with the remaining steps by using a channel to indicate when compression of each file completes.
Due to the file format, you won't be able to perform parallel compression without either storing compressed data in memory or in temporary files.
With Go1.17, parallel compression and merging of zip files are possible using the archive/zip package.
An example is below. In the example, I create zip workers to create individual zip files and an entry provider worker which provides entries to be added to a zip file via a channel to zip workers. Actual files can be provided to the zip workers but I skipped that part.
package main
import (
"archive/zip"
"context"
"fmt"
"io"
"log"
"os"
"strings"
"golang.org/x/sync/errgroup"
)
const numOfZipWorkers = 10
type entry struct {
name string
rc io.ReadCloser
}
func main() {
log.SetFlags(log.LstdFlags | log.Lshortfile)
entCh := make(chan entry, numOfZipWorkers)
zpathCh := make(chan string, numOfZipWorkers)
group, ctx := errgroup.WithContext(context.Background())
for i := 0; i < numOfZipWorkers; i++ {
group.Go(func() error {
return zipWorker(ctx, entCh, zpathCh)
})
}
group.Go(func() error {
defer close(entCh) // Signal workers to stop.
return entryProvider(ctx, entCh)
})
err := group.Wait()
if err != nil {
log.Fatal(err)
}
f, err := os.OpenFile("output.zip", os.O_CREATE|os.O_TRUNC|os.O_WRONLY, 0644)
if err != nil {
log.Fatal(err)
}
zw := zip.NewWriter(f)
close(zpathCh)
for path := range zpathCh {
zrd, err := zip.OpenReader(path)
if err != nil {
log.Fatal(err)
}
for _, zf := range zrd.File {
err := zw.Copy(zf)
if err != nil {
log.Fatal(err)
}
}
_ = zrd.Close()
_ = os.Remove(path)
}
err = zw.Close()
if err != nil {
log.Fatal(err)
}
err = f.Close()
if err != nil {
log.Fatal(err)
}
}
func entryProvider(ctx context.Context, entCh chan<- entry) error {
for i := 0; i < 2*numOfZipWorkers; i++ {
select {
case <-ctx.Done():
return ctx.Err()
case entCh <- entry{
name: fmt.Sprintf("file_%d", i+1),
rc: io.NopCloser(strings.NewReader(fmt.Sprintf("content %d", i+1))),
}:
}
}
return nil
}
func zipWorker(ctx context.Context, entCh <-chan entry, zpathch chan<- string) error {
f, err := os.CreateTemp(".", "tmp-part-*")
if err != nil {
return err
}
zw := zip.NewWriter(f)
Loop:
for {
var (
ent entry
ok bool
)
select {
case <-ctx.Done():
err = ctx.Err()
break Loop
case ent, ok = <-entCh:
if !ok {
break Loop
}
}
hdr := &zip.FileHeader{
Name: ent.name,
Method: zip.Deflate, // zip.Store can also be used.
}
hdr.SetMode(0644)
w, e := zw.CreateHeader(hdr)
if e != nil {
_ = ent.rc.Close()
err = e
break
}
_, e = io.Copy(w, ent.rc)
_ = ent.rc.Close()
if e != nil {
err = e
break
}
}
if e := zw.Close(); e != nil && err == nil {
err = e
}
if e := f.Close(); e != nil && err == nil {
err = e
}
if err == nil {
select {
case <-ctx.Done():
err = ctx.Err()
case zpathch <- f.Name():
}
}
return err
}

Reading log files as they're updated in Go

I'm trying to parse some log files as they're being written in Go but I'm not sure how I would accomplish this without rereading the file again and again while checking for changes.
I'd like to be able to read to EOF, wait until the next line is written and read to EOF again, etc. It feels a bit like how tail -f looks.
I have written a Go package -- github.com/hpcloud/tail -- to do exactly this.
t, err := tail.TailFile("/var/log/nginx.log", tail.Config{Follow: true})
for line := range t.Lines {
fmt.Println(line.Text)
}
...
Quoting kostix's answer:
in real life files might be truncated, replaced or renamed (because that's what tools like logrotate are supposed to do).
If a file gets truncated, it will automatically be re-opened. To support re-opening renamed files (due to logrotate, etc.), you can set Config.ReOpen, viz.:
t, err := tail.TailFile("/var/log/nginx.log", tail.Config{
Follow: true,
ReOpen: true})
for line := range t.Lines {
fmt.Println(line.Text)
}
Config.ReOpen is analogous to tail -F (capital F):
-F The -F option implies the -f option, but tail will also check to see if the file being followed has been
renamed or rotated. The file is closed and reopened when tail detects that the filename being read from
has a new inode number. The -F option is ignored if reading from standard input rather than a file.
You have to either watch the file for changes (using an OS-specific subsystem to accomplish this) or poll it periodically to see whether its modification time (and size) changed. In either case, after reading another chunk of data you remember the file offset and restore it before reading another chunk after detecting the change.
But note that this seems to be easy only on paper: in real life files might be truncated, replaced or renamed (because that's what tools like logrotate are supposed to do).
See this question for more discussion of this problem.
A simple example:
package main
import (
"bufio"
"fmt"
"io"
"os"
"time"
)
func tail(filename string, out io.Writer) {
f, err := os.Open(filename)
if err != nil {
panic(err)
}
defer f.Close()
r := bufio.NewReader(f)
info, err := f.Stat()
if err != nil {
panic(err)
}
oldSize := info.Size()
for {
for line, prefix, err := r.ReadLine(); err != io.EOF; line, prefix, err = r.ReadLine() {
if prefix {
fmt.Fprint(out, string(line))
} else {
fmt.Fprintln(out, string(line))
}
}
pos, err := f.Seek(0, io.SeekCurrent)
if err != nil {
panic(err)
}
for {
time.Sleep(time.Second)
newinfo, err := f.Stat()
if err != nil {
panic(err)
}
newSize := newinfo.Size()
if newSize != oldSize {
if newSize < oldSize {
f.Seek(0, 0)
} else {
f.Seek(pos, io.SeekStart)
}
r = bufio.NewReader(f)
oldSize = newSize
break
}
}
}
}
func main() {
tail("x.txt", os.Stdout)
}
I'm also interested in doing this, but haven't (yet) had the time to tackle it. One approach that occurred to me is to let "tail" do the heavy lifting. It would likely make your tool platform-specific, but that may be ok. The basic idea would be to use Cmd from the "os/exec" package to follow the file. You could fork a process that was the equivalent of "tail --retry --follow=name prog.log", and then listen to it's Stdout using the Stdout reader on the the Cmd object.
Sorry I know it's just a sketch, but maybe it's helpful.
There are many ways to do this. In modern POSIX based Operating Systems, one can use the inotify interface to do this.
One can use this package: https://github.com/fsnotify/fsnotify
Sample code:
watcher, err := fsnotify.NewWatcher()
if err != nil {
log.Fatal(err)
}
done := make(chan bool)
err = watcher.Add(fileName)
if err != nil {
log.Fatal(err)
}
for {
select {
case event := <-watcher.Events:
if event.Op&fsnotify.Write == fsnotify.Write {
log.Println("modified file:", event.Name)
}
}
Hope this helps!

Resources