Reading bytes from a file concurrently - go

I've written a program in Go that reads a single byte from a file and checks to see which bits are set. These files are usually pretty large (around 10 - 100 GB), so I don't want to read the entire file into memory. The program normally has to check millions of separate bytes.
Right now, the way I'm performing these reads is by using os.File.ReadAt(). This ended up being pretty slow, so I tried to use Goroutines to speed it up. For example:
var wg sync.WaitGroup
threadCount := 8
for i := 0; i < threadCount; i += 1 {
wg.Add(1)
go func(id int) {
defer wg.Done()
index := id
myByte := make([]byte, 1)
for index < numBytesInFile-1 { // Stop when thread would attempt to read byte outside of file
fmt.Println(file.ReadAt(myByte, index))
index += threadCount
}
}(i)
}
wg.Wait()
However, using Goroutines here didn't speed the program up at all (in fact, it made it slightly slower due to overhead). I would have thought that files on the disc could be read concurrently as long as they are opened in read-only mode (which I do in my program). Is what I'm asking for impossible, or is there some way I make concurrent reads to a file in Go?

You slowness is because of I/O and not CPU. Adding more threads will not speed up your program. Read about Amdahl's law. https://en.wikipedia.org/wiki/Amdahl%27s_law
If you do not want to read the full file into memory, you could either use a buffered reader and read in parts https://golang.org/pkg/bufio/#NewReader or you could even consider using the experimental memory-mapped files package too: https://godoc.org/golang.org/x/exp/mmap
To know more about memory mapped files, see https://en.wikipedia.org/wiki/Memory-mapped_file

Related

Performance issues while reading a file line by line with bufio.NewScanner

I am learning how to read efficiently very large files in Go. I have tried bufio.NewScanner and bufio.NewReader with ReadString('\n'). Among both options, NewScanner seems to be consistently faster (2:1).
For NewScanner I found it takes much more time to read a file line by line than running a unix cat command to read the file.
I have measured how long does it take to run this code:
package main
import (
"bufio"
"fmt"
"os"
)
func main() {
file, _ := os.Open("test")
scanner := bufio.NewScanner(file)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
}
when you compare against a regular unix cat output I get the following results:
$ time ./parser3 > /dev/null
19.13 real 13.81 user 5.94 sys
$ time cat test > /dev/null
0.83 real 0.08 user 0.74 sys
The time difference is consistent among several executions.
I understand that scanning for '\n' adds overhead than rather just copying data from input to output as cat does.
But seeing the difference between cat and this code snippet I am asking myself if this is the most efficient way to read a file line by line in Go.
As per MuffinTop's comment, this is the snippet of code that improves the speed. The performance penalty is not related to the usage of Scanner, but to the fact of:
Not using buffering in the output
Using scanner.Text() -which allocates a string- instead of scanner.Bytes()
Performance adding output buffering:
package main
import (
"bufio"
"fmt"
"os"
)
func main() {
file, _ := os.Open("test")
w := bufio.NewWriter(os.Stdout)
scanner := bufio.NewScanner(file)
for scanner.Scan() {
fmt.Fprintln(w, scanner.Text())
}
}
The above solution uses output buffering and takes 6.6 seconds versus the original 19.1 seconds
Performance adding output buffering and using .Bytes() instead of .Text() output:
package main
import (
"bufio"
"fmt"
"os"
)
func main() {
file, _ := os.Open("test")
w := bufio.NewWriter(os.Stdout)
scanner := bufio.NewScanner(file)
for scanner.Scan() {
w.Write(scanner.Bytes()); w.WriteByte('\n')
}
}
The above solution uses output buffering and outputs Bytes from the scanner and takes 2.2 seconds versus the original 19.1 seconds.
First consider what type of data you will be reading CSV,JSON, and the system that will be reading it and why ? all these details must be taken to consideration. There is no one size fits all and the best developers know how to use the right tool for the job. Without knowing ram limits, data etc its unclear. Will we be reading large amounts of JSON from a server? or will we be parsing a small text file, usually each function has a purpose. Do not get in the habit of thinking one way is better or worse than the other and limit your learning and skill set.
ways to read a file.
Document parsing
reads all data from the file and turns it into object.
Stream parsing
reads a single element at a time then it moves on to the next one.
some methods may be faster but it reads the whole file into memory what if your file is too big?
Or what if your files are not that large, looking for duplicate names ? If you scan a file line-by-line,
perhaps if you're searching for multiple items, is reading line by line the best way ?
You should checkout this post. Special thanks to #Schwern who answered your question already Here. Using bytes and scanner is slightly faster. See: Link
I found that using scanner.Bytes() instead of scanner.Text() improves speed
slightly on my machine. bufio's scanner.Bytes() method
doesn't allocate any additional memory, whereas Text() creates a string from its buffer.

How to prevent Go program from crashing after accidental panic?

Think of a large project which deals with tons of concurrent requests handled by its own goroutine. It happens that there is a bug in the code and one of these requests will cause panic due to a nil reference.
In Java, C# and many other languages, this would end up in a exception which would stop the request without any harm to other healthy requests. In go, that would crash the entire program.
AFAIK, I'd have to have recover() for every single new go routine creation. Is that the only way to prevent entire program from crashing?
UPDATE: adding recover() call for every gorouting creation seems OK. What about third-party libraries? If third party creates goroutines without recover() safe net, it seems there is NOTHING to be done.
If you go the defer-recover-all-the-things, I suggest investing some time to make sure that a clear error message is collected with enough information to promptly act on it.
Writing the panic message to stderr/stdout is not great as it will be very hard to find where the problem is. In my experience the best approach is to invest a bit of time to get your Go programs to handle errors in a reasonable way. errors.Wrap from "github.com/pkg/errors" for instance allows you to wrap all errors and get a stack-trace.
Recovering panic is often a necessary evil. Like you say, it's not ideal to crash the entire program just because one requested caused a panic. In most cases recovering panics will not back-fire, but it is possible for a program to end up in a undefined not-recoverable state that only a manual restart can fix. That being said, my suggestion in this case is to make sure your Go program exposes a way to create a core dump.
Here's how to write a core dump to stderr when SIGQUIT is sent to the Go program (eg. kill pid -QUIT)
go func() {
// Based on answers to this stackoverflow question:
// https://stackoverflow.com/questions/19094099/how-to-dump-goroutine-stacktraces
sigs := make(chan os.Signal, 1)
signal.Notify(sigs, syscall.SIGQUIT)
for {
<-sigs
fmt.Fprintln(os.Stderr, "=== received SIGQUIT ===")
fmt.Fprintln(os.Stderr, "*** goroutine dump...")
var buf []byte
var bufsize int
var stacklen int
// Create a stack buffer of 1MB and grow it to at most 100MB if
// necessary
for bufsize = 1e6; bufsize < 100e6; bufsize *= 2 {
buf = make([]byte, bufsize)
stacklen = runtime.Stack(buf, true)
if stacklen < bufsize {
break
}
}
fmt.Fprintln(os.Stderr, string(buf[:stacklen]))
fmt.Fprintln(os.Stderr, "*** end of dump")
}
}()
there is no way you can handle panic without recover function, a good practice would be using a middleware like function for your safe function, checkout this snippet
https://play.golang.org/p/d_fQWzXnlAm

How to track memory usage accurately?

I am trying to build a small tool that will allow me to run a program and track memory usage through Go. I am using r.exec = exec.Command(r.Command, r.CommandArgs...) to run the command, and runtime.MemStats to track memory usage (in a separate go routine):
func monitorRuntime() {
m := &runtime.MemStats{}
f, err := os.Create(fmt.Sprintf("mmem_%s.csv", getFileTimeStamp()))
if err != nil {
panic(err)
}
f.WriteString("Time;Allocated;Total Allocated; System Memory;Num Gc;Heap Allocated;Heap System;Heap Objects;Heap Released;\n")
for {
runtime.ReadMemStats(m)
f.WriteString(fmt.Sprintf("%s;%d;%d;%d;%d;%d;%d;%d;%d;\n", getTimeStamp(), m.Alloc, m.TotalAlloc, m.Sys, m.NumGC, m.HeapAlloc, m.HeapSys, m.HeapObjects, m.HeapReleased))
time.Sleep(5 * time.Second)
}
}
When I tested my code with simple program that just sits there (for about 12 hours), I noticed that Go is constantly allocating more memory:
System Memory
Heap Allocation
I did a few more tests such as running the monitorRuntime() function without any other code, or using pprof, such as:
package main
import (
"net/http"
_ "net/http/pprof"
)
func main() {
http.ListenAndServe(":8080", nil)
}
Yet I still noticed that memory allocation keeps going up just like in the graphs.
How can I accurately track memory usage of the program I want to run through Go?
I know one way, which I used in the past, is to use /proc/$PID/statm, but that file doesn't exist in every operating system (such as MacOS or Windows)
There isn't a way in standard Go to get the memory usage of a program called from exec.Command. runtime.ReadMemStats only returns memory tracked by the go runtime (which, in this case, is only the file handling and sprintf).
Your best bet would be to execute platform specific commands to get memory usage.
On Linux (RedHat) the following will show memory usage:
ps -No pid,comm,size,vsize,args:90

why go routine only operate on one core [duplicate]

I'm testing this Go code on my VirtualBoxed Ubuntu 11.4
package main
import ("fmt";"time";"big")
var c chan *big.Int
func sum( start,stop,step int64) {
bigStop := big.NewInt(stop)
bigStep := big.NewInt(step)
bigSum := big.NewInt(0)
for i := big.NewInt(start);i.Cmp(bigStop)<0 ;i.Add(i,bigStep){
bigSum.Add(bigSum,i)
}
c<-bigSum
}
func main() {
s := big.NewInt( 0 )
n := time.Nanoseconds()
step := int64(4)
c = make( chan *big.Int , int(step))
stop := int64(100000000)
for j:=int64(0);j<step;j++{
go sum(j,stop,step)
}
for j:=int64(0);j<step;j++{
s.Add(s,<-c)
}
n = time.Nanoseconds() - n
fmt.Println(s,float64(n)/1000000000.)
}
Ubuntu has access to all my 4 cores. I checked this with simultaneous run of several executables and System Monitor.
But when I'm trying to run this code, it's using only one core and is not gaining any profit of parallel processing.
What I'm doing wrong?
You probably need to review the Concurrency section of the Go FAQ, specifically these two questions, and work out which (if not both) apply to your case:
Why doesn't my multi-goroutine program
use multiple CPUs?
You must set the GOMAXPROCS shell environment
variable or use the similarly-named function
of the runtime package to allow the run-time
support to utilize more than one OS thread.
Programs that perform parallel computation
should benefit from an increase in GOMAXPROCS.
However, be aware that concurrency is not parallelism.
Why does using GOMAXPROCS > 1
sometimes make my program slower?
It depends on the nature of your
program. Programs that contain several
goroutines that spend a lot of time
communicating on channels will
experience performance degradation
when using multiple OS threads. This
is because of the significant
context-switching penalty involved in
sending data between threads.
Go's goroutine scheduler is not as
good as it needs to be. In future, it
should recognize such cases and
optimize its use of OS threads. For
now, GOMAXPROCS should be set on a
per-application basis.
For more detail on this topic see the
talk entitled Concurrency is not Parallelism.

Go bug in ioutil.ReadFile()

I am running a program in Go which sends data continuously after reading a file /proc/stat.
Using ioutil.ReadFile("/proc/stat")
After running for about 14 hrs i got err: too many files open /proc/stat
Click here for snippet of code.
I doubt that defer f.Close is ignored by Go sometimes or it is skipping it.
The snippet of code (in case play.golang.org dies sooner than stackoverflow.com):
package main
import ("fmt";"io/ioutil")
func main() {
for {
fmt.Println("Hello, playground")
fData,err := ioutil.ReadFile("/proc/stat")
if err != nil {
fmt.Println("Err is ",err)
}
fmt.Println("FileData",string(fData))
}
}
The reason probably is that somewhere in your program:
you are forgetting to close files, or
you are leaning on the garbage collector to automatically close files on object finalization, but Go's conservative garbage collector fails to do so. In this case you should check your program's memory consumption (whether it is steadily increasing while the program is running).
In either case, try to check the contents of /proc/PID/fd to see whether the number of open files is increasing while the program is running.
If you are sure you Do the f.Close(),it still has the proble,Maybe it is because your other connection,for example the connection to MYSQL,also will be cause the problem,especially,in a loop,and you forget to close the connection.
Always do :
db.connection....
**defer db.Close()**
If it is in loop
loop
db.connection....
**defer db.Close()**
end
Do not put the db.connection before the loop

Resources