I need someome to help or at least any tip. I'm trying to read from large files (100mb - 11gb) line by line and then store some data into Map.
var m map[string]string
// expansive func
func stress(s string, mutex sync.Mutex) {
// some very cost operation .... that's why I want to use goroutines
mutex.Lock()
m[s] = s // store result
mutex.Unlock()
}
func main() {
file, err := os.Open("somefile.txt")
if err != nil {
fmt.Println(err)
return
}
defer func() {
if err = file.Close(); err != nil {
fmt.Println(err)
return
}
}()
scanner := bufio.NewScanner(file)
for scanner.Scan() {
go stress(scanner.Text(), mutex)
}
}
Without gouroutines it works fine but slow. As you can see, file is large so within loop there will be a lot of gouroutines. And that fact provides two problems:
Sometimes mutex doesn't work properly. And programm crashes. (How many goroutines mutex suppose?)
Everytime some data just lost (But programm doesn't crash)
I suppose I should use WaitGroup, but I cannot understand how it should be. Also I guess there should be some limit for goroutines, maybe some counter. It would be great to run it in 5-20 goroutines.
UPD. Yes, As #user229044 mentioned, I have to pass mutex by pointer. But the problem with limiting goroutines within loop still active.
UPD2. This is how I workaround this problem. I don't exactly understand which way program handle these goroutines and how memory and process time go. Also almost all commentors point on Map structure, but the main problem was to handle runtime of goroutines. How many goroutines spawn if it would be 10billions iterations of Scan() loop, and how goroutines store in RAM?
func stress(s string, mutex *sync.Mutex) {
// a lot of coslty ops
// ...
// ...
mutex.Lock()
m[where] = result // store result
mutex.Unlock()
wg.Done()
}
// main
for scanner.Scan() {
wg.Add(1)
go func(data string) {
stress(data, &mutex)
}(scanner.Text())
}
wg.Wait()
Your specific problem is that you're copying the mutex by value. You should be passing a pointer to the mutex, so that a single instance of your mutex is shared by all function invocations. You're also spawning an unbounded number of go routines, which will eventually exhaust your system's memory.
However, you can spawn as many Go routines as you want and you're only wasting resources for no gain, and juggling all of those useless Go routines will probably cause a net loss of performance. Increased parallelism can't help you when every parallel process has to wait for serialized access to a data structure, as is the case with your map. sync.WaitGroup and mutexes are the wrong approach here.
Instead, to add and control concurrency, you want a buffered channel and single Go routine responsible for map inserts. This way you have one process reading from the file, and one process inserting into the map, decoupling the disk IO from the map insertion.
Something like this:
scanner := bufio.NewScanner(file)
ch := make(chan string, 10)
go func() {
for s := range ch {
m[s] = s
}
}()
for scanner.Scan() {
ch <- scanner.Text()
}
close(ch)
Related
I have this snippet of code which concurrently runs a function using an input and output channel and associated WaitGroups, but I was clued in to the fact that I've done some things wrong. Here's the code:
func main() {
concurrency := 50
var tasksWG sync.WaitGroup
tasks := make(chan string)
output := make(chan string)
for i := 0; i < concurrency; i++ {
tasksWG.Add(1)
// evidentally because I'm processing tasks in a groutine then I'm not blocking and I end up closing the tasks channel almost immediately and stopping tasks from executing
go func() {
for t := range tasks {
output <- process(t)
continue
}
tasksWG.Done()
}()
}
var outputWG sync.WaitGroup
outputWG.Add(1)
go func() {
for o := range output {
fmt.Println(o)
}
outputWG.Done()
}()
go func() {
// because of what was mentioned in the previous comment, the tasks wait group finishes almost immediately which then closes the output channel almost immediately as well which ends ranging over output early
tasksWG.Wait()
close(output)
}()
f, err := os.Open(os.Args[1])
if err != nil {
log.Panic(err)
}
s := bufio.NewScanner(f)
for s.Scan() {
tasks <- s.Text()
}
close(tasks)
// and finally the output wait group finishes almost immediately as well because tasks gets closed right away due to my improper use of goroutines
outputWG.Wait()
}
func process(t string) string {
time.Sleep(3 * time.Second)
return t
}
I've indicated in the comments where I've implementing things wrong. Now these comments make sense to me. The funny thing is that this code does indeed seem to run asynchronously and dramatically speeds up execution. I want to understand what I've done wrong but it's hard to wrap my head around it when the code seems to execute in an asynchronous way. I'd love to understand this better.
Your main goroutine is doing a couple of things sequentially and others concurrently, so I think your order of execution is off
f, err := os.Open(os.Args[1])
if err != nil {
log.Panic(err)
}
s := bufio.NewScanner(f)
for s.Scan() {
tasks <- s.Text()
}
Shouldn't you move this up top? So then you have values sent to tasks
THEN have your loop which ranges over tasks 50 times in the concurrency named for loop (you want to have something in tasks before calling code that ranges over it)
go func() {
// because of what was mentioned in the previous comment, the tasks wait group finishes almost immediately which then closes the output channel almost immediately as well which ends ranging over output early
tasksWG.Wait()
close(output)
}()
The logic here is confusing me, you're spawning a goroutine to wait on the waitgroup, so here the wait is nonblocking on the main goroutine - is that what you want to do? It won't wait for tasksWG to be decremented to zero inside main, it'll do that inside the goroutine that you've created. I don't believe you want to do that?
It might be easier to debug if you could give more details on the expected output?
The following example is taken from the Donovan/Kernighan book:
func makeThumbnails6(filenames <-chan string) int64 {
sizes := make(chan int64)
var wg sync.WaitGroup // number of working goroutines
for f := range filenames {
wg.Add(1)
// worker
go func(f string) {
defer wg.Done()
thumb, err := thumbnail.ImageFile(f)
if err != nil {
log.Println(err)
return
}
info, _ := os.Stat(thumb) // OK to ignore error
sizes <- info.Size()
}(f)
}
// closer
go func() {
wg.Wait()
close(sizes)
}()
var total int64
for size := range sizes {
total += size
}
return total
}
And the book states:
"These two operations, wait and close, must be concurrent with the
loop over sizes. Consider the alternatives: if the wait operation were
placed in the main goroutine before the loop, it would never end"
This is what I do not understand - if they are not put in separate goroutine, then wg.Wait will block the main goroutine, so close(sizes) will happen when all other goroutines finished. Closing the sizes channel will still allow the loop to read all the already sent messages/ from channel, right?
Closing the sizes channel will still allow the loop to read all the already sent messages/ from channel, right?
Yes, but that's not the problem. All goroutines are waiting to read from channels (and therefore, nobody will ever write to them). So the process will deadlock if sizes is unbuffered. For the workers to complete, something needs to read from it. For wg.Wait() to complete, the workers need to complete.
But also, the range sizes can't complete (eg find an empty, closed channel) until close(sizes) happens, which can't complete until the workers are complete (because they're the ones writing to sizes).
So it's wg.Wait() and close(sizes) both have to complete before range sizes that happen concurrently.
Unbuffered channnel sizes would block all other routines to write into it. wg.Wait() would never end because all routines are blocked in sizes <-info.Size().
So it's nessary that wait and close are concurrent with the loop over sizes.
In an effort to learn golang, I was looking through the go source for reverseproxy:
https://golang.org/src/net/http/httputil/reverseproxy.go
I found this block of code (truncated):
...
errc := make(chan error, 1)
spc := switchProtocolCopier{user: conn, backend: backConn}
go spc.copyToBackend(errc)
go spc.copyFromBackend(errc)
<-errc
return
}
// switchProtocolCopier exists so goroutines proxying data back and
// forth have nice names in stacks.
type switchProtocolCopier struct {
user, backend io.ReadWriter
}
func (c switchProtocolCopier) copyFromBackend(errc chan<- error) {
_, err := io.Copy(c.user, c.backend)
errc <- err
}
func (c switchProtocolCopier) copyToBackend(errc chan<- error) {
_, err := io.Copy(c.backend, c.user)
errc <- err
}
The portion that caught my attention was the creation of the errc buffered channel. I thought (probably naively) that we would use an unbuffered channel and the later receive from errc would need to run twice, like this:
<-errc
<-errc
As written, I understand that reading from the channel will ensure at least one of the copy methods has run. I also understand that the first send to the channel will not block, while the second will block only if the first one has not yet been received.
What I don't understand, is why it is written like this. Is it to ensure that only one of the methods completes? If that is the case, couldn't they technically both run?
Thanks!
The channel of size one helps realize a binary semaphore.
Since at most one value is consumed from the channel (on line 549), changing the size of the channel to be greater than one will not affect the currently exhibited behavior, which is wait until at least one of the two go routines complete executing the Copy operation.
I have two goroutines go doProcess_A() and go doProcess_B(). Both can call saveData(), a non goroutine method.
should I use go saveData() instead of saveData() ?
Which one is safe?
var waitGroup sync.WaitGroup
func main() {
for i:=0; i<4; i++{
waitGroup.Add(2)
go doProcess_A(i)
go doProcess_B(i)
}
waitGroup.Wait()
}
func doProcess_A(i int) {
// do process
// the result will be stored in data variable
data := "processed data-A as string"
uniqueFileName := "file_A_"+strconv.Itoa(i)+".txt"
saveData(uniqueFileName, data)
waitGroup.Done()
}
func doProcess_B(i int) {
// do some process
// the result will be stored in data variable
data := "processed data-B as string"
uniqueFileName := "file_B_"+strconv.Itoa(i)+".txt"
saveData(uniqueFileName, data)
waitGroup.Done()
}
// write text file
func saveData(fileName ,dataStr string) {
// file name will be unique.
// there is no chance to be same file name
err := ioutil.WriteFile("out/"+fileName, []byte(dataStr), 0644)
if err != nil {
panic(err)
}
}
here, does one goroutine wait for disk file operation when other goroutine is doing?
or, are two goroutine make there own copy of saveData() ?
Goroutines typically don't wait for anything except you explicitly tell them to or if an operation is waiting on a channel or other blocking operation. In your code there is a possibility of a race condition with unwanted results if multiple goroutines call the saveData() function with same filename. It appears that the two goroutines are writing to different files, therefore as long as the filenames are unique, the saveData operation will be safe in a goroutine. It doesn't make sense to use a go routine to call saveData(), don't unnecessarily complicate your life, just call it directly in the doProcess_X functions.
Read more about goroutines and make sure you are using it where it is absolutely necessary. - https://gobyexample.com/goroutines
Note: Just because you are writing a Go application doesn't mean you
should litter it with goroutines. Read and understand what problem it
solves so as to know the best time to use it.
I know there is a function called SetReadDeadline that can set a timeout in socket(conn.net) reading, while io.Read not. There is a way that starts another routine as a timer to solve this problem, but it brings another problem that the reader routine(io.Read) still block:
func (self *TimeoutReader) Read(buf []byte) (n int, err error) {
ch := make(chan bool)
n = 0
err = nil
go func() { // this goroutime still exist even when timeout
n, err = self.reader.Read(buf)
ch <- true
}()
select {
case <-ch:
return
case <-time.After(self.timeout):
return 0, errors.New("Timeout")
}
return
}
This question is similar in this post, but the answer is unclear.
Do you guys have any good idea to solve this problem?
Instead of setting a timeout directly on the read, you can close the os.File after a timeout. As written in https://golang.org/pkg/os/#File.Close
Close closes the File, rendering it unusable for I/O. On files that support SetDeadline, any pending I/O operations will be canceled and return immediately with an error.
This should cause your read to fail immediately.
Your mistake here is something different:
When you read from the reader you just read one time and that is wrong:
go func() {
n, err = self.reader.Read(buf) // this Read needs to be in a loop
ch <- true
}()
Here is a simple example (https://play.golang.org/p/2AnhrbrhLrv)
buf := bytes.NewBufferString("0123456789")
r := make([]byte, 3)
n, err := buf.Read(r)
fmt.Println(string(r), n, err)
// Output: 012 3 <nil>
The size of the given slice is used when using the io.Reader. If you would log the n variable in your code you would see, that not the whole file is read. The select statement outside of your goroutine is at the wrong place.
go func() {
a := make([]byte, 1024)
for {
select {
case <-quit:
result <- []byte{}
return
default:
_, err = self.reader.Read(buf)
if err == io.EOF {
result <- a
return
}
}
}
}()
But there is something more! You want to implement the io.Reader interface. After the Read() method is called until the file ends you should not start a goroutine in here, because you just read chunks of the file.
Also the timeout inside the Read() method doesn't help, because that timeout works for each call and not for the whole file.
In addition to #apxp's point about looping over Read, you could use a buffer size of 1 byte so that you never block as long is there is data to read.
When interacting with external resources anything can happen. It is possible for any given io.Reader implementation to simply block forever. Here, I'll write one for you...
type BlockingReader struct{}
func (BlockingReader) Read(b []byte) (int, error) {
<-make(chan struct{})
return 0, nil
}
Remember anyone can implement an interface, so you can't make any assumptions that it will behave like *os.File or any other standard library io.Reader. In addition to asinine coding like mine above, an io.Reader could legitimately connect to a resources that can block forever.
You cannot kill gorountines, so if an io.Reader truly blocks forever the blocked goroutine will continue to consume resources until your application terminates. However, this shouldn't be a problem, a blocked goroutine does not consume much in the way of resources, and should be fine as long as you don't blindly retry blocked Reads by spawning more gorountines.