Say I have a large list of strings and I want to sort them, beyond the usual sort.Sort and sort.Slice etc I wanted to use more than one core to speed things up. So while reading the large list I add the strings to 2 different slices, strings that start with a-m and n-z (for arguments sake).
Meanwhile I've fired up multiple go routines to read a channel of string slices which will then sort their own sublists. So far, so good, "potentially" parallel processing of the lists so my sort time if effectively halved. Great. Now my question is how do I get the results back to the main goroutine?
Originally each goroutine had 2 channels, one for incoming unsorted list and the other for sorted list. Yes it works... but uses SOOO much memory (hey give the volume of data I'm tinkering with for this test, that's probably not unreasonable). But then it dawned on me that by passing a slice on a channel is really just passing a reference, so I don't actually NEED to pass anything back. Not having to put the resulting sorted lists in a channel for the return journey is obviously far less taxing memory wise, but it (to me) smells.
This means I could have one of the goroutines sorting away meanwhile the main goroutine (in theory) could be manipulating the same list. As long as discipline is used this wouldn't be an issue, but is still obviously a concern. Is there a generally accepted best practice within Go to say that references shouldn't be passed as input from one goroutine to another.... but IS acceptable that a goroutine generating reference data can be returned via a channel (since the goroutine would then stop using the reference).
Before anyone says it, yes I know I don't have to pass these in via channels etc but this just the case I was tinkering with and got me thinking.
Long and hand wavy I know. Here's a minimal subset of code showing the above.
package main
import (
"bufio"
"fmt"
"os"
"sort"
"strings"
"sync"
"time"
)
var wg sync.WaitGroup
func sortWordsList(id int, ch chan []string ) {
l := <- ch
sort.Strings(l)
wg.Done()
}
func main() {
file, err := os.Open("big.txt")
defer file.Close()
if err != nil {
fmt.Printf("BOOM %s\n", err.Error())
panic(err)
}
// Start reading from the file with a reader.
reader := bufio.NewReader(file)
inCh1 := make(chan []string, 1000)
inCh2 := make(chan []string, 1000)
go sortWordsList(1, inCh1)
go sortWordsList(2, inCh2)
wg.Add(2)
words1 := []string{}
words2 := []string{}
for {
line, err := reader.ReadString('\n')
if err != nil {
break
}
sp := strings.Split(line, " ")
for _,w := range sp {
word := strings.ToLower(w)
word = strings.TrimSuffix(word, "\n")
if len(word) > 0 {
// figure out where to go.
// arbitrary split.
if word[0] < 'm' {
words1 = append(words1, word)
} else {
words2 = append(words2, word)
}
}
}
}
inCh1 <- words1
inCh2 <- words2
close(inCh1)
close(inCh2)
wg.Wait()
// now have sorted words1 and words2 slices.
}
There is nothing wrong with passing pointers, slices, or maps. As long as you synchronize the access to the shared variable, you can pass a pointer and keep on using it in the sending goroutine. For large objects like arrays or large structs, passing a pointer is usually the logical thing to do to avoid expensive copies. Also, avoiding passing pointer means avoiding passing slices and maps, or anything that contains slices, maps, or pointers to other structs.
As you already know, you don't really need channels here, simply start your goroutines after you constructed your slices, and pass the slices directly.
go sortWordsList(words1)
go sortWordsList(words2)
or:
go sort.Strings(words1)
go sort.Strings(words2)
Related
I was exploring the possibility of concurrently accessing a map with fixed keys without a lock for performance improvement.
I've explored the similar with slice before and seems it works:
func TestConcurrentSlice(t *testing.T) {
fixed := []int{1, 2, 3}
wg := &sync.WaitGroup{}
for i := 0; i < len(fixed); i++ {
idx := i
wg.Add(1)
go func() {
defer wg.Done()
fixed[idx]++
}()
}
wg.Wait()
fmt.Printf("%v\n", fixed)
}
The above code will pass the -race test.
That gave me the confidence of achieving the same thing with map with fixed size (fixed number of keys) because I assume if the number of keys doesn't change, so the underline array (in map) does not need to expand, so it will be safe for us to access different key (different memory location) in different go-routine. So I wrote this test:
type simpleStruct struct {
val int
}
func TestConcurrentAccessMap(t *testing.T) {
fixed := map[string]*simpleStruct{
"a": {0},
"b": {0},
}
wg := &sync.WaitGroup{}
// here I use array instead of iterating the map to avoid read access
keys := []string{"a", "b"}
for _, k := range keys {
kcopy := k
wg.Add(1)
go func() {
defer wg.Done()
// this failed the race test
fixed[kcopy] = &simpleStruct{}
// this actually can pass the race test!
//fixed[kcopy].val++
}()
}
wg.Wait()
}
however, the test failed the race test with error message concurrent write by runtime.mapassign_faststr() function.
And one more interesting I found is the code I've commented out "fixed[kcopy].val++" actually passed the race test (I assume it's because of the writings are at different memory location). But I'm wondering since the go-routines are accessing different keys of the map, why it will fail the race test?
Accessing different slice elements without synchronization from multiple goroutines is OK, because each slice element acts as an individual variable. For details, see Can I concurrently write different slice elements.
However, this is not the case with maps. A value for a specific key does not act as a variable, and it is not addressable (because the actual memory space the value is stored at may be internally changed–at the sole discretion of the implementation).
So with maps, the general rule applies: if the map is accessed from multiple goroutines where at least one of them is a write (assign a value to a key), explicit synchronization is needed.
A very simple and usual case in golang as below, but got result not expected.
package main
import (
"fmt"
"time"
)
func main() {
consumer(generator())
for {
time.Sleep(time.Duration(time.Second))
}
}
// simple generator through channel
func generator() <-chan []byte {
ret := make(chan []byte)
go func() {
// make buf outside of loop, and result is not expected
var ch = byte('A')
count := 0
buf := make([]byte, 1)
for {
if count > 10 {
return
}
// make buf inside loop, and result is expected
// buf := make([]byte, 1)
buf[0] = ch
ret <- buf
ch++
count++
// time.Sleep(time.Duration(time.Second))
}
}()
return ret
}
// simple consumer through channel
func consumer(recv <-chan []byte) {
go func() {
for buf := range recv {
fmt.Println("received:" + string(buf[0]))
}
}()
}
output:
received:A
received:B
received:D
received:D
received:F
received:F
received:H
received:H
received:J
received:J
received:K
In generator, if put the buf variable inside for loop, result is what I expected:
received:A
received:B
received:C
received:D
received:E
received:F
received:G
received:H
received:I
received:J
received:K
I am thinking even buf is outside for loop and not changed always, after we write it to channe, receiver will read out it until next write can happen, so its' content should not be override, but looks like golang behaviors not in this way, what wrong for happened here?
Problem: your code contains a data race
Save your your program in a file named main.go; then run it with the race detector: go run -race main.go. You should see something like the following:
$ go run -race main.go
received:A
==================
WARNING: DATA RACE
Write at 0x00c000180000 by goroutine 7:
main.generator.func1()
/redacted/main.go:29 +0x8c
Previous read at 0x00c000180000 by goroutine 8:
main.consumer.func1()
/redacted/main.go:43 +0x55
The race detector tells you your program contains a data race because two goroutines are writing and reading to some shared memory without synchronisation:
the anonymous function launched as a goroutine in your generator function updates its local variable named buf at line 29;
the anonymous function launched as a goroutine in your consumer function reads from its local variable named buf at line 43.
The data race stems from the conjunction of two things:
Although local variable buf in consumer is just a copy of the homonymous local variable in generator, those slice variables are coupled because they refer to the same underlying array.
See [the relevant section of the language specification] (https://golang.org/ref/spec#Slice_types):
A slice, once initialized, is always associated with an underlying array that holds its elements. A slice therefore shares storage with its array and with other slices of the same array [...]
Operations on slices are not concurrency-safe and require proper synchronisation if performed concurrently (i.e. from multiple goroutines at the same time).
What your code displays is a typical case of aliasing. You should better familiarise yourself with how slices work.
Solution
You could eliminate the data race by using a one-byte array ([1]byte) instead of a slice, but arrays are quite inflexible in Go. Whether you really need to use a slice of bytes at all here is unclear. Since you're effectively only sending one byte at a time to the channel, why not simply use a chan byte rather than a chan []byte?
Other improvements unrelated to the data race include:
modifying the API of your two functions to make them synchronous (and therefore, easier to reason about);
simplifying the generator logic and closing the channel so that main can actually terminate;
simplifying the consumer logic and not spawning a goroutine for it.
package main
import "fmt"
func main() {
ch := make(chan byte)
go generator(ch)
consumer(ch)
}
func generator(ch chan<- byte) {
var c byte = 'A'
for i := 0; i < 10; i++ {
ch <- c
c++
}
close(ch)
}
func consumer(ch <-chan byte) {
for c := range ch {
fmt.Printf("received: %c\n", c)
}
}
The case is very simple. Both threads have ownership of the buffer and so channel does not guarantee synchronization. While consumer is reading the channel, generator is fast enough to modify the buffer so this char skip happens. to fix this you have to introduce another channel (that will send buffer back) or pass a copy of buffer.
I have a slice that contains work to be done, and a slice that will contain the results when everything is done. The following is a sketch of my general process:
var results = make([]Result, len(jobs))
wg := sync.WaitGroup{}
for i, job := range jobs {
wg.Add(1)
go func(i int, j job) {
defer wg.Done()
var r Result = doWork(j)
results[i] = r
}(i, job)
}
wg.Wait()
// Use results
It seems to work, but I have not tested it thoroughly and am not sure if it is safe to do. Generally I would not feel good letting multiple goroutines write to anything, but in this case, each goroutine is limited to its own index in the slice, which is pre-allocated.
I suppose the alternative is collecting results via a channel, but since order of results matters, this seemed rather simple. Is it safe to write into slice elements this way?
The rule is simple: if multiple goroutines access a variable concurrently, and at least one of the accesses is a write, then synchronization is required.
Your example does not violate this rule. You don't write the slice value (the slice header), you only read it (implicitly, when you index it).
You don't read the slice elements, you only modify the slice elements. And each goroutine only modifies a single, different, designated slice element. And since each slice element has its own address (own memory space), they are like distinct variables. This is covered in Spec: Variables:
Structured variables of array, slice, and struct types have elements and fields that may be addressed individually. Each such element acts like a variable.
What must be kept in mind is that you can't read the results from the results slice without synchronization. And the waitgroup you used in your example is a sufficient synchronization. You are allowed to read the slice once wg.Wait() returns, because that can only happen after all worker goroutines called wg.Done(), and none of the worker goroutines modify the elements after they called wg.Done().
For example, this is a valid (safe) way to check / process the results:
wg.Wait()
// Safe to read results after the above synchronization point:
fmt.Println(results)
But if you would try to access the elements of results before wg.Wait(), that's a data race:
// This is data race! Goroutines might still run and modify elements of results!
fmt.Println(results)
wg.Wait()
Yes, it's perfectly legal: a slice has an array as its underlying data storage, and, being a compound type, an array is a sequence of "elements" which behave as individual variables with distinct memory locations; modifying them concurrently is fine.
Just be sure to synchronize the shutdown of your worker goroutines with
the main one before it reads the updated contents of the slice.
Using sync.WaitGroup for this—as you do—is perfectly fine.
Also, as #icza said, you must not modify the slice value itself (which is a struct containing a pointer to the backing storage array, the capacity and the length).
YES, YOU CAN.
tldr
In golang.org/x/sync/errgroup example, it has the same example code in Example (Parallel)
Google := func(ctx context.Context, query string) ([]Result, error) {
g, ctx := errgroup.WithContext(ctx)
searches := []Search{Web, Image, Video}
results := make([]Result, len(searches))
for i, search := range searches {
i, search := i, search
g.Go(func() error {
result, err := search(ctx, query)
if err == nil {
results[i] = result
}
return err
})
}
if err := g.Wait(); err != nil {
return nil, err
}
return results, nil
}
// ...
The reading part isn't concurrent but the processing is. I phrased the title this way because I'm most likely to search for this problem again using that phrase. :)
I'm getting a deadlock after trying to go beyond the examples so this is a learning experience for me. My goals are these:
Read a file line by line (eventually use a buffer to do groups of lines).
Pass off the text to a func() that does some regex work.
Send the results somewhere but avoid mutexes or shared variables. I'm sending ints (always the number 1) to a channel. It's sort of silly but if it's not causing problems I'd like to leave it like this unless you folks have a neater option.
Use a worker pool to do this. I'm not sure how I tell the workers to requeue themselves?
Here is the playground link. I tried to write helpful comments, hopefully this makes sense. My design could be completely wrong so don't hesitate to refactor.
package main
import (
"bufio"
"fmt"
"regexp"
"strings"
"sync"
)
func telephoneNumbersInFile(path string) int {
file := strings.NewReader(path)
var telephone = regexp.MustCompile(`\(\d+\)\s\d+-\d+`)
// do I need buffered channels here?
jobs := make(chan string)
results := make(chan int)
// I think we need a wait group, not sure.
wg := new(sync.WaitGroup)
// start up some workers that will block and wait?
for w := 1; w <= 3; w++ {
wg.Add(1)
go matchTelephoneNumbers(jobs, results, wg, telephone)
}
// go over a file line by line and queue up a ton of work
scanner := bufio.NewScanner(file)
for scanner.Scan() {
// Later I want to create a buffer of lines, not just line-by-line here ...
jobs <- scanner.Text()
}
close(jobs)
wg.Wait()
// Add up the results from the results channel.
// The rest of this isn't even working so ignore for now.
counts := 0
// for v := range results {
// counts += v
// }
return counts
}
func matchTelephoneNumbers(jobs <-chan string, results chan<- int, wg *sync.WaitGroup, telephone *regexp.Regexp) {
// Decreasing internal counter for wait-group as soon as goroutine finishes
defer wg.Done()
// eventually I want to have a []string channel to work on a chunk of lines not just one line of text
for j := range jobs {
if telephone.MatchString(j) {
results <- 1
}
}
}
func main() {
// An artificial input source. Normally this is a file passed on the command line.
const input = "Foo\n(555) 123-3456\nBar\nBaz"
numberOfTelephoneNumbers := telephoneNumbersInFile(input)
fmt.Println(numberOfTelephoneNumbers)
}
You're almost there, just need a little bit of work on goroutines' synchronisation. Your problem is that you're trying to feed the parser and collect the results in the same routine, but that can't be done.
I propose the following:
Run scanner in a separate routine, close input channel once everything is read.
Run separate routine waiting for the parsers to finish their job, than close the output channel.
Collect all the results in you main routine.
The relevant changes could look like this:
// Go over a file line by line and queue up a ton of work
go func() {
scanner := bufio.NewScanner(file)
for scanner.Scan() {
jobs <- scanner.Text()
}
close(jobs)
}()
// Collect all the results...
// First, make sure we close the result channel when everything was processed
go func() {
wg.Wait()
close(results)
}()
// Now, add up the results from the results channel until closed
counts := 0
for v := range results {
counts += v
}
Fully working example on the playground: http://play.golang.org/p/coja1_w-fY
Worth adding you don't necessarily need the WaitGroup to achieve the same, all you need to know is when to stop receiving results. This could be achieved for example by scanner advertising (on a channel) how many lines were read and then the collector reading only specified number of results (you would need to send zeros as well though).
Edit: The answer by #tomasz above is the correct one. Please disregard this answer.
You need to do two things:
use buffered chan's so that sending doesn't block
close the results chan so that receiving doesn't block.
The use of buffered channels is essential because unbuffered channels need a receive for each send, which is causing the deadlock you're hitting.
If you fix that, you'll run into a deadlock when you try to receive the results, because results hasn't been closed.
Here's the fixed playground: http://play.golang.org/p/DtS8Matgi5
for an assignment we are using go and one of the things we are going to do is to parse a uniprotdatabasefile line-by-line to collect uniprot-records.
I prefer not to share too much code, but I have a working code snippet that does parse such a file (2.5 GB) correctly in 48 s (measured using the time go-package). It parses the file iteratively and add lines to a record until a record end signal is reached (a full record), and metadata on the record is created. Then the record string is nulled, and a new record is collected line-by-line. Then I thought that I would try to use go-routines.
I have got some tips before from stackoverflow, and then to the original code I simple added a function to handle everything concerning the metadata-creation.
So, the code is doing
create an empty record,
iterate the file and add lines to the record,
if a record stop signal is found (now we have a full record) - give it to a go routine to create the metadata
null the record string and continue from 2).
I also added a sync.WaitGroup() to make sure that I waited (in the end) for each routine to finish. I thought that this would actually lower the time spent on parsing the databasefile as it continued to parse while the goroutines would act on each record. However, the code seems to run for more than 20 minutes indicating that something is wrong or the overhead went crazy. Any suggestions?
package main
import (
"bufio"
"crypto/sha1"
"fmt"
"io"
"log"
"os"
"strings"
"sync"
"time"
)
type producer struct {
parser uniprot
}
type unit struct {
tag string
}
type uniprot struct {
filenames []string
recordUnits chan unit
recordStrings map[string]string
}
func main() {
p := producer{parser: uniprot{}}
p.parser.recordUnits = make(chan unit, 1000000)
p.parser.recordStrings = make(map[string]string)
p.parser.collectRecords(os.Args[1])
}
func (u *uniprot) collectRecords(name string) {
fmt.Println("file to open ", name)
t0 := time.Now()
wg := new(sync.WaitGroup)
record := []string{}
file, err := os.Open(name)
errorCheck(err)
scanner := bufio.NewScanner(file)
for scanner.Scan() { //Scan the file
retText := scanner.Text()
if strings.HasPrefix(retText, "//") {
wg.Add(1)
go u.handleRecord(record, wg)
record = []string{}
} else {
record = append(record, retText)
}
}
file.Close()
wg.Wait()
t1 := time.Now()
fmt.Println(t1.Sub(t0))
}
func (u *uniprot) handleRecord(record []string, wg *sync.WaitGroup) {
defer wg.Done()
recString := strings.Join(record, "\n")
t := hashfunc(recString)
u.recordUnits <- unit{tag: t}
u.recordStrings[t] = recString
}
func hashfunc(record string) (hashtag string) {
hash := sha1.New()
io.WriteString(hash, record)
hashtag = string(hash.Sum(nil))
return
}
func errorCheck(err error) {
if err != nil {
log.Fatal(err)
}
}
First of all: your code is not thread-safe. Mainly because you're accessing a hashmap
concurrently. These are not safe for concurrency in go and need to be locked. Faulty line in your code:
u.recordStrings[t] = recString
As this will blow up when you're running go with GOMAXPROCS > 1, I'm assuming that you're not doing that. Make sure you're running your application with GOMAXPROCS=2 or higher to achieve parallelism.
The default value is 1, therefore your code runs on one single OS thread which, of course, can't be scheduled on two CPU or CPU cores simultaneously. Example:
$ GOMAXPROCS=2 go run udb.go uniprot_sprot_viruses.dat
At last: pull the values from the channel or otherwise your program will not terminate.
You're creating a deadlock if the number of goroutines exceeds your limit. I tested with a
76MiB file of data, you said your file was about 2.5GB. I have 16347 entries. Assuming linear growth,
your file will exceed 1e6 and therefore there are not enough slots in the channel and your program
will deadlock, giving no result while accumulating goroutines which don't run to fail at the end
(miserably).
So the solution should be to add a go routine which pulls the values from the channel and does
something with them.
As a side note: If you're worried about performance, do not use strings as they're always copied. Use []byte instead.