Creating a parallel word counter in Go - go

I am trying to create a word counter that returns an array of the number of times each word in a text file appears. Moreover, I have been assigned to parallelize this program.
My initial attempt at this task was as follows
Implementation 1
func WordCount(words []string, startWord int, endWord int, waitGroup *sync.WaitGroup, freqsChannel chan<- map[string]int) {
freqs := make(map[string]int)
for i := startWord; i < endWord; i++ {
word := words[i]
freqs[word]++
}
freqsChannel <- freqs
waitGroup.Done()
}
func ParallelWordCount(text string) map[string]int {
// Split text into string array of the words in text.
text = strings.ToLower(text)
text = strings.ReplaceAll(text, ",", "")
text = strings.ReplaceAll(text, ".", "")
words := strings.Fields(text)
length := len(words)
threads := 28
freqsChannel := make(chan map[string]int, threads)
var waitGroup sync.WaitGroup
waitGroup.Add(threads)
defer waitGroup.Wait()
wordsPerThread := length / threads // always rounds down
wordsInLastThread := length - (threads-1)*wordsPerThread
startWord := -wordsPerThread
endWord := 0
for i := 1; i <= threads; i++ {
if i < threads {
startWord += wordsPerThread
endWord += wordsPerThread
} else {
startWord += wordsInLastThread
endWord += wordsInLastThread
}
go WordCount(words, startWord, endWord, &waitGroup, freqsChannel)
}
freqs := <-freqsChannel
for i := 1; i < threads; i++ {
subFreqs := <-freqsChannel
for word, count := range subFreqs {
freqs[word] += count
}
}
return freqs
}
According to my teaching assistant, this was not a good solution as the pre-processing of the text file carried out by
text = strings.ToLower(text)
text = strings.ReplaceAll(text, ",", "")
text = strings.ReplaceAll(text, ".", "")
words := strings.Fields(text)
in ParallelWordCount goes against the idea of parallel processing.
Now, to fix this, I have moved the responsibility of processing the text file into an array of words into the the WordCount function that is called on separate goroutines for different parts of the text file. Below is the code for my second implementation.
Implementation 2
func WordCount(text string, waitGroup *sync.WaitGroup, freqsChannel chan<- map[string]int) {
freqs := make(map[string]int)
text = strings.ToLower(text)
text = strings.ReplaceAll(text, ",", "")
text = strings.ReplaceAll(text, ".", "")
words := strings.Fields(text)
for _, value := range words {
freqs[value]++
}
freqsChannel <- freqs
waitGroup.Done()
}
func splitCount(str string, subStrings int, waitGroup *sync.WaitGroup, freqsChannel chan<- map[string]int) {
if subStrings != 1 {
length := len(str)
charsPerSubstring := length / subStrings
i := 0
for str[charsPerSubstring+i] != ' ' {
i++
}
subString := str[0 : charsPerSubstring+i+1]
go WordCount(subString, waitGroup, freqsChannel)
splitCount(str[charsPerSubstring+i+1:length], subStrings-1, waitGroup, freqsChannel)
} else {
go WordCount(str, waitGroup, freqsChannel)
}
}
func ParallelWordCount(text string) map[string]int {
threads := 28
freqsChannel := make(chan map[string]int, threads)
var waitGroup sync.WaitGroup
waitGroup.Add(threads)
defer waitGroup.Wait()
splitCount(text, threads, &waitGroup, freqsChannel)
// Collect and return frequences
freqs := <-freqsChannel
for i := 1; i < threads; i++ {
subFreqs := <-freqsChannel
for word, count := range subFreqs {
freqs[word] += count
}
}
return freqs
}
The average runtime of this implementation is 3 ms compared to the old average of 5 ms, but have I thoroughly addressed the issue raised by my teaching assistant or does the second implementation also not take full advantage of parallel processing to efficiently count the words of text file?

Two things that I see:
Second example is better as you have split the text parsing and word counting into several goroutines. One thing you can try is to not count words in WordCount method, but just push them to the channel and increment them in the main counter. You can check if that is any faster, I'm not sure. Also, check the fan-in pattern for more details.
Parallel processing might still not be fully utilized, because I
don't believe you have 28 CPU cores available :). Number of cores is determining how many WordCount goroutines are working in parrallel, the rest of them will be distributed concurrently base on available resources (available CPU cores). Here is a great article explaining this.

Implementation 2 Issues
In method splitCount(), what if the total length of the string is less than 28. Still, it will call wordcount() equal to the number of words.
Also, it will fail as we are doing waitgroup.done 28 times in that scenario.
Recursively calling splitWord() is making it slow. We should split and call in a loop
The number of threads should not be 28 always as we don't know how many words are there in the string.
Will try to develop a more optimised approach and will update the answer.

Related

Not able to scrape data when trying to use go routines

I am trying to scrape related words for a given word for which I am using BFS starting with the given word and searching through each related word on dictionary.com
I have tried this code without concurrency and it works just fine, but takes a lot of time hence, tried using go routines but my code gets stuck after the first iteration. The first level of BFS works just fine but then in the second level it hangs!
package main
import (
"fmt"
"github.com/gocolly/colly"
"sync"
)
var wg sync.WaitGroup
func buildURL(word string) string {
return "https://www.dictionary.com/browse/" + string(word)
}
func get(url string) []string {
c := colly.NewCollector()
c.IgnoreRobotsTxt = true
var ret []string
c.OnHTML("a.css-cilpq1.e15p0a5t2", func(e *colly.HTMLElement) {
ret = append(ret, string(e.Text))
})
c.Visit(url)
c.Wait()
return ret
}
func threading(c chan []string, word string) {
defer wg.Done()
var words []string
for _, w := range get(buildURL(word)) {
words = append(words, w)
}
c <- words
}
func main() {
fmt.Println("START")
word := "jump"
maxDepth := 2
//bfs
var q map[string]int
nq := map[string]int {
word: 0,
}
vis := make(map[string]bool)
queue := make(chan []string, 5000)
for i := 1; i <= maxDepth; i++ {
fmt.Println(i)
q, nq = nq, make(map[string]int)
for word := range q {
if _, ok := vis[word]; !ok {
wg.Add(1)
vis[word] = true
go threading(queue, word)
for v := range queue {
fmt.Println(v)
for _, w := range v {
nq[w] = i
}
}
}
}
}
wg.Wait()
close(queue)
fmt.Println("END")
}
OUTPUT:
START
1
[plunge dive rise upsurge bounce hurdle fall vault drop advance upturn inflation increment spurt boost plummet skip bound surge take]
hangs just here forever, counter = 2 is not printed!
Can check here https://www.dictionary.com/browse/jump for the related words.
According to Tour of Go
Sends to a buffered channel block only when the buffer is full.
Receives block when the buffer is empty.
So, in this case, you are creating a buffered channel using 5000 as length.
for i := 1; i <= maxDepth; i++ {
fmt.Println(i)
q, nq = nq, make(map[string]int)
for word := range q { // for each word
if _, ok := vis[word]; !ok { // if not visited visit
wg.Add(1) // add a worker
vis[word] = true
go threading(queue, word) // fetch in concurrent manner
for v := range queue { // <<< blocks here when queue is empty
fmt.Println(v)
for _, w := range v {
nq[w] = i
}
}
}
}
}
As you can see I've commented in the code, after 1st iteration the for loop gonna block until channel is empty. In this case after fetching jump It sends the array corresponding similar words, but after that as the for loop is blocking as zerkems explains you will not get to next iteration(i = 2). You can ultimately close the channel to end the blocking in for loop. But since you use the same channel to write over multiple goroutines it will panic if you closed it from multiple goroutines.
To overcome this we can come up with a nice workaround.
We exactly know how much unvisited items we are fetching for.
We now know where is the block
First, we need to count the unvisited words and then we can iterate that much of the time
vis := make(map[string]bool)
queue := make(chan []string, 5000)
for i := 1; i <= maxDepth; i++ {
fmt.Println(i)
q, nq = nq, make(map[string]int)
unvisited := 0
for word := range q {
if _, ok := vis[word]; !ok {
vis[word] = true
unvisited++
wg.Add(1)
go threading(queue, word)
}
}
wg.Wait() // wait until jobs are done
for j := 0; j < unvisited; j++ { // << does not block as we know how much
v := <-queue // we exactly try to get unvisited amount
fmt.Println(v)
for _, w := range v {
nq[w] = i
}
}
}
In this situation, we are simply counting what is the minimum iterations we need to go to get results. Also, you can see that I've moved down the for loop outer and use original one to just add words to workers. It will ask to fetch all words and will wait in the following loop to complete there tasks in a non-blocking way.
Latter loop waits until all workers are done. After that next iteration, works and next level of BFS can be reached.
Summary
Distribute workload
Wait for results
Don't do both at the same time
Hope this helps.

for range vs static channel length golang

I have a channel taking events parsed from a log file and another one which is used for synchronization. There were 8 events for the purpose of my test.
When using the for range syntax, I get 4 events. When using the known number (8), I can get all of them.
func TestParserManyOpinit(t *testing.T) {
ch := make(chan event.Event, 1000)
done := make(chan bool)
go parser.Parse("./test_data/many_opinit", ch, done)
count := 0
exp := 8
evtList := []event.Event{}
<-done
close(ch)
//This gets all the events
for i := 0; i < 8; i++ {
evtList = append(evtList, <-ch)
count++
}
//This only gives me four
//for range ch {
// evtList = append(evtList, <-ch)
// count++
//}
if count != exp || count != len(evtList) {
t.Errorf("Not proper lenght, got %d, exp %d, evtList %d", count, exp, len(evtList))
}
func Parse(filePath string, evtChan chan event.Event, done chan bool) {
log.Info(fmt.Sprintf("(thread) Parsing file %s", filePath))
file, err := os.Open(filePath)
defer file.Close()
if err != nil {
log.Error("Cannot read file " + filePath)
}
count := 0
scan := bufio.NewScanner(file)
scan.Split(splitFunc)
scan.Scan() //Skip log file header
for scan.Scan() {
text := scan.Text()
text = strings.Trim(text, "\n")
splitEvt := strings.Split(text, "\n")
// Some parsing ...
count++
evtChan <- evt
}
fmt.Println("Done ", count) // gives 8
done <- true
}
I must be missing something related to for loops on a channel.
I've tried adding a time.Sleep just before the done <- true part. It didn't change the result.
When you use for range, each loop iteration reads from the channel, and you're not using the read value. Hence, half the values are discarded. It should be:
for ev := range ch {
evtList = append(evtList, ev)
count++
}
In order to actually utilize the values read in the loop iterator.
Ranging over channels is demonstrated in the Tour of Go and detailed in the Go spec.

Saving results from a parallelized goroutine

I am trying to parallelize an operation in golang and save the results in a manner that I can iterate over to sum up afterwords.
I have managed to set up the parameters so that no deadlock occurs, and I have confirmed that the operations are working and being saved correctly within the function. When I iterate over the Slice of my struct and try and sum up the results of the operation, they all remain 0. I have tried passing by reference, with pointers, and with channels (causes deadlock).
I have only found this example for help: https://golang.org/doc/effective_go.html#parallel. But this seems outdated now, as Vector as been deprecated? I also have not found any references to the way this function (in the example) was constructed (with the func (u Vector) before the name). I tried replacing this with a Slice but got compile time errors.
Any help would be very appreciated. Here is the key parts of my code:
type job struct {
a int
b int
result *big.Int
}
func choose(jobs []Job, c chan int) {
temp := new(big.Int)
for _,job := range jobs {
job.result = //perform operation on job.a and job.b
//fmt.Println(job.result)
}
c <- 1
}
func main() {
num := 100 //can be very large (why we need big.Int)
n := num
k := 0
const numCPU = 6 //runtime.NumCPU
count := new(big.Int)
// create a 2d slice of jobs, one for each core
jobs := make([][]Job, numCPU)
for (float64(k) <= math.Ceil(float64(num / 2))) {
// add one job to each core, alternating so that
// job set is similar in difficulty
for i := 0; i < numCPU; i++ {
if !(float64(k) <= math.Ceil(float64(num / 2))) {
break
}
jobs[i] = append(jobs[i], Job{n, k, new(big.Int)})
n -= 1
k += 1
}
}
c := make(chan int, numCPU)
for i := 0; i < numCPU; i++ {
go choose(jobs[i], c)
}
// drain the channel
for i := 0; i < numCPU; i++ {
<-c
}
// computations are done
for i := range jobs {
for _,job := range jobs[i] {
//fmt.Println(job.result)
count.Add(count, job.result)
}
}
fmt.Println(count)
}
Here is the code running on the go playground https://play.golang.org/p/X5IYaG36U-
As long as the []Job slice is only modified by one goroutine at a time, there's no reason you can't modify the job in place.
for i, job := range jobs {
jobs[i].result = temp.Binomial(int64(job.a), int64(job.b))
}
https://play.golang.org/p/CcEGsa1fLh
You should also use a WaitGroup, rather than rely on counting tokens in a channel yourself.

"fan in" - one "fan out" behavior

Say, we have three methods to implement "fan in" behavior
func MakeChannel(tries int) chan int {
ch := make(chan int)
go func() {
for i := 0; i < tries; i++ {
ch <- i
}
close(ch)
}()
return ch
}
func MergeByReflection(channels ...chan int) chan int {
length := len(channels)
out := make(chan int)
cases := make([]reflect.SelectCase, length)
for i, ch := range channels {
cases[i] = reflect.SelectCase{Dir: reflect.SelectRecv, Chan: reflect.ValueOf(ch)}
}
go func() {
for length > 0 {
i, line, opened := reflect.Select(cases)
if !opened {
cases[i].Chan = reflect.ValueOf(nil)
length -= 1
} else {
out <- int(line.Int())
}
}
close(out)
}()
return out
}
func MergeByCode(channels ...chan int) chan int {
length := len(channels)
out := make(chan int)
go func() {
var i int
var ok bool
for length > 0 {
select {
case i, ok = <-channels[0]:
out <- i
if !ok {
channels[0] = nil
length -= 1
}
case i, ok = <-channels[1]:
out <- i
if !ok {
channels[1] = nil
length -= 1
}
case i, ok = <-channels[2]:
out <- i
if !ok {
channels[2] = nil
length -= 1
}
case i, ok = <-channels[3]:
out <- i
if !ok {
channels[3] = nil
length -= 1
}
case i, ok = <-channels[4]:
out <- i
if !ok {
channels[4] = nil
length -= 1
}
}
}
close(out)
}()
return out
}
func MergeByGoRoutines(channels ...chan int) chan int {
var group sync.WaitGroup
out := make(chan int)
for _, ch := range channels {
go func(ch chan int) {
for i := range ch {
out <- i
}
group.Done()
}(ch)
}
group.Add(len(channels))
go func() {
group.Wait()
close(out)
}()
return out
}
type MergeFn func(...chan int) chan int
func main() {
length := 5
tries := 1000000
channels := make([]chan int, length)
fns := []MergeFn{MergeByReflection, MergeByCode, MergeByGoRoutines}
for _, fn := range fns {
sum := 0
t := time.Now()
for i := 0; i < length; i++ {
channels[i] = MakeChannel(tries)
}
for i := range fn(channels...) {
sum += i
}
fmt.Println(time.Since(t))
fmt.Println(sum)
}
}
Results are (at 1 CPU, I have used runtime.GOMAXPROCS(1)):
19.869s (MergeByReflection)
2499997500000
8.483s (MergeByCode)
2499997500000
4.977s (MergeByGoRoutines)
2499997500000
Results are (at 2 CPU, I have used runtime.GOMAXPROCS(2)):
44.94s
2499997500000
10.853s
2499997500000
3.728s
2499997500000
I understand the reason why MergeByReflection is slowest, but what is about the difference between MergeByCode and MergeByGoRoutines?
And when we increase the CPU number why "select" clause (used MergeByReflection directly and in MergeByCode indirectly) becomes slower?
Here is a preliminary remark. The channels in your examples are all unbuffered, meaning they will likely block at put or get time.
In this example, there is almost no processing except channel management. The performance is therefore dominated by synchronization primitives. Actually, there is very little of this code that can be parallelized.
In the MergeByReflection and MergeByCode functions, select is used to listen to multiple input channels, but nothing is done to take in account the output channel (which may therefore block, while some event could be available on one of the input channels).
In the MergeByGoRoutines function, this situation cannot happen: when the output channel blocks, it does not prevent an other input channel to be read by another goroutine. There are therefore better opportunities for the runtime to parallelize the goroutines, and less contention on the input channels.
The MergeByReflection code is the slowest because it has the overhead of reflection, and almost nothing can be parallelized.
The MergeByGoRoutines function is the fastest because it reduces the contention (less synchronization is needed), and because output contention has a lesser impact on the input performance. It can therefore benefit of a small improvement when running with multiple cores (contrary to the two other methods).
There is so much synchronization activity with MergeByReflection and MergeByCode, that running on multiple cores negatively impacts the performance. You could have different performance by using buffered channels though.

How to break out of select gracefuly in golang

I have a program in golang that counts SHA1s and prints ones that start with two zeros. I want to use goroutines and channels. My problem is that I don't know how to gracefully exit select clause if I don't know how many results it will produce.
Many tutorials know that in advance and exit when counter hits. Other suggest using WaitGroups, but I don't want to do that: I want to print results in main thread as soon it appears in channel. Some suggest to close a channel when goroutines are finished, but I want to close it after asynchronous for finishes, so I don't know how.
Please help me to achieve my requirements:
package main
import (
"crypto/sha1"
"fmt"
"time"
"runtime"
"math/rand"
)
type Hash struct {
message string
hash [sha1.Size]byte
}
var counter int = 0
var max int = 100000
var channel = make(chan Hash)
var source = rand.NewSource(time.Now().UnixNano())
var generator = rand.New(source)
func main() {
nCPU := runtime.NumCPU()
runtime.GOMAXPROCS(nCPU)
fmt.Println("Number of CPUs: ", nCPU)
start := time.Now()
for i := 0 ; i < max ; i++ {
go func(j int) {
count(j)
}(i)
}
// close channel here? I can't because asynchronous producers work now
for {
select {
// how to stop receiving if there are no producers left?
case hash := <- channel:
fmt.Printf("Hash is %v\n ", hash)
}
}
fmt.Printf("Count of %v sha1 took %v\n", max, time.Since(start))
}
func count(i int) {
random := fmt.Sprintf("This is a test %v", generator.Int())
hash := sha1.Sum([]byte(random))
if (hash[0] == 0 && hash[1] == 0) {
channel <- Hash{random, hash}
}
}
Firstly: if you don't know when your computation ends, how could you even model it? Make sure you know exactly when and under what circumstances your program terminates. If you're done you know how to write it in code.
You're basically dealing with a producer-consumer problem. A standard case. I would model
that this way (on play):
Producer
func producer(max int, out chan<- Hash, wg *sync.WaitGroup) {
defer wg.Done()
for i := 0; i < max; i++ {
random := fmt.Sprintf("This is a test %v", rand.Int())
hash := sha1.Sum([]byte(random))
if hash[0] == 0 && hash[1] == 0 {
out <- Hash{random, hash}
}
}
close(out)
}
Obviously you're brute-forcing hashes, so the end is reached when the loop is finished.
We can close the channel here and signal the other goroutines that there is nothing more to listen for.
Consumer
func consumer(max int, in <-chan Hash, wg *sync.WaitGroup) {
defer wg.Done()
for {
hash, ok := <-in
if !ok {
break
}
fmt.Printf("Hash is %v\n ", hash)
}
}
The consumer takes all the incoming messages from the in channel and checks if it was closed (ok).
If it is closed, we're done. Otherwise print the received hashes.
Main
To start this all up we can write:
wg := &sync.WaitGroup{}
c := make(chan Hash)
wg.Add(1)
go producer(max, c, wg)
wg.Add(1)
go consumer(max, c, wg)
wg.Wait()
The WaitGroup's purpose is to wait until the spawned goroutines finished, signalled by
the call of wg.Done in the goroutines.
Sidenote
Also note that the Rand you're using is not safe for concurrent access. Use the one initialized
globally in math/rand. Example:
rand.Seed(time.Now().UnixNano())
rand.Int()
The structure of your program should probably be re-examined.
Here is a working example of what I presume you are looking for.
It can be run on the Go playground
package main
import (
"crypto/sha1"
"fmt"
"math/rand"
"runtime"
"time"
)
type Hash struct {
message string
hash [sha1.Size]byte
}
const Max int = 100000
func main() {
nCPU := runtime.NumCPU()
runtime.GOMAXPROCS(nCPU)
fmt.Println("Number of CPUs: ", nCPU)
hashes := Generate()
start := time.Now()
for hash := range hashes {
fmt.Printf("Hash is %v\n ", hash)
}
fmt.Printf("Count of %v sha1 took %v\n", Max, time.Since(start))
}
func Generate() <-chan Hash {
c := make(chan Hash, 1)
go func() {
defer close(c)
source := rand.NewSource(time.Now().UnixNano())
generator := rand.New(source)
for i := 0; i < Max; i++ {
random := fmt.Sprintf("This is a test %v", generator.Int())
hash := sha1.Sum([]byte(random))
if hash[0] == 0 && hash[1] == 0 {
c <- Hash{random, hash}
}
}
}()
return c
}
Edit: This does not fire up a separate routine for each Hash computation,
but to be honest, I fail to see the value on doing so. The scheduling of all those routines will likely cost you far more than running the code in a single routine.
If need be, you can split it up into chunks of N routines, but a 1:1 mapping is not the way to go with this.

Resources