How to search a huge slice of maps[string]string concurrently - go

I need to search a huge slice of maps[string]string. My thought was that this is a good chance for go's channel and go routines.
The Plan was to divide the slice in parts and send search them in parallel.
But I was kind of shocked that my parallel version timed out while the search of the whole slice did the trick.
I am not sure what I am doing wrong. Down below is my code which I used to test the concept. The real code would involve more complexity
//Search for a giving term
//This function gets the data passed which will need to be search
//and the search term and it will return the matched maps
// the data is pretty simply the map contains { key: andSomeText }
func Search(data []map[string]string, term string) []map[string]string {
set := []map[string]string{}
for _, v := range data {
if v["key"] == term {
set = append(set, v)
}
}
return set
}
So this works pretty well to search the slice of maps for a given SearchTerm.
Now I thought if my slice would have like 20K entries, I would like to do the search in parallel
// All searches all records concurrently
// Has the same function signature as the the search function
// but the main task is to fan out the slice in 5 parts and search
// in parallel
func All(data []map[string]string, term string) []map[string]string {
countOfSlices := 5
part := len(data) / countOfSlices
fmt.Printf("Size of the data:%v\n", len(data))
fmt.Printf("Fragemnt Size:%v\n", part)
timeout := time.After(60000 * time.Millisecond)
c := make(chan []map[string]string)
for i := 0; i < countOfSlices; i++ {
// Fragments of the array passed on to the search method
go func() { c <- Search(data[(part*i):(part*(i+1))], term) }()
}
result := []map[string]string{}
for i := 0; i < part-1; i++ {
select {
case records := <-c:
result = append(result, records...)
case <-timeout:
fmt.Println("timed out!")
return result
}
}
return result
}
Here are my tests:
I have a function to generate my test data and 2 tests.
func GenerateTestData(search string) ([]map[string]string, int) {
rand.Seed(time.Now().UTC().UnixNano())
strin := []string{"String One", "This", "String Two", "String Three", "String Four", "String Five"}
var matchCount int
numOfRecords := 20000
set := []map[string]string{}
for i := 0; i < numOfRecords; i++ {
p := rand.Intn(len(strin))
s := strin[p]
if s == search {
matchCount++
}
set = append(set, map[string]string{"key": s})
}
return set, matchCount
}
The 2 tests: The first just traverses the slice and the second searches in parallel
func TestSearchItem(t *testing.T) {
tests := []struct {
InSearchTerm string
Fn func(data []map[string]string, term string) []map[string]string
}{
{
InSearchTerm: "This",
Fn: Search,
},
{InSearchTerm: "This",
Fn: All,
},
}
for i, test := range tests {
startTime := time.Now()
data, expectedMatchCount := GenerateTestData(test.InSearchTerm)
result := test.Fn(data, test.InSearchTerm)
fmt.Printf("Test: [%v]:\nTime: %v \n\n", i+1, time.Since(startTime))
assert.Equal(t, len(result), expectedMatchCount, "expected: %v to be: %v", len(result), expectedMatchCount)
}
}
It would be great if someone could explain me why my parallel code is so slow? What is wrong with the code and what I am missing here as well as what the recommended way would be to search huge slices in memory 50K+.

This looks like just a simple typo. The problem is that you divide your original big slice into 5 pieces (countOfSlices), and you properly launch 5 goroutines to search each part:
for i := 0; i < countOfSlices; i++ {
// Fragments of the array passed on to the search method
go func() { c <- Search(data[(part*i):(part*(i+1))], term) }()
}
This means you should expect 5 results, but you don't. You expect 4000-1 results:
for i := 0; i < part-1; i++ {
select {
case records := <-c:
result = append(result, records...)
case <-timeout:
fmt.Println("timed out!")
return result
}
}
Obviously if you only launched 5 goroutines, each of which delivers 1 single result, you can only expect as many (5). And since your loop waits a lot more (which will never come), it times out as expected.
Change the condition to this:
for i := 0; i < countOfSlices; i++ {
// ...
}

Concurrency is not parallelism. Go is massively concurrent language, not parallel. Even using multicore machine you will pay for data exchange between CPUs when accessing your shared slice in computation threads. You can take advantage of concurrency searching just first match for example. Or doing something with results(say print them, or write to some Writer, or sort) while continue to search.
func Search(data []map[string]string, term string, ch chan map[string]string) {
for _, v := range data {
if v["key"] == term {
ch <- v
}
}
}
func main(){
...
go search(datapart1, term, ch)
go search(datapart2, term, ch)
go search(datapart3, term, ch)
...
for vv := range ch{
fmt.Println(vv) //do something with match concurrently
}
...
}
The recommended way to search huge slice would be to keep it sorted, or make binary tree. There are no magic.

There are two problems - as icza notes you never finish the select as you need to use countOfSlices, and then also your call to Search will not get the data you want as you need to allocate that before calling the go func(), so allocate the slice outside and pass it in.
You might find it still isn't faster though to do this particular work in parallel with such simple data (perhaps with more complex data on a machine with lots of cores it would be worthwhile)?
Be sure when testing that you try swapping the order of your test runs - you might be surprised by the results! Also perhaps try the benchmarking tools available in the testing package which runs your code lots of times for you and averages the results, this might help you get a better idea of whether the fanout speeds things up.

Related

Conditionally Run Consecutive Go Routines

I have the following piece of code. I'm trying to run 3 GO routines at the same time never exceeding three. This works as expected, but the code is supposed to be running updates a table in the DB.
So the first routine processes the first 50, then the second 50, and then third 50, and it repeats. I don't want two routines processing the same rows at the same time and due to how long the update takes, this happens almost every time.
To solve this, I started flagging the rows with a new column processing which is a bool. I set it to true for all rows to be updated when the routine starts and sleep the script for 6 seconds to allow the flag to be updated.
This works for a random amount of time, but every now and then, I'll see 2-3 jobs processing the same rows again. I feel like the method I'm using to prevent duplicate updates is a bit janky and was wondering if there was a better way.
stopper := make(chan struct{}, 3)
var counter int
for {
counter++
stopper <- struct{}{}
go func(db *sqlx.DB, c int) {
fmt.Println("start")
updateTables(db)
fmt.Println("stop"b)
<-stopper
}(db, counter)
time.Sleep(6 * time.Second)
}
in updateTables
var ids[]string
err := sqlx.Select(db, &data, `select * from table_data where processing = false `)
if err != nil {
panic(err)
}
for _, row:= range data{
list = append(ids, row.Id)
}
if len(rows) == 0 {
return
}
for _, row:= range data{
_, err = db.Exec(`update table_data set processing = true where id = $1, row.Id)
if err != nil {
panic(err)
}
}
// Additional row processing
I think there's a misunderstanding on approach to go routines in this case.
Go routines to do these kind of work should be approached like worker Threads, using channels as the communication method in between the main routine (which will be doing the synchronization) and the worker go routines (which will be doing the actual job).
package main
import (
"log"
"sync"
"time"
)
type record struct {
id int
}
func main() {
const WORKER_COUNT = 10
recordschan := make(chan record)
var wg sync.WaitGroup
for k := 0; k < WORKER_COUNT; k++ {
wg.Add(1)
// Create the worker which will be doing the updates
go func(workerID int) {
defer wg.Done() // Marking the worker as done
for record := range recordschan {
updateRecord(record)
log.Printf("req %d processed by worker %d", record.id, workerID)
}
}(k)
}
// Feeding the records channel
for _, record := range fetchRecords() {
recordschan <- record
}
// Closing our channel as we're not using it anymore
close(recordschan)
// Waiting for all the go routines to finish
wg.Wait()
log.Println("we're done!")
}
func fetchRecords() []record {
result := []record{}
for k := 0; k < 100; k++ {
result = append(result, record{k})
}
return result
}
func updateRecord(req record) {
time.Sleep(200 * time.Millisecond)
}
You can even buffer things in the main go routine if you need to update all the 50 tables at once.

Unpredictable results with http.Client and goroutines

I'm new to Golang, trying to build a system that fetches content from a set of urls and extract specific lines with regex. The problems start when i wrap the code with goroutines. I'm getting a different number of regex results and many of fetched lines are duplicates.
max_routines := 3
sem := make(chan int, max_routines) // to control the number of working routines
var wg sync.WaitGroup
ch_content := make(chan string)
client := http.Client{}
for i:=2; ; i++ {
// for testing
if i>5 {
break
}
// loop should be broken if feebbacks_checstr is found in content
if loop_break {
break
}
wg.Add(1)
go func(i int) {
defer wg.Done()
sem <- 1 // will block if > max_routines
final_url = url+a.tm_id+"/page="+strconv.Itoa(i)
resp, _ := client.Get(final_url)
var bodyString string
if resp.StatusCode == http.StatusOK {
bodyBytes, _ := ioutil.ReadAll(resp.Body)
bodyString = string(bodyBytes)
}
// checking for stop word in content
if false == strings.Contains(bodyString, feebbacks_checstr) {
res2 = regex.FindAllStringSubmatch(bodyString,-1)
for _,v := range res2 {
ch_content <- v[1]
}
} else {
loop_break = true
}
resp.Body.Close()
<-sem
}(i)
}
for {
select {
case r := <-ch_content:
a.feedbacks = append(a.feedbacks, r) // collecting the data
case <-time.After(500 * time.Millisecond):
show(len(a.feedbacks)) // < always different result, many entries in a.feedbacks are duplicates
fmt.Printf(".")
}
}
As a result len(a.feedbacks) gives sometimes 130, sometimes 139 and a.feedbacks contains duplicates. If i clean the duplicates the number of results is about half of what i'm expecting (109 without duplicates)
You're creating a closure by using an anonymous go routine function. I notice your final_url isn't := but = which means it's defined outside the closure. All go routines will have access to the same value of final_url and there's a race condition going on. Some go routines are overwriting final_url before other go routines are making their requests and this will result in duplicates.
If you define final_url inside the go routine then they won't be stepping on each other's toes and it should work as you expect.
That's the simple fix for what you have. A more idiomatically Go way to do this would be to create an input channel (containing the URLs to request) and an output channel (eventually containing whatever you're pulling out of the response) and instead of trying to manage the life and death of dozens of go routines you would keep alive a constant amount of go routines that try to empty out the input channel.

how to efficiently call a func after X hours?

I know I can do like this:
func randomFunc() {
// do stuff
go destroyObjectAfterXHours(4, "idofobject")
// do other stuff
}
func destroyObjectAfterXHours(hours int, id string) {
time.Sleep(hours * time.Hour)
destroyObject(id)
}
but if we imagine destroyObjectAfterXHours is called a million times within a few minutes, this solution will be very bad.
I was hoping someone could share a more efficient solution to this problem.
I've been thinking about a potential solution where destruction time and object id was stored somewhere, and then there would be one func that would traverse through the list every X minutes, destroy the objects that had to be destroyed and remove their id and time info from wherever that info was stored. Would this be a good solution?
I worry it would also be bad solution since you will then have to traverse through a list with millions of items all the time, and then have to efficiently remove some of the items, etc.
The time.AfterFunc function is designed for this use case:
func randomFunc() {
// do stuff
time.AfterFunc(4*time.Hour, func() { destroyObject("idofobject") })
// do other stuff
}
time.AfterFunc is efficient and simple to use.
As the documentation states, the function is called in a goroutine after the duration elapses. The goroutine is not created up front as in the question.
So I'd agree with your solution #2 than number 1.
Traversing through a list of a million numbers is much easier than having a million separate Go Routines
Go Routines are expensive (compared to loops) and take memory and processing time. For eg. a million Go Routines take about 4GB of RAM.
Traversing through a list on the other hand takes very little space and is done in O(n) time.
A good example of this exact function is Go Cache which deletes its expired elements in a Go Routine it runs periodically
https://github.com/patrickmn/go-cache/blob/master/cache.go#L931
This is a more detailed example of how they did it:
type Item struct {
Object interface{}
Expiration int64
}
func (item Item) Expired() bool {
if item.Expiration == 0 {
return false
}
return time.Now().UnixNano() > item.Expiration
}
func RemoveItem(s []Item, index int) []int {
return append(s[:index], s[index+1:]...)
}
func deleteExpired(items []Item){
var deletedItems []int
for k, v := range items {
if v.Expired(){
deletedItems = append(deletedItems, k)
}
}
for key, deletedIndex := range deleteditems{
items = RemoveItem(items, deletedIndex)
}
}
The above implementation could definitely be improved with a linked list instead of an array but this is the general idea
This is an interesting question. I come to a solution where it use a heap to maintain the queue of items to be destroyed and sleep for exactly time until the next item is up for destruction. I think it is more efficient, but the gain might be slim on some cases. Nonetheless, you can see the code here:
package main
import (
"container/heap"
"fmt"
"time"
)
type Item struct {
Expiration time.Time
Object interface{} // It would make more sence to be *interface{}, but not as convinient
}
//MINIT is the minimal interval for delete to run. In most cases, it is better to be set as 0
const MININT = 1 * time.Second
func deleteExpired(addCh chan Item) (quitCh chan bool) {
quitCh = make(chan bool)
go func() {
h := make(ExpHeap, 0)
var t *time.Timer
item := <-addCh
heap.Push(&h, &item)
t = time.NewTimer(time.Until(h[0].Expiration))
for {
//Check unfinished incoming first
for incoming := true; incoming; {
select {
case item := <-addCh:
heap.Push(&h, &item)
default:
incoming = false
}
}
if delta := time.Until(h[0].Expiration); delta >= MININT {
t.Reset(delta)
} else {
t.Reset(MININT)
}
select {
case <-quitCh:
return
//New Item incoming, break the timer
case item := <-addCh:
heap.Push(&h, &item)
if item.Expiration.After(h[0].Expiration) {
continue
}
if delta := time.Until(item.Expiration); delta >= MININT {
t.Reset(delta)
} else {
t.Reset(MININT)
}
//Wait until next item to be deleted
case <-t.C:
for !h[0].Expiration.After(time.Now()) {
item := heap.Pop(&h).(*Item)
destroy(item.Object)
}
if delta := time.Until(h[0].Expiration); delta >= MININT {
t.Reset(delta)
} else {
t.Reset(MININT)
}
}
}
}()
return quitCh
}
type ExpHeap []*Item
func (h ExpHeap) Len() int {
return len(h)
}
func (h ExpHeap) Swap(i, j int) {
h[i], h[j] = h[j], h[i]
}
func (h ExpHeap) Less(i, j int) bool {
return h[i].Expiration.Before(h[j].Expiration)
}
func (h *ExpHeap) Push(x interface{}) {
item := x.(*Item)
*h = append(*h, item)
}
func (h *ExpHeap) Pop() interface{} {
old, n := *h, len(*h)
item := old[n-1]
*h = old[:n-1]
return item
}
//Auctural destroy code.
func destroy(x interface{}) {
fmt.Printf("%v # %v\n", x, time.Now())
}
func main() {
addCh := make(chan Item)
quitCh := deleteExpired(addCh)
for i := 30; i > 0; i-- {
t := time.Now().Add(time.Duration(i) * time.Second / 2)
addCh <- Item{t, t}
}
time.Sleep(7 * time.Second)
quitCh <- true
}
playground: https://play.golang.org/p/JNV_6VJ_yfK
By the way, there are packages like cron for job management, but I am not familiar with them so I cannot speak for their efficiency.
Edit:
Still I haven't enough reputation to comment :(
About performance: this code basically has less CPU usage as it only wake it self when necessary and only traverse items that is up for destruction instead of the whole list. Based on personal (actually ACM experience), roughly a mordern CPU can process a loop of 10^9 in 1.2 seconds or so, which means on a scale of 10^6, traversing the whole list takes about over 1 millisecond excluding auctual destruction code AND data copy (which will cost a lot on average on more than thousands of runs, to a scale of 100 millisecond or so). My code's approach is O(lg N) which on 10^6 scale is at least a thousand times faster (considering constant). Please note again all these calculation is based on experience instead of benchmarks (there was but I am not able to provide them).
Edit 2:
With a second thought, I think the plain solution can use a simple optimization:
func deleteExpired(items []Item){
tail = len(items)
for index, v := range items { //better naming
if v.Expired(){
tail--
items[tail],items[index] = v,items[tail]
}
}
deleteditems := items[tail:]
items:=items[:tail]
}
With this change, it no longer copy data unefficiently and will not allocate extra space.
Edit 3:
changing code from here
I tested the memoryuse of afterfunc. On my laptop it is 250 bytes per call, while on palyground it is 69 (I am curious at the reason). With my code, A pointer + a time.Time is 28 byte. At a scale of a million, the difference is slim. Using After Func is a much better option.
If it is a one-shot, this can be easily achieved with
// Make the destruction cancelable
cancel := make(chan bool)
go func(t time.Duration, id int){
expired := time.NewTimer(t).C
select {
// destroy the object when the timer is expired
case <-expired:
destroyObject(id)
// or cancel the destruction in case we get a cancel signal
// before it is destroyed
case <-cancel:
fmt.Println("Cancelled destruction of",id)
return
}
}(time.Hours * 4, id)
if weather == weather.SUNNY {
cancel <- true
}
If you want to do it every 4 h:
// Same as above, though since id may have already been destroyed
// once, I name the channel different
done := make(chan bool)
go func(t time.Duration,id int){
// Sends to the channel every t
tick := time.NewTicker(t).C
// Wrap, otherwise select will only execute the first tick
for{
select {
// t has passed, so id can be destroyed
case <-tick:
destroyObject(id)
// We are finished destroying stuff
case <-done:
fmt.Println("Ok, ok, I quit destroying...")
return
}
}
}()
if weather == weather.RAINY {
done <- true
}
The idea behind it is to run a single goroutine per destruction job which can be cancelled. Say, you have a session and the user did something, so you want to keep the session alive. Since goroutines are extremely cheap, you can simply fire off another goroutine.

Golang order output of go routines

I have 16 go routines which return output , which is typically a struct.
struct output{
index int,
description string,
}
Now all these 16 go routines run in parallel, and the total expected output structs from all the go routines is expected to be a million. I have used the basic sorting of go lang it is very expensive to do that, could some one help me with the approach to take to sort the output based on the index and I need to write the "description" field on to a file based on the order of index.
For instance ,
if a go routine gives output as {2, "Hello"},{9,"Hey"},{4,"Hola"}, my output file should contain
Hello
Hola
Hey
All these go routines run in parallel and I have no control on the order of execution , hence I am passing the index to finally order the output.
One thing to consider before getting into the answer is your example code will not compile. To define a type of struct in Go, you would need to change your syntax to
type output struct {
index int
description string
}
In terms of a potential solution to your problem - if you already reliably have unique index's as well as the expected count of the result set - you should not have to do any sorting at all. Instead synchronize the go routines over a channel and insert the output in an allocated slice at the respective index. You can then iterate over that slice to write the contents to a file. For example:
ch := make(chan output) //each go routine will write to this channel
wg := new(sync.WaitGroup) //wait group to sync all go routines
//execute 16 goroutines
for i := 0; i < 16; i++ {
wg.Add(1)
go worker(ch, wg) //this is expecting each worker func to call wg.Done() when completing its portion of work
}
//create a "quit" channel that will be used to signal to the select statement below that your go routines are all done
quit := make(chan bool)
go func() {
wg.Wait()
quit <- true
}()
//initialize a slice with length and capacity to 1mil, the expected result size mentioned in your question
sorted := make([]string, 1000000, 1000000)
//use the for loop, select pattern to sync the results from your 16 go routines and insert them into the sorted slice
for {
select {
case output := <-ch:
//this is not robust - check notes below example
sorted[output.index] = output.description
case <-quit:
//implement a function you could pass the sorted slice to that will write the results
// Ex: writeToFile(sorted)
return
}
}
A couple notes on this solution: it is dependent upon you knowing the size of the expected result set. If you do not know what the size of the result set is - in the select statement you will need to check if the index is read from ch exceeds the length of the sorted slice and allocate additional space before inserting our you program will crash as a result of an out of bounds error
You could use the module Ordered-concurrently to merge your inputs and then print them in order.
https://github.com/tejzpr/ordered-concurrently
Example - https://play.golang.org/p/hkcIuRHj63h
package main
import (
concurrently "github.com/tejzpr/ordered-concurrently/v2"
"log"
"math/rand"
"time"
)
type loadWorker int
// The work that needs to be performed
// The input type should implement the WorkFunction interface
func (w loadWorker) Run() interface{} {
time.Sleep(time.Millisecond * time.Duration(rand.Intn(10)))
return w
}
func main() {
max := 10
inputChan := make(chan concurrently.WorkFunction)
output := concurrently.Process(inputChan, &concurrently.Options{PoolSize: 10, OutChannelBuffer: 10})
go func() {
for work := 0; work < max; work++ {
inputChan <- loadWorker(work)
}
close(inputChan)
}()
for out := range output {
log.Println(out.Value)
}
}
Disclaimer: I'm the module creator

Consuming all elements of a channel into a slice

How can I construct a slice out of all of elements consumed from a channel (like Python's list does)? I can use this helper function:
func ToSlice(c chan int) []int {
s := make([]int, 0)
for i := range c {
s = append(s, i)
}
return s
}
but due to the lack of generics, I'll have to write that for every type, won't I? Is there a builtin function that implements this? If not, how can I avoid copying and pasting the above code for every single type I'm using?
If there's only a few instances in your code where the conversion is needed, then there's absolutely nothing wrong with copying the 7 lines of code a few times (or even inlining it where it's used, which reduces it to 4 lines of code and is probably the most readable solution).
If you've really got conversions between lots and lots of types of channels and slices and want something generic, then you can do this with reflection at the cost of ugliness and lack of static typing at the callsite of ChanToSlice.
Here's complete example code for how you can use reflect to solve this problem with a demonstration of it working for an int channel.
package main
import (
"fmt"
"reflect"
)
// ChanToSlice reads all data from ch (which must be a chan), returning a
// slice of the data. If ch is a 'T chan' then the return value is of type
// []T inside the returned interface.
// A typical call would be sl := ChanToSlice(ch).([]int)
func ChanToSlice(ch interface{}) interface{} {
chv := reflect.ValueOf(ch)
slv := reflect.MakeSlice(reflect.SliceOf(reflect.TypeOf(ch).Elem()), 0, 0)
for {
v, ok := chv.Recv()
if !ok {
return slv.Interface()
}
slv = reflect.Append(slv, v)
}
}
func main() {
ch := make(chan int)
go func() {
for i := 0; i < 10; i++ {
ch <- i
}
close(ch)
}()
sl := ChanToSlice(ch).([]int)
fmt.Println(sl)
}
You could make ToSlice() just work on interface{}'s, but the amount of code you save here will likely cost you in complexity elsewhere.
func ToSlice(c chan interface{}) []interface{} {
s := make([]interface{}, 0)
for i := range c {
s = append(s, i)
}
return s
}
Full example at http://play.golang.org/p/wxx-Yf5ESN
That being said: As #Volker said in the comments from the slice (haha) of code you showed it seems like it'd be saner to either process the results in a streaming fashion or "buffer them up" at the generator and just send the slice down the channel.

Resources