golang errgroup example without channels - go

I saw the example of errgroup in godoc, and it makes me confused that it simply assigns the result to global results instead of using channels in each search routines. Heres the code:
Google := func(ctx context.Context, query string) ([]Result, error) {
g, ctx := errgroup.WithContext(ctx)
searches := []Search{Web, Image, Video}
results := make([]Result, len(searches))
for i, search := range searches {
i, search := i, search // https://golang.org/doc/faq#closures_and_goroutines
g.Go(func() error {
result, err := search(ctx, query)
if err == nil {
results[i] = result
}
return err
})
}
if err := g.Wait(); err != nil {
return nil, err
}
return results, nil
}
I'm not sure is there any reason or implied rules guarantees it is correct? THX

The intent here is to make searches and results congruent. The result for the Web search is always at results[0], the result for the Image search always at results[1], etc. It also makes for a simpler example, because there is no need for an additional goroutine that consumes a channel.
If the goroutines would send their results into a channel, the result order would be unpredictable. If predictable result order is not a property you care about feel free to use a channel.

There is secret sauce in this code that creates siloing:
results := make([]Result, len(searches))
^^^^ ^^^^^^^^^^^^^
for i, search := ... {
i, search := i, search
^^^^^^^^^^
g.Go {
results[i] = result
^^^^^^^^^^
}
We know how big the result set is going to be, so we pre-allocate all the slots before starting any goroutines. This eliminates any contention over the slice object itself
make(.., len(searches))
^^^^ ^^^^^^^^^^^^^
We then promote the index number and search property to a closure for each iteration, so there is no contention over the variables being used by the loop/goroutines
i, search := i, search
And finally, each worker operates on a singular slot in the pre-sized slice:
results[i] = result
The workers are guaranteed to only perform read operations on the "results" slice to find out where their element is (results[i]).
This particular pattern is limiting, you can't use the results until all the workers are completed. So ask yourself what you're going to do next when deciding whether to use this or a channels-based pipeline workflow.
results := getSearchResults(searches)
statistics := analyzeResults(results)
for stats := range statistics {
our.Write("{%s}\n", stats.String())
}
If the analysis of a given result is independent of any other, this is a good candidate for a channel-based workflow.
But if the analysis depends on order, or has different results depending on each other then you may not have any choice but to serialize the flow.

Related

Multiple queries to Postgres within the same function

I'm new to Go, so sorry for the silly question in advance!
I'm using Gin framework and want to make multiple queries to the database within the same handler (database/sql + lib/pq)
userIds := []int{}
bookIds := []int{}
var id int
/* Handling first query here */
rows, err := pgClient.Query(getUserIdsQuery)
defer rows.Close()
if err != nil {
return
}
for rows.Next() {
err := rows.Scan(&id)
if err != nil {
return
}
userIds = append(userIds, id)
}
/* Handling second query here */
rows, err = pgClient.Query(getBookIdsQuery)
defer rows.Close()
if err != nil {
return
}
for rows.Next() {
err := rows.Scan(&id)
if err != nil {
return
}
bookIds = append(bookIds, id)
}
I have a couple of questions regarding this code (any improvements and best practices would be appreciated)
Does Go properly handle defer rows.Close() in such a case? I mean I have reassignment of rows variable later down the code, so will compiler track both and properly close at the end of a function?
Is it ok to reuse id shared var or should I redeclare it while iterating within rows.Next() loop?
What's the better approach of having even more queries within one handler? Should I have some kind of Writer that accepts query and slice and populate it with ids retrieved?
Thanks.
I've never worked with go-pg library, and my answer is mostly focused on the other stuff, which are generic, and are not specific to golang or go-pg.
Regardless of the fact that the rows here has the same reference while being shared between 2 queries (so one rows.Close() call would suffice, unless the library has some special implementation), defining two variables is cleaner, like userRows and bookRows.
Although I already said that I have not worked with go-pg, I believe that you wont need to iterate through rows and scan the id for all the rows manually, I believe that the lib has provided some API like this (based on the quick look on the documentations):
userIds := []int{}
err := pgClient.Query(&userIds, "select id from users where ...", args...)
Regarding your second question, it depends on what you mean by "ok". Since your doing some synchronous iteration, I don't think it would result into bugs, but when it comes to coding style, personally, I wouldn't do this.
I think that the best thing to do in your case is this:
// repo layer
func getUserIds(args whatever) ([]int, err) {...}
// these can be exposed, based on your packaging logic
func getBookIds(args whatever) ([]int, err) {...}
// service layer, or wherever you want to aggregate both queries
func getUserAndBookIds() ([]int, []int, err) {
userIds, err := getUserIds(...)
// potential error handling
bookIds, err := getBookIds(...)
// potential error handling
return userIds, bookIds, nil // you have done err handling earlier
}
I think this code is easier to read/maintain. You won't face the variable reassignment and other issues.
You can take a look at the go-pg documentations for more details on how to improve your query.

Extract Prometheus Metrics in Go

I'm new in Golang, what I am trying to do is to query Prometheus and save the query result in an object (such as a map) that has all timestamps and their values of the metric.
I started from this example code with only a few changes (https://github.com/prometheus/client_golang/blob/master/api/prometheus/v1/example_test.go)
func getFromPromRange(start time.Time, end time.Time, metric string) model.Value {
client, err := api.NewClient(api.Config{
Address: "http://localhost:9090",
})
if err != nil {
fmt.Printf("Error creating client: %v\n", err)
os.Exit(1)
}
v1api := v1.NewAPI(client)
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
r := v1.Range{
Start: start,
End: end,
Step: time.Second,
}
result, warnings, err := v1api.QueryRange(ctx, metric, r)
if err != nil {
fmt.Printf("Error querying Prometheus: %v\n", err)
os.Exit(1)
}
if len(warnings) > 0 {
fmt.Printf("Warnings: %v\n", warnings)
}
fmt.Printf("Result:\n%v\n", result)
return result
}
The result that is printed is for example:
"TEST{instance="localhost:4321", job="realtime"} =>\n21 #[1597758502.337]\n22 #[1597758503.337]...
These are actually the correct values and timestamps that are on Prometheus. How can I insert these timestamps and values into a map object (or another type of object that I can then use in code)?
The result coming from QueryRange has the type model.Matrix.
This will then contain a pointer of type *SampleStream. As your example then contains only one SampleStream, we can access the first one directly.
The SampleStream then has a Metric and Values of type []SamplePair. What you are aiming for is the slice of sample pairs. Over this we then can iterate and build for instance a map.
mapData := make(map[model.Time]model.SampleValue)
for _, val := range result.(model.Matrix)[0].Values {
mapData[val.Timestamp] = val.Value
}
fmt.Println(mapData)
You have to know the type of result you're getting returned. For example, model.Value can be of type Scalar, Vector, Matrix or String. Each of these types have their own way of getting the data and timestamps. For example, a Vector has an array of Sample types which contain the data you're looking for. The godocs and the github repo for the prom/go client have really great documentation if you want to dive deeper.
maybe you can find your answer in this issue
https://github.com/prometheus/client_golang/issues/194
switch {
case val.Type() == model.ValScalar:
scalarVal := val.(*model.Scalar)
// handle scalar stuff
case val.Type() == model.ValVector:
vectorVal := val.(model.Vector)
for _, elem := range vectorVal {
// do something with each element in the vector
// etc

Can I concurrently write different slice elements

I have a slice that contains work to be done, and a slice that will contain the results when everything is done. The following is a sketch of my general process:
var results = make([]Result, len(jobs))
wg := sync.WaitGroup{}
for i, job := range jobs {
wg.Add(1)
go func(i int, j job) {
defer wg.Done()
var r Result = doWork(j)
results[i] = r
}(i, job)
}
wg.Wait()
// Use results
It seems to work, but I have not tested it thoroughly and am not sure if it is safe to do. Generally I would not feel good letting multiple goroutines write to anything, but in this case, each goroutine is limited to its own index in the slice, which is pre-allocated.
I suppose the alternative is collecting results via a channel, but since order of results matters, this seemed rather simple. Is it safe to write into slice elements this way?
The rule is simple: if multiple goroutines access a variable concurrently, and at least one of the accesses is a write, then synchronization is required.
Your example does not violate this rule. You don't write the slice value (the slice header), you only read it (implicitly, when you index it).
You don't read the slice elements, you only modify the slice elements. And each goroutine only modifies a single, different, designated slice element. And since each slice element has its own address (own memory space), they are like distinct variables. This is covered in Spec: Variables:
Structured variables of array, slice, and struct types have elements and fields that may be addressed individually. Each such element acts like a variable.
What must be kept in mind is that you can't read the results from the results slice without synchronization. And the waitgroup you used in your example is a sufficient synchronization. You are allowed to read the slice once wg.Wait() returns, because that can only happen after all worker goroutines called wg.Done(), and none of the worker goroutines modify the elements after they called wg.Done().
For example, this is a valid (safe) way to check / process the results:
wg.Wait()
// Safe to read results after the above synchronization point:
fmt.Println(results)
But if you would try to access the elements of results before wg.Wait(), that's a data race:
// This is data race! Goroutines might still run and modify elements of results!
fmt.Println(results)
wg.Wait()
Yes, it's perfectly legal: a slice has an array as its underlying data storage, and, being a compound type, an array is a sequence of "elements" which behave as individual variables with distinct memory locations; modifying them concurrently is fine.
Just be sure to synchronize the shutdown of your worker goroutines with
the main one before it reads the updated contents of the slice.
Using sync.WaitGroup for this—as you do—is perfectly fine.
Also, as #icza said, you must not modify the slice value itself (which is a struct containing a pointer to the backing storage array, the capacity and the length).
YES, YOU CAN.
tldr
In golang.org/x/sync/errgroup example, it has the same example code in Example (Parallel)
Google := func(ctx context.Context, query string) ([]Result, error) {
g, ctx := errgroup.WithContext(ctx)
searches := []Search{Web, Image, Video}
results := make([]Result, len(searches))
for i, search := range searches {
i, search := i, search
g.Go(func() error {
result, err := search(ctx, query)
if err == nil {
results[i] = result
}
return err
})
}
if err := g.Wait(); err != nil {
return nil, err
}
return results, nil
}
// ...

Spawning go routines in a loop with closure

I have a list of strings which can contain number of elements ranging from 1 to 100,000. I want to verify each string and see if they are stored in a database, which requires call to network.
In order to maximize the efficiency, I want to spawn a go routine for each element.
Goal is to return false if one of the verifications inside the go routine function returns err, and return true if there is no err. So if we find at least one err we can stop since we already know that it is going to return false.
This is the basic idea, and the function below is the structure I've been thinking about using so far. I'd like to know if there is a better way (perhaps using channel?).
for _, id := range userIdList {
go func(id string){
user, err := verifyId(id)
if err != nil {
return err
}
// ...
// few more calls to other APIs for verifications
if err != nil {
return err
}
}(id)
}
I have wrote a small function that might be helpful for you.
Please take a look at limited parallel operations

How to search a huge slice of maps[string]string concurrently

I need to search a huge slice of maps[string]string. My thought was that this is a good chance for go's channel and go routines.
The Plan was to divide the slice in parts and send search them in parallel.
But I was kind of shocked that my parallel version timed out while the search of the whole slice did the trick.
I am not sure what I am doing wrong. Down below is my code which I used to test the concept. The real code would involve more complexity
//Search for a giving term
//This function gets the data passed which will need to be search
//and the search term and it will return the matched maps
// the data is pretty simply the map contains { key: andSomeText }
func Search(data []map[string]string, term string) []map[string]string {
set := []map[string]string{}
for _, v := range data {
if v["key"] == term {
set = append(set, v)
}
}
return set
}
So this works pretty well to search the slice of maps for a given SearchTerm.
Now I thought if my slice would have like 20K entries, I would like to do the search in parallel
// All searches all records concurrently
// Has the same function signature as the the search function
// but the main task is to fan out the slice in 5 parts and search
// in parallel
func All(data []map[string]string, term string) []map[string]string {
countOfSlices := 5
part := len(data) / countOfSlices
fmt.Printf("Size of the data:%v\n", len(data))
fmt.Printf("Fragemnt Size:%v\n", part)
timeout := time.After(60000 * time.Millisecond)
c := make(chan []map[string]string)
for i := 0; i < countOfSlices; i++ {
// Fragments of the array passed on to the search method
go func() { c <- Search(data[(part*i):(part*(i+1))], term) }()
}
result := []map[string]string{}
for i := 0; i < part-1; i++ {
select {
case records := <-c:
result = append(result, records...)
case <-timeout:
fmt.Println("timed out!")
return result
}
}
return result
}
Here are my tests:
I have a function to generate my test data and 2 tests.
func GenerateTestData(search string) ([]map[string]string, int) {
rand.Seed(time.Now().UTC().UnixNano())
strin := []string{"String One", "This", "String Two", "String Three", "String Four", "String Five"}
var matchCount int
numOfRecords := 20000
set := []map[string]string{}
for i := 0; i < numOfRecords; i++ {
p := rand.Intn(len(strin))
s := strin[p]
if s == search {
matchCount++
}
set = append(set, map[string]string{"key": s})
}
return set, matchCount
}
The 2 tests: The first just traverses the slice and the second searches in parallel
func TestSearchItem(t *testing.T) {
tests := []struct {
InSearchTerm string
Fn func(data []map[string]string, term string) []map[string]string
}{
{
InSearchTerm: "This",
Fn: Search,
},
{InSearchTerm: "This",
Fn: All,
},
}
for i, test := range tests {
startTime := time.Now()
data, expectedMatchCount := GenerateTestData(test.InSearchTerm)
result := test.Fn(data, test.InSearchTerm)
fmt.Printf("Test: [%v]:\nTime: %v \n\n", i+1, time.Since(startTime))
assert.Equal(t, len(result), expectedMatchCount, "expected: %v to be: %v", len(result), expectedMatchCount)
}
}
It would be great if someone could explain me why my parallel code is so slow? What is wrong with the code and what I am missing here as well as what the recommended way would be to search huge slices in memory 50K+.
This looks like just a simple typo. The problem is that you divide your original big slice into 5 pieces (countOfSlices), and you properly launch 5 goroutines to search each part:
for i := 0; i < countOfSlices; i++ {
// Fragments of the array passed on to the search method
go func() { c <- Search(data[(part*i):(part*(i+1))], term) }()
}
This means you should expect 5 results, but you don't. You expect 4000-1 results:
for i := 0; i < part-1; i++ {
select {
case records := <-c:
result = append(result, records...)
case <-timeout:
fmt.Println("timed out!")
return result
}
}
Obviously if you only launched 5 goroutines, each of which delivers 1 single result, you can only expect as many (5). And since your loop waits a lot more (which will never come), it times out as expected.
Change the condition to this:
for i := 0; i < countOfSlices; i++ {
// ...
}
Concurrency is not parallelism. Go is massively concurrent language, not parallel. Even using multicore machine you will pay for data exchange between CPUs when accessing your shared slice in computation threads. You can take advantage of concurrency searching just first match for example. Or doing something with results(say print them, or write to some Writer, or sort) while continue to search.
func Search(data []map[string]string, term string, ch chan map[string]string) {
for _, v := range data {
if v["key"] == term {
ch <- v
}
}
}
func main(){
...
go search(datapart1, term, ch)
go search(datapart2, term, ch)
go search(datapart3, term, ch)
...
for vv := range ch{
fmt.Println(vv) //do something with match concurrently
}
...
}
The recommended way to search huge slice would be to keep it sorted, or make binary tree. There are no magic.
There are two problems - as icza notes you never finish the select as you need to use countOfSlices, and then also your call to Search will not get the data you want as you need to allocate that before calling the go func(), so allocate the slice outside and pass it in.
You might find it still isn't faster though to do this particular work in parallel with such simple data (perhaps with more complex data on a machine with lots of cores it would be worthwhile)?
Be sure when testing that you try swapping the order of your test runs - you might be surprised by the results! Also perhaps try the benchmarking tools available in the testing package which runs your code lots of times for you and averages the results, this might help you get a better idea of whether the fanout speeds things up.

Resources