how to read mysql data using goroutine and channel in bulk - go

I'm a newbee for golang, now need to read a big amount data in mysql, so I wanna use goroutine and channel to get data in high performance, but don't know how to avoid data duplication for each goroutine and make whole process stable. for instance, table schema is as below, I wanna get all records which create_time is smaller than 1000000000000000000, I wanna create 10 goroutines and read data concurrently, each goroutine do some business logic, how to design codes? thank u
id content last_id create_time

I would suggest you to create a goroutine that would publish the data to your channel. Then you can add listener go routines to handle the published data.
This can be done as following:
Main :
const GoroutineCount = 10
type SomeData []int
func main() {
ch := make(chan SomeData, 1)
go PublishData(ch)
for i := 0; i < GoroutineCount; i++ {
go ProcessData(ch)
}
}
For assumptions, I have used a simple slice of int as data. This can be slice of any struct as required.
Publish data to channel:
const ChunkSize = 1000
func PublishData(ch chan SomeData) {
// Assume having 10000 records in result set
// This has to come from db transaction
res := make([]int, 10000)
// split into chunks of 1000
chunks := GetChunk(res)
// write chunk data to channel
for i := range chunks {
ch <- chunks[i]
}
}
func GetChunk(input SomeData) []SomeData {
var result []SomeData
boundary := len(input)
index := 0
for index = 0; boundary >= ChunkSize; index+=ChunkSize {
boundary -= ChunkSize
lastIndex := index+ChunkSize
result = append(result, input[index:lastIndex])
}
boundary = len(input) % ChunkSize
if boundary > 0 {
lastIndex := index+ boundary
result = append(result, input[index:lastIndex])
}
return result
}
Process individual chunks as :
func ProcessData(ch chan SomeData) {
// Read single chunk
res := <-ch
// Process chunk data
fmt.Printf("len %d\n", len(res))
}
Code on go playground: https://play.golang.org/p/X9Q991h6ru_n

Related

In Go, how do we apply concurrency calls while preserving the order of the list?

To give you context,
The variable elementInput is dynamic. I do not know the exact length of it.
It can be 10, 5, or etc.
The *Element channel type is struct
My example is working. But my problem is this implementation is still synchronized, because I am waiting for the channel return so that I can append it to my result
Can you pls help me how to concurrent call GetElements() function and preserve the order defined in elementInput (based on index)
elementInput := []string{FB_FRIENDS, BEAUTY_USERS, FITNESS_USERS, COMEDY_USERS}
wg.Add(len(elementInput))
for _, v := range elementInput {
//create channel
channel := make(chan *Element)
//concurrent call
go GetElements(ctx, page, channel)
//Preserve the order
var elementRes = *<-channel
if len(elementRes.List) > 0 {
el = append(el, elementRes)
}
}
wg.Wait()
Your implementation is not concurrent.
Reason after every subroutine call you are waiting for result, that is making this serial
Below is Sample implementation similar to your flow
calling Concurreny method which calls function concurrently
afterwards we loop and collect response from every above call
main subroutine sleep for 2 seconds
Go PlayGround with running code -> Sample Application
func main() {
Concurrency()
time.Sleep(2000)
}
func response(greeter string, channel chan *string) {
reply := fmt.Sprintf("hello %s", greeter)
channel <- &reply
}
func Concurrency() {
events := []string{"ALICE", "BOB"}
channels := make([]chan *string, 0)
// start concurrently
for _, event := range events {
channel := make(chan *string)
go response(event, channel)
channels = append(channels, channel)
}
// collect response
response := make([]string, len(channels))
for i := 0; i < len(channels); i++ {
response[i] = *<-channels[i]
}
// print response
log.Printf("channel response %v", response)
}

Conditionally Run Consecutive Go Routines

I have the following piece of code. I'm trying to run 3 GO routines at the same time never exceeding three. This works as expected, but the code is supposed to be running updates a table in the DB.
So the first routine processes the first 50, then the second 50, and then third 50, and it repeats. I don't want two routines processing the same rows at the same time and due to how long the update takes, this happens almost every time.
To solve this, I started flagging the rows with a new column processing which is a bool. I set it to true for all rows to be updated when the routine starts and sleep the script for 6 seconds to allow the flag to be updated.
This works for a random amount of time, but every now and then, I'll see 2-3 jobs processing the same rows again. I feel like the method I'm using to prevent duplicate updates is a bit janky and was wondering if there was a better way.
stopper := make(chan struct{}, 3)
var counter int
for {
counter++
stopper <- struct{}{}
go func(db *sqlx.DB, c int) {
fmt.Println("start")
updateTables(db)
fmt.Println("stop"b)
<-stopper
}(db, counter)
time.Sleep(6 * time.Second)
}
in updateTables
var ids[]string
err := sqlx.Select(db, &data, `select * from table_data where processing = false `)
if err != nil {
panic(err)
}
for _, row:= range data{
list = append(ids, row.Id)
}
if len(rows) == 0 {
return
}
for _, row:= range data{
_, err = db.Exec(`update table_data set processing = true where id = $1, row.Id)
if err != nil {
panic(err)
}
}
// Additional row processing
I think there's a misunderstanding on approach to go routines in this case.
Go routines to do these kind of work should be approached like worker Threads, using channels as the communication method in between the main routine (which will be doing the synchronization) and the worker go routines (which will be doing the actual job).
package main
import (
"log"
"sync"
"time"
)
type record struct {
id int
}
func main() {
const WORKER_COUNT = 10
recordschan := make(chan record)
var wg sync.WaitGroup
for k := 0; k < WORKER_COUNT; k++ {
wg.Add(1)
// Create the worker which will be doing the updates
go func(workerID int) {
defer wg.Done() // Marking the worker as done
for record := range recordschan {
updateRecord(record)
log.Printf("req %d processed by worker %d", record.id, workerID)
}
}(k)
}
// Feeding the records channel
for _, record := range fetchRecords() {
recordschan <- record
}
// Closing our channel as we're not using it anymore
close(recordschan)
// Waiting for all the go routines to finish
wg.Wait()
log.Println("we're done!")
}
func fetchRecords() []record {
result := []record{}
for k := 0; k < 100; k++ {
result = append(result, record{k})
}
return result
}
func updateRecord(req record) {
time.Sleep(200 * time.Millisecond)
}
You can even buffer things in the main go routine if you need to update all the 50 tables at once.

GO code with execution control using channels

I'm extracting from a long redshift table all its data in chunks, each chunk to a csv file. I want to control how many files are created at the "same" time (concurrently), i.e. if the whole process will create 10 files, I want to, let's say, create 4 files, wait until they are created and once they are "done", create another 4, and then the remaining 2.
How can I achieve this using channel/s?
I've tried to change the following slice to a channel, but I couldn't get it to work as I said, the implementation I did, did not wait/stop for the 4 first files to end before creating the following ones.
Right now I'm doing the following using WaitGroup:
package, imports, var, etc...
//Inside func main:
//Create a WaitGroup
var wg = sync.WaitGroup{}
//Opening the connection
db, err := sql.Open("postgres", connStr)
if err != nil {
panic(err.Error())
}
defer db.Close()
//Define chunks using a slice
chunkSizer := Slicer(totalRowsInTable, numberRowsByChunk) // e.g. []int{100, 100, 100... 100}
//Iterating over the array
for index, value := range chunkSizer {
wg.Add(1)
go ExtractorToCSV(db, queriedSection, expFileName)
if (index+1)% 4 == 0 { // <-- 4 is the maximum number of files created at the "same" time
wg.Wait()
}
wg.Wait() // <-- waits for the remaining files (last 2 in this case)
}
//Outside main
func ExtractorToCSV(db *sql.DB, queryToExtract, fileName string) {
//... do its process
wg.Done()
}
I've tried using a buffered channel of the size that I wanted to stop (4 in this case), but I didn't use it properly, I don't know...
Thanks in advance.
UPDATED - STOP CONDITION
You can use channel to hold the next line of code like this. This is minimum code that I write for you. Tweak it as you like
var doneCh = make(chan bool)
func main() {
WRITE_POOL := 4
for index, val := range RANGE {
go extractToFile(val)
if (index + 1) % WRITE_POOL == 0 {
// wait for doneCh to finish
// if the iteration is divisive of WRITE_POOL
<-doneCh
<-doneCh
<-doneCh
<-doneCh
} else if index == MAX - 1 {
// wait for whatever doneCh left to finish
// if current val is the last one
LEFT := MAX - index - 1
for i := 0; i < LEFT; i++ {
<-doneCh
}
}
}
}
func extractToFile(val int) {
os.Create(fmt.Sprintf("test-%d", val))
doneCh<-true
}
For better performance, try to :
Create data channel to main function can send the data to it and ExtractorToCSV can receive it.
Create ExtractorToCSV as goroutine and read from data channel, after ExtractorToCSV finish, send data to doneCh
Send db data to data channel and after ExtractorToCSV finished to write to file, send true to doneCh.
I will update this post if you need more example.

memory pooling and buffered channel with multiple goroutines

I'm creating a program which create random bson.M documents, and insert them in database.
The main goroutine generate the documents, and push them to a buffered channel. In the same time, two goroutines fetch the documents from the channel and insert them in database.
This process take a lot of memory and put too much pressure on garbage colelctor, so I'm trying to implement a memory pool to limit the number of allocations
Here is what I have so far:
package main
import (
"fmt"
"math/rand"
"sync"
"time"
"gopkg.in/mgo.v2/bson"
)
type List struct {
L []bson.M
}
func main() {
var rndSrc = rand.NewSource(time.Now().UnixNano())
pool := sync.Pool{
New: func() interface{} {
l := make([]bson.M, 1000)
for i, _ := range l {
m := bson.M{}
l[i] = m
}
return &List{L: l}
},
}
// buffered channel to store generated bson.M docs
var record = make(chan List, 3)
// start worker to insert docs in database
for i := 0; i < 2; i++ {
go func() {
for r := range record {
fmt.Printf("first: %v\n", r.L[0])
// do the insert ect
}
}()
}
// feed the channel
for i := 0; i < 100; i++ {
// get an object from the pool instead of creating a new one
list := pool.Get().(*List)
// re generate the documents
for j, _ := range list.L {
list.L[j]["key1"] = rndSrc.Int63()
}
// push the docs to the channel, and return them to the pool
record <- *list
pool.Put(list)
}
}
But it looks like one List is used 4 times before being regenerated:
> go run test.go
first: map[key1:943279487605002381 key2:4444061964749643436]
first: map[key1:943279487605002381 key2:4444061964749643436]
first: map[key1:943279487605002381 key2:4444061964749643436]
first: map[key1:943279487605002381 key2:4444061964749643436]
first: map[key1:8767993090152084935 key2:8807650676784718781]
...
Why isn't the list regenerated each time ? How can I fix this ?
The problem is that you have created a buffered channel with var record = make(chan List, 3). Hence this code:
record <- *list
pool.Put(list)
May return immediately and the entry will be placed back into the pool before it has been consumed. Hence the underlying slice will likely be modified in another loop iteration before your consumer has had a chance to consume it. Although you are sending List as a value object, remember that the []bson.M is a pointer to an allocated array and will still be pointing to the same memory when you send a new List value. Hence why you are seeing the duplicate output.
To fix, modify your channel to send the List pointer make(chan *List, 3) and change your consumer to put the entry back in the pool once finished, e.g:
for r := range record {
fmt.Printf("first: %v\n", r.L[0])
// do the insert etc
pool.Put(r) // Even if error occurs
}
Your producer should then sent the pointer with the pool.Put removed, i.e.
record <- list

conn.flush() will not flush all record to redis

this is code
func main() {
...
pool := createPool(*redis_server, *redis_pass)
defer pool.Close()
c := pool.Get()
var i int64
st := tickSec()
for i = 0; i < *total; i++ {
r := time.Now().Unix() - rand.Int63n(60*60*24*31*12)
score, _ := strconv.Atoi(time.Unix(r, 0).Format("2006010215"))
id := utee.PlainMd5(uuid.NewUUID().String())
c.Send("ZADD", "app_a_5512", score, id)
if i%10000 == 0 {
c.Flush()
log.Println("current sync to redis", i)
}
}
//c.Flush()
c.Close()
...
}
if i use c.Close(),the total set 100000,the real sortedset count 100000.
but if i use c.Flush(),the total also set 100000, the real sortedset count less than 100000(96932);if i use time.Sleep() in the end of the main func,the total is 100000 too.
when main func exit,the flush func is not complete?and why? thank you!
The reason that the program works when Close() is called after the loop is that the pooled connection's Close() method reads and discards all pending responses.
The application should Receive the responses for all commands instead of letting the respones backup and consume memory on the server. There's no need to flush in the loop.
go func() {
for i = 0; i < *total; i++ {
r := time.Now().Unix() - rand.Int63n(60*60*24*31*12)
score, _ := strconv.Atoi(time.Unix(r, 0).Format("2006010215"))
id := utee.PlainMd5(uuid.NewUUID().String())
c.Send("ZADD", "app_a_5512", score, id)
}
c.Flush()
}
for i = 0; i < *total; i++ {
c.Receive()
}
c.Close()
Also, the application should check and handle the errors returns from Send, Flush and Receive.

Resources