How to optimise processing large data

How to optimise processing large data - go

The objective of my backend service is to process 90 milllion data and at least 10 million of data in 1 day.
My system config:
Ram 2000 Mb
CPU 2core(s)
what I am doing right now is something like this:
var wg sync.WaitGroup
//length of evs is 4455
for i, ev := range evs {
wg.Add(1)
go migrate(&wg)
}
wg.Wait()
func migrate(wg *sync.WaitGroup) {
defer wg.Done()
//processing
time.Sleep(time.Second)
}

Without knowing more detail about the type of work you need to do, your approach seems good. Some things to think about:
Re-using variables and or clients in your processing loop. For example reusing an HTTP client instead of recreating one.
Depending on how your use case calls to handle failures. It might be efficient to use erroGroup. It's a convenience wrapper that stops all the threads on error possibly saving you a lot of time.
In the migrate function be sure to be aware of the caveats regarding closure and goroutines.
func main() {
g := new(errgroup.Group)
var urls = []string{
"http://www.someasdfasdfstupidname.com/",
"ftp://www.golang.org/",
"http://www.google.com/",
}
for _, url := range urls {
url := url // https://golang.org/doc/faq#closures_and_goroutines
g.Go(func() error {
resp, err := http.Get(url)
if err == nil {
resp.Body.Close()
}
return err
})
}
fmt.Println("waiting")
if err := g.Wait(); err == nil {
fmt.Println("Successfully fetched all URLs.")
} else {
fmt.Println(err)
}
}

I have got the solution. to achieve this much huge processing what I have done is
a limited number of goroutine to 50 and increased the number of cores from 2 to 5.

Related

Use go worker pool implementation to write files in parallel?

I have a slice clientFiles which I am iterating over sequentially and writing it in S3 one by one as shown below:
for _, v := range clientFiles {
err := writeToS3(v.FileContent, s3Connection, v.FileName, bucketName, v.FolderName)
if err != nil {
fmt.Println(err)
}
}
The above code works fine but I want to write in S3 in parallel so that I can speed things up. Does worker pool implementation works better here or is there any other better option here?
I got below code which uses wait group but I am not sure if this is better option here to work with?
wg := sync.WaitGroup{}
for _, v := range clientFiles {
wg.Add(1)
go func(v ClientMapFile) {
err := writeToS3(v.FileContent, s3Connection, v.FileName, bucketName, v.FolderName)
if err != nil {
fmt.Println(err)
}
}(v)
}

Yes, parallelising should help.
Your code should work well after changes regarding usage of WaitGroup. You need to mark work as Done and wait for all goroutines to finish after for-loop.
var wg sync.WaitGroup
for _, v := range clientFiles {
wg.Add(1)
go func(v ClientMapFile) {
defer wg.Done()
err := writeToS3(v.FileContent, s3Connection, v.FileName, bucketName, v.FolderName)
if err != nil {
fmt.Println(err)
}
}(v)
}
wg.Wait()
Be aware that your solution creates N goroutines for N files, which can be not optimal if number of files is very big. In such case use this pattern https://gobyexample.com/worker-pools and try different number of workers to find which works best for you in terms of performance.

Streaming values from websocket, determining if I am lagging behind processing the data

I am connecting to a websocket that is stream live stock trades.
I have to read the prices, perform calculations on the fly and based on these calculations make another API call e.g. buy or sell.
I want to ensure my calculations/processing doesn't slow down my ability to stream in all the live data.
What is a good design pattern to follow for this type of problem?
Is there a way to log/warn in my system to know if I am falling behind?
Falling behind means: the websocket is sending price data, and I am not able to process that data as it comes in and it is lagging behind.
While doing the c.ReadJSON and then passing the message to my channel, there might be a delay in deserializing into JSON
When inside my channel and processing, calculating formulas and sending another API request to buy/sell, this will add delays
How can I prevent lags/delays and also monitor if indeed there is a delay?
func main() {
c, _, err := websocket.DefaultDialer.Dial("wss://socket.example.com/stocks", nil)
if err != nil {
panic(err)
}
defer c.Close()
// Buffered channel to account for bursts or spikes in data:
chanMessages := make(chan interface{}, 10000)
// Read messages off the buffered queue:
go func() {
for msgBytes := range chanMessages {
logrus.Info("Message Bytes: ", msgBytes)
}
}()
// As little logic as possible in the reader loop:
for {
var msg interface{}
err := c.ReadJSON(&msg)
if err != nil {
panic(err)
}
chanMessages <- msg
}
}

You can read bytes, pass them to the channel, and use other goroutines to do conversion.

I worked on a similar crypto market bot. Instead of creating large buffured channel i created buffered channel with cap of 1 and used select statement for sending socket data to channel.
Here is the example
var wg sync.WaitGroup
msg := make(chan []byte, 1)
wg.Add(1)
go func() {
defer wg.Done()
for data := range msg {
// decode and process data
}
}()
for {
_, data, err := c.ReadMessage()
if err != nil {
log.Println("read error: ", err)
return
}
select {
case msg <- data: // in case channel is free
default: // if not, next time will try again with latest data
}
}
This will insure that you'll get the latest data when you are ready to process.

How to parallelize a recursive function

I am trying to parallelize a recursive problem in Go, and I am unsure what the best way to do this is.
I have a recursive function, which works like this:
func recFunc(input string) (result []string) {
for subInput := range getSubInputs(input) {
subOutput := recFunc(subInput)
result = result.append(result, subOutput...)
}
result = result.append(result, getOutput(input)...)
}
func main() {
output := recFunc("some_input")
...
}
So the function calls itself N times (where N is 0 at some level), generates its own output and returns everything in a list.
Now I want to make this function run in parallel. But I am unsure what the cleanest way to do this is. My Idea:
Have a "result" channel, to which all function calls send their result.
Collect the results in the main function.
Have a wait group, which determines when all results are collected.
The Problem: I need to wait for the wait group and collect all results in parallel. I can start a separate go function for this, but how do I ever quit this separate go function?
func recFunc(input string) (result []string, outputChannel chan []string, waitGroup &sync.WaitGroup) {
defer waitGroup.Done()
waitGroup.Add(len(getSubInputs(input))
for subInput := range getSubInputs(input) {
go recFunc(subInput)
}
outputChannel <-getOutput(input)
}
func main() {
outputChannel := make(chan []string)
waitGroup := sync.WaitGroup{}
waitGroup.Add(1)
go recFunc("some_input", outputChannel, &waitGroup)
result := []string{}
go func() {
nextResult := <- outputChannel
result = append(result, nextResult ...)
}
waitGroup.Wait()
}
Maybe there is a better way to do this? Or how can I ensure the anonymous go function, that collects the results, is quited when done?

tl;dr;
recursive algorithms should have bounded limits on expensive resources (network connections, goroutines, stack space etc.)
cancelation should be supported - to ensure expensive operations can be cleaned up quickly if a result is no longer needed
branch traversal should support error reporting; this allows errors to bubble up the stack & partial results to be returned without the entire recursion traversal to fail.
For asychronous results - whether using recursions or not - use of channels is recommended. Also, for long running jobs with many goroutines, provide a method for cancelation (context.Context) to aid with clean-up.
Since recursion can lead to exponential consumption of resources it's important to put limits in place (see bounded parallelism).
Below is a design patten I use a lot for asynchronous tasks:
always support taking a context.Context for cancelation
number of workers needed for the task
return a chan of results & a chan error (will only return one error or nil)
var (
workers = 10
ctx = context.TODO() // use request context here - otherwise context.Background()
input = "abc"
)
resultC, errC := recJob(ctx, workers, input) // returns results & `error` channels
// asynchronous results - so read that channel first in the event of partial results ...
for r := range resultC {
fmt.Println(r)
}
// ... then check for any errors
if err := <-errC; err != nil {
log.Fatal(err)
}
Recursion:
Since recursion quickly scales horizontally, one needs a consistent way to fill the finite list of workers with work but also ensure when workers are freed up, that they quickly pick up work from other (over-worked) workers.
Rather than create a manager layer, employ a cooperative peer system of workers:
each worker shares a single inputs channel
before recursing on inputs (subIinputs) check if any other workers are idle
if so, delegate to that worker
if not, current worker continues recursing that branch
With this algorithm, the finite count of workers quickly become saturated with work. Any workers which finish early with their branch - will quickly be delegated a sub-branch from another worker. Eventually all workers will run out of sub-branches, at which point all workers will be idled (blocked) and the recursion task can finish up.
Some careful coordination is needed to achieve this. Allowing the workers to write to the input channel helps with this peer coordination via delegation. A "recursion depth" WaitGroup is used to track when all branches have been exhausted across all workers.
(To include context support and error chaining - I updated your getSubInputs function to take a ctx and return an optional error):
func recFunc(ctx context.Context, input string, in chan string, out chan<- string, rwg *sync.WaitGroup) error {
defer rwg.Done() // decrement recursion count when a depth of recursion has completed
subInputs, err := getSubInputs(ctx, input)
if err != nil {
return err
}
for subInput := range subInputs {
rwg.Add(1) // about to recurse (or delegate recursion)
select {
case in <- subInput:
// delegated - to another goroutine
case <-ctx.Done():
// context canceled...
// but first we need to undo the earlier `rwg.Add(1)`
// as this work item was never delegated or handled by this worker
rwg.Done()
return ctx.Err()
default:
// noone available to delegate - so this worker will need to recurse this item themselves
err = recFunc(ctx, subInput, in, out, rwg)
if err != nil {
return err
}
}
select {
case <-ctx.Done():
// always check context when doing anything potentially blocking (in this case writing to `out`)
// context canceled
return ctx.Err()
case out <- subInput:
}
}
return nil
}
Connecting the Pieces:
recJob creates:
input & output channels - shared by all workers
"recursion" WaitGroup detects when all workers are idle
"output" channel can then safely be closed
error channel for all workers
kicks-off recursion workload by writing initial input to input channel
func recJob(ctx context.Context, workers int, input string) (resultsC <-chan string, errC <-chan error) {
// RW channels
out := make(chan string)
eC := make(chan error, 1)
// R-only channels returned to caller
resultsC, errC = out, eC
// create workers + waitgroup logic
go func() {
var err error // error that will be returned to call via error channel
defer func() {
close(out)
eC <- err
close(eC)
}()
var wg sync.WaitGroup
wg.Add(1)
in := make(chan string) // input channel: shared by all workers (to read from and also to write to when they need to delegate)
workerErrC := createWorkers(ctx, workers, in, out, &wg)
// get the ball rolling, pass input job to one of the workers
// Note: must be done *after* workers are created - otherwise deadlock
in <- input
errCount := 0
// wait for all worker error codes to return
for err2 := range workerErrC {
if err2 != nil {
log.Println("worker error:", err2)
errCount++
}
}
// all workers have completed
if errCount > 0 {
err = fmt.Errorf("PARTIAL RESULT: %d of %d workers encountered errors", errCount, workers)
return
}
log.Printf("All %d workers have FINISHED\n", workers)
}()
return
}
Finally, create the workers:
func createWorkers(ctx context.Context, workers int, in chan string, out chan<- string, rwg *sync.WaitGroup) (errC <-chan error) {
eC := make(chan error) // RW-version
errC = eC // RO-version (returned to caller)
// track the completeness of the workers - so we know when to wrap up
var wg sync.WaitGroup
wg.Add(workers)
for i := 0; i < workers; i++ {
i := i
go func() {
defer wg.Done()
var err error
// ensure the current worker's return code gets returned
// via the common workers' error-channel
defer func() {
if err != nil {
log.Printf("worker #%3d ERRORED: %s\n", i+1, err)
} else {
log.Printf("worker #%3d FINISHED.\n", i+1)
}
eC <- err
}()
log.Printf("worker #%3d STARTED successfully\n", i+1)
// worker scans for input
for input := range in {
err = recFunc(ctx, input, in, out, rwg)
if err != nil {
log.Printf("worker #%3d recurseManagers ERROR: %s\n", i+1, err)
return
}
}
}()
}
go func() {
rwg.Wait() // wait for all recursion to finish
close(in) // safe to close input channel as all workers are blocked (i.e. no new inputs)
wg.Wait() // now wait for all workers to return
close(eC) // finally, signal to caller we're truly done by closing workers' error-channel
}()
return
}

I can start a separate go function for this, but how do I ever quit this separate go function?
You can range over the output channel in the separate go-routine. The go-routine, in that case, will exit safely, when the channel is closed
go func() {
for nextResult := range outputChannel {
result = append(result, nextResult ...)
}
}
So, now the thing that we need to take care of is that the channel is closed after all the go-routines spawned as part of the recursive function call have successfully existed
For that, you can use a shared waitgroup across all the go-routines and wait on that waitgroup in your main function, as you are already doing. Once the wait is over, close the outputChannel, so that the other go-routine also exits safely
func recFunc(input string, outputChannel chan, wg &sync.WaitGroup) {
defer wg.Done()
for subInput := range getSubInputs(input) {
wg.Add(1)
go recFunc(subInput)
}
outputChannel <-getOutput(input)
}
func main() {
outputChannel := make(chan []string)
waitGroup := sync.WaitGroup{}
waitGroup.Add(1)
go recFunc("some_input", outputChannel, &waitGroup)
result := []string{}
go func() {
for nextResult := range outputChannel {
result = append(result, nextResult ...)
}
}
waitGroup.Wait()
close(outputChannel)
}
PS: If you want to have bounded parallelism to limit the exponential growth, check this out

Allocated a lot of memory in Go. How to fix?

Several hundred MB of memory is allocated for 50 requests of 5 MB. Memory is allocated and is no longer released.
How can I clear my memory? Why can this happen?
I've tried on Ubuntu on my home pc and on VPS
package main
import (
"fmt"
"io/ioutil"
"net/http"
"time"
)
func main() {
fmt.Println("start")
for i := 0; i < 50; i++ {
go func() {
DoRequest()
}()
time.Sleep(10 * time.Millisecond)
}
time.Sleep(10 * time.Minute)
}
func DoRequest() error {
requestUrl := "https://blockchain.info/rawblock/0000000000000000000eebedea046425bd54626e6c56eb032e66e714d0141ea6"
req, err := http.NewRequest("GET", requestUrl, nil)
if err != nil {
return err
}
req.Header.Set("user-agent", "free")
httpClient := &http.Client{
Timeout: time.Second * 10,
}
resp, err := httpClient.Do(req)
if resp != nil {
defer resp.Body.Close()
}
if err != nil {
return err
}
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
return err
}
fmt.Println("bodylen", len(body))
return nil
}
Allocated somewhere 400MB

You are creating an http client for each go-routine.
Http client is designed to be create once & used many times. They are go-routine safe.
They allow for connection reuse & other efficiency savers.
Create the http client once in main (instead of in your go-routine) & then pass this single reference to all of your 50 go-routines.
Edit: Also, while it may not make a practical difference in your case, the order for a request is usually like so:
resp, err := httpClient.Do(req)
if err != nil {
return err // check error first
}
defer resp.Body.Close() // no error - so resp will *NOT* be nil - so this is safe
Edit 2: As #Adrian has mentioned: go's garbage collection is not instantaneous - nor should it be - as it is an expensive operation. If you no longer need a block of memory - simply don't reference it anymore. Let the GC do its job, so you can focus on yours!
If you're curious about the evolution of go's GC:
https://blog.golang.org/ismmkeynote (heavy on the technical side)
What kind of Garbage Collection does Go use?

for i := 0; i < 50; i++ {
go func() {
DoRequest()
}()
time.Sleep(10 * time.Millisecond)
}
Never create go-routines like this. Always make sure you create go-routines the way it not fill large ( all ) memory in any case ( including worst case )
Simple solution is control the count of go-routines can spawned ( or running ) at time.
You can pre-calculate memory to be occupied in worst case by multiplying max-number of go-routines you want to run at a time and max-memory can be used by one go-routine.
You can control instances of go-routines by using channles.
Refer first answer of this stackoverflow question
Always have x number of goroutines running at any time
Always use balanced solution between perforamce and required resources.
Update June 11,2019
Here is example go program
https://play.golang.org/p/HovNRgp6FxH

Non-blocking way to read from a Pipe

I want to create a simple app which will be reading continuously output from one app, process it and write processed output to stdout. This app can produce a lot of data within a second and next is silent for a few minutes.
The problem is that mine data processing algorithm is quite slow so main loop is blocked. When the loop is blocked I'm loosing a data which comes and this moment.
cmd := exec.Command("someapp")
stdoutPipe, _ := cmd.StdoutPipe()
stdoutReader := bufio.NewReader(stdoutPipe)
go func() {
bufioReader := bufio.NewReader(stdoutReader)
for {
output, _, err := bufioReader.ReadLine()
if err != nil || err == io.EOF {
break
}
processedOutput := dataProcessor(output);
fmt.Print(processedOutput)
}
}()
Probably the best way to solve this problem is to buffer all output and process it in another Goroutine but I'm not sure how to implement this in Golang. What is the most idiomatic way to solve this problem?

You can have two goroutines, one supplier and other is a consumer. supplier execute a command and pass data with a channel to the consumer.
cmd := exec.Command("someapp")
stdoutPipe, _ := cmd.StdoutPipe()
stdoutReader := bufio.NewReader(stdoutPipe)
Datas := make(chan Data, 100)
go dataProcessor(Datas)
bufioReader := bufio.NewReader(stdoutReader)
for {
output, _, err := bufioReader.ReadLine()
var tempData Data
tempData.Out = output
if err != nil || err == io.EOF {
break
}
Datas <- tempData
}
}
and then you will prccess data in dataProcessor func:
func dataProcessor(Datas <-chan Data) {
for {
select {
case data := <-Datas:
fmt.Println(data)
default:
continue
}
}
}
obviesly it is very simple example and you should customize it and make it better. search about chanle and goroutins. reading this tutorial could be helpfull.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to optimise processing large data - go

I have got the solution. to achieve this much huge processing what I have done is a limited number of goroutine to 50 and increased the number of cores from 2 to 5.

Related

Use go worker pool implementation to write files in parallel?

Streaming values from websocket, determining if I am lagging behind processing the data

How to parallelize a recursive function

Allocated a lot of memory in Go. How to fix?

Non-blocking way to read from a Pipe

Categories

Resources