I am retrieving payloads from a REST API with which I then want to insert into a Snowflake table.
My current process is to use the Snowflake DB connection and iterate over a slice of structs (which contain my data from the API). However, this doesn't seem to be efficient or optimal. Everything is successfully loading, but I am trying to figure out how to optimize a large amount of inserts for potentially thousands of records. Perhaps there needs to be a separate channel for insertions instead of synchronously inserting?
General code flow:
import (
"database/sql"
"fmt"
"sync"
"time"
_ "github.com/snowflakedb/gosnowflake"
)
func ETL() {
var wg sync.WaitGroup
ch := make(chan []*Response)
defer close(ch)
// Create requests to API
for _, req := range requests {
// All of this flows fine without issue
wg.Add(1)
go func(request Request) {
defer wg.Done()
resp, _ := request.Get()
ch <- resp
}(request)
}
// Connect to snowflake
// This is not a problem
connString := fmt.Sprintf(config...)
db, _ := sql.Open("snowflake", connString)
defer db.Close()
// Collect responses from our channel
results := make([][]*Response, len(requests))
for i, _ := range results {
results[i] <-ch
for _, res := range results[i] {
// transform is just a function to flatten my structs into entries that I would like to insert into Snowflake. This is not a bottleneck.
entries := transform(res)
// Load the data into snowflake, passing the entries that have been
// Flattened as well as the db connection
err := load(entries, db)
}
}
}
type Entry struct {
field1 string
field2 string
statusCode int
}
func load(entries []*Entry, db *sql.DB) error {
start := time.Now()
for i, entry := range entries {
fmt.Printf("Loading entry %d\n", i)
stmt := `INSERT INTO tbl (field1, field2, updated_date, status_code)
VALUES (?, ?, CURRENT_TIMESTAMP(), ?)`
_, err := db.Exec(stmt, entry.field1, entry.field2, entry.statusCode)
if err != nil {
fmt.Println(err)
return err
}
}
fmt.Println("Load time: ", time.Since(start))
return nil
}
Instead of INSERTing individual rows, collect rows in files and each time you push one of these to S3/GCS/Azure it will be loaded immediately.
I wrote a post detailing these steps:
https://medium.com/snowflake/lightweight-batch-streaming-to-snowflake-with-gcp-pub-sub-1790ab76da31
With the appropriate storage integration, this would auto-ingest the files:
create pipe temp.public.meetup202011_pipe
auto_ingest = true
integration = temp_meetup202011_pubsub_int
as
copy into temp.public.meetup202011_rsvps
from #temp_fhoffa_gcs_meetup;
Also check these considerations:
https://www.snowflake.com/blog/best-practices-for-data-ingestion/
Soon: If you want to send individual rows and ingest them in real time into Snowflake - that's in development (https://www.snowflake.com/blog/snowflake-streaming-now-hiring-help-design-and-build-the-future-of-big-data-and-stream-processing/).
Related
I am trying to design a HTTP client in Go that will be capable ofcConcurrent API calls to the web services and write some data in a textfile.
func getTotalCalls() int {
reader := bufio.NewReader(os.Stdin)
...
return callInt
}
getTotalColls decide how many calls I want to make, input comes from terminal.
func writeToFile(s string, namePrefix string) {
fileStore := fmt.Sprintf("./data/%s_calls.log", namePrefix)
...
defer f.Close()
if _, err := f.WriteString(s); err != nil {
log.Println(err)
}
}
The writeToFile will write data to file synchronously from a buffered channel.
func makeRequest(url string, ch chan<- string, id int) {
var jsonStr = []byte(`{"from": "Saru", "message": "Saru to Discovery. Over!"}`)
req, err := http.NewRequest("POST", url, bytes.NewBuffer(jsonStr))
req.Header.Set("Content-Type", "application/json")
client := &http.Client{}
start := time.Now()
resp, err := client.Do(req)
if err != nil {
panic(err)
}
secs := time.Since(start).Seconds()
defer resp.Body.Close()
body, _ := ioutil.ReadAll(resp.Body)
ch <- fmt.Sprintf("%d, %.2f, %d, %s, %s\n", id, secs, len(body), url, body)
}
This is the function which make the API call in a go Routine.
and Finally Here is the Main function, which send data from go routine to a bufferend channel and Later I range over the bufferend channel of string and write the data to file.
func main() {
urlPrefix := os.Getenv("STARCOMM_GO")
url := urlPrefix + "discovery"
totalCalls := getTotalCalls()
queue := make(chan string, totalCalls)
for i := 1; i <= totalCalls; i++ {
go makeRequest(url, queue, i)
}
for item := range queue {
fmt.Println(item)
writeToFile(item, fmt.Sprint(totalCalls))
}
}
The problem is at the end of the call the buffered somehow block and the program wait forever end of all the call. Does someone have a better way to design such use case? My final goal is to check for different number of concurrent post request how much time it takes for each calls for bench marking the API endpoint for 5, 10, 50, 100, 500, 1000 ... set of concurrent call.
Something has to close(queue). Otherwise range queue will block. If you want to range queue, you have to ensure that this channel is closed once the final client is done.
However... It's not even clear that you need to range queue though, since you know exactly how many results you'll get - it's totalCalls. You just need to loop this many times receiving from queue.
I believe your use case is similar to the Worker Pools example on gobyexample, so you may want to check that one out. Here's the code from that example:
// In this example we'll look at how to implement
// a _worker pool_ using goroutines and channels.
package main
import (
"fmt"
"time"
)
// Here's the worker, of which we'll run several
// concurrent instances. These workers will receive
// work on the `jobs` channel and send the corresponding
// results on `results`. We'll sleep a second per job to
// simulate an expensive task.
func worker(id int, jobs <-chan int, results chan<- int) {
for j := range jobs {
fmt.Println("worker", id, "started job", j)
time.Sleep(time.Second)
fmt.Println("worker", id, "finished job", j)
results <- j * 2
}
}
func main() {
// In order to use our pool of workers we need to send
// them work and collect their results. We make 2
// channels for this.
const numJobs = 5
jobs := make(chan int, numJobs)
results := make(chan int, numJobs)
// This starts up 3 workers, initially blocked
// because there are no jobs yet.
for w := 1; w <= 3; w++ {
go worker(w, jobs, results)
}
// Here we send 5 `jobs` and then `close` that
// channel to indicate that's all the work we have.
for j := 1; j <= numJobs; j++ {
jobs <- j
}
close(jobs)
// Finally we collect all the results of the work.
// This also ensures that the worker goroutines have
// finished. An alternative way to wait for multiple
// goroutines is to use a [WaitGroup](waitgroups).
for a := 1; a <= numJobs; a++ {
<-results
}
}
Your "worker" makes HTTP requests, otherwise it's pretty much the same pattern. Note the for loop at the end which reads from the channel a known number of times.
If you need to limit a number of simultaneous requests, you can use a semaphore implemented with a buffered channel.
func makeRequest(url string, id int) string {
var jsonStr = []byte(`{"from": "Saru", "message": "Saru to Discovery. Over!"}`)
req, err := http.NewRequest("POST", url, bytes.NewBuffer(jsonStr))
req.Header.Set("Content-Type", "application/json")
client := &http.Client{}
start := time.Now()
resp, err := client.Do(req)
if err != nil {
panic(err)
}
secs := time.Since(start).Seconds()
defer resp.Body.Close()
body, _ := ioutil.ReadAll(resp.Body)
return fmt.Sprintf("%d, %.2f, %d, %s, %s\n", id, secs, len(body), url, body)
}
func main() {
urlPrefix := os.Getenv("STARCOMM_GO")
url := urlPrefix + "discovery"
totalCalls := getTotalCalls()
concurrencyLimit := 50 // 5, 10, 50, 100, 500, 1000.
// Declare semaphore as a buffered channel with capacity limited by concurrency level.
semaphore := make(chan struct{}, concurrencyLimit)
for i := 1; i <= totalCalls; i++ {
// Take a slot in semaphore before proceeding.
// Once all slots are taken this call will block until slot is freed.
semaphore <- struct{}{}
go func() {
// Release slot on job finish.
defer func() { <-semaphore }()
item := makeRequest(url, i)
fmt.Println(item)
// Beware that writeToFile will be called concurrently and may need some synchronization.
writeToFile(item, fmt.Sprint(totalCalls))
}()
}
// Wait for jobs to finish by filling semaphore to full capacity.
for i := 0; i < cap(semaphore); i++ {
semaphore <- struct{}{}
}
close(semaphore)
}
I'm using Go routines to send queries to PostgreSQL master and slave nodes in parallel. The first host that returns a valid result wins. Error cases are outside the scope of this question.
The caller is the only one that cares about the contents of a *sql.Rows object, so intentionally my function doesn't do any operations on those. I use buffered channels to retrieve return objects from the Go routines, so there should be no Go routine leak. Garbage collection should take care of the rest.
There is a problem I haven't taught about properly: the Rows objects that remain behind in the channel are never closed. When I call this function from a (read only) transaction, tx.Rollback() returns an error for every instance of non-closed Rows object: "unexpected command tag SELECT".
This function is called from higher level objects:
func multiQuery(ctx context.Context, xs []executor, query string, args ...interface{}) (*sql.Rows, error) {
rc := make(chan *sql.Rows, len(xs))
ec := make(chan error, len(xs))
for _, x := range xs {
go func(x executor) {
rows, err := x.QueryContext(ctx, query, args...)
switch { // Make sure only one of them is returned
case err != nil:
ec <- err
case rows != nil:
rc <- rows
}
}(x)
}
var me MultiError
for i := 0; i < len(xs); i++ {
select {
case err := <-ec:
me.append(err)
case rows := <-rc: // Return on the first success
return rows, nil
}
}
return nil, me.check()
}
Executors can be *sql.DB, *sql.Tx or anything that complies with the interface:
type executor interface {
ExecContext(ctx context.Context, query string, args ...interface{}) (sql.Result, error)
QueryContext(ctx context.Context, query string, args ...interface{}) (*sql.Rows, error)
QueryRowContext(ctx context.Context, query string, args ...interface{}) *sql.Row
}
Rollback logic:
func (mtx MultiTx) Rollback() error {
ec := make(chan error, len(mtx))
for _, tx := range mtx {
go func(tx *Tx) {
err := tx.Rollback()
ec <- err
}(tx)
}
var me MultiError
for i := 0; i < len(mtx); i++ {
if err := <-ec; err != nil {
me.append(err)
}
}
return me.check()
}
MultiTx is a collection of open transactions on multiple nodes. It is a higher level object that calls multiQuery
What would be the best approach to "clean up" unused rows? Options I'm thinking about not doing:
Cancel the context: I believe it will work inconsistently, multiple queries might already have returned by the time cancel() is called
Create a deferred Go routine which continues to drain the channels and close the rows objects: If a DB node is slow to respond, Rollback() is still called before rows.Close()
Use a sync.WaitGroup somewhere in the MultiTx type, maybe in combination with (2): This can cause Rollback to hang if one of the nodes is unresponsive. Also, I wouldn't be sure how I would implement that.
Ignore the Rollback errors: Ignoring errors never sounds like a good idea, they are there for a reason.
What would be the recommended way of approaching this?
Edit:
As suggested by #Peter, I've tried canceling the context, but it seems this also invalidates all the returned Rows from the query. On rows.Scan I'm getting context canceled error at the higher level caller.
This is what I've done so far:
func multiQuery(ctx context.Context, xs []executor, query string, args ...interface{}) (*sql.Rows, error) {
ctx, cancel := context.WithCancel(ctx)
defer cancel()
rc := make(chan *sql.Rows, len(xs))
ec := make(chan error, len(xs))
for _, x := range xs {
go func(x executor) {
rows, err := x.QueryContext(ctx, query, args...)
switch { // Make sure only one of them is returned
case err != nil:
ec <- err
case rows != nil:
rc <- rows
cancel() // Cancel on success
}
}(x)
}
var (
me MultiError
rows *sql.Rows
)
for i := 0; i < len(xs); i++ {
select {
case err := <-ec:
me.append(err)
case r := <-rc:
if rows == nil { // Only use the first rows
rows = r
} else {
r.Close() // Cleanup remaining rows, if there are any
}
}
}
if rows != nil {
return rows, nil
}
return nil, me.check()
}
Edit 2:
#Adrian mentioned:
we can't see the code that's actually using any of this.
This code is reused by type methods. First there is the transaction type. The issues in this question are appearing on the Rollback() method above.
// MultiTx holds a slice of open transactions to multiple nodes.
// All methods on this type run their sql.Tx variant in one Go routine per Node.
type MultiTx []*Tx
// QueryContext runs sql.Tx.QueryContext on the tranactions in separate Go routines.
// The first non-error result is returned immediately
// and errors from the other Nodes will be ignored.
//
// If all nodes respond with the same error, that exact error is returned as-is.
// If there is a variety of errors, they will be embedded in a MultiError return.
//
// Implements boil.ContextExecutor.
func (mtx MultiTx) QueryContext(ctx context.Context, query string, args ...interface{}) (*sql.Rows, error) {
return multiQuery(ctx, mtx2Exec(mtx), query, args...)
}
Then there is:
// MultiNode holds a slice of Nodes.
// All methods on this type run their sql.DB variant in one Go routine per Node.
type MultiNode []*Node
// QueryContext runs sql.DB.QueryContext on the Nodes in separate Go routines.
// The first non-error result is returned immediately
// and errors from the other Nodes will be ignored.
//
// If all nodes respond with the same error, that exact error is returned as-is.
// If there is a variety of errors, they will be embedded in a MultiError return.
//
// Implements boil.ContextExecutor.
func (mn MultiNode) QueryContext(ctx context.Context, query string, args ...interface{}) (*sql.Rows, error) {
return multiQuery(ctx, nodes2Exec(mn), query, args...)
}
These methods the public wrappers around the multiQuery() function. Now I realize that just sending the *Rows into a buffered channel to die, is actually a memory leak. In the transaction cases it becomes clear, as Rollback() starts to complain. But in the non-transaction variant, the *Rows inside the channel will never be garbage collected, as the driver might hold reference to it until rows.Close() is called.
I've written this package to by used by an ORM, sqlboiler. My higher level logic passes a MultiTX object to the ORM. From that point, I don't have any explicit control over the returned Rows. A simplistic approach would be that my higher level code cancels the context before Rollback(), but I don't like that:
It gives a non-intuitive API. This (idiomatic) approach would break:
ctx, cancel = context.WithCancel(context.Background())
defer cancel()
tx, _ := db.BeginTx(ctx)
defer tx.Rollback()
The ORM's interfaces also specify the regular, non-context aware Query() variants, which in my package's case will run against context.Background().
I'm starting to worry that this broken by design... Anyway, I will start by implementing a Go routine that will drain the channel and close the *Rows. After that I will see if I can implement some reasonable waiting / cancellation mechanism that won't affect the returned *Rows
I think that the function below will do what you require with the one provisio being that the context passed in should be cancelled when you are done with the results (otherwise one context.WithCancel will leak; I cannot see a way around that as cancelling it within the function will invalidate the returned sql.Rows).
Note that I have not had time to test this (would need to setup a database, implement your interfaces etc) so there may well be a bug hidden in the code (but I believe the basic algorithm is sound)
// queryResult holds the goroutine# and the result from that gorouting (need both so we can avoid cancelling the relevant context)
type queryResult struct {
no int
rows *sql.Rows
}
// multiQuery - Executes multiple queries and returns either the first to resutn a result or, if all fail, a multierror summarising the errors
// Important: This should be used for READ ONLY queries only (it is possible that more than one will complete)
// Note: The ctx passed in must be cancelled to avoid leaking a context (this routine cannot cancel the context used for the winning query)
func multiQuery(ctx context.Context, xs []executor, query string, args ...interface{}) (*sql.Rows, error) {
noOfQueries := len(xs)
rc := make(chan queryResult) // Channel for results; unbuffered because we only want one, and only one, result
ec := make(chan error) // errors get sent here - goroutines must send a result or 1 error
defer close(ec) // Ensure the error consolidation go routine will complete
// We need a way to cancel individual goroutines as we do not know which one will succeed
cancelFns := make([]context.CancelFunc, noOfQueries)
// All goroutines must terminate before we exit (otherwise the transaction maybe rolled back before they are cancelled leading to "unexpected command tag SELECT")
var wg sync.WaitGroup
wg.Add(noOfQueries)
for i, x := range xs {
var queryCtx context.Context
queryCtx, cancelFns[i] = context.WithCancel(ctx)
go func(ctx context.Context, queryNo int, x executor) {
defer wg.Done()
rows, err := x.QueryContext(ctx, query, args...)
if err != nil {
ec <- err // Error collection go routine guaranteed to run until all query goroutines complete
return
}
select {
case rc <- queryResult{queryNo, rows}:
return
case <-ctx.Done(): // If another query has already transmitted its results these should be thrown away
rows.Close() // not strictly required because closed context should tidy up
return
}
}(queryCtx, i, x)
}
// Start go routine that will send a MultiError to a channel if all queries fail
mec := make(chan MultiError)
go func() {
var me MultiError
errCount := 0
for err := range ec {
me.append(err)
errCount += 1
if errCount == noOfQueries {
mec <- me
return
}
}
}()
// Wait for one query to succeed or all queries to fail
select {
case me := <-mec:
for _, cancelFn := range cancelFns { // not strictly required so long as ctx is eventually cancelled
cancelFn()
}
wg.Wait()
return nil, me.check()
case result := <-rc:
for i, cancelFn := range cancelFns { // not strictly required so long as ctx is eventually cancelled
if i != result.no { // do not cancel the query that returned a result
cancelFn()
}
}
wg.Wait()
return result.rows, nil
}
}
Thanks to the comments from #Peter and the answer of #Brits, I got fresh ideas on how to approach this.
Blue print
3 out of 4 proposals from the question were needed to be implemented.
1. Cancel the Context
mtx.QueryContext() creates a descendant context and sets the CancelFunc in the MultiTx object.
The cancelWait() helper cancels an old context and waits for MultiTX.Done if its not nil. It is called on Rollback() and before every new query.
2. Drain the channel
In multiQuery(), Upon obtaining the first successful Rows, a Go routine is launched to drain and close the remaining Rows. The rows channel no longer needs to be buffered.
An additional Go routine and a WaitGroup is used to close the error and rows channels.
3. Return a done channel
Instead of the proposed WaitGroup, multiQuery() returns a done channel. The channel is closed once the drain & close routine has finished. mtx.QueryContext() sets done the channel on the MultiTx object.
Errors
Instead of the select block, only drain the error channel if there are now Rows. The error needs to remain buffered for this reason.
Code
// MultiTx holds a slice of open transactions to multiple nodes.
// All methods on this type run their sql.Tx variant in one Go routine per Node.
type MultiTx struct {
tx []*Tx
done chan struct{}
cancels context.CancelFunc
}
func (m *MultiTx) cancelWait() {
if m.cancel != nil {
m.cancel()
}
if m.done != nil {
<-m.done
}
// reset
m.done, m.cancel = nil, nil
}
// Context creates a child context and appends CancelFunc in MultiTx
func (m *MultiTx) context(ctx context.Context) context.Context {
m.cancelWait()
ctx, m.cancel = context.WithCancel(ctx)
return ctx
}
// QueryContext runs sql.Tx.QueryContext on the tranactions in separate Go routines.
func (m *MultiTx) QueryContext(ctx context.Context, query string, args ...interface{}) (rows *sql.Rows, err error) {
rows, m.done, err = multiQuery(m.context(ctx), mtx2Exec(m.tx), query, args...)
return rows, err
}
func (m *MultiTx) Rollback() error {
m.cancelWait()
ec := make(chan error, len(m.tx))
for _, tx := range m.tx {
go func(tx *Tx) {
err := tx.Rollback()
ec <- err
}(tx)
}
var me MultiError
for i := 0; i < len(m.tx); i++ {
if err := <-ec; err != nil {
me.append(err)
}
}
return me.check()
}
func multiQuery(ctx context.Context, xs []executor, query string, args ...interface{}) (*sql.Rows, chan struct{}, error) {
rc := make(chan *sql.Rows)
ec := make(chan error, len(xs))
var wg sync.WaitGroup
wg.Add(len(xs))
for _, x := range xs {
go func(x executor) {
rows, err := x.QueryContext(ctx, query, args...)
switch { // Make sure only one of them is returned
case err != nil:
ec <- err
case rows != nil:
rc <- rows
}
wg.Done()
}(x)
}
// Close channels when all query routines completed
go func() {
wg.Wait()
close(ec)
close(rc)
}()
rows, ok := <-rc
if ok { // ok will be false if channel closed before any rows
done := make(chan struct{}) // Done signals the caller that all remaining rows are properly closed
go func() {
for rows := range rc { // Drain channel and close unused Rows
rows.Close()
}
close(done)
}()
return rows, done, nil
}
// no rows, build error return
var me MultiError
for err := range ec {
me.append(err)
}
return nil, nil, me.check()
}
Edit: Cancel & wait for old contexts before every Query, as *sql.Tx is not Go routine save, all previous queries have to be done before a next call.
I have a BigQuery table with this schema:
name STRING NULLABLE
age INTEGER NULLABLE
amount INTEGER NULLABLE
And I can succesfully insert on the table with this code:
ctx := context.Background()
client, err := bigquery.NewClient(ctx, projectID)
if err != nil {
log.Fatal(err)
}
u := client.Dataset(dataSetID).Table("test_user").Uploader()
savers := []*bigquery.StructSaver{
{Struct: test{Name: "Sylvanas", Age: 23, Amount: 123}},
}
if err := u.Put(ctx, savers); err != nil {
log.Fatal(err)
}
fmt.Printf("rows inserted!!")
This works fine because the table is already created on bigquery, what I want to do now is deleting the table if exist and creating it again from code:
type test struct {
Name string
Age int
Amount int
}
if err := client.Dataset(dataSetID).Table("test_user").Delete(ctx); err != nil {
log.Fatal(err)
}
fmt.Printf("table deleted")
t := client.Dataset(dataSetID).Table("test_user")
// Infer table schema from a Go type.
schema, err := bigquery.InferSchema(test{})
if err := t.Create(ctx,
&bigquery.TableMetadata{
Name: "test_user",
Schema: schema,
}); err != nil {
log.Fatal(err)
}
fmt.Printf("table created with the test schema")
This is also working really nice because is deleting the table and creating it with the infered schema from my struct test.
The problem is coming when I try to do the above insert after the delete/create process. No error is thrown but it is not inserting data (and the insert works fine if I comment the delete/create part).
What am I doing wrong?
Do I need to commit the create table transaction somehow in order to insert or maybe do I need to close the DDBB connection?
According to this old answer, it might take up to 2 min for a BigQuery streaming buffer to be properly attached to a deleted and immediately re-created table.
I have run some tests, and in my case it just took a few seconds until the table is availabe instead of the 2~5 min reported on other questions. The resulting code is quite different from yours, but the concepts should apply.
What I tried is, instead of directly inserting the rows, adding them on a buffered channel, and wait until you can verify that the current table is properly saving the values before start sending them.
I've used a quite simpler struct to run my tests (so it was easier to write the code):
type Row struct {
ByteField []byte
}
I generated my rows the following way:
func generateRows(rows chan<- *Row) {
for {
randBytes := make([]byte, 100)
_, _ = rand.Read(randBytes)
rows <- &row{randBytes}
time.Sleep(time.Millisecond * 500) // use whatever frequency you need to insert rows at
}
}
Notice how I'm sending the rows to the channel. Instead of generating them, you just have to get them from your data source.
The next part is finding a way to check if the table is properly saving the rows. What I did was trying to insert one of the buffered rows into the table, recover that row, and verify if everything is OK. If the row is not properly returned, send it back to the buffer.
func unreadyTable(rows chan *row) bool {
client, err := bigquery.NewClient(context.Background(), project)
if err != nil {return true}
r := <-rows // get a row to try to insert
uploader := client.Dataset(dataset).Table(table).Uploader()
if err := uploader.Put(context.Background(), r); err != nil {rows <- r;return true}
i, err := client.Query(fmt.Sprintf("select * from `%s.%s.%s`", project, dataset, table)).Read(context.Background())
if err != nil {rows <- r; return true}
var testRow []bigquery.Value
if err := i.Next(&testRow); err != nil {rows <- r;return true}
if reflect.DeepEqual(&row{testrow[0].([]byte)}, r) {return false} // there's probably a better way to check if it's equal
rows <- r;return true
}
With a function like that, we only need to add for ; unreadyTable(rows); time.Sleep(time.Second) {} to block until it's safe to insert the rows.
Finally, we put everything together:
func main() {
// initialize a channel where the rows will be sent
rows := make(chan *row, 1000) // make it big enough to hold several minutes of rows
// start generating rows to be inserted
go generateRows(rows)
// create the BigQuery client
client, err := bigquery.NewClient(context.Background(), project)
if err != nil {/* handle error */}
// delete the previous table
if err := client.Dataset(dataset).Table(table).Delete(context.Background()); err != nil {/* handle error */}
// create the new table
schema, err := bigquery.InferSchema(row{})
if err != nil {/* handle error */}
if err := client.Dataset(dataset).Table(table).Create(context.Background(), &bigquery.TableMetadata{Schema: schema}); err != nil {/* handle error */}
// wait for the table to be ready
for ; unreadyTable(rows); time.Sleep(time.Second) {}
// once it's ready, upload indefinitely
for {
if len(rows) > 0 { // if there are uninserted rows, create a batch and insert them
uploader := client.Dataset(dataset).Table(table).Uploader()
insert := make([]*row, min(500, len(rows))) // create a batch of all the rows on buffer, up to 500
for i := range insert {insert[i] = <-rows}
go func(insert []*row) { // do the actual insert async
if err := uploader.Put(context.Background(), insert); err != nil {/* handle error */}
}(insert)
} else { // if there are no rows waiting to be inserted, wait and check again
time.Sleep(time.Second)
}
}
}
Note: Since math.Min() does not like ints, I had to include func min(a,b int)int{if a<b{return a};return b}.
Here's my full working example.
I am importing data to neo4j using neoism, and I have some issues importing big data, 1000 nodes, would take 8s. here is a part of the code that imports 100nodes.
quite basic code, needs improvement, anyone can help me improve this?
var wg sync.WaitGroup
for _, itemProps := range items {
wg.Add(1)
go func(i interface{}) {
s := time.Now()
cypher := neoism.CypherQuery{
Statement: fmt.Sprintf(`
CREATE (%v)
SET i = {Props}
RETURN i
`, ItemLabel),
Parameters: neoism.Props{"Props": i},
}
if err := database.ExecuteCypherQuery(cypher); err != nil {
utils.Error(fmt.Sprintf("error ImportItemsNeo4j! %v", err))
wg.Done()
return
}
utils.Info(fmt.Sprintf("import Item success! took: %v", time.Since(s)))
wg.Done()
}(itemProps)
}
wg.Wait()
Afaik neoism still uses old APIs, you should use cq instead: https://github.com/go-cq/cq
also you should batch your creates,
i.e. either send multiple statements per request, e.g 100 statements per request
or even better send a list of parameters to a single cypher query:
e.g. {data} is a [{id:1},{id:2},...]
UNWIND {data} as props
CREATE (n:Label) SET n = props
First I would like to say i've already viewed Golang download multiple files in parallel using goroutines and Example for sync.WaitGroup correct? and i've used them as a guide in my code. However I'm not certain that it's working for me. I'm trying to download files from multiple buckets on aws. This is what I have (some lines will be blank for security reasons).
package main
import (
"fmt"
"os"
"os/user"
"path/filepath"
"sync"
"time"
"github.com/aws/aws-sdk-go/aws"
"github.com/aws/aws-sdk-go/aws/session"
"github.com/aws/aws-sdk-go/service/s3"
"github.com/aws/aws-sdk-go/service/s3/s3manager"
)
var (
//Bucket = "" // Download from this bucket
Prefix = "" // Using this key prefix
LocalDirectory = "s3logs" // Into this directory
)
// create a single session to be used
var sess = session.New()
// used to control concurrency
var wg sync.WaitGroup
func main() {
start := time.Now()
//map of buckets to region
regBuckets := map[string]string{
}
// download the files for each bucket
for region, bucket := range regBuckets {
fmt.Println(region)
wg.Add(1)
go getLogs(region, bucket, LocalDirectory, &wg)
}
wg.Wait()
elapsed := time.Since(start)
fmt.Printf("\nTime took %s\n", elapsed)
}
// function to get data from buckets
func getLogs(region string, bucket string, directory string, wg *sync.WaitGroup) {
client := s3.New(sess, &aws.Config{Region: aws.String(region)})
params := &s3.ListObjectsInput{Bucket: &bucket, Prefix: &Prefix}
manager := s3manager.NewDownloaderWithClient(client, func(d *s3manager.Downloader) {
d.PartSize = 6 * 1024 * 1024 // 6MB per part
d.Concurrency = 5
})
d := downloader{bucket: bucket, dir: directory, Downloader: manager}
client.ListObjectsPages(params, d.eachPage)
wg.Done()
}
// downloader object and methods
type downloader struct {
*s3manager.Downloader
bucket, dir string
}
func (d *downloader) eachPage(page *s3.ListObjectsOutput, more bool) bool {
for _, obj := range page.Contents {
d.downloadToFile(*obj.Key)
}
return true
}
func (d *downloader) downloadToFile(key string) {
// Create the directories in the path
// desktop path
user, errs := user.Current()
if errs != nil {
panic(errs)
}
homedir := user.HomeDir
desktop := homedir + "/Desktop/" + d.dir
file := filepath.Join(desktop, key)
if err := os.MkdirAll(filepath.Dir(file), 0775); err != nil {
panic(err)
}
// Setup the local file
fd, err := os.Create(file)
if err != nil {
panic(err)
}
defer fd.Close()
// Download the file using the AWS SDK
fmt.Printf("Downloading s3://%s/%s to %s...\n", d.bucket, key, file)
params := &s3.GetObjectInput{Bucket: &d.bucket, Key: &key}
d.Download(fd, params)
_, e := d.Download(fd, params)
if e != nil {
panic(e)
}
}
In the regBuckets hashmap I place a list of bucket names : regions In the for loop below I print the bucket name. So If i have two buckets I want to download the items from both buckets at the same time. I was testing this with a print statement. I excepted to see the name of the first bucket and soon after the name of the second bucket. However it seems like instead of downloading the files from multiple buckets is parallel it's downloading them in order, e.g bucket 1 when bucket 1 is done the for loop continues and then bucket 2...etc So i need help making sure I'm downloading in parallel because I have roughly 10 buckets and speed is important. I also wonder if it's because i'm using a single session. Any idea?