Upload data to table without waiting for streaming buffer to flush - go

I have a Go program which downloads data from a table (T1), formats it, and uploads it to a new temporary table (T2). Once the data has been uploaded (30s or so), the data should be copied to a third table (T3).
After uploading the formatted data to T2, querying the table returns results ok. However, when copying the table - the job completes almost instantly and the destination table (T3) is empty.
I'm copying the table as suggested here - but the result is the same when performing the action in the UI.
In the table metadata section it shows as 0B, 0 rows but there are about 100k rows and 18mb of data in there - or at least that's what comes back from a query.
Edit I did not spot that this data was still stuck in the streaming buffer - see my answer.

The comments on my question lead me to see that the issue was the streaming buffer. This was taking a long time to flush - it's not possible to flush manually.
I ended up reading this issue and comment on GitHub here. This suggested using a Load job instead.
After some research, I realised it was possible to read from an io.Reader as well as a Google Cloud Storage Reference by configuring the Loader's ReaderSource.
My original implementation that used the streaming buffer looked like this:
var vss []*bigquery.ValuesSaver
// for each row:
vss = append(vss, &bigquery.ValuesSaver{
Schema: schema,
InsertID: fmt.Sprintf(index of loop),
Row: []bigquery.Value{
"data"
},
})
err := uploader.Put(ctx, vss)
if err != nil {
if pmErr, ok := err.(bigquery.PutMultiError); ok {
for _, rowInsertionError := range pmErr {
log.Println(rowInsertionError.Errors)
}
}
return fmt.Errorf("failed to insert data: %v", err)
}
I was able to change this to a load job with code that looked like this:
var lines []string
for _, v := range rows {
json, err := json.Marshal(v)
if err != nil {
return fmt.Errorf("failed generate json %v, %+v", err, v)
}
lines = append(lines, string(json))
}
dataString := strings.Join(lines, "\n")
rs := bigquery.NewReaderSource(strings.NewReader(dataString))
rs.FileConfig.SourceFormat = bigquery.JSON
rs.FileConfig.Schema = schema
loader := dataset.Table(t2Name).LoaderFrom(rs)
loader.CreateDisposition = bigquery.CreateIfNeeded
loader.WriteDisposition = bigquery.WriteTruncate
job, err := loader.Run(ctx)
if err != nil {
return fmt.Errorf("failed to start load job %v", err)
}
_, err := job.Wait(ctx)
if err != nil {
return fmt.Errorf("load job failed %v", err)
}
Now the data is available in the table 'immediately' - I no longer need to wait for the streaming buffer.

Related

How to update a bigquery row in golang

I have a go program connected to a bigquery table. This is the table's schema:
name STRING NULLABLE
age INTEGER NULLABLE
amount INTEGER NULLABLE
I have succeded at queryng the data of this table and printing all rows on console with this code:
ctx := context.Background()
client, err := bigquery.NewClient(ctx, projectID)
q := client.Query("SELECT * FROM test.test_user LIMIT 1000")
it, err := q.Read(ctx)
if err != nil {
log.Fatal(err)
}
for {
var values []bigquery.Value
err := it.Next(&values)
if err == iterator.Done {
break
}
if err != nil {
// TODO: Handle error.
}
fmt.Println(values)
}
And I also have succeded to insert data on the table from a struct using this code:
type test struct {
Name string
Age int
Amount int
}
u := client.Dataset("testDS").Table("test_user").Uploader()
savers := []*bigquery.StructSaver{
{Struct: test{Name: "Jack", Age: 23, Amount:123}, InsertID: "id1"},
}
if err := u.Put(ctx, savers); err != nil {
log.Fatal(err)
}
fmt.Printf("rows inserted!!")
Now, what I am failing to do is updating rows. What I want to do is selecting all the rows and update all of them with an operation (for example: amount = amount * 2)
How can I achieve this using golang?
Updating rows is not specific to Go, or any other client library. If you want to update data in BigQuery, you need to use DML (Data Manipulation Language) via SQL. So, essentially you already have the main part working (running a query) - you just need to change this SQL to use DML.
But, a word of caution: BigQuery is a OLAP service. Don't use it for OLTP. Also, there are quotas with using DML. Make sure you familiarise yourself with them.

Golang code is not inserting on BigQuery's table after I have created it from code

I have a BigQuery table with this schema:
name STRING NULLABLE
age INTEGER NULLABLE
amount INTEGER NULLABLE
And I can succesfully insert on the table with this code:
ctx := context.Background()
client, err := bigquery.NewClient(ctx, projectID)
if err != nil {
log.Fatal(err)
}
u := client.Dataset(dataSetID).Table("test_user").Uploader()
savers := []*bigquery.StructSaver{
{Struct: test{Name: "Sylvanas", Age: 23, Amount: 123}},
}
if err := u.Put(ctx, savers); err != nil {
log.Fatal(err)
}
fmt.Printf("rows inserted!!")
This works fine because the table is already created on bigquery, what I want to do now is deleting the table if exist and creating it again from code:
type test struct {
Name string
Age int
Amount int
}
if err := client.Dataset(dataSetID).Table("test_user").Delete(ctx); err != nil {
log.Fatal(err)
}
fmt.Printf("table deleted")
t := client.Dataset(dataSetID).Table("test_user")
// Infer table schema from a Go type.
schema, err := bigquery.InferSchema(test{})
if err := t.Create(ctx,
&bigquery.TableMetadata{
Name: "test_user",
Schema: schema,
}); err != nil {
log.Fatal(err)
}
fmt.Printf("table created with the test schema")
This is also working really nice because is deleting the table and creating it with the infered schema from my struct test.
The problem is coming when I try to do the above insert after the delete/create process. No error is thrown but it is not inserting data (and the insert works fine if I comment the delete/create part).
What am I doing wrong?
Do I need to commit the create table transaction somehow in order to insert or maybe do I need to close the DDBB connection?
According to this old answer, it might take up to 2 min for a BigQuery streaming buffer to be properly attached to a deleted and immediately re-created table.
I have run some tests, and in my case it just took a few seconds until the table is availabe instead of the 2~5 min reported on other questions. The resulting code is quite different from yours, but the concepts should apply.
What I tried is, instead of directly inserting the rows, adding them on a buffered channel, and wait until you can verify that the current table is properly saving the values before start sending them.
I've used a quite simpler struct to run my tests (so it was easier to write the code):
type Row struct {
ByteField []byte
}
I generated my rows the following way:
func generateRows(rows chan<- *Row) {
for {
randBytes := make([]byte, 100)
_, _ = rand.Read(randBytes)
rows <- &row{randBytes}
time.Sleep(time.Millisecond * 500) // use whatever frequency you need to insert rows at
}
}
Notice how I'm sending the rows to the channel. Instead of generating them, you just have to get them from your data source.
The next part is finding a way to check if the table is properly saving the rows. What I did was trying to insert one of the buffered rows into the table, recover that row, and verify if everything is OK. If the row is not properly returned, send it back to the buffer.
func unreadyTable(rows chan *row) bool {
client, err := bigquery.NewClient(context.Background(), project)
if err != nil {return true}
r := <-rows // get a row to try to insert
uploader := client.Dataset(dataset).Table(table).Uploader()
if err := uploader.Put(context.Background(), r); err != nil {rows <- r;return true}
i, err := client.Query(fmt.Sprintf("select * from `%s.%s.%s`", project, dataset, table)).Read(context.Background())
if err != nil {rows <- r; return true}
var testRow []bigquery.Value
if err := i.Next(&testRow); err != nil {rows <- r;return true}
if reflect.DeepEqual(&row{testrow[0].([]byte)}, r) {return false} // there's probably a better way to check if it's equal
rows <- r;return true
}
With a function like that, we only need to add for ; unreadyTable(rows); time.Sleep(time.Second) {} to block until it's safe to insert the rows.
Finally, we put everything together:
func main() {
// initialize a channel where the rows will be sent
rows := make(chan *row, 1000) // make it big enough to hold several minutes of rows
// start generating rows to be inserted
go generateRows(rows)
// create the BigQuery client
client, err := bigquery.NewClient(context.Background(), project)
if err != nil {/* handle error */}
// delete the previous table
if err := client.Dataset(dataset).Table(table).Delete(context.Background()); err != nil {/* handle error */}
// create the new table
schema, err := bigquery.InferSchema(row{})
if err != nil {/* handle error */}
if err := client.Dataset(dataset).Table(table).Create(context.Background(), &bigquery.TableMetadata{Schema: schema}); err != nil {/* handle error */}
// wait for the table to be ready
for ; unreadyTable(rows); time.Sleep(time.Second) {}
// once it's ready, upload indefinitely
for {
if len(rows) > 0 { // if there are uninserted rows, create a batch and insert them
uploader := client.Dataset(dataset).Table(table).Uploader()
insert := make([]*row, min(500, len(rows))) // create a batch of all the rows on buffer, up to 500
for i := range insert {insert[i] = <-rows}
go func(insert []*row) { // do the actual insert async
if err := uploader.Put(context.Background(), insert); err != nil {/* handle error */}
}(insert)
} else { // if there are no rows waiting to be inserted, wait and check again
time.Sleep(time.Second)
}
}
}
Note: Since math.Min() does not like ints, I had to include func min(a,b int)int{if a<b{return a};return b}.
Here's my full working example.

Duplicate records with .Run(ctx) in bigquery while loading files from google storage

For each day wise partition, we load files into bigquery every 3 minutes and each file is of size 200MB approx. (.gz). Sometimes I get duplication and I am not sure why. I already verified that the input file only contains the data once and the logs prove that the file was processed only once. What could be the possible reasons for the duplication? Are there any ways to prevent it before uploading in bigquery?
client, err := bigquery.NewClient(ctx, loadJob.ProjectID, clientOption)
if err != nil {
return nil, jobID, err
}
defer client.Close()
ref := bigquery.NewGCSReference(loadJob.URIs...)
if loadJob.Schema == nil {
ref.AutoDetect = true
} else {
ref.Schema = loadJob.Schema
}
ref.SourceFormat = bigquery.JSON
dataset := client.DatasetInProject(loadJob.ProjectID, loadJob.DatasetID)
if err := dataset.Create(ctx, nil); err != nil {
// Create dataset if it does exist, otherwise ignore duplicate error
if !strings.Contains(err.Error(), ErrorDuplicate) {
return nil, jobID, err
}
}
loader := dataset.Table(loadJob.TableID).LoaderFrom(ref)
loader.CreateDisposition = bigquery.CreateIfNeeded
loader.WriteDisposition = bigquery.WriteAppend
loader.JobID = jobID
job, err := loader.Run(ctx)
if err != nil {
return nil, jobID, err
}
status, err := job.Wait(ctx)
return status, jobID, err
BigQuery load jobs are atomic. So, if a job returned with success, then data will be guaranteed to have been loaded exactly once.
That said, duplication is possible in case of job retries that succeed on the backend for both the original and the retried attempts.
From the code snippet, I am not sure if that retry happens in the client implementation (some clients retry the same load if connection drops. The usual method to prevent duplication is to send BigQuery load jobs with the same job_id for the same data. BigQuery front ends will try to dedupe the retries if the original submission is still running.

How to read Image from Oracle (long raw format) in GoLang

I am trying to read images (in long raw datatype) from external Oracle database using Golang code.
When sql's row.Next() is called following error is caught:
ORA-01406: fetched column value was truncated
row.Next works fine for reading blob images from mssql DB.
Example code:
db, err := sql.Open("oci8", getDSN()) //function to get connection details
if err != nil {
fmt.Println(err)
return
}
defer db.Close()
rows, err := db.Query("SELECT image FROM sysadm.all_images")
if err != nil {
fmt.Println(err)
return
}
defer rows.Close()
for rows.Next() {
var id string
var data []byte
rows.Scan(&id, &data)
}
fmt.Println("Total errors", rows.Err())
}
I hope someone can help me to fix this issue or pinpoint to problem area.
I assume you are using go-oci8 as the driver.
Based on this issue https://github.com/mattn/go-oci8/pull/71 there are someone who get the same error like yours, and then managed to fix it by modifying some codes on the driver.
As per this commit, the problem is already solved by increasing the value of oci8cols[i].size on file $GOPATH/src/github.com/mattn/go-oci8/oci8.go. I think in your case you have bigger blob data, that's why the revision is still not working.
case C.SQLT_NUM:
oci8cols[i].kind = C.SQLT_CHR
oci8cols[i].size = int(lp * 4) // <==== THIS VALUE
oci8cols[i].pbuf = C.malloc(C.size_t(oci8cols[i].size) + 1)
So, try to increase the multiplier, like:
oci8cols[i].size = int(lp * 12) // <==== OR GREATER

Why is RethinkDB very slow?

I am getting started with RethinkDB, I have never used it before. I give it a try together with Gorethink following this tutorial.
To sum up this tutorial, there are two programs:
The first one updates entries infinitely.
for {
var scoreentry ScoreEntry
pl := rand.Intn(1000)
sc := rand.Intn(6) - 2
res, err := r.Table("scores").Get(strconv.Itoa(pl)).Run(session)
if err != nil {
log.Fatal(err)
}
err = res.One(&scoreentry)
scoreentry.Score = scoreentry.Score + sc
_, err = r.Table("scores").Update(scoreentry).RunWrite(session)
}
And the second one, receives this changes and logs them.
res, err := r.Table("scores").Changes().Run(session)
var value interface{}
if err != nil {
log.Fatalln(err)
}
for res.Next(&value) {
fmt.Println(value)
}
In the statistics that RethinkDB shows, I can see that there are 1.5K reads and writes per second. But in the console of the second program, I see 1 or 2 changes per second approximately.
Why does this occur? Am I missing something?
This code:
r.Table("scores").Update(scoreentry).RunWrite(session)
Probably doesn't do what you think it does. This attempts to update every document in the table by merging scoreentry into it. This is why the RethinkDB console is showing so many writes per second: every time you run that query it's resulting in thousands of writes.
Usually you want to update documents inside of ReQL, like so:
r.Table('scores').Get(strconv.Itoa(pl)).Update(func (row Term) interface{} {
return map[string]interface{}{"Score": row.GetField('Score').Add(sc)};
})
If you need to do the update in Go code, though, you can replace just that one document like so:
r.Table('scores').Get(strconv.Itoa(pl)).Replace(scoreentry)
Im not sure why it is quite that slow, it could be because by default each query blocks until the write has been completely flushed. I would first add some kind of instrumentation to see which operation is being so slow. There are also a couple of ways that you can improve the performance:
Set the Durability of the write using UpdateOpts
_, err = r.Table("scores").Update(scoreentry, r.UpdateOpts{
Durability: "soft",
}).RunWrite(session)
Execute each query in a goroutine to allow your code to execute multiple queries in parallel (you may need to use a pool of goroutines instead but this code is just a simplified example)
for {
go func() {
var scoreentry ScoreEntry
pl := rand.Intn(1000)
sc := rand.Intn(6) - 2
res, err := r.Table("scores").Get(strconv.Itoa(pl)).Run(session)
if err != nil {
log.Fatal(err)
}
err = res.One(&scoreentry)
scoreentry.Score = scoreentry.Score + sc
_, err = r.Table("scores").Update(scoreentry).RunWrite(session)
}()
}

Resources