When should I use batches and when should I use transactions? Can I embed a transaction in a batch? A batch in a transaction?
A batch is a collection of operations that are sent to the server as a single unit for efficiency. It is equivalent to sending the same operations as individual requests from different threads. Requests in a batch may be executed out of order, and it's possible for some operations in a batch to succeed while others fail.
In Go, batches are created with the batcher object DB.B, and must be passed to DB.Run(). For example:
err := db.Run(db.B.Put("a", "1").Put("b", "2"))
is equivalent to:
_, err1 := db.Put("a", "1")
_, err2 := db.Put("b", "2")
A transaction defines a consistent and atomic sequence of operations. Transactions guarantee consistency with respect to all other operations in the system: the results of a transaction cannot be seen unless and until the transaction is committed. Since transactions may need to be retried, transactions are defined by function objects (typically closures) which may be called multiple times.
In Go, transactions are created with the DB.Tx method. The *client.Tx parameter to the closure implements a similar interface to DB; inside the transaction you must perform all your operations on this object instead of the original DB. If your function returns an error, the transaction will be aborted; otherwise it will commit. Here is a transactional version of the previous example (but see below for a more efficient version):
err := db.Tx(func(tx *client.Tx) error {
err := tx.Put("a", "1")
if err != nil {
return err
}
return tx.Put("b", "2")
})
The previous example waits for the "a" write to complete before starting the "b" write, and then waits for the "b" write to complete before committing the transaction. It is possible to make this more efficient by using batches inside the transaction. Tx.B is a batcher object, just like DB.B. In a transaction, you can run batches with either Tx.Run or Tx.Commit. Tx.Commit will commit the transaction if and only if all other operations in the batch succeed, and is more efficient than letting the transaction commit automatically when the closure returns. It is a good practice to always make the last operation in a transaction a batch executed by Tx.Commit:
err := db.Tx(func(tx *client.Tx) error {
return tx.Commit(tx.B.Put("a", "1").Put("b", "2"))
})
Related
The Problem
Using the golang cloud.google.com/go/datastore package to create a transaction, perform a series of getMulti's, and putMulti's, on commit of this transaction I'm confronted with an an entity write limit error.
2021/12/22 09:07:18 err: rpc error: code = InvalidArgument desc = cannot write more than 500 entities in a single call
The Question
My question is how do you create a transaction with more than 500 writes?
While I want my operation to remain atomic, I can't seem to solve this write limit error for a transaction and the set of queries run just fine when I test on an emulator, writing in batches of 500.
What I've Tried
please excuse the sudo code but I'm trying to get the jist of what I've done
All in one
transaction, err := datastoreClient.NewTransaction(ctx)
transaction.PutMulti(allKeys, allEntities)
transaction.commit()
// err too many entities written in a single call
Batched in an attempt to avoid the write limit
transaction, err := datastoreClient.NewTransaction(ctx)
transaction.PutMulti(first500Keys, first500Entities)
transaction.PutMulti(second500Keys, second500Entities)
transaction.commit()
// err too many entities written in a single call
A simple regular putmulti also fails
datastoreClient.PutMulti(ctx,allKeys, allEntities)
// err too many entities written in a single call
What Works
Non-atomic write to the datastore
datastoreClient.PutMulti(ctx,first500Keys, first500Entities)
datastoreClient.PutMulti(ctx,second500Keys, second500Entities)
here's the real code that I used for the write, either as a batched transaction or regular putMulti
for i := 0; i < (len(allKeys) / 500); i++ {
var max int = (i + 1) * 500
if len(allKeys) < max {
max = len(allKeys) % 500
}
_, err = svc.dsClient.PutMulti(ctx, allKeys[i*500:max], allEntities[i*500:max])
if err != nil {
return
}
}
Where I'm Lost
so in an effort to keeping my work atomic, is there any method to commit a transaction that has more than 500 entities written in it?
Nothing you can do. This limit is enforced by the platform to ensure scalability and to prevent performance degradation. You can't write more than 500 entities in a single transaction.
It's possible to change the limit on Google's side, but nothing you can do on your side.
From CBT's documentation
// READING OP HERE
timestamp := bigtable.Now()
mut := bigtable.NewMutation()
mut.Set(columnFamilyName, "os_name", timestamp, []byte("android"))
filter := bigtable.ChainFilters(
bigtable.FamilyFilter(columnFamilyName),
bigtable.ColumnFilter("os_build"),
bigtable.ValueFilter("PQ2A\\..*"))
conditionalMutation := bigtable.NewCondMutation(filter, mut, nil)
rowKey := "phone#4c410523#20190501"
if err := tbl.Apply(ctx, rowKey, conditionalMutation); err != nil {
return fmt.Errorf("Apply: %v", err)
}
fmt.Println("Successfully updated row's os_name")
I wanted to know if this also enables concurrency control, i.e. if we go by sequence
#1 - Read
#2 - Modify on Read
#3 - Write
If two threads are trying to modify same row at same time, will tbl.Apply fail?
The process of checking and then writing as a conditional mutation is completed as a single, atomic action. If you are sending multiple mutations they could get executed in an arbitrary order, but since the conditionalMutation is a single action it wont be affected by another mutation. Therefore, tbl.Apply should not fail in your case.
I have what is essentially a counter that users can increment.
However, I want to avoid the race condition of two users incrementing the counter at once.
Is there a way to atomically increment a counter using Gorm as opposed to fetching the value from the database, incrementing, and finally updating the database?
If you want to use the basic ORM features, you can use FOR UPDATE as query option when retrieving the record, the database will lock the record for that specific connection until that connection issues an UPDATE query to change that record.
Both the SELECT and UPDATE statements must happen on the same connection, which means you need to wrap them in a transaction (otherwise Go may send the second query over a different connection).
Please note that this will make every other connection that wants to SELECT the same record wait until you've done the UPDATE. That is not an issue for most applications, but if you either have very high concurrency or the time between SELECT ... FOR UPDATE and the UPDATE after that is long, this may not be for you.
In addition to FOR UPDATE, the FOR SHARE option sounds like it can also work for you, with less locking contentions (but I don't know it well enough to say this for sure).
Note: This assumes you use an RDBMS that supports SELECT ... FOR UPDATE; if it doesn't, please update the question to tell us which RDBMS you are using.
Another option is to just go around the ORM and do db.Exec("UPDATE counter_table SET counter = counter + 1 WHERE id = ?", 42) (though see https://stackoverflow.com/a/29945125/1073170 for some pitfalls).
A possible solution is to use GORM transactions (https://gorm.io/docs/transactions.html).
err := db.Transaction(func(tx *gorm.DB) error {
// Get model if exist
var feature models.Feature
if err := tx.Where("id = ?", c.Param("id")).First(&feature).Error; err != nil {
return err
}
// Increment Counter
if err := tx.Model(&feature).Update("Counter", feature.Counter+1).Error; err != nil {
return err
}
return nil
})
if err != nil {
c.Status(http.StatusInternalServerError)
return
}
c.Status(http.StatusOK)
The official Go documentation on the datastore package (client library for the GCP datastore service) has the following code snippet for demonstartion:
type Entity struct {
Value string
}
func main() {
ctx := context.Background()
// Create a datastore client. In a typical application, you would create
// a single client which is reused for every datastore operation.
dsClient, err := datastore.NewClient(ctx, "my-project")
if err != nil {
// Handle error.
}
k := datastore.NameKey("Entity", "stringID", nil)
e := new(Entity)
if err := dsClient.Get(ctx, k, e); err != nil {
// Handle error.
}
old := e.Value
e.Value = "Hello World!"
if _, err := dsClient.Put(ctx, k, e); err != nil {
// Handle error.
}
fmt.Printf("Updated value from %q to %q\n", old, e.Value)
}
As one can see, it states that the datastore.Client should ideally only be instantiated once in an application. Now given that the datastore.NewClient function requires a context.Context object does it mean that it should get instantiated only once per HTTP request or can it safely be instantiated once globally with a context.Background() object?
Each operation requires a context.Context object again (e.g. dsClient.Get(ctx, k, e)) so is that the point where the HTTP request's context should be used?
I'm new to Go and can't really find any online resources which explain something like this very well with real world examples and actual best practice patterns.
You may use any context.Context for the datastore client creation, it may be context.Background(), that's completely fine. Client creation may be lengthy, it may require connecting to a remote server, authenticating, fetching configuration etc. If your use case has limited time, you may pass a context with timeout to abort the operation. Also if creation takes longer than the time you have, you may use a context with cancel and abort the mission at your will. These are just options which you may or may not use. But the "tools" are given via context.Context.
Later when you use the datastore.Client during serving (HTTP) client requests, then using the request's context is reasonable, so if a request gets cancelled, then so will its context, and so will the datastore operation you issue, rightfully, because if the client cannot see the result, then there's no point completing the query. Terminating the query early you might not end up using certain resources (e.g. datastore reads), and you may lower the server's load (by aborting jobs whose result will not be sent back to the client).
i have a chron task to perform the best way in Golang.
I need to store big data from web service in JSON in sellers
After saving these sellers in a database, i need to browse another large JSON webservice with sellersID parameter to save to another table named customers.
Each customer has an initial state, if this state has changed from the data of the webservice (n°2) i need to store the difference in another table changes to have a history of changes.
Finally, if the change is equal to our conditions I perform another task.
My current operation
var wg sync.WaitGroup
action.FetchSellers() // fetch large JSON and stort in sellers table ~2min
sellers := action.ListSellers()
for _, s := range sellers {
wg.Add(1)
go action.FetchCustomers(&wg, s) // fetch multiple large JSON and stort in customers table and store notify... ~20sec
}
wg.Wait()
The first difficulty with this code is that I do not control the number of calls to the webservice.
The second is that the action.FetchCustomers function does a lot of work that I think can be done in a concurrency way.
The third difficulty is that I can not resume where an error has occurred in case of errors.
I need to run this code every hour so it needs to be well built, currently it works but not in the best way.
I think that considering the use of Worker Pools in Go like this example Go by Example: Worker Pools But I have trouble conceiving it
Not to be a jerk! But I would use a queue for this kind of things. I have already created a library and using this. github.com/AnikHasibul/queue
// Limit the max
maximumJobLimit := 50
// Open a new queue with the limit
q := queue.New(maximumJobLimit)
defer q.Close()
// simulate a large amount of jobs
for i := 0; i != 1000; i++ {
// Add a job to queue
q.Add()
// Run your long long long job here in a goroutine
go func(c int) {
// Must call Done() after finishing the job
defer q.Done()
time.Sleep(time.Second)
fmt.Println(c)
}(i)
}
//wait for the end of the all jobs
q.Wait()
// Done!