Figuring out Page State with Cassandra GOCQL Driver (Golang) - go

I've been trying to wrap my head around how paging in Apache Cassandra with the driver functions in GOlang.
I have the following code for fetching rows
/// Assume all other prerequisites.
session, _ := cluster.CreateSession()
session.SetPageSize(100)
var pagestate []byte
query := session.Query(`select * from keyspace.my_table`)
query = query.PageState(pagestate)
if err := query.Exec(); != nil {
panic(err)
}
iter := query.Iter()
for {
row := map[string]interface{}{}
if !iter.MapScan(row) {
pagestate = iter.PageState()
break
}
/// Do whatever I need with row.
}
What I'm trying to achieve:
The table I'm referencing is huge, over 18k rows huge, and I want to fetch all of them for a special operation in the most efficient manner using the driver's built in paging so the query won't time out.
The problem:
I'm not sure how to get the query to resume at the previous page state. I'm not sure if this involves running the query in a loop and managing page state outside of it or not. I understand how to get and set page state, I can't figure out how to iterate the query with a new page sate each time without a proper halt condition when all the paging is done.
My best attempt:
var pagestate []byte
query := session.Query(`select * from keyspace.my_table`)
for {
query = query.PageState(pagestate)
if err := query.Exec(); != nil {
panic(err)
}
iter := query.Iter()
/// I don't know if I'm using this bool correct or not.
/// My assumption is that this would return false when a new page isn't
/// avaliable, thus meaning that all the pages have been filled and
/// the loop can exit.
if !iter.WillSwitchPage() {
break
}
for {
row := map[string]interface{}{}
if !iter.MapScan(row) {
pagestate = iter.PageState()
break
}
/// Do whatever I need with row.
}
}
Am I doing this right, or is there a better way to achieve this?

So, as it turns out, WillSwitchPage() will never return true at any point in the loop. Not sure why. I guess it's because of how I'm using MapScan() to control the inner loop.
Anyway, I figured out a solution by checking the []byte for the page state itself at the en dof the query loop. If the driver reaches the end and doesn't fill the page, page state will have 0 elements, so I'm using that as my halt condition. This may or may not be the most elegant or intended way to handle the driver's pagination, but it's functioning as wanted.
var pagestate []byte
query := session.Query(`select * from keyspace.my_table`)
for {
query = query.PageState(pagestate)
if err := query.Exec(); != nil {
panic(err)
}
iter := query.Iter()
for {
row := map[string]interface{}{}
if !iter.MapScan(row) {
pagestate = iter.PageState()
break
}
/// Do whatever I need with row.
}
if len(pagestate) == 0 {
break
}
}

Related

BigQuery Golang maxResults for Queries

I would like to be able to specify maxResults when using the golang BigQuery library. It isn't clear how to do this, though. I don't see it as an option in the documentation, and I have browsed the source to try to find it but I only see some sporadic usage in seemingly functionality not related to queries. Is there a way to circumvent this issue?
I think there is no implemented method in the SDK for that but after looking a bit, I found this one: request
You could try to execute an HTTP GET specifying the parameters (you can find an example of the use of parameters here: query_parameters)
By default the google API iterators manage page size for you. The RowIterator returns a single row by default, backed internally by fetched pages that rely on the backend to select an appropriate size.
If however you want to specify a fixed max page size, you can use the google.golang.org/api/iterator package to iterate by pages while specifying a specific size. The size, in this case, corresponds to maxResults for BigQuery's query APIs.
See https://github.com/googleapis/google-cloud-go/wiki/Iterator-Guidelines for more general information about advanced iterator usage.
Here's a quick test to demonstrate with the RowIterator in bigquery. It executes a query that returns a row for each day in October, and iterates by page:
func TestQueryPager(t *testing.T) {
ctx := context.Background()
pageSize := 5
client, err := bigquery.NewClient(ctx, "your-project-id here")
if err != nil {
t.Fatal(err)
}
defer client.Close()
q := client.Query("SELECT * FROM UNNEST(GENERATE_DATE_ARRAY('2022-10-01','2022-10-31', INTERVAL 1 DAY)) as d")
it, err := q.Read(ctx)
if err != nil {
t.Fatalf("query failure: %v", err)
}
pager := iterator.NewPager(it, pageSize, "")
var fetchedPages int
for {
var rows [][]bigquery.Value
nextToken, err := pager.NextPage(&rows)
if err != nil {
t.Fatalf("NextPage: %v", err)
}
fetchedPages = fetchedPages + 1
if len(rows) > pageSize {
t.Errorf("page size exceeded, got %d want %d", len(rows), pageSize)
}
t.Logf("(next token %s) page size: %d", nextToken, len(rows))
if nextToken == "" {
break
}
}
wantPages := 7
if fetchedPages != wantPages {
t.Fatalf("fetched %d pages, wanted %d pages", fetchedPages, wantPages)
}
}

Multiple queries to Postgres within the same function

I'm new to Go, so sorry for the silly question in advance!
I'm using Gin framework and want to make multiple queries to the database within the same handler (database/sql + lib/pq)
userIds := []int{}
bookIds := []int{}
var id int
/* Handling first query here */
rows, err := pgClient.Query(getUserIdsQuery)
defer rows.Close()
if err != nil {
return
}
for rows.Next() {
err := rows.Scan(&id)
if err != nil {
return
}
userIds = append(userIds, id)
}
/* Handling second query here */
rows, err = pgClient.Query(getBookIdsQuery)
defer rows.Close()
if err != nil {
return
}
for rows.Next() {
err := rows.Scan(&id)
if err != nil {
return
}
bookIds = append(bookIds, id)
}
I have a couple of questions regarding this code (any improvements and best practices would be appreciated)
Does Go properly handle defer rows.Close() in such a case? I mean I have reassignment of rows variable later down the code, so will compiler track both and properly close at the end of a function?
Is it ok to reuse id shared var or should I redeclare it while iterating within rows.Next() loop?
What's the better approach of having even more queries within one handler? Should I have some kind of Writer that accepts query and slice and populate it with ids retrieved?
Thanks.
I've never worked with go-pg library, and my answer is mostly focused on the other stuff, which are generic, and are not specific to golang or go-pg.
Regardless of the fact that the rows here has the same reference while being shared between 2 queries (so one rows.Close() call would suffice, unless the library has some special implementation), defining two variables is cleaner, like userRows and bookRows.
Although I already said that I have not worked with go-pg, I believe that you wont need to iterate through rows and scan the id for all the rows manually, I believe that the lib has provided some API like this (based on the quick look on the documentations):
userIds := []int{}
err := pgClient.Query(&userIds, "select id from users where ...", args...)
Regarding your second question, it depends on what you mean by "ok". Since your doing some synchronous iteration, I don't think it would result into bugs, but when it comes to coding style, personally, I wouldn't do this.
I think that the best thing to do in your case is this:
// repo layer
func getUserIds(args whatever) ([]int, err) {...}
// these can be exposed, based on your packaging logic
func getBookIds(args whatever) ([]int, err) {...}
// service layer, or wherever you want to aggregate both queries
func getUserAndBookIds() ([]int, []int, err) {
userIds, err := getUserIds(...)
// potential error handling
bookIds, err := getBookIds(...)
// potential error handling
return userIds, bookIds, nil // you have done err handling earlier
}
I think this code is easier to read/maintain. You won't face the variable reassignment and other issues.
You can take a look at the go-pg documentations for more details on how to improve your query.

Mutex used right?

I'm a bit confused about mutex locken/unlocking more times after another.
I'm using a RWMutex and all goroutines will have the same mutex of course.
Is this code still race-protected when using mutexes this often?
func (r *Redis) RedisDb(dbId DatabaseId) *RedisDb {
r.Mu().RLock()
size := len(r.redisDbs) // A
r.Mu().RUnlock()
if size >= int(dbId) { // B
r.Mu().RLock()
db := r.redisDbs[dbId] // C
r.Mu().RUnlock()
if db != nil { // D
return db
}
}
// E create db...
}
Example situation I would think of can happen:
gorountine1 and goroutine2 are running both this function
both are at point A so that variable size is 3
condition B is true for both goroutines
both read C at the same time
variable db is nil for both goroutines so condition C is false
now both goroutines are going to E and create the same database 2 times, thats bad
Or do I have to lock/unlock all one time in this situation?
func (r *Redis) RedisDb(dbId DatabaseId) *RedisDb {
r.Mu().Lock()
defer r.Mu().Unlock()
size := len(r.redisDbs)
if size >= int(dbId) {
db := r.redisDbs[dbId]
if db != nil {
return db
}
}
// create db...
}
Solution
func (r *Redis) RedisDb(dbId DatabaseId) *RedisDb {
getDb := func() *RedisDb { // returns nil if db not exists
if len(r.redisDbs) >= int(dbId) {
db := r.redisDbs[dbId]
if db != nil {
return db
}
}
return nil
}
r.Mu().RLock()
db := getDb()
r.Mu().RUnlock()
if db != nil {
return db
}
// create db
r.Mu().Lock()
defer r.Mu().Unlock()
// check if db does not exists again since
// multiple "mutex readers" can come to this point
db = getDb()
if db != nil {
return db
}
// now really create it
// ...
}
Welcome to the world of synchronization. Your assessment is correct, there is opportunity for concurrency issues to arise with the first implementation. With the second, those concurrency issues are removed, but it's fully locked, with no opportunities for even concurrent read access. You don't have to do it that way, you could do an initial check with a read lock, then if that check determines creation is needed, establish a write lock, then re-check, create if still necessary, then unlock. This isn't an unusual construct. It's less efficient (due to performing the check twice) so it's up to you to make a decision on the trade-off, mostly based on how expensive the check is to do twice, and how often the function would be able to operate in the read-only path.

Return all documents from elasticsearch query

My question is specific to the "gopkg.in/olivere/elastic.v2" package I am using.
I am trying to return all documents that match my query:
termQuery := elastic.NewTermQuery("item_id", item_id)
searchResult, err := es.client.Search().
Index(index).
Type(SegmentsType). // search segments type
Query(termQuery). // specify the query
Sort("time", true). // sort by "user" field, ascending
From(0).Size(9).
Pretty(true). // pretty print request and response JSON
Do() // execute
if err != nil {
// Handle error
return timeline, err
}
The problem is I get an internal server error if I increase the size to something large. If I eliminate the line that states:
From(0).Size(9).
then the default is used (10 documents). How may I return all documents?
Using a scroller to retrieve all results is just a bit different and in the interest of brevity I'm not including much error handling that you might need.
Basically you just need to slightly change your code from Search to Scroller and then loop with the Scroller calling Do and handling pages of results.
termQuery := elastic.NewTermQuery("item_id", item_id)
scroller := es.client.Scroller().
Index(index).
Type(SegmentsType).
Query(termQuery).
Sort("time", true).
Size(1)
docs := 0
for {
res, err := scroller.Do(context.TODO())
if err == io.EOF {
// No remaining documents matching the search so break out of the 'forever' loop
break
}
for _, hit := range res.Hits.Hits {
// JSON parse or do whatever with each document retrieved from your index
item := make(map[string]interface{})
err := json.Unmarshal(*hit.Source, &item)
docs++
}
}

Why is RethinkDB very slow?

I am getting started with RethinkDB, I have never used it before. I give it a try together with Gorethink following this tutorial.
To sum up this tutorial, there are two programs:
The first one updates entries infinitely.
for {
var scoreentry ScoreEntry
pl := rand.Intn(1000)
sc := rand.Intn(6) - 2
res, err := r.Table("scores").Get(strconv.Itoa(pl)).Run(session)
if err != nil {
log.Fatal(err)
}
err = res.One(&scoreentry)
scoreentry.Score = scoreentry.Score + sc
_, err = r.Table("scores").Update(scoreentry).RunWrite(session)
}
And the second one, receives this changes and logs them.
res, err := r.Table("scores").Changes().Run(session)
var value interface{}
if err != nil {
log.Fatalln(err)
}
for res.Next(&value) {
fmt.Println(value)
}
In the statistics that RethinkDB shows, I can see that there are 1.5K reads and writes per second. But in the console of the second program, I see 1 or 2 changes per second approximately.
Why does this occur? Am I missing something?
This code:
r.Table("scores").Update(scoreentry).RunWrite(session)
Probably doesn't do what you think it does. This attempts to update every document in the table by merging scoreentry into it. This is why the RethinkDB console is showing so many writes per second: every time you run that query it's resulting in thousands of writes.
Usually you want to update documents inside of ReQL, like so:
r.Table('scores').Get(strconv.Itoa(pl)).Update(func (row Term) interface{} {
return map[string]interface{}{"Score": row.GetField('Score').Add(sc)};
})
If you need to do the update in Go code, though, you can replace just that one document like so:
r.Table('scores').Get(strconv.Itoa(pl)).Replace(scoreentry)
Im not sure why it is quite that slow, it could be because by default each query blocks until the write has been completely flushed. I would first add some kind of instrumentation to see which operation is being so slow. There are also a couple of ways that you can improve the performance:
Set the Durability of the write using UpdateOpts
_, err = r.Table("scores").Update(scoreentry, r.UpdateOpts{
Durability: "soft",
}).RunWrite(session)
Execute each query in a goroutine to allow your code to execute multiple queries in parallel (you may need to use a pool of goroutines instead but this code is just a simplified example)
for {
go func() {
var scoreentry ScoreEntry
pl := rand.Intn(1000)
sc := rand.Intn(6) - 2
res, err := r.Table("scores").Get(strconv.Itoa(pl)).Run(session)
if err != nil {
log.Fatal(err)
}
err = res.One(&scoreentry)
scoreentry.Score = scoreentry.Score + sc
_, err = r.Table("scores").Update(scoreentry).RunWrite(session)
}()
}

Resources