Return all documents from elasticsearch query - elasticsearch

My question is specific to the "gopkg.in/olivere/elastic.v2" package I am using.
I am trying to return all documents that match my query:
termQuery := elastic.NewTermQuery("item_id", item_id)
searchResult, err := es.client.Search().
Index(index).
Type(SegmentsType). // search segments type
Query(termQuery). // specify the query
Sort("time", true). // sort by "user" field, ascending
From(0).Size(9).
Pretty(true). // pretty print request and response JSON
Do() // execute
if err != nil {
// Handle error
return timeline, err
}
The problem is I get an internal server error if I increase the size to something large. If I eliminate the line that states:
From(0).Size(9).
then the default is used (10 documents). How may I return all documents?

Using a scroller to retrieve all results is just a bit different and in the interest of brevity I'm not including much error handling that you might need.
Basically you just need to slightly change your code from Search to Scroller and then loop with the Scroller calling Do and handling pages of results.
termQuery := elastic.NewTermQuery("item_id", item_id)
scroller := es.client.Scroller().
Index(index).
Type(SegmentsType).
Query(termQuery).
Sort("time", true).
Size(1)
docs := 0
for {
res, err := scroller.Do(context.TODO())
if err == io.EOF {
// No remaining documents matching the search so break out of the 'forever' loop
break
}
for _, hit := range res.Hits.Hits {
// JSON parse or do whatever with each document retrieved from your index
item := make(map[string]interface{})
err := json.Unmarshal(*hit.Source, &item)
docs++
}
}

Related

GORM return list of list of results or map of results with group by id

Essentially, using GORMDB, my current code looks something like this:
res = []*modelExample
DB.Model(&modelExample{}).
Order("task_id ").
Find(res)
And what I do with res is that I will manually loop through and append the models with the same task_id into one list, and then append this list to be worked on. The reason why I need to do this is because there are some specific operations i need to do on specific columns that I need to extract which I can't do in GORM.
However, is there a way to do this more efficiently where I return like a list of list, which I can then for loop and do my operation on each list element?
You should be able to achieve your needs with the following code snippet:
package main
import (
"fmt"
"gorm.io/driver/postgres"
"gorm.io/gorm"
)
type modelExample struct {
TaskId int
Name string
}
func main() {
dsn := "host=localhost user=postgres password=postgres dbname=postgres port=5432 sslmode=disable"
db, err := gorm.Open(postgres.Open(dsn), &gorm.Config{})
if err != nil {
panic(err)
}
db.AutoMigrate(&modelExample{})
// here you should populate the database with some data
// querying
res := make(map[int][]modelExample, 0)
rows, err := db.Table("model_examples").Select("task_id, name").Rows()
if err != nil {
panic(err)
}
defer rows.Close()
// scanning
for rows.Next() {
var taskId int
var name string
rows.Scan(&taskId, &name)
if _, isFound := res[taskId]; !isFound {
res[taskId] = []modelExample{{taskId, name}}
continue
}
res[taskId] = append(res[taskId], modelExample{taskId, name})
}
// always good idea to check for errors when scanning
if err = rows.Err(); err != nil {
panic(err)
}
for _, v := range res {
fmt.Println(v)
}
}
After the initial setup, let's take a closer look at the querying section.
First, you're going to get all the records from the table. The records you get are stored in the rows variable.
In the for loop, you scan all of the records. Each record will be either added as a new map entry or appended to an existing one (if the taskId is already present in the map).
This is the easiest way to create different lists based on a specific column (e.g. the TaskId). Actually, from what I understood, you need to split the records rather than grouping them with an aggregation function (e.g. COUNT, SUM, and so on).
The other code I added was just put in for clarity.
Let me know if this solves your issue or if you need something else, thanks!

BigQuery Golang maxResults for Queries

I would like to be able to specify maxResults when using the golang BigQuery library. It isn't clear how to do this, though. I don't see it as an option in the documentation, and I have browsed the source to try to find it but I only see some sporadic usage in seemingly functionality not related to queries. Is there a way to circumvent this issue?
I think there is no implemented method in the SDK for that but after looking a bit, I found this one: request
You could try to execute an HTTP GET specifying the parameters (you can find an example of the use of parameters here: query_parameters)
By default the google API iterators manage page size for you. The RowIterator returns a single row by default, backed internally by fetched pages that rely on the backend to select an appropriate size.
If however you want to specify a fixed max page size, you can use the google.golang.org/api/iterator package to iterate by pages while specifying a specific size. The size, in this case, corresponds to maxResults for BigQuery's query APIs.
See https://github.com/googleapis/google-cloud-go/wiki/Iterator-Guidelines for more general information about advanced iterator usage.
Here's a quick test to demonstrate with the RowIterator in bigquery. It executes a query that returns a row for each day in October, and iterates by page:
func TestQueryPager(t *testing.T) {
ctx := context.Background()
pageSize := 5
client, err := bigquery.NewClient(ctx, "your-project-id here")
if err != nil {
t.Fatal(err)
}
defer client.Close()
q := client.Query("SELECT * FROM UNNEST(GENERATE_DATE_ARRAY('2022-10-01','2022-10-31', INTERVAL 1 DAY)) as d")
it, err := q.Read(ctx)
if err != nil {
t.Fatalf("query failure: %v", err)
}
pager := iterator.NewPager(it, pageSize, "")
var fetchedPages int
for {
var rows [][]bigquery.Value
nextToken, err := pager.NextPage(&rows)
if err != nil {
t.Fatalf("NextPage: %v", err)
}
fetchedPages = fetchedPages + 1
if len(rows) > pageSize {
t.Errorf("page size exceeded, got %d want %d", len(rows), pageSize)
}
t.Logf("(next token %s) page size: %d", nextToken, len(rows))
if nextToken == "" {
break
}
}
wantPages := 7
if fetchedPages != wantPages {
t.Fatalf("fetched %d pages, wanted %d pages", fetchedPages, wantPages)
}
}

Figuring out Page State with Cassandra GOCQL Driver (Golang)

I've been trying to wrap my head around how paging in Apache Cassandra with the driver functions in GOlang.
I have the following code for fetching rows
/// Assume all other prerequisites.
session, _ := cluster.CreateSession()
session.SetPageSize(100)
var pagestate []byte
query := session.Query(`select * from keyspace.my_table`)
query = query.PageState(pagestate)
if err := query.Exec(); != nil {
panic(err)
}
iter := query.Iter()
for {
row := map[string]interface{}{}
if !iter.MapScan(row) {
pagestate = iter.PageState()
break
}
/// Do whatever I need with row.
}
What I'm trying to achieve:
The table I'm referencing is huge, over 18k rows huge, and I want to fetch all of them for a special operation in the most efficient manner using the driver's built in paging so the query won't time out.
The problem:
I'm not sure how to get the query to resume at the previous page state. I'm not sure if this involves running the query in a loop and managing page state outside of it or not. I understand how to get and set page state, I can't figure out how to iterate the query with a new page sate each time without a proper halt condition when all the paging is done.
My best attempt:
var pagestate []byte
query := session.Query(`select * from keyspace.my_table`)
for {
query = query.PageState(pagestate)
if err := query.Exec(); != nil {
panic(err)
}
iter := query.Iter()
/// I don't know if I'm using this bool correct or not.
/// My assumption is that this would return false when a new page isn't
/// avaliable, thus meaning that all the pages have been filled and
/// the loop can exit.
if !iter.WillSwitchPage() {
break
}
for {
row := map[string]interface{}{}
if !iter.MapScan(row) {
pagestate = iter.PageState()
break
}
/// Do whatever I need with row.
}
}
Am I doing this right, or is there a better way to achieve this?
So, as it turns out, WillSwitchPage() will never return true at any point in the loop. Not sure why. I guess it's because of how I'm using MapScan() to control the inner loop.
Anyway, I figured out a solution by checking the []byte for the page state itself at the en dof the query loop. If the driver reaches the end and doesn't fill the page, page state will have 0 elements, so I'm using that as my halt condition. This may or may not be the most elegant or intended way to handle the driver's pagination, but it's functioning as wanted.
var pagestate []byte
query := session.Query(`select * from keyspace.my_table`)
for {
query = query.PageState(pagestate)
if err := query.Exec(); != nil {
panic(err)
}
iter := query.Iter()
for {
row := map[string]interface{}{}
if !iter.MapScan(row) {
pagestate = iter.PageState()
break
}
/// Do whatever I need with row.
}
if len(pagestate) == 0 {
break
}
}

Why is RethinkDB very slow?

I am getting started with RethinkDB, I have never used it before. I give it a try together with Gorethink following this tutorial.
To sum up this tutorial, there are two programs:
The first one updates entries infinitely.
for {
var scoreentry ScoreEntry
pl := rand.Intn(1000)
sc := rand.Intn(6) - 2
res, err := r.Table("scores").Get(strconv.Itoa(pl)).Run(session)
if err != nil {
log.Fatal(err)
}
err = res.One(&scoreentry)
scoreentry.Score = scoreentry.Score + sc
_, err = r.Table("scores").Update(scoreentry).RunWrite(session)
}
And the second one, receives this changes and logs them.
res, err := r.Table("scores").Changes().Run(session)
var value interface{}
if err != nil {
log.Fatalln(err)
}
for res.Next(&value) {
fmt.Println(value)
}
In the statistics that RethinkDB shows, I can see that there are 1.5K reads and writes per second. But in the console of the second program, I see 1 or 2 changes per second approximately.
Why does this occur? Am I missing something?
This code:
r.Table("scores").Update(scoreentry).RunWrite(session)
Probably doesn't do what you think it does. This attempts to update every document in the table by merging scoreentry into it. This is why the RethinkDB console is showing so many writes per second: every time you run that query it's resulting in thousands of writes.
Usually you want to update documents inside of ReQL, like so:
r.Table('scores').Get(strconv.Itoa(pl)).Update(func (row Term) interface{} {
return map[string]interface{}{"Score": row.GetField('Score').Add(sc)};
})
If you need to do the update in Go code, though, you can replace just that one document like so:
r.Table('scores').Get(strconv.Itoa(pl)).Replace(scoreentry)
Im not sure why it is quite that slow, it could be because by default each query blocks until the write has been completely flushed. I would first add some kind of instrumentation to see which operation is being so slow. There are also a couple of ways that you can improve the performance:
Set the Durability of the write using UpdateOpts
_, err = r.Table("scores").Update(scoreentry, r.UpdateOpts{
Durability: "soft",
}).RunWrite(session)
Execute each query in a goroutine to allow your code to execute multiple queries in parallel (you may need to use a pool of goroutines instead but this code is just a simplified example)
for {
go func() {
var scoreentry ScoreEntry
pl := rand.Intn(1000)
sc := rand.Intn(6) - 2
res, err := r.Table("scores").Get(strconv.Itoa(pl)).Run(session)
if err != nil {
log.Fatal(err)
}
err = res.One(&scoreentry)
scoreentry.Score = scoreentry.Score + sc
_, err = r.Table("scores").Update(scoreentry).RunWrite(session)
}()
}

golang couchbase gocb.RemoveOp - doesnt removes all

I think I did a silly mistake somewhere, but could not figure where for long time already :( The code is rough, I just testing things.
It deletes, but by some reasons not all documents, I have rewritten to delete it all one by one, and that went OK.
I use official package for Couchbase http://github.com/couchbase/gocb
Here is code:
var items []gocb.BulkOp
myQuery := gocb.NewN1qlQuery([Selecting ~ 283k documents from 1.5mln])
rows, err := myBucket.ExecuteN1qlQuery(myQuery, nil)
checkErr(err)
var idToDelete map[string]interface{}
for rows.Next(&idToDelete) {
items = append(items, &gocb.RemoveOp{Key: idToDelete["id"].(string)})
}
if err := rows.Close(); err != nil {
fmt.Println(err.Error())
}
if err := myBucket.Do(items);err != nil {
fmt.Println(err.Error())
}
This way it deleted ~70k documents, I run it again it got deleted 43k more..
Then I just let it delete one by one, and it worked fine:
//var items []gocb.BulkOp
myQuery := gocb.NewN1qlQuery([Selecting ~ 180k documents from ~1.3mln])
rows, err := myBucket.ExecuteN1qlQuery(myQuery, nil)
checkErr(err)
var idToDelete map[string]interface{}
for rows.Next(&idToDelete) {
//items = append(items, &gocb.RemoveOp{Key: idToDelete["id"].(string)})
_, err := myBucket.Remove(idToDelete["id"].(string), 0)
checkErr(err)
}
if err := rows.Close(); err != nil {
fmt.Println(err.Error())
}
//err = myBucket.Do(items)
By default, queries against N1QL use a consistency level called 'request plus'. Thus, your second time running the program to query will use whatever index update is valid at the time of the query, rather than considering all of your previous mutations by waiting until the index is up to date. You can read more about this in Couchbase's Developer Guide and it looks like the you'll want to add the RequestPlus parameter to your myquery through the consistency method on the query.
This kind of eventually consistent secondary indexing and the flexibility is pretty powerful because it gives you as a developer the ability to decide what level of consistency you want to pay for since index recalculations have a cost.

Resources