Filter ES query based on aggregation results - elasticsearch

We have an index with the following document structure:
{
email: "test#test.com",
stuff ..
},
{
email: "test#test.com,
stuff...
},
{
email: anotherEmail#test.com,
stuf..
}
We need to get all records where the count of distinct email is > 2 for example. I know I can use an aggregation with a mininum doc count to find all counts of all records where there are at least 2 records for an email.
But what we need to do is actually get all the records where the count of distinct email is > X. So we need our query to constrain our results to only those records that match an aggregation.
I know that we can have a nested TopHits aggregation, but that is not good enough for us, because we need to be able to page through these results... there could be records where an email has 10k records for example. We need to be able to get these results in the Hits collection so that we can page them.
How would we go about doing something like that?

Related

ElasticSearch - backward pagination with search_after when sorting value is null

I have an application which has a dashboard, basically a table with hundreds of thousands of records.
This table has up to 50 different columns. These columns have different types in mapping: keyword, text, boolean, integer.
As records in the table might have the same values, I use sorting as an array of 2 attributes:
First attribute is what client wants to sort by. It can be a simple
sorting object or some sort query with nested filter.
Second
attribute is basically a default sorting by id, needed for sorting
the documents which have identical values for the column customer
wants to sort by.
I checked multiple topics/issues on github and here
on elastic forum to understand how to implement search_after
mechanism for back sorting but it's not working for all the cases I
need.
Please have a look at the image:
Imagine there is a limit = 3, the customer right now is on the 3d page of a table and all the data is sorted by name asc, _id asc
The names are: A, B, C, D, E on the image.
The ids are numeric parts of the Doc word.
When customer wants to go back to the previous page, which is a page #2 on my picture, what I do is pass the following to elastic:
sort: [
{
name: 'desc'
},
{
_id: 'desc'
}
],
search_after: [null, Doc7._id]
As as result, I get only one document, which is Doc6: null on my image. It seems to be logical, because I ask elastic to search by desc after null and id 7 and I have only 1 doc corresponding this..it's Doc6 but it's not what I need.
I can't make up the solution to get the data that I need.
Could anyone help, please?

FaunaDB search document and get its ranking based on a score

I have the following Collection of documents with structure:
type Streak struct {
UserID string `fauna:"user_id"`
Username string `fauna:"username"`
Count int `fauna:"count"`
UpdatedAt time.Time `fauna:"updated_at"`
CreatedAt time.Time `fauna:"created_at"`
}
This looks like the following in FaunaDB Collections:
{
"ref": Ref(Collection("streaks"), "288597420809388544"),
"ts": 1611486798180000,
"data": {
"count": 1,
"updated_at": Time("2021-01-24T11:13:17.859483176Z"),
"user_id": "276989300",
"username": "yodanparry"
}
}
Basically I need a lambda or a function that takes in a user_id and spits out its rank within the collection. rank is simply sorted by the count field. For example, let's say I have the following documents (I ignored other fields for simplicity):
user_id
count
abc
12
xyz
10
fgh
999
If I throw in fgh as an input for this lambda function, I want it to spit out 1 (or 0 if you start counting from 0).
I already have an index for user_id so I can query and match a document reference from this index. I also have an index sorted_count that sorts document based on count field ascendingly.
My current solution was to query all documents by sorted_count index, then get the rank by iterating through the array. I think there should be a better solution for this. I'm just not seeing it.
Please help. Thank you!
Counting things in Fauna isn't as easy as one might expect. But you might still be able to do something more efficient than you describe.
Assuming you have:
CreateIndex(
{
name: "sorted_count",
source: Collection("streaks"),
values: [
{ field: ["data", "count"] }
]
}
)
Then you can query this index like so:
Count(
Paginate(
Match(Index("sorted_count")),
{ after: 10, size: 100000 }
)
)
Which will return an object like this one:
{
before: [10],
data: [123]
}
Which tells you that there are 123 documents with count >= 10, which I think is what you want.
This means that, in order to get a user's rank based on their user_id, you'll need to implement this two-step process:
Determine the count of the user in question using your index on user_id.
Query sorted_count using the user's count as described above.
Note that, in case your collection has more than 100,000 documents, you'll need your Go code to iterate through all the pages based on the returned object's after field. 100,000 is Fauna's maximum allowed page size. See the Fauna docs on pagination for details.
Also note that this might not reflect whatever your desired logic is for resolving ties.

Navigating terms aggregation in Elastic with very large number of buckets

Hope everyone is staying safe!
I am trying to explore the proper way to tacke the following use case in elasticsearch
Lets say that I have about 700000 docs which I would like to bucket on the basis of a field (let's call it primary_id). This primary id can be same for more than one docs (usually upto 2-3 docs will have same primary_id). In all other cases the primary_id is not repeted in any other docs.
So on average out of every 10 docs I will have 8 unique primary ids, and 1 primary id same among 2 docs
To ensure uniqueness I tried using the terms aggregation and I ended up getting buckets in response to my search request but not for the subsequent scroll requests. Upon googling, I found that scroll queries do not support aggregations.
As a result, I tried finding alternates solutions, and tried the solution in this link as well, https://lukasmestan.com/learn-how-to-use-scroll-elasticsearch-aggregation/
It suggests use of multiple search requests each specifying the partition number to fetch (dependent upon how many partitions do you divide your result in). But I receive client timeouts even with high timeout settings client side.
Ideally, I want to know what is the best way to go about such data where the variance of the field which forms the bucket is almost equal to the number of docs. The SQL equivalent would be select DISTINCT ( primary_id) from .....
But in elasticsearch, distinct things can only be processed via bucketing (terms aggregation).
I also use top hits as a sub aggregation query under terms aggregation to fetch the _source fields.
Any help would be extremely appreciated!
Thanks!
There are 3 ways to paginate aggregtation.
Composite aggregation
Partition
Bucket sort
Partition you have already tried.
Composite Aggregation: can combine multiple datasources in a single buckets and allow pagination and sorting on it. It can only paginate linearly using after_key i.e you cannot jump from page 1 to page 3. You can fetch "n" records , then pass returned after key and fetch next "n" records.
GET index22/_search
{
"size": 0,
"aggs": {
"ValueCount": {
"value_count": {
"field": "id.keyword"
}
},
"pagination": {
"composite": {
"size": 2,
"sources": [
{
"TradeRef": {
"terms": {
"field": "id.keyword"
}
}
}
]
}
}
}
}
Bucket sort
The bucket_sort aggregation, like all pipeline aggregations, is
executed after all other non-pipeline aggregations. This means the
sorting only applies to whatever buckets are already returned from the
parent aggregation. For example, if the parent aggregation is terms
and its size is set to 10, the bucket_sort will only sort over those
10 returned term buckets
So this isn't suitable for your case
You can increase the result size to value greater than 10K by updating setting index.max_result_window. Setting too big a size can cause out of memory issue so you need to test it out see how much your hardware can support.
Better option is to use scroll api and perform distinct at client side

elasticsearch aggregation bucket

In Kibana (or potentially even elasticsearch), is there a way to sort documents into buckets based on a field, and then compute statistics on the generated buckets themselves? Here is a simplified example of my problem:
I have logs with structure:
{
user_id: [string],
post_id: [string]
}
that signal a user with ID user_id has viewed post with ID post_id. I would like to:
bucket the logs by matching user_id
Count the amount of logs per bucket
Compute the 75th percentile of these bucket-specific counts
Is this possible in Kibana?

Apollo GraphQl Query get data from two documents

i have a query and i want to go through two different documents, is it possible to do it ?
Document A { _id, senders_ID[sender_id]}
Document B { sender_id, firstname}
// i want to loop through document A and get each sender_id and go to document B to get the firstname.
Is it what you are looking for ?

Resources