Aggregating by count then getting middle buckets from elasticsearch query - elasticsearch

So I know how to get the top N by aggregation in elasticsearch, which is this:
query= {
"size": 0,
"aggs": {
"group_by_cat": {
"terms": {
"field": "cat.keyword",
"size": 100
}
}
}
}
How do I get, for example, the top 400th-500th buckets? I was linked to range aggregations on the elasticsearch reference but I couldn't figure out how to apply it to this problem.

In Aggregation you cannot use "from" keyword.
There are two ways to perform pagination.
Partitions.
"size":15
"include": {
"partition": 0, <-- Trying to retrieve the first partition
"num_partitions": 7 <-- Expecting 7 partitions (7*15 > 101 (total records))
}
Composite aggregation
It combines several aggregation in a single stream. In this you can only paginate linearly. ie you cannot just from page 1 to page 3

Related

elasticsearch: count appearance of terms aggregation on other fields

I want to count how many times, unique values (result of terms aggragation) have appeared in other fields in the same query. Let's say:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"unique_products": {
"terms": {
"field": "products.name.keyword",
"min_doc_count": 10
}
}
}
}
What I want is to count, how many time each of the keys returned in the bucket, appeared in another field.
My ideal output is:
"aggregations": {
"product_stat": {
"key": "<product_name>"
"sold": "<#>" #I want to know how many times the key is appeared in another field like sold
"bought": "<#>"
}
}
Elasticsearch cannot do terms aggregations over multiple fields. In short, if they would, aggregations would not be blazing fast.
As documentation suggests, there are two options:
use script terms aggregation (with performance penalty),
change how the documents are indexed so a normal terms aggregation can be used.
Depending on the structure of your data and your use-cases, you might get by with a complex aggregation + some processing on the client side. This can be done with sub aggregations like here, for example.
Hope that helps!

How to find all duplicate documents in ElasticSearch

We have a need to walk over all of the documents in our AWS ElasticSearch cluster, version 6.0, and gather a count of all the duplicate user ids.
I have tried using a Data Visualization to aggregate counts on the user ids and export them, but the numbers don't match another source of our data that is searchable via traditional SQL.
What we would like to see is like this:
USER ID COUNT
userid1 4
userid22 3
...
I am not an advanced Lucene query person and have yet to find an answer to this question. If anyone can provide some insight into how to do this, I would be appreciative.
The following query will count each id, and filter the ids which have <2 counts, so you'll get something in the likes of:
id:2, count:2
id:4, count:15
GET /index
{
"query":{
"match_all":{}
},
"aggs":{
"user_id":{
"terms":{
"field":"user_id",
"size":100000,
"min_doc_count":2
}
}
}
}
More here:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
If you want to get all duplicate userids with count
First you get to know maximum size of aggs.
find all maximum matches record via aggs cardinality.
GET index/type/_search
{
"size": 0,
"aggs": {
"maximum_match_counts": {
"cardinality": {
"field": "userid",
"precision_threshold": 100
}
}
}
}
get value of maximum_match_counts aggregations
Now you can get all duplicate userids
GET index/type/_search
{
"size": 0,
"aggs": {
"userIds": {
"terms": {
"field": "userid",
"size": maximum_match_counts,
"min_doc_count": 2
}
}
}
}
When you go with terms aggregation (Bharat suggestion) and set aggregation size more than 10K you will get a warning about this approach will throw an error for the feature releases.
Instead of using terms aggregation you should go with composite aggregation to scan all of your documents by pagination/afterkey method.
the composite aggregation can be used to paginate all buckets from a multi-level aggregation efficiently. This aggregation provides a way to stream all buckets of a specific aggregation similarly to what scroll does for documents.

Elasticsearch get aggregation bucket size (number of elements in the bucket) without retrieving all data

I am trying to get information about an aggregation in Elasticsearch.
I have an index in which I store mail metadata (sender ip, subject etc.) What I'm trying to do is I want to get number of IPs which send over 1000 mails. (So for example let's say we have 3 IP addresses, 2000 mails are sent from first IP, 1500 from second and 200 from the third IP. Then I want to see 2 as the aggregation result.) I wrote the following query:
GET /my_index/_search
{
"size": 0,
"aggs": {
"ipAddresses": {
"terms": {
"field": "senderIpAddress",
"min_doc_count": 1000,
"size" : 0
}
}
}
}
I can get the bucket and calculate its size in my back-end implementation, however I need to get all the data in the bucket in order to do this. It is slow and I want to get the bucket size without getting all the data.
TL;DR, how can I get the total size of aggregation bucket without retrieving the whole data?
This is the purpose of the cardinality aggregation:
{
"size": 0,
"aggs": {
"ipAddressesCount": {
"cardinality": {
"field": "senderIpAddress"
}
}
}
}
Keep in mind that it is an approximation -- the precision can be configured using precision_threshold as documented in the link above.

Compare documents in Elasticsearch

I am new to Elasticsearch and I am trying to get all documents which have same mobile type. I couldn't find a relevant question and am currently stuck.
curl -XPUT 'http://localhost:9200/sessions/session/1' \
-d '{"useragent": "1121212","mobile": "android", "browser": "mozilla", "device": "computer", "service-code": "1112"}'
EDIT -
I need Elasticsearch equivalent of following -
SELECT * FROM session s1, session s2
where s1.device == s2.device
What you are trying to achieve is simple grouping docs on a field via self-join.
The similar notion of grouping can be achieved by terms aggregation in elasticsearch. Although this aggregation returns only the group level metrics like count, sum etc. It does not return the individual records.
However, there is another aggregation which can be applied as a sub-aggregation to the terms aggregation, top-hits aggregations.
The top_hits aggregator can effectively be used to group result sets
by certain fields via a bucket aggregator. One or more bucket
aggregators determines by which properties a result set get sliced
into.
Options
from - The offset from the first result you want to fetch.
size - The maximum number of top matching hits to return per bucket. By default the top three matching hits are returned.
sort - How the top matching hits should be sorted. By default the hits are sorted by the score of the main query.
Here is a sample query
{
"query": {
"match_all": {}
},
"aggs": {
"top-mobiles": {
"terms": {
"field": "device"
},
"aggs": {
"top_device_hits": {
"top_hits": {}
}
}
}
}
}

Limit and Offset in Term Aggregation ElasticSearch

There is way to get the top n terms result. For example:
{
"aggs": {
"apiSalesRepUser": {
"terms": {
"field": "userName",
"size": 5
}
}
}
}
Is there any way to set the offset for the terms result?
If you mean something like ignore first m results and return the next n results then no; it is not possible. A workaround to that would be to set size to m + n and do client side processing to ignore the first m results.
A little late, but (at least) since Elastic 5.2.0 you can use partitioning in the terms aggregation to paginate results.
https://www.elastic.co/guide/en/elasticsearch/reference/5.2/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions
Maybe this helps a bit:
"aggregations": {
"apiSalesRepUser": {
"terms": {
"field": "userName",
"size": 9999 ---> add here a bigger size
}
},
"aggregations": {
"limitBucket": {
"bucket_sort": {
"sort": [],
"from": 10,
"size": 20,
"gap_policy": "SKIP"
}
}
}
}
I am not sure about what value to put in the term size. I would suggest to put a reasonable value. This limits the initial aggregation, then the second limitBucket agg will limit again the term agg. This will probably still load in memory all the documents that you limited in the terms agg. That is why it depends on your scenario, if it's reasonable not get all results (i.e. if you have tens of thousands). I.e you are doing a google like search where you don't need to jump to page 1000.
Compared to the alternative to get the data on the client side, this might save you some data transfer from ES, but as I said weight this carefully as it loads all a lot of data in ES memory and you might have memory issues in ElasticSearch

Resources