Is it possible to sort buckets in Terms aggregation response on a non-term field? - elasticsearch

I need to sort the buckets in result of a ElasticSearch Terms aggregation.
Below is the one of the indexed records in ElasticSearch
{"personId":"10","Salary":10000, "Age":20, "personName":"xyz"}
I am using Terms aggregation over the field Salary. Below is the Terms aggregated ElasticSearch query:
{
"aggs" : {
"genders" : {
"terms" : {
"field" : "Salary"
}
}
}
}
This query returns the buckets on the basis of Salary values. These buckets can be sort over the Salary value using order below query:
{
"aggs" : {
"genders" : {
"terms" : {
"field" : "gender",
"order" : { "_term" : "asc" }
}
}
}
}
But I need to sort buckets on any the field Age (non terms field), is there any way to do it ?

The whole point of aggregations is to "dispatch" the documents into buckets, each of which is defined by the declared field of the terms aggregation, in your case Salary.
The buckets you get in the response are not documents anymore. For instance, in the bucket 10000, you'll get the count of documents which have Salary: 10000, and you'll have as many buckets as different Salary values there are in all your documents (by default only 10 buckets, though).
So, since buckets are not documents, and since a bucket can aggregate documents with different Age values, it's not clear how you'd like the Salary buckets to be sorted by Age.
Maybe, one way out of this could be to add a terms sub-aggregation on the Age field, so you get top Salary buckets and below that you get Age buckets. Then you can sort your Salary/Age bucket pairs any way you want.

Related

Elasticsearch "size" value not working in terms aggregation with partitions

I am trying to paginate over a specific field using the terms aggregation with partitions.
The problem is that the number of returned terms for each partition is not equal to the size parameter that I set.
These are the steps that I am doing:
Retrieve the number of different unique values for the field with "cardinality" aggregation.
In my data, the result is 21.
From the web page, the user wants to display a table with 10 items per page.
if unique_values % page_size != 0:
partitions_number = (unique_values // page_size) + 1
else:
partitions_number = (unique_values // page_size)
Than I am making this simple query:
POST my_index/_search?pretty
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"match": {
"field_to_paginate": "foo"
}
}
]
}
},
"aggs": {
"by_pchostname": {
"terms": {
"size": 10,
"field": "field_to_paginate",
"include": {
"partition": 0,
"num_partitions": 3
}
}
}
}
}
I am expecting to retrieve 10 results. But if I run the query I have only 7 results.
What am I missing here? Do I need to use a different solution here?
As a side note, I can't use composite aggregation because I need to sort results by doc_count over the whole dataset.
Partitons in terms aggregation divide the values in equal chunks.
In your case no of partition num_partitions is 3 so 21/3 == 7.
Partitons are meant for getting large values in the order of 1000 s.
You may be able to leverage shard_size parameter. My suggestion is to read this part of manual and work with the shard_size param
Terms aggregation does not allow pagination. Use composite aggregation instead (requires ES >= 6.1.0). Below is the quote from reference docs:
If you want to retrieve all terms or all combinations of terms in a
nested terms aggregation you should use the Composite aggregation
which allows to paginate over all possible terms rather than setting a
size greater than the cardinality of the field in the terms
aggregation. The terms aggregation is meant to return the top terms
and does not allow pagination.

How to perform a distinct count query in Elasticsearch

I have an index with a host field. I am trying to retrieve the count of documents by distinct host name.
IE:
Host1:
Count: 72
Host2:
Count: 33
Host3:
Count: 153
Each document has a host field and it is a string. I assume I need to do something involving terms and cardinality, but I can't quite nail the syntax.
How to get all possible values for field host?
curl -XGET http://localhost:9200/articles/_search?pretty -d '
{
"aggs" : {
"whatever_you_like_here" : {
"terms" : { "field" : "host", "size":10000 }
}
},
"size" : 0
}'
Note
The result will contain a doc_count for each unique value
"size":10000 Get at most 10000 unique values. Default is 10.
"size":0 By default, "hits" contains 10 documents. We don't need them.
By default, the buckets are ordered by the doc_count in decreasing order.
Reference: bucket terms aggregation

How to use multiple query strings with aggregation in elasticsearch

How to use multiple query strings with aggregate functions in elasticsearch?
For example:
if a>0 AND a<1, then {"low":count(aggregate count of records within 0 to 1)}
else if a > 1 AND a < 100, then {"normal":count(aggregate count of records within 1 to 100)}
else {"high":count(aggregate count of records after 100)}
How to achieve this using Request Body Query string?
Thank you in advance.
Assuming that a is a field that you search on, I think the easiest way for you to do that is using the range aggregation with buckets for each of your use-cases (low, normal, high).
You cannot bind aggregations to conditions of your query. That you would have to do in code yourself. But if you use the range aggregation, you could define your buckets like
POST /_search
{
"aggs" : {
"a_ranges" : {
"range" : {
"field" : "a",
"ranges" : [
{ "to" : 1 },
{ "from" : 1, "to" : 10 },
{ "from" : 10 }
]
}
}
}
}
Depending on your query, two of these buckets would remain empty, but this should give you the result you want

elasticsearch: get random distinct field values?

We have elastic search document with dealerId "field". Multiple documents can have the same "dealerId". We want to pick "N" random dealers from it.
What I have done so far: The following query would return max 1000 "dealerId" and their count in descending order. We will then randomly pick "N" records client side.
{
"from":0,
"size":0,
"aggs":{
"CityIdCount":{
"terms":{
"field":"dealerId",
"order" : { "_term" : "desc" },
"size":1000
}
}
}
}
The downside with this approach is that:
If in future, we have more than 1K unique dealers, this approach would fail as it would pick only top 1K dealerId occurence. What should we put as "size" for this?
We are fetching all the data although we just require random "N" i.e. 3 or 4 random "dealerId" from elastic server to the client. Can we somehow do this randomization in the elastic query itself i.e. order: "random"?
I have read something similar here but trying to check if we have some solution for this now.

How to filter results based on frequency of repeating terms in an array in elasticsearch

I have an array field with a lot of keywords and i need to sort the documents on the basis on how many times a particular keyword repetation in those arrays.
For eg,if my field name is "nationality" and for document 1, it consists of the following
doc1
nationality :
["US","UK","Australia","India","US","US"]
and for doc2
nationality:
["US","UK","US","US","US","China"]
I want only those documents to be shown where the term "US" occurs more than 3 times. That would make only doc2 to be shown. How to do this?
You can use scripting for this to be implemented.
{
"query": {
"filtered": {
"filter": {
"script": {
"script": "_index['nationality']['US'].tf() > 3"
}
}
}
}
}
Here in this scripy the array "nationality" is checked for the term "US" and the count is taken by tf (term frequency). Now only the documents with term frequency greater than three are shown in the results. You can learn more about the filter operations here

Resources