Elastic Search SUM of aggregated values - elasticsearch

We are using elastic search to get some statics.
I need to get average values for each group.
Sum all this values
So far, step no. 1 was pretty straight forward. However I really don't know how to sum all values at the end. Is this possible? If yes, how?.
Thanks for suggestions.
Here is my aggs query >
{
"query":{
"filtered":{
"query":{
"query_string":{
"analyze_wildcard":true,
"query":"*"
}
}
}
},
"aggs":{
"2":{
"terms":{
"field":"person",
"size":5000,
"order":{
"1":"desc"
}
},
"aggs":{
"1":{
"avg":{
"field":"company"
}
}
}
}
}
}

Aggregating over aggregation results are not yet supported in elasticsearch. Apparently there is a concept called reducers that are being developed for 2.0. I would suggest having a look at scripted metric aggregations. Basically, you can create your own aggregation by controlling the collection and computation aspects yourself using scripts.
Alternatively, if possible you can precompute and store the average when indexing and then use the sum aggregation when querying.
Have a look at the following question for an example of this aggregation: Elasticsearch: Possible to process aggregation results?

Related

elastic search query to aggregate field by biggest timestamp instead of max aggregation

I have the following elastic search query:
{
"aggs":{
"trends":{
"terms":{
"field”:”notificationType”,
"size":100
},
"aggs":{
"granularity":{
"date_histogram":{
"field":"timestamp",
"interval":"1d"
},
"aggs":{
"critical":{
"max":{
"field”:”push”_count
}
},
}
}
}
}
},
"size":0
}
Which gets all notifications aggregated by their type then adds to the response the document with the maximum number of push_count. Instead of max aggregation I would need to add push_count for the most recent (biggest timestamp) document.
I have tried to use a histogram aggregation instead of the max one. But this will group the results in to another level of buckets again and I just need the value of the most recent push_count (this sub-piece would translate to SELECT push_count WHERE timestamp = most recent timestamp in SQL)
Can you assist me with some advices on how to achieve this, is a pipeline aggregation the way to solve this?

Aggregation after sorting and limit in Elastic Search 5.6

I have to do a aggregation on Elasticsearch documents after sorting the results and picking top n from it.
I tried to do this:
{
"size":1,
"query":{
"bool":{
"must":[
{
"terms":{
"name.keyword":[
"some_name"
]
}
},
{
"exists":{
"field":"3g_duration_count"
}
}
]
}
},
"sort":[
{
"tmst":{
"order":"desc"
}
}
],
"aggs":{
"fieldNameAgg":{
"avg":{
"field":"3g_duration_count"
}
}
}
}
Here the fetching of top n results is happening after aggregation (which makes no sense), I want to pick top n records based on the sort criteria and then apply aggregation. How do I achieve this?
I am using Elasticsearch 5.6.
Is there a way that I can assign the results of the inner query along with the sort and limit clauses to a child query and then apply the avg aggregator on top of that ? In that way I can ensure the limit is applied before the aggregation is happening .
An equivalent sql query might look like this :
select avg(field_value) from (select field_value from t1 order by tmst desc fetch first n rows ) t2
Is this something possible to accomplish in ElasticSearch 5.6 ?

How to find all duplicate documents in ElasticSearch

We have a need to walk over all of the documents in our AWS ElasticSearch cluster, version 6.0, and gather a count of all the duplicate user ids.
I have tried using a Data Visualization to aggregate counts on the user ids and export them, but the numbers don't match another source of our data that is searchable via traditional SQL.
What we would like to see is like this:
USER ID COUNT
userid1 4
userid22 3
...
I am not an advanced Lucene query person and have yet to find an answer to this question. If anyone can provide some insight into how to do this, I would be appreciative.
The following query will count each id, and filter the ids which have <2 counts, so you'll get something in the likes of:
id:2, count:2
id:4, count:15
GET /index
{
"query":{
"match_all":{}
},
"aggs":{
"user_id":{
"terms":{
"field":"user_id",
"size":100000,
"min_doc_count":2
}
}
}
}
More here:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
If you want to get all duplicate userids with count
First you get to know maximum size of aggs.
find all maximum matches record via aggs cardinality.
GET index/type/_search
{
"size": 0,
"aggs": {
"maximum_match_counts": {
"cardinality": {
"field": "userid",
"precision_threshold": 100
}
}
}
}
get value of maximum_match_counts aggregations
Now you can get all duplicate userids
GET index/type/_search
{
"size": 0,
"aggs": {
"userIds": {
"terms": {
"field": "userid",
"size": maximum_match_counts,
"min_doc_count": 2
}
}
}
}
When you go with terms aggregation (Bharat suggestion) and set aggregation size more than 10K you will get a warning about this approach will throw an error for the feature releases.
Instead of using terms aggregation you should go with composite aggregation to scan all of your documents by pagination/afterkey method.
the composite aggregation can be used to paginate all buckets from a multi-level aggregation efficiently. This aggregation provides a way to stream all buckets of a specific aggregation similarly to what scroll does for documents.

Max Clause Count in ElasticSearch / Rewrite queries

I know that ElasticSearch has an internal limit on how many clauses you can use in a bool query. This is controlled by the max_clause_count in the ElasticSearch.yml file.
But I thought that this limit did not apply to the values that were passed in the searches
So a query like the following would work, with more than 1024 values in the
terms query
{
"query":{
"bool":{
"should":[
{ "terms": {"id": ["cafe-babe-0000","cafe-babe-0001",... ]}}
]
}
}
}
But this query will launch a TooManyClauses Exception. So, in this case, the
number of values in the query also counts for this limit. Is it correct?
Also, I now that it's not the best way to perform this kind of queries, but
Is it possible to rewrite the previous query so that the limit is not exceeded?
You can use the ids query.
"query": {
"ids": {
"values": [ "cafe-babe-0000","cafe-babe-0001",... ]
}
}
For the best of i know there is no limitation on this query.

Understanding boosting in ElasticSearch

I've been using ElasticSearch for a little bit with the goal of building a search engine and I'm interested in manually changing the IDFs (Inverse Document Frequencies) of each term to match the ones one can measure from the Google Books unigrams.
In order to do that I plan on doing the following:
1) Use only 1 shard (so IDFs are not computed for every shard and they are "global")
2) Get the ttf (total term frequency, which is used to compute the IDFs) for every term by running this query for every document in my index
curl -XGET 'http://localhost:9200/index/document/id_doc/_termvectors?pretty=true' -d '{
"fields" : ["content"],
"offsets" : true,
"term_statistics" : true
}'
3) Use the Google Books unigram model to "rescale" the ttf for every term.
The problem is that, once I've found the "boost" factors I have to use for every term, how can I use this in a query?
For instance, let's consider this example
"query":
{
"bool":{
"should":[
{
"match":{
"title":{
"query":"cat",
"boost":2
}
}
},
{
"match":{
"content":{
"query":"cat",
"boost":2
}
}
}
]
}
}
Does that mean that the IDFs of the term "cat" is going to be boosted / multiplied by a factor of 2?
Also, what happens if instead of search for one word I have a sentence? Would that mean that the IDFs of each word is going to be boosted by 2?
I tried to understand the role of the boost parameter (https://www.elastic.co/guide/en/elasticsearch/guide/current/query-time-boosting.html) and t.getBoost(), but that seems a little confusing.
The boost is used when query with multi query clauses, example:
{
"bool":{
"should":[
{
"match":{
"clause1":{
"query":"query1",
"boost":3
}
}
},
{
"match":{
"clause2":{
"query":"query2",
"boost":2
}
}
},
{
"match":{
"clause3":{
"query":"query1",
"boost":1
}
}
}
]
}
}
In the above query, it means clause1 is three times important than clause3, clause2 is the twice important than clause2, It's not simply multiply 3, 2, because when calculate score, because there is normalized for scores.
also if you just query with one query clause with boost, it's not useful.
An usage scenario for using boost:
A set of page document set with title and content field.
You want to search title and content with some terms, and you think title is more important than content when search these documents. so you can set title query clause boost more than content. Such as if your query hit one document by title field, and one hit document by content field, and you want to hit title field's document prior to the content field document. so boost can help you do it.

Resources