I studied elasticsearch aggregation queries but couldn't find if it supports multiple aggregate function. In an other word, I wanna know if elasticsearch can generate the equivalent of this Sql aggregation query:
SELECT account_no, transaction_type, count(account_no), sum(amount), max(amount) FROM index_name GROUP BY account_no, transaction_type Having count(account_no) > 10
If yes, how?
Thank you.
There are two possible ways to do what you are looking for in ES and I've mentioned them both below.
I've also added sample mapping and sample documents for your reference.
Mapping:
PUT index_name
{
"mappings": {
"mydocs":{
"properties":{
"account_no":{
"type": "keyword"
},
"transaction_type":{
"type": "keyword"
},
"amount":{
"type":"double"
}
}
}
}
}
Sample Documents:
Notice carefully, I'm only creating list of 4 transactions for 1 customer.
POST index_name/mydocs/1
{
"account_no": "1011",
"transaction_type":"credit",
"amount": 200
}
POST index_name/mydocs/2
{
"account_no": "1011",
"transaction_type":"credit",
"amount": 400
}
POST index_name/mydocs/3
{
"account_no": "1011",
"transaction_type":"cheque",
"amount": 100
}
POST index_name/mydocs/4
{
"account_no": "1011",
"transaction_type":"cheque",
"amount": 100
}
There are two ways to get what you are looking for:
Solution 1: Using Elasticsearch Query DSL
Aggregation Query:
For Aggregation Query DSL, I've made use of the below aggregation queries to solve what you are looking for.
Terms Aggregation
Sum Aggregation Query (Metric Aggregation)
Max Aggregation Query (Metric Aggregation)
Below is how query is summarised version of the query so that you get the clarity on which queries are sibling and which are parents.
- Terms Aggregation (For Every Account)
- Terms Aggregation (For Every Transaction_type)
- Sum Amount
- Max Amount
Below is the actual query:
POST index_name/_search
{
"size": 0,
"aggs": {
"account_no_agg": {
"terms": {
"field": "account_no"
},
"aggs": {
"transaction_type_agg": {
"terms": {
"field": "transaction_type",
"min_doc_count": 2
},
"aggs": {
"sum_amount": {
"sum": {
"field": "amount"
}
},
"max_amount":{
"max": {
"field": "amount"
}
}
}
}
}
}
}
}
Important thing to mention is min_doc_count which is nothing but the having count(account_no)>10, which in my query I'm filtering only those transactions with having count(account_no) > 2
Query Response
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"account_no_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1011", <---- account_no
"doc_count" : 4, <---- count(account_no)
"transaction_type_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "cheque", <---- transaction_type
"doc_count" : 2,
"sum_amount" : { <---- sum(amount)
"value" : 200.0
},
"max_amount" : { <---- max(amount)
"value" : 100.0
}
},
{
"key" : "credit", <---- another transaction_type
"doc_count" : 2,
"sum_amount" : { <---- sum(amount)
"value" : 600.0
},
"max_amount" : { <---- max(amount)
"value" : 400.0
}
}
]
}
}
]
}
}
}
Notice the above result carefully, I've added comments wherever required so that it helps what part of sql query you are looking for.
Solution 2: Using Elasticsearch SQL(_xpack solution)
If you are making use of xpack feature of Elasticsearch's SQL Access, you can simply copy paste the SELECT Query as below for the mapping and document as mentioned above:
Elasticsearch SQL:
POST /_xpack/sql?format=txt
{
"query": "SELECT account_no, transaction_type, sum(amount), max(amount), count(account_no) FROM index_name GROUP BY account_no, transaction_type HAVING count(account_no) > 1"
}
Elasticsearch SQL Result:
account_no |transaction_type| SUM(amount) | MAX(amount) |COUNT(account_no)
---------------+----------------+---------------+---------------+-----------------
1011 |cheque |200.0 |100.0 |2
1011 |credit |600.0 |400.0 |2
Note that I've tested the query in ES 6.5.4.
Hope this helps!
Related
I am looking for elastic search aggregation + mapping
that will return the most common list for a certain field.
For example for docs:
{"ToneCurvePV2012": [1,2,3]}
{"ToneCurvePV2012": [1,5,6]}
{"ToneCurvePV2012": [1,7,8]}
{"ToneCurvePV2012": [1,2,3]}
I wish for the aggregation result:
[1,2,3] (since it appears twice).
so far any aggregation that i made would return: 1
This is not possible with default terms aggregation. You need to use terms aggregation with script. Please note that this might impact your cluster performance.
Here, i have used script which will create string from array and used it for aggregation. so if you have array value like [1,2,3] then it will create string representation of it like '[1,2,3]' and that key will be used for aggregation.
Below is sample query you can use to generate aggregation as you expected:
POST index1/_search
{
"size": 0,
"aggs": {
"tone_s": {
"terms": {
"script": {
"source": "def value='['; for(int i=0;i<doc['ToneCurvePV2012'].length;i++){value= value + doc['ToneCurvePV2012'][i] + ',';} value+= ']'; value = value.replace(',]', ']'); return value;"
}
}
}
}
}
Output:
{
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"tone_s" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "[1,2,3]",
"doc_count" : 2
},
{
"key" : "[1,5,6]",
"doc_count" : 1
},
{
"key" : "[1,7,8]",
"doc_count" : 1
}
]
}
}
}
PS: key will be come as string and not as array in aggregation response.
I'm having troubles doing this equivalent in ES:
SELECT COUNT(*)
FROM
(
SELECT current_place
FROM `request`
WHERE user_id = '3'
ORDER BY asked_at DESC
LIMIT 10
) sr1
WHERE current_place = '4'
The goal is to take the 10 most recent records for an user (asked_at is a timestamp field), and count how many record have a current_place = '4'
In Elasticsearch I did this, without ordering because I didn't even succeed to filter to 10 elements:
GET /index/type/_search
{
"size": 10,
"query": {
"bool": {
"filter": [
{
"term": {
"user_id": 3
}
},
{
"term": {
"current_place": 4
}
}
]
}
}
}
Which gives me:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 54,
"max_score" : 0.0,
"hits" : [
... truncated, 10 records ...
]
}
}
How can I perform a count on the ordered and filtered data ?
EDIT:
Here is a sample of data:
1 | 2019-03-13 18:28:17
1 | 2019-01-15 16:48:30
1 | 2019-01-15 16:25:32
1 | 2019-01-15 16:19:36
1 | 2019-01-15 15:43:33
1 | 2019-01-15 15:42:05
4 | 2018-11-22 14:14:03
1 | 2018-09-11 11:36:05
4 | 2018-09-11 11:00:49
1 | 2018-08-31 11:19:17 -> 10th line
1 | 2018-08-31 11:19:17
1 | 2018-08-31 11:09:32
1 | 2018-08-27 10:19:04
4 | 2018-08-23 11:56:27
the SQL query returns 2
This is not possible with elasticsearch if you have n shards for that particular index.
So basically there is a feature called terminate after which is available with request body search, which would only take into account the n documents from each shard. Yes, it works on shard level.
Using that, let's say, my index has 5 shards, I thought I can use the value 2 in the below updated query to see if only 10 documents(5 shards * 2 documents) are retrieved but it doesn't work that way as one shard might return only 1 document while rest return 2, where I ultimately ended up with apply aggregation query on 9 documents.
Again with that less documents from each shard, your sorted result itself may not fetch the correct top 10 documents.
Aggregation Query
POST <your_index_name>/_search
{
"size":0,
"terminate_after":2,
"query":{
"bool":{
"filter":[
{
"term":{
"user_id":101
}
}
]
}
},
"sort":[
{
"asked_at":{
"order":"desc"
}
}
],
"aggs":{
"filter_current_place":{
"filter":{
"term":{
"current_place":4
}
},
"aggs":{
"requiredCount":{
"value_count":{
"field":"current_place"
}
}
}
}
}
}
Below is how my response appeared:
Response
{
"took" : 2,
"timed_out" : false,
"terminated_early" : true,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 9,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"filter_current_place" : {
"doc_count" : 2,
"requiredCount" : {
"value" : 2
}
}
}
}
Notice that the hits are only 9 despite mentioning that I would want 2 documents to be considered from each shard. Of course the count appeared correct because as mentioned in the question, the 9th document has current_place:4. What if this was in 10th position!!
This is probably not correct and pretty clearly which is something
that would require to be done at the client side or service layer.
If that is the case, then you would only need the below query and handle the logic of aggregating based on top 10 documents on client side/service layer.
Sorted Query
POST <your_index_name>/_search
{
"size":10,
"query":{
"bool":{
"filter":[
{
"term":{
"user_id":101
}
}
]
}
},
"sort":[
{
"asked_at":{
"order":"desc"
}
}
]
}
Note: The only possible way to achieve this via Elasticsearch using the first query I've mentioned above is that your index has only a single shard and you make use of "terminate_after":10
Although technically this doesn't, I hope this helps!
Suppose I have the following data:
{"field": [{"type": "A"}, {"type": "B"}]},
{"field": [{"type": "B"}]}
How do you construct a query in Elasticsearch to get the count of all records with a specific field type value, given field is an array?
You can use the Count API, with the following query
Query:
GET /index/index_type/_count
{
"query" : {
"term" : { "field.type" : "A" }
}
}
Response:
{
"count" : <number of docs>,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
In my ES I've a schema type like this:
{
"index_v1":{
"mappings":{
"fuas":{
"properties":{
"comment":{
"type":"string"
},
"matter":{
"type":"string"
},
"metainfos":{
"properties":{
"department":{
"type":"string"
},
"processos":{
"type":"string"
}
}
}
}
}
}
}
}
Shortly, fuas type has two properties comment and matter and an inner (not nested) object metainfos with several properties department and processos.
I'd like to know how many metainfos' fields are informed with its number of occurrences.
Imagine a document doc1 with metainfos: {department: "d1"} and a doc2 with metainfos: {department: "d2", processos: "p1"}.
Then I'd like to get: {department: 2, processos: 1}.
EDIT
As a inner object and since ES is schemaless documents' metainfos inner objects can have several fields informed or not.
So, doc1's metainfos {field1: 1, field3: 3} and doc2's metainfos {field2: 1, field4: 5} and doc3's metainfos {field1:2, field4: 2, field5: 1}.
I'd like to get: {field1: 2, field2: 1, field3: 1, field4: 2, field5: 1}. I think the main issue to solve it is how I'm able to ask for fields I don't know exist.
I've tested with two documents:
{
"hits":{
"total":2,
"max_score":1.0,
"hits":[
{
"_source":{
"matter":"FUA2",
"comment":null,
"metainfos":[
{
"department":"d1"
}
]
}
},
{
"_source":{
"matter":"FUA1",
"comment":"vcvcvc",
"metainfos":[
{
"department":"d1"
},
{
"processos":"p1"
}
]
}
}
]
}
}
I've tested this with this command:
curl -XGET 'http://localhost:9201/living_team/fuas/_search?pretty' -d '
{
"size": 0,
"aggregations" : {
"followUpActivity.metainfo.department" : {
"terms" : {
"field" : "metainfos.*"
}
}
}
}
'
The results have been:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"followUpActivity.metainfo.department" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
}
You can use the value_count aggregation for this:
{
"size": 0,
"aggs" : {
"dept" : {
"value_count" : { "field" : "metainfos.department" }
},
"proc" : {
"value_count" : { "field" : "metainfos.processos" }
}
}
}
You need to use nested fields as otherwise your inner fields are not seen "together" in the metainfos object.
See here: ElasticSearch aggregation over inner object
I'm trying to list all buckets on an aggregation, but it seems to be showing only the first 10.
My search:
curl -XPOST "http://localhost:9200/imoveis/_search?pretty=1" -d'
{
"size": 0,
"aggregations": {
"bairro_count": {
"terms": {
"field": "bairro.raw"
}
}
}
}'
Returns:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 16920,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"bairro_count" : {
"buckets" : [ {
"key" : "Barra da Tijuca",
"doc_count" : 5812
}, {
"key" : "Centro",
"doc_count" : 1757
}, {
"key" : "Recreio dos Bandeirantes",
"doc_count" : 1027
}, {
"key" : "Ipanema",
"doc_count" : 927
}, {
"key" : "Copacabana",
"doc_count" : 842
}, {
"key" : "Leblon",
"doc_count" : 833
}, {
"key" : "Botafogo",
"doc_count" : 594
}, {
"key" : "Campo Grande",
"doc_count" : 456
}, {
"key" : "Tijuca",
"doc_count" : 361
}, {
"key" : "Flamengo",
"doc_count" : 328
} ]
}
}
}
I have much more than 10 keys for this aggregation. In this example I'd have 145 keys, and I want the count for each of them. Is there some pagination on buckets? Can I get all of them?
I'm using Elasticsearch 1.1.0
The size param should be a param for the terms query example:
curl -XPOST "http://localhost:9200/imoveis/_search?pretty=1" -d'
{
"size": 0,
"aggregations": {
"bairro_count": {
"terms": {
"field": "bairro.raw",
"size": 10000
}
}
}
}'
Use size: 0 for ES version 2 and prior.
Setting size:0 is deprecated in 2.x onwards, due to memory issues inflicted on your cluster with high-cardinality field values. You can read more about it in the github issue here .
It is recommended to explicitly set reasonable value for size a number between 1 to 2147483647.
How to show all buckets?
{
"size": 0,
"aggs": {
"aggregation_name": {
"terms": {
"field": "your_field",
"size": 10000
}
}
}
}
Note
"size":10000 Get at most 10000 buckets. Default is 10.
"size":0 In result, "hits" contains 10 documents by default. We don't need them.
By default, the buckets are ordered by the doc_count in decreasing order.
Why do I get Fielddata is disabled on text fields by default error?
Because fielddata is disabled on text fields by default. If you have not wxplicitly chosen a field type mapping, it has the default dynamic mappings for string fields.
So, instead of writing "field": "your_field" you need to have "field": "your_field.keyword".
If you want to get all unique values without setting a magic number (size: 10000), then use COMPOSITE AGGREGATION (ES 6.5+).
From official documentation:
"If you want to retrieve all terms or all combinations of terms in a nested terms aggregation you should use the COMPOSITE AGGREGATION which allows to paginate over all possible terms rather than setting a size greater than the cardinality of the field in the terms aggregation. The terms aggregation is meant to return the top terms and does not allow pagination."
Implementation example in JavaScript:
const ITEMS_PER_PAGE = 1000;
const body = {
"size": 0, // Returning only aggregation results: https://www.elastic.co/guide/en/elasticsearch/reference/current/returning-only-agg-results.html
"aggs" : {
"langs": {
"composite" : {
"size": ITEMS_PER_PAGE,
"sources" : [
{ "language": { "terms" : { "field": "language" } } }
]
}
}
}
};
const uniqueLanguages = [];
while (true) {
const result = await es.search(body);
const currentUniqueLangs = result.aggregations.langs.buckets.map(bucket => bucket.key);
uniqueLanguages.push(...currentUniqueLangs);
const after = result.aggregations.langs.after_key;
if (after) {
// continue paginating unique items
body.aggs.langs.composite.after = after;
} else {
break;
}
}
console.log(uniqueLanguages);
Increase the size(2nd size) to 10000 in your term aggregations and you will get the bucket of size 10000. By default it is set to 10.
Also if you want to see the search results just make the 1st size to 1, you can see 1 document, since ES does support both searching and aggregation.
curl -XPOST "http://localhost:9200/imoveis/_search?pretty=1" -d'
{
"size": 1,
"aggregations": {
"bairro_count": {
"terms": {
"field": "bairro.raw",
"size": 10000
}
}
}
}'