Elasticsearch - group by day of week and hour - elasticsearch

I need to do get some data grouped by day of week and hour, for example
curl -XGET http://localhost:9200/testing/hello/_search?pretty=true -d '
{
"size": 0,
"aggs": {
"articles_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "hour",
"format": "E - k"
}
}
}
}
'
Gives me this:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2857,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"articles_over_time" : {
"buckets" : [ {
"key_as_string" : "Fri - 17",
"key" : 1391792400000,
"doc_count" : 6
},
...
{
"key_as_string" : "Wed - 22",
"key" : 1411596000000,
"doc_count" : 1
}, {
"key_as_string" : "Wed - 22",
"key" : 1411632000000,
"doc_count" : 1
} ]
}
}
}
Now I need to summarize doc counts by this value "Wed - 22", how can I do this?
Maybe some another approach?

The same kind of problem has been solved in this thread.
Adapting the solution to your problem, we need to make a script to convert the date into the hour of day and day of week:
Date date = new Date(doc['date'].value) ;
java.text.SimpleDateFormat format = new java.text.SimpleDateFormat('EEE, HH');
format.format(date)
And use it in a query:
{
"aggs": {
"perWeekDay": {
"terms": {
"script": "Date date = new Date(doc['date'].value) ;java.text.SimpleDateFormat format = new java.text.SimpleDateFormat('EEE, HH');format.format(date)"
}
}
}
}

You can try doing terms aggregation on "key_as_string" field from the aggregation results using sub aggregation.
Hope that helps.

This is because you are using an interval of 'hour', but, the date format is 'day' (E - k).
Change your interval to be 'day', and you'll no longer get separate buckets for 'Weds - 22'.
Or, if you do want per hour, then change your format to include the hour field.

Related

Elasticsearch Aggregation most common list of integers

I am looking for elastic search aggregation + mapping
that will return the most common list for a certain field.
For example for docs:
{"ToneCurvePV2012": [1,2,3]}
{"ToneCurvePV2012": [1,5,6]}
{"ToneCurvePV2012": [1,7,8]}
{"ToneCurvePV2012": [1,2,3]}
I wish for the aggregation result:
[1,2,3] (since it appears twice).
so far any aggregation that i made would return: 1
This is not possible with default terms aggregation. You need to use terms aggregation with script. Please note that this might impact your cluster performance.
Here, i have used script which will create string from array and used it for aggregation. so if you have array value like [1,2,3] then it will create string representation of it like '[1,2,3]' and that key will be used for aggregation.
Below is sample query you can use to generate aggregation as you expected:
POST index1/_search
{
"size": 0,
"aggs": {
"tone_s": {
"terms": {
"script": {
"source": "def value='['; for(int i=0;i<doc['ToneCurvePV2012'].length;i++){value= value + doc['ToneCurvePV2012'][i] + ',';} value+= ']'; value = value.replace(',]', ']'); return value;"
}
}
}
}
}
Output:
{
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"tone_s" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "[1,2,3]",
"doc_count" : 2
},
{
"key" : "[1,5,6]",
"doc_count" : 1
},
{
"key" : "[1,7,8]",
"doc_count" : 1
}
]
}
}
}
PS: key will be come as string and not as array in aggregation response.

Elasticsearch max of field combined with a unique field

I have an index with two fields:
name: uuid
version: long
I now only want to count the documents (on a very large index [1 million+ entries]) where the version of the name is the highest. For e.g. a query on an index with the following documents:
{name="a", version=1}
{name="a", version=2}
{name="a", version=3}
{name="b", version=1}
... would return:
count=2
Is this somehow possible? I can not find a solution for this particular problem.
You are effectively describing a count of distinct names, which you can do with a cardinality aggregation.
Request:
GET test1/_search
{
"aggs" : {
"distinct_count" : {
"cardinality" : {
"field" : "name.keyword"
}
}
},
"size": 0
}
Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"distinct_count" : {
"value" : 2
}
}
}

SQL aggregation query corresponding in elasticsearch

I studied elasticsearch aggregation queries but couldn't find if it supports multiple aggregate function. In an other word, I wanna know if elasticsearch can generate the equivalent of this Sql aggregation query:
SELECT account_no, transaction_type, count(account_no), sum(amount), max(amount) FROM index_name GROUP BY account_no, transaction_type Having count(account_no) > 10
If yes, how?
Thank you.
There are two possible ways to do what you are looking for in ES and I've mentioned them both below.
I've also added sample mapping and sample documents for your reference.
Mapping:
PUT index_name
{
"mappings": {
"mydocs":{
"properties":{
"account_no":{
"type": "keyword"
},
"transaction_type":{
"type": "keyword"
},
"amount":{
"type":"double"
}
}
}
}
}
Sample Documents:
Notice carefully, I'm only creating list of 4 transactions for 1 customer.
POST index_name/mydocs/1
{
"account_no": "1011",
"transaction_type":"credit",
"amount": 200
}
POST index_name/mydocs/2
{
"account_no": "1011",
"transaction_type":"credit",
"amount": 400
}
POST index_name/mydocs/3
{
"account_no": "1011",
"transaction_type":"cheque",
"amount": 100
}
POST index_name/mydocs/4
{
"account_no": "1011",
"transaction_type":"cheque",
"amount": 100
}
There are two ways to get what you are looking for:
Solution 1: Using Elasticsearch Query DSL
Aggregation Query:
For Aggregation Query DSL, I've made use of the below aggregation queries to solve what you are looking for.
Terms Aggregation
Sum Aggregation Query (Metric Aggregation)
Max Aggregation Query (Metric Aggregation)
Below is how query is summarised version of the query so that you get the clarity on which queries are sibling and which are parents.
- Terms Aggregation (For Every Account)
- Terms Aggregation (For Every Transaction_type)
- Sum Amount
- Max Amount
Below is the actual query:
POST index_name/_search
{
"size": 0,
"aggs": {
"account_no_agg": {
"terms": {
"field": "account_no"
},
"aggs": {
"transaction_type_agg": {
"terms": {
"field": "transaction_type",
"min_doc_count": 2
},
"aggs": {
"sum_amount": {
"sum": {
"field": "amount"
}
},
"max_amount":{
"max": {
"field": "amount"
}
}
}
}
}
}
}
}
Important thing to mention is min_doc_count which is nothing but the having count(account_no)>10, which in my query I'm filtering only those transactions with having count(account_no) > 2
Query Response
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"account_no_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1011", <---- account_no
"doc_count" : 4, <---- count(account_no)
"transaction_type_agg" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "cheque", <---- transaction_type
"doc_count" : 2,
"sum_amount" : { <---- sum(amount)
"value" : 200.0
},
"max_amount" : { <---- max(amount)
"value" : 100.0
}
},
{
"key" : "credit", <---- another transaction_type
"doc_count" : 2,
"sum_amount" : { <---- sum(amount)
"value" : 600.0
},
"max_amount" : { <---- max(amount)
"value" : 400.0
}
}
]
}
}
]
}
}
}
Notice the above result carefully, I've added comments wherever required so that it helps what part of sql query you are looking for.
Solution 2: Using Elasticsearch SQL(_xpack solution)
If you are making use of xpack feature of Elasticsearch's SQL Access, you can simply copy paste the SELECT Query as below for the mapping and document as mentioned above:
Elasticsearch SQL:
POST /_xpack/sql?format=txt
{
"query": "SELECT account_no, transaction_type, sum(amount), max(amount), count(account_no) FROM index_name GROUP BY account_no, transaction_type HAVING count(account_no) > 1"
}
Elasticsearch SQL Result:
account_no |transaction_type| SUM(amount) | MAX(amount) |COUNT(account_no)
---------------+----------------+---------------+---------------+-----------------
1011 |cheque |200.0 |100.0 |2
1011 |credit |600.0 |400.0 |2
Note that I've tested the query in ES 6.5.4.
Hope this helps!

Elasticsearch range with aggs

i want average rating of every user document but is not working according to me.please check the code given below.
curl -XGET 'localhost:9200/mentorz/users/_search?pretty' -H 'Content-Type: application/json' -d'
{"aggs" : {"avg_rating" : {"range" : {"field" : "rating","ranges" : [{ "from" : 3, "to" : 19 }]}}}}';
{ "_index" : "mentorz", "_type" : "users", "_id" : "555", "_source" : { "name" : "neeru", "user_id" : 555,"email_id" : "abc#gmail.com","followers" : 0,
"following" : 0, "mentors" : 0, "mentees" : 0, "basic_info" : "api test info",
"birth_date" : 1448451985397,"charge_price" : 0,"org" : "cz","located_in" : "noida", "position" : "sw developer", "exp" : 7, "video_bio_lres" : "test bio lres url normal signup","video_bio_hres" : "test bio hres url normal signup", "rating" : [ 5 ,4], "expertises" : [ 1, 4, 61, 62, 63 ] }
this is my user document,i want to filter only those users who have average rating range from 3 to 5.
Update Answer
I've made a query using script, hope the below query works for you.
GET mentorz/users/_search
{
"size": 0,
"aggs": {
"term": {
"terms": {
"field": "user.keyword",
"size": 100
},
"aggs": {
"NAME": {
"terms": {
"field": "rating",
"size": 10,
"script": {
"inline": "float var=0;float count=0;for(int i = 0; i < params['_source']['rating'].size();i++){var=var+params['_source']['rating'][i];count++;} float avg = var/count; if(avg>=4 && avg<=5) {avg}else{null}"
}
}
}
}
}
}
}
You can change the range of your desired rating range by changing the if condition "if(avg>=4 && avg<=5)".

Count the number of duplicates in elasticsearch

I have an application inserting a numbered sequence of logs into elasticsearch.
Under certain conditions, after stopping my application, I find that in elasticsearch there are more logs than I have actually generated.
This simple aggregation helped me find out that a few duplicates are present:
curl /logstash-*/_search?pretty -d '{
size: 0,
aggs: {
msgnum_terms: {
terms: {
field: "msgnum.raw",
min_doc_count: 2,
size: 0
}
}
}
}'
msgnum is the field containing the numeric sequence. Normally it should be unique and the resulting doc_counts never exceed 1. Instead I get something like:
{
"took" : 33,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 100683,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"msgnum_terms" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "4097",
"doc_count" : 2
}, {
"key" : "4099",
"doc_count" : 2
...
...
...
}, {
"key" : "5704",
"doc_count" : 2
} ]
}
}
}
How can I count the exact number of duplicates in order to make sure that they are the only cause of mismatch between number of generated log lines and number of hits in elasticsearch?

Resources