ElasticSearch: Avg aggregation for datetime format - elasticsearch

I am stuck regarding an elastic search query using python
I have data such as:
{
"_index": "user_log",
"_type": "logs",
"_id": "gdUJpXIBAoADuwvHTK29",
"_score": 1,
"_source": {
"user_name": "prathameshsalap#gmail.com",
"working_hours": "2019-10-21 09:00:01",
}
{
"_index": "user_log",
"_type": "logs",
"_id": "gtUJpXIBAoADuwvHTK29",
"_version": 1,
"_score": 0,
"_source": {
"user_name": "vaishusawant143#gmail.com",
"working_hours": "2019-10-21 09:15:01",
}
{
"_index": "user_log",
"_type": "logs",
"_id": "g9UJpXIBAoADuwvHTK29",
"_version": 1,
"_score": 0,
"_source": {
"user_name": "prathameshsalap#gmail.com",
"working_hours": "2019-10-22 07:50:00",
}
{
"_index": "user_log",
"_type": "logs",
"_id": "g8UJpXIBAoADuwvHTK29",
"_version": 1,
"_score": 0,
"_source": {
"user_name": "vaishusawant143#gmail.com",
"working_hours": "2019-10-22 04:15:01",
}
Here, for each user give working hours for different date(21 and 22). I want to take an average of each user's working hours.
{
"size": 0,
"query" : {"match_all": {}},
"aggs": {
"users": {
"terms": {
"field": "user_name"
},
"aggs": {
"avg_hours": {
"avg": {
"field": "working_hours"
}
}
}
}
}
}
This query not working. How to find the average working hours for each user for all dates? And, I also want to run this query using python-elastic search.
Updated
When I use ingest pipeline as #Val mention. I am getting an error:
{
"error" : {
"root_cause" : [
{
"type" : "script_exception",
"reason" : "compile error",
"processor_type" : "script",
"script_stack" : [
"\n def workDate = /\\s+/.split(ctx.working_h ...",
" ^---- HERE"
],
"script" : "\n def workDate = /\\s+/.split(ctx.working_hours);\n def workHours = /:/.split(workDate[1]);\n ctx.working_minutes = (Integer.parseInt(workHours[0]) * 60) + Integer.parseInt(workHours[1]);\n ",
"lang" : "painless",
"position" : {
"offset" : 24,
"start" : 0,
"end" : 49
}
}
.....
How can I solve it?

The problem is that your working_hours field is a point in time and does not denote a duration.
For this use case, it's best to store the working day and working hours in two separate fields and store the working hours in minutes.
So instead of having documents like this:
{
"user_name": "prathameshsalap#gmail.com",
"working_hours": "2019-10-21 09:00:01",
}
Create documents like this:
{
"user_name": "prathameshsalap#gmail.com",
"working_day": "2019-10-21",
"working_hours": "09:00:01",
"working_minutes": 540
}
Then you can use your query on the working_minutes field:
{
"size": 0,
"query" : {"match_all": {}},
"aggs": {
"users": {
"terms": {
"field": "user_name.keyword",
"order": {
"avg_hours": "desc"
}
},
"aggs": {
"avg_hours": {
"avg": {
"field": "working_minutes"
}
}
}
}
}
}
If it is not convenient to compute the working_minutes field in your client code, you can achieve the same thing using an ingest pipeline. Let's define the pipeline first:
PUT _ingest/pipeline/working-hours
{
"processors": [
{
"dissect": {
"field": "working_hours",
"pattern": "%{?date} %{tmp_hours}:%{tmp_minutes}:%{?seconds}"
}
},
{
"convert": {
"field": "tmp_hours",
"type": "integer"
}
},
{
"convert": {
"field": "tmp_minutes",
"type": "integer"
}
},
{
"script": {
"source": """
ctx.working_minutes = (ctx.tmp_hours * 60) + ctx.tmp_minutes;
"""
}
},
{
"remove": {
"field": [
"tmp_hours",
"tmp_minutes"
]
}
}
]
}
Then you need to update your Python client code to use the new pipeline that will create the working_hours field for you:
helpers.bulk(es, reader, index='user_log', doc_type='logs', pipeline='working-hours')

Related

How to get sum of diferent fields / array values in elasticsearch?

Using Elasticsearch 7.9.0
My document looks like this
{
"student": {
"marks": [
{
"sub": 80
},
{
"sub": 90
},
{
"sub": 100
}
]
}
}
I need one more field total_marks in the response of GET API
Something like this
{
"hits": [
{
"_index": "abc",
"_type": "_doc",
"_id": "blabla",
"_score": null,
"_source": {
"student": {
"marks": [
{
"sub": 80
},
{
"sub": 90
},
{
"sub": 100
}
]
}
},
"total_marks": 270
}
]
}
I tried using script_fields
My query is
GET sample/_search
{
"query": {
"match_all": {}
},
"script_fields": {
"total_marks": {
"script": {
"source": """double sum = 0.0;
for( item in params._source.student.marks)
{ sum = sum + item.sub }
return sum;"""
}
}
}
}
I got response as
{
"hits": [
{
"_index": "abc",
"_type": "_doc",
"_id": "blabla",
"_score": null,
"_source": {
"student": {
"marks": [
{
"sub": 80
},
{
"sub": 90
},
{
"sub": 100
}
]
}
},
"fields": {
"total_marks": [
270
]
}
}
]
}
Is thare any way to get as expected?
Any better/optimal solution would be helps a lot.
Thank you.
Terms aggregation and sum aggregation can be used to find total marks per group
{
"aggs": {
"students": {
"terms": {
"field": "student.id.keyword",
"size": 10
},
"aggs": {
"total_marks": {
"sum": {
"field": "student.marks.sub"
}
}
}
}
}
}
Result
"aggregations" : {
"students" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 1,
"total_marks" : {
"value" : 270.0
}
}
]
}
}
This will be faster than script but Pagination will be easier in query as compared to aggregation. So choose accordingly.
Best option may be to have it calculated at index time. If those fields are not changing frequently.

Count API: count query field A with distinct field B value

For instance, given this result for a search, reduced to a size of 3 hits for brevity:
{
"hits": {
"total": {
"value": 51812937,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "desc-imunizacao",
"_type": "_doc",
"_id": "7d0ac34a-1d4f-435a-9e5f-6dc2d77bb251-i0b0",
"_score": 1.0,
"_source": {
"vacina_descricao_dose": "    2ª Dose",
"estabelecimento_uf": "BA",
"document_id": "7d0ac34a-1d4f-435a-9e5f-6dc2d77bb251-i0b0"
}
},
{
"_index": "desc-imunizacao",
"_type": "_doc",
"_id": "2dc55c6a-5ac1-4550-8990-5ca611808e8a-i0b0",
"_score": 1.0,
"_source": {
"vacina_descricao_dose": "    1ª Dose",
"estabelecimento_uf": "SE",
"document_id": "2dc55c6a-5ac1-4550-8990-5ca611808e8a-i0b0"
}
},
{
"_index": "desc-imunizacao",
"_type": "_doc",
"_id": "d7e9b381-2873-4d0a-8b2d-5fa5034b7a80-i0b0",
"_score": 1.0,
"_source": {
"vacina_descricao_dose": "    1ª Dose",
"estabelecimento_uf": "SE",
"document_id": "d7e9b381-2873-4d0a-8b2d-5fa5034b7a80-i0b0"
}
}
]
}
}
If I wanted to query for "estabelecimento_uf": "SE" and keep only one result for duplicates of "document_id", I would issue:
{
"_source": ["document_id", "estabelecimento_uf", "vacina_descricao_dose"],
"query": {
"match": {
"estabelecimento_uf": {
"query": "SE"
}
}
},
"collapse": {
"field": "document_id",
"inner_hits": {
"name": "latest",
"size": 1
}
}
}
Is there a way to achieve this with Elasticsearch's Count API? Meaning: count query for field A (estabelecimento_uf) and count for unique values of field B (document_id), knowing that document_id has duplicates of course.
This is a public API: https://imunizacao-es.saude.gov.br/_search
This is the authentication:
User: imunizacao_public
Pass: qlto5t&7r_#+#Tlstigi
You can use a combination of filter aggregation along with cardinality aggregation, to get a count of unique document id based on a filter
{
"size": 0,
"aggs": {
"filter_agg": {
"filter": {
"term": {
"estabelecimento_uf.keyword": "SE"
}
},
"aggs": {
"count_docid": {
"cardinality": {
"field": "document_id.keyword"
}
}
}
}
}
}
As far as I know, you cannot get the count of distinct field values using count API, you can either use field collapsing feature (as done in the question) OR use cardinality aggregation
Adding a working example with index data, search query and search result
{
"vacina_descricao_dose": " 2ª Dose",
"estabelecimento_uf": "BA",
"document_id": "7d0ac34a-1d4f-435a-9e5f-6dc2d77bb251-i0b0"
}
{
"vacina_descricao_dose": " 1ª Dose",
"estabelecimento_uf": "SE",
"document_id": "2dc55c6a-5ac1-4550-8990-5ca611808e8a-i0b0"
}
{
"vacina_descricao_dose": " 1ª Dose",
"estabelecimento_uf": "SE",
"document_id": "d7e9b381-2873-4d0a-8b2d-5fa5034b7a80-i0b0"
}
{
"vacina_descricao_dose": " 1ª Dose",
"estabelecimento_uf": "SE",
"document_id": "d7e9b381-2873-4d0a-8b2d-5fa5034b7a80-i0b0"
}
Search Query 1:
{
"size": 0,
"query": {
"match": {
"estabelecimento_uf": "SE"
}
},
"aggs": {
"count_doc_id": {
"cardinality": {
"field": "document_id.keyword"
}
}
}
}
Search Result:
"aggregations": {
"count_doc_id": {
"value": 2 // note this
}
}
Search Query 2:
{
"size": 0,
"aggs": {
"filter_agg": {
"filter": {
"term": {
"estabelecimento_uf.keyword": "SE"
}
},
"aggs": {
"count_docid": {
"cardinality": {
"field": "document_id.keyword"
}
}
}
}
}
}
Search Result:
"aggregations": {
"filter_agg": {
"doc_count": 3,
"count_docid": {
"value": 2 // note this
}
}
}

elasticsearch - get intermediate scores within 'function_score'

Here's my index
POST /blogs/1
{
"name" : "learn java",
"popularity" : 100
}
POST /blogs/2
{
"name" : "learn elasticsearch",
"popularity" : 10
}
My search query:
GET /blogs/_search
{
"query": {
"function_score": {
"query": {
"match": {
"name": "learn"
}
},
"script_score": {
"script": {
"source": "_score*(1+Math.log(1+doc['popularity'].value))"
}
}
}
}
}
which returns:
[
{
"_index": "blogs",
"_type": "1",
"_id": "AW5fxnperVbDy5wjSDBC",
"_score": 0.58024323,
"_source": {
"name": "learn elastic search",
"popularity": 100
}
},
{
"_index": "blogs",
"_type": "1",
"_id": "AW5fxqmL8cCMCxtBYOyC",
"_score": 0.43638366,
"_source": {
"name": "learn java",
"popularity": 10
}
}
]
Problem: I need to return an extra field in results which would give me raw score (just tf/idf which doesn't take popularity into account)
Things I have explored: script_fields (which doesn't give access to _score at fetch time.
The problem is in the way you are querying, which over-writes the _score variable. Instead if you use sort then _score isn't changed and can be pulled up within the same query.
You can try querying this way :
{
"query": {
"match": {
"name": "learn"
}
},
"sort": [
{
"_script": {
"type": "number",
"script": {
"lang": "painless",
"source": "_score*(1+Math.log(1+doc['popularity'].value))"
},
"order": "desc"
}
},
"_score"
]
}

ElasticSearch: Restrict Aggregations to Query String

I'm struggling to get my aggregations to be restricted to my query.
I, of course, tried:
{
"_source": ["burger.id", "burger.user_name", "burger.timestamp"],
"query": {
"query_string": {
"query": "burger.user_name:Bob"
}
},
"aggs": {
"burger_count": {
"cardinality": {
"field": "burger.id.keyword"
}
},
"min_dtm": {
"min": {
"field": "burger.timestamp"
}
},
"max_dtm": {
"max": {
"field": "burger.timestamp"
}
}
}
}
I am very set on using "query_string" for filtering, as we have a very nice front-end that allows users to easily build queries that are then turned into a "query_string."
Unfortunately, I have not found a way to combine query_string and aggregations so that the aggregations are only over the results of the query!
I've read through many SO posts about doing this, but they are all very old and outdated as they all suggest the deprecated way of Filtered Queries, but even that doesn't implement query_string.
UPDATE
Here are some example documents.
It appears that my results are not being filtered by my query. Is there a setting that I am missing?
I also changed all of the fields to be about burgers...
{
"_index": "burgers",
"_type": "burger",
"_id": "123",
"_score": 5.3759894,
"_source": {
"inference": {
"id": "1",
"user_name": "Jonathan",
"timestamp": 1541521691847
}
}
},
{
"_index": "burgers",
"_type": "burger",
"_id": "456",
"_score": 5.3759894,
"_source": {
"inference": {
"id": "2",
"user_name": "Ryan",
"timestamp": 1542416601153
}
}
},
{
"_index": "burgers",
"_type": "burger",
"_id": "789",
"_score": 5.3759894,
"_source": {
"inference": {
"id": "3",
"user_name": "Grant",
"timestamp": 1542237715511
}
}
}
Found my answer!
It appears that the issue was caused by querying the text field of burger.user_name instead of the keyword field: burger.user_name.keyword.
Changing my query_string to use the keyword for each text field solved my issue.
{
"_source": ["burger.id", "burger.user_name", "burger.timestamp"],
"query": {
"query_string": {
"query": "burger.user_name.keyword:Bob"
}
},
"aggs": {
"burger_count": {
"cardinality": {
"field": "burger.id.keyword"
}
},
"min_dtm": {
"min": {
"field": "burger.timestamp"
}
},
"max_dtm": {
"max": {
"field": "burger.timestamp"
}
}
}
}
This SO answer gives a great, brief explanation why.

How to get total score specific to each row

I need, Elasticsearch GET query to view the total score of each and every students by adding up the marks earned by them in all the subject rather I am getting total score of all the students in every subject.
GET /testindex/testindex/_search
{
"query" : {
"filtered" : {
"query" : {
"match_all" : {}
}
}
},
"aggs": {
"total": {
"sum": {
"script" : "doc['physics'].value + doc['maths'].value + doc['chemistry'].value"
}
}
}
}
Output
{
....
"hits": [
{
"_index": "testindex",
"_type": "testindex",
"_id": "1",
"_score": 1,
"_source": {
"personalDetails": {
"name": "viswa",
"age": "33"
},
"marks": {
"physics": 18,
"maths": 5,
"chemistry": 34
},
"remarks": [
"hard working",
"intelligent"
]
}
},
{
"_index": "testindex",
"_type": "testindex",
"_id": "2",
"_score": 1,
"_source": {
"personalDetails": {
"name": "bob",
"age": "13"
},
"marks": {
"physics": 48,
"maths": 45,
"chemistry": 44
},
"remarks": [
"hard working",
"intelligent"
]
}
}
]
},
"aggregations": {
"total": {
"value": 194
}
}
}
Expected Output:
I would like to get total mark earned in subjects of each and every student rather than total of all the students.
What changes I need to do in the query to achieve this.
{
"query": {
"filtered": {
"query": {
"match_all": {}
}
}
},
"aggs": {
"student": {
"terms": {
"field": "personalDetails.name",
"size": 10
},
"aggs": {
"total": {
"sum": {
"script": "doc['physics'].value + doc['maths'].value + doc['chemistry'].value"
}
}
}
}
}
}
But, be careful, for student terms aggregation you need a "unique" (something that makes that student unique - like a personal ID or something) field, maybe the _id itself, but you need to store it.

Resources