How to get total score specific to each row - elasticsearch

I need, Elasticsearch GET query to view the total score of each and every students by adding up the marks earned by them in all the subject rather I am getting total score of all the students in every subject.
GET /testindex/testindex/_search
{
"query" : {
"filtered" : {
"query" : {
"match_all" : {}
}
}
},
"aggs": {
"total": {
"sum": {
"script" : "doc['physics'].value + doc['maths'].value + doc['chemistry'].value"
}
}
}
}
Output
{
....
"hits": [
{
"_index": "testindex",
"_type": "testindex",
"_id": "1",
"_score": 1,
"_source": {
"personalDetails": {
"name": "viswa",
"age": "33"
},
"marks": {
"physics": 18,
"maths": 5,
"chemistry": 34
},
"remarks": [
"hard working",
"intelligent"
]
}
},
{
"_index": "testindex",
"_type": "testindex",
"_id": "2",
"_score": 1,
"_source": {
"personalDetails": {
"name": "bob",
"age": "13"
},
"marks": {
"physics": 48,
"maths": 45,
"chemistry": 44
},
"remarks": [
"hard working",
"intelligent"
]
}
}
]
},
"aggregations": {
"total": {
"value": 194
}
}
}
Expected Output:
I would like to get total mark earned in subjects of each and every student rather than total of all the students.
What changes I need to do in the query to achieve this.

{
"query": {
"filtered": {
"query": {
"match_all": {}
}
}
},
"aggs": {
"student": {
"terms": {
"field": "personalDetails.name",
"size": 10
},
"aggs": {
"total": {
"sum": {
"script": "doc['physics'].value + doc['maths'].value + doc['chemistry'].value"
}
}
}
}
}
}
But, be careful, for student terms aggregation you need a "unique" (something that makes that student unique - like a personal ID or something) field, maybe the _id itself, but you need to store it.

Related

How to get sum of diferent fields / array values in elasticsearch?

Using Elasticsearch 7.9.0
My document looks like this
{
"student": {
"marks": [
{
"sub": 80
},
{
"sub": 90
},
{
"sub": 100
}
]
}
}
I need one more field total_marks in the response of GET API
Something like this
{
"hits": [
{
"_index": "abc",
"_type": "_doc",
"_id": "blabla",
"_score": null,
"_source": {
"student": {
"marks": [
{
"sub": 80
},
{
"sub": 90
},
{
"sub": 100
}
]
}
},
"total_marks": 270
}
]
}
I tried using script_fields
My query is
GET sample/_search
{
"query": {
"match_all": {}
},
"script_fields": {
"total_marks": {
"script": {
"source": """double sum = 0.0;
for( item in params._source.student.marks)
{ sum = sum + item.sub }
return sum;"""
}
}
}
}
I got response as
{
"hits": [
{
"_index": "abc",
"_type": "_doc",
"_id": "blabla",
"_score": null,
"_source": {
"student": {
"marks": [
{
"sub": 80
},
{
"sub": 90
},
{
"sub": 100
}
]
}
},
"fields": {
"total_marks": [
270
]
}
}
]
}
Is thare any way to get as expected?
Any better/optimal solution would be helps a lot.
Thank you.
Terms aggregation and sum aggregation can be used to find total marks per group
{
"aggs": {
"students": {
"terms": {
"field": "student.id.keyword",
"size": 10
},
"aggs": {
"total_marks": {
"sum": {
"field": "student.marks.sub"
}
}
}
}
}
}
Result
"aggregations" : {
"students" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 1,
"total_marks" : {
"value" : 270.0
}
}
]
}
}
This will be faster than script but Pagination will be easier in query as compared to aggregation. So choose accordingly.
Best option may be to have it calculated at index time. If those fields are not changing frequently.

Count API: count query field A with distinct field B value

For instance, given this result for a search, reduced to a size of 3 hits for brevity:
{
"hits": {
"total": {
"value": 51812937,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "desc-imunizacao",
"_type": "_doc",
"_id": "7d0ac34a-1d4f-435a-9e5f-6dc2d77bb251-i0b0",
"_score": 1.0,
"_source": {
"vacina_descricao_dose": "    2ª Dose",
"estabelecimento_uf": "BA",
"document_id": "7d0ac34a-1d4f-435a-9e5f-6dc2d77bb251-i0b0"
}
},
{
"_index": "desc-imunizacao",
"_type": "_doc",
"_id": "2dc55c6a-5ac1-4550-8990-5ca611808e8a-i0b0",
"_score": 1.0,
"_source": {
"vacina_descricao_dose": "    1ª Dose",
"estabelecimento_uf": "SE",
"document_id": "2dc55c6a-5ac1-4550-8990-5ca611808e8a-i0b0"
}
},
{
"_index": "desc-imunizacao",
"_type": "_doc",
"_id": "d7e9b381-2873-4d0a-8b2d-5fa5034b7a80-i0b0",
"_score": 1.0,
"_source": {
"vacina_descricao_dose": "    1ª Dose",
"estabelecimento_uf": "SE",
"document_id": "d7e9b381-2873-4d0a-8b2d-5fa5034b7a80-i0b0"
}
}
]
}
}
If I wanted to query for "estabelecimento_uf": "SE" and keep only one result for duplicates of "document_id", I would issue:
{
"_source": ["document_id", "estabelecimento_uf", "vacina_descricao_dose"],
"query": {
"match": {
"estabelecimento_uf": {
"query": "SE"
}
}
},
"collapse": {
"field": "document_id",
"inner_hits": {
"name": "latest",
"size": 1
}
}
}
Is there a way to achieve this with Elasticsearch's Count API? Meaning: count query for field A (estabelecimento_uf) and count for unique values of field B (document_id), knowing that document_id has duplicates of course.
This is a public API: https://imunizacao-es.saude.gov.br/_search
This is the authentication:
User: imunizacao_public
Pass: qlto5t&7r_#+#Tlstigi
You can use a combination of filter aggregation along with cardinality aggregation, to get a count of unique document id based on a filter
{
"size": 0,
"aggs": {
"filter_agg": {
"filter": {
"term": {
"estabelecimento_uf.keyword": "SE"
}
},
"aggs": {
"count_docid": {
"cardinality": {
"field": "document_id.keyword"
}
}
}
}
}
}
As far as I know, you cannot get the count of distinct field values using count API, you can either use field collapsing feature (as done in the question) OR use cardinality aggregation
Adding a working example with index data, search query and search result
{
"vacina_descricao_dose": " 2ª Dose",
"estabelecimento_uf": "BA",
"document_id": "7d0ac34a-1d4f-435a-9e5f-6dc2d77bb251-i0b0"
}
{
"vacina_descricao_dose": " 1ª Dose",
"estabelecimento_uf": "SE",
"document_id": "2dc55c6a-5ac1-4550-8990-5ca611808e8a-i0b0"
}
{
"vacina_descricao_dose": " 1ª Dose",
"estabelecimento_uf": "SE",
"document_id": "d7e9b381-2873-4d0a-8b2d-5fa5034b7a80-i0b0"
}
{
"vacina_descricao_dose": " 1ª Dose",
"estabelecimento_uf": "SE",
"document_id": "d7e9b381-2873-4d0a-8b2d-5fa5034b7a80-i0b0"
}
Search Query 1:
{
"size": 0,
"query": {
"match": {
"estabelecimento_uf": "SE"
}
},
"aggs": {
"count_doc_id": {
"cardinality": {
"field": "document_id.keyword"
}
}
}
}
Search Result:
"aggregations": {
"count_doc_id": {
"value": 2 // note this
}
}
Search Query 2:
{
"size": 0,
"aggs": {
"filter_agg": {
"filter": {
"term": {
"estabelecimento_uf.keyword": "SE"
}
},
"aggs": {
"count_docid": {
"cardinality": {
"field": "document_id.keyword"
}
}
}
}
}
}
Search Result:
"aggregations": {
"filter_agg": {
"doc_count": 3,
"count_docid": {
"value": 2 // note this
}
}
}

ElasticSearch: Avg aggregation for datetime format

I am stuck regarding an elastic search query using python
I have data such as:
{
"_index": "user_log",
"_type": "logs",
"_id": "gdUJpXIBAoADuwvHTK29",
"_score": 1,
"_source": {
"user_name": "prathameshsalap#gmail.com",
"working_hours": "2019-10-21 09:00:01",
}
{
"_index": "user_log",
"_type": "logs",
"_id": "gtUJpXIBAoADuwvHTK29",
"_version": 1,
"_score": 0,
"_source": {
"user_name": "vaishusawant143#gmail.com",
"working_hours": "2019-10-21 09:15:01",
}
{
"_index": "user_log",
"_type": "logs",
"_id": "g9UJpXIBAoADuwvHTK29",
"_version": 1,
"_score": 0,
"_source": {
"user_name": "prathameshsalap#gmail.com",
"working_hours": "2019-10-22 07:50:00",
}
{
"_index": "user_log",
"_type": "logs",
"_id": "g8UJpXIBAoADuwvHTK29",
"_version": 1,
"_score": 0,
"_source": {
"user_name": "vaishusawant143#gmail.com",
"working_hours": "2019-10-22 04:15:01",
}
Here, for each user give working hours for different date(21 and 22). I want to take an average of each user's working hours.
{
"size": 0,
"query" : {"match_all": {}},
"aggs": {
"users": {
"terms": {
"field": "user_name"
},
"aggs": {
"avg_hours": {
"avg": {
"field": "working_hours"
}
}
}
}
}
}
This query not working. How to find the average working hours for each user for all dates? And, I also want to run this query using python-elastic search.
Updated
When I use ingest pipeline as #Val mention. I am getting an error:
{
"error" : {
"root_cause" : [
{
"type" : "script_exception",
"reason" : "compile error",
"processor_type" : "script",
"script_stack" : [
"\n def workDate = /\\s+/.split(ctx.working_h ...",
" ^---- HERE"
],
"script" : "\n def workDate = /\\s+/.split(ctx.working_hours);\n def workHours = /:/.split(workDate[1]);\n ctx.working_minutes = (Integer.parseInt(workHours[0]) * 60) + Integer.parseInt(workHours[1]);\n ",
"lang" : "painless",
"position" : {
"offset" : 24,
"start" : 0,
"end" : 49
}
}
.....
How can I solve it?
The problem is that your working_hours field is a point in time and does not denote a duration.
For this use case, it's best to store the working day and working hours in two separate fields and store the working hours in minutes.
So instead of having documents like this:
{
"user_name": "prathameshsalap#gmail.com",
"working_hours": "2019-10-21 09:00:01",
}
Create documents like this:
{
"user_name": "prathameshsalap#gmail.com",
"working_day": "2019-10-21",
"working_hours": "09:00:01",
"working_minutes": 540
}
Then you can use your query on the working_minutes field:
{
"size": 0,
"query" : {"match_all": {}},
"aggs": {
"users": {
"terms": {
"field": "user_name.keyword",
"order": {
"avg_hours": "desc"
}
},
"aggs": {
"avg_hours": {
"avg": {
"field": "working_minutes"
}
}
}
}
}
}
If it is not convenient to compute the working_minutes field in your client code, you can achieve the same thing using an ingest pipeline. Let's define the pipeline first:
PUT _ingest/pipeline/working-hours
{
"processors": [
{
"dissect": {
"field": "working_hours",
"pattern": "%{?date} %{tmp_hours}:%{tmp_minutes}:%{?seconds}"
}
},
{
"convert": {
"field": "tmp_hours",
"type": "integer"
}
},
{
"convert": {
"field": "tmp_minutes",
"type": "integer"
}
},
{
"script": {
"source": """
ctx.working_minutes = (ctx.tmp_hours * 60) + ctx.tmp_minutes;
"""
}
},
{
"remove": {
"field": [
"tmp_hours",
"tmp_minutes"
]
}
}
]
}
Then you need to update your Python client code to use the new pipeline that will create the working_hours field for you:
helpers.bulk(es, reader, index='user_log', doc_type='logs', pipeline='working-hours')

Elasticsearch: Querying nested objects

Dear elasticsearch experts,
i have a problem querying nested objects. Lets use the following simplified mapping:
{
"mappings" : {
"_doc" : {
"properties" : {
"companies" : {
"type": "nested",
"properties" : {
"company_id": { "type": "long" },
"name": { "type": "text" }
}
},
"title": { "type": "text" }
}
}
}
}
And put some documents in the index:
PUT my_index/_doc/1
{
"title" : "CPU release",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 2, "name" : "Intel" }
]
}
PUT my_index/_doc/2
{
"title" : "GPU release 2018-01-10",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 3, "name" : "Nvidia" }
]
}
PUT my_index/_doc/3
{
"title" : "GPU release 2018-03-01",
"companies" : [
{ "company_id" : 3, "name" : "Nvidia" }
]
}
PUT my_index/_doc/4
{
"title" : "Chipset release",
"companies" : [
{ "company_id" : 2, "name" : "Intel" }
]
}
Now i want to execute queries like this:
{
"query": {
"bool": {
"must": [
{ "match": { "title": "GPU" } },
{ "nested": {
"path": "companies",
"query": {
"bool": {
"must": [
{ "match": { "companies.name": "AMD" } }
]
}
},
"inner_hits" : {}
}
}
]
}
}
}
As result I want to get the matching companies with the number of matching documents. So the above query should give me:
[
{ "company_id" : 1, "name" : "AMD", "matched_documents:": 1 }
]
The following query:
{
"query": {
"bool": {
"must": [
{ "match": { "title": "GPU" } }
{ "nested": {
"path": "companies",
"query": { "match_all": {} },
"inner_hits" : {}
}
}
]
}
}
}
should give me all companies assigned to a document whichs title contains "GPU" with the number of matching documents:
[
{ "company_id" : 1, "name" : "AMD", "matched_documents:": 1 },
{ "company_id" : 3, "name" : "Nvidia", "matched_documents:": 2 }
]
Is there any possibility with good performance to achieve this result? I'm explicitly not interested in the matching documents, only in the number of matched documents and the nested objects.
Thanks for your help.
What you need to do in terms of Elasticsearch is:
filter "parent" documents on desired criteria (like having GPU in title, or also mentioning Nvidia in the companies list);
group "nested" documents by a certain criteria, a bucket (e.g. company_id);
count how many "nested" documents there are per each bucket.
Each of the nested objects in the array are indexed as a separate hidden document, which complicates life a bit. Let's see how to aggregate on them.
So how to aggregate and count the nested documents?
You can achieve this with a combination of a nested, terms and top_hits aggregation:
POST my_index/doc/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "GPU"
}
},
{
"nested": {
"path": "companies",
"query": {
"match_all": {}
}
}
}
]
}
},
"aggs": {
"Extract nested": {
"nested": {
"path": "companies"
},
"aggs": {
"By company id": {
"terms": {
"field": "companies.company_id"
},
"aggs": {
"Examples of such company_id": {
"top_hits": {
"size": 1
}
}
}
}
}
}
}
}
This will give the following output:
{
...
"hits": { ... },
"aggregations": {
"Extract nested": {
"doc_count": 4, <== How many "nested" documents there were?
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3, <== this bucket's key: "company_id": 3
"doc_count": 2, <== how many "nested" documents there were with such company_id?
"Examples of such company_id": {
"hits": {
"total": 2,
"max_score": 1.5897496,
"hits": [ <== an example, "top hit" for such company_id
{
"_nested": {
"field": "companies",
"offset": 1
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
{
"key": 1,
"doc_count": 1,
"Examples of such company_id": {
"hits": {
"total": 1,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 0
},
"_score": 1.5897496,
"_source": {
"company_id": 1,
"name": "AMD"
}
}
]
}
}
}
]
}
}
}
}
Notice that for Nvidia we have "doc_count": 2.
But what if we want to count the number of "parent" objects who's got Nvidia vs Intel?
What if we want to count parent objects based on a nested bucket?
It can be achieved with reverse_nested aggregation.
We need to change our query just a little bit:
POST my_index/doc/_search
{
"query": { ... },
"aggs": {
"Extract nested": {
"nested": {
"path": "companies"
},
"aggs": {
"By company id": {
"terms": {
"field": "companies.company_id"
},
"aggs": {
"Examples of such company_id": {
"top_hits": {
"size": 1
}
},
"original doc count": { <== we ask ES to count how many there are parent docs
"reverse_nested": {}
}
}
}
}
}
}
}
The result will look like this:
{
...
"hits": { ... },
"aggregations": {
"Extract nested": {
"doc_count": 3,
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3,
"doc_count": 2,
"original doc count": {
"doc_count": 2 <== how many "parent" documents have such company_id
},
"Examples of such company_id": {
"hits": {
"total": 2,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 1
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
{
"key": 1,
"doc_count": 1,
"original doc count": {
"doc_count": 1
},
"Examples of such company_id": {
"hits": {
"total": 1,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 0
},
"_score": 1.5897496,
"_source": {
"company_id": 1,
"name": "AMD"
}
}
]
}
}
}
]
}
}
}
}
How can I spot the difference?
To make the difference evident, let's change the data a bit and add another Nvidia item in the document list:
PUT my_index/doc/2
{
"title" : "GPU release 2018-01-10",
"companies" : [
{ "company_id" : 1, "name" : "AMD" },
{ "company_id" : 3, "name" : "Nvidia" },
{ "company_id" : 3, "name" : "Nvidia" }
]
}
The last query (the one with reverse_nested) will give us the following:
"By company id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 3,
"doc_count": 3, <== 3 "nested" documents with Nvidia
"original doc count": {
"doc_count": 2 <== but only 2 "parent" documents
},
"Examples of such company_id": {
"hits": {
"total": 3,
"max_score": 1.5897496,
"hits": [
{
"_nested": {
"field": "companies",
"offset": 2
},
"_score": 1.5897496,
"_source": {
"company_id": 3,
"name": "Nvidia"
}
}
]
}
}
},
As you can see, this is a subtle difference that is hard to grasp, but it changes the semantics completely.
What's about performance?
While for most of the cases the performance of nested query and aggregations should be enough, of course it comes with a certain cost. It is therefore recommended to avoid using nested or parent-child types when tuning for search speed.
In Elasticsearch the best performance is often achieved through denormalization, although there is no single recipe and you should select the data model depending on your needs.
Hope this clarifies this nested thing for you a bit!

Complex aggregations with Elastic Search

Supposing this is my elasticsearch structure:
{
"_index": "my_index",
"_type": "person",
"_id": "ID",
"_source": {
...DATA...
}
}
{
"_index": "my_index",
"_type": "result",
"_id": "ID",
"_source": {
"personID": "personID"
"date": "timestamp",
"result": "integer",
"speciality": "categoryID"
}
}
I would like to get the most 10 most "influent" people based on:
number of competition in the last 30 days
number of competition in the last year
competition's results in the last 30 days
number of different specialities in the last 30 days
I'm thinking about using _score but I don't know how to influence the score using some values aggregated from the documents of type "result" . This is what I'm trying to achieve
POST my_index/_search?search_type=dfs_query_then_fetch
{
"size": 10,
"query": {
"function_score": {
"query": {
"bool": {
"must": [
{
"term": {
"_type": {
"value": "person"
}
}
}
]
}
}
},
"functions": [
{
"field_value_factor": {
"field": {
"query": {
//competitions in the last 30 days
},
"aggs": {
//cout
}
},
"factor": 1
},
"weight": 0.1
}
]
}
}
Is this possible with just 1 request?
Is this a good approach?
Any tip on what to look at is appreciated

Resources