Count API: count query field A with distinct field B value - elasticsearch

For instance, given this result for a search, reduced to a size of 3 hits for brevity:
{
"hits": {
"total": {
"value": 51812937,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "desc-imunizacao",
"_type": "_doc",
"_id": "7d0ac34a-1d4f-435a-9e5f-6dc2d77bb251-i0b0",
"_score": 1.0,
"_source": {
"vacina_descricao_dose": "    2ª Dose",
"estabelecimento_uf": "BA",
"document_id": "7d0ac34a-1d4f-435a-9e5f-6dc2d77bb251-i0b0"
}
},
{
"_index": "desc-imunizacao",
"_type": "_doc",
"_id": "2dc55c6a-5ac1-4550-8990-5ca611808e8a-i0b0",
"_score": 1.0,
"_source": {
"vacina_descricao_dose": "    1ª Dose",
"estabelecimento_uf": "SE",
"document_id": "2dc55c6a-5ac1-4550-8990-5ca611808e8a-i0b0"
}
},
{
"_index": "desc-imunizacao",
"_type": "_doc",
"_id": "d7e9b381-2873-4d0a-8b2d-5fa5034b7a80-i0b0",
"_score": 1.0,
"_source": {
"vacina_descricao_dose": "    1ª Dose",
"estabelecimento_uf": "SE",
"document_id": "d7e9b381-2873-4d0a-8b2d-5fa5034b7a80-i0b0"
}
}
]
}
}
If I wanted to query for "estabelecimento_uf": "SE" and keep only one result for duplicates of "document_id", I would issue:
{
"_source": ["document_id", "estabelecimento_uf", "vacina_descricao_dose"],
"query": {
"match": {
"estabelecimento_uf": {
"query": "SE"
}
}
},
"collapse": {
"field": "document_id",
"inner_hits": {
"name": "latest",
"size": 1
}
}
}
Is there a way to achieve this with Elasticsearch's Count API? Meaning: count query for field A (estabelecimento_uf) and count for unique values of field B (document_id), knowing that document_id has duplicates of course.
This is a public API: https://imunizacao-es.saude.gov.br/_search
This is the authentication:
User: imunizacao_public
Pass: qlto5t&7r_#+#Tlstigi

You can use a combination of filter aggregation along with cardinality aggregation, to get a count of unique document id based on a filter
{
"size": 0,
"aggs": {
"filter_agg": {
"filter": {
"term": {
"estabelecimento_uf.keyword": "SE"
}
},
"aggs": {
"count_docid": {
"cardinality": {
"field": "document_id.keyword"
}
}
}
}
}
}
As far as I know, you cannot get the count of distinct field values using count API, you can either use field collapsing feature (as done in the question) OR use cardinality aggregation
Adding a working example with index data, search query and search result
{
"vacina_descricao_dose": " 2ª Dose",
"estabelecimento_uf": "BA",
"document_id": "7d0ac34a-1d4f-435a-9e5f-6dc2d77bb251-i0b0"
}
{
"vacina_descricao_dose": " 1ª Dose",
"estabelecimento_uf": "SE",
"document_id": "2dc55c6a-5ac1-4550-8990-5ca611808e8a-i0b0"
}
{
"vacina_descricao_dose": " 1ª Dose",
"estabelecimento_uf": "SE",
"document_id": "d7e9b381-2873-4d0a-8b2d-5fa5034b7a80-i0b0"
}
{
"vacina_descricao_dose": " 1ª Dose",
"estabelecimento_uf": "SE",
"document_id": "d7e9b381-2873-4d0a-8b2d-5fa5034b7a80-i0b0"
}
Search Query 1:
{
"size": 0,
"query": {
"match": {
"estabelecimento_uf": "SE"
}
},
"aggs": {
"count_doc_id": {
"cardinality": {
"field": "document_id.keyword"
}
}
}
}
Search Result:
"aggregations": {
"count_doc_id": {
"value": 2 // note this
}
}
Search Query 2:
{
"size": 0,
"aggs": {
"filter_agg": {
"filter": {
"term": {
"estabelecimento_uf.keyword": "SE"
}
},
"aggs": {
"count_docid": {
"cardinality": {
"field": "document_id.keyword"
}
}
}
}
}
}
Search Result:
"aggregations": {
"filter_agg": {
"doc_count": 3,
"count_docid": {
"value": 2 // note this
}
}
}

Related

Elasticsearch - Find documents missing two fields

I'm trying to create a query that returns information about how many documents that don't have data for two fields (date.new and date.old). I have tried the query below, but it works as OR-logic, where all documents missing either date.new or date.old are returned. Does anyone know how I can make this only return documents missing both fields?
{
"aggs":{
"Missing_field_count1":{
"missing":{
"field":"date.new"
}
},
"Missing_field_count2":{
"missing":{
"field":"date.old"
}
}
}
}
Aggregations is not the feature to use for this. You need to use the exists query wrapped within a bool/must_not query, like this:
GET index/_count
{
"size": 0,
"bool": {
"must_not": [
{
"exists": {
"field": "date.new"
}
},
{
"exists": {
"field": "date.old"
}
}
]
}
}
hits.total.value indicates the count of the documents that match the search request. The value indicates the number of hits that match and relation indicates whether the value is accurate (eq) or a lower bound (gte)
Index Data:
{
"data": {
"new": 1501,
"old": 10
}
}
{
"title": "elasticsearch"
}
{
"title": "elasticsearch-query"
}
{
"date": {
"new": 1400
}
}
The search query given by #Val answers on how to achieve your use case.
Search Result:
"hits": {
"total": {
"value": 2, <-- note this
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
"_index": "65112793",
"_type": "_doc",
"_id": "2",
"_score": 0.0,
"_source": {
"title": "elasticsearch"
}
},
{
"_index": "65112793",
"_type": "_doc",
"_id": "5",
"_score": 0.0,
"_source": {
"title": "elasticsearch-query"
}
}
]
}

elasticSearch: bool query with multiple values on one field

This works:
GET /bitbucket$$pull-request-activity/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"prid": "12343"
}
},
{
"match": {
"repoSlug": "com.xxx.vserver"
}
}
]
}
}
}
But I would like to capture multiple prids in one call.
This does not work however:
GET /bitbucket$$pull-request-activity/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"prid": "[12343, 11234, 13421]"
}
},
{
"match": {
"repoSlug": "com.xxx.vserver"
}
}
]
}
}
}
any hints?
As you are using must in your bool query, then this represents logical AND, so be sure that all the documents that you are Matching of the prid field, should also match with "repoSlug": "com.xxx.vserver".
If none of the documents match with "repoSlug": "com.xxx.vserver", then no result will return.
And, if only 2 documents match, then only 2 of them will be returned in the search result, and not all the documents.
Adding Working example with mapping, sample docs and search query
Index Sample Data :
{
"id":"1",
"message":"hello"
}
{
"id":"2",
"message":"hello"
}
{
"id":"3",
"message":"hello-bye"
}
Search Query:
{
"query": {
"bool": {
"must": [
{
"match": {
"id": "[1, 2, 3]"
}
},
{
"match": {
"message": "hello"
}
}
]
}
}
}
Search Result :
"hits": [
{
"_index": "foo14",
"_type": "_doc",
"_id": "1",
"_score": 1.5924306,
"_source": {
"id": "1",
"message": "hello"
}
},
{
"_index": "foo14",
"_type": "_doc",
"_id": "3",
"_score": 1.4903541,
"_source": {
"id": "3",
"message": "hello-bye"
}
},
{
"_index": "foo14",
"_type": "_doc",
"_id": "2",
"_score": 1.081605,
"_source": {
"id": "2",
"message": "hello"
}
}
]

ElasticSearch: Avg aggregation for datetime format

I am stuck regarding an elastic search query using python
I have data such as:
{
"_index": "user_log",
"_type": "logs",
"_id": "gdUJpXIBAoADuwvHTK29",
"_score": 1,
"_source": {
"user_name": "prathameshsalap#gmail.com",
"working_hours": "2019-10-21 09:00:01",
}
{
"_index": "user_log",
"_type": "logs",
"_id": "gtUJpXIBAoADuwvHTK29",
"_version": 1,
"_score": 0,
"_source": {
"user_name": "vaishusawant143#gmail.com",
"working_hours": "2019-10-21 09:15:01",
}
{
"_index": "user_log",
"_type": "logs",
"_id": "g9UJpXIBAoADuwvHTK29",
"_version": 1,
"_score": 0,
"_source": {
"user_name": "prathameshsalap#gmail.com",
"working_hours": "2019-10-22 07:50:00",
}
{
"_index": "user_log",
"_type": "logs",
"_id": "g8UJpXIBAoADuwvHTK29",
"_version": 1,
"_score": 0,
"_source": {
"user_name": "vaishusawant143#gmail.com",
"working_hours": "2019-10-22 04:15:01",
}
Here, for each user give working hours for different date(21 and 22). I want to take an average of each user's working hours.
{
"size": 0,
"query" : {"match_all": {}},
"aggs": {
"users": {
"terms": {
"field": "user_name"
},
"aggs": {
"avg_hours": {
"avg": {
"field": "working_hours"
}
}
}
}
}
}
This query not working. How to find the average working hours for each user for all dates? And, I also want to run this query using python-elastic search.
Updated
When I use ingest pipeline as #Val mention. I am getting an error:
{
"error" : {
"root_cause" : [
{
"type" : "script_exception",
"reason" : "compile error",
"processor_type" : "script",
"script_stack" : [
"\n def workDate = /\\s+/.split(ctx.working_h ...",
" ^---- HERE"
],
"script" : "\n def workDate = /\\s+/.split(ctx.working_hours);\n def workHours = /:/.split(workDate[1]);\n ctx.working_minutes = (Integer.parseInt(workHours[0]) * 60) + Integer.parseInt(workHours[1]);\n ",
"lang" : "painless",
"position" : {
"offset" : 24,
"start" : 0,
"end" : 49
}
}
.....
How can I solve it?
The problem is that your working_hours field is a point in time and does not denote a duration.
For this use case, it's best to store the working day and working hours in two separate fields and store the working hours in minutes.
So instead of having documents like this:
{
"user_name": "prathameshsalap#gmail.com",
"working_hours": "2019-10-21 09:00:01",
}
Create documents like this:
{
"user_name": "prathameshsalap#gmail.com",
"working_day": "2019-10-21",
"working_hours": "09:00:01",
"working_minutes": 540
}
Then you can use your query on the working_minutes field:
{
"size": 0,
"query" : {"match_all": {}},
"aggs": {
"users": {
"terms": {
"field": "user_name.keyword",
"order": {
"avg_hours": "desc"
}
},
"aggs": {
"avg_hours": {
"avg": {
"field": "working_minutes"
}
}
}
}
}
}
If it is not convenient to compute the working_minutes field in your client code, you can achieve the same thing using an ingest pipeline. Let's define the pipeline first:
PUT _ingest/pipeline/working-hours
{
"processors": [
{
"dissect": {
"field": "working_hours",
"pattern": "%{?date} %{tmp_hours}:%{tmp_minutes}:%{?seconds}"
}
},
{
"convert": {
"field": "tmp_hours",
"type": "integer"
}
},
{
"convert": {
"field": "tmp_minutes",
"type": "integer"
}
},
{
"script": {
"source": """
ctx.working_minutes = (ctx.tmp_hours * 60) + ctx.tmp_minutes;
"""
}
},
{
"remove": {
"field": [
"tmp_hours",
"tmp_minutes"
]
}
}
]
}
Then you need to update your Python client code to use the new pipeline that will create the working_hours field for you:
helpers.bulk(es, reader, index='user_log', doc_type='logs', pipeline='working-hours')

ElasticSearch: Restrict Aggregations to Query String

I'm struggling to get my aggregations to be restricted to my query.
I, of course, tried:
{
"_source": ["burger.id", "burger.user_name", "burger.timestamp"],
"query": {
"query_string": {
"query": "burger.user_name:Bob"
}
},
"aggs": {
"burger_count": {
"cardinality": {
"field": "burger.id.keyword"
}
},
"min_dtm": {
"min": {
"field": "burger.timestamp"
}
},
"max_dtm": {
"max": {
"field": "burger.timestamp"
}
}
}
}
I am very set on using "query_string" for filtering, as we have a very nice front-end that allows users to easily build queries that are then turned into a "query_string."
Unfortunately, I have not found a way to combine query_string and aggregations so that the aggregations are only over the results of the query!
I've read through many SO posts about doing this, but they are all very old and outdated as they all suggest the deprecated way of Filtered Queries, but even that doesn't implement query_string.
UPDATE
Here are some example documents.
It appears that my results are not being filtered by my query. Is there a setting that I am missing?
I also changed all of the fields to be about burgers...
{
"_index": "burgers",
"_type": "burger",
"_id": "123",
"_score": 5.3759894,
"_source": {
"inference": {
"id": "1",
"user_name": "Jonathan",
"timestamp": 1541521691847
}
}
},
{
"_index": "burgers",
"_type": "burger",
"_id": "456",
"_score": 5.3759894,
"_source": {
"inference": {
"id": "2",
"user_name": "Ryan",
"timestamp": 1542416601153
}
}
},
{
"_index": "burgers",
"_type": "burger",
"_id": "789",
"_score": 5.3759894,
"_source": {
"inference": {
"id": "3",
"user_name": "Grant",
"timestamp": 1542237715511
}
}
}
Found my answer!
It appears that the issue was caused by querying the text field of burger.user_name instead of the keyword field: burger.user_name.keyword.
Changing my query_string to use the keyword for each text field solved my issue.
{
"_source": ["burger.id", "burger.user_name", "burger.timestamp"],
"query": {
"query_string": {
"query": "burger.user_name.keyword:Bob"
}
},
"aggs": {
"burger_count": {
"cardinality": {
"field": "burger.id.keyword"
}
},
"min_dtm": {
"min": {
"field": "burger.timestamp"
}
},
"max_dtm": {
"max": {
"field": "burger.timestamp"
}
}
}
}
This SO answer gives a great, brief explanation why.

How to get total score specific to each row

I need, Elasticsearch GET query to view the total score of each and every students by adding up the marks earned by them in all the subject rather I am getting total score of all the students in every subject.
GET /testindex/testindex/_search
{
"query" : {
"filtered" : {
"query" : {
"match_all" : {}
}
}
},
"aggs": {
"total": {
"sum": {
"script" : "doc['physics'].value + doc['maths'].value + doc['chemistry'].value"
}
}
}
}
Output
{
....
"hits": [
{
"_index": "testindex",
"_type": "testindex",
"_id": "1",
"_score": 1,
"_source": {
"personalDetails": {
"name": "viswa",
"age": "33"
},
"marks": {
"physics": 18,
"maths": 5,
"chemistry": 34
},
"remarks": [
"hard working",
"intelligent"
]
}
},
{
"_index": "testindex",
"_type": "testindex",
"_id": "2",
"_score": 1,
"_source": {
"personalDetails": {
"name": "bob",
"age": "13"
},
"marks": {
"physics": 48,
"maths": 45,
"chemistry": 44
},
"remarks": [
"hard working",
"intelligent"
]
}
}
]
},
"aggregations": {
"total": {
"value": 194
}
}
}
Expected Output:
I would like to get total mark earned in subjects of each and every student rather than total of all the students.
What changes I need to do in the query to achieve this.
{
"query": {
"filtered": {
"query": {
"match_all": {}
}
}
},
"aggs": {
"student": {
"terms": {
"field": "personalDetails.name",
"size": 10
},
"aggs": {
"total": {
"sum": {
"script": "doc['physics'].value + doc['maths'].value + doc['chemistry'].value"
}
}
}
}
}
}
But, be careful, for student terms aggregation you need a "unique" (something that makes that student unique - like a personal ID or something) field, maybe the _id itself, but you need to store it.

Resources