Finding number of files in latest batch using aggregations - elasticsearch

I've asked this same question in the Elasticsearch forums, but haven't gotten an answer yet.
Here is the scenario:
I have an index that contains documents representing files uploaded to a file store. These documents are uploaded in batches and each doc is tagged with a batch_id field as well as the time they are uploaded. They also have a status field that tracks if the file pass/failed during the ingestion.
Some example docs:
{
"batch_id" : "a",
"file_name" : "file_1_from_batch_a",
"#timestamp" : "2021-10-12T18:12:54.331Z",
"status" : "success"
}
{
"batch_id" : "a",
"file_name" : "file_2_from_batch_a",
"#timestamp" : "2021-10-12T00:00:00.000Z",
"status" : "success"
}
{
"batch_id" : "b",
"file_name" : "file_1_from_batch_b",
"#timestamp" : "2021-10-13T18:13:00.000Z",
"status" : "failure"
}
{
"batch_id" : "b",
"file_name" : "file_2_from_batch_b",
"#timestamp" : "2021-10-13T18:10:22.450Z",
"status" : "failure"
}
I wish to perform an aggregation query over the index to find out how many failures have occurred in the latest batch of files.
Here's what I've come up with so far, but sadly its not giving the right answer
GET my-index/_search
{
"size": 0,
"aggs": {
"most_recent" : {
"terms": {
"field" : "#timestamp",
"order": { "_term": "desc" },
"size": 1
},
"aggs": {
"execution_id": {
"terms": {
"field": "batch_id.keyword"
},
"aggs": {
"failures": {
"filter": {"term": {"status.keyword": "failure"}}
}
}
}
}
}
}
}
Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"most_recent" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 3,
"buckets" : [
{
"key" : 1634148780000,
"key_as_string" : "2021-10-13T18:13:00.000Z",
"doc_count" : 1,
"execution_id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "b",
"doc_count" : 1,
"failures" : {
"doc_count" : 1
}
}
]
}
}
]
}
}
}
The query is giving me the batch_id of the most recent batch (which is good), but is incorrectly telling me how many files in that batch failed (it should be 2).
I would appreciate any help on this!

You're aggregating on timestamp, always resulting in a doc_count of 1 unless 2 batches have the exact same timestamp.
As it is aggregating the batches on unique timestamps, you're not actually grouping them as batches based on batch_id first.
To prove this, change your terms query to not include the size parameter and you'll see that the result of the search will include 2 as and 2 bs "grouped" by timestamp.
"terms": {
"field" : "#timestamp",
"order": { "_term": "desc" }
}
That's the reason for it, but I can't currently test what is the working version so leaving this here as a draft.

Related

Get an aggregate count in elasticsearch based on particular uniqueid field

I have created an index and indexed the document in elasticsearch it's working fine but here the challenge is i have to get an aggregate count of category field based on uniqueid i have given my sample documents below.
{
"UserID":"A1001",
"Category":"initiated",
"policyno":"5221"
},
{
"UserID":"A1001",
"Category":"pending",
"policyno":"5222"
},
{
"UserID":"A1001",
"Category":"pending",
"policyno":"5223"
},
{
"UserID":"A1002",
"Category":"completed",
"policyno":"5224"
}
**Sample output for UserID - "A1001"**
initiated-1
pending-2
**Sample output for UserID - "A1002"**
completed-1
How to get the aggregate count from above given Json documents like the sample output mentioned above
I suggest a terms aggregation as shown in the following:
{
"size": 0,
"aggs": {
"By_ID": {
"terms": {
"field": "UserID.keyword"
},
"aggs": {
"By_Category": {
"terms": {
"field": "Category.keyword"
}
}
}
}
}
}
Here is a snippet of the response:
"hits" : {
"total" : {
"value" : 4,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"By_ID" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "A1001",
"doc_count" : 3,
"By_Category" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "pending",
"doc_count" : 2
},
{
"key" : "initiated",
"doc_count" : 1
}
]
}
},
{
"key" : "A1002",
"doc_count" : 1,
"By_Category" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "completed",
"doc_count" : 1
}
]
}
}
]
}
}

What is the difference in these elasticsearch queries?

I have the following elasticsearch query that returns plenty of results.
{
"query": {
"multi_match": {
"query": "swartz",
"fields": ["notes"]
}
},
"size": 20,
"from": 0,
"sort": {
"last_modified_date": {
"order": "desc"
}
}
}
I'm trying to redo it as a bool query so I can add should and must_not, but am getting no results and I'm not sure why.
{
"query": {
"bool": {
"must": [
{"term": { "notes": "swartz" }}
]
}
},
"size": 20,
"from": 0,
"sort": {
"last_modified_date": {
"order": "desc"
}
}
}
Instead of results, what I do get is this.
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 6,
"successful" : 5,
"skipped" : 0,
"failed" : 1,
"failures" : [
{
"shard" : 0,
"index" : ".kibana_1",
"node" : "E2fjoon_Smm5m7LFcQp9XQ",
"reason" : {
"type" : "query_shard_exception",
"reason" : "No mapping found for [last_modified_date] in order to sort on",
"index_uuid" : "0pZdhm_nRXWiWGcqFgvvHQ",
"index" : ".kibana_1"
}
}
]
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
First, I'm not sure why I get results and it orders properly with the first query, and secondly, even if I take the sort out of the second query I still get no results.
At first you use a match query will look any occurrence of "swartz" somewhere in the content of "notes".
In a SQL world it's something like :
where notes ilike "%swartz%"
In the second query you use a term query which will look for a perfect equality in the field.
In SQL :
where "notes"=="swartz"
It could probably explain your behavior
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html

Is it possible with aggregation to amalgamate all values of an array property from all grouped documents into the coalesced document?

I have documents with the format similar to the following:
[
{
"name": "fred",
"title": "engineer",
"division_id": 20
"skills": [
"walking",
"talking"
]
},
{
"name": "ed",
"title": "ticket-taker",
"division_id": 20
"skills": [
"smiling"
]
}
]
I would like to run an aggs query that would show the complete set of skills for the division: ie,
{
"aggs":{
"distinct_skills":{
"cardinality":{
"field":"division_id"
}
}
},
"_source":{
"includes":[
"division_id",
"skills"
]
}
}
.. so that the resulting hit would look like:
{
"division_id": 20,
"skills": [
"walking",
"talking",
"smiling"
]
}
I know I can retrieve inner_hits and iterate through the list and amalgamate values "manually". I assume it would perform better if I could do it a query.
Just pipe two Terms Aggregation queries as shown below:
POST <your_index_name>/_search
{
"size": 0,
"aggs": {
"my_division_ids": {
"terms": {
"field": "division_id",
"size": 10
},
"aggs": {
"my_skills": {
"terms": {
"field": "skills", <---- If it is not keyword field use `skills.keyword` field if using dynamic mapping.
"size": 10
}
}
}
}
}
}
Below is the sample response:
Response:
{
"took" : 490,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_division_ids" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 20, <---- division_id
"doc_count" : 2,
"my_skills" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ <---- Skills
{
"key" : "smiling",
"doc_count" : 1
},
{
"key" : "talking",
"doc_count" : 1
},
{
"key" : "walking",
"doc_count" : 1
}
]
}
}
]
}
}
}
Hope this helps!

filtering on 2 values of same field

I have a status field, which can have one of the following values,
I can filter for data which have status completed. I can also see data which has ongoing.
But I want to display the data which have status completed and ongoing at the same time.
But I don't know how to add filters for 2 values on a single field.
How can I achieve what I want ?
EDIT - Thanks for answers. But that is not what i wanted.
Like here I have filtered for status:completed, I want to filter for 2 values in this exact way.
I know I can edit this filter and , and use your queries, But I need a simple way to do this(query way is complex), as I have to show it to my marketing team and they don't have any idea about queries. I need to convince them.
If I understand your question correctly, you want to perform an aggregation on 2 values of a field.
This should be possible with a query similar to this one with a terms query:
{
"size" : 0,
"query" : {
"bool" : {
"must" : [ {
"terms" : {
"status" : [ "completed", "unpaid" ]
}
} ]
}
},
"aggs" : {
"freqs" : {
"terms" : {
"field" : "status"
}
}
}
}
This will give a result like this one:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
"total" : 5,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"freqs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "unpaid",
"doc_count" : 4
}, {
"key" : "completed",
"doc_count" : 1
} ]
}
}
}
Here is my toy mapping definition:
{
"bookings" : {
"properties" : {
"status" : {
"type" : "keyword"
}
}
}
}
You need a filter in aggregation.
{
"size": 0,
"aggs": {
"agg_name": {
"filter": {
"bool": {
"should": [
{
"terms": {
"status": [
"completed",
"ongoing"
]
}
}
]
}
}
}
}
}
Use the above query to get results like this:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 8,
"max_score": 0,
"hits": []
},
"aggregations": {
"agg_name": {
"doc_count": 6
}
}
}
The result what you want is the doc_count
For your reference bool query in elasticsearch, should it's like OR conditions,
{
"query":{
"bool":{
"should":[
{"must":{"status":"completed"}},
{"must":{"status":"ongoing"}}
]
}
},
"aggs" : {
"booking_status" : {
"terms" : {
"field" : "status"
}
}
}
}

ElasticSearch: retriving documents belonging to buckets

I am trying to retrieve documents for the past year, bucketed into 1 month wide buckets each. I will take the documents for each 1 month bucket, and then further analyze them (out of scope of my problem here). From the description, it seems "Bucket Aggregation" is the way to go, but in the "bucket" response, I am getting only the count of documents in each bucket, and not the raw documents itself. What am I missing?
GET command
{
"aggs" : {
"DateHistogram" : {
"date_histogram" : {
"field" : "timestamp",
"interval": "month"
}
}
},
"size" : 0
}
Resulting Output
{
"took" : 138,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1313058,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"DateHistogram" : {
"buckets" : [ {
"key_as_string" : "2015-02-01T00:00:00.000Z",
"key" : 1422748800000,
"doc_count" : 270
}, {
"key_as_string" : "2015-03-01T00:00:00.000Z",
"key" : 1425168000000,
"doc_count" : 459
},
(...and all the other months...)
{
"key_as_string" : "2016-03-01T00:00:00.000Z",
"key" : 1456790400000,
"doc_count" : 136009
} ]
}
}
}
You're almost there, you simply need to add the a top_hits sub-aggregation in order to retrieve some documents for each bucket:
POST /your_index/_search
{
"aggs" : {
"DateHistogram" : {
"date_histogram" : {
"field" : "timestamp",
"interval": "month"
},
"aggs": { <--- add this
"docs": {
"top_hits": {
"size": 10
}
}
}
}
},
"size" : 0
}

Resources