ElasticSearch: count by item in array - elasticsearch

For every document I have a category array, it looks like this:
[{
"id": 1,
"level": 1
}, {
"id": 2,
"level": 2
}, {
"id": 3,
"level": 3
}]
How can I count the categories I have in every document according to the level 3 category.id?

category array shall be nested field. And the rest can be handled via aggregations. Try something similar to the code given below.
"aggregations": {
"mainAgg": {
"nested": {
"path": "category"
},
"aggs": {
"levelFilter": {
-- filter condition
"filter": {
"term": {
"level": 3
},
-- count aggregation
"aggs": {
"count": {
"value_count": {
"field": "level"
}
}
}
}
}
}
}
}

Related

Elasticsearch aggregation over children document field values

I'm facing the following problem of selecting and sorting parent documents based on an aggregated value over its children documents. The aggregation (e.g. sum) itself depends on a query string, i.e. which children documents are relevant for the aggregation.
Example: Given the documents basket A and basket B, for each basket document, I am looking to sum over the number field of its fruit children if the name field matches my query, e.g. apples.
PUT /baskets/_doc/0
{
"name": "basket A",
"fruit": [
{
"name": "apples",
"number": 2
},
{
"name": "oranges",
"number": 3
}
]
}
PUT /baskets/_doc/1
{
"name": "basket B",
"fruit": [
{
"name": "apples",
"number": 3
},
{
"name": "apples",
"number": 3
}
]
}
Mappings:
PUT /baskets
{
"mappings": {
"properties": {
"name": { "type": "text" },
"fruit": {
"type": "nested",
"properties": {
"name": { "type": "text" },
"number": { "type": "long" }
}
}
}
}
}
Use case 1: Which basket has (strictly) more than 5 apples? Would expect only basket B
Use case 2: Sort baskets by number of apples. Would expect basket B with a total of 6 apples, then basket A with a total of 2 apples.
How can one implement this using the Elasticsearch (7.8.0) query DSL?
I have tried so far with nested queries and aggregations without success.
Thanks!
Edit: Added mappings
Edit: Updated the numbers to better reflect the problem
*Edit: Added possible answer to Use case 2 (see comments to the answer from #joe):
GET /profiles/_search
{
"aggs": {
"aggs_baskets": {
"terms": {
"field": "name",
"order": {"nest > fruit_filter > fruit_sum": "desc"}
},
"aggs": {
"nest":{
"nested":{
"path": "fruit"
},
"aggs":{
"fruit_filter":{
"filter": {
"term": {"fruit.name": "apple"}
},
"aggs":{
"fruit_sum":{
"sum": {"field": "fruit.number"}
}
}
}
}
}
}
}
}
}
Use case 1:
GET baskets/_search
{
"query": {
"nested": {
"path": "fruit",
"inner_hits": {},
"query": {
"bool": {
"must": [
{
"term": {
"fruit.name": {
"value": "apples"
}
}
},
{
"range": {
"fruit.number": {
"gte": 5
}
}
}
]
}
}
}
}
}
Strictly more than 5 --> gt; >=5 --> gte.
Also notice the inner_hits part -- this gives you the actual nested subdocument which caused this particular basket to match the query. It's not required but good-to-know.
Use case 2:
GET baskets/_search
{
"sort": [
{
"fruit.number": {
"nested_path": "fruit",
"order": "desc"
}
}
]
}
Use case 2 Edit:
There are probably cleaner ways of doing this but I'd go with the following:
GET baskets/_search
{
"size": 0,
"aggs": {
"multiply_and_add": {
"scripted_metric": {
"params": {
"only_fruit_name": "apples"
},
"init_script": "state.by_basket_name = [:]",
"map_script": """
def basket_name = params._source['name'];
def fruits = params._source['fruit'].findAll(group -> group.name == params.only_fruit_name);
for (def fruit_group : fruits) {
def number = fruit_group.number;
if (state.by_basket_name.containsKey(basket_name)) {
state.by_basket_name[basket_name] += number;
} else {
state.by_basket_name[basket_name] = number;
}
}
""",
"combine_script": "return state.by_basket_name",
"reduce_script": "return states"
}
}
}
}
yielding a hash map along the lines of
{
...
"aggregations":{
"multiply_and_add":{
"value":[
{
"basket A":2,
"basket B":6
}
]
}
}
}
Sorting can either be done in the reduce_script or within your ES response post-processing pipeline. You could of course choose to go w/ (sorted) lists and lambdas...
Notice the required nested_path.
After a while of searching and testing, here are (in addition to #joe's answer to use case 2) possible queries for both use cases. Note that both use cases require to change the mapping for the field name to be of type keyword.
Use case 1: Which basket has (strictly) more than 5 apples? Would expect only basket B
For more information on filtering results by their aggregation value see Bucket Selectors
GET /baskets/_search
{
"aggs": {
"aggs_baskets": {
"terms": {
"field": "name"
},
"aggs": {
"nest":{
"nested":{
"path": "fruit"
},
"aggs":{
"fruit_filter":{
"filter": {
"match": {"fruit.name": "apples"}
},
"aggs":{
"fruit_sum":{
"sum": {"field": "fruit.number"}
}
}
}
}
},
"basket_sum_filter":{
"bucket_selector":{
"buckets_path":{
"fruitSum":"nest > fruit_filter > fruit_sum"
},
"script":"params.fruitSum > 5"
}
}
}
}
}
}
... yielding
...,
"buckets": [
{
"key": "basket B",
"doc_count": 1,
"nest": {
"doc_count": 2,
"fruit_filter": {
"doc_count": 2,
"fruit_sum": {
"value": 6
}
}
}
}
]
Use case 2: Sort baskets by number of apples. Would expect basket B with a total of 6 apples, then basket A with a total of 2 apples.
GET /baskets/_search
{
"aggs": {
"aggs_baskets": {
"terms": {
"field": "name",
"order": {"nest > fruit_filter > fruit_sum": "desc"}
},
"aggs": {
"nest":{
"nested":{
"path": "fruit"
},
"aggs":{
"fruit_filter":{
"filter": {
"term": {"fruit.name": "apple"}
},
"aggs":{
"fruit_sum":{
"sum": {"field": "fruit.number"}
}
}
}
}
}
}
}
}
}
... yielding
...,
"buckets": [
{
"key": "basket B",
"doc_count": 1,
"nest": {
"doc_count": 2,
"fruit_filter": {
"doc_count": 2,
"fruit_sum": {
"value": 6
}
}
}
},
{
"key": "basket A",
"doc_count": 1,
"nest": {
"doc_count": 2,
"fruit_filter": {
"doc_count": 1,
"fruit_sum": {
"value": 2
}
}
}
}
]

Group by terms and get count of nested array property?

I would like to get the count from a document series where an array item matches some value.
I have documents like these:
{
"Name": "jason",
"Todos": [{
"State": "COMPLETED"
"Timer": 10
},{
"State": "PENDING"
"Timer": 5
}]
}
{
"Name": "jason",
"Todos": [{
"State": "COMPLETED"
"Timer": 5
},{
"State": "PENDING"
"Timer": 2
}]
}
{
"Name": "martin",
"Todos": [{
"State": "COMPLETED"
"Timer": 15
},{
"State": "PENDING"
"Timer": 10
}]
}
I would like to count how many documents I have where they have any Todos with COMPLETED State. And group by Name.
So from the above I would need to get:
jason: 2
martin: 1
Usually I do this with a term aggregation for the Name, and an other sub aggregation for other items:
"aggs": {
"statistics": {
"terms": {
"field": "Name"
},
"aggs": {
"test": {
"filter": {
"bool": {
"must": [{
"match_phrase": {
"SomeProperty.keyword": {
"query": "THEVALUE"
}
}
}
]
}
},
But not sure how to do this here as I have items in an array.
Elasticsearch has no problem with arrays because in fact it flattens them by default:
Arrays of inner object fields do not work the way you may expect. Lucene has no concept of inner objects, so Elasticsearch flattens object hierarchies into a simple list of field names and values.
So a query like the one you posted will do. I would use term query for keyword datatype, though:
POST mytodos/_search
{
"size": 0,
"aggs": {
"by name": {
"terms": {
"field": "Name"
},
"aggs": {
"how many completed": {
"filter": {
"term": {
"Todos.State": "COMPLETED"
}
}
}
}
}
}
}
I am assuming your mapping looks something like this:
PUT mytodos/_mappings
{
"properties": {
"Name": {
"type": "keyword"
},
"Todos": {
"properties": {
"State": {
"type": "keyword"
},
"Timer": {
"type": "integer"
}
}
}
}
}
The example documents that you posted will be transformed internally into something like this:
{
"Name": "jason",
"Todos.State": ["COMPLETED", "PENDING"],
"Todos.Timer": [10, 5]
}
However, if you need to query for Todos.State and Todos.Timer, for example, filter for those "COMPLETED" but only with Timer > 10, it will not be possible with such mapping because Elasticsearch forgets the link between fields of object array items.
In this case you would need to use something like nested datatype for such arrays, and query them with special nested query.
Hope that helps!

Calculate the counts of last snapshot of a record in ElasticSearch

I am storing snapshots of data in ElasticSearch. I want to perform count metric aggregation on latest snapshot of each entry, the purpose is to know what state my current (latest) data are in
I have something like this
[
{
"id": 2,
"state": "deleted",
"timestamp": "2019-11-20T18:18:09+00:00"
},
{
"id": 2,
"state": "published",
"timestamp": "2019-11-19T18:18:09+00:00"
},
{
"id": 3,
"state": "published",
"timestamp": "2019-10-17T18:18:09+00:00"
},
{
"id": 3,
"state": "draft",
"timestamp": "2019-10-16T18:18:09+00:00"
}
]
I tried this
POST /snapshots/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"2": {
"terms": {
"field": "state.keyword",
},
"aggs": {
"1": {
"top_hits": {
"size": 1,
"sort": [
{
"timestamp": {
"order": "desc"
}
}
]
}
}
}
}
}
}
But the problem is it first create a bucket and in that bucket it does the sorting and calculate the top_hits so instead of
deleted = 1
published = 1
draft = 0
It returns
deleted = 1
published = 1
draft = 1

Filter elasticsearch bucket aggregation based on term field

I have a list of products (deal entities) and I'm attempting to create a bucket aggregation by categories, ordered by the sum of available_stock.
This all works fine, but I want to exclude such categories from the resulting aggregation that don't have level set to 1 (In other words, I only want to keep aggregations on category where level IS 1).
I am aware that elasticsearch provides "exclude" and "include" parameters, but these only work on the same field I'm aggregating on (deal.category.id in this case)
This is my sample deal document:
{
"_source": {
"id": 392745,
"category": [
{
"id": 17575,
"level": 2
},
{
"id": 17574,
"level": 1
},
{
"id": 17572,
"level": 0
}
],
"stats": {
"available_stock": 500
}
}
}
And this would be the query:
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
}
},
"aggs": {
"mainAggregation": {
"terms": {
"field": "deal.category.id",
"order": {
"available_stock": "desc"
},
"size": 3
},
"aggs": {
"available_stock": {
"sum": {
"field": "deal.stats.available_stock"
}
}
}
}
},
"size": 0
}
And my resulting aggregation, sadly including category 17572 with level 0.
{
"aggregations": {
"mainAggregation": {
"buckets": [
{
"key": 17572,
"doc_count": 30,
"available_stock": {
"value": 24000
}
},
{
"key": 17598,
"doc_count": 10,
"available_stock": {
"value": 12000
}
},
{
"key": 17602,
"doc_count": 8,
"available_stock": {
"value": 6000
}
}
]
}
}
}
P.S.: Currently on ElasticSearch 1.6
Update 1: Still stuck on the problem after various experiments with various combimation of subaggregations.
I have found this impossible to solve and decided to go with two separate queries.

Elasticsearch query fails to return results when querying a nested object

I have an object which looks something like this:
{
"id": 123,
"language_id": 1,
"label": "Pablo de la Pena",
"office": {
"count": 2,
"data": [
{
"id": 1234,
"is_office_lead": false,
"office": {
"id": 1,
"address_line_1": "123 Main Street",
"address_line_2": "London",
"address_line_3": "",
"address_line_4": "UK",
"address_postcode": "E1 2BC",
"city_id": 1
}
},
{
"id": 5678,
"is_office_lead": false,
"office": {
"id": 2,
"address_line_1": "77 High Road",
"address_line_2": "Edinburgh",
"address_line_3": "",
"address_line_4": "UK",
"address_postcode": "EH1 2DE",
"city_id": 2
}
}
]
},
"primary_office": {
"id": 1,
"address_line_1": "123 Main Street",
"address_line_2": "London",
"address_line_3": "",
"address_line_4": "UK",
"address_postcode": "E1 2BC",
"city_id": 1
}
}
My Elasticsearch mapping looks like this:
"mappings": {
"item": {
"properties": {
"office": {
"properties": {
"data": {
"type": "nested",
}
}
}
}
}
}
My Elasticsearch query looks something like this:
GET consultant/item/_search
{
"from": 0,
"size": 24,
"query": {
"bool": {
"must": [
{
"term": {
"language_id": 1
}
},
{
"term": {
"office.data.office.city_id": 1
}
}
]
}
}
}
This returns zero results, however, if I remove the second term and leave it only with the language_id clause, then it works as expected.
I'm sure this is down to a misunderstading on my part of how the nested object is flattened, but I'm out of ideas - I've tried all kinds of permutations of the query and mappings.
Any guidance hugely appreciated. I am using Elasticsearch 6.1.1.
I'm not sure if you need the entire record or not, this solution gives every record that has language_id: 1 and has an office.data.office.id: 1 value.
GET consultant/item/_search
{
"from": 0,
"size": 100,
"query": {
"bool":{
"must": [
{
"term": {
"language_id": {
"value": 1
}
}
},
{
"nested": {
"path": "office.data",
"query": {
"match": {
"office.data.office.city_id": 1
}
}
}
}
]
}
}
}
I put 3 different records in my test index for proofing against false hits, one with different language_id and one with different office ids and only the matching one returned.
If you only need the office data, then that's a bit different but still solvable.

Resources