ElasticSearch: find multiple unique values in array with complex objects - elasticsearch

Suppose there is an index with documents following a structure like:
{
"array": [
{
"field1": 1,
"field2": 2
},
{
"field1": 3,
"field2": 2
},
{
"field1": 3,
"field2": 2
},
...
]
}
Is it possible to define a query that returns documents having multiple unique values for a field?
For the example above, the query searching on field2 would not return the document because all have the same value, but searching on field1 would return it because it has values 1 and 3.
The only thing I can think of is to store the unique values in the parent object and then query for its length, but, as it seems trivial, I'd hope to solve it without having to change the structure to something like:
{
"arrayField1Values" : [1, 3],
"arrayField2Values" : [2]
"array": [
{
"field1": 1,
"field2": 2
},
{
"field1": 3,
"field2": 2
},
{
"field1": 3,
"field2": 2
},
...
]
}
Thanks for anybody that can help!

My hunch was to go with a nested datatype but then I realized you could do a simple distinct count on the array-values of fields 1 and 2 using query scripts and top_hits:
PUT array
POST array/_doc
{
"array": [
{
"field1": 1,
"field2": 2
},
{
"field1": 3,
"field2": 2
},
{
"field1": 3,
"field2": 2
}
]
}
GET array/_search
{
"size": 0,
"aggs": {
"field1_is_unique": {
"filter": {
"script": {
"script": {
"source": "def uniques = doc['array.field1'].stream().distinct().sorted().collect(Collectors.toList()); return uniques.length > 1 ;",
"lang": "painless"
}
}
},
"aggs": {
"top_hits_field1": {
"top_hits": {}
}
}
},
"field2_is_unique": {
"filter": {
"script": {
"script": {
"source": "def uniques = doc['array.field2'].stream().distinct().sorted().collect(Collectors.toList()); return uniques.length > 1 ;",
"lang": "painless"
}
}
},
"aggs": {
"top_hits_field2": {
"top_hits": {}
}
}
}
}
}
yielding separate aggregations for whether field1 or field2 included unique value counts > 1:
"aggregations" : {
"field1_is_unique" : {
"doc_count" : 1,
"top_hits_field1" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "array",
"_type" : "_doc",
"_id" : "WbJhgnEBVBaNYdXKNktL",
"_score" : 1.0,
"_source" : {
"array" : [
{
"field1" : 1,
"field2" : 2
},
{
"field1" : 3,
"field2" : 2
},
{
"field1" : 3,
"field2" : 2
}
]
}
}
]
}
}
},
"field2_is_unique" : {
"doc_count" : 0,
"top_hits_field2" : {
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
}
}
Hope it helps.

Related

Sort Aggregated Buckets From Nested Object Array By Specific Field

I have indexed documents such as
// doc 1
{
...,
"list": [{
"value": "a",
"order": 1
}, {
"value": "b",
"order": 2
}]
,...
}
// doc 2
{
...,
"list": [{
"value": "b",
"order": 2
}, {
"value": "c",
"order": 3
}]
,...
}
If I use the aggregation on the list.value:
{
"aggs": {
"values": {
"terms": {
"field": "list.value.keyword"
}
}
}
}
I get buckets in order b, a, c:
"buckets" : [
{
"key" : "b",
"doc_count" : 2
},
{
"key" : "a",
"doc_count" : 1
},
{
"key" : "c",
"doc_count" : 1
}
]
as keys would be sorted by the _count in desc order.
If I use the aggregation on the list.value with sub-aggregation for sorting in form of max(list.order):
{
"aggs": {
"values": {
"terms": {
"field": "list.value.keyword",
"order": { "max_order": "desc" }
},
"aggs": {
"max_order": { "max": { "field": "list.order" } }
}
}
}
}
I get buckets in order b, c, a
"buckets" : [
{
"key" : "b",
"doc_count" : 2,
"max_order" : {
"value" : 3.0
}
},
{
"key" : "c",
"doc_count" : 1,
"max_order" : {
"value" : 3.0
}
},
{
"key" : "a",
"doc_count" : 1,
"max_order" : {
"value" : 2.0
}
}
]
as both b and c have max order 3 in their lists of the object.
However, I want to write a query to get buckets in order c, b, a as their order is 3, 2, 1 respectively. How to achieve that?
You need to use nested aggregation, to get the buckets in order of c,b,a
Adding a working example with index data, mapping, search query and search result
Index Mapping
PUT testidx1
{
"mappings":{
"properties": {
"list":{
"type": "nested"
}
}
}
}
Index Data:
POST testidx1/_doc/1
{
"list": [
{
"value": "a",
"order": 1
},
{
"value": "b",
"order": 2
}
]
}
POST testidx1/_doc/2
{
"list": [
{
"value": "b",
"order": 2
},
{
"value": "c",
"order": 3
}
]
}
Search Query:
POST testidx1/_search
{
"size": 0,
"aggs": {
"resellers": {
"nested": {
"path": "list"
},
"aggs": {
"unique_values": {
"terms": {
"field": "list.value.keyword",
"order": {
"max_order": "desc"
}
},
"aggs": {
"max_order": {
"max": {
"field": "list.order"
}
}
}
}
}
}
}
}
Search Response:
"aggregations" : {
"resellers" : {
"doc_count" : 4,
"unique_values" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "c",
"doc_count" : 1,
"max_order" : {
"value" : 3.0
}
},
{
"key" : "b",
"doc_count" : 2,
"max_order" : {
"value" : 2.0
}
},
{
"key" : "a",
"doc_count" : 1,
"max_order" : {
"value" : 1.0
}
}
]
}
}
}
}

How to search in a list of results on elasticsearch

How to search in a list like an indexed array and I'm new on Elasticsearch please let me know what is the name of some concepts. I don't know which concept is true, for example: Is it better to say array, object, properties, or list in my question?
This is my result in Kibana when I run GET car/_doc/4
{
"_index" : "car",
"_type" : "_doc",
"_id" : "4",
"_version" : 2,
"_seq_no" : 7,
"_primary_term" : 1,
"found" : true,
"_source" : {
"id" : 4,
"user_id" : 7,
"ads" : {
"0" : {
"id" : 1,
"priority" : 1,
"city_id" : 83,
"model_id" : 13
},
"2" : {
"id" : 4,
"priority" : 2,
"city_id" : 54,
"model_id" : 23
}
},
"status" : 1
}
}
And this is my result for GET car/_doc/15
{
"_index" : "car",
"_type" : "_doc",
"_id" : "15",
"_version" : 2,
"_seq_no" : 27,
"_primary_term" : 1,
"found" : true,
"_source" : {
"id" : 15,
"user_id" : 24,
"ads" : [
{
"id" : 5,
"priority" : 4,
"city_id" : 42,
"model_id" : 11
}
],
"status" : 1
}
}
As you see, I have 2 types of ads. My question is how to search When status is 1 and (ads.city_id = 83 OR ads.0.city_id = 83)
I can use:
GET car/_search
{
"query": {
"bool": {
"must": [
{
"terms": {
"ads.city.slug": ["LA"]
}
}
]
}
}
}
But it doesn't work for the other type of ads and I need to use something like this:
GET car/_search
{
"query": {
"bool": {
"must": [
{
"terms": {
"ads.2.city.slug": ["NewYork"]
}
}
]
}
}
}
How to write a query without writing an index of ads (2)?
As SagarPatel suggested in comment, do normalize the ads field as an array of objects (like it is for doc with id: 15 in your example). After that create an index with defined static mapping like this:
PUT /your-index-name
{
"mappings": {
"properties": {
"ads": {
"type": "nested"
}
}
}
}
(BTW it is advisable to define static mappings for other fields as well)
Nested type will index ads objects as separate inner documents. See elastic docs
After adding documents to the created index you can run queries as follows:
GET /your-index-name/_search
"query": {
"bool": {
"must": [
{
"nested": {
"path": "ads",
"query": {
"term": {
"ads.city_id": {
"value": 83
}
}
}
}
},
{
"nested": {
"path": "ads",
"query": {
"term": {
"ads.city_id": {
"value": 94
}
}
}
}
}
],
"minimum_should_match" : 1 // OR clause
}
}

Count number of inner elements of array property (Including repeated values)

Given I have the following records.
[
{
"profile": "123",
"inner": [
{
"name": "John"
}
]
},
{
"profile": "456",
"inner": [
{
"name": "John"
},
{
"name": "John"
},
{
"name": "James"
}
]
}
]
I want to get something like:
"aggregations": {
"name": {
"buckets": [
{
"key": "John",
"doc_count": 3
},
{
"key": "James",
"doc_count": 1
}
]
}
}
I'm a beginner using Elasticsearch, and this seems to be a pretty simple operation to do, but I can't find how to achieve this.
If I try a simple aggs using term, it returns 2 for John, instead of 3.
Example request I'm trying:
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "inner.name"
}
}
}
}
How can I possibly achieve this?
Additional Info: It will be used on Kibana later.
I can change mapping to whatever I want, but AFAIK Kibana doesn't like the "Nested" type. :(
You need to do a value_count aggregation, by default terms only does a doc_count, but the value_count aggregation will count the number of times a given field exists.
So, for your purposes:
{
"size": 0,
"aggs": {
"name": {
"terms": {
"field": "inner.name"
},
"aggs": {
"total": {
"value_count": {
"field": "inner.name"
}
}
}
}
}
}
Which returns:
"aggregations" : {
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "John",
"doc_count" : 2,
"total" : {
"value" : 3
}
},
{
"key" : "James",
"doc_count" : 1,
"total" : {
"value" : 2
}
}
]
}
}

How to get per term statistics in Elasticsearch

I need to implement the following (on the backend): a user types a query and gets back hits as well as statistics for the hits. Below is a simplified example.
Suppose the query is Grif, then the user gets back (random words just for example)
Griffith
Griffin
Grif
Grift
Griffins
And frequency + number of documents a certain term occurs in, for example:
Griffith (freq 10, 3 docs)
Griffin (freq 17, 9 docs)
Grif (freq 6, 3 docs)
Grift (freq 9, 5 docs)
Griffins (freq 11, 4 docs)
I'm relatively new to Elasticsearch, so I'm not sure where to start to implement something like this. What type of query is the most suitable for this? What can I use to get that kind of statistics? Any other advice will be appreciated too.
There are multiple layers to this. You'd need:
n-gram / partial / search-as-you-type matching
a way to group the matched keywords by their original form
a mechanism to reversely look up the document & term frequencies.
I'm not aware of any way to achieve this in one go, but here's my take on it.
You could start off with a special, n-gram-powered analyzer, as explained in my other answer. There's the original content field, plus a multi-field mapping for the said analyzer, plus a keyword field to aggregate on down the line:
PUT my-index
{
"settings": {
"index": {
"max_ngram_diff": 20
},
"analysis": {
"tokenizer": {
"my_ngrams": {
"type": "ngram",
"min_gram": 3,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"my_ngrams_analyzer": {
"tokenizer": "my_ngrams",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"fields": {
"analyzed": {
"type": "text",
"analyzer": "my_ngrams_analyzer"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Next, bulk-insert some sample docs containing text inside the content field. Note that each doc has an _id too — you'll need those later on.
POST _bulk
{"index":{"_index":"my-index", "_id":1}}
{"content":"Griffith"}
{"index":{"_index":"my-index", "_id":2}}
{"content":"Griffin"}
{"index":{"_index":"my-index", "_id":3}}
{"content":"Grif"}
{"index":{"_index":"my-index", "_id":4}}
{"content":"Grift"}
{"index":{"_index":"my-index", "_id":5}}
{"content":"Griffins"}
{"index":{"_index":"my-index", "_id":6}}
{"content":"Griffith"}
{"index":{"_index":"my-index", "_id":7}}
{"content":"Griffins"}
Search for n-grams in the .analyzed field and group the matched documents by the original terms through the terms aggregation. At the same time, retrieve the _id of one of the bucketed documents through the top_hits aggregation. BTW — it doesn't matter which _id is returned in a given bucket — all will have contained the same bucketed term.
POST my-index/_search?filter_path=aggregations.*.buckets.key,aggregations.*.buckets.doc_count,aggregations.*.buckets.*.hits.hits._id
{
"size": 0,
"query": {
"term": {
"content.analyzed": "grif"
}
},
"aggs": {
"full_terms": {
"terms": {
"field": "content.keyword",
"size": 10
},
"aggs": {
"top_doc": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
}
}
Observe the response. The filter_path URL parameter from the previous request reduces the response to just those attributes that we need — the untouched, original full_terms plus one of the underlying IDs:
{
"aggregations" : {
"full_terms" : {
"buckets" : [
{
"key" : "Griffins",
"doc_count" : 2,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "5"
}
]
}
}
},
{
"key" : "Griffith",
"doc_count" : 2,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "1"
}
]
}
}
},
{
"key" : "Grif",
"doc_count" : 1,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "3"
}
]
}
}
},
{
"key" : "Griffin",
"doc_count" : 1,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "2"
}
]
}
}
},
{
"key" : "Grift",
"doc_count" : 1,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "4"
}
]
}
}
}
]
}
}
}
Time for the fun part.
There's a specialized Elasticsearch API called Term Vectors which does exactly what you're after — it retrieves field & term stats from the whole index. In order for it to hand these stats over to you, it needs the document IDs — which you'll have obtained from the above aggregation!
Finally, since you've got multiple term vectors to work with, you can use the Multi term vectors API like so — again condensing the response thru filter_path:
POST /my-index/_mtermvectors?filter_path=docs.term_vectors.*.*.*.doc_freq,docs.term_vectors.*.*.*.term_freq
{
"docs": [
{
"_id": "5", <--- guaranteeing
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
},
{
"_id": "1", <--- the response
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
},
{
"_id": "3", <--- order
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
},
{
"_id": "2",
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
},
{
"_id": "4",
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
}
]
}
The result can be post-processed in your backend to form your autocomplete response. You've got A) the full terms, B) the number of matching documents (doc_freq), and C), the term frequency:
{
"docs" : [
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Griffins" : { | term
"doc_freq" : 2, | <-- # of docs
"term_freq" : 1 | term frequency
}
}
}
}
},
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Griffith" : {
"doc_freq" : 2,
"term_freq" : 1
}
}
}
}
},
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Grif" : {
"doc_freq" : 1,
"term_freq" : 1
}
}
}
}
},
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Griffin" : {
"doc_freq" : 1,
"term_freq" : 1
}
}
}
}
},
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Grift" : {
"doc_freq" : 1,
"term_freq" : 1
}
}
}
}
}
]
}
Shameless plug: if you're new to Elasticsearch and, just like me, learn best from real-world examples, consider buying my Elasticsearch Handbook.

Elasticsearch nested sort based on minimum values of child of child arrays

I've two orders and these orders have multiple shipments and shipments have multiple products.
How can I sort the orders based on the minimum product.quantity in a shipment?
For example. When ordering ascending, orderNo = 2 should be listed first because it has a shipment that contains a product.quantity=1. (This is the minimum value among all product.quantity values. (productName doesn't matter)
{
"orders": [
{
"orderNo": "1",
"shipments": [
{
"products": [
{
"productName": "AAA",
"quantity": "2"
},
{
"productName": "AAA",
"quantity": "2"
}
]
},
{
"products": [
{
"productName": "AAA",
"quantity": "3"
},
{
"productName": "AAA",
"quantity": "6"
}
]
}
]
},
{
"orderNo": "2",
"shipments": [
{
"products": [
{
"productName": "AAA",
"quantity": "1"
},
{
"productName": "AAA",
"quantity": "6"
}
]
},
{
"products": [
{
"productName": "AAA",
"quantity": "4"
},
{
"productName": "AAA",
"quantity": "5"
}
]
}
]
}
]
}
Assuming that each order is a separate document, you could create an order-focused index where both shipments and products are nested fields to prevent array flattening.
The minimal index mapping could then look like:
PUT orders
{
"mappings": {
"properties": {
"shipments": {
"type": "nested",
"properties": {
"products": {
"type": "nested"
}
}
}
}
}
}
The next step is to ensure the quantity is always numeric -- not a string. When that's done, insert said docs:
POST orders/_doc
{"orderNo":"1","shipments":[{"products":[{"productName":"AAA","quantity":2},{"productName":"AAA","quantity":2}]},{"products":[{"productName":"AAA","quantity":3},{"productName":"AAA","quantity":6}]}]}
POST orders/_doc
{"orderNo":"2","shipments":[{"products":[{"productName":"AAA","quantity":1},{"productName":"AAA","quantity":6}]},{"products":[{"productName":"AAA","quantity":4},{"productName":"AAA","quantity":5}]}]}
Finally, you can use nested sorting:
POST orders/_search
{
"sort": [
{
"shipments.products.quantity": {
"nested": {
"path": "shipments.products"
},
"order": "asc"
}
}
]
}
Tip: To make the query even more useful, you could introduce sorted inner_hits to not only sort the top-level orders but also the individual products enclosed in a given order. These inner hits need a nested query so you could simply add a non-negative condition on shipments.products.quantity.
When you combine this query with the above sort and restrict the response to only relevant attributes with filter_path:
POST orders/_search?filter_path=hits.hits._id,hits.hits._source.orderNo,hits.hits.inner_hits.*.hits.hits._source
{
"_source": ["orderNo", "non_negative_quantities"],
"query": {
"nested": {
"path": "shipments.products",
"inner_hits": {
"name": "non_negative_quantities",
"sort": {
"shipments.products.quantity": "asc"
}
},
"query": {
"range": {
"shipments.products.quantity": {
"gte": 0
}
}
}
}
},
"sort": [
{
"shipments.products.quantity": {
"nested": {
"path": "shipments.products"
},
"order": "asc"
}
}
]
}
you'll end up with both sorted orders AND sorted products:
{
"hits" : {
"hits" : [
{
"_id" : "gVc0BHgBly0XYOUcZ4vd",
"_source" : {
"orderNo" : "2" <---
},
"inner_hits" : {
"non_negative_quantities" : {
"hits" : {
"hits" : [
{
"_source" : {
"quantity" : 1, <---
"productName" : "AAA"
}
},
{
"_source" : {
"quantity" : 4, <---
"productName" : "AAA"
}
},
{
"_source" : {
"quantity" : 5, <---
"productName" : "AAA"
}
}
]
}
}
}
},
{
"_id" : "gFc0BHgBly0XYOUcYosz",
"_source" : {
"orderNo" : "1"
},
"inner_hits" : {
"non_negative_quantities" : {
"hits" : {
"hits" : [
{
"_source" : {
"quantity" : 2,
"productName" : "AAA"
}
},
{
"_source" : {
"quantity" : 2,
"productName" : "AAA"
}
},
{
"_source" : {
"quantity" : 3,
"productName" : "AAA"
}
}
]
}
}
}
}
]
}
}

Resources