Find distinct inner objects in Elasticsearch - elasticsearch

We're trying to find distinct inner objects in Elasticsearch. This would be a minimum example for our case.
We're stuck with something like the following mapping (changing types or indices or adding new fields wouldn't be a problem, but the structure should remain as it is):
{
"building": {
"properties": {
"street": {
"type": "string",
"store": "yes",
"index": "not_analyzed"
},
"house number": {
"type": "string",
"store": "yes",
"index": "not_analyzed"
},
"city": {
"type": "string",
"store": "yes",
"index": "not_analyzed"
},
"people": {
"type": "object",
"store": "yes",
"index": "not_analyzed",
"properties": {
"firstName": {
"type": "string",
"store": "yes",
"index": "not_analyzed"
},
"lastName": {
"type": "string",
"store": "yes",
"index": "not_analyzed"
}
}
}
}
}
}
Assuming we have this example data:
{
"buildings": [
{
"street": "Baker Street",
"house number": "221 B",
"city": "London",
"people": [
{
"firstName": "John",
"lastName": "Doe"
},
{
"firstName": "Jane",
"lastName": "Doe"
}
]
},
{
"street": "Baker Street",
"house number": "5",
"city": "London",
"people": [
{
"firstName": "John",
"lastName": "Doe"
}
]
},
{
"street": "Garden Street",
"house number": "1",
"city": "London",
"people": [
{
"firstName": "Jane",
"lastName": "Smith"
}
]
}
]
}
When we query for the street "Baker Street" (and whatever additional options needed), we expect to get the following list:
[
{
"firstName": "John",
"lastName": "Doe"
},
{
"firstName": "Jane",
"lastName": "Doe"
}
]
The format does not matter too much, but we should be able to parse the first and last name. Just, as our actual data-set is much larger, we need the entries to be distinct.
We are using Elasticsearch 1.7.

We finally solved our problem.
Our solution is (as we expected) a pre-calculated people_all field. But instead of using copy_to or transform we're just writing it as we are writing the other fields when importing our data. The field looks as follows:
"people": {
"type": "nested",
..
"properties": {
"firstName": {
"type": "string",
"store": "yes",
"index": "not_analyzed"
},
"lastName": {
"type": "string",
"store": "yes",
"index": "not_analyzed"
},
"people_all": {
"type": "string",
"index": "not_analyzed"
}
}
}
Please pay attention on the "index": "not_analyzed" at the people_all field. This is important to have complete buckets. If you don't use it, our example will return 3 buckets "john", "jane" and "doe".
After writing this new field we can run an aggragetion as follows:
{
"size": 0,
"query": {
"term": {
"street": "Baker Street"
}
},
"aggs": {
"people_distinct": {
"nested": {
"path": "people"
},
"aggs": {
"people_all_distinct": {
"terms": {
"field": "people.people_all",
"size": 0
}
}
}
}
}
}
And we return the following response:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"people_distinct": {
"doc_count": 3,
"people_name_distinct": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "John Doe",
"doc_count": 2
},
{
"key": "Jane Doe",
"doc_count": 1
}
]
}
}
}
}
Out of the buckets in the response we are now able to create the distinct people objects.
Please let us know if there is a better way to reach our goal.
Parsing the buckets is not an optimal solution and it would be more fancy to have the fields firstName and lastName in each bucket.

As suggested in the comment your mapping of people should be of type nested rather than object as it could give unexpected results. You also need to reindex your data after that.
As for the question, You need to aggregate results based on your query.
{
"query": {
"term": {
"street": "Baker Street"
}
},
"aggs": {
"distinct_people": {
"terms": {
"field": "people",
"size": 1000
}
}
}
}
Please note that I have set size to 1000 inside aggregation, you might have to use bigger number to get all distinct people, ES returns only 10 results by default.
You could set the query size to 0 or use the parameter search_type=count if you are interested only in aggregated buckets.
You can read more about aggregations here. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
I hope this helps!Let me know if this does not work out.

Related

How can i do nested field queries in Elastic search using Lucene query syntax

Here is the simple usecase,
I have a system that sends the Lucene query to my elastic search. I have a mapping
{
"mappings": {
"properties": {
"grocery_name":{
"type": "text"
},
"items": {
"type": "nested",
"properties": {
"name": {
"type": "text"
},
"stock": {
"type": "integer"
},
"category": {
"type": "text"
}
}
}
}
}
}
and the data looks like
{
"grocery_name": "Elastic Eats",
"items": [
{
"name": "banana",
"stock": "12",
"category": "fruit"
},
{
"name": "peach",
"stock": "10",
"category": "fruit"
},
{
"name": "carrot",
"stock": "9",
"category": "vegetable"
},
{
"name": "broccoli",
"stock": "5",
"category": "vegetable"
}
]
}
How can I query to get all items where the item name is banana and stock > 10, In KQL i can write something like items:{ name:banana and stock > 10 }
The Lucene expression language doesn't support querying nested documents. That's why the KQL language fills that gap.
That's currently the only way to query nested documents via the Kibana search bar.

How to do aggregation on nested objects - Elasticsearch

I'm pretty new to Elasticsearch so please bear with me.
This is part of my document in ES.
{
"source": {
"detail": {
"attribute": {
"Size": ["32 Gb",4],
"Type": ["Tools",4],
"Brand": ["Sandisk",4],
"Color": ["Black",4],
"Model": ["Sdcz36-032g-b35",4],
"Manufacturer": ["Sandisk",4]
}
},
"title": {
"list": [
"Sandisk Cruzer 32gb Usb 32 Gb Flash Drive , Black - Sdcz36-032g"
]
}
}
}
So what I want to achieve is to find the best three or top three hits of the attribute object. For example, if I do a search for "sandisk", I want to get three attributes like ["Size", "Color", "Model"] or whatever attributes based on the top hits aggregation.
So i did a query like this
{
"size": 0,
"aggs": {
"categoryList": {
"filter": {
"bool": {
"filter": [
{
"term": {
"title.list": "sandisk"
}
}
]
}
},
"aggs": {
"results": {
"terms": {
"field": "detail.attribute",
"size": 3
}
}
}
}
}
}
But it seems to be not working. How do I fix this? Any hints would be much appreciated.
This is the _mappings. It is not the complete one, but I guess this would suffice.
{
"catalog2_0": {
"mappings": {
"product": {
"dynamic": "strict",
"dynamic_templates": [
{
"attributes": {
"path_match": "detail.attribute.*",
"mapping": {
"type": "text"
}
}
}
],
"properties": {
"detail": {
"properties": {
"attMaxScore": {
"type": "scaled_float",
"scaling_factor": 100
},
"attribute": {
"dynamic": "true",
"properties": {
"Brand": {
"type": "text"
},
"Color": {
"type": "text"
},
"MPN": {
"type": "text"
},
"Manufacturer": {
"type": "text"
},
"Model": {
"type": "text"
},
"Operating System": {
"type": "text"
},
"Size": {
"type": "text"
},
"Type": {
"type": "text"
}
}
},
"description": {
"type": "text"
},
"feature": {
"type": "text"
},
"tag": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
},
"title": {
"properties": {
"en": {
"type": "text"
}
}
}
}
}
}
}
}
According the documentation you can't make aggregation on field that have text datatype. They must have keyword datatype.
Then you can't make aggregation on the detail.attribute field in that way: The detail.attribute field doesn't store any value: it is an object datatype - not a nested one as you have written in the question, that means that it is a container for other field like Size, Brand etc. So you should aggregate against detail.attribute.Size field - if this one was a keyword datatype - for example.
Another presumable error is that you are trying to run a term query on a text datatype - what is the datatype of title.list field?. Term query is a prerogative for field that have keyword datatype, while match query is used to query against text datatype
Here is what I have used for a nested aggs query, minus the actual value names.
The actual field is a keyword, which as already mentioned is required, that is part of a nested JSON object:
"STATUS_ID": {
"type": "keyword",
"index": "not_analyzed",
"doc_values": true
},
Query
GET index name/_search?size=200
{
"aggs": {
"panels": {
"nested": {
"path": "nested path"
},
"aggs": {
"statusCodes": {
"terms": {
"field": "nested path.STATUS.STATUS_ID",
"size": 50
}
}
}
}
}
}
Result
"aggregations": {
"status": {
"doc_count": 12108963,
"statusCodes": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "O",
"doc_count": 5912218
},
{
"key": "C",
"doc_count": 401586
},
{
"key": "E",
"doc_count": 135628
},
{
"key": "Y",
"doc_count": 3742
},
{
"key": "N",
"doc_count": 1012
},
{
"key": "L",
"doc_count": 719
},
{
"key": "R",
"doc_count": 243
},
{
"key": "H",
"doc_count": 86
}
]
}
}

elasticsearch sorting unexpected null returned

I followed the doc https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-fields.html to add sorting column for name field. Unfortunately, it is not working
These are the steps:
add index mapping
PUT /staff
{
"mappings": {
"staff": {
"properties": {
"id": {
"type": "string",
"index": "not_analyzed"
},
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
Add document
POST /staff/list {
"id": 5,
"name": "abc"
}
Search for the name.raw
POST /staff_change/_search
{
"sort": "name.raw"
}
However, the sort field in the response return null
"_source": {
"id": 5,
"name": "abc"
},
"sort": [
null
]
}
I dont know why it is not working and I cant search relevant issue doc related this. Does someone come across this issue
Many thanks in advance
Your mappings are incorrect. You create a mapping staff inside index staff and then index documents under mapping list inside index staff which works but with a dynamic mapping, not the one you added. In the end you are searching for all the documents in the index staff. Try this:
PUT /staff
{
"mappings": {
"list": {
"properties": {
"id": {
"type": "string",
"index": "not_analyzed"
},
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
Then index:
POST /staff/list {
"id": 5,
"name": "abc aa"
}
And query:
POST /staff/list/_search
{
"sort": "name.raw"
}
Results in:
"hits": [
{
"sort": [
"abc aa"
]
}
...

ElasticSearch aggregations - sorting values

In this sample I have some cars with an unknown number of facets on them.
When doing aggregations I would like the values in the aggregations to be sorted alphabetically. However, some of the facets are integers, and that will produce these aggregations
Color
blue (2)
red (1)
Top speed
100 (1)
120 (1)
90 (1)
Year
2015 (1)
As you can see the topspeed facet is sorted wrong - 90 should be first.
Sample data
PUT /my_index
{
"mappings": {
"product": {
"properties": {
"displayname" :{"type": "string"},
"facets": {
"type": "nested",
"properties": {
"name": { "type": "string" },
"value": { "type": "string" },
"datatype": { "type": "string" }
}
}
}
}
}
}
PUT /my_index/product/1
{
"displayname": "HONDA",
"facets": [
{
"name": "topspeed",
"value": "100",
"datatype": "integer"
},
{
"name": "color",
"value": "Blue",
"datatype": "string"
}
]
}
PUT /my_index/product/2
{
"displayname": "WV",
"facets": [
{
"name": "topspeed",
"value": "90",
"datatype": "integer"
},
{
"name": "color",
"value": "Red",
"datatype": "string"
}
]
}
PUT /my_index/product/3
{
"displayname": "FORD",
"facets": [
{
"name": "topspeed",
"value": "120",
"datatype": "integer"
},
{
"name": "color",
"value": "Blue",
"datatype": "string"
},
{
"name": "year",
"value": "2015",
"datatype": "integer"
}
]
}
GET my_index/product/1
GET /my_index/product/_search
{
"size": 0,
"aggs": {
"facets": {
"nested": {
"path": "facets"
},
"aggs": {
"nested_facets": {
"terms": {
"field": "facets.name"
},
"aggs": {
"facet_value": {
"terms": {
"field": "facets.value",
"size": 0,
"order": {
"_term": "asc"
}
}
}
}
}
}
}
}
}
As you can see each facet has a datatype (integer or string).
Any ideas how I can get the sorting of values to be like this:
Color
blue (2)
red (1)
Top speed
90(1)
100 (1)
120 (1)
Year
2015 (1)
I've played around with adding a new field to the facet "sortable_value" where i pad the integer values like this "00000000090" at index time. But could not get the aggregations to work.
Any help is appreciated
That's an uncommon way of representing your data.
I'd suggest changing your data structure to the following
{
"displayname": "FORD",
"facets": {
"topspeed": 120,
"color": "Blue",
"year": 2015
}
}

add score to elasticsearch completion suggester inputs

I need to implement elasticsearch completion suggester.
I have an index mapped like this:
{
"user": {
"properties": {
"username": {
"index": "not_analyzed",
"analyzer": "simple",
"type": "string"
},
"email": {
"index": "not_analyzed",
"analyzer": "simple",
"type": "string"
},
"name": {
"index": "not_analyzed",
"analyzer": "simple",
"type": "string"
},
"name_suggest": {
"payloads": true,
"type": "completion"
}
}
}
}
I add documents to the index like this:
{
"doc": {
"id": 1,
"username": "jack",
"name": "Jack Nicholson",
"email": "nick#myemail.com",
"name_suggest": {
"input": [
"jack",
"Jack Nicholson",
"nick#myemail.com"
],
"payload": {
"id": 1,
"name": "Jack Nicholson",
"username": "jack",
"email": "nick#myemail.com"
},
"output": "Jack Nicholson (jack) - nick#myemail.com"
}
},
"doc_as_upsert": true
}
And I send this request to my_index/_suggest:
{
"user": {
"text": "jack",
"completion": {
"field": "name_suggest"
}
}
}
I get the resulting options that look like this:
[
{
"text": "John Smith",
"score": 1.0,
"payload": {
"id": 11,
"name": "John Smith",
"username": "jack",
"email": "john#myemail.com"
}
},
{
"text": "Jack Nickolson",
"score": 1.0,
"payload": {
"id": 1,
"name": "Jack Nickolson",
"username": "jack.n",
"email": "nickolson#myemail.com"
}
},
{
"text": "Jackson Jermaine",
"score": 1.0,
"payload": {
"id": 10,
"name": "Jackson Jermaine",
"username": "jermaine",
"email": "jermaine#myemail.com"
}
},
{
"text": "Tito Jackson",
"score": 1.0,
"payload": {
"id": 9,
"name": "Tito Jackson",
"username": "tito",
"email": "jackson#myemail.com"
}
},
{
"text": "Michael Jackson",
"score": 1.0,
"payload": {
"id": 6,
"name": "Michael Jackson",
"username": "michael_jackson",
"email": "jackson_michael#myemail.com"
}
}
]
This works fine but, I need to have the options sorted that way that those that have username matched come first. I can do it manually, but that would prevent me to use length and offset and would be slower.
Is it possible to add scoring to the individual inputs (not the whole suggests), and that way affect the sorting? With the approach that I use it seems it is not.
Another related question, is it possible to specify in the input an array of fields instead of an array of values, and that way avoid the duplication? If yes, would setting the score on the fields be taken into account when ES generates suggestions?
You can add score to your input with the weight option.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-suggesters-completion.html#indexing

Resources