Nested filtering in elasticsearch with more than one term of the same nested type - elasticsearch

I'm new to elasticsearch, so maybe my approach is plain wrong, but I want to make an index of recipes and allow the user to filter it down with the aggregated ingredients that are still found in the subset.
Maybe I'm using the wrong language to explain so maybe this example will clarify. I would like to search for recipes with the term salt; which results in three recipes:
with ingredients: salt, flour, water
with ingredients: salt, pepper, egg
with ingredients: water, flour, egg, salt
The aggregate on the results ingredients returns salt, flour, water, pepper, egg. When I filter with flour I only want recipe 1 and 3 to appear in the search results (and the aggregate on ingredients should only return salt, flour, water, egg and salt). When I add another filter egg I want only recipe 3 to appear (and the aggregate should only return water, flour, egg, salt).
I can't make the latter to work: one filter next to the default query does narrow down the results as desired but when adding the other term (egg) to the terms filter the results again start to include b as well, as if it were an OR filter. Adding AND however to the filter execution results in NO results ... what am I doing wrong?
My mapping:
{
"recipe": {
"properties": {
"title": {
"analyzer": "dutch",
"type": "string"
},
"ingredients": {
"type": "nested",
"properties": {
"name": {
"type": "string",
"analyzer": "dutch",
"include_in_parent": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}
My query:
{
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"match": {
"_all": "salt"
}
}
]
}
},
"filter": {
"nested": {
"path": "ingredients",
"filter": {
"terms": {
"ingredients.name": [
"flour",
"egg"
],
"execution": "and"
}
}
}
}
}
},
"size": 50,
"aggregations": {
"ingredients": {
"nested": {
"path": "ingredients"
},
"aggregations": {
"count": {
"terms": {
"field": "ingredients.name.raw"
}
}
}
}
}
}

Why are you using a nested mapping here? Its main purpose is to keep relations between the sub-object attributes, but your ingredients field has just one attribute and can be modeled simply as a string field.
So, if you update your mapping like this :
POST recipes
{
"mappings": {
"recipe": {
"properties": {
"title": {
"type": "string"
},
"ingredients": {
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}
You can still index your recipes as :
{
"title":"recipe b",
"ingredients":["salt","pepper","egg"]
}
And this query gives you the result you are waiting for :
POST recipes/recipe/_search
{
"query": {
"filtered": {
"query": {
"match": {
"_all": "salt"
}
},
"filter": {
"terms": {
"ingredients": [
"flour",
"egg"
],
"execution": "and"
}
}
}
},
"size": 50,
"aggregations": {
"ingredients": {
"terms": {
"field": "ingredients"
}
}
}
}
which is :
{
...
"hits": {
"total": 1,
"max_score": 0.22295055,
"hits": [
{
"_index": "recipes",
"_type": "recipe",
"_id": "PP195TTsSOy-5OweArNsvA",
"_score": 0.22295055,
"_source": {
"title": "recipe c",
"ingredients": [
"salt",
"flour",
"egg",
"water"
]
}
}
]
},
"aggregations": {
"ingredients": {
"buckets": [
{
"key": "egg",
"doc_count": 1
},
{
"key": "flour",
"doc_count": 1
},
{
"key": "salt",
"doc_count": 1
},
{
"key": "water",
"doc_count": 1
}
]
}
}
}
Hope this helps.

Related

Distinct values from array-field matching filter in Elasticsearch 2.4

In short: I want to lookup for distinct values in some field of the document BUT only matching some filter. The problem is in array-fields.
Imagine there are following documents in ES 2.4:
[
{
"states": [
"Washington (US-WA)",
"California (US-CA)"
]
},
{
"states": [
"Washington (US-WA)"
]
}
]
I'd like my users to be able to lookup all possible states via typeahead, so I have the following query for the "wa" user request:
{
"query": {
"wildcard": {
"states.raw": "*wa*"
}
},
"aggregations": {
"typed": {
"terms": {
"field": "states.raw"
},
"aggregations": {
"typed_hits": {
"top_hits": {
"_source": { "includes": ["states"] }
}
}
}
}
}
}
states.raw is a sub-field with not_analyzed option
This query works pretty well unless I have an array of values like in the example - it returns both Washington and California. I do understand why it happens (query and aggregations are working on top of the document and the document contains both, even though only one option matched the filter), but I really want to only see Washington and don't want to add another layer of filtering on the application side for the ES results.
Is there a way to do so via single ES 2.4 request?
You could use the "Filtering Values" feature (see https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-aggregations-bucket-terms-aggregation.html#_filtering_values_2).
So, your request could look like:
POST /index/collection/_search?size=0
{
"aggregations": {
"typed": {
"terms": {
"field": "states.raw",
"include": ".*wa.*" // You need to carefully quote the "wa" string because it'll be used as part of RegExp
},
"aggregations": {
"typed_hits": {
"top_hits": {
"_source": { "includes": ["states"] }
}
}
}
}
}
}
I can't hold myself back, though, and not tell you that using wildcard with leading wildcard is not the best solution. Do, please please, consider using ngrams for this:
PUT states
{
"settings": {
"analysis": {
"filter": {
"ngrams": {
"type": "nGram",
"min_gram": "2",
"max_gram": "20"
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"ngrams"
],
"tokenizer": "standard"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"location": {
"properties": {
"states": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"ngrams": {
"type": "string",
"analyzer": "ngram_analyzer"
}
}
}
}
}
}
}
}
}
POST states/doc/1
{
"text":"bla1",
"location": [
{
"states": [
"Washington (US-WA)",
"California (US-CA)"
]
},
{
"states": [
"Washington (US-WA)"
]
}
]
}
POST states/doc/2
{
"text":"bla2",
"location": [
{
"states": [
"Washington (US-WA)",
"California (US-CA)"
]
}
]
}
POST states/doc/3
{
"text":"bla3",
"location": [
{
"states": [
"California (US-CA)"
]
},
{
"states": [
"Illinois (US-IL)"
]
}
]
}
And the final query:
GET states/_search
{
"query": {
"term": {
"location.states.ngrams": {
"value": "sh"
}
}
},
"aggregations": {
"filtering_states": {
"terms": {
"field": "location.states.raw",
"include": ".*sh.*"
},
"aggs": {
"typed_hits": {
"top_hits": {
"_source": {
"includes": [
"location.states"
]
}
}
}
}
}
}
}

Elastic Search query return terms within array of a specific type

I've a mapping of an index as following:
{"tagged_index":{"mappings":{"tagged":{"properties":{"tags":{"properties":{"resources":{"properties":{"tagName":{"type":"string"},"type":{"type":"string"}}}}},"content":{"type":"string"}}}}}}
Where Resources is an array which can have multiple tags. For example
{"_id":"82906194","_source":{"tags":{"resources":[{"type":"Person","tagName":"Kim_Kardashian",},{"type":"Person","tagName":"Kanye_West",},{"type":"City","tagName":"New_York",},...},"content":" Popular NEWS ..."}}
,
{"_id":"82906195","_source":{"tags":{"resources":[{"type":"City","tagName":"London",},{"type":"Country","tagName":"USA",},{"type":"Music","tagName":"Hello",},...},"content":" Adele's Hello..."}},
...
I do know how to extract important terms[tagName] with the below query, but I do not want terms[tagName] of all types.
How can I extract only the terms which are for example Cities only [type:City]? (I would like to get a list of tagName where the type is City i.e. London, New_York, Berlin,...)
{"size":0,"query":{"filtered":{"query":{"query_string":{"query":"*","analyze_wildcard":true}}}},"aggs":{"Cities":{"terms":{"field":"tags.resources.tagName","size":10,"order":{"_count":"desc"}}}}}
Following is how the required output should look like:
{"took":1200,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":5179261,"max_score":0.0,"hits":[]},"aggregations":{"Cities":{"doc_count_error_upper_bound":46737,"sum_other_doc_count":36037440,"buckets":[{"key":"London","doc_count":332820},{"key":"New_York","doc_count":211274},{"key":"Berlin","doc_count":156954},{"key":"Amsterdam","doc_count":132173},...
Can you try this:
{
"_source" : ["tags.resources.tagName"]
"query": {
"term": {
"tags.resources.type": {
"value": "City"
}
}
}
}
Above query will fetch those resources which are of type city provided resources is of object type.
After Edit
Problem Group By Tag name which are Of city Type. That would not be achieved with the current mapping you have. You will have to change resources field to nested type.
Mapping would look like.
"mappings": {
"resource": {
"properties": {
"tags": {
"properties": {
"content": {
"type": "string"
},
"resources": {
"type": "nested",
"properties": {
"tagName": {
"type": "string"
},
"type": {
"type": "string"
}
}
}
}
}
}
}
}
Final query would be :
{
"size": 0,
"query": {
"nested": {
"path": "tags.resources",
"query": {
"match": {
"tags.resources.type": "city"
}
}
}
},
"aggs": {
"resources Nested path": {
"nested": {
"path": "tags.resources"
},
"aggs": {
"city type": {
"filter": {
"term": {
"tags.resources.type": "city"
}
},
"aggs": {
"group By tagName": {
"terms": {
"field": "tags.resources.tagName"
}
}
}
}
}
}
}
}
Output would be:
"aggregations": {
"resources Nested path": {
"doc_count": 6,
"city type": {
"doc_count": 2,
"group By tagName": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "london",
"doc_count": 1
},
{
"key": "new_york",
"doc_count": 1
}
]
}
}
}
}

Empty inner_hits in compound Elasticsearch filter

I'm seeing what appears to be aberrant behavior in inner_hits results within nested boolean queries.
Test data (abbreviated for brevity):
# MAPPING
PUT unit_testing
{
"mappings": {
"document": {
"properties": {
"display_name": {"type": "text"},
"metadata": {
"properties": {
"NAME": {"type": "text"}
}
}
}
},
"paragraph": {
"_parent": {"type": "document"},
"_routing": {"required": true},
"properties": {
"checksum": {"type": "text"},
"sentences": {
"type": "nested",
"properties": {
"text": {"type": "text"}
}
}
}
}
}
}
# DOCUMENT X 2 (d0, d1)
PUT unit_testing/document/doc_id_d0
{
"display_name": "Test Document d0",
"paragraphs": [
"para_id_d0p0",
"para_id_d0p1"
],
"metadata": {"NAME": "Test Document d0 Metadata"}
}
# PARAGRAPH X 2 (d0p0, d1p0)
PUT unit_testing/paragraph/para_id_d0p0?parent=doc_id_d0
{
"checksum": "para_checksum_d0p0",
"sentences": [
{"text": "Test sentence d0p0s0"},
{"text": "Test sentence d0p0s1 ODD"},
{"text": "Test sentence d0p0s2 EVEN"},
{"text": "Test sentence d0p0s3 ODD"},
{"text": "Test sentence d0p0s4 EVEN"}
]
}
This initial query behaves as I would expect (I'm aware that the metadata filter isn't actually necessary in this example case):
GET unit_testing/paragraph/_search
{
"_source": "false",
"query": {
"bool": {
"must": [
{
"has_parent": {
"query": {
"match_phrase": {
"metadata.NAME": "Test Document d0 Metadata"
}
},
"type": "document"
}
},
{
"nested": {
"inner_hits": {},
"path": "sentences",
"query": {
"match": {
"sentences.text": "d0p0s0"
}
}
}
}
]
}
}
}
It yields an inner_hits object containing the one sentence that matched the predicate (some fields removed for clarity):
{
"hits": {
"hits": [
{
"_source": {},
"inner_hits": {
"sentences": {
"hits": {
"hits": [
{
"_source": {
"text": "Test sentence d0p0s0"
}
}
]
}
}
}
}
]
}
}
The following query is an attempt to embed the query above within a parent "should" clause, to create a logical OR between the initial query, and an additional query that matches a single sentence:
GET unit_testing/paragraph/_search
{
"_source": "false",
"query": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"has_parent": {
"query": {
"match_phrase": {
"metadata.NAME": "Test Document d0 Metadata"
}
},
"type": "document"
}
},
{
"nested": {
"inner_hits": {},
"path": "sentences",
"query": {
"match": {
"sentences.text": "d0p0s0"
}
}
}
}
]
}
},
{
"nested": {
"inner_hits": {},
"path": "sentences",
"query": {
"match": {
"sentences.text": "d1p0s0"
}
}
}
}
]
}
}
}
While the "d1" query outputs the result one would expect, with an inner_hits object containing the matching sentence, the original "d0" query now yields an empty inner_hits object:
{
"hits": {
"hits": [
{
"_source": {},
"inner_hits": {
"sentences": {
"hits": {
"total": 0,
"hits": []
}
}
}
},
{
"_source": {},
"inner_hits": {
"sentences": {
"hits": {
"hits": [
{
"_source": {
"text": "Test sentence d1p0s0"
}
}
]
}
}
}
}
]
}
}
Although I'm using the elasticsearch_dsl Python library to build and combine these queries, and I'm something of a novice with respect to the Query DSL, the query format looks solid to me.
What am I missing?
I think what is missing is the name parameter for inner_hits - you have two inner_hits clauses at two different queries that would end up with the same name. Try giving the inner_hits a name parameter (0).
0 - https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-inner-hits.html#_options

Elasticsearch unexpected results when sorting against deeply nested attributes

I'm trying to perform some sorting based on the attributes of a document's deeply nested children.
Let's say we have an index filled with publisher documents. A publisher has a collection of books, and
each book has a title, a published flag, and a collection of genre scores. A genre_score represents how well
a particular book matches a particular genre, or in this case a genre_id.
First, let's define some mappings (for simplicity, we will only be explicit about the nested types):
curl -XPUT 'localhost:9200/book_index' -d '
{
"mappings": {
"publisher": {
"properties": {
"books": {
"type": "nested",
"properties": {
"genre_scores": {
"type": "nested"
}
}
}
}
}
}
}'
Here are our two publishers:
curl -XPUT 'localhost:9200/book_index/publisher/1' -d '
{
"name": "Best Books Publishing",
"books": [
{
"name": "Published with medium genre_id of 1",
"published": true,
"genre_scores": [
{ "genre_id": 1, "score": 50 },
{ "genre_id": 2, "score": 15 }
]
}
]
}'
curl -XPUT 'localhost:9200/book_index/publisher/2' -d '
{
"name": "Puffin Publishers",
"books": [
{
"name": "Published book with low genre_id of 1",
"published": true,
"genre_scores": [
{ "genre_id": 1, "score": 10 },
{ "genre_id": 4, "score": 10 }
]
},
{
"name": "Unpublished book with high genre_id of 1",
"published": false,
"genre_scores": [
{ "genre_id": 1, "score": 100 },
{ "genre_id": 2, "score": 35 }
]
}
]
}'
And here is the final definition of our index & mappings...
curl -XGET 'localhost:9200/book_index/_mappings?pretty=true'
...
{
"book_index": {
"mappings": {
"publisher": {
"properties": {
"books": {
"type": "nested",
"properties": {
"genre_scores": {
"type": "nested",
"properties": {
"genre_id": {
"type": "long"
},
"score": {
"type": "long"
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"published": {
"type": "boolean"
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
Now suppose we want to query for a list of publishers, and have them sorted by those who books performing
well in a particular genre. In other words, sort the publishers by the genre_score.score of one of their books
for the target genre_id.
We might write a search query like this...
curl -XGET 'localhost:9200/book_index/_search?pretty=true' -d '
{
"size": 5,
"from": 0,
"sort": [
{
"books.genre_scores.score": {
"order": "desc",
"nested_path": "books.genre_scores",
"nested_filter": {
"term": {
"books.genre_scores.genre_id": 1
}
}
}
}
],
"_source":false,
"query": {
"nested": {
"path": "books",
"query": {
"bool": {
"must": []
}
},
"inner_hits": {
"size": 5,
"sort": []
}
}
}
}'
Which correctly returns the Puffin (with a sort value of [100]) first and Best Books second (with a sort value of [50]).
But suppose we only want to consider books for which published is true. This would change our expectation to have Best Books first (with a sort of [50]) and Puffin second (with a sort of [10]).
Let's update our nested_filter and query to the following...
curl -XGET 'localhost:9200/book_index/_search?pretty=true' -d '
{
"size": 5,
"from": 0,
"sort": [
{
"books.genre_scores.score": {
"order": "desc",
"nested_path": "books.genre_scores",
"nested_filter": {
"bool": {
"must": [
{
"term": {
"books.genre_scores.genre_id": 1
}
}, {
"term": {
"books.published": true
}
}
]
}
}
}
}
],
"_source": false,
"query": {
"nested": {
"path": "books",
"query": {
"term": {
"books.published": true
}
},
"inner_hits": {
"size": 5,
"sort": []
}
}
}
}'
Suddenly, our sort values for both publishers has become [-9223372036854775808].
Why does adding an additional term to our nested_filter in the top-level sort have this impact?
Can anyone provide some insight as to why this behavior is happening? And additionally, if there are any viable solutions to the proposed query/sort?
This occurs in both ES1.x and ES5
Thanks!

Function Score On Nested Object

I have this index blog with the following settings and mappings.
PUT /blog
{
"settings": {
"index": {
"number_of_shards": "1"
}
},
"mappings": {
"post": {
"_all": {
"enabled": false
},
"properties": {
"title": {
"type": "string"
},
"content": {
"type": "string"
},
"visitor": {
"type": "nested",
"properties": {
"id": {
"type": "string",
"index": "not_analyzed"
},
"last_visit": {
"type": "date",
"format": "yyyy-MM-dd"
}
}
}
}
}
}
}
I want to rank my posts based on relevancy and visitor's last visit. I tried this query without success. It seems like the gauss function cannot get the value of visitor's last_visit. How to get this worked?
POST /blog/post/_search
{
"query": {
"function_score": {
"functions": [
{
"gauss": {
"visitor.last_visit": {
"origin": "now/d",
"offset": "3d",
"scale": "4d",
"decay": 0.5
}
},
"filter": {
"nested": {
"path": "visitor",
"query": {
"term": {
"visitor.id": "1"
}
}
}
}
}
]
}
}
}
Here is a query with a match for a name that uses a nested object that I had for a particular use case. I didn't use any date fields, but as I said, it does use a nested object. I used relevancy of distance along with a text match, so it's similar.
I used the answer from this question to structure my query as it matched what I was trying to do. Scoring documents by text match and distance
GET dev_search_core_data/_search?size=200
{
"query": {
"bool": {
"should": [
{
"match": {
"NAME": "Amy Smith"
}
},
{
"bool": {
"must": [
{
"function_score": {
"query": {
"nested": {
"path": "LOCATION",
"query": {
"term": {
"LOCATION.SOME_IND": {
"value": true
}
}
}
}
},
"functions": [
{
"gauss": {
"LOCATION.COORDINATES": {
"origin": "-118.309, 34.041",
"scale": "50km",
"offset": "10km",
"decay": 0.5
}
}
}
]
}
}
]
}
}
]
}
}
}
I think the problem is with the structure of your query. I always run this command first to validate my queries if I'm having any problems to eliminate any syntax issues.
GET dev_search_core_data/_validate/query?explain
This was the result:
{
"valid": true,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"explanations": [
{
"index": "dev_search_core_data_b",
"valid": true,
"explanation": "filtered((NAME:amy NAME:smith) (+function score (ToParentBlockJoinQuery (filtered(LOCATION.SOME_IND:true)->random_access(_type:_LOCATION)),function=org.elasticsearch.index.query.functionscore.DecayFunctionParser$GeoFieldDataScoreFunction#274227b9)))->cache(org.elasticsearch.index.search.nested.NonNestedDocsFilter#1012ada6)"
}
]
}
I also looked at the docs for an in-depth explanation of how the function score worked. You don't mention your version, but I'm using ES 1.6.

Resources