Understand Elasticsearch Multivalue Fields - elasticsearch

I am trying to understand the position_increment_gap as it is explained on the Elasticsearch documentation https://www.elastic.co/guide/en/elasticsearch/guide/current/_multivalue_fields_2.html
I created the same index as in the example and inserted a single document
PUT /my_index/groups/1
{
"names": [ "John Abraham", "Lincoln Smith", "Justin Trudeau"]
}
Then I try a phrase query for Abraham Lincoln and it matches, as expected
GET /my_index/groups/_search
{
"query": {
"match_phrase": {
"names": "Abraham Lincoln"
}
}
}
{
"took": 25,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.5753642,
"hits": [
{
"_index": "names",
"_type": "doc",
"_id": "1",
"_score": 0.5753642,
"_source": {
"names": [
"john abraham",
"lincoln smith",
"justin trudeau"
]
}
}
]
}
}
The documentation explains that the match occurs because ES produces the tokens john abraham lincoln smith justin trudeau and it recommends inserting a position_increment_gap of 100 to avoid matching abraham lincoln unless I have a slop of 100.
I changed the index to have a position_increment_gap of 1 as shown below:
PUT names
{
"mappings": {
"doc": {
"properties": {
"names": {
"type":"text",
"position_increment_gap": 1
}
}
}
}
}
If I'm understanding the documentation, using a gap of 1 should allow me to match "abraham smith". But it doesn't match. Nor does "abraham lincoln", "abraham justin", or "abraham trudeau". "lincoln smith", "john abraham" and "justin trudeau" all continue to match.
I must be misunderstanding the documentation.
Thanks for any suggestions.

Related

Function score ignored

I have two nearly identical documents, one of which has the fields CONSTRUCTION: 1 and EDUCATION: 0.1, the other with CONSTRUCTION: 0.1 and EDUCATION: 1. I want to be able to sort results by the value of either the CONSTRUCTION or EDUCATION field
GET /objects/_search
{
"query": {
"function_score": {
"query": {
"match": {
"name": {
"query": "Monkeys"
}
}
},
"field_value_factor": {
"field" : "CONSTRUCTION",
"missing": 1
}
}
},
"_source": ["name", "CONSTRUCTION", "EDUCATION"]
}
Returns the incorrect results:
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.7622693,
"hits": [
{
"_index": "objects__feed_id_key_pages__date_2019-12-10__timestamp_1575988952__batch_id_3gpnz7fc__",
"_type": "_doc",
"_id": "dit:greatDomesticUi:KeyPages:12",
"_score": 1.7622693,
"_source": {
"CONSTRUCTION": 0.1,
"name": "Space Monkeys - education",
"EDUCATION": 1
}
},
{
"_index": "objects__feed_id_key_pages__date_2019-12-10__timestamp_1575988952__batch_id_3gpnz7fc__",
"_type": "_doc",
"_id": "dit:greatDomesticUi:KeyPages:11",
"_score": 1.0226655,
"_source": {
"CONSTRUCTION": 1,
"name": "Space Monkeys - construction",
"EDUCATION": 0.1
}
}
]
}
}
This only always returns the same results. Indeed if you misspell the field_value_factor field, you get the same score "field_value_factor": { "field" : "WHATEVER",... }. This suggests the field simply isn't being read.
Dynamic mapping was turned off. The EDUCATION and CONSTRUCTION fields were not mapped. Mystery solved!

Elasticsearch query with fuzziness AUTO not working as expected

From the Elasticsearch documentation regarding fuzziness:
AUTO
Generates an edit distance based on the length of the term. Low and high distance arguments may be optionally provided AUTO:[low],[high]. If not specified, the default values are 3 and 6, equivalent to AUTO:3,6 that make for lengths:
0..2
Must match exactly
3..5
One edit allowed
>5
Two edits allowed
However, when I am trying to specify low and high distance arguments in the search query the result is not what I am expecting.
I am using Elasticsearch 6.6.0 with the following index mapping:
{
"fuzzy_test": {
"mappings": {
"_doc": {
"properties": {
"description": {
"type": "text"
},
"id": {
"type": "keyword"
}
}
}
}
}
}
Inserting a simple document:
{
"id": "1",
"description": "hello world"
}
And the following search query:
{
"size": 10,
"timeout": "30s",
"query": {
"match": {
"description": {
"query": "helqo",
"fuzziness": "AUTO:7,10"
}
}
}
}
I assumed that fuzziness:AUTO:7,10 would mean that for the input term with length <= 6 only documents with the exact match will be returned. However, here is a result of my query:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.23014566,
"hits": [
{
"_index": "fuzzy_test",
"_type": "_doc",
"_id": "OQtUu2oBABnEwrgM3Ejr",
"_score": 0.23014566,
"_source": {
"id": "1",
"description": "hello world"
}
}
]
}
}
This is strange but seems like that bug exists only in version the Elasticsearch 6.6.0. I've tried 6.4.2 and 6.6.2 and both of them work just fine.

Elasticsearch OR query with nested objects returns inner_hits not matching the criteria

I'm getting weird results when querying nested objects. Imagine the following structure:
{ owner.name = "fred",
...,
pets [
{ name = "daisy", ... },
{ name = "flopsy", ... }
]
}
If I only have the document shown above, and I search pets matching this criteria:
pets.name = "daisy" OR
(owner.name = "julie" and pet.name = "flopsy")
I would expect to only get one result ("daisy"), but I'm getting both pet names.
This is one way to reproduce this:
# Create nested mapping
PUT pet-owners
{
"mappings": {
"animals": {
"properties": {
"owner": {"type": "text"},
"pets": {
"type": "nested",
"properties": {
"name": {"type": "text", "fielddata": true}
}
}
}
}
}
}
# Insert nested object
PUT pet-owners/animals/1?op_type=create
{
"owner" : "fred",
"pets" : [
{ "name" : "daisy"},
{ "name" : "flopsy"}
]
}
# Query
GET pet-owners/_search
{ "from": 0, "size": 50,
"query": {
"constant_score": {
"filter": { "bool": {"must": [
{"bool": {"should": [
{"nested": {"query":
{"term": {"pets.name": "daisy"}},
"path":"pets",
"inner_hits": {
"name": "pets_hits_1",
"size": 99,
"_source": false,
"docvalue_fields": ["pets.name"]
}
}},
{"bool": {"must": [
{"term": {"owner": "julie"}},
{"nested": {"query":
{"term": {"pets.name": "flopsy"}},
"path":"pets",
"inner_hits": {
"name": "pets_hits_2",
"size": 99,
"_source": false,
"docvalue_fields": ["pets.name"]
}
}}
]}}
]}}
]}}}},
"_source": false
}
The query returns both pets names (as opposed to the expected one).
Is this behavior normal? Am I doing something wrong, or my reasoning about the nested structure or the query behavior is flawed?
Any help or guidance will be much appreciated.
I'm running this query under ElasticSearch 6.3.x
EDIT: I'm adding the response received, to better illustrate the case
{
"took": 16,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "pet-owners",
"_type": "animals",
"_id": "1",
"_score": 1,
"inner_hits": {
"pets_hits_1": {
"hits": {
"total": 1,
"max_score": 0.6931472,
"hits": [
{
"_index": "pet-owners",
"_type": "animals",
"_id": "1",
"_nested": {
"field": "pets",
"offset": 0
},
"_score": 0.6931472,
"fields": {
"pets.name": [
"daisy"
]
}
}
]
}
},
"pets_hits_2": {
"hits": {
"total": 1,
"max_score": 0.6931472,
"hits": [
{
"_index": "pet-owners",
"_type": "animals",
"_id": "1",
"_nested": {
"field": "pets",
"offset": 1
},
"_score": 0.6931472,
"fields": {
"pets.name": [
"flopsy"
]
}
}
]
}
}
}
}
]
}
}
So we can see that it's not that the query matches and returns the whole existing document, but that it returns each of the pets independently, one inside each of the inner_hits. It's this result that's surprising to me.
(edited) - in summary this issue is around the context of the 'inner_hits':
It looks like the inner_hits 'pets_hits_2' is returning a match because it is belonging to the nested query that simply searches the pets field for 'flopsy'.
As an independent query on our single document, that is a valid hit.
However, because that query is within a list of bool/must queries, where other queries will not match on our document, you may well expect that the inner_hits should pick up on this and therefore not return a hit.
I haven't been able to find any docs to clarify whether this is intentional behaviour or not - might be worth raising with elastic ...

How to filter out elements from an array that doesn’t match the query?

stackoverflow won't let me write that much example code so I put it on gist.
So I have this index
with this mapping
here is a sample document I insert into newly created mapping
this is my query
GET products/paramSuggestions/_search
{
"size": 10,
"query": {
"filtered": {
"query": {
"match": {
"paramName": {
"query": "col",
"operator": "and"
}
}
}
}
}
}
this is the unwanted result I get from previous query
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.33217794,
"hits": [
{
"_index": "products",
"_type": "paramSuggestions",
"_id": "1",
"_score": 0.33217794,
"_source": {
"productName": "iphone 6",
"params": [
{
"paramName": "color",
"value": "white"
},
{
"paramName": "capacity",
"value": "32GB"
}
]
}
}
]
}
}
and finally the wanted result, how I want the query result to look like
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.33217794,
"hits": [
{
"_index": "products",
"_type": "paramSuggestions",
"_id": "1",
"_score": 0.33217794,
"_source": {
"productName": "iphone 6",
"params": [
{
"paramName": "color",
"value": "white"
},
]
}
}
]
}
}
How should the query look like to achieve the wanted result with filtered array field which matches the query? In other words, all other non-matching array items should not appear in the final result.
The final result is the _source document that you indexed. There is no feature that lets you mask field elements of your document out of the Elasticsearch response.
That said, depending on your goal, you can look into how Highlighters and Suggesters identify result terms matching the query, or possibly, roll-your-own client-side masking using info returned from setting "explain": true in your query.

Fuzzy not functioning as expected (one term search, see example)

Consider the following results from:
curl -XGET 'http://localhost:9200/megacorp/employee/_search' -d
'{ "query" :
{"match":
{"last_name": "Smith"}
}
}'
Result:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.30685282,
"hits": [
{
"_index": "megacorp",
"_type": "employee",
"_id": "1",
"_score": 0.30685282,
"_source": {
"first_name": "John",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing on the weekends.",
"interests": [
"sports",
"music"
]
}
},
{
"_index": "megacorp",
"_type": "employee",
"_id": "2",
"_score": 0.30685282,
"_source": {
"first_name": "Jane",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing",
"interests": [
"sports",
"music"
]
}
}
]
}
}
Now when I execute the following query:
curl -XGET 'http://localhost:9200/megacorp/employee/_search' -d
'{ "query" :
{"fuzzy":
{"last_name":
{"value":"Smitt",
"fuzziness": 1
}
}
}
}'
Returns NO results despite the Levenshtein distance of "Smith" and "Smitt" being 1. The same thing results with a value of "Smit." If I put in a fuzziness value of 2, I get results. What am I missing here?
I assume that the last_name field your are querying is an analyzed string. The indexed term will though be smith and not Smith.
Returns NO results despite the Levenshtein distance of "Smith" and
"Smitt" being 1.
The fuzzy query don't analyze term, so actually, your Levenshtein distance is not 1 but 2 :
Smitt -> Smith
Smith -> smith
Try using this mapping, and your query with fuzziness = 1 will work :
PUT /megacorp/employee/_mapping
{
"employee":{
"properties":{
"last_name":{
"type":"string",
"index":"not_analyzed"
}
}
}
}
Hope this helps

Resources