Smartcase searches/highlights with ElasticSearch - elasticsearch

Context
I am trying to support smart-case search within our application which uses elasticsearch. The use case I want to support is to be able to partially match on any blob of text using smart-case semantics. I managed to configure my index in such a way that I am capable of simulating smart-case search. It uses ngrams of max length 8 to not overload storage requirements.
The way it works is that each document has both a generated case-sensitive and a case-insensitive field using copy_to with their own specific indexing strategy. When searching on a given input, I split the input in parts. This depends on the ngrams length, white spaces and double quote escaping. Each part is checked for capitalized letters. When a capital letter is found, it generates a match filter for that specific part using the case-sensitive field, otherwise it uses the case-insensitive field.
This has proven to work very nicely, however I am having difficulties with getting highlighting to work the way I would like. To better explain the issue, I added an overview of my test setup below.
Settings
curl -X DELETE localhost:9200/custom
curl -X PUT localhost:9200/custom -d '
{
"settings": {
"analysis": {
"filter": {
"default_min_length": {
"type": "length",
"min": 1
},
"squash_spaces": {
"type": "pattern_replace",
"pattern": "\\s{2,}",
"replacement": " "
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "8"
}
},
"analyzer": {
"index_raw": {
"type": "custom",
"filter": ["lowercase","squash_spaces","trim","default_min_length"],
"tokenizer": "keyword"
},
"index_case_insensitive": {
"type": "custom",
"filter": ["lowercase","squash_spaces","trim","default_min_length"],
"tokenizer": "ngram_tokenizer"
},
"search_case_insensitive": {
"type": "custom",
"filter": ["lowercase","squash_spaces","trim"],
"tokenizer": "keyword"
},
"index_case_sensitive": {
"type": "custom",
"filter": ["squash_spaces","trim","default_min_length"],
"tokenizer": "ngram_tokenizer"
},
"search_case_sensitive": {
"type": "custom",
"filter": ["squash_spaces","trim"],
"tokenizer": "keyword"
}
}
}
},
"mappings": {
"_default_": {
"_all": { "enabled": false },
"date_detection": false,
"dynamic_templates": [
{
"case_insensitive": {
"match_mapping_type": "string",
"match": "case_insensitive",
"mapping": {
"type": "string",
"analyzer": "index_case_insensitive",
"search_analyzer": "search_case_insensitive"
}
}
},
{
"case_sensitive": {
"match_mapping_type": "string",
"match": "case_sensitive",
"mapping": {
"type": "string",
"analyzer": "index_case_sensitive",
"search_analyzer": "search_case_sensitive"
}
}
},
{
"text": {
"match_mapping_type": "string",
"mapping": {
"type": "string",
"analyzer": "index_raw",
"copy_to": ["case_insensitive","case_sensitive"],
"fields": {
"case_insensitive": {
"type": "string",
"analyzer": "index_case_insensitive",
"search_analyzer": "search_case_insensitive",
"term_vector": "with_positions_offsets"
},
"case_sensitive": {
"type": "string",
"analyzer": "index_case_sensitive",
"search_analyzer": "search_case_sensitive",
"term_vector": "with_positions_offsets"
}
}
}
}
}
]
}
}
}
'
Data
curl -X POST "http://localhost:9200/custom/test" -d '{ "text" : "tHis .is a! Test" }'
Query
The user searches for: tHis test which gets split into two parts as ngrams are maximum 8 in lengths: (1) tHis and (2) test. For (1) the case-sensitive field is used and (2) uses the case-insensitive field.
curl -X POST "http://localhost:9200/_search" -d '
{
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"case_sensitive": {
"query": "tHis",
"type": "boolean"
}
}
},
{
"match": {
"case_insensitive": {
"query": "test",
"type": "boolean"
}
}
}
]
}
},
"highlight": {
"pre_tags": [
"<em>"
],
"post_tags": [
"</em>"
],
"number_of_fragments": 0,
"require_field_match": false,
"fields": {
"*": {}
}
}
}
'
Response
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.057534896,
"hits": [
{
"_index": "custom",
"_type": "test",
"_id": "1",
"_score": 0.057534896,
"_source": {
"text": "tHis .is a! Test"
},
"highlight": {
"text.case_sensitive": [
"<em>tHis</em> .is a! Test"
],
"text.case_insensitive": [
"tHis .is a!<em> Test</em>"
]
}
}
]
}
}
Problem: highlighting
As you can see, the response shows that the smart-case search works very well. However, I also want to give feedback to the user using highlighting. My current setup uses "term_vector": "with_positions_offsets" to generate highlights. This indeed gives back correct highlights. However, the highlights are returned as both case-sensitive and case-insensitive independently.
"highlight": {
"text.case_sensitive": [
"<em>tHis</em> .is a! Test"
],
"text.case_insensitive": [
"tHis .is a!<em> Test</em>"
]
}
This requires me to manually zip multiple highlights on the same field into one combined highlight before returning it to the user. This becomes very painful when highlights become more complicated and can overlap.
Question
Is there an alternative setup to actually get back the combined highlight. I.e. I would like to have this as part of my response.
"highlight": {
"text": [
"<em>tHis</em> .is a!<em> Test</em>"
]
}

Attempt
Make use of highlight query to get merged result:
curl -XPOST 'http://localhost:9200_search' -d '
{
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"case_sensitive": {
"query": "tHis",
"type": "boolean"
}
}
},
{
"match": {
"case_insensitive": {
"query": "test",
"type": "boolean"
}
}
}
]
}
},
"highlight": {
"pre_tags": [
"<em>"
],
"post_tags": [
"</em>"
],
"number_of_fragments": 0,
"require_field_match": false,
"fields": {
"*.case_insensitive": {
"highlight_query": {
"bool": {
"must": [
{
"match": {
"*.case_insensitive": {
"query": "tHis",
"type": "boolean"
}
}
},
{
"match": {
"*.case_insensitive": {
"query": "test",
"type": "boolean"
}
}
}
]
}
}
}
}
}
}
'
Response
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.9364339,
"hits": [
{
"_index": "custom",
"_type": "test",
"_id": "1",
"_score": 0.9364339,
"_source": {
"text": "tHis .is a! Test"
},
"highlight": {
"text.case_insensitive": [
"<em>tHis</em> .is a!<em> Test</em>"
]
}
}
]
}
}
Warning
When ingesting the following, note the additional lower-case test keyword:
curl -X POST "http://localhost:9200/custom/test" -d '{ "text" : "tHis this .is a! Test" }'
The response to the same query becomes:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.9364339,
"hits": [
{
"_index": "custom",
"_type": "test",
"_id": "1",
"_score": 0.9364339,
"_source": {
"text": "tHis this .is a! Test"
},
"highlight": {
"text.case_insensitive": [
"<em>tHis</em><em> this</em> .is a!<em> Test</em>"
]
}
}
]
}
}
As you can see, the highlight now also includes the lower-case this. For such a test example, we do not mind. However, for complicated queries, the user might (and probably will) get confused when and how the smart-case has any effect. Especially when the lower-case match would include a field that only matches on lower-case.
Conclusion
This solution will give you all highlights merged as one, but might include unwanted results.

Related

Trying to form an Elasticsearch query for autocomplete

I've read a lot and it seems that using EdgeNGrams is a good way to go for implementing an autocomplete feature for search applications. I've already configured the EdgeNGrams in my settings for my index.
PUT /bigtestindex
{
"settings":{
"analysis":{
"analyzer":{
"autocomplete":{
"type":"custom",
"tokenizer":"standard",
"filter":[ "standard", "stop", "kstem", "ngram" ]
}
},
"filter":{
"edgengram":{
"type":"ngram",
"min_gram":2,
"max_gram":15
}
},
"highlight": {
"pre_tags" : ["<em>"],
"post_tags" : ["</em>"],
"fields": {
"title.autocomplete": {
"number_of_fragments": 1,
"fragment_size": 250
}
}
}
}
}
}
So if in my settings I have the EdgeNGram filter configured how do I add that to the search query?
What I have so far is a match query with highlight:
GET /bigtestindex/doc/_search
{
"query": {
"match": {
"content": {
"query": "thing and another thing",
"operator": "and"
}
}
},
"highlight": {
"pre_tags" : ["<em>"],
"post_tags" : ["</em>"],
"field": {
"_source.content": {
"number_of_fragments": 1,
"fragment_size": 250
}
}
}
}
How would I add autocomplete to the search query using EdgeNGrams configured in the settings for the index?
UPDATE
For the mapping, would it be ideal to do something like this:
"title": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
},
Or do I need to use multi_field type:
"title": {
"type": "multi_field",
"fields": {
"title": {
"type": "string"
},
"autocomplete": {
"analyzer": "autocomplete",
"type": "string",
"index": "not_analyzed"
}
}
},
I'm using ES 1.4.1 and want to use the title field for autocomplete purposes.... ?
Short answer: you need to use it in a field mapping. As in:
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"stop",
"kstem",
"ngram"
]
}
},
"filter": {
"edgengram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"doc": {
"properties": {
"field1": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}
For a bit more discussion, see:
http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams
and
http://blog.qbox.io/an-introduction-to-ngrams-in-elasticsearch
Also, I don't think you want the "highlight" section in your index definition; that belongs in the query.
EDIT: Upon trying out your code, there are a couple of problems with it. One was the highlight issue I already mentioned. Another is that you named your filter "edgengram", even though it is of type "ngram" rather than type "edgeNGram", but then you referenced the filter "ngram" in your analyzer, which will use the default ngram filter, which probably doesn't give you what you want. (Hint: you can use term vectors to figure out what your analyzer is doing to your documents; you probably want to turn them off in production, though.)
So what you actually want is probably something like this:
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"standard",
"stop",
"kstem",
"edgengram_filter"
]
}
},
"filter": {
"edgengram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"doc": {
"properties": {
"content": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}
When I indexed these two docs:
POST test_index/doc/_bulk
{"index":{"_id":1}}
{"content":"hello world"}
{"index":{"_id":2}}
{"content":"goodbye world"}
And ran this query (there was an error in your "highlight" block as well; should have said "fields" rather than "field")"
POST /test_index/doc/_search
{
"query": {
"match": {
"content": {
"query": "good wor",
"operator": "and"
}
}
},
"highlight": {
"pre_tags": [
"<em>"
],
"post_tags": [
"</em>"
],
"fields": {
"content": {
"number_of_fragments": 1,
"fragment_size": 250
}
}
}
}
I get back this response, which seems to be what you're looking for, if I understand you correctly:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2712221,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.2712221,
"_source": {
"content": "goodbye world"
},
"highlight": {
"content": [
"<em>goodbye</em> <em>world</em>"
]
}
}
]
}
}
Here is some code I used to test it out:
http://sense.qbox.io/gist/3092992993e0328f7c4ee80e768dd508a0bc053f

edge_ngram filter and not analzyed to match search

I have the following elastic search configuration:
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
},
"snow_filter" : {
"type" : "snowball",
"language" : "English"
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"snow_filter",
"autocomplete_filter"
]
}
}
}
}
}
PUT /my_index/_mapping/my_type
{
"my_type": {
"properties": {
"name": {
"type": "multi_field",
"fields": {
"name": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "snowball"
},
"not": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "name": "Brown foxes" }
{ "index": { "_id": 2 }}
{ "name": "Yellow furballs" }
{ "index": { "_id": 3 }}
{ "name": "my discovery" }
{ "index": { "_id": 4 }}
{ "name": "myself is fun" }
{ "index": { "_id": 5 }}
{ "name": ["foxy", "foo"] }
{ "index": { "_id": 6 }}
{ "name": ["foo bar", "baz"] }
I am trying to get a search to only return item 6 that has a name of "foo bar" and I am not quite sure how. This is what I am doing right now:
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "foo b"
}
}
}
}
I know it's a combination of how the tokenizer is splitting the word but sort of lost on how both be flexible and be strict enough to match this. I am guessing I need to do a multiple field on my mapping of name, but I am not sure. How can I fix the query and/or my mapping to satisfy my needs?
You're already close. Since your edge_ngram analyzer generates tokens of a minimum length of 1, and your query gets tokenized into "foo" and "b", and the default match query operator is "or", your query matches each document that has a term starting with "b" (or "foo"), three of the docs.
Using the "and" operator seems to do what you want:
POST /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "foo b",
"operator": "and"
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4451914,
"hits": [
{
"_index": "test_index",
"_type": "my_type",
"_id": "6",
"_score": 1.4451914,
"_source": {
"name": [
"foo bar",
"baz"
]
}
}
]
}
}
Here's the code I used to test it:
http://sense.qbox.io/gist/4f6fb7c1fdc6942023091ee1433d7490e04e7dea

Full text search for exact match_phrase (with leading and trailing whitespace) in elasticsearch

I'm new to Elasticsearch and here is my task at hand.
Given my index:
{
"my_index": {
"mappings": {
"_default_": {
"_all": {
"enabled": false
},
"properties": {}
},
"title": {
"_all": {
"enabled": false
},
"properties": {
"foo_id": {
"type": "long"
},
"title": {
"type": "string",
"analyzer": "english"
}
}
}
},
"settings": {
...
}
}
}
And sample records:
{"foo_id": 777, "title": "Equality"}
{"foo_id": 777, "title": "First Among Equals"}
{"foo_id": 777, "title": "AN EQUAL MUSIC"}
I would like to search for records that must:
have foo_id == 777
contain case-insensitive word "equal"
Meaning, I must find only third record, containing exact phrase "equal". Titles containing words "equality" and "equals" must not be returned. I'd like to avoid resorting to regexp.
I tried a searching like this:
{
"query": {
"bool": {
"must": [
{"term": {"account_id": 777}},
{"match_phrase": {"title": "equal"}}
]
}
}
}
but it returns all three results.
Additional question: how can I get results in the most efficient way, given that I don't care about relevancy of the results? Should I use search_type='scan' with scroll or maybe filtering? A snippet would be nice. Thanks.
currently you're using the english analyser:
"title": {
"type": "string",
"analyzer": "english"
If you don't want to do stemming etc. (to avoid picking up "equals", "equality") then switch to a simpler analyser. For example use the Standard or Simple analyser instead - or even create your own.
"title": {
"type": "string",
"analyzer": "standard"
once set up, use a match or query_string query to find the relevant document.
If you want to retain the Stem analyser but also support an alternative form of analysis then you should use multi-fields
For example:
"title": {
"type": "string",
"analyzer": "english",
"fields": {
"std": { "type": "string", "analyzer": "standard" }
}
}
When you want to do a search using the standard analyser, use the field title.std
Here's one way you can do it. If you take out the english analyzer, the standard analyzer will be used instead, which seems to give you what you want.
curl -XPUT "http://localhost:9200/my_index" -d'
{
"settings": {
"number_of_shards": 2,
"number_of_replicas": 1
},
"mappings": {
"_default_": {
"_all": {
"enabled": false
},
"properties": {}
},
"title": {
"_all": {
"enabled": false
},
"properties": {
"foo_id": {
"type": "long"
},
"title": {
"type": "string"
}
}
}
}
}'
Then add the docs:
curl -XPUT "http://localhost:9200/my_index/title/1" -d'
{"foo_id": 777, "title": "Equality"}'
curl -XPUT "http://localhost:9200/my_index/title/2" -d'
{"foo_id": 777, "title": "First Among Equals"}'
curl -XPUT "http://localhost:9200/my_index/title/3" -d'
{"foo_id": 777, "title": "AN EQUAL MUSIC"}'
Then you can use a constant score query to avoid extra computation (if you don't care about the ranking of results), combined with a must bool filter to get the results you want:
curl -XPOST "http://localhost:9200/my_index/_search" -d'
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{"term": {
"foo_id": 777
}},
{"term": {
"title": "equal"
}}
]
}
}
}
}
}'
yielding:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "title",
"_id": "3",
"_score": 1,
"_source": {
"foo_id": 777,
"title": "AN EQUAL MUSIC"
}
}
]
}
}
Here is the code I used:
http://sense.qbox.io/gist/179d737edf1de964090746a2fdae5ad52c935b31
EDIT: If you want to be able to use the english analyzer as well as the standard analyzer (or some other analyzer, or none, as is often the case for faceting or sorting) you can use a multi_field (deprecated name) as follows:
curl -XPUT "http://localhost:9200/my_index" -d'
{
"settings": {
"number_of_shards": 2,
"number_of_replicas": 1
},
"mappings": {
"_default_": {
"_all": {
"enabled": false
},
"properties": {}
},
"title": {
"_all": {
"enabled": false
},
"properties": {
"foo_id": {
"type": "long"
},
"title": {
"type": "string",
"analyzer": "english",
"fields": {
"unstemmed": {
"type": "string",
"analyzer": "standard"
}
}
}
}
}
}
}'
Now, if you search with { "term": { "title": "equal" } } you will get all three docs, but if you use { "term": { "title.unstemmed": "equal" } } you will get what you want:
curl -XPOST "http://localhost:9200/my_index/_search" -d'
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"term": {
"foo_id": 777
}
},
{
"term": {
"title.unstemmed": "equal"
}
}
]
}
}
}
}
}'
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "my_index",
"_type": "title",
"_id": "3",
"_score": 1,
"_source": {
"foo_id": 777,
"title": "AN EQUAL MUSIC"
}
}
]
}
}
Here's the code:
http://sense.qbox.io/gist/40a145e94fd8e47b875525c7e095024f025dd1ab

Querying elasticsearch returns all documents

i wonder why a search for a specific term returns all documents of an index and not the documents containing the requested term.
Here's the index and how i set it up:
(using the elasticsearch head-plugin browser-interface)
{
"settings": {
"number_of_replicas": 1,
"number_of_shards": 1,
"analysis": {
"filter": {
"dutch_stemmer": {
"type": "dictionary_decompounder",
"word_list": [
"koud",
"plaat",
"staal",
"fabriek"
]
},
"snowball_nl": {
"type": "snowball",
"language": "dutch"
}
},
"analyzer": {
"dutch": {
"tokenizer": "standard",
"filter": [
"length",
"lowercase",
"asciifolding",
"dutch_stemmer",
"snowball_nl"
]
}
}
}
}
}
{
"properties": {
"test": {
"type": "string",
"fields": {
"dutch": {
"type": "string",
"analyzer": "dutch"
}
}
}
}
}
Then i added some docs:
{"test": "ijskoud"}
{"test": "plaatstaal"}
{"test": "kristalfabriek"}
So now when firing a search for "plaat" somehow one would expect the search would come back with the document containing "plaatstaal".
{
"match": {
"test": "plaat"
}
}
However saving me further searches elasticsearch retuns all documents regardless of its text content.
Is there anything I am missing here?
Funny enough: there is a difference when using GET or POST. While using the latter brings back no hits, GET returns all documents.
Any help is much appreciated.
When you are using GET you do not pass the request body, so search is performed without any filter and all documents are returned.
When you are using POST your search query does get passed on. It doesn't return anything probably because your document is not getting analyzed as you intended it to.
You need to configure your index to use your custom analyzer:
PUT /some_index
{
"settings": {
...
},
"mappings": {
"doc": {
"properties": {
"test": {
"type": "string",
"analyzer": "dutch"
}
}
}
}
}
If you have more fields that use this analyzer and don't want to specify for each the analyzer, you can do it like this for a specific type in that index:
"mappings": {
"doc": {
"analyzer": "dutch"
}
}
If you want ALL your types in that index to use your custom analyzer:
"mappings": {
"_default_": {
"analyzer": "dutch"
}
}
To test your analyzer in a simple way:
GET /some_index/_analyze?text=plaatstaal&analyzer=dutch
This would be the full list of steps to perform:
DELETE /some_index
PUT /some_index
{
"settings": {
"number_of_replicas": 1,
"number_of_shards": 1,
"analysis": {
"filter": {
"dutch_stemmer": {
"type": "dictionary_decompounder",
"word_list": [
"koud",
"plaat",
"staal",
"fabriek"
]
},
"snowball_nl": {
"type": "snowball",
"language": "dutch"
}
},
"analyzer": {
"dutch": {
"tokenizer": "standard",
"filter": [
"length",
"lowercase",
"asciifolding",
"dutch_stemmer",
"snowball_nl"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"test": {
"type": "string",
"analyzer": "dutch"
}
}
}
}
}
POST /some_index/doc/_bulk
{"index":{}}
{"test": "ijskoud"}
{"index":{}}
{"test": "plaatstaal"}
{"index":{}}
{"test": "kristalfabriek"}
GET /some_index/doc/_search
{
"query": {
"match": {
"test": "plaat"
}
}
}
And the result of the search:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.987628,
"hits": [
{
"_index": "some_index",
"_type": "doc",
"_id": "jlGkoJWoQfiVGiuT_TUCpg",
"_score": 1.987628,
"_source": {
"test": "plaatstaal"
}
}
]
}
}

ElasticSearch edgeNGram

I have the following settings and analyzer:
put /tests
{
"settings": {
"analysis": {
"analyzer": {
"standardWithEdgeNGram": {
"tokenizer": "standard",
"filter": ["lowercase", "edgeNGram"]
}
},
"tokenizer": {
"standard": {
"type": "standard"
}
},
"filter": {
"lowercase": {
"type": "lowercase"
},
"edgeNGram": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15,
"token_chars": ["letter", "digit"]
}
}
}
},
"mappings": {
"test": {
"_all": {
"analyzer": "standardWithEdgeNGram"
},
"properties": {
"Name": {
"type": "string",
"analyzer": "standardWithEdgeNGram"
}
}
}
}
}
And I posted the following data into it:
POST /tests/test
{
"Name": "JACKSON v. FRENKEL"
}
And here is my query:
GET /tests/test/_search
{
"query": {
"match": {
"Name": "jax"
}
}
}
And I got this result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.19178301,
"hits": [
{
"_index": "tests",
"_type": "test",
"_id": "lfOxb_5bS86_CMumo_ZLoA",
"_score": 0.19178301,
"_source": {
"Name": "JACKSON v. FRENKEL"
}
}
]
}
}
Can someone explain to me that there is no "jax" anywhere in the "Name", and it still gets the match?
Thanks in advance
A match query performs analysis on its given value. By default, "jax" is being analyzed with standardWithEdgeNGram, which includes n-gram analysis permuting it into ["ja", "ax"], the first of which matches the "ja" from the analyzed "JACKSON v. FRENKEL".
If you don't want this behavior you can specify a different analyzer to match, using the analyzer field, for example keyword:
GET /tests/test/_search
{
"query": {
"match": {
"Name": "jax",
"analyzer" : "keyword"
}
}
}
In ES 1.3.2 the below query gave an error
GET /tests/test/_search
{
"query": {
"match": {
"Name": "jax",
"analyzer" : "keyword"
}
}
}
Error : query parsed in simplified form, with direct field name, but included more options than just the field name, possibly use its 'options' form, with 'query' element?]; }]
status: 400
I fixed the issue as below:
{
"query": {
"query_string": {
"fields": [
"Name"
],
"query": "jax",
"analyzer": "simple"
}
}
}

Resources