Elasticsearch char_filter not effecting search - elasticsearch

My understanding of how the char_filter works must be wrong. My goal here is to treat all apostrophes and quote like characters the same (in this case, remove them entirely) in elasticsearch. (Apparently there are like 5 apostrophe-like unicode characters... and my database has all versions :facepalm:)
Aside: This approach to the solution was inspired by this thread
So here is a toy problem that illustrates my issue.
I create an index with the char_filter, and then populate it with 3 documents:
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"quote_analyzer": {
"char_filter": [
"quotes"
],
"tokenizer": "standard"
}
},
"char_filter": {
"quotes": {
"mappings": [
"\u0091=>",
"\u0092=>",
"\u2018=>",
"\u2019=>"
],
"type": "mapping"
}
}
}
}
}
POST test/_doc
{
"name": "The King’s men",
"id": "1"
}
POST test/_doc
{
"name": "Zoom LeBron the Soldier 7 'King's Pride'",
"id": "2"
}
POST test/_doc
{
"name": "Kings Kings Kings",
"id": "3"
}
As you can see, each document contains some form of the word Kings. I then check that my analyzer is doing what I think it should be doing:
GET test/_analyze
{
"analyzer": "quote_analyzer",
"text": "King’s boat"
}
Which yields:
{
"tokens" : [
{
"token" : "Kings",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "boat",
"start_offset" : 7,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
It appears that the apostrophe in King’s has been removed and the token is Kings. Great! So now I want to search for King’s and since the analyzer is removing the apostrophe I should get all three results. Or at LEAST I would get just id:3 as the apostrophe was removed, and it only matches that Kings Kings Kings without the apostrophe. However, searching for:
GET test/_search
{
"query": {
"match": {
"name": "King’s boat"
}
}
}
Yields:
{
"took" : 1,
// collapsing ....
"hits" : {
// collapsing ....
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1e2x_38Bn0QWlup8OIvp",
"_score" : 1.1220688,
"_source" : {
"name" : "The King’s men",
"id" : "1"
}
}
]
}
}
Similarly, searching Kings boat only retrieves id:3. And searching King's boat only retrieves id:2.
What am I missing? How do I accomplish the goal of treating all apostrophe characters the same?

Please modify your char_filter to accommodate both quotes and apostrophe, like you already did for quotes.

Related

How to access mapping sub-fields in painless?

Assuming there is an Elasticsearch mapping:
"mappings": {
"dynamic_templates": [],
"properties": {
"name": {
"type": "text",
"fields": {
"x": {
"type": "completion"
},
"y": {
"type": "keyword"
},
"z": {
"type": "text",
"analyzer": "shingles",
"fielddata": true
}
}
How can I access name.x/y/z data/tokens from painless script?
Below are not working:
ctx._source.name.x
ctx._source['name.fields.x']
ctx._source['name.x']
Oh I see, you want some kind of retro-loop of whatever is produced in name.z to be fed into name.x. First, I think there's a misconception in the way you think multi-fields are working. You only ever get to specify the value for the name field and each of the sub-field will get a value according to whatever type/analyzer it has, but you cannot specify a value directly for any of those sub-fields.
What you should do instead is to specify the shingles analyzer in your completion field directly and that should do the trick.
UPDATE
What you want to do is not possible automatically and not with multi-fields, you need a standalone completion field, not as sub-field of name. The process you want can be done like this, but needs to be coded on client-side:
# 1. Index your document
PUT test/_doc/1
{ "name": "the big brown fox" }
# 2. For each indexed document, retrieve the term vectors for name.z
GET test/_termvectors/1?fields=name.z
=>
"term_vectors" : {
"name.z" : {
"field_statistics" : {
"sum_doc_freq" : 7,
"doc_count" : 1,
"sum_ttf" : 7
},
"terms" : {
"big" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 1,
"start_offset" : 4,
"end_offset" : 7
}
]
},
"big brown" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 1,
"start_offset" : 4,
"end_offset" : 13
}
]
},
# 3. Retrieve all the keys from the terms map obtained in step 2
"big", "big brown", ...
# 4. Feed the retrieved tokens back into the document's top-level completion field
POST test/_doc/1/_update
{
"doc": {
"completion": {
"input": [ "big", "big brown", ... ],
}
}
}

How to combine completion, suggestion and match phrase across multiple text fields?

I've been reading about Elasticsearch suggesters, match phrase prefix and highlighting and i'm a bit confused as to which to use to suit my problem.
Requirement: i have a bunch of different text fields, and need to be able to autocomplete and autosuggest across all of them, as well as misspelling. Basically the way Google works.
See in the following Google snapshot, when we start typing "Can", it lists word like Canadian, Canada, etc. This is auto complete. However it lists additional words also like tire, post, post tracking, coronavirus etc. This is auto suggest. It searches for most relevant word in all fields. If we type "canxad" it should also misspel suggest the same results.
Could someone please give me some hints on how i can implement the above functionality across a bunch of text fields?
At first i tried this:
GET /myindex/_search
{
"query": {
"match_phrase_prefix": {
"myFieldThatIsCombinedViaCopyTo": "revis"
}
},
"highlight": {
"fields": {
"*": {}
},
"require_field_match" : false
}
}
but it returns highlights like this:
"In the aforesaid revision filed by the members of the Committee, the present revisionist was also party",
So that's not a "prefix" anymore...
Also tried this:
GET /myindex/_search
{
"query": {
"multi_match": {
"query": "revis",
"fields": ["myFieldThatIsCombinedViaCopyTo"],
"type": "phrase_prefix",
"operator": "and"
}
},
"highlight": {
"fields": {
"*": {}
}
}
}
But it still returns
"In the aforesaid revision filed by the members of the Committee, the present revisionist was also party",
Note: I have about 5 "text" fields that I need to search upon. One of those fields is quite long (1000s of words). If I break things up into keywords, I lose the phrase. So it's like I need match phrase prefix across a combined text field, with fuzziness?
EDIT
Here's an example of a document (some fields taken out, content snipped):
{
"id" : 1,
"respondent" : "Union of India",
"caseContent" : "<snip>..against the Union of India, through the ...<snip>"
}
As #Vlad suggested, i tried this:
POST /cases/_search
POST /cases/_search
{
"suggest": {
"respondent-suggest": {
"prefix": "uni",
"completion": {
"field": "respondent.suggest",
"skip_duplicates": true
}
},
"caseContent-suggest": {
"prefix": "uni",
"completion": {
"field": "caseContent.suggest",
"skip_duplicates": true
}
}
}
}
Which returns this:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"suggest" : {
"caseContent-suggest" : [
{
"text" : "uni",
"offset" : 0,
"length" : 3,
"options" : [ ]
}
],
"respondent-suggest" : [
{
"text" : "uni",
"offset" : 0,
"length" : 3,
"options" : [
{
"text" : "Union of India",
"_index" : "cases",
"_type" : "_doc",
"_id" : "dI5hh3IBEqNFLVH6-aB9",
"_score" : 1.0,
"_ignored" : [
"headNote.suggest"
],
"_source" : {
<snip>
}
}
]
}
]
}
}
So looks like it matches on the respondent field, which is great! But, it didn't match on the caseContent field, even though the text (see above) includes the phrase "against the Union of India".. shouldn't it match there? or is it because how the text is broken up?
Since you need autocomplete/suggest on each field, then you need to run a suggest query on each field and not on the copy_to field. That way you're guaranteed to have the proper prefixes.
copy_to fields are great for searching in multiple fields, but not so good for auto-suggest/-complete type of queries.
The idea is that for each of your fields, you should have a completion sub-field so that you can get auto-complete results for each of them.
PUT index
{
"mappings": {
"properties": {
"text1": {
"type": "text",
"fields": {
"suggest": {
"type": "completion"
}
}
},
"text2": {
"type": "text",
"fields": {
"suggest": {
"type": "completion"
}
}
},
"text3": {
"type": "text",
"fields": {
"suggest": {
"type": "completion"
}
}
}
}
}
}
Your suggest queries would then run on all the sub-fields directly:
POST index/_search?pretty
{
"suggest": {
"text1-suggest" : {
"prefix" : "revis",
"completion" : {
"field" : "text1.suggest"
}
},
"text2-suggest" : {
"prefix" : "revis",
"completion" : {
"field" : "text2.suggest"
}
},
"text3-suggest" : {
"prefix" : "revis",
"completion" : {
"field" : "text3.suggest"
}
}
}
}
That takes care of the auto-complete/-suggest part. For misspellings, the suggest queries allow you to specify a fuzzy parameter as well
UPDATE
If you need to do prefix search on all sentences within a body of text, the approach needs to change a bit.
The new mapping below creates a new completion field next to the text one. The idea is to apply a small transformation (i.e. split sentences) to what you're going to store in the completion field. So first create the index mapping like this:
PUT index
{
"mappings": {
"properties": {
"text1": {
"type": "text",
},
"text1Suggest": {
"type": "completion"
}
}
}
}
Then create an ingest pipeline that will populate the text1Suggest field with sentences from the text1 field:
PUT _ingest/pipeline/sentence
{
"processors": [
{
"split": {
"field": "text1",
"target_field": "text1Suggest.input",
"separator": "\\.\\s+"
}
}
]
}
Then we can index a document such as this one (with only the text1 field as the completion field will be built dynamically)
PUT test/_doc/1?pipeline=sentence
{
"text1": "The crazy fox. The quick snail. John goes to the beach"
}
What gets indexed looks like this (your text1 field + another completion field optimized for sentence prefix completion):
{
"text1": "The crazy fox. The cat drinks milk. John goes to the beach",
"text1Suggest": {
"input": [
"The crazy fox",
"The cat drinks milk",
"John goes to the beach"
]
}
}
And finally you can search for prefixes of any sentence, below we search for John and you should get a suggestion:
POST test/_search?pretty
{
"suggest": {
"text1-suggest": {
"prefix": "John",
"completion": {
"field": "text1Suggest"
}
}
}
}

Elasticsearch mapping for the UK postcodes, able to deal with spacing and capatalization

I am looking for a mapping/analyzer setup for Elasticsearch 7 with the UK postcodes. We do not require any fuzzy operator, but should be able to deal with variance in capital letters and spacing.
Some examples:
Query string: "SN13 9ED" should return:
sn139ed
SN13 9ED
Sn13 9ed
but should not return:
SN13 1EP
SN131EP
The keyword analyzer is used by default and this seems to be sensitive to spacing issues, but not to capital letters. It also will return a match for SN13 1EP unless we specify a query as SN13 AND 9ED, which we do not want.
Additionally, with the keyword analyzer, a query of SN13 9ED returns a result of SN13 1EP with a higher relevance than SN13 9ED even though this should be the exact match. Why are 2 matches in the same string a lower relevance than just 1 match?
Mapping for postal code
"post_code": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
Query
"query" => array:1 [▼
"query_string" => array:1 [▼
"query" => "KT2 7AJ"
]
]
I believe based on my comments, you may have been able to filter out SN13 1EP when your search string would be SN13 9ED.
Hope you are aware of what Analysis is, how Analyzers work on text field and how by default Standard Analyzer is applied on tokens before they eventually are stored in inverted index. Note that this is only applied on text fields.
Looking at your mapping, if you would have used searching on post_code and not post_code.keyword, I believe capitalization would have been resolved because ES for text field by default uses Standard Analyzer which means your tokens would eventually gets saved in index in lowercase format and even while querying, ES during querying time, the analyzer would be applied before it searches in the inverted index.
Note that by default, the same analyzer as configured in the mapping are applied during index time as well as search time on that field
For the scenarios where you have sn131ep what I've done is made use of Pattern Capture Token Filter where I've specified a regex which would break the token into two of lengths 4 and 3 each and thereby save them in inverted index which in this case would be sn13 and 1ep. I'm also lowercasing them before I store them in inverted index.
Note that the scenario I'm adding for your postcode is that its size is fixed i.e. having 7 characters. You can add more patterns if that is not the case
Please see below for more details:
Mapping:
PUT my_postcode_index
{
"settings" : {
"analysis" : {
"filter" : {
"mypattern" : {
"type" : "pattern_capture",
"preserve_original" : true,
"patterns" : [
"(\\w{4}+)|(\\w{3}+)", <--- Note this and feel free to add more patterns
"\\s" <--- Filter based on whitespace
]
}
},
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "pattern",
"filter" : [ "mypattern", "lowercase" ] <--- Note the lowercase here
}
}
}
},
"mappings": {
"properties": {
"postcode":{
"type": "text",
"analyzer": "my_analyzer", <--- Note this
"fields":{
"keyword":{
"type": "keyword"
}
}
}
}
}
}
Sample Documents:
POST my_postcode_index/_doc/1
{
"postcode": "SN131EP"
}
POST my_postcode_index/_doc/2
{
"postcode": "sn13 1EP"
}
POST my_postcode_index/_doc/3
{
"postcode": "sn131ep"
}
Note that these documents are semantically the same.
Request Query:
POST my_postcode_index/_search
{
"query": {
"query_string": {
"default_field": "postcode",
"query": "SN13 1EP",
"default_operator": "AND"
}
}
}
Response:
{
"took" : 24,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 0.6246513,
"hits" : [
{
"_index" : "my_postcode_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.6246513,
"_source" : {
"postcode" : "SN131EP"
}
},
{
"_index" : "my_postcode_index",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.6246513,
"_source" : {
"postcode" : "sn131ep"
}
},
{
"_index" : "my_postcode_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.5200585,
"_source" : {
"postcode" : "sn13 1EP"
}
}
]
}
}
Notice that all three documents are returned even with queries snp131p and snp13 1ep.
Additional Note:
You can make use of Analyze API to figure out what tokens are created for a particular text
POST my_postcode_index/_analyze
{
"analyzer": "my_analyzer",
"text": "sn139ed"
}
And you can see below what tokens are stored in inverted index.
{
"tokens" : [
{
"token" : "sn139ed",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
},
{
"token" : "sn13",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
},
{
"token" : "9ed",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 0
}
]
}
Also:
You may also want to read about Ngram Tokenizer. I'd advise you to play around both the solutions and see what best suits your inputs.
Please test it and let me know if you have any queries.
In addition to Opsters answer, the following can also be used to tackle the issue from the opposite angle. For Opster's answer, they suggest splitting value by a known postcode pattern, which is great.
If we do not know the pattern, the following can be used:
{
"analysis": {
"filter": {
"whitespace_remove": {
"pattern": " ",
"type": "pattern_replace",
"replacement": ""
}
},
"analyzer": {
"no_space_analyzer": {
"filter": [
"lowercase",
"whitespace_remove"
],
"tokenizer": "keyword"
}
}
}
}
{
"post_code": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
},
"analyzer": "no_space_analyzer"
}
}
This allows us to search with any kind of spacing, and with any case due to the lowercase filter.
sn13 1ep, s n 1 3 1 e p, sn131ep will all match against SN13 1EP
I think the main drawback to this option, however, is we will no longer get any results for sn13 as we are not producing at tokens. sn13* would bring us back results, however.
Is it possible to mix both of these methods together so we can have the best of both worlds?

ElaaticSearch - extract info between tags in the Highlights field

We have a field in our ElasticSearch index called Terms Matched and we populate that field at query time with the values that are tagged in the Highlights field of a given result. The Highlights field is derived from our field called Free Text, which contains unstructured data. The query is not a match phrase query - it looks for the words in the query to be within a certain distance of each other via a span-multi query.
So right now, an example could look like this:
Query: John Smith
Result:
Free Text: "Once upon a time, John Alexander Smith went to the market..."
Highlights: "Once upon a time, <em>John</em> Alexander <em>Smith</em> went to the market..."
Terms Matched: John Smith
Currently, the Terms Matched field is just a concatenation of the tags from Highlights. What we want to do is have the Terms Matched field return the tags, AND anything between the tags, if there is more than one tag - so in the above example the Terms Matched field would show "John Alexander Smith."
How could we accomplish this in ElasticSearch?
So I think this is working as you would expect.
This is mapping with shingles token filter configured. Shingles will produce combinations of searchable tokens (2 to 4 tokens per shingle).
PUT /highlights
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding",
"my_shingle"
]
}
},
"filter": {
"my_shingle": {
"type": "shingle",
"max_shingle_size": 4,
"min_shingle_size": 2
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"search_analyzer": "standard",
"analyzer": "my_custom_analyzer"
}
}
}
}
Dummy document
PUT /highlights/_doc/1
{
"content": "Once upon a time, John Alexander Smith went to the market..."
}
And basic search query
GET /highlights/_search
{
"query": {
"match": {
"content": "John Smith"
}
},
"highlight": {
"fields": {
"content": {
"type": "plain"
}
}
}
}
This is the response, with correctly (hopefully) highlighted text:
{
"took" : 46,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.8111373,
"hits" : [
{
"_index" : "highlights",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.8111373,
"_source" : {
"content" : "Once upon a time, John Alexander Smith went to the market..."
},
"highlight" : {
"content" : [
"Once upon a time, <em>John Alexander Smith</em> went to the market..."
]
}
}
]
}
}
Yet again, you might need to tweak this quite a lot, but this should put you on right track.

Elasticsearch exact matches on analyzed fields

Is there a way to have ElasticSearch identify exact matches on analyzed fields? Ideally, I would like to lowercase, tokenize, stem and perhaps even phoneticize my docs, then have queries pull "exact" matches out.
What I mean is that if I index "Hamburger Buns" and "Hamburgers", they will be analyzed as ["hamburger","bun"] and ["hamburger"]. If I search for "Hamburger", it will only return the "hamburger" doc, as that's the "exact" match.
I've tried using the keyword tokenizer, but that won't stem the individual tokens. Do I need to do something to ensure that the number of tokens is equal or so?
I'm familiar with multi-fields and using the "not_analyzed" type, but this is more restrictive than I'm looking for. I'd like exact matching, post-analysis.
Use shingles tokenizer together with stemming and whatever else you need. Add a sub-field of type token_count that will count the number of tokens in the field.
At searching time, you need to add an additional filter to match the number of tokens in the index with the number of tokens you have in the searching text. You would need an additional step, when you perform the actual search, that should count the tokens in the searching string. This is like this because shingles will create multiple permutations of tokens and you need to make sure that it matches the size of your searching text.
An attempt for this, just to give you an idea:
{
"settings": {
"analysis": {
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 10,
"min_shingle_size": 2,
"output_unigrams": true
},
"filter_stemmer": {
"type": "porter_stem",
"language": "_english_"
}
},
"analyzer": {
"ShingleAnalyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"snowball",
"filter_stemmer",
"filter_shingle"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "ShingleAnalyzer",
"fields": {
"word_count": {
"type": "token_count",
"store": "yes",
"analyzer": "ShingleAnalyzer"
}
}
}
}
}
}
}
And the query:
{
"query": {
"filtered": {
"query": {
"match_phrase": {
"text": {
"query": "HaMbUrGeRs BUN"
}
}
},
"filter": {
"term": {
"text.word_count": "2"
}
}
}
}
}
The shingles filter is important here because it can create combinations of tokens. And more than that, these are combinations that keep the order or the tokens. Imo, the most difficult requirement to fulfill here is to change the tokens (stemming, lowercasing etc) and, also, to assemble back the original text. Unless you define your own "concatenation" filter I don't think there is any other way than using the shingles filter.
But with shingles there is another issue: it creates combinations that are not needed. For a text like "Hamburgers buns in Los Angeles" you end up with a long list of shingles:
"angeles",
"buns",
"buns in",
"buns in los",
"buns in los angeles",
"hamburgers",
"hamburgers buns",
"hamburgers buns in",
"hamburgers buns in los",
"hamburgers buns in los angeles",
"in",
"in los",
"in los angeles",
"los",
"los angeles"
If you are interested in only those documents that match exactly meaning, the documents above matches only when you search for "hamburgers buns in los angeles" (and doesn't match something like "any hamburgers buns in los angeles") then you need a way to filter that long list of shingles. The way I see it is to use word_count.
You can use multi-fields for that purpose and have a not_analyzed sub-field within your analyzed field (let's call it item in this example). Your mapping would have to look like this:
{
"yourtype": {
"properties": {
"item": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
With this kind of mapping, you can check how each of the values Hamburgers and Hamburger Buns are "viewed" by the analyzer with respect to your multi-field item and item.raw
For Hamburger:
curl -XGET 'localhost:9200/yourtypes/_analyze?field=item&pretty' -d 'Hamburger'
{
"tokens" : [ {
"token" : "hamburger",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
curl -XGET 'localhost:9200/yourtypes/_analyze?field=item.raw&pretty' -d 'Hamburger'
{
"tokens" : [ {
"token" : "Hamburger",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 1
} ]
}
For Hamburger Buns:
curl -XGET 'localhost:9200/yourtypes/_analyze?field=item&pretty' -d 'Hamburger Buns'
{
"tokens" : [ {
"token" : "hamburger",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "buns",
"start_offset" : 11,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
curl -XGET 'localhost:9200/yourtypes/_analyze?field=item.raw&pretty' -d 'Hamburger Buns'
{
"tokens" : [ {
"token" : "Hamburger Buns",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 1
} ]
}
As you can see, the not_analyzed field is going to be indexed untouched exactly as it was input.
Now, let's index two sample documents to illustrate this:
curl -XPOST localhost:9200/yourtypes/_bulk -d '
{"index": {"_type": "yourtype", "_id": 1}}
{"item": "Hamburger"}
{"index": {"_type": "yourtype", "_id": 2}}
{"item": "Hamburger Buns"}
'
And finally, to answer your question, if you want to have an exact match on Hamburger, you can search within your sub-field item.raw like this (note that the case has to match, too):
curl -XPOST localhost:9200/yourtypes/yourtype/_search -d '{
"query": {
"term": {
"item.raw": "Hamburger"
}
}
}'
And you'll get:
{
...
"hits" : {
"total" : 1,
"max_score" : 0.30685282,
"hits" : [ {
"_index" : "yourtypes",
"_type" : "yourtype",
"_id" : "1",
"_score" : 0.30685282,
"_source":{"item": "Hamburger"}
} ]
}
}
UPDATE (see comments/discussion below and question re-edit)
Taking your example from the comments and trying to have HaMbUrGeR BuNs match Hamburger buns you could simply achieve it with a match query like this.
curl -XPOST localhost:9200/yourtypes/yourtype/_search?pretty -d '{
"query": {
"match": {
"item": {
"query": "HaMbUrGeR BuNs",
"operator": "and"
}
}
}
}'
Which based on the same two indexed documents above will yield
{
...
"hits" : {
"total" : 1,
"max_score" : 0.2712221,
"hits" : [ {
"_index" : "yourtypes",
"_type" : "yourtype",
"_id" : "2",
"_score" : 0.2712221,
"_source":{"item": "Hamburger Buns"}
} ]
}
}
You can keep the analyzer as what you expected (lowercase, tokenize, stem, ...), and use query_string as the main query, match_phrase as the boosting query to search. Something like this:
{
"bool" : {
"should" : [
{
"query_string" : {
"default_field" : "your_field",
"default_operator" : "OR",
"phrase_slop" : 1,
"query" : "Hamburger"
}
},
{
"match_phrase": {
"your_field": {
"query": "Hamburger"
}
}
}
]
}
}
It will match both documents, and exact match (match_phrase) will be on top since the query match both should clauses (and get higher score)
default_operator is set to OR, it will help the query "Hamburger Buns" (match hamburger OR bun) match the document "Hamburger" also.
phrase_slop is set to 1 to match terms with distance = 1 only, e.g. search for Hamburger Buns will not match document Hamburger Big Buns. You can adjust this depend on your requirements.
You can refer Closer is better, Query string for more details.

Resources