We're using ElasticSearch completion suggester with the Standard Analyzer, but it seems like the text is not tokenized.
e.g.
Texts: "First Example", "Second Example"
Search: "Fi" returns "First Example"
While
Search: "Ex" doesn't return any result returns "First Example"
As the doc of Elastic about completion suggester: Completion Suggester
The completion suggester is a so-called prefix suggester.
So when you send a keyword, it will look for the prefix of your texts.
E.g:
Search: "Fi" => "First Example"
Search: "Sec" => "Second Example"
but if you give Elastic "Ex", it returns nothing because it cannot find a text which begins with "Ex".
You can try some others suggesters like: Term Suggester
A great work around is to tokenize the string yourself and put it in a separate tokens field.
You can then use 2 suggestions in your suggest query to search both fields.
Example:
PUT /example
{
"mappings": {
"doc": {
"properties": {
"full": {
"type": "completion"
},
"tokens": {
"type": "completion"
}
}
}
}
}
POST /example/doc/_bulk
{ "index":{} }
{"full": {"input": "First Example"}, "tokens": {"input": ["First", "Example"]}}
{ "index":{} }
{"full": {"input": "Second Example"}, "tokens": {"input": ["Second", "Example"]}}
POST /example/_search
{
"suggest": {
"full-suggestion": {
"prefix" : "Ex",
"completion" : {
"field" : "full",
"fuzzy": true
}
},
"token-suggestion": {
"prefix": "Ex",
"completion" : {
"field" : "tokens",
"fuzzy": true
}
}
}
}
Search result:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": 0,
"hits": []
},
"suggest": {
"full-suggestion": [
{
"text": "Ex",
"offset": 0,
"length": 2,
"options": []
}
],
"token-suggestion": [
{
"text": "Ex",
"offset": 0,
"length": 2,
"options": [
{
"text": "Example",
"_index": "example",
"_type": "doc",
"_id": "Ikvk62ABd4o_n4U8G5yF",
"_score": 2,
"_source": {
"full": {
"input": "First Example"
},
"tokens": {
"input": [
"First",
"Example"
]
}
}
},
{
"text": "Example",
"_index": "example",
"_type": "doc",
"_id": "I0vk62ABd4o_n4U8G5yF",
"_score": 2,
"_source": {
"full": {
"input": "Second Example"
},
"tokens": {
"input": [
"Second",
"Example"
]
}
}
}
]
}
]
}
}
One approach to hack in the suggestions from every position of the string could be to shingle the string, take only the shingles with position 0, from every shingle take the last token.
PUT example
{
"settings": {
"index.max_shingle_diff": 10,
"analysis": {
"filter": {
"after_last_space": {
"type": "pattern_replace",
"pattern": "(.* )",
"replacement": ""
},
"preserve_only_first": {
"type": "predicate_token_filter",
"script": {
"source": "token.position == 0"
}
},
"big_shingling": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 10,
"output_unigrams": true
}
},
"analyzer": {
"dark_magic": {
"tokenizer": "standard",
"filter": [
"lowercase",
"big_shingling",
"preserve_only_first",
"after_last_space"
]
}
}
}
},
"mappings": {
"properties": {
"suggest": {
"type": "completion",
"analyzer": "dark_magic",
"search_analyzer": "standard"
}
}
}
}
This hack works for short strings (up to 10 tokens in the example).
Related
I'm trying to search with a string, containing multiple strings that are comma-separated. [might not match with the whole value text, can be partial, the passed item should be in the text]
Note: I have tried n-gram as well, which does not give me right data.
(Example: search term "Data Science" gives all "Data", "Science", "data science")
Doc In ES:
{
"_index": "questions_dev",
"_type": "_doc",
"_id": "188",
"_score": 6.6311107,
"_source": {
"questionId": 188,
"questionText": "What other social media platforms do you use on your own time?",
"domainId": 2,
"subdomainId": 25,
"type": "TEXT",
"difficulty": 1,
"time": 600,
"domain": "Domain Specific",
"subdomain": "Social Media Specialist",
"skill": ["social media"]
}
}
What I have done so far:
Index:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"default": {
"tokenizer": "custom_tokenizer",
"filter": ["lowercase"]
}
},
"tokenizer": {
"custom_tokenizer": {
"type": "pattern",
"pattern": ",",
},
}
}
},
"mappings": {
"properties": {
"questionId": {
"type": "long"
},
"questionText": {
"type": "text",
},
"domain": {
"type": "text"
},
"subdomain": {
"type": "text"
},
"type":{
"type": "keyword"
},
"difficulty":{
"type": "keyword"
},
"totaltime":{
"type": "keyword"
},
"domainId":{
"type": "keyword"
},
"subdomainId":{
"type": "keyword"
}
}
}
}
Query:
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"fields": ["questionText","skill"],
"query": "social media"
}
}
]
}
}
}
Output:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
Expected Output:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 6.6311107,
"hits": [
{
"_index": "questions_development",
"_type": "_doc",
"_id": "188",
"_score": 6.6311107,
"_source": {
"questionId": 188,
"questionText": "What other social media platforms do you use on your own time?",
"domainId": 2,
"subdomainId": 25,
"type": "TEXT",
"difficulty": 1,
"time": 600,
"domain": "Domain Specific",
"subdomain": "Social Media Specialist",
"skill": []
}
}
]
}
}
Goal:
Search with a string for all the docs, which contains the string.
Example:
If I search with "social media" it should return me the above doc.
(for my case its not returning.)
this search also should support a comma-separated search mechanism.
which means, I can pass "social media, own time" and I'm expecting the output to have questionTexts text to contain any of these strings.
The data which you are indexing social media, own time, contains whitespace between , and own time. So, the tokens generated with your previous mapping are :
{
"tokens": [
{
"token": " social media", <-- note the preceding whitespace here
"start_offset": 0,
"end_offset": 12,
"type": "word",
"position": 0
},
{
"token": " own time", <-- note the preceding whitespace here
"start_offset": 13,
"end_offset": 22,
"type": "word",
"position": 1
}
]
}
Therefore, in the search query when you "query": "social media", with no whitespace, in the beginning, no search results are shown. However, if you query in this way "query": " social media" (including whitespace in the beginning), then search result will be there.
To remove leading and trailing whitespace from each token in a stream you can use Trim Token filter
Adding working example with index data, mapping and search query
Index Mapping:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"default": {
"tokenizer": "custom_tokenizer",
"filter": [
"lowercase",
"trim" <-- note this
]
}
},
"tokenizer": {
"custom_tokenizer": {
"type": "pattern",
"pattern": ",",
"filter": [
"trim" <-- note this
]
}
}
}
},
"mappings": {
"properties": {
"questionText": {
"type": "text"
}
}
}
}
Index Data:
{ "questionText": "social media" }
{ "questionText": "social media, own time" }
Search Query:
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"fields": [
"questionText"
],
"query": "own time" <-- no whitespace included in the
beginning
}
}
]
}
}
}
Search Result:
"hits": [
{
"_index": "my-index",
"_type": "_doc",
"_id": "2",
"_score": 0.60996956,
"_source": {
"questionText": "social media, own time"
}
}
Update 1:
Index Settings
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": ","
}
}
}
}
}
Index Data:
{
"questionText": "What other platforms do you use on your ?"
}
{
"questionText": "What other social time platforms do you use on your?"
}
{
"questionText": "What other social media platforms do you use on your?"
}
{
"questionText": "What other platforms do you use on your own time?"
}
Search Query:
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"fields": "questionText",
"query": "social media, own time"
}
}
]
}
}
}
Search Result
"hits": [
{
"_index": "my-index3",
"_type": "_doc",
"_id": "1",
"_score": 2.5628972,
"_source": {
"questionText": "What other social media platforms do you use on your own time?"
}
},
{
"_index": "my-index3",
"_type": "_doc",
"_id": "2",
"_score": 1.3862944,
"_source": {
"questionText": "What other social media platforms do you use on your?"
}
},
{
"_index": "my-index3",
"_type": "_doc",
"_id": "3",
"_score": 1.3862944,
"_source": {
"questionText": "What other platforms do you use on your own time?"
}
}
]
I've setup a normalizer on an index field to support case insensitive searches, cant seem to get it to work.
GET users/
Returns the following mapping:
{
"users": {
"aliases": {},
"mappings": {
"user": {
"properties": {
"active": {
"type": "boolean"
},
"first_name": {
"type": "keyword",
"fields": {
"normalize": {
"type": "keyword",
"normalizer": "search_normalizer"
}
}
},
},
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "users",
"creation_date": "1567936315432",
"analysis": {
"normalizer": {
"search_normalizer": {
"filter": [
"lowercase"
],
"type": "custom"
}
}
},
"number_of_replicas": "1",
"uuid": "5SknFdwJTpmF",
"version": {
"created": "6040299"
}
}
}
}
}
Although first_name is normalized to lowercase, queries on the first_name field are case sensitive.
Using the following query for a user with first name Dave
GET users/_search
{
"query": {
"bool": {
"should": [
{
"regexp": {
"first_name": {
"value": ".*dave.*"
}
}
}
]
}
}
}
GET users/_analyze
{
"analyzer" : "standard",
"text": "Dave"
}
returns
{
"tokens": [
{
"token": "dave",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Although "Dave" is tokenized to "dave" the following query
GET users/_search
{
"query": {
"match": {
"first_name": "dave"
}
}
}
Returns no hits.
Is there an issue with my current mapping? or the query?
I think you have missed first_name.normalize in query
Indexing Records
{"first_name": "Daveraj"}
{"index": {}}
{"first_name": "RajdaveN"}
{"index": {}}
{"first_name": "Dave"}
Query
"query": {
"bool": {
"should": [
{
"regexp": {
"first_name.normalize": {
"value": ".*dave.*"
}
}
}
]
}
}
}
Result
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.0,
"hits": [
{
"_index": "test3",
"_type": "test3_type",
"_id": "M8-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "Dave"
}
},
{
"_index": "test3",
"_type": "test3_type",
"_id": "Mc-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "Daveraj"
}
},
{
"_index": "test3",
"_type": "test3_type",
"_id": "Ms-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "RajdaveN"
}
}
]
}
}```
You have created a normalized multi-field: first_name.normalize , but you are searching on the original field first_name which doesn't have any analyzer specified (will default to index-default analyzer or standard).
The examples given here might help:
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html
You need to explicitly specify the multi-field you want to search on, note even though a multi-field cant have its own content, it indexes different terms as opposed to its parent (although not always) as a result of possibly being analyzed using diff analyzers/char/token filters.
i am basically trying to write a query where it should return the document where
school is "holy international" AND grade is "second".
but the issue with the current query is that its not considering the must match query part. ie even though i don't i specify the school is the giving me this document where as it is not a match.
query is giving me all the documents where the grade is second.
i want only document where school is "holy international" AND grade is "second".
as well as i have not specified in the match query for "schools.school" but its giving me results.
mapping
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword_lowercase1": {
"tokenizer": "keyword",
"filter": ["lowercase", "my_pattern_replace1", "trim"]
},
"my_keyword_lowercase2": {
"tokenizer": "standard",
"filter": ["lowercase", "trim"]
}
},
"filter": {
"my_pattern_replace1": {
"type": "pattern_replace",
"pattern": ".",
"replacement": ""
}
}
}
},
"mappings": {
"test_data": {
"properties": {
"schools": {
"type": "nested",
"properties": {
"school": {
"type": "string",
"analyzer": "my_keyword_lowercase1"
},
"grade": {
"type": "string",
"analyzer": "my_keyword_lowercase2"
}
}
}
}
}
}
}
data
{
"_index": "data_index",
"_type": "test_data",
"_id": "57a33ebc1d41",
"_version": 1,
"found": true,
"_source": {
"summary": null,
"schools": [{
"school": "little flower",
"grade": "first",
"date": "2007-06-01",
},
{
"school": "holy international",
"grade": "second",
"date": "2007-06-01",
},
],
"first_name": "Adam",
"location": "Kansas City",
"last_name": "Roger",
"country": "US",
"name": "Adam Roger",
}
}
query
{
"_source": ["first_name"],
"query": {
"nested": {
"path": "schools",
"inner_hits": {
"_source": {
"includes": [
"schools.school",
"schools.grade"
]
}
},
"query": {
"bool": {
"must": {
"match": {
"schools.school": {
"query": "" <-----X didnt specify anything
}
}
},
"filter": {
"match": {
"schools.grade": {
"query": "second",
"operator": "and",
"minimum_should_match": "100%"
}
}
}
}
}
}
}
}
result
{
"took": 26,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "data_test",
"_type": "test_data",
"_id": "57a33ebc1d41",
"_score": 0.2876821,
"_source": {
"first_name": "Adam"
},
"inner_hits": {
"schools": {
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_nested": {
"field": "schools",
"offset": 0
},
"_score": 0.2876821,
"_source": {
"schools": {
"school": "holy international",
"grade": "second"
}
}
}
]
}
}
}
}
]
}
}
So, basically your problem is analysis step, when I load everything and checked, it become very clear:
This filter completely wipes all string from schools.school field
"filter": {
"my_pattern_replace1": {
"type": "pattern_replace",
"pattern": ".",
"replacement": ""
}
}
I think, that's happening because . is regexp literal, so, when I checked it:
POST /_analyze
{
"field": "schools.school",
"text": "holy international"
}
{
"tokens": [
{
"token": "",
"start_offset": 0,
"end_offset": 18,
"type": "word",
"position": 0
}
]
}
That's why you always get a match, every string you passed during indexing time and during search time becomes "". Some additional info from Elastic wiki - https://www.elastic.co/guide/en/elasticsearch/reference/5.1/analysis-pattern_replace-tokenfilter.html
After I removed patter replace filter, this query returns everything as expected:
{
"_source": ["first_name"],
"query": {
"nested": {
"path": "schools",
"inner_hits": {
"_source": {
"includes": [
"schools.school",
"schools.grade"
]
}
},
"query": {
"bool": {
"must": {
"match": {
"schools.school": {
"query": "holy international"
}
}
},
"filter": {
"match": {
"schools.grade": {
"query": "second"
}
}
}
}
}
}
}
}
Context
I am trying to support smart-case search within our application which uses elasticsearch. The use case I want to support is to be able to partially match on any blob of text using smart-case semantics. I managed to configure my index in such a way that I am capable of simulating smart-case search. It uses ngrams of max length 8 to not overload storage requirements.
The way it works is that each document has both a generated case-sensitive and a case-insensitive field using copy_to with their own specific indexing strategy. When searching on a given input, I split the input in parts. This depends on the ngrams length, white spaces and double quote escaping. Each part is checked for capitalized letters. When a capital letter is found, it generates a match filter for that specific part using the case-sensitive field, otherwise it uses the case-insensitive field.
This has proven to work very nicely, however I am having difficulties with getting highlighting to work the way I would like. To better explain the issue, I added an overview of my test setup below.
Settings
curl -X DELETE localhost:9200/custom
curl -X PUT localhost:9200/custom -d '
{
"settings": {
"analysis": {
"filter": {
"default_min_length": {
"type": "length",
"min": 1
},
"squash_spaces": {
"type": "pattern_replace",
"pattern": "\\s{2,}",
"replacement": " "
}
},
"tokenizer": {
"ngram_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "8"
}
},
"analyzer": {
"index_raw": {
"type": "custom",
"filter": ["lowercase","squash_spaces","trim","default_min_length"],
"tokenizer": "keyword"
},
"index_case_insensitive": {
"type": "custom",
"filter": ["lowercase","squash_spaces","trim","default_min_length"],
"tokenizer": "ngram_tokenizer"
},
"search_case_insensitive": {
"type": "custom",
"filter": ["lowercase","squash_spaces","trim"],
"tokenizer": "keyword"
},
"index_case_sensitive": {
"type": "custom",
"filter": ["squash_spaces","trim","default_min_length"],
"tokenizer": "ngram_tokenizer"
},
"search_case_sensitive": {
"type": "custom",
"filter": ["squash_spaces","trim"],
"tokenizer": "keyword"
}
}
}
},
"mappings": {
"_default_": {
"_all": { "enabled": false },
"date_detection": false,
"dynamic_templates": [
{
"case_insensitive": {
"match_mapping_type": "string",
"match": "case_insensitive",
"mapping": {
"type": "string",
"analyzer": "index_case_insensitive",
"search_analyzer": "search_case_insensitive"
}
}
},
{
"case_sensitive": {
"match_mapping_type": "string",
"match": "case_sensitive",
"mapping": {
"type": "string",
"analyzer": "index_case_sensitive",
"search_analyzer": "search_case_sensitive"
}
}
},
{
"text": {
"match_mapping_type": "string",
"mapping": {
"type": "string",
"analyzer": "index_raw",
"copy_to": ["case_insensitive","case_sensitive"],
"fields": {
"case_insensitive": {
"type": "string",
"analyzer": "index_case_insensitive",
"search_analyzer": "search_case_insensitive",
"term_vector": "with_positions_offsets"
},
"case_sensitive": {
"type": "string",
"analyzer": "index_case_sensitive",
"search_analyzer": "search_case_sensitive",
"term_vector": "with_positions_offsets"
}
}
}
}
}
]
}
}
}
'
Data
curl -X POST "http://localhost:9200/custom/test" -d '{ "text" : "tHis .is a! Test" }'
Query
The user searches for: tHis test which gets split into two parts as ngrams are maximum 8 in lengths: (1) tHis and (2) test. For (1) the case-sensitive field is used and (2) uses the case-insensitive field.
curl -X POST "http://localhost:9200/_search" -d '
{
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"case_sensitive": {
"query": "tHis",
"type": "boolean"
}
}
},
{
"match": {
"case_insensitive": {
"query": "test",
"type": "boolean"
}
}
}
]
}
},
"highlight": {
"pre_tags": [
"<em>"
],
"post_tags": [
"</em>"
],
"number_of_fragments": 0,
"require_field_match": false,
"fields": {
"*": {}
}
}
}
'
Response
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.057534896,
"hits": [
{
"_index": "custom",
"_type": "test",
"_id": "1",
"_score": 0.057534896,
"_source": {
"text": "tHis .is a! Test"
},
"highlight": {
"text.case_sensitive": [
"<em>tHis</em> .is a! Test"
],
"text.case_insensitive": [
"tHis .is a!<em> Test</em>"
]
}
}
]
}
}
Problem: highlighting
As you can see, the response shows that the smart-case search works very well. However, I also want to give feedback to the user using highlighting. My current setup uses "term_vector": "with_positions_offsets" to generate highlights. This indeed gives back correct highlights. However, the highlights are returned as both case-sensitive and case-insensitive independently.
"highlight": {
"text.case_sensitive": [
"<em>tHis</em> .is a! Test"
],
"text.case_insensitive": [
"tHis .is a!<em> Test</em>"
]
}
This requires me to manually zip multiple highlights on the same field into one combined highlight before returning it to the user. This becomes very painful when highlights become more complicated and can overlap.
Question
Is there an alternative setup to actually get back the combined highlight. I.e. I would like to have this as part of my response.
"highlight": {
"text": [
"<em>tHis</em> .is a!<em> Test</em>"
]
}
Attempt
Make use of highlight query to get merged result:
curl -XPOST 'http://localhost:9200_search' -d '
{
"size": 1,
"query": {
"bool": {
"must": [
{
"match": {
"case_sensitive": {
"query": "tHis",
"type": "boolean"
}
}
},
{
"match": {
"case_insensitive": {
"query": "test",
"type": "boolean"
}
}
}
]
}
},
"highlight": {
"pre_tags": [
"<em>"
],
"post_tags": [
"</em>"
],
"number_of_fragments": 0,
"require_field_match": false,
"fields": {
"*.case_insensitive": {
"highlight_query": {
"bool": {
"must": [
{
"match": {
"*.case_insensitive": {
"query": "tHis",
"type": "boolean"
}
}
},
{
"match": {
"*.case_insensitive": {
"query": "test",
"type": "boolean"
}
}
}
]
}
}
}
}
}
}
'
Response
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.9364339,
"hits": [
{
"_index": "custom",
"_type": "test",
"_id": "1",
"_score": 0.9364339,
"_source": {
"text": "tHis .is a! Test"
},
"highlight": {
"text.case_insensitive": [
"<em>tHis</em> .is a!<em> Test</em>"
]
}
}
]
}
}
Warning
When ingesting the following, note the additional lower-case test keyword:
curl -X POST "http://localhost:9200/custom/test" -d '{ "text" : "tHis this .is a! Test" }'
The response to the same query becomes:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.9364339,
"hits": [
{
"_index": "custom",
"_type": "test",
"_id": "1",
"_score": 0.9364339,
"_source": {
"text": "tHis this .is a! Test"
},
"highlight": {
"text.case_insensitive": [
"<em>tHis</em><em> this</em> .is a!<em> Test</em>"
]
}
}
]
}
}
As you can see, the highlight now also includes the lower-case this. For such a test example, we do not mind. However, for complicated queries, the user might (and probably will) get confused when and how the smart-case has any effect. Especially when the lower-case match would include a field that only matches on lower-case.
Conclusion
This solution will give you all highlights merged as one, but might include unwanted results.
I have the following elastic search configuration:
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
},
"snow_filter" : {
"type" : "snowball",
"language" : "English"
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"snow_filter",
"autocomplete_filter"
]
}
}
}
}
}
PUT /my_index/_mapping/my_type
{
"my_type": {
"properties": {
"name": {
"type": "multi_field",
"fields": {
"name": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "snowball"
},
"not": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "name": "Brown foxes" }
{ "index": { "_id": 2 }}
{ "name": "Yellow furballs" }
{ "index": { "_id": 3 }}
{ "name": "my discovery" }
{ "index": { "_id": 4 }}
{ "name": "myself is fun" }
{ "index": { "_id": 5 }}
{ "name": ["foxy", "foo"] }
{ "index": { "_id": 6 }}
{ "name": ["foo bar", "baz"] }
I am trying to get a search to only return item 6 that has a name of "foo bar" and I am not quite sure how. This is what I am doing right now:
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "foo b"
}
}
}
}
I know it's a combination of how the tokenizer is splitting the word but sort of lost on how both be flexible and be strict enough to match this. I am guessing I need to do a multiple field on my mapping of name, but I am not sure. How can I fix the query and/or my mapping to satisfy my needs?
You're already close. Since your edge_ngram analyzer generates tokens of a minimum length of 1, and your query gets tokenized into "foo" and "b", and the default match query operator is "or", your query matches each document that has a term starting with "b" (or "foo"), three of the docs.
Using the "and" operator seems to do what you want:
POST /my_index/my_type/_search
{
"query": {
"match": {
"name": {
"query": "foo b",
"operator": "and"
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4451914,
"hits": [
{
"_index": "test_index",
"_type": "my_type",
"_id": "6",
"_score": 1.4451914,
"_source": {
"name": [
"foo bar",
"baz"
]
}
}
]
}
}
Here's the code I used to test it:
http://sense.qbox.io/gist/4f6fb7c1fdc6942023091ee1433d7490e04e7dea