Confusing query_string search results - elasticsearch

I've got Elasticsearch set up and am running queries against it, but I'm getting odd results, and can't figure out why:
For example, the here's one relevant portion of my mapping:
"classification": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
And, then here's some of the queries and results. For all of these, there are objects with classification value of "Jewelry & Adornment":
Query:
"query": {
"bool": {
"must": [
{
"match_all": {}
},
{
"query_string": {
"query": "(classification:/jewel.*/)"
}
}
]
}
}
Result:
"hits": {
"total": 2541,
"max_score": 1.4142135,
"hits": [
{
...
Yet if I add "ry":
Query:
"query_string": {
"query": "(classification:/jewelry.*/)"
}
Result:
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
I've also tried running the queries:
"query_string": {
"query": "(classification\\*:/jewelry.*/)"
}
(should match either "classification" or "classification.raw")
And:
"query_string": {
"query": "(classification.raw:/jewelry.*/)"
}
I've also tried cases variations, e.g. "Jewelry" vs. "jewelry", to no effect. All of these return no results. This makes no sense to me. Even when querying "classification.raw" with "Jewelry" (same case and on a completely unanalyzed field), I get no results. Any ideas?
UPDATE
As per request of #keety
{
"tokens": [
{
"token": "jewelri",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "adorn",
"start_offset": 10,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
}
]
}
I imagine the fact that it's stemming "jewelry" to "jewelri" is my problem, but not sure why it's doing that or how to fix it.
UPDATE #2
These are the analyzers in play:
"analyzer": {
"default_index": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": [
"icu_folding",
"custom_stem",
"porter_stem",
"index_filter"
],
"char_filter": [
"html_strip",
"quotes"
]
},
"default_search": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": [
"icu_folding",
"custom_stem",
"porter_stem",
"search_filter"
],
"char_filter": [
"html_strip",
"quotes"
]
}
}
UPDATE #3
I ran an _explain query on one of the objects that should be matching but isn't and got the following:
"matched": false,
"explanation": {
"value": 0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [
{
"value": 0.70710677,
"description": "ConstantScore(*:*), product of:",
"details": [
{
"value": 1,
"description": "boost"
},
{
"value": 0.70710677,
"description": "queryNorm"
}
]
},
{
"value": 0,
"description": "no match on required clause (ConstantScore())"
}
]
}
I don't know what "required clause (ConstantScore())" is. The only thing I can find related is Constant Score Query, but I'm not employing this particular query anywhere.
UPDATE #4
Okay, this is getting a little long-winded. Sorry about that. However, I just discovered that the problem seems to lie in using the regex syntax. If I just use a basic wildcard (along with "analyze_wildcard": true), then all my queries start working.

Related

Elasticsearch Term suggester is not returning correct suggestions when one character is missing (instead of misspelling)

I'm using Elasticsearch term suggester for spell correction. my index contains huge list of ads. Each ad has subject and body fields. I've found a problematic example for which the suggester is not suggesting correct suggestions.
I have lots of ads whose subject contains word "soffa" and also 5 ads whose subject contain word "sofa". Ideally, when I send "sofa" (wrong spelling) as text to suggester, it should return "soffa" (correct spelling) as suggestions (since soffa is correct spell and most of ads contains "soffa" and only few ads contains "sofa" (wrong spell)).
Here is my suggester query body :
{
"suggest": {
"text": "sofa",
"subjectSuggester": {
"term": {
"field": "subject",
"suggest_mode": "popular",
"min_word_length": 1
}
}
}
}
When I send above query, I get below response :
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"suggest": {
"subjectSuggester": [
{
"text": "sof",
"offset": 0,
"length": 4,
"options": [
{
"text": "soff",
"score": 0.6666666,
"freq": 298
},
{
"text": "sol",
"score": 0.6666666,
"freq": 101
},
{
"text": "saf",
"score": 0.6666666,
"freq": 6
}
]
}
]
}
}
As you see in above response, it returned "soff" but not "soffa" although I have lots of docs whose subject contains "soffa".
I even played with parameters like suggest_mode and string_distance but still no luck.
I also used phrase suggester instead of term suggester but still same. Here is my phrase suggester query :
{
"suggest": {
"text": "sofa",
"subjectuggester": {
"phrase": {
"field": "subject",
"size": 10,
"gram_size": 3,
"direct_generator": [
{
"field": "subject.trigram",
"suggest_mode": "always",
"min_word_length":1
}
]
}
}
}
}
I somehow think it doesn't work when one character is missing instead of being misspelled. in the "soffa" example, one "f" is missing.
while it works fine for misspells e.g it works fine for "vovlo".
When I send "vovlo" it gives me "volvo".
Any help would be hugely appreciated.
Try changing the "string_distance".
{
"suggest": {
"text": "sof",
"subjectSuggester": {
"term": {
"field": "title",
"min_word_length":2,
"string_distance":"ngram"
}
}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html#term-suggester
I've found the workaround myself.
I added ngram filter and analyzer with max_shingle_size 3 which means trigram, then added a subfield with that analyzer (trigram) and performed suggester query on that field (instead of actual field) and it worked.
Here is the mapping changes :
{
"settings": {
"analysis": {
"filter": {
"shingle": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
}
},
"analyzer": {
"trigram": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle"
],
"char_filter": [
"diacritical_marks_filter"
]
}
}
}
},
"mappings": {
"properties": {
"subject": {
"type": "text",
"fields": {
"trigram": {
"type": "text",
"analyzer": "trigram"
}
}
}
}
}
}
And here is my corrected query :
{
"suggest": {
"text": "sofa",
"subjectSuggester": {
"term": {
"field": "subject.trigram",
"suggest_mode": "popular",
"min_word_length": 1,
"string_distance": "ngram"
}
}
}
}
Note that I'm performing suggester to subject.trigram instead of subject itself.
Here is the result :
{
"suggest": {
"subjectSuggester": [
{
"text": "sofa",
"offset": 0,
"length": 4,
"options": [
{
"text": "soffa",
"score": 0.8,
"freq": 282
},
{
"text": "soffan",
"score": 0.6666666,
"freq": 5
},
{
"text": "som",
"score": 0.625,
"freq": 102
},
{
"text": "sol",
"score": 0.625,
"freq": 82
},
{
"text": "sony",
"score": 0.625,
"freq": 50
}
]
}
]
}
}
As you can see above soffa appears as first suggestion.
There is sth weird in your result for the term suggester for the word sofa, take a look at the text that is being corrected:
"suggest": {
"subjectSuggester": [
{
"text": "sof",
"offset": 0,
"length": 4,
"options": [
{
"text": "soff",
"score": 0.6666666,
"freq": 298
},
{
"text": "sol",
"score": 0.6666666,
"freq": 101
},
{
"text": "saf",
"score": 0.6666666,
"freq": 6
}
]
}
]
}
As you can see it's sof and not sofa which means the correction is not for sofa but instead it's for sof, so I doubt that this issue is related to the analyzer you were using on this field, especially when looking at the results soff instead of soffa it's removing the last a

how to exclude search words in synonyms filter in elasticsearch

While I'm adding table and tables as synonym filter in elastic search, I need to filter out the results for table fan. How to achieve this in elastic search
Could we build a taxonomy of inclusion and exclusion lists filters in settings rather than at run time queries in elastic search
GET <indexName>/_search
{
"query": {
"bool": {
"must_not": [
{
"match": {
"<fieldName>": {
"query": "table fan", // <======= Below operator will applied b/w table(&synonyms) And fan(&synonyms)
"operator": "AND"
}
}
}
]
}
}
}
You can use above query to exclude all the documents having both 'table', 'fan' and their corresponding synonyms.
OR:
If you want to play with multiple logical operators. e.g Given me all the documents which doesn't contain either "table fan" Or "ac" you can use simple_query_string
GET <indexName>/_search
{
"query": {
"bool": {
"must_not": [
{
"simple_query_string": {
"query": "(table + fan) | ac", // <=== '+'='and', '|'='or', '-'='not'
"fields": [
"<fieldName>" // <==== use multiple field names, wildcard also supported
]
}
}
]
}
}
}
Adding a working example with index data, mapping, search query and search result
Index Mapping:
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms": [
"table, tables"
]
}
},
"analyzer": {
"synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_filter"
]
}
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "synonym_analyzer",
"search_analyzer": "standard"
}
}
}
}
Analyze API
POST/_analyze
{
"analyzer" : "synonym_analyzer",
"text" : "table fan"
}
The following tokens are generated:
{
"tokens": [
{
"token": "table",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "tables",
"start_offset": 0,
"end_offset": 5,
"type": "SYNONYM",
"position": 0
},
{
"token": "fan",
"start_offset": 6,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Index Data:
{ "title": "table and fan" }
{ "title": "tables and fan" }
{ "title": "table fan" }
{ "title": "tables fan" }
{ "title": "table chair" }
Search Query:
{
"query": {
"bool": {
"must": {
"match": {
"title": "table"
}
},
"filter": {
"bool": {
"must_not": [
{
"match_phrase": {
"title": "table fan"
}
},
{
"match_phrase": {
"title": "table and fan"
}
}
]
}
}
}
}
}
You can also use match query in place of match_phrase query
{
"query": {
"bool": {
"must": {
"match": {
"title": "table"
}
},
"filter": {
"bool": {
"must_not": [
{
"match": {
"title": {
"query": "table fan",
"operator": "AND"
}
}
}
]
}
}
}
}
}
Search Result:
"hits": [
{
"_index": "synonym",
"_type": "_doc",
"_id": "2",
"_score": 0.06783115,
"_source": {
"title": "table chair"
}
}
]
Update 1:
Could we build a taxonomy of inclusion and exclusion lists filters in
settings rather than at run time queries in elastic search
Mapping is the process of defining how a document, and the fields it contains, are stored and indexed.Refer this ES documentation on mapping to understand what mapping is used to define.
Please refer to this documentation on Dynamic template that allow you to define custom mappings that can be applied to dynamically added fields

Elasticsearch - Stop analyzer doesn't allow number

I'm trying to build a search utility using elasticsearch 6.3.0 where any term can be searched within the database. I have applied Stop Analyzer to exclude some of the generic words. However, after having that analyzer system stopped giving me term with numbers as well.
Like if I search for news24 then it removes 24 and search only for "news" term in all records. Unsure why.
Below is the query I am using
{
"from": 0,
"size": 10,
"explain": false,
"stored_fields": [
"_source"
],
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "news24",
"analyzer": "stop",
"fields": [
"title",
"keywords",
"url"
]
}
},
"functions": [
{
"script_score": {
"script": "( (doc['isSponsered'].value == 'y') ? 100 : 0 )"
}
},
{
"script_score": {
"script": "doc['linksCount'].value"
}
}
],
"score_mode": "sum",
"boost_mode": "sum"
}
},
"script_fields": {
"custom_score": {
"script": {
"lang": "painless",
"source": "params._source.linksArray"
}
}
},
"highlight": {
"pre_tags": [
""
],
"post_tags": [
"<\/span>"
],
"fields": {
"title": {
"type": "plain"
},
"keywords": {
"type": "plain"
},
"description": {
"type": "plain"
},
"url": {
"type": "plain"
}
}
}
}
That is because stop analyzer is just an extension of Simple Analyzer which makes use of Lowercase Tokenizer which would simply break terms into tokens if it encounters character which is not a letter (of course also lowercasing all the terms).
So bascially if you have something like news24 what it does it, breaks it into news as it encountered 2.
This is the default behaviour of the stop analyzer. If you intend to make use of stop words and still want to keep numerics in picture, then you would be required to create a custom analyzer as shown below:
Mapping:
POST sometestindex
{
"settings":{
"analysis":{
"analyzer":{
"my_english_analyzer":{
"type":"standard",
"stopwords":"_english_"
}
}
}
}
}
What it does it it makes use of Standard Analyzer which internally uses Standard Tokenizer and also ignores stop words.
Analysis Query To Test
POST sometestindex/_analyze
{
"analyzer": "my_english_analyzer",
"text": "the name of the channel is news24"
}
Query Result
{
"tokens": [
{
"token": "name",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "channel",
"start_offset": 16,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "news24",
"start_offset": 27,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 6
}
]
}
You can see in the above tokens, that news24 is being preserved as token.
Hope it helps!

Elasticsearch match certain fields exactly but not others

I am needing ElasticSearch to match certain fields exactly, currently using multi_match.
For example, a user types in long beach chiropractor.
I want long beach to match the city field exactly, and not return results for seal beach or glass beach.
At the same time chiropractor should also match chiropractic.
Here is the current query I am using:
"query": {
"bool": {
"should": [
{
"multi_match": {
"fields": [
"title",
"location_address_address_1.value",
"location_address_city.value^2",
"location_address_state.value",
"specialty" // e.g. chiropractor
],
"query": "chiropractor long beach",
"boost": 6,
"type": "cross_fields"
}
}
]
}
},
The right approach would be to separate term that is searched and location, and store location as keyword type. If that's not possible then you can use synonym tokenizer to store locations as single tokens, but this will require to have the list of all possible locations. e.g.
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"long beach=>long-beach"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
Now if you call
POST /my_index/_analyze?analyzer=my_synonyms
{
"text": ["chiropractor long beach"]
}
the response is
{
"tokens": [
{
"token": "chiropractor",
"start_offset": 0,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "long-beach",
"start_offset": 13,
"end_offset": 23,
"type": "SYNONYM",
"position": 1
}
]
}

elasticsearch context suggester stopwords

Is there a way to analyze a field that is passed to the context suggester?
If, say, I have this in my mapping:
mappings: {
myitem: {
title: {type: 'string'},
content: {type: 'string'},
user: {type: 'string', index: 'not_analyzed'},
suggest_field: {
type: 'completion',
payloads: false,
context: {
user: {
type: 'category',
path: 'user'
},
}
}
}
}
and I index this doc:
POST /myindex/myitem/1
{
title: "The Post Title",
content: ...,
user: 123,
suggest_field: {
input: "The Post Title",
context: {
user: 123
}
}
}
I would like to analyze the input first, split it into separate words, run it through lowercase and stop words filters so that the context suggester actually gets
suggest_field: {
input: ["post", "title"],
context: {
user: 123
}
}
I know I can pass an array into the suggest field but I would like to avoid lowercasing the text, splitting it, running the stop words filter in my application, before passing to ES. If possible, I would rather ES do this for me. I did try adding an index_analyzer to the field mapping but that didn't seem to achieve anything.
OR, is there another way to get autocomplete suggestions for words?
Okay, so this is pretty involved, but I think it does what you want, more or less. I'm not going to explain the whole thing, because that would take quite a bit of time. However, I will say that I started with this blog post and added a stop token filter. The "title" field has sub-fields (what used to be called a multi_field) that use different analyzers, or none. The query contains a couple of terms aggregations. Also notice that the aggregations results are filtered by the match query to only return results relevant to the text query.
Here is the index setup (spend some time looking through this; if you have specific questions I will try to answer them but I encourage you to go through the blog post first):
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
},
"stop_filter": {
"type": "stop"
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"stop_filter",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"stop_filter"
]
},
"stopword_only_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"asciifolding",
"stop_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"stopword_only": {
"type": "string",
"analyzer": "stopword_only_analyzer"
}
}
}
}
}
}
}
Then I added a few docs:
PUT /test_index/_bulk
{"index": {"_index":"test_index", "_type":"doc", "_id":1}}
{"title": "The Lion King"}
{"index": {"_index":"test_index", "_type":"doc", "_id":2}}
{"title": "Beauty and the Beast"}
{"index": {"_index":"test_index", "_type":"doc", "_id":3}}
{"title": "Alladin"}
{"index": {"_index":"test_index", "_type":"doc", "_id":4}}
{"title": "The Little Mermaid"}
{"index": {"_index":"test_index", "_type":"doc", "_id":5}}
{"title": "Lady and the Tramp"}
Now I can search the documents with word prefixes if I want (or the full words, capitalized or not), and use aggregations to return both the intact titles of the matching documents, as well as intact (non-lowercased) words, minus the stopwords:
POST /test_index/_search?search_type=count
{
"query": {
"match": {
"title": {
"query": "mer king",
"operator": "or"
}
}
},
"aggs": {
"word_tokens": {
"terms": { "field": "title.stopword_only" }
},
"intact_titles": {
"terms": { "field": "title.raw" }
}
}
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"intact_titles": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "The Lion King",
"doc_count": 1
},
{
"key": "The Little Mermaid",
"doc_count": 1
}
]
},
"word_tokens": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "The",
"doc_count": 2
},
{
"key": "King",
"doc_count": 1
},
{
"key": "Lion",
"doc_count": 1
},
{
"key": "Little",
"doc_count": 1
},
{
"key": "Mermaid",
"doc_count": 1
}
]
}
}
}
Notice that "The" gets returned. This seems to be because the default _english_ stopwords only contain "the". I didn't immediately find a way around this.
Here is the code I used:
http://sense.qbox.io/gist/2fbb8a16b2cd35370f5d5944aa9ea7381544be79
Let me know if that helps you solve your problem.
You can set up a analyzer which does this for you.
If you follow the tutorial called you complete me, there is a section about stopwords.
There is a change in how elasticsearch works after this article was written. The standard analyzer no logner does stopword removal, so you need to use the stop analyzer in stead.
The mapping
curl -X DELETE localhost:9200/hotels
curl -X PUT localhost:9200/hotels -d '
{
"mappings": {
"hotel" : {
"properties" : {
"name" : { "type" : "string" },
"city" : { "type" : "string" },
"name_suggest" : {
"type" : "completion",
"index_analyzer" : "stop",//NOTE HERE THE DIFFERENCE
"search_analyzer" : "stop",//FROM THE ARTICELE!!
"preserve_position_increments": false,
"preserve_separators": false
}
}
}
}
}'
Getting suggestion
curl -X POST localhost:9200/hotels/_suggest -d '
{
"hotels" : {
"text" : "m",
"completion" : {
"field" : "name_suggest"
}
}
}'
Hope this helps. I have spent a long time looking for this answer myself.

Resources