Querying an analysed field doesn't work without informing then analyser in the query - elasticsearch

I'm using elasticsearch 7.14 and I want to perform a query using a custom analyzer. This is the index:
PUT /my-index-001
{
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 0
},
"analysis": {
"analyzer": {
"alphanumeric_only_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": [
"alphanumeric_only_filter"
],
"filter": [
"lowercase"
]
}
},
"char_filter": {
"alphanumeric_only_filter": {
"type": "pattern_replace",
"pattern": "[^A-Za-z0-9]",
"replacement": ""
}
}
}
},
"mappings": {
"properties": {
"myField": {
"type": "text",
"analyzer": "alphanumeric_only_analyzer",
"search_analyzer": "alphanumeric_only_analyzer"
}
}
}
}
And 2 documents to test the queries:
POST /my-index-001/_doc
{
"myField":"asd-9887"
}
POST /my-index-001/_doc
{
"myField":"asd 9887"
}
Checking the analyzer, it works as expected, resulting the token "asd9887"
POST my-index-001/_analyze
{
"analyzer": "alphanumeric_only_analyzer",
"text": "aSd 9887"
}
Since everything is there and looks fine, let's start querying:
Query1: This finds both documents:
GET /my-index-001/_search
{
"query": {
"term": {
"myField": "asd9887"
}
}
}
Query2: This doesn't find any document
GET /my-index-001/_search
{
"query": {
"term": {
"myField": "asd 9887"
}
}
}
Query3: This finds both documents, but I had to inform which analyser to use:
GET /my-index-001/_search
{
"query": {
"match": {
"myField": {
"query": "asd 9887",
"analyzer": "alphanumeric_only_analyzer"
}
}
}
}
Why should I be required to do it this way, since I created the mapping informing search_analyzer as alphanumeric_only_analyzer?
There is a way to make Query2 work as is? I don't want my users having to know analyzer names, as well as I want them to be able to find both documents when querying any value that, after analyzed, matches the analyzed document value.

Use match query instead of term query
The term query does not analyze the search term. The term query only searches for the exact term you provide. So it is searching for "asd 9887" in your tokens.
Match query analyzes search term using same analyzer as field resulting in creation of same tokens. So "asd 9887" is converted to "asd9887" while searching

Related

elasticsearch n-gram tokenizer with match_phrase not giving expected result

I created a index as follow
PUT /ngram_tokenizer
{
"mappings": {
"properties": {
"test_name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
},
"settings": {
"index": {
"max_ngram_diff": 20
},
"analysis": {
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 20,
"token_chars":[
"letter",
"digit",
"whitespace",
"symbol"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
}
}
}
}
Then indexed as follow
POST /ngram_tokenizer/_doc
{
"test_name": "test document"
}
POST /ngram_tokenizer/_doc
{
"test_name": "another document"
}
Then I did a match_phrase query,
GET /ngram_tokenizer/_search
{
"query": {
"match_phrase": {
"test_name": "document"
}
}
}
Above query returns both document as expected, but below query didn't return any document
GET /ngram_tokenizer/_search
{
"query": {
"match_phrase": {
"test_name": "test"
}
}
}
also, I checked all the tokens it generates by following query
POST ngram_tokenizer/_analyze
{
"analyzer": "my_analyzer",
"text": "test document"
}
match query works fine, can you guys help me
Update
when I want to search for a phrase I have to do a match_phrase query right? Then I used n-gram tokenizer on that field because, if there is any typo in the search term, still i can get a similar doc. Also, I know that we can use fuzziness to overcome typo issues in search terms. But when I used fuzziness in match queries or fuzzy queries there was a scoring issue as mentioned here. Actually what I want is, when I do a match query I want to get results even though there is a typo in search terms. And in the match_phrase query, I should get proper results at least when I search without any typos.
It's because at search time, the analyzer used to analyze the input text is the same as the one used at indexing time, i.e. my_analyzer, and the match_phrase query is a bit more complex than the match query.
At search time, you should simply use the standard analyzer (or something different than the ngram analyzer) in order to analyze your query input.
The following query shows how to make it work as you expect.
GET /ngram_tokenizer/_search
{
"query": {
"match_phrase": {
"test_name": {
"query": "test",
"analyzer": "standard"
}
}
}
}
You can also specify the standard analyzer as a search_analyzer in your mapping
"test_name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}

elasticsearch synonyms analyzer gives 0 results

I am using elasticsearch 7.0.0.
I am trying to work on synonyms with this configuration while creating index.
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
}
}
}
},
"mappings": {
"properties": {
"address.state": {
"type": "text",
"analyzer": "synonym"
},
"location": {
"type": "geo_point"
}
}
}
}
Here's a document inserted into the index:
{
"name": "Berry's Burritos",
"description": "Best burritos in New York",
"address": {
"street": "230 W 4th St",
"city": "New York",
"state": "NY",
"zip": "10014"
},
"location": [
40.7543385,
-73.976313
],
"tags": [
"mexican",
"tacos",
"burritos"
],
"rating": "4.3"
}
Also content in synonyms.txt:
ny, new york, big apple
When I tried searching for anything in address.state property, I get empty result.
Here's the query:
{
"query": {
"bool": {
"filter": {
"range": {
"rating": {
"gte": 4
}
}
},
"must": {
"match": {
"address.state": "ny"
}
}
}
}
}
Even with ny (as it is:no synonym) in query, the result is empty.
Before, when I created index without mappings, the query used to give the result, only except for synonyms.
But now with mappings, the result is empty even though the term is present.
This query is working though:
{
"query": {
"query_string": {
"query": "tacos",
"fields": [
"tags"
]
}
}
}
I looked and researched into many articles/tutorials and came up this far.
What am I missing here now?
While indexing you are passing the value as "state":"NY". Notice the case of NY. The analyzer synonym define in the settings has only one filter i.e. synonym. NY doesn't match any set of synonyms in defined in synonym.txt due to case. NOTE that NY isn't equal to ny. To overcome this problem (or we can call making it case insensitive) add lowercase filter before synonym filter to synonym analyzer. This will ensure that any input text is lower cased first and then synonym filter is applied. Same will happen when you search on that field using full text search queries.
So you settings will be as below:
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
}
}
}
}
No changes are required in mapping.
Why it initially worked?
Answer to this is because when you haven't defined any mapping, elastic would map address.state as a text field with no explicit analyzer defined for the field. In such case elasticsearch by default uses standard analyzer which uses lowercase token filter as one of the filters. and hence the query matched the document.

Elasticsearch query returning false results when term exceeds ngram length

The requirement is to search partial phrases in a block of text. Most of the words will be standard length. I want to keep the max_gram value down to 10. But there may be the occasional id/code with more characters than that, and these show up if I type in a query where the first 10 characters match, but then the rest don't.
For example, here is the mapping:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete"
}
}
}
}
}
and document:
POST my_index/doc/1
{
"title": "Quick fox with id of ABCDEFGHIJKLMNOP"
}
If I run the query:
POST my_index/doc/_search
{
"query": {
"match_phrase": {
"title": {
"query": "fox wi"
}
}
}
}
It returns the document as expected. However, if I run this:
POST my_index/doc/_search
{
"query": {
"match_phrase": {
"title": {
"query": "ABCDEFGHIJxxx"
}
}
}
}
It also returns the document, when it shouldn't. It will do this if the x's are after the 10th character, but not before it. How can I avoid this?
I am using version 5.
By default, the analyzer that is used at index time is the same analyzer that is used at search time, meaning the edge_ngram analyzer is used on your search term. This is not what you want. You will end up with 10 tokens as the search terms, none of which contain those last 3 characters.
You will want to take a look at the Search Analyzer for your mapping. This documentation points out this specific use case:
Sometimes, though, it can make sense to use a different analyzer at search time, such as when using the edge_ngram tokenizer for autocomplete.
The standard analyzer may suit your needs:
{
...
"mappings": {
"doc": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
}

Why match_phrase_prefix query returns wrong results with diffrent length of phrase?

I have very simple query:
POST /indexX/document/_search
{
"query": {
"match_phrase_prefix": {
"surname": "grab"
}
}
}
with mapping:
"surname": {
"type": "string",
"analyzer": "polish",
"copy_to": [
"full_name"
]
}
and definition for index (I use Stempel (Polish) Analysis for Elasticsearch plugin):
POST /indexX
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym" : {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
},
"polish_stop": {
"type": "stop",
"stopwords_path": "analysis/stopwords.txt"
},
"polish_my_stem": {
"type": "stemmer",
"rules_path": "analysis/stems.txt"
}
},
"analyzer": {
"polish_with_synonym": {
"tokenizer": "standard",
"filter": [
"synonym",
"lowercase",
"polish_stop",
"polish_stem",
"polish_my_stem"
]
}
}
}
}
}
}
For this query I get zero results. When I change phrase to GRA or GRABA it returns 1 result (GRABARZ is the surname). Why is this happening?
I tried max_expansions with values even as high as 1200 and that didn't help.
At the first glance, your analyzer stems the search term ("grab") and renders it unusable ("grabić").
Without going into details on how to resolve this, please consider getting rid of polish analyzer here. We are talking about people's names, not "ordinary" polish words.
I saw different techniques used in this case: multi-field searches, fuzzy searches, phonetic searches, dedicated plugins.
Some links:
https://www.elastic.co/blog/multi-field-search-just-got-better
http://www.basistech.com/fuzzy-search-names-in-elasticsearch/
https://www.found.no/play/gist/6c6434c9c638a8596efa
But I guess in case of polish names some kind of prefix query on non-analyzed field would suffice...

Elasticsearch: Unable to search with wordforms

I am trying to setup Elasticsearch, created index, added some records but can not make it return results with word forms (for example: records with substring "dreams" when I search for "dream").
My records look like this (index "myindex/movies"):
{
"id": 1,
"title": "What Dreams May Come",
... other fields
}
The configuration I tried to use:
{
"settings": {
"analysis": {
"analyzer": {
"stem": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"stop",
"porter_stem"
]
}
}
}
},
"mappings": {
"movies": {
"dynamic": true,
"properties": {
"title": {
"type": "string",
"analyzer": "stem"
}
}
}
}
}
And query look like this:
{
"query": {
"query_string": {
"query": "Dream"
}
}
}
I can get result back using word "dreams" but not "dream".
Do I do something wrong?
Should I install porter_stem somehow first?
You haven't done anything wrong , just that you are searching in wrong field.
query_string , does the search on _all by default. And _all is having its own analyzer.
So either you need to apply the same analyzer to _all or point your query to title field like below -
{
"query": {
"query_string": {
"query": "dream",
"default_field": "title"
}
}
}

Resources