analyzer for spelling mistakes - elasticsearch

I have saved the user inputs directly in elastcisearch. The name field has various spelling combinations for the same student.
PrabhuNath Prasad
PrabhuNathPrasad
Prabhu NathPrasad
Prabhu Nath Prashad
PrabhuNath Prashad
PrabhuNathPrashad
Prabhu NathPrashad
The real name of the student is "Prabhu Nath Prasad" and when I search by that name, I should get all the above results back. Is there any analyzer in elasticsearch that can take care of it?

You could do that custom_analyzer, This is my setup
POST name_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"char_filter": [
"space_removal"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"char_filter": {
"space_removal": {
"type": "pattern_replace",
"pattern": "\\s+",
"replacement": ""
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"name": {
"type": "string",
"fields": {
"variation": {
"type": "string",
"analyzer": "my_custom_analyzer"
}
}
}
}
}
}
}
I have mapped name with both standard analyzer and custom_analyzer which uses keyword tokenizer and lowercase filter along with char_filter which removes space and joins the string. This char_filter will help us query different variations effectively.
I inserted all those 7 combinations you have given in index. This is my query
GET name_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"name": "Prabhu Nath Prasad"
}
},
{
"match": {
"name.variation": {
"query": "Prabhu Nath Prasad",
"fuzziness": "AUTO"
}
}
}
]
}
}
}
This handles all your possibilities and it will also give back prabhu, prasad etc.
Hope this helps!!

There is no analyzer for that however, what you can look into is the "fuzzy"..
In your query specify the fuzziness which can help you in getting the above record.
I will Suggest you to go through the links below
https://www.elastic.co/blog/found-fuzzy-search
https://www.elastic.co/guide/en/elasticsearch/guide/current/fuzzy-match-query.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/fuzziness.html
This will help you achieve what you want.
Also there wont be any straight way to get the record if the user have typed "PrabhuNath", because elastic will treat it as a single token, however you can use "phrase_prefix" query which help you fetch records while the user is typing..
Your query will look like this to get the basic spelling mistake
{
"query": {
"match": {
"name": {
"query":"PrabhuNath Prasad",
"fuzziness": 2
}
}
}
}

Related

How to search by words written together among data where these words are written apart in Elasticsearch?

I have documents which have, let's say, 1 field - name of this document. Name may consist of several words written apart, for example:
{
"name": "first document"
},
{
"name": "second document"
}
My goal is to be able to search for these documents by strings:
firstdocument, seconddocumen
As you can see, search strings are written wrong, but they still match those documents if we delete whitespaces from documents' names. This issue could be handled by creating another field with the same string but without whitespaces, but it seems like extra data unless there's no other way to do that.
I need something similar to this:
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type":"shingle",
"max_shingle_size":3,
"min_shingle_size":2,
"output_unigrams":"true",
"token_separator": ""
}
],
"text": "first document"
}
But the other way around. I need kind of apply this not to a search text, but for search objects (name of documents), so I could find documents with a little misspell in a search text. How should it be done?
I suggest using multi-fields with an analyzer for removing whitespaces.
Analyzer
"no_spaces": {
"filter": [
"lowercase"
],
"char_filter": [
"remove_spaces"
],
"tokenizer": "standard"
}
Char Filter
"remove_spaces": {
"type": "pattern_replace",
"pattern": "[ ]",
"replacement": ""
}
Field Mapping
"name": {
"type": "text",
"fields": {
"without_spaces": {
"type": "text",
"analyzer": "no_spaces"
}
}
}
Query
GET /_search
{
"query": {
"match": {
"name.without_spaces": {
"query": "seconddocumen",
"fuzziness": "AUTO"
}
}
}
}
EDIT:
For completion: An alternative to the remove_spaces filter could be the shingle filter:
"analysis": {
"filter": {
"shingle_filter": {
"type": "shingle",
"output_unigrams": "false",
"token_separator": ""
}
},
"analyzer": {
"shingle_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"shingle_filter"
]
}
}
}

How can I get auto-suggestions for synonyms match in elasticsearch

I'm using the code below and it does not give auto-suggestion as curd when i type "cu"
But it does match the document with yogurt which is correct.
How can I get both auto-complete for synonym words and document match for the same?
PUT products
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"synonym_graph"
]
}
},
"filter": {
"synonym_graph": {
"type": "synonym_graph",
"synonyms": [
"yogurt, curd, dahi"
]
}
}
}
}
}
}
PUT products/_mapping
{
"properties": {
"description": {
"type": "text",
"analyzer": "synonym_analyzer"
}
}
}
POST products/_doc
{
"description": "yogurt"
}
GET products/_search
{
"query": {
"match": {
"description": "cu"
}
}
}
When you provide a list of synonyms in a synonym_graph filter it simply means that ES will treat any of the synonyms interchangeably. But when they're analyzed via the standard analyzer, only full-word tokens will be produced:
POST products/_analyze?filter_path=tokens.token
{
"text": "yogurt",
"field": "description"
}
yielding:
{
"tokens" : [
{
"token" : "curd"
},
{
"token" : "dahi"
},
{
"token" : "yogurt"
}
]
}
As such, a regular match_query won't cut it here because the standard analyzer hasn't provided it with enough context in terms of matchable substrings (n-grams).
In the meantime you can replace match with match_phrase_prefix which does exactly what you're after -- match an ordered sequence of characters while taking into account the synonyms:
GET products/_search
{
"query": {
"match_phrase_prefix": {
"description": "cu"
}
}
}
But that, as the query name suggests, is only going to work for prefixes. If you fancy an autocomplete that suggests terms regardless of where the substring matches occur, have a look at my other answer where I talk about leveraging n-grams.

not able to search in compounding query using analyzer

I have a problem index which has multiple fields e.g tags (comma separated string of tags), author, tester. I am creating a global search where problems can be searched by all these fields at once.
I am using boolean query
e.g
{
"query": {
"bool": {
"must": [{
"match": {
"author": "author_username"
}
},
{
"match": {
"tester": "tester_username"
}
},
{
"match": {
"tags": "<tag1,tag2>"
}
}
]
}
}
}
Without Analyzer I am able to get the results but it uses space as separator e.g python 3 is getting searched as python or 3.
But I wanted to search Python 3 as single query. So, I have created an analyzer for tags so that every comma-separated tag is considered as one, not by standard whitespace.
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": ","
}
}
}
},
"mappings": {
"properties": {
"tags": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
But now I am not getting any results. Please let me know what I am missing here. I am not able to find the use of analyzer in compound queries in the documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/compound-queries.html
Adding an example:
{
"query": {
"bool": {
"must": [{
"match": {
"author": "test1"
}
},
{
"match": {
"tester": "test2"
}
},
{
"match": {
"tags": "test3, abc 4"
}
}
]
}
}
}
Results should match all the fields but for the tags field there should be a union of tags and query should be comma-separated not by space. i.e query should match test and abc 4 but above query searching for test, abc and 4.
You need to either remove search_analyzer from your mapping or pass my_analyzer in match query
GET tags/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"tags": {
"query": "python 3",
"analyzer": "my_analyzer" --> by default search analyzer is used
}
}
}
]
}
}
}
By default, queries will use the analyzer defined in the field mapping, but this can be overridden with the search_analyzer setting.

elasticsearch synonyms analyzer gives 0 results

I am using elasticsearch 7.0.0.
I am trying to work on synonyms with this configuration while creating index.
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
}
}
}
},
"mappings": {
"properties": {
"address.state": {
"type": "text",
"analyzer": "synonym"
},
"location": {
"type": "geo_point"
}
}
}
}
Here's a document inserted into the index:
{
"name": "Berry's Burritos",
"description": "Best burritos in New York",
"address": {
"street": "230 W 4th St",
"city": "New York",
"state": "NY",
"zip": "10014"
},
"location": [
40.7543385,
-73.976313
],
"tags": [
"mexican",
"tacos",
"burritos"
],
"rating": "4.3"
}
Also content in synonyms.txt:
ny, new york, big apple
When I tried searching for anything in address.state property, I get empty result.
Here's the query:
{
"query": {
"bool": {
"filter": {
"range": {
"rating": {
"gte": 4
}
}
},
"must": {
"match": {
"address.state": "ny"
}
}
}
}
}
Even with ny (as it is:no synonym) in query, the result is empty.
Before, when I created index without mappings, the query used to give the result, only except for synonyms.
But now with mappings, the result is empty even though the term is present.
This query is working though:
{
"query": {
"query_string": {
"query": "tacos",
"fields": [
"tags"
]
}
}
}
I looked and researched into many articles/tutorials and came up this far.
What am I missing here now?
While indexing you are passing the value as "state":"NY". Notice the case of NY. The analyzer synonym define in the settings has only one filter i.e. synonym. NY doesn't match any set of synonyms in defined in synonym.txt due to case. NOTE that NY isn't equal to ny. To overcome this problem (or we can call making it case insensitive) add lowercase filter before synonym filter to synonym analyzer. This will ensure that any input text is lower cased first and then synonym filter is applied. Same will happen when you search on that field using full text search queries.
So you settings will be as below:
"settings": {
"index": {
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "synonyms.txt"
}
}
}
}
}
No changes are required in mapping.
Why it initially worked?
Answer to this is because when you haven't defined any mapping, elastic would map address.state as a text field with no explicit analyzer defined for the field. In such case elasticsearch by default uses standard analyzer which uses lowercase token filter as one of the filters. and hence the query matched the document.

Elasticsearch: Unable to search with wordforms

I am trying to setup Elasticsearch, created index, added some records but can not make it return results with word forms (for example: records with substring "dreams" when I search for "dream").
My records look like this (index "myindex/movies"):
{
"id": 1,
"title": "What Dreams May Come",
... other fields
}
The configuration I tried to use:
{
"settings": {
"analysis": {
"analyzer": {
"stem": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"stop",
"porter_stem"
]
}
}
}
},
"mappings": {
"movies": {
"dynamic": true,
"properties": {
"title": {
"type": "string",
"analyzer": "stem"
}
}
}
}
}
And query look like this:
{
"query": {
"query_string": {
"query": "Dream"
}
}
}
I can get result back using word "dreams" but not "dream".
Do I do something wrong?
Should I install porter_stem somehow first?
You haven't done anything wrong , just that you are searching in wrong field.
query_string , does the search on _all by default. And _all is having its own analyzer.
So either you need to apply the same analyzer to _all or point your query to title field like below -
{
"query": {
"query_string": {
"query": "dream",
"default_field": "title"
}
}
}

Resources