case insensitive elasticsearch with uppercase or lowercase - elasticsearch

I am working with elastic search and I am facing a problem. if any body gave me a hint , I will really thankful.
I want to analyze a field "name" or "description" which consist of different entries . e.g someone want to search Sara. if he enter SARA, SAra or sara. he should be able to get Sara.
elastic search uses analyzer which makes everything lowercase.
I want to implement it case insensitive regardless of user input uppercase or lowercase name, he/she should get results.
I am using ngram filter to search names and lowercase which makes it case insensitive. But I want to make sure that a person get results if even he enters in uppercase or lowercase.
Is there any way to do this in elastic search?
{"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 80
}
},
"analyzer": {
"index_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "ngram_filter", "lowercase" ]
},
I attach the example.js file which include json example and search.txt file to explain my problem . I hope my problem will be more clear now.
this is the link to onedrive where I kept both files.
https://1drv.ms/f/s!AsW4Pb3Y55Qjb34OtQI7qQotLzc

Is there any specific reason you are using ngram? Elasticsearch uses the same analyzer on the "query" as well as the text you index - unless search_analyzer is explicitly specified, as mentioned by #Adam in his answer. In your case it might be enough to use a standard tokenizer with a lowercase filter
I created an index with the following settings and mapping:
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"typehere": {
"properties": {
"name": {
"type": "string",
"analyzer": "custom_analyzer"
},
"description": {
"type": "string",
"analyzer": "custom_analyzer"
}
}
}
}
}
Indexed two documents
Doc 1
PUT /test_index/test_mapping/1
{
"name" : "Sara Connor",
"Description" : "My real name is Sarah Connor."
}
Doc 2
PUT /test_index/test_mapping/2
{
"name" : "John Connor",
"Description" : "I might save humanity someday."
}
Do a simple search
POST /test_index/_search?query=sara
{
"query" : {
"match" : {
"name" : "SARA"
}
}
}
And get back only the first document. I tried with "sara" and "Sara" also, same results.
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.19178301,
"hits": [
{
"_index": "test_index",
"_type": "test_mapping",
"_id": "1",
"_score": 0.19178301,
"_source": {
"name": "Sara Connor",
"Description": "My real name is Sarah Connor."
}
}
]
}
}

The analysis process is executed for full-text search fields (analysed) twice: first when data are stored and the second time when you search. It’s worth to say that input JSON will be returned in the same shape as an output from a search query. The analysis process is only used to create tokens for an inverted index. Key to your solution are the following steps:
Create two analysers one with ngram filter and second analyser
without ngram filter because you don’t need to analyse input search
query using ngram because you have an exact value that you want to search.
Define mappings correctly for your fields. There are two fields in
the mapping that allow you to specify analysers. One is used for
storage (analyzer) and second, is used for searching
(search_analyzer) – if you specified only analyser field then
specified analyser is used for index and search time.
You can read more about it here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html
And your code should look like that:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 1,
"max_gram": 5
}
},
"analyzer": {
"index_store_ngram": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"ngram_filter",
"lowercase"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "index_store_ngram",
"search_analyzer": "standard"
}
}
}
}
}
post /my_index/my_type/1
{
"name": "Sara_11_01"
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": "sara"
}
}
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": "SARA"
}
}
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"name": "SaRa"
}
}
}
Edit 1: updated code for a new example provided in the question

This answer is in context of ElasticSearch 7.14. So, let me re-format the ask of this question in another way:
Irrespective of the actual case type provided in the match query, you would like to be able to get those documents that have been analysed with :
"tokenizer": "keyword",
"filter": [ "ngram_filter", "lowercase" ]
Now, coming to the answer part:
It will not be possible to get the match query to return the docs that have been analysed with filter lowercase and the match query contains uppercase letters. The analysis that you have applied in the settings is applicable both while updating and searching data. Although, it is also possible to apply different analysers for updating and searching, I do not see that helping your case. You would have to convert the match query value to lowercase before making the query. So, if your filter is lowercase, you can not match by say Sara or SARA or sAra etc. The match param should be all lowercase, just as it is in your analyser.

Related

Search in Elasticsearch for a string containing the "not" keyword

I am using ElasticSearch on AWS (7.9 version) and I am trying to distinguish between two strings.
My main target is to split the search results on "Found" and on "Not found".
The generic question is how to search for "not" keyword.
Two example messages you can see below.
"CachingServiceOne:Found in cache - Retrieve."
"CachingServiceThree:Not found in cache - Create new."
You can use ngram tokenizer, to search for "not" on "title" field.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index Data:
{
"title":"CachingServiceThree:Not found in cache - Create new."
}
{
"title":"CachingServiceOne:Found in cache - Retrieve."
}
Search Query:
{
"query":{
"match":{
"title":"Not"
}
}
}
Search Result:
"hits": [
{
"_index": "67093372",
"_type": "_doc",
"_id": "2",
"_score": 0.6720003,
"_source": {
"title": "CachingServiceThree:Not found in cache - Create new."
}
}
]
Well, the problem seems to be indeed the way the default analyzer works, and not the fact that I could not search for the not word. That is why I accepted the answer. But I would like to add another take. For the sake of simplicity.
Default analyzer does not split words on :.
That means, we have to search for title:CachingServiceThree\:Not.
Where title is the field name and : must be escaped \:.
What did the trick was title:*\:Not and title:*\:Found using the KQL syntax.
Using the wildcard did the trick to fetch everything. I am wondering whether using an array of all the actual values will be quicker.
That translated through the Inspect panel into:
{
"query": {
"bool": {
"filter": [
{
"bool": {
"should": [
{
"query_string": {
"fields": [
"title"
],
"query": "*\\:Not"
}
}
],
"minimum_should_match": 1
}
}
]
}
}
}

Elastic Enterprise Search: Search API doesnt match “33” for “033” in document title

I couldn't figure it out how I can match the title e.g.: "Test title No.033" with query "33".
When I search for "033", it return the document. But for only "33" it doesnt return :frowning:
The guide is not very helpful for me (Search API | Elastic App Search Documentation [7.12] | Elastic)
Could you please help me with this?
What other information should I provide?
If no analyzer is specified then elasticsearch uses a standard analyzer.
The tokens generated will be "test", "title", "no", "033"
You can use ngram tokenizer to do a partial match on "title" field
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Analyze API
GET /_analzye
{
"analyzer" : "my_analyzer",
"text" : "Test title No.033"
}
The tokens generated will contain both "033" and "33"
Index Data:
{
"title":"Test title No.033"
}
Search Query:
{
"query":{
"match":{
"title":"33"
}
}
}
Search Result:
"hits": [
{
"_index": "67091386",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"title": "Test title No.033"
}
}
]
Elastic App Search doesn't allow you to configure the tokenizer like in Elasticsearch. However, you can tune in the results using the relevance tuning API.
You can give more weight to your title, and it will start showing when you search for either "33" or "033".
Note: Relevance tuning has multiple components like Weights, Functions that you could apply. I'd recommend you try it out in the Elastic App Search Console.

How to search on Elasticsearch for words with or without apostrophe ? and deal with spelling mistakes?

I'm trying to move my Full Text Search logic from MySQL to Elasticsearch. In MySQL to find all rows containing the word "woman" I would just write
SELECT b.code
FROM BIBLE b
WHERE ((b.DISPLAY_NAME LIKE '%woman%')
OR (b.BRAND LIKE '%woman%')
OR (b.DESCRIPTION LIKE '%woman%'));
on elasticsearch I tried for something similar
curl -X GET "localhost:9200/bible/_search" -H 'Content-Type: application/json' -d'
{
"query": { "multi_match": { "query": "WOMAN","fields": ["description","display_name","brand"] } }, "sort": { "code": {"order": "asc" } },"_source":["code"]
}
'
but it didn't have the same count on further checking it I found words like woman's weren't found by elasticsearch but was by MySQL. How do I solve this ?
AND
How do I incorporate stuff like searching for words even with spelling mistakes or words which are phonetically the same ?
Firstly, how is your mapping like ? Are you using any tokenizer. If not i would suggest that if you want to do wildcard search, you should use ngram tokenizer. It is mostly used for partial matches.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html
In elasticsearch, you have to do the mapping for the fields before indexing the data. Mapping is the way for informing elasticsearch to index the data in a particular way for retrieving the data the way you want.
Try the below DSL query (JSON format) for creating custom analyzer and mapping:
PUT {YOUR_INDEX_NAME}
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 20 //For Elasticsearch v6 and above
},
"mappings": {
"properties": {
"code": {"type": "long"},
"description": {
"type": "text",
"analyzer": "my_analyzer"
},
"display_name": {
"type": "text",
"analyzer": "my_analyzer"
},
"brand": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Sample Query example:
GET {YOUR_INDEX_NAME}/_search
{
"query": {
"multi_match" : {
"query" : "women",
"fields" : [ "description^3", "display_name", "brand" ]
}
}
}
I suggest you take a look at the fuzzy query for spelling mistakes.
Try to use Kibana UI for testing the index using DSL query instead of using cURL which will save you time.
Hope it helps you.

Why is my elastic search prefix query case-sensitive despite using lowercase filters on both index and search?

The Problem
I am working on an autocompleter using ElasticSearch 6.2.3. I would like my query results (a list of pages with a Name field) to be ordered using the following priority:
Prefix match at start of "Name" (Prefix query)
Any other exact (whole word) match within "Name" (Term query)
Fuzzy match (this is currently done on a different field to Name using a ngram tokenizer ... so I assume cannot be relevant to my problem but I would like to apply this on the Name field as well)
My Attempted Solution
I will be using a Bool/Should query consisting of three queries (corresponding to the three priorities above), using boost to define relative importance.
The issue I am having is with the Prefix query - it appears to not be lowercasing the search query despite my search analyzer having the lowercase filter. For example, the below query returns "Harry Potter" for 'harry' but returns zero results for 'Harry':
{ "query": { "prefix": { "Name.raw" : "Harry" } } }
I have verified using the _analyze API that both my analyzers do indeed lowercase the text "Harry" to "harry". Where am I going wrong?
From the ES documentation I understand I need to analyze the Name field in two different ways to enable use of both Prefix and Term queries:
using the "keyword" tokenizer to enable the Prefix query (I have applied this on a .raw field)
using a standard analyzer to enable the Term (I have applied this on the Name field)
I have checked duplicate questions such as this one but the answers have not helped
My mapping and settings are below
ES Index Mapping
{
"myIndex": {
"mappings": {
"pages": {
"properties": {
"Id": {},
"Name": {
"type": "text",
"fields": {
"raw": {
"type": "text",
"analyzer": "keywordAnalyzer",
"search_analyzer": "pageSearchAnalyzer"
}
},
"analyzer": "pageSearchAnalyzer"
},
"Tokens": {}, // Other fields not important for this question
}
}
}
}
}
ES Index Settings
{
"myIndex": {
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"keywordAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "keyword"
},
"pageSearchAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
},
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "1",
"uuid": "l2AXoENGRqafm42OSWWTAg",
"version": {}
}
}
}
}
Prefix queries don't analyze the search terms, so the text you pass into it bypasses whatever would be used as the search analyzer (in your case, the configured search_analyzer: pageSearchAnalyzer) and evaluates Harry as-is directly against the keyword-tokenized, custom-filtered harry potter that was the result of the keywordAnalyzer applied at index time.
In your case here, you'll need to do one of a few different things:
Since you're using a lowercase filter on the field, you could just always use lowercase terms in your prefix query (using application-side lowercasing if necessary)
Run a match query against an edge_ngram-analyzed field instead of a prefix query like described in the ES search_analyzer docs
Here's an example of the latter:
1) Create the index w/ ngram analyzer and (recommended) standard search analyzer
PUT my_index
{
"settings": {
"index": {
"analysis": {
"filter": {
"ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "15"
}
},
"analyzer": {
"pageIndexAnalyzer": {
"filter": [
"trim",
"lowercase",
"asciifolding",
"ngram"
],
"type": "custom",
"tokenizer": "keyword"
}
}
}
}
},
"mappings": {
"pages": {
"properties": {
"name": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "pageIndexAnalyzer",
"search_analyzer": "standard"
}
}
}
}
}
}
}
2) Index some sample docs
POST my_index/pages/_bulk
{"index":{}}
{"name":"Harry Potter"}
{"index":{}}
{"name":"Hermione Granger"}
3) Run the a match query against the ngram field
POST my_index/pages/_search
{
"query": {
"match": {
"query": "Har",
"operator": "and"
}
}
}
I think it is better to use match_phrase_prefix query without using .keyword suffix. Check the docs at here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html

Why match_phrase_prefix query returns wrong results with diffrent length of phrase?

I have very simple query:
POST /indexX/document/_search
{
"query": {
"match_phrase_prefix": {
"surname": "grab"
}
}
}
with mapping:
"surname": {
"type": "string",
"analyzer": "polish",
"copy_to": [
"full_name"
]
}
and definition for index (I use Stempel (Polish) Analysis for Elasticsearch plugin):
POST /indexX
{
"settings": {
"index": {
"analysis": {
"filter": {
"synonym" : {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
},
"polish_stop": {
"type": "stop",
"stopwords_path": "analysis/stopwords.txt"
},
"polish_my_stem": {
"type": "stemmer",
"rules_path": "analysis/stems.txt"
}
},
"analyzer": {
"polish_with_synonym": {
"tokenizer": "standard",
"filter": [
"synonym",
"lowercase",
"polish_stop",
"polish_stem",
"polish_my_stem"
]
}
}
}
}
}
}
For this query I get zero results. When I change phrase to GRA or GRABA it returns 1 result (GRABARZ is the surname). Why is this happening?
I tried max_expansions with values even as high as 1200 and that didn't help.
At the first glance, your analyzer stems the search term ("grab") and renders it unusable ("grabić").
Without going into details on how to resolve this, please consider getting rid of polish analyzer here. We are talking about people's names, not "ordinary" polish words.
I saw different techniques used in this case: multi-field searches, fuzzy searches, phonetic searches, dedicated plugins.
Some links:
https://www.elastic.co/blog/multi-field-search-just-got-better
http://www.basistech.com/fuzzy-search-names-in-elasticsearch/
https://www.found.no/play/gist/6c6434c9c638a8596efa
But I guess in case of polish names some kind of prefix query on non-analyzed field would suffice...

Resources