Tokenize a big word into combination of words - elasticsearch

Suppose I have Super Bowl is the value of a document's property in the elasticsearch. How can the term query superbowl match Super Bowl?
I read about letter tokenizer and word delimiter but both don't seem to solve my problem. Basically I want to be able to convert combination of a large word into meaningful combination of words.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-letter-tokenizer.html

I know this is quite late but you could use synonym filter
You could define that super bowl is the same as "s bowl", "SuperBowl" etc.

There are ways to do this without changing what you actually index. For example, if you are using at least 5.2 (where normalizers were introduced), but it can also be earlier version but 5.x makes it easier, you can define a normalizer to lowercase your text and not change it and then use a fuzzy query at search time to account for the space between super and bowl. My solution though is specific to this example you have given. As it is with Elasticsearch most of time, one needs to think about what kind of data goes into Elasticsearch and what it is required at search time.
In any case, if you are interested in an approach here it is:
DELETE test
PUT /test
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"test": {
"properties": {
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
}
}
POST test/test/1
{"title":"Super Bowl"}
GET /test/_search
{
"query": {
"fuzzy": {
"title.keyword": "superbowl"
}
}
}

Related

Elasticsearch Text with Path Hierarchy vs KeyWord using Prefix query performance

I'm trying to achieve the best way to filter results based on folder hierarchies. We will use this to simulate a situation where we want to get all assets/documents in provided folder and all subfolders (recursive search).
So for example for such a structure
/someFolder/someSubfolder/1
/someFolder/someSubfolder/1/subFolder
/someFolder/someSubfolder/2
/someFolder/someSubfolder/2/subFolder
If we search for /someFolder/someSubfolder/1
We want to get as results
/someFolder/someSubfolder/1
/someFolder/someSubfolder/1/subFolder
Now I've found two ways to do this. Not sure which one would be better from performance perspective.
Use Text property with path_hierarchy Tokenizer
Use Keyword property and use Query prefix to get results
Both of the above seem to work as I want them to (unless I missed something). Not sure which one would be better. On one hand I've read that filtering should be done on Keywords. On the other hand path_hierarchy Tokenizer seems to be created exactly for these scenarios but we can only use it with Text field.
Below I prepared a sample code.
Create index and push some test data into it.
PUT test-index-2
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "path_hierarchy"
}
}
}
},
"mappings": {
"properties": {
"folderPath": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
POST test-index-2/_doc/
{
"folderPath": "8bf5ad7949a1_104d753b-0fdf-4b07-9213-534dec89112a/Folder with Spaces"
}
POST test-index-2/_doc/
{
"folderPath": "8bf5ad7949a1_104d753b-0fdf-4b07-9213-534dec89112a/Folder with Spaces/SomeTestValue/11"
}
Now both of below queries will return two results for matching partial path hierarchy.
1.
GET test-index-2/_search
{
"query": {
"bool": {
"filter": [
{ "term": { "folderPath": "8bf5ad7949a1_104d753b-0fdf-4b07-9213-534dec89112a/Folder with Spaces" }}
]
}
}
}
GET test-index-2/_search
{
"query": {
"prefix" : { "folderPath.keyword": "8bf5ad7949a1_104d753b-0fdf-4b07-9213-534dec89112a/Folder with Spaces" }
}
}
Now the question would be: Which solution is better if we want to get a subset of results ?

Elasticsearch Became case sensitive after add synonym analyzer

After I added synonym analyzer to my_index, the index became case-sensitive
I have one property called nationality that has synonym analyzer. But it seems that this property become case sensitive because of the synonym analyzer.
Here is my /my_index/_mappings
{
"my_index": {
"mappings": {
"items": {
"properties": {
.
.
.
"nationality": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "synonym"
},
.
.
.
}
}
}
}
}
Inside the index, i have word India COUNTRY. When I try to search India nation using the command below, I will get the result.
POST /my_index/_search
{
"query": {
"match": {
"nationality": "India nation"
}
}
}
But, when I search for india (notice the letter i is lowercase), I will get nothing.
My assumption is, this happend because i put uppercase filter before the synonym. I did this because the synonyms are uppercased. So the query India will be INDIA after pass through this filter.
Here is my /my_index/_settings
{
"my_index": {
"settings": {
"index": {
"number_of_shards": "1",
"provided_name": "my_index",
"similarity": {
"default": {
"type": "BM25",
"b": "0.9",
"k1": "1.8"
}
},
"creation_date": "1647924292297",
"analysis": {
"filter": {
"synonym": {
"type": "synonym",
"lenient": "true",
"synonyms": [
"NATION, COUNTRY, FLAG"
]
}
},
"analyzer": {
"synonym": {
"filter": [
"uppercase",
"synonym"
],
"tokenizer": "whitespace"
}
}
},
"number_of_replicas": "1",
"version": {
"created": "6080099"
}
}
}
}
}
Is there a way so I can make this property still case-insensitive. All the solution i've found only shows that I should only either set all the text inside nationality to be lowercase or uppercase. But how if I have uppercase & lowercase letters inside the index?
Did you apply synonym filter after adding your data into index?
If so, probably "India COUNTRY" phrase was indexed exactly as "India COUNTRY". When you sent a match query to index, your query was analyzed and sent as "INDIA COUNTRY" because you have uppercase filter anymore, it is matched because you are using match query, it is enough to match one of the words. "COUNTRY" word provide this.
But, when you sent one word query "india" then it is analyzed and converted to "INDIA" because of your uppercase filter but you do not have any matching word on your index. You just have a document contains "India COUNTRY".
My answer has a little bit assumption. I hope that it will be useful to understand your problem.
I have found the solution!
I didn't realize that the filter that I applied in the settings is applicable while updating and searching the data. At first, I did this step:
Create index with synonym filter
Insert data
Add uppercase before synonym filter
By doing that, the uppercase filter is not applied to my data. What I should've done are:
Create index with uppercase & synonym filter (pay attention to the order)
Insert data
Then the filter will be applied to my data.

Why fuzzy query returns a match but query with fuzziness doesn't on the same input?

I created the following index in Elasticsearch:
PUT /my-index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": ["lowercase", "3_5_edgegrams"]
}
},
"filter": {
"3_5_edgegrams": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 10
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Then I inserted the following document:
{
"name": "Nuvus Gro Corp"
}
When I make the following query (let's call it fuzzy_query):
GET /my-index/_search
{
"query": {
"fuzzy": {
"name": {
"value": "qnuv"
}
}
}
}
I get a match for the above document.
When I make the query (let's call the query match_with_fuzziness):
GET /my-index/_search
{
"query": {
"match": {
"name": {
"query": "qnuv",
"fuzziness": "AUTO"
}
}
}
}
I don't get a match. If I make the following query:
GET /my-index/_search
{
"query": {
"match": {
"name": {
"query": "nuvq",
"fuzziness": "AUTO"
}
}
}
}
I again get a match. I don't understand why when I make the match_with_fuzziness query I don't get any matches.
EDIT: I analyzed the queries with Kibana Profiler and according to the profiler match_with_fuzziness is a SynonymQuery Synonym(name:qnu name:qnuv) query while fuzzy_query is a BoostQuery (name:nuv)^0.6666666
Very similar problem to the one explained in your other question.
The problem is that you haven't specified a specific search_analyzer, so at search time qnuv and nuvq also get analyzed by my_analyzer and edge-ngramed as well, hence the match you're receiving.
If we check the first query, since you're using the fuzzy query, qnuv (the search term) will match nuv (the first indexed edge-ngramed token) with a distance of 1 (i.e. the first q is "tolerated"), which is what the fuzzy query does by default (with "fuzziness: AUTO")
In the third query, nuv (the first edge-ngramed token of the search term) will match nuv (the first indexed edge-ngramed token).
The case of the second query is a bit special and I'm referencing below how the fuzziness parameter works in the context of match queries
Fuzzy matching is not applied to terms with synonyms or in cases where the analysis process produces multiple tokens at the same position. Under the hood these terms are expanded to a special synonym query that blends term frequencies, which does not support fuzzy expansion.
The part in bold is what applies to your case. Since the search term qnuv is analyzed by my_analyzer, it produces the two tokens qnu and qnuv at the same position and that does not support fuzzy matching.
You need to change your mapping to this one instead and it will work the way you expect, i.e. all three queries will return your document:
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard" <---- add this line
}
}
}

Elasticsearch - Do searches for alternative country codes

I have a document with a field called 'countryCode'. I have a term query that search for the keyword value of it. But having some issues with:
Some records saying UK and some other saying GB
Some records saying US and some other USA
And the list goes on..
Can I instruct my index to handle all those variations somehow, instead of me having to expand the terms on my query filter?
What you are looking for is a way to have your tokens understand similar tokens which may or may not be having similar characters. This is only possible using synonyms.
Elasticsearch provides you to configure your synonyms and have your query use those synonyms and return the results accordingly.
I have configured a field using a custom analyzer using synonym token filter. I have created a sample mapping and query so that you can play with it and see if that fits your needs.
Mapping
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"usa, us",
"uk, gb"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_synonyms"
}
}
}
}
}
Sample Document
POST my_index/mydocs/1
{
"name": "uk is pretty cool country"
}
And when you make use of the below query, it does return the above document as well.
Query
GET my_index/mydocs/_search
{
"query": {
"match": {
"name": "gb"
}
}
}
Refer to their official documentation to understand more on this. Hope this helps!
Handling within ES itself without using logstash, I'd suggest using a simple ingest pipeline with gsub processor to update the field in it's place
{
"gsub": {
"field": "countryCode",
"pattern": "GB",
"replacement": "UK"
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/master/gsub-processor.html

Semi-exact (complete) match in ElasticSearch

Is there a way to require a complete (though not necessarily exact) match in ElasticSearch?
For instance, if a field has the term "I am a little teapot short and stout", I would like to match on " i am a LITTLE TeaPot short and stout! " but not just "teapot short and stout". I've tried the term filter, but that requires an actual exact match.
If your "not necessarily exact" definition refers to uppercase/lowercase letters combination and the punctuation marks (like ! you have in your example), this would be a solution, not too simple and obvious tough:
The mapping:
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword_lowercase": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"trim",
"my_pattern_replace"
]
}
},
"filter": {
"my_pattern_replace": {
"type": "pattern_replace",
"pattern": "!",
"replacement":""
}
}
}
},
"mappings": {
"test": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_keyword_lowercase"
}
}
}
}
}
The idea here is the following:
use a keyword tokenizer to keep the text as is and not to be split into tokens
use the lowercase filter to get rid of the mixing uppercase/lowercase characters
trim filter used to get rid of the trailing and leading whitespaces
use a pattern_replace filter to get rid of the punctuation. This is like this because a keyword tokenizer won't do anything to the characters inside the text. A standard analyzer will do this, but the standard will, also, split the text whereas you need it as is
And this is the query you would use for the mapping above:
{
"query": {
"match": {
"text": " i am a LITTLE TeaPot short and stout! "
}
}
}

Resources