Rank Elasticsearch results by the shortest hit - sorting

I am building a ngram search example with ES. Is is possible to take account the shortest length of all the hits?
Here's an example:
Documents:
{"aliases": ["ElonMuskTesla", "MuskTesla"]}
{"aliases": ["ElonMusk"]}
Default Result:
When searching for "Musk" against the field "aliases", the first document will have the highest score, because it has two hits matching "Musk".
What I want:
But I want the second document to appear on the top, because in my case, it's more relavant to the serach term (shortest means most similar).
I guess this might be achieved by the script score query, but don't know exactly how after browsing a bunch of seemingly related questions.
[Appendix] Mapping & Settings:
{
"settings":{
"analysis":{
"tokenizer":{
"ngram":{
"type":"ngram",
"min_gram":2,
"max_gram":40
}
},
"analyzer":{
"ngram_analyzer":{
"tokenizer":"ngram",
"filter":[
"lowercase"
]
},
"lower_analyzer":{
"tokenizer":"keyword",
"filter":[
"lowercase"
]
}
}
}
},
"mappings":{
"properties":{
"aliases":{
"type":"text",
"analyzer":"ngram_analyzer",
"term_vector":"with_positions_offsets",
"search_analyzer":"lower_analyzer"
}
}
}
}

Related

wildcard and term search returning different results based on case

I am using OpenSearch version 1.3.1 via the Docker image.
Here is my index and a document:
PUT index_0
{
"settings":{
"analysis":{
"analyzer":{
"keyword_lower":{
"type":"custom",
"tokenizer":"keyword",
"filter":"lowercase"
}
}
}
},
"mappings":{
"properties":{
"id":{
"type":"text",
"index":true
},
"name":{
"type":"text",
"index":true,
"analyzer":"keyword_lower"
}
}
}
}
PUT index_0/_doc/1
{
"id":"123",
"name":"FooBar"
}
If I run this query, I get results (notice the difference in case, lowercase b):
GET index_0/_search?pretty
{"query":{"wildcard":{"name":"Foobar"}}}
But if I run this query, I do not:
GET index_0/_search?pretty
{"query":{"term":{"name":"Foobar"}}}
Why does a term search seem to be case sensitive whereas a wildcard one is not, given the same field?

Exact Sub-String Match | ElasticSearch

We are migrating our search strategy, from database to ElasticSearch. During this we are in need to preserve the existing functionality of partially searching a field similar the SQL query below (including whitespaces):
SELECT *
FROM customer
WHERE customer_id LIKE '%0995%';
Having said that, I've gone through multiple articles related to ES and achieving the said functionality. After the above exercise following is what I've come up with:
Majority of the article which I read recommended to use nGram analyzer/filter; hence following is how mapping & setting looks like:
Note:
The max length of customer_id field is VARCHAR2(100).
{
"customer-index":{
"aliases":{
},
"mappings":{
"customer":{
"properties":{
"customerName":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"customerId":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
},
"analyzer":"substring_analyzer"
}
}
}
},
"settings":{
"index":{
"number_of_shards":"3",
"provided_name":"customer-index",
"creation_date":"1573333835055",
"analysis":{
"filter":{
"substring":{
"type":"ngram",
"min_gram":"3",
"max_gram":"100"
}
},
"analyzer":{
"substring_analyzer":{
"filter":[
"lowercase",
"substring"
],
"type":"custom",
"tokenizer":"standard"
}
}
},
"number_of_replicas":"1",
"uuid":"XXXXXXXXXXXXXXXXX",
"version":{
"created":"5061699"
}
}
}
}
}
Request to query the data looks like this:
{
"from": 0,
"size": 10,
"sort": [
{
"name.keyword": {
"missing": "_first",
"order": "asc"
}
}
],
"query": {
"bool": {
"filter": [
{
"query_string": {
"query": "0995",
"fields": [
"customer_id"
],
"analyzer": "substring_analyzer"
}
}
]
}
}
}
With that being said, here are couple of queries/issue:
Lets say there are 3 records with customer_id:
0009950011214,
0009900011214,
0009920011214
When I search for "0995". Ideally, I am looking forward to get only customer_id: 0009950011214.
But I get all three records as part of result set and I believe its due to nGram analyzer and the way it splits the string (note: minGram: 3 and maxGram:100). Setting maxGram to 100 was for exact match.
How should I fix this?
This brings me to my second point. Is using nGram analyzer for this kind of requirement the most effective strategy? My concern is the memory utilization of having minGram = 3 and maxGram = 100. Is there are better way to implement the same?
P.S: I'm on NEST 5.5.
In your customerID field you can pass a "search_analyzer": "standard". Then in your search query remove the line "analyzer": "substring_analyzer".
This will ensure that the searched customerID is not tokenized into nGrams and is searched as is, while the customerIDs are indexed as nGrams.
I believe that's the functionality that you were trying to replicate from your SQL query.
From the mapping I can see that the field customerId is a text/keyword field.( Difference between keyword and text in ElasticSearch )
So you can use a regex filter as shown below to make searches like the sql query you have given as example, Try this-
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"regexp": {
"customerId": {
"value": ".*0995.*",
"flags": "ALL"
}
}
}
]
}
}
}
}
}
notice the "." in the value of the regex expression.
..* is same as contains search
~(..) is same as not contains
You can also append ".*" at the starting or the end of the search term to do searches like Ends-with and Starts-with type of searches. Reference -https://www.elastic.co/guide/en/elasticsearch/reference/6.4/query-dsl-regexp-query.html

analyser with ngram token depending on term length

I'm building an analyser to provide partial search on term. So I want to use 2-5 ngram tokenzier at index time and 5-5 ngram at search.
The rational of using 2-5 ngram at index time is that the a partial term query of lenght 2 shall match.
At search, if the search term has a length lower than 5, the term can be searched directly in the inverted index. If it has a len greater than 5, then the term is tokenized with 5-grams and match if all token match.
However, in Elastic, using 5-5 ngram tokenziser won't create any token if the query term has a length lower than 5.
The solution could be to use at search a 2-5 tokenizer, same as for indexing, but this would result in searching all the 2grams, 3grams and 4grams tokens, which is useless... (5grams token is sufficient)
Here is my current index mapping:
{
"settings" : {
"analysis":{
"analyzer":{
"index_partial":{
"type":"custom",
"tokenizer":"2-5_ngram_token"
},
"search_partial":{
"type":"custom",
"tokenizer": "5-5_ngram_token"
}
},
"tokenizer":{
"2-5_ngram_token": {
"type":"nGram",
"min_gram":"2",
"max_gram":"5"
},
"5-5_ngram_token": {
"type":"nGram",
"min_gram":"5",
"max_gram":"5"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"name_trans": {
"type": "text",
"fields": {
"partial": {
"type":"text",
"analyzer":"index_partial",
"search_analyzer":"search_partial"
}
}
}
}
}
}
So my question is : How can create analyzer that would do no-op if the search query has a length lower than 5. If it has a length greater than 5, it creates 5 grams tokens ?
----------------------UPDATE WITH WORK AROUND SOLUTION-----------------------
It seems not possible to create an analyser that do no-op if len < 5 and 5-5ngram if len >= 5.
There is two work around solutions to perform partial:
1- As mentionned by #Amit Khandelwal, one solution is to use max ngrams at index time. If your field has 30 chars max, use a tokenizer with ngram 2-30 and at searh time, search for the exact term, without processing it with the ngram analyser (either via term query or by setting the search analyszer to keyword).
Drawback of this solution is that it could result in huge inverted index depending on the max length.
2- Other solution is to create two fields:
- one for short search query term that can be look for in the inverted index directly, without being tokenized
- one for longer search query term that shall be tokenized
Depending of the length of the search query term, the search shall be performed on either one of those two fields
Below is the mapping I used for solution 2 (the limit between short and long term I chose is len=5):
PUT name_test
{
"settings" : {
"max_ngram_diff": 3,
"analysis":{
"analyzer":{
"2-4nGrams":{
"type":"custom",
"tokenizer":"2-4_ngram_token",
"filter": ["lowercase"]
},
"5-5nGrams":{
"type":"custom",
"tokenizer": "5-5_ngram_token",
"filter": ["lowercase"]
}
},
"tokenizer":{
"2-4_ngram_token": {
"type":"nGram",
"min_gram":"2",
"max_gram":"4"
},
"5-5_ngram_token": {
"type":"nGram",
"min_gram":"5",
"max_gram":"5"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"name_trans": {
"type": "text",
"fields": {
"2-4partial": {
"type":"text",
"analyzer":"2-4nGrams",
"search_analyzer":"keyword"
},
"5-5partial": {
"type":"text",
"analyzer":"5-5nGrams"
}
}
}
}
}
}
and the two kind of request to be used with this mapping depending search term length:
GET name_test/_search
{
"query": {
"match": {
"name_trans.2-4partial": {
"query": "ema",
"operator": "and",
"fuzziness": 0
}
}
}
}
GET name_test/_search
{
"query": {
"match": {
"name_trans.5-5partial": {
"query": "emanue",
"operator": "and",
"fuzziness": 0
}
}
}
Maybe this will help someone someday :)
I am not sure if it's possible in Elasticsearch or not, But I can suggest you a workaround which we also use in our application although our use case was different.
Create a custom analyzer using 2-5 ngram tokenzier on the fields, which you want to use for the partial search, this will store the ngram tokens of the fields in inverted index, for example for a field containing foobar as a value, it will store fo, foo, foob, fooba, oo, oob , ooba, oobar ,ob, oba ,obar, ba, bar, ar.
Now instead of match query use the term query on partial fields, which is not analyzed, you can read diff b/w these here.
So now, in this case, It doesn't matter whether the search term is smaller than 5 or not, it will still match the tokens and you will get the results.
Now lets dry run this on the field containing foobar as a value and test it against some search terms,
Case 1: If search term contains less than 5 chars like fo, oo, ar, bar , oob, oba, bar and ooba, still it will match as these tokens are present in the inverted index.
Case 2: Search term contains equal or more than 5 chars, like fooba, oobar then also it return the document as index contains these tokens.
Let me know if its clear or you require additional clarification.

Elastic Search: Matching sub token default operator

Is there a way to set the default operator for sub tokens (tokens generated through the analyzer)? It currently seems to default to OR and setting operator does not work.
I'm using the validate API to see how Elastic Search is understanding the query:
/myIndex/mapping/_validate/query?explain=true
{
"query":{
"multi_match":{
"type":"phrase_prefix",
"query":"test123",
"fields":[
"message"
],
"lenient":true,
"analyzer":"myAnalyzer"
}
}
}
Which returns
+(message:test123 message:test message:123)
What I want is
+message:test123 +message:test +message:123
Is there any way to do this without using a script or splitting the terms and creating a more complex query in the application?
EDIT
Using operator or minimum_should_match does not make a difference.
My elastic search mapping for myAnalyzer is
{
"analysis":{
"filter":{
"foldAscii":{
"type":"asciifolding",
"preserve_original":"1"
},
"capturePattern":{
"type":"pattern_capture",
"patterns":[
"(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+(?!\\p{Ll}+))",
"(\\d+)"
]
},
"noDuplicates":{
"type":"unique",
"only_on_same_position":"true"
}
},
"myAnalyzer":{
"filter":[
"capturePattern",
"lowercase",
"foldAscii",
"noDuplicates"
],
"tokenizer":"standard"
}
}
}

Keep non-stemmed tokens on Elasticsearch

I'm using a stemmer (for the Brazilian Portuguese Language) when I index documents on Elasticsearch. This is what my default analyzer looks like(nvm minor mistakes here because I've copied this by hand from my code in the server):
{
"analysis":{
"filter":{
"my_asciifolding": {
"type": "asciifolding",
"preserve_original": true,
},
"stop_pt":{
"type": "stop",
"ignore_case": true,
"stopwords": "_brazilian_"
},
"stemmer_pt": {
"type": "stemmer",
"language": "brazilian"
}
},
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"my_asciifolding",
"stop_pt",
"stemmer_pt"
]
}
}
}
}
I haven't really touched my type mappings (apart from a few numeric fields I've declared "type":"long") so I expect most fields to be using this default analyzer I've specified above.
This works as expected, but the thing is that some users are frustrated because (since tokens are being stemmed), the query "vulnerabilities" and the query "vulnerable" return the same results, which is misleading because they expect the results having an exact match to be ranked first.
Whats is the default way (if any) to do this in elasticsearch? (maybe keep the unstemmed tokens in the index as well as the stemmed tokens?) I'm using version 1.5.1.
I ended up using "fields" field to index my attributes in different ways. Not sure whether this is optimal but this is the way I'm handling it right now:
Add another analyzer (I called it "no_stem_analyzer") with all filters that the "default" analyzer has, minus "stemmer".
For each attribute I want to keep both non stemmed and stemmed variants, I did this (example for field "DESCRIPTION"):
"mappings":{
"_default_":{
"properties":{
"DESCRIPTION":{
"type"=>"string",
"fields":{
"no_stem":{
"type":"string",
"index":"analyzed",
"analyzer":"no_stem_analyzer"
},
"stemmed":{
"type":"string",
"index":"analyzed",
"analyzer":"default"
}
}
}
},//.. other attributes here
}
}
At search time (using query_string_query) I must also indicate (using field "fields") that I want to search all sub-fields (e.g. "DESCRIPTION.*")
I also based my approach upon [this answer].(elasticsearch customize score for synonyms/stemming)

Resources