Elasticsearch Edge-NGrams Prefer Shorter Terms

Elasticsearch Edge-NGrams Prefer Shorter Terms - elasticsearch

I like the results I am getting from Elasticsearch using Edge-NGrams to index data and a different analyzer for searching. I would, however, prefer that shorter terms that match get ranked higher than longer terms.
For example, take the terms ABC100 and ABC100xxx. If I perform a query using the term ABC, I get back both of these documents as hits with the same score. What I would like is for ABC100 to be scored higher than ABC100xxx because ABC closer matches ABC100 according to something like the Levenshtein distance algorithm.
Setting up the index:
PUT stackoverflow
{
"settings": {
"index": {
"number_of_replicas": 0,
"number_of_shards": 1
},
"analysis": {
"filter": {
"edge_ngram": {
"type": "edgeNGram",
"min_gram": "1",
"max_gram": "20"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"edge_ngram"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"product": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "whitespace"
}
}
}
}
}
Inserting documents:
PUT stackoverflow/doc/1
{
"product": "ABC100"
}
PUT stackoverflow/doc/2
{
"product": "ABC100xxx"
}
Search query:
GET stackoverflow/_search?pretty
{
"query": {
"match": {
"product": "ABC"
}
}
}
Results:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.28247002,
"hits": [
{
"_index": "stackoverflow",
"_type": "doc",
"_id": "2",
"_score": 0.28247002,
"_source": {
"product": "ABC100xxx"
}
},
{
"_index": "stackoverflow",
"_type": "doc",
"_id": "1",
"_score": 0.28247002,
"_source": {
"product": "ABC100"
}
}
]
}
}
Does anyone know how I may have a shorter term such as ABC100 ranked higher than ABC100xxx?

After finding plenty of less than optimal solutions regarding storing field length as a field or using a script query, I found the root of my problem. It was simply because I was using the edge_ngrams token filter instead of the the edge_ngrams tokenizer.

Related

Normalizing keyword field: ascii should match diacritic, but not vice versa

I have a keyword field that can contain characters with diacritics. Queries without diacritics should return results with those diacritics, but not vice versa. The first part can be resolved by using a normalizer, the configuration for which is also described in a related question. If I use that for e.g. {"title": "Sulgi"} and {"title": "Šulgi"}, searching for "Sulgi" will (correctly) return both documents. However, searching for "Šulgi" also returns both documents, instead of just the one with the diacritic. It seems ES is also normalizing the query input, which is generally good, but is it possible to change that behavior?
PUT _template/test
{
"index_patterns": ["*"],
"settings": {
"analysis": {
"normalizer": {
"exact": {
"type": "custom",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "keyword",
"normalizer": "exact"
}
}
}
}
POST test/_doc/1
{
"title": "Sulgi"
}
POST test/_doc/2
{
"title": "Šulgi"
}
Example search query:
POST test/_search
{
"query": {
"term": {
"title":"Šulgi"
}
}
}
{
"took": 294,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.18232156,
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "1",
"_score": 0.18232156,
"_source": {
"title": "Šulgi"
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "2",
"_score": 0.18232156,
"_source": {
"title": "Sulgi"
}
}
]
}
}

Elasticsearch template to support case insensitive searches

I've setup a normalizer on an index field to support case insensitive searches, cant seem to get it to work.
GET users/
Returns the following mapping:
{
"users": {
"aliases": {},
"mappings": {
"user": {
"properties": {
"active": {
"type": "boolean"
},
"first_name": {
"type": "keyword",
"fields": {
"normalize": {
"type": "keyword",
"normalizer": "search_normalizer"
}
}
},
},
"settings": {
"index": {
"number_of_shards": "5",
"provided_name": "users",
"creation_date": "1567936315432",
"analysis": {
"normalizer": {
"search_normalizer": {
"filter": [
"lowercase"
],
"type": "custom"
}
}
},
"number_of_replicas": "1",
"uuid": "5SknFdwJTpmF",
"version": {
"created": "6040299"
}
}
}
}
}
Although first_name is normalized to lowercase, queries on the first_name field are case sensitive.
Using the following query for a user with first name Dave
GET users/_search
{
"query": {
"bool": {
"should": [
{
"regexp": {
"first_name": {
"value": ".*dave.*"
}
}
}
]
}
}
}
GET users/_analyze
{
"analyzer" : "standard",
"text": "Dave"
}
returns
{
"tokens": [
{
"token": "dave",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Although "Dave" is tokenized to "dave" the following query
GET users/_search
{
"query": {
"match": {
"first_name": "dave"
}
}
}
Returns no hits.
Is there an issue with my current mapping? or the query?

I think you have missed first_name.normalize in query
Indexing Records
{"first_name": "Daveraj"}
{"index": {}}
{"first_name": "RajdaveN"}
{"index": {}}
{"first_name": "Dave"}
Query
"query": {
"bool": {
"should": [
{
"regexp": {
"first_name.normalize": {
"value": ".*dave.*"
}
}
}
]
}
}
}
Result
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.0,
"hits": [
{
"_index": "test3",
"_type": "test3_type",
"_id": "M8-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "Dave"
}
},
{
"_index": "test3",
"_type": "test3_type",
"_id": "Mc-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "Daveraj"
}
},
{
"_index": "test3",
"_type": "test3_type",
"_id": "Ms-lEG0BLCpzI1hbBWYC",
"_score": 1.0,
"_source": {
"first_name": "RajdaveN"
}
}
]
}
}```

You have created a normalized multi-field: first_name.normalize , but you are searching on the original field first_name which doesn't have any analyzer specified (will default to index-default analyzer or standard).
The examples given here might help:
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html
You need to explicitly specify the multi-field you want to search on, note even though a multi-field cant have its own content, it indexes different terms as opposed to its parent (although not always) as a result of possibly being analyzed using diff analyzers/char/token filters.

Elasticsearch concatenate two words into one

I have a field ManufacturerName
"ManufacturerName": {
"type": "keyword",
"normalizer" : "keyword_lowercase"
},
And a normalizer
"normalizer": {
"keyword_lowercase": {
"type": "custom",
"filter": ["lowercase"]
}
}
When searching for 'ripcurl' it matches. However when searching for 'rip curl' it doesn't.
How/what would use to concatenate certain words. i.e. 'rip curl' -> 'ripcurl'
Apologies if this is a duplicate, I've spent some time seeking a solution to this.

You would want to make use of text field for what you are looking for and get this kind of requirement carried out via Ngram Tokenizer
Below is a sample mapping, query and response:
Mapping:
PUT mysomeindex
{
"mappings": {
"mydocs":{
"properties": {
"ManufacturerName":{
"type": "text",
"analyzer": "my_analyzer",
"fields":{
"keyword":{
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
},
"settings": {
"analysis": {
"normalizer": {
"my_normalizer":{
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"filter": [ "synonyms" ]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 5,
"token_chars": [
"letter",
"digit"
]
}
},
"filter": {
"synonyms":{
"type": "synonym",
"synonyms" : ["henry loyd, henry loid, henry lloyd => henri lloyd"]
}
}
}
}
}
Notice that the field ManufacturerName is a multi-field which has both text type and its sibling keyword type. That way for exact matches & for aggregation queries you could make use of keyword field while for this requirement you can make use of text field.
Sample Document:
POST mysomeindex/mydocs/1
{
"ManufacturerName": "ripcurl"
}
POST mysomeindex/mydocs/2
{
"ManufacturerName": "henri lloyd"
}
What elasticsearch does when you ingest the above document is, it creates tokens of size from 3 to 5 length and stored them in inverted index for e.g. `rip, ipc, pcu etc...
You can execute the below query to see what tokens gets created:
POST mysomeindex/_analyze
{
"text": "ripcurl",
"analyzer": "my_analyzer"
}
Also I'd suggest you to look into Edge Ngram tokenizer and see if that fits better for your requirement.
Query:
POST mysomeindex/_search
{
"query": {
"match": {
"ManufacturerName": "rip curl"
}
}
}
Response:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.25316024,
"hits": [
{
"_index": "mysomeindex",
"_type": "mydocs",
"_id": "1",
"_score": 0.25316024,
"_source": {
"ManufacturerName": "ripcurl"
}
}
]
}
}
Query for Synonyms:
POST mysomeindex/_search
{
"query": {
"match": {
"ManufacturerName": "henri lloyd"
}
}
}
Response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 2.2784421,
"hits": [
{
"_index": "mysomeindex",
"_type": "mydocs",
"_id": "2",
"_score": 2.2784421,
"_source": {
"ManufacturerName": "henry lloyd"
}
}
]
}
}
Note: If you intend to make use of synonyms then the best way it to have them in the a text file and add that relative to the config folder location as mentioned here
Hope this helps!

How to control scoring or ordering of results while using ngram in Elasticsearch?

I am using Elasticsearch 6.X..
I have created an index test_index with index type doc as follow:
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "my_ngram_tokenizer"
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "1",
"max_gram": "7",
"token_chars": [
"letter",
"digit",
"punctuation"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"my_text": {
"type": "text",
"fielddata": true,
"fields": {
"ngram": {
"type": "text",
"fielddata": true,
"analyzer": "my_analyzer"
}
}
}
}
}
}
}
I have indexed data as follow:
PUT /text_index/doc/1
{
"my_text": "ohio"
}
PUT /text_index/doc/2
{
"my_text": "ohlin"
}
PUT /text_index/doc/3
{
"my_text": "john"
}
Then I used search query:
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "oh",
"fields": [
"my_text^5",
"my_text.ngram"
]
}
}
]
}
}
}
And got the response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 1.0042334,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1.0042334,
"_source": {
"my_text": "ohio"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 0.97201055,
"_source": {
"my_text": "john"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.80404717,
"_source": {
"my_text": "ohlin"
}
}
]
}
}
Here, we can see the when I searched for oh, I got results in the order:
-> ohio
-> john
-> ohlin
But, I want to have scoring and order of the results in a way which gives higher priority to matching prefix:
-> ohio
-> ohlin
-> john
How can I achieve such result ? What approaches can I take here ?
Thanks in advance.

You should add a new subfield with a new analyzer using the edge_ngram tokenizer then add the new subfield in your multimatch.
You need then to use the type most_fields for your multimatch query. Then only the documents starting by the search term will match on this subfield and then will be boosted against others matching documents.

facing problems with terms filter

My mapping looks like the below.
"BID": {
"type": "string"
},
"REGION": {
"type": "string"
},
Now I am trying to search for the records whose BID values are B100, B302. I've written below query. Though I've records with those ID values, I am not getting any results. Any clue where I am doing wrong?
{"query": {"filtered": {"filter": {"terms": {"BID": ["B100","B302"]}}}}}

Try using lower-case values, like:
{"query": {"filtered": {"filter": {"terms": {"BID": ["b100","b302"]}}}}}
You need to do this because, since you did not specify an analyzer in the definition of "BID" in your mapping, the default standard analyzer is used, which will convert letters to lower-case.
Alternatively, if you want to maintain the case in your index terms, you can add "index": "not_analyzed" to your mapping definition for "BID".
To test I set up an index like this:
PUT /test_index
{
"mappings": {
"doc": {
"properties": {
"BID": {
"type": "string",
"index": "not_analyzed"
},
"REGION": {
"type": "string"
}
}
}
}
}
added a few docs:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"REGION":"NA","BID":"B100"}
{"index":{"_id":2}}
{"REGION":"NA","BID":"B200"}
{"index":{"_id":3}}
{"REGION":"NA","BID":"B302"}
and now your query works as written:
POST /test_index/_search
{
"query": {
"filtered": {
"filter": {
"terms": {
"BID": [
"B100",
"B302"
]
}
}
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"REGION": "NA",
"BID": "B100"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"REGION": "NA",
"BID": "B302"
}
}
]
}
}
Here is some code I used for testing:
http://sense.qbox.io/gist/b4b4767501df7ad8b6459c4d96809d737a8811ec

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch Edge-NGrams Prefer Shorter Terms - elasticsearch

After finding plenty of less than optimal solutions regarding storing field length as a field or using a script query, I found the root of my problem. It was simply because I was using the edge_ngrams token filter instead of the the edge_ngrams tokenizer.

Related

Normalizing keyword field: ascii should match diacritic, but not vice versa

Elasticsearch template to support case insensitive searches

Elasticsearch concatenate two words into one

How to control scoring or ordering of results while using ngram in Elasticsearch?

facing problems with terms filter

Categories

Resources