Mapping international character to multiple options - elasticsearch

What I want to achieve is ability for people to search for individuals without being language aware, but not punishing those people, that are. What I mean is:
Given I build index:
Jorgensen
Jörgensen
Jørgensen
I want to be able allow such conversions:
ö to o
ö to oe
ø to oe
ø to oe
so if someone searches for:
QUERY | RESULT(I include only ID's, but it would be full records in reality)
Jorgensen return - 1,2,3
Jörgensen return - 1,2
Jørgensen return - 1,3
Joergensen return - 2,3
Starting with that I tried to create index analyzer and filter that:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"ö => o",
"ö => oe"
]
}
}
}
}
}
But that is invalid, because it tries to map to same character.
What ma I missing? Do I need multiple analyzers? Any direction would be appreciated.

Since custom mapping isn't enough in your case, as show comments above, let's play with your data and char normalization.
In your case, normalization using unidecode isn't enough due ø and oe conversions. Example:
import unicodedata
def strip_accents(s):
return ''.join(
c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn'
)
body_matches = [
u'Jorgensen',
u'Jörgensen',
u'Jørgensen',
u'Joergensen',
]
for b in body_matches:
print b,strip_accents(b)
>>>> Jorgensen Jorgensen
>>>> Jörgensen Jorgensen
>>>> Jørgensen Jørgensen
>>>> Joergensen Joergensen
So, we need a custom translation. By now I've only set those chars that you shown, but feel free to complete the list.
accented_letters = {
u'ö' : [u'o',u'oe'],
u'ø' : [u'o',u'oe'],
}
Then, we can normalize words and store them in a special property, body_normalized for instance, and index them as a field of your Elasticsearch records
Once they are inserted, you could perform two types of search:
exact search: User input isn't normalized and Elasticsearch query search against body field, that isn't normalized too.
simliar search. User input is normalized and we'll search againts
body_normalized field
Let's see an example
body_matches = [
u'Jorgensen',
u'Jörgensen',
u'Jørgensen',
u'Joergensen',
]
print "------EXACT MATCH------"
for body_match in body_matches:
elasticsearch_query = {
"query": {
"match" : {
"body" : body_match
}
}
}
es_kwargs = {
"doc_type" : "your_type",
"index" : 'your_index',
"body" : elasticsearch_query
}
res = es.search(**es_kwargs)
print body_match," MATCHING BODIES=",res['hits']['total']
for r in res['hits']['hits']:
print "-",r['_source'].get('body','')
print "\n------SIMILAR MATCHES------"
for body_match in body_matches:
body_match = normalize_word(body_match)
elasticsearch_query = {
"query": {
"match" : {
"body_normalized" : body_match
}
}
}
es_kwargs = {
"doc_type" : "your_type",
"index" : 'your_index',
"body" : elasticsearch_query
}
res = es.search(**es_kwargs)
print body_match," MATCHING NORMALIZED BODIES=",res['hits']['total']
for r in res['hits']['hits']:
print "-",r['_source'].get('body','')
You can see a running example in this notebook

After playing with it quite bit more, so far I came up with following approach:
We cannot store multiple representations of data in one field. That does makes sense, so instead, like it was suggested, we store it in multiple representations of the same field in something like sub field. I did everything with Kibana and/or Postman.
Create index with following settings:
PUT surname
{
"mappings": {
"individual": {
"_all": { "enabled": false },
"properties": {
"id": { "type": "integer" },
"name" : {
"type": "string",
"analyzer": "not_folded",
"fields": {
"double": {
"type": "string",
"analyzer": "double_folder"
},
"single": {
"type": "string",
"analyzer": "folded"
}
}
}
}
}
},
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"double_folder": {
"tokenizer": "icu_tokenizer",
"filter" : [
"icu_folding"
],
"char_filter": [
"my_char_filter"
]
},
"folded": {
"tokenizer": "icu_tokenizer",
"filter": [
"icu_folding"
]
},
"not_folded": {
"tokenizer": "icu_tokenizer",
"filter": [
"lowercase"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"ö => oe"
]
}
}
}
}
}
in this case it stores all names in 3 different formats:
The way it was entered
Folded to multiple symbols where I want it to
Folded to single symbol
Number of shards one is important bit for testing, since having multiple shards doesn't work well where there is not enough data. More read in Relevance is broken
then we can add test data to our index:
POST surname/individual/_bulk
{ "index": { "_id": 1}}
{ "id": "1", "name": "Matt Jorgensen"}
{ "index": { "_id": 2}}
{ "id": "2", "name": "Matt Jörgensen"}
{ "index": { "_id": 3}}
{ "id": "3", "name": "Matt Jørgensen"}
{ "index": { "_id": 4}}
{ "id": "4", "name": "Matt Joergensen"}
all that is left is to test if we get proper response:
GET surname/_search
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "Jorgensen",
"fields": [ "name","name.double", "name.single" ]
}
}
}

Related

.NET Elastic Search Create NGram Index

I am trying to set up elastic search as a prototype for a project that might use it.
The project needs to look through the contents of datasets and make them searchable.
What I have right now is the following:
Index documents
Search through all fields of the indexed documents for the full text
Missing right now is:
Search through all fields of the indexed documents for partial text
That means I can find this sample dataset from my database by searching for e.g. "Sofia"
, "sofia", "anderson" or "canada", but not by searching for "canad".
{
"id": 46,
"firstName": "Sofia",
"lastName": "Anderson",
"country": "Canada" }
I am creating my index using the "Elastic.Clients.Elasticsearch" NuGet package.
I try to create an Index with a NGram-Tokenizer and apply it to all fields.
That seems to be somehow not working.
This is the code that I use to create the Index:
Client.Indices.Create(IndexName, c => c
.Settings(s => s
.Analysis(a => a
.Tokenizer(t => t.Add(TokenizerName, new Tokenizer(new TokenizerDefinitions(new Dictionary<string, ITokenizerDefinition>() { { TokenizerName, ngram } }))))
.Analyzer(ad => ad
.Custom(AnalyzerName, ca => ca
.Tokenizer(TokenizerName)
)
)
)
)
.Mappings(m => m
.AllField(all => all
.Enabled()
.Analyzer(AnalyzerName)
.SearchAnalyzer(AnalyzerName)
)
)
);
with
private string TokenizerName => "my_tokenizer";
private string AnalyzerName => "my_analyzer";
and
var ngram = new NGramTokenizer() { MinGram = 3, MaxGram = 3, TokenChars = new List<TokenChar>() { TokenChar.Letter }, CustomTokenChars = "" };
With this code I get the behaviour described above.
Is there any error in my code?
Am I missing something?
Do you need further information?
Thanks in advance
Paul
I did not find a way to get this running in .NET.
However what worked for me was to create the index using this API call:
URL:
https://{{elasticUrl}}/{{indexName}}
Body:
{
"mappings": {
"properties": {
"firstName": {
"type":"text",
"analyzer":"index_ngram",
"search_analyzer":"search_ngram"
},
"lastName": {
"type":"text",
"analyzer":"index_ngram",
"search_analyzer":"search_ngram"
},
"country": {
"type":"text",
"analyzer":"index_ngram",
"search_analyzer":"search_ngram"
}
}
},
"settings": {
"index": {
"max_ngram_diff":50
},
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 25
}
},
"analyzer": {
"index_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "ngram_filter", "lowercase" ]
},
"search_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
}
}
This leads to an NGram with term lengths from 2 to 25 for the fields: firstName, lastName, country.
I hope this helps someone in the future :)

How to correctly query inside of terms aggregate values in elasticsearch, using include and regex?

How do you filter out/search in aggregate results efficiently?
Imagine you have 1 million documents in elastic search. In those documents, you have a multi_field (keyword, text) tags:
{
...
tags: ['Race', 'Racing', 'Mountain Bike', 'Horizontal'],
...
},
{
...
tags: ['Tracey Chapman', 'Silverfish', 'Blue'],
...
},
{
...
tags: ['Surfing', 'Race', 'Disgrace'],
...
},
You can use these values as filters, (facets), against a query to pull only the documents that contain this tag:
...
"filter": [
{
"terms": {
"tags": [
"Race"
]
}
},
...
]
But you want the user to be able to query for possible tag filters. So if the user types, race the return should show (from previous example), ['Race', 'Tracey Chapman', 'Disgrace']. That way, the user can query for a filter to use. In order to accomplish this, I had to use aggregates:
{
"aggs": {
"topics": {
"terms": {
"field": "tags",
"include": ".*[Rr][Aa][Cc][Ee].*", // I have to dynamically form this
"size": 6
}
}
},
"size": 0
}
This gives me exactly what I need! But it is slow, very slow. I've tried adding the execution_hint, it does not help me.
You may think, "Just use a query before the aggregate!" But the issue is that it'll pull all values for all documents in that query. Meaning, you can be displaying tags that are completely unrelated. If I queried for race before the aggregate, and did not use the include regex, I would end up with all those other values, like 'Horizontal', etc...
How can I rewrite this aggregation to work faster? Is there a better way to write this? Do I really have to make a separate index just for values? (sad face) Seems like this would be a common issue but have found no answers through documentation and googling.
You certainly don't need a separate index just for the values...
Here's my take on it:
What you're doing with the regex is essentially what should've been done by a tokenizer -- i.e. constructing substrings (or N-grams) such that they can be targeted later.
This means that the keyword Race will need to be tokenized into the n-grams ["rac", "race", "ace"]. (It doesn't really make sense to go any lower than 3 characters -- most autocomplete libraries choose to ignore fewer than 3 characters because the possible matches balloon too quickly.)
Elasticsearch offers the N-gram tokenizer but we'll need to increase the default index-level setting called max_ngram_diff from 1 to (arbitrarily) 10 because we want to catch as many ngrams as is reasonable:
PUT tagindex
{
"settings": {
"index": {
"max_ngram_diff": 10
},
"analysis": {
"analyzer": {
"my_ngrams_analyzer": {
"tokenizer": "my_ngrams",
"filter": [ "lowercase" ]
}
},
"tokenizer": {
"my_ngrams": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": [ "letter", "digit" ]
}
}
}
},
{ "mappings": ... } --> see below
}
When your tags field is a list of keywords, it's simply not possible to aggregate on that field without resorting to the include option which can be either exact matches or a regex (which you're already using). Now, we cannot guarantee exact matches but we also don't want to regex! So that's why we need to use a nested list which'll treat each tag separately.
Now, nested lists are expected to contain objects so
{
"tags": ["Race", "Racing", "Mountain Bike", "Horizontal"]
}
will need to be converted to
{
"tags": [
{ "tag": "Race" },
{ "tag": "Racing" },
{ "tag": "Mountain Bike" },
{ "tag": "Horizontal" }
]
}
After that we'll proceed with the multi field mapping, keeping the original tags intact but also adding a .tokenized field to search on and a .keyword field to aggregate on:
"index": { ... },
"analysis": { ... },
"mappings": {
"properties": {
"tags": {
"type": "nested",
"properties": {
"tag": {
"type": "text",
"fields": {
"tokenized": {
"type": "text",
"analyzer": "my_ngrams_analyzer"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
}
We'll then add our adjusted tags docs:
POST tagindex/_doc
{"tags":[{"tag":"Race"},{"tag":"Racing"},{"tag":"Mountain Bike"},{"tag":"Horizontal"}]}
POST tagindex/_doc
{"tags":[{"tag":"Tracey Chapman"},{"tag":"Silverfish"},{"tag":"Blue"}]}
POST tagindex/_doc
{"tags":[{"tag":"Surfing"},{"tag":"Race"},{"tag":"Disgrace"}]}
and apply a nested filter terms aggregation:
GET tagindex/_search
{
"aggs": {
"topics_parent": {
"nested": {
"path": "tags"
},
"aggs": {
"topics": {
"filter": {
"term": {
"tags.tag.tokenized": "race"
}
},
"aggs": {
"topics": {
"terms": {
"field": "tags.tag.keyword",
"size": 100
}
}
}
}
}
}
},
"size": 0
}
yielding
{
...
"topics_parent" : {
...
"topics" : {
...
"topics" : {
...
"buckets" : [
{
"key" : "Race",
"doc_count" : 2
},
{
"key" : "Disgrace",
"doc_count" : 1
},
{
"key" : "Tracey Chapman",
"doc_count" : 1
}
]
}
}
}
}
Caveats
in order for this to work, you'll have to reindex
ngrams will increase the storage footprint -- depending on how many tags-per-doc you have, it may become a concern
nested fields are internally treated as "separate documents" so this affects the disk space too
P.S.: This is an interesting use case. Let me know how the implementation went!

indexing suggestions using analyzer

Good day:
I'm trying to figure out how to index suggestion without splitting my text using a delimiter and storing it in the CompletionField:
List<string> inputs = new List<string>() {
facility.City,
facility.State,
facility.ZipCode
};
inputs.AddRange(facility.Name.Split(' '));
inputs.AddRange(facility.Address.Split(' '));
inputs.AddRange(facilityType.Description.Split(' '));
var completionField = new CompletionField()
{
Input = inputs.AsEnumerable<string>()
};
return completionField;
This isn't a optimal way of doing this because, I would rather let the analyzer handle this as oppose to doing this and then indexing it. Is there a way to send the entire text to Elastic and let Elastic analyze the text and store in in the completionfield on indexing or something else?
Updated
I've my code to index the entire text and to use the default analyzer however, this is what was index and the analyzer isn't breaking the text up
"suggest": {
"input": [
"Reston",
"Virginia",
"20190",
"Facility 123456",
"22100 Sunset Hills Rd suite 150*"
]
},
My code:
List<string> inputs = new List<string>() {
facility.City,
facility.State,
facility.ZipCode
};
inputs.Add(facility.Name);
inputs.Add(facility.Address);
if (facility.Description != null && facility.Description != "")
{
inputs.Add(facility.Description);
}
var completionField = new CompletionField()
{
Input = inputs.AsEnumerable<string>()
};
return completionField;
My mapping for the property:
"suggest": {
"type": "completion",
"analyzer": "simple",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50
},
But, yet it's not breaking up my input.
Just send all the text in input and specify a custom analyzer that uses the whitespace tokenizer and nothing else
EDIT
First add the analyzer
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"_doc" : {
"properties" : {
"suggest" : {
"type" : "completion",
"analyzer": "my_custom_analyzer"
},
"title" : {
"type": "keyword"
}
}
}
}
}
Then specify it on suggest field

elasticsearch: how to map common character mistakes?

I would like to map common mistakes in my language, as:
xampu -> shampoo
Shampoo is an english word, but commonly used in Brazil. In Portuguese, "ch" sounds like "x", as sometimes "s" sounds like "z". We also do not have "y" on our language, but it's common on names and foreign words - it sounds like "i".
So I would like to map a character replacement, but also keep the original word on the same position.
So a mapping table would be:
ch -> x
sh -> x
y -> i
ph -> f
s -> z
I have taken a look on the "Character Filters", but it seems to only support replacement:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html
I want to form derivative words based on the original so users can find the correct word even if typed wrong. To archive this, the following product name:
SHAMPOO NIVEA MEN
Should be tokenized as:
0: SHAMPOO, XAMPOO
1: NIVEA
2: MEN
I am using the synonym filter, but with synonym I need to may every word.
Any way to do this?
Thanks.
For your usecase, Multi-Field seems to suit the best. You can keep your field analyzed in two ways, one using standard and other using your custom analyzer created using mapping Char Filter.
It would look like:
Index Creation
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"ch => x",
"sh => x",
"y => i",
"ph => f",
"s => z"
]
}
}
}
}
}
MultiField creation
POST my_index/_mapping/my_type
{
"properties": {
"field_name": {
"type": "text",
"analyzer": "standard",
"fields": {
"mapped": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
Above mapping would create two versions of field_name, one which is analyzed ising standard analyzer, another which is analyzed using your custom analyzer created.
In order to Query Both you can use should on both versions.
GET my_index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"field_name": "xampoo"
}
},
{
"match": {
"field_name.mapped": "shampoo"
}
}
]
}
}
}
Hope this helps you!!

Elasticsearch Analysis token filter doesn't capture pattern

I made a custom analyzer in my test index:
PUT test
{
"settings": {
"analysis": {
"filter": {
"myFilter": {
"type": "pattern_capture",
"patterns": ["\\d+(,\\d+)*(\\.\\d+)?[%$€£¥]?"],
"preserve_original": 1
}
},
"analyzer": {
"myAnalyzer": {
"type": "custom",
"tokenizer": "myTokenizer",
"filters":["myFilter"]
}
},
"tokenizer": {
"myTokenizer":{
"type":"pattern",
"pattern":"([^\\p{N}\\p{L}%$€£¥##'\\-&]+)|((?<=[^\\p{L}])['\\-&]|^['\\-&]|['\\-&](?=[^\\p{L}])|['\\-&]$)|((?<=[^\\p{N}])[$€£¥%]|^[$€£¥%]|(?<=[$€£¥%])(?=\\d))"
}
}
}
}
}
It is supposed to spit numbers like 123,234.56$ as a single token
But when such a number is provided it spits out 3 tokens 123 234 56$
The sample of failing test query:
GET test/Stam/_termvector?pretty=true
{
doc:{
"Stam" : {
"fld" : "John Doe",
"txt": "100,234.54%"
}
},
"per_field_analyzer" : {
"Stam.txt": "myAnalyzer"
},
"fields" : ["Stam.txt"],
"offsets":true,
"positions":false,
"payloads":false,
"term_statistics":false,
"field_statistics":false
}
}
Can anyone figure out what is the reason?
Definitely for every other case ',' and '.' are delimiters, that is why I added a filter for that purpose, but unfortunately it doesn't work.
Thanks in advance.
The answer is quite simple, token filter can not combine tokens by design. It should be done through char filters, that are applied to the char stream even before tokenizer starts to split to tokens.
I only had to make sure that the custom tokenizer will not split my tokens.

Resources