indexing suggestions using analyzer - elasticsearch

Good day:
I'm trying to figure out how to index suggestion without splitting my text using a delimiter and storing it in the CompletionField:
List<string> inputs = new List<string>() {
facility.City,
facility.State,
facility.ZipCode
};
inputs.AddRange(facility.Name.Split(' '));
inputs.AddRange(facility.Address.Split(' '));
inputs.AddRange(facilityType.Description.Split(' '));
var completionField = new CompletionField()
{
Input = inputs.AsEnumerable<string>()
};
return completionField;
This isn't a optimal way of doing this because, I would rather let the analyzer handle this as oppose to doing this and then indexing it. Is there a way to send the entire text to Elastic and let Elastic analyze the text and store in in the completionfield on indexing or something else?
Updated
I've my code to index the entire text and to use the default analyzer however, this is what was index and the analyzer isn't breaking the text up
"suggest": {
"input": [
"Reston",
"Virginia",
"20190",
"Facility 123456",
"22100 Sunset Hills Rd suite 150*"
]
},
My code:
List<string> inputs = new List<string>() {
facility.City,
facility.State,
facility.ZipCode
};
inputs.Add(facility.Name);
inputs.Add(facility.Address);
if (facility.Description != null && facility.Description != "")
{
inputs.Add(facility.Description);
}
var completionField = new CompletionField()
{
Input = inputs.AsEnumerable<string>()
};
return completionField;
My mapping for the property:
"suggest": {
"type": "completion",
"analyzer": "simple",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50
},
But, yet it's not breaking up my input.

Just send all the text in input and specify a custom analyzer that uses the whitespace tokenizer and nothing else
EDIT
First add the analyzer
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"_doc" : {
"properties" : {
"suggest" : {
"type" : "completion",
"analyzer": "my_custom_analyzer"
},
"title" : {
"type": "keyword"
}
}
}
}
}
Then specify it on suggest field

Related

.NET Elastic Search Create NGram Index

I am trying to set up elastic search as a prototype for a project that might use it.
The project needs to look through the contents of datasets and make them searchable.
What I have right now is the following:
Index documents
Search through all fields of the indexed documents for the full text
Missing right now is:
Search through all fields of the indexed documents for partial text
That means I can find this sample dataset from my database by searching for e.g. "Sofia"
, "sofia", "anderson" or "canada", but not by searching for "canad".
{
"id": 46,
"firstName": "Sofia",
"lastName": "Anderson",
"country": "Canada" }
I am creating my index using the "Elastic.Clients.Elasticsearch" NuGet package.
I try to create an Index with a NGram-Tokenizer and apply it to all fields.
That seems to be somehow not working.
This is the code that I use to create the Index:
Client.Indices.Create(IndexName, c => c
.Settings(s => s
.Analysis(a => a
.Tokenizer(t => t.Add(TokenizerName, new Tokenizer(new TokenizerDefinitions(new Dictionary<string, ITokenizerDefinition>() { { TokenizerName, ngram } }))))
.Analyzer(ad => ad
.Custom(AnalyzerName, ca => ca
.Tokenizer(TokenizerName)
)
)
)
)
.Mappings(m => m
.AllField(all => all
.Enabled()
.Analyzer(AnalyzerName)
.SearchAnalyzer(AnalyzerName)
)
)
);
with
private string TokenizerName => "my_tokenizer";
private string AnalyzerName => "my_analyzer";
and
var ngram = new NGramTokenizer() { MinGram = 3, MaxGram = 3, TokenChars = new List<TokenChar>() { TokenChar.Letter }, CustomTokenChars = "" };
With this code I get the behaviour described above.
Is there any error in my code?
Am I missing something?
Do you need further information?
Thanks in advance
Paul
I did not find a way to get this running in .NET.
However what worked for me was to create the index using this API call:
URL:
https://{{elasticUrl}}/{{indexName}}
Body:
{
"mappings": {
"properties": {
"firstName": {
"type":"text",
"analyzer":"index_ngram",
"search_analyzer":"search_ngram"
},
"lastName": {
"type":"text",
"analyzer":"index_ngram",
"search_analyzer":"search_ngram"
},
"country": {
"type":"text",
"analyzer":"index_ngram",
"search_analyzer":"search_ngram"
}
}
},
"settings": {
"index": {
"max_ngram_diff":50
},
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 25
}
},
"analyzer": {
"index_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "ngram_filter", "lowercase" ]
},
"search_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
}
}
This leads to an NGram with term lengths from 2 to 25 for the fields: firstName, lastName, country.
I hope this helps someone in the future :)

ElasticSearch: preserve_position_increments not working

According to the docs
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html
preserve_position_increments=false is supposed to make consecutive keywords in a string searchable. But for me it's not working. Is this a bug? Steps to reproduce in Kibana:
PUT /example-index/
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"_doc": {
"properties": {
"example-suggest-field": {
"type": "completion",
"analyzer": "stop",
"preserve_position_increments": false,
"max_input_length": 50
}
}
}
}
}
PUT /example-index/_doc/1
{
"example-suggest-field": [
{
"input": "Nevermind Nirvana",
"weight" : 10
}
]
}
POST /example-index/_search
{
"suggest": {
"bib-suggest" : {
"prefix" : "nir",
"completion" : {
"field" : "example-suggest-field"
}
}
}
}
POST /example-index/_search
{
"suggest": {
"bib-suggest" : {
"prefix" : "nev",
"completion" : {
"field" : "example-suggest-field"
}
}
}
}
If yes I will make a bug report
It's not a bug, preserve_position_increments is only useful when you are removing stopwords and would like to search for the token coming after the stopword (i.e. search for Beat and find The Beatles).
In your case, you should probably index ["Nevermind", "Nirvana"] instead, i.e. and array of tokens.
If you try to indexing "The Nirvana" instead, you'll find it by searching for nir

Mapping international character to multiple options

What I want to achieve is ability for people to search for individuals without being language aware, but not punishing those people, that are. What I mean is:
Given I build index:
Jorgensen
Jörgensen
Jørgensen
I want to be able allow such conversions:
ö to o
ö to oe
ø to oe
ø to oe
so if someone searches for:
QUERY | RESULT(I include only ID's, but it would be full records in reality)
Jorgensen return - 1,2,3
Jörgensen return - 1,2
Jørgensen return - 1,3
Joergensen return - 2,3
Starting with that I tried to create index analyzer and filter that:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"ö => o",
"ö => oe"
]
}
}
}
}
}
But that is invalid, because it tries to map to same character.
What ma I missing? Do I need multiple analyzers? Any direction would be appreciated.
Since custom mapping isn't enough in your case, as show comments above, let's play with your data and char normalization.
In your case, normalization using unidecode isn't enough due ø and oe conversions. Example:
import unicodedata
def strip_accents(s):
return ''.join(
c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn'
)
body_matches = [
u'Jorgensen',
u'Jörgensen',
u'Jørgensen',
u'Joergensen',
]
for b in body_matches:
print b,strip_accents(b)
>>>> Jorgensen Jorgensen
>>>> Jörgensen Jorgensen
>>>> Jørgensen Jørgensen
>>>> Joergensen Joergensen
So, we need a custom translation. By now I've only set those chars that you shown, but feel free to complete the list.
accented_letters = {
u'ö' : [u'o',u'oe'],
u'ø' : [u'o',u'oe'],
}
Then, we can normalize words and store them in a special property, body_normalized for instance, and index them as a field of your Elasticsearch records
Once they are inserted, you could perform two types of search:
exact search: User input isn't normalized and Elasticsearch query search against body field, that isn't normalized too.
simliar search. User input is normalized and we'll search againts
body_normalized field
Let's see an example
body_matches = [
u'Jorgensen',
u'Jörgensen',
u'Jørgensen',
u'Joergensen',
]
print "------EXACT MATCH------"
for body_match in body_matches:
elasticsearch_query = {
"query": {
"match" : {
"body" : body_match
}
}
}
es_kwargs = {
"doc_type" : "your_type",
"index" : 'your_index',
"body" : elasticsearch_query
}
res = es.search(**es_kwargs)
print body_match," MATCHING BODIES=",res['hits']['total']
for r in res['hits']['hits']:
print "-",r['_source'].get('body','')
print "\n------SIMILAR MATCHES------"
for body_match in body_matches:
body_match = normalize_word(body_match)
elasticsearch_query = {
"query": {
"match" : {
"body_normalized" : body_match
}
}
}
es_kwargs = {
"doc_type" : "your_type",
"index" : 'your_index',
"body" : elasticsearch_query
}
res = es.search(**es_kwargs)
print body_match," MATCHING NORMALIZED BODIES=",res['hits']['total']
for r in res['hits']['hits']:
print "-",r['_source'].get('body','')
You can see a running example in this notebook
After playing with it quite bit more, so far I came up with following approach:
We cannot store multiple representations of data in one field. That does makes sense, so instead, like it was suggested, we store it in multiple representations of the same field in something like sub field. I did everything with Kibana and/or Postman.
Create index with following settings:
PUT surname
{
"mappings": {
"individual": {
"_all": { "enabled": false },
"properties": {
"id": { "type": "integer" },
"name" : {
"type": "string",
"analyzer": "not_folded",
"fields": {
"double": {
"type": "string",
"analyzer": "double_folder"
},
"single": {
"type": "string",
"analyzer": "folded"
}
}
}
}
}
},
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"double_folder": {
"tokenizer": "icu_tokenizer",
"filter" : [
"icu_folding"
],
"char_filter": [
"my_char_filter"
]
},
"folded": {
"tokenizer": "icu_tokenizer",
"filter": [
"icu_folding"
]
},
"not_folded": {
"tokenizer": "icu_tokenizer",
"filter": [
"lowercase"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"ö => oe"
]
}
}
}
}
}
in this case it stores all names in 3 different formats:
The way it was entered
Folded to multiple symbols where I want it to
Folded to single symbol
Number of shards one is important bit for testing, since having multiple shards doesn't work well where there is not enough data. More read in Relevance is broken
then we can add test data to our index:
POST surname/individual/_bulk
{ "index": { "_id": 1}}
{ "id": "1", "name": "Matt Jorgensen"}
{ "index": { "_id": 2}}
{ "id": "2", "name": "Matt Jörgensen"}
{ "index": { "_id": 3}}
{ "id": "3", "name": "Matt Jørgensen"}
{ "index": { "_id": 4}}
{ "id": "4", "name": "Matt Joergensen"}
all that is left is to test if we get proper response:
GET surname/_search
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "Jorgensen",
"fields": [ "name","name.double", "name.single" ]
}
}
}

Elasticsearch Analysis token filter doesn't capture pattern

I made a custom analyzer in my test index:
PUT test
{
"settings": {
"analysis": {
"filter": {
"myFilter": {
"type": "pattern_capture",
"patterns": ["\\d+(,\\d+)*(\\.\\d+)?[%$€£¥]?"],
"preserve_original": 1
}
},
"analyzer": {
"myAnalyzer": {
"type": "custom",
"tokenizer": "myTokenizer",
"filters":["myFilter"]
}
},
"tokenizer": {
"myTokenizer":{
"type":"pattern",
"pattern":"([^\\p{N}\\p{L}%$€£¥##'\\-&]+)|((?<=[^\\p{L}])['\\-&]|^['\\-&]|['\\-&](?=[^\\p{L}])|['\\-&]$)|((?<=[^\\p{N}])[$€£¥%]|^[$€£¥%]|(?<=[$€£¥%])(?=\\d))"
}
}
}
}
}
It is supposed to spit numbers like 123,234.56$ as a single token
But when such a number is provided it spits out 3 tokens 123 234 56$
The sample of failing test query:
GET test/Stam/_termvector?pretty=true
{
doc:{
"Stam" : {
"fld" : "John Doe",
"txt": "100,234.54%"
}
},
"per_field_analyzer" : {
"Stam.txt": "myAnalyzer"
},
"fields" : ["Stam.txt"],
"offsets":true,
"positions":false,
"payloads":false,
"term_statistics":false,
"field_statistics":false
}
}
Can anyone figure out what is the reason?
Definitely for every other case ',' and '.' are delimiters, that is why I added a filter for that purpose, but unfortunately it doesn't work.
Thanks in advance.
The answer is quite simple, token filter can not combine tokens by design. It should be done through char filters, that are applied to the char stream even before tokenizer starts to split to tokens.
I only had to make sure that the custom tokenizer will not split my tokens.

How to index both a string and its reverse?

I'm looking for a way to analyze the string "abc123" as ["abc123", "321cba"]. I've looked at the reverse token filter, but that only gets me ["321cba"]. Documentation on this filter is pretty sparse, only stating that
"A token filter of type reverse ... simply reverses each token."
(see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-reverse-tokenfilter.html).
I've also tinkered with using the keyword_repeat filter, which gets me two instances. I don't know if that's useful, but for now all it does it reverse both instances.
How can I use the reverse token filter but keep the original token as well?
My analyzer:
{ "settings" : { "analysis" : {
"analyzer" : {
"phone" : {
"type" : "custom"
,"char_filter" : ["strip_non_numeric"]
,"tokenizer" : "keyword"
,"filter" : ["standard", "keyword_repeat", "reverse"]
}
}
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
}
}}}
Make and put a analyzer to reverse a string (say reverse_analyzer).
PUT index_name
{
"settings": {
"analysis": {
"analyzer": {
"reverse_analyzer": {
"type": "custom",
"char_filter": [
"strip_non_numeric"
],
"tokenizer": "keyword",
"filter": [
"standard",
"keyword_repeat",
"reverse"
]
}
},
"char_filter": {
"strip_non_numeric": {
"type": "pattern_replace",
"pattern": "[^0-9]",
"replacement": ""
}
}
}
}
}
then, for a field, (say phoneno), use mapping as, (create a type and append mapping for phone as)
PUT index_name/type_name/_mapping
{
"type_name": {
"properties": {
"phone_no": {
"type": "string",
"fields": {
"reverse": {
"type": "string",
"analyzer": "reverse_analyzer"
}
}
}
}
}
}
So, phone_no is like multifield, which will store a string and its reverse as,
if you index
phone_no: 911220
then in elasticsearch, there will be fields as,
phone_no: 911220 and phone_no.reverse : 022119, so you can search, filter reverse or not-reversed field.
Hope this helps.
I don't believe you can do this directly, as I am unaware of any way to get the reverse token filter to also output the original.
However, you could use the fields parameter to index both the original and the reversed at the same time with no additional coding. You would then search both fields.
So let's say your field was called phone_number:
"phone_number": {
"type": "string",
"fields": {
"reverse": { "type": "string", "index": "phone" }
}
}
In this case we're indexing using the default analyzer (assume standard) plus also indexing into reverse with your customer analyzer phone which reverses. You then issue your queries against both fields.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html
I'm not sure it's possible to do this using built-in set of token filters. I would recommend you to create your own plugin. There is ICU Analysis plugin supported by elastic search team, that you can use as example.
I wound up using the following two char_filter's in my analyzer. It's an ugly abuse of regex, but it seems to work. It is limited to the first 20 numeric characters, but in my use-case that is acceptable.
First it groups all numeric characters, then explicitly rebuilds the string with its own (numeric-only!) reverse. The space in the center of the replacement pattern then causes the tokenizer to split it into two tokens - the original and the reverse.
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
,"dupe_and_reverse" : {
"type" : "pattern_replace"
,"pattern" : "([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)"
,"replacement" : "$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15$16$17$18$19$20 $20$19$18$17$16$15$14$13$12$11$10$9$8$7$6$5$4$3$2$1"
}
}

Resources