.NET Elastic Search Create NGram Index

.NET Elastic Search Create NGram Index - elasticsearch

I am trying to set up elastic search as a prototype for a project that might use it.
The project needs to look through the contents of datasets and make them searchable.
What I have right now is the following:
Index documents
Search through all fields of the indexed documents for the full text
Missing right now is:
Search through all fields of the indexed documents for partial text
That means I can find this sample dataset from my database by searching for e.g. "Sofia"
, "sofia", "anderson" or "canada", but not by searching for "canad".
{
"id": 46,
"firstName": "Sofia",
"lastName": "Anderson",
"country": "Canada" }
I am creating my index using the "Elastic.Clients.Elasticsearch" NuGet package.
I try to create an Index with a NGram-Tokenizer and apply it to all fields.
That seems to be somehow not working.
This is the code that I use to create the Index:
Client.Indices.Create(IndexName, c => c
.Settings(s => s
.Analysis(a => a
.Tokenizer(t => t.Add(TokenizerName, new Tokenizer(new TokenizerDefinitions(new Dictionary<string, ITokenizerDefinition>() { { TokenizerName, ngram } }))))
.Analyzer(ad => ad
.Custom(AnalyzerName, ca => ca
.Tokenizer(TokenizerName)
)
)
)
)
.Mappings(m => m
.AllField(all => all
.Enabled()
.Analyzer(AnalyzerName)
.SearchAnalyzer(AnalyzerName)
)
)
);
with
private string TokenizerName => "my_tokenizer";
private string AnalyzerName => "my_analyzer";
and
var ngram = new NGramTokenizer() { MinGram = 3, MaxGram = 3, TokenChars = new List<TokenChar>() { TokenChar.Letter }, CustomTokenChars = "" };
With this code I get the behaviour described above.
Is there any error in my code?
Am I missing something?
Do you need further information?
Thanks in advance
Paul

I did not find a way to get this running in .NET.
However what worked for me was to create the index using this API call:
URL:
https://{{elasticUrl}}/{{indexName}}
Body:
{
"mappings": {
"properties": {
"firstName": {
"type":"text",
"analyzer":"index_ngram",
"search_analyzer":"search_ngram"
},
"lastName": {
"type":"text",
"analyzer":"index_ngram",
"search_analyzer":"search_ngram"
},
"country": {
"type":"text",
"analyzer":"index_ngram",
"search_analyzer":"search_ngram"
}
}
},
"settings": {
"index": {
"max_ngram_diff":50
},
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 25
}
},
"analyzer": {
"index_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "ngram_filter", "lowercase" ]
},
"search_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
}
}
This leads to an NGram with term lengths from 2 to 25 for the fields: firstName, lastName, country.
I hope this helps someone in the future :)

Related

How to perform partial word searches on a text input with Elasticsearch?

I have a query to search for records in the following format: TR000002_1_2020.
Users should be able to search for results the following ways:
TR000002 or 2_1_2020 or TR000002_1_2020 or 2020. I am using Elasticsearch 6.8 so I cannot use the built in Search-As-You-Type introduced in E7. Thus, I figured either wildcard searches or ngram may best suit what I needed. Here were my two approaches and why they did not work.
Wildcard
Property mapping:
.Text(t => t
.Name(tr => tr.TestRecordId)
)
Query:
m => m.Wildcard(w => w
.Field(tr => tr.TestRecordId)
.Value($"*{form.TestRecordId}*")
),
This works but it is case-sensitive so if the user searches with tr000002_1_2020, then no results would return (since the t and r are lowercased in the query)
ngram (search as you type equivalent)
Create a custom ngram analyzer
.Analysis(a => a
.Analyzers(aa => aa
.Custom("autocomplete", ca => ca
.Tokenizer("autocomplete")
.Filters(new string[] {
"lowercase"
})
)
.Custom("autocomplete_search", ca => ca
.Tokenizer("lowercase")
)
)
.Tokenizers(t => t
.NGram("autocomplete", e => e
.MinGram(2)
.MaxGram(16)
.TokenChars(new TokenChar[] {
TokenChar.Letter,
TokenChar.Digit,
TokenChar.Punctuation,
TokenChar.Symbol
})
)
)
)
Property Mapping
.Text(t => t
.Name(tr => tr.TestRecordId)
.Analyzer("autocomplete")
.SearchAnalyzer("autocomplete_search")
)
Query
m => m.Match(m => m
.Query(form.TestRecordId)
),
As described in this answer, this does not work since the tokenizer splits the characters up in to elements like 20 and 02 and 2020, so as a result my queries returned all documents in my index that contained 2020 such as TR000002_1_2020 and TR000008_1_2020 and TR000003_6_2020.
What's the best utilization of Elasticsearch to allow my desired search behavior? I've seen query string used as well. Thanks!

here is a simple way to address your requirements ( I hope ).
we use a pattern replace char filter to remove the fixed part of the reference ( TR000... )
we use a split tokenizer to split the reference on the "_" character
we use a matchPhrase query to ensure that fragments of the reference are matched in order
with this analysis chain for reference TR000002_1_2020 we get the tokens ["2", "1", "2020" ]. So it will matches the queries ["TR000002_1_2020", "TR000002 1 2020", "2_1_2020", "1_2020"], but it will not match 3_1_2020 or 2_2_2020.
Here is an example of mapping and analysis. It's not in Nest but I think you will be able to make the translation.
PUT pattern_split_demo
{
"settings": {
"analysis": {
"char_filter": {
"replace_char_filter": {
"type": "pattern_replace",
"pattern": "^TR0*",
"replacement": ""
}
},
"tokenizer": {
"split_tokenizer": {
"type": "simple_pattern_split",
"pattern": "_"
}
},
"analyzer": {
"split_analyzer": {
"tokenizer": "split_tokenizer",
"filter": [
"lowercase"
],
"char_filter": [
"replace_char_filter"
]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "split_analyzer"
}
}
}
}
POST pattern_split_demo/_analyze
{
"text": "TR000002_1_2020",
"analyzer": "split_analyzer"
} => ["2", "1", "2020"]
POST pattern_split_demo/_doc?refresh=true
{
"content": "TR000002_1_2020"
}
POST pattern_split_demo/_search
{
"query": {
"match_phrase": {
"content": "TR000002_1_2020"
}
}
}

indexing suggestions using analyzer

Good day:
I'm trying to figure out how to index suggestion without splitting my text using a delimiter and storing it in the CompletionField:
List<string> inputs = new List<string>() {
facility.City,
facility.State,
facility.ZipCode
};
inputs.AddRange(facility.Name.Split(' '));
inputs.AddRange(facility.Address.Split(' '));
inputs.AddRange(facilityType.Description.Split(' '));
var completionField = new CompletionField()
{
Input = inputs.AsEnumerable<string>()
};
return completionField;
This isn't a optimal way of doing this because, I would rather let the analyzer handle this as oppose to doing this and then indexing it. Is there a way to send the entire text to Elastic and let Elastic analyze the text and store in in the completionfield on indexing or something else?
Updated
I've my code to index the entire text and to use the default analyzer however, this is what was index and the analyzer isn't breaking the text up
"suggest": {
"input": [
"Reston",
"Virginia",
"20190",
"Facility 123456",
"22100 Sunset Hills Rd suite 150*"
]
},
My code:
List<string> inputs = new List<string>() {
facility.City,
facility.State,
facility.ZipCode
};
inputs.Add(facility.Name);
inputs.Add(facility.Address);
if (facility.Description != null && facility.Description != "")
{
inputs.Add(facility.Description);
}
var completionField = new CompletionField()
{
Input = inputs.AsEnumerable<string>()
};
return completionField;
My mapping for the property:
"suggest": {
"type": "completion",
"analyzer": "simple",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50
},
But, yet it's not breaking up my input.

Just send all the text in input and specify a custom analyzer that uses the whitespace tokenizer and nothing else
EDIT
First add the analyzer
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"_doc" : {
"properties" : {
"suggest" : {
"type" : "completion",
"analyzer": "my_custom_analyzer"
},
"title" : {
"type": "keyword"
}
}
}
}
}
Then specify it on suggest field

Mapping international character to multiple options

What I want to achieve is ability for people to search for individuals without being language aware, but not punishing those people, that are. What I mean is:
Given I build index:
Jorgensen
Jörgensen
Jørgensen
I want to be able allow such conversions:
ö to o
ö to oe
ø to oe
ø to oe
so if someone searches for:
QUERY | RESULT(I include only ID's, but it would be full records in reality)
Jorgensen return - 1,2,3
Jörgensen return - 1,2
Jørgensen return - 1,3
Joergensen return - 2,3
Starting with that I tried to create index analyzer and filter that:
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_char_filter"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"ö => o",
"ö => oe"
]
}
}
}
}
}
But that is invalid, because it tries to map to same character.
What ma I missing? Do I need multiple analyzers? Any direction would be appreciated.

Since custom mapping isn't enough in your case, as show comments above, let's play with your data and char normalization.
In your case, normalization using unidecode isn't enough due ø and oe conversions. Example:
import unicodedata
def strip_accents(s):
return ''.join(
c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn'
)
body_matches = [
u'Jorgensen',
u'Jörgensen',
u'Jørgensen',
u'Joergensen',
]
for b in body_matches:
print b,strip_accents(b)
>>>> Jorgensen Jorgensen
>>>> Jörgensen Jorgensen
>>>> Jørgensen Jørgensen
>>>> Joergensen Joergensen
So, we need a custom translation. By now I've only set those chars that you shown, but feel free to complete the list.
accented_letters = {
u'ö' : [u'o',u'oe'],
u'ø' : [u'o',u'oe'],
}
Then, we can normalize words and store them in a special property, body_normalized for instance, and index them as a field of your Elasticsearch records
Once they are inserted, you could perform two types of search:
exact search: User input isn't normalized and Elasticsearch query search against body field, that isn't normalized too.
simliar search. User input is normalized and we'll search againts
body_normalized field
Let's see an example
body_matches = [
u'Jorgensen',
u'Jörgensen',
u'Jørgensen',
u'Joergensen',
]
print "------EXACT MATCH------"
for body_match in body_matches:
elasticsearch_query = {
"query": {
"match" : {
"body" : body_match
}
}
}
es_kwargs = {
"doc_type" : "your_type",
"index" : 'your_index',
"body" : elasticsearch_query
}
res = es.search(**es_kwargs)
print body_match," MATCHING BODIES=",res['hits']['total']
for r in res['hits']['hits']:
print "-",r['_source'].get('body','')
print "\n------SIMILAR MATCHES------"
for body_match in body_matches:
body_match = normalize_word(body_match)
elasticsearch_query = {
"query": {
"match" : {
"body_normalized" : body_match
}
}
}
es_kwargs = {
"doc_type" : "your_type",
"index" : 'your_index',
"body" : elasticsearch_query
}
res = es.search(**es_kwargs)
print body_match," MATCHING NORMALIZED BODIES=",res['hits']['total']
for r in res['hits']['hits']:
print "-",r['_source'].get('body','')
You can see a running example in this notebook

After playing with it quite bit more, so far I came up with following approach:
We cannot store multiple representations of data in one field. That does makes sense, so instead, like it was suggested, we store it in multiple representations of the same field in something like sub field. I did everything with Kibana and/or Postman.
Create index with following settings:
PUT surname
{
"mappings": {
"individual": {
"_all": { "enabled": false },
"properties": {
"id": { "type": "integer" },
"name" : {
"type": "string",
"analyzer": "not_folded",
"fields": {
"double": {
"type": "string",
"analyzer": "double_folder"
},
"single": {
"type": "string",
"analyzer": "folded"
}
}
}
}
}
},
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"double_folder": {
"tokenizer": "icu_tokenizer",
"filter" : [
"icu_folding"
],
"char_filter": [
"my_char_filter"
]
},
"folded": {
"tokenizer": "icu_tokenizer",
"filter": [
"icu_folding"
]
},
"not_folded": {
"tokenizer": "icu_tokenizer",
"filter": [
"lowercase"
]
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"ö => oe"
]
}
}
}
}
}
in this case it stores all names in 3 different formats:
The way it was entered
Folded to multiple symbols where I want it to
Folded to single symbol
Number of shards one is important bit for testing, since having multiple shards doesn't work well where there is not enough data. More read in Relevance is broken
then we can add test data to our index:
POST surname/individual/_bulk
{ "index": { "_id": 1}}
{ "id": "1", "name": "Matt Jorgensen"}
{ "index": { "_id": 2}}
{ "id": "2", "name": "Matt Jörgensen"}
{ "index": { "_id": 3}}
{ "id": "3", "name": "Matt Jørgensen"}
{ "index": { "_id": 4}}
{ "id": "4", "name": "Matt Joergensen"}
all that is left is to test if we get proper response:
GET surname/_search
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "Jorgensen",
"fields": [ "name","name.double", "name.single" ]
}
}
}

ElasticSearch 5 Sort by Keyword Field Case Insensitive

We are using ElasticSearch 5.
I have a field city using a custom analyzer and the following mapping.
Analyzer
"analysis": {
"analyzer": {
"lowercase_analyzer": {
"filter": [
"standard",
"lowercase",
"trim"
],
"type": "custom",
"tokenizer": "keyword"
}
}
Mapping
"city": {
"type": "text",
"analyzer": "lowercase_analyzer"
}
I am doing this so that I can do a case insensitive sort on the city field. Here is an example query that I am trying to run
{
"query": {
"term": {
"email": {
"value": "some_email#test.com"
}
}
},
"sort": [
{
"city": {
"order": "desc"
}
}
]
}
Here is the error I am getting:
"Fielddata is disabled on text fields by default. Set fielddata=true
on [city] in order to load fielddata in memory by uninverting the
inverted index. Note that this can however use significant memory."
I don't want to turn on FieldData and incur a performance hit in ElasticSearch. I would like to have a Keyword field that is not case sensitive, so that I can perform more meaningful aggregations and sorts on it. Is there no way to do this?

Yes, there is a way to do this, using multi_fields.
In Elasticsearch 5.0 onwards, string field types were split out into two separate types, text field types that are analyzed and can be used for search, and keyword field types that are not analyzed and are suited to use for sorting, aggregations and exact value matches.
With dynamic mapping in Elasticsearch 5.0 (i.e. let Elasticsearch infer the type that a document property should be mapped to), a json string property is mapped to a text field type, with a sub-field of "keyword" that is mapped as a keyword field type and the setting ignore_above:256.
With NEST 5.x automapping, a string property on your POCO will be automapped in the same way as dynamic mapping in Elasticsearch maps it as per above e.g. given the following document
public class Document
{
public string Property { get; set; }
}
automapping it
var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var defaultIndex = "default-index";
var connectionSettings = new ConnectionSettings(pool)
.DefaultIndex(defaultIndex);
var client = new ElasticClient(connectionSettings);
client.CreateIndex(defaultIndex, c => c
.Mappings(m => m
.Map<Document>(mm => mm
.AutoMap()
)
)
);
produces
{
"mappings": {
"document": {
"properties": {
"property": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
}
}
}
}
}
You can now use property for sorting using Field(f => f.Property.Suffix("keyword"). Take a look at Field Inference for more examples.
keyword field types have doc_values enabled by default, which means that a columnar data structure is built at index time and this is what provides efficient sorting and aggregations.
To add a custom analyzer at index creation time, we can automap as before, but then provide overrides for fields that we want to control the mapping for with .Properties()
client.CreateIndex(defaultIndex, c => c
.Settings(s => s
.Analysis(a => a
.Analyzers(aa => aa
.Custom("lowercase_analyzer", ca => ca
.Tokenizer("keyword")
.Filters(
"standard",
"lowercase",
"trim"
)
)
)
)
)
.Mappings(m => m
.Map<Document>(mm => mm
.AutoMap()
.Properties(p => p
.Text(t => t
.Name(n => n.Property)
.Analyzer("lowercase_analyzer")
.Fields(f => f
.Keyword(k => k
.Name("keyword")
.IgnoreAbove(256)
)
)
)
)
)
)
);
which produces
{
"settings": {
"analysis": {
"analyzer": {
"lowercase_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"trim"
],
"tokenizer": "keyword"
}
}
}
},
"mappings": {
"document": {
"properties": {
"property": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "lowercase_analyzer"
}
}
}
}
}

How to define a bucket aggregation where buckets are defined by arbitrary filters on a field (GROUP BY CASE equivalent)

ElasticSearch enables us to filter a set of documents by regex on any given field, and also to group the resulting documents by the terms in a given (same or different field, using "bucket aggregations". For example, on an index that contains a "Url" field and a "UserAgent" field (some kind of web server log), the following will return the top document counts for terms found in the UserAgent field.
{
query: { filtered: { filter: { regexp: { Url : ".*interestingpage.*" } } } },
size: 0,
aggs: { myaggregation: { terms: { field: "UserAgent" } } }
}
What I'd like to do is use the power of the regexp filter (which operates on the whole field, not just terms within a field) to manually define my aggregation buckets, so that I can relatively reliably split my documents/counts/hits by "user agent type" data, rather than the arbitrary terms parsed by elastic search in the field.
Basically, I am looking for the equivalent of a CASE statement in a GROUP BY, in SQL terms. The SQL query that would express my intent would be something like:
SELECT Bucket, Count(*)
FROM (
SELECT CASE
WHEN UserAgent LIKE '%android%' OR UserAgent LIKE '%ipad%' OR UserAgent LIKE '%iphone%' OR UserAgent LIKE '%mobile%' THEN 'Mobile'
WHEN UserAgent LIKE '%msie 7.0%' then 'IE7'
WHEN UserAgent LIKE '%msie 8.0%' then 'IE8'
WHEN UserAgent LIKE '%firefox%' then 'FireFox'
ELSE 'OTHER'
END Bucket
FROM pagedata
WHERE Url LIKE '%interestingpage%'
) Buckets
GROUP BY Bucket
Can this be done in an ElasticSearch query?

This is an interesting use-case.
Here's a more Elasticsearch-way solution.
The idea is to do all this regex matching at indexing time and the search time to be fast (scripts during search time, if there are many documents, are not performing well and will take time). Let me explain:
define a sub-field for your main field, in which the manipulation of terms is customized
this manipulation will be performed so that the only terms that will be kept in the index will be the ones you defined: FireFox, IE8, IE7, Mobile. Each document can have more than one of these fields. Meaning a text like msie 7.0 sucks and ipad rules will generate only two terms: IE7 and Mobile.
All this is made possible by the keep token filter.
there should be another list of token filters that will actually perform the replacement. This will be possible by using the pattern_replace token filter.
because you have two words that should be replaced (msie 7.0 for example), you need a way to capture these two words (msie and 7.0) one beside the other. This will be possible using the shingle token filter.
Let me put all this together and provide the complete solution:
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_pattern_replace_analyzer": {
"tokenizer": "whitespace",
"filter": [
"filter_shingle",
"my_pattern_replace1",
"my_pattern_replace2",
"my_pattern_replace3",
"my_pattern_replace4",
"words_to_be_kept"
]
}
},
"filter": {
"filter_shingle": {
"type": "shingle",
"max_shingle_size": 10,
"min_shingle_size": 2,
"output_unigrams": true
},
"my_pattern_replace1": {
"type": "pattern_replace",
"pattern": "android|ipad|iphone|mobile",
"replacement": "Mobile"
},
"my_pattern_replace2": {
"type": "pattern_replace",
"pattern": "msie 7.0",
"replacement": "IE7"
},
"my_pattern_replace3": {
"type": "pattern_replace",
"pattern": "msie 8.0",
"replacement": "IE8"
},
"my_pattern_replace4": {
"type": "pattern_replace",
"pattern": "firefox",
"replacement": "FireFox"
},
"words_to_be_kept": {
"type": "keep",
"keep_words": [
"FireFox", "IE8", "IE7", "Mobile"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"UserAgent": {
"type": "string",
"fields": {
"custom": {
"analyzer": "my_pattern_replace_analyzer",
"type": "string"
}
}
}
}
}
}
}
Test data:
POST /test/test/_bulk
{"index":{"_id":1}}
{"UserAgent": "android OS is the best firefox"}
{"index":{"_id":2}}
{"UserAgent": "firefox is my favourite browser"}
{"index":{"_id":3}}
{"UserAgent": "msie 7.0 sucks and ipad rules"}
Query:
GET /test/test/_search?search_type=count
{
"aggs": {
"myaggregation": {
"terms": {
"field": "UserAgent.custom",
"size": 10
}
}
}
}
Results:
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"myaggregation": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "FireFox",
"doc_count": 2
},
{
"key": "Mobile",
"doc_count": 2
},
{
"key": "IE7",
"doc_count": 1
}
]
}
}

You could use a terms aggregation with a scripted field:
{
query: { filtered: { filter: { regexp: { Url : ".*interestingpage.*" } } } },
size: 0,
aggs: {
myaggregation: {
terms: {
script: "doc['UserAgent'] =~ /.*android.*/ || doc['UserAgent'] =~ /.*ipad.*/ || doc['UserAgent'] =~ /.*iphone.*/ || doc['UserAgent'] =~ /.*mobile.*/ ? 'Mobile' : doc['UserAgent'] =~ /.*msie 7.0.*/ ? 'IE7' : '...you got the idea by now...'"
}
}
}
}
But beware of the performance hit!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

.NET Elastic Search Create NGram Index - elasticsearch

Related

How to perform partial word searches on a text input with Elasticsearch?

indexing suggestions using analyzer

Mapping international character to multiple options

ElasticSearch 5 Sort by Keyword Field Case Insensitive

How to define a bucket aggregation where buckets are defined by arbitrary filters on a field (GROUP BY CASE equivalent)

Categories

Resources