ElasticSearch 5 Sort by Keyword Field Case Insensitive - elasticsearch

We are using ElasticSearch 5.
I have a field city using a custom analyzer and the following mapping.
Analyzer
"analysis": {
"analyzer": {
"lowercase_analyzer": {
"filter": [
"standard",
"lowercase",
"trim"
],
"type": "custom",
"tokenizer": "keyword"
}
}
Mapping
"city": {
"type": "text",
"analyzer": "lowercase_analyzer"
}
I am doing this so that I can do a case insensitive sort on the city field. Here is an example query that I am trying to run
{
"query": {
"term": {
"email": {
"value": "some_email#test.com"
}
}
},
"sort": [
{
"city": {
"order": "desc"
}
}
]
}
Here is the error I am getting:
"Fielddata is disabled on text fields by default. Set fielddata=true
on [city] in order to load fielddata in memory by uninverting the
inverted index. Note that this can however use significant memory."
I don't want to turn on FieldData and incur a performance hit in ElasticSearch. I would like to have a Keyword field that is not case sensitive, so that I can perform more meaningful aggregations and sorts on it. Is there no way to do this?

Yes, there is a way to do this, using multi_fields.
In Elasticsearch 5.0 onwards, string field types were split out into two separate types, text field types that are analyzed and can be used for search, and keyword field types that are not analyzed and are suited to use for sorting, aggregations and exact value matches.
With dynamic mapping in Elasticsearch 5.0 (i.e. let Elasticsearch infer the type that a document property should be mapped to), a json string property is mapped to a text field type, with a sub-field of "keyword" that is mapped as a keyword field type and the setting ignore_above:256.
With NEST 5.x automapping, a string property on your POCO will be automapped in the same way as dynamic mapping in Elasticsearch maps it as per above e.g. given the following document
public class Document
{
public string Property { get; set; }
}
automapping it
var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var defaultIndex = "default-index";
var connectionSettings = new ConnectionSettings(pool)
.DefaultIndex(defaultIndex);
var client = new ElasticClient(connectionSettings);
client.CreateIndex(defaultIndex, c => c
.Mappings(m => m
.Map<Document>(mm => mm
.AutoMap()
)
)
);
produces
{
"mappings": {
"document": {
"properties": {
"property": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
}
}
}
}
}
You can now use property for sorting using Field(f => f.Property.Suffix("keyword"). Take a look at Field Inference for more examples.
keyword field types have doc_values enabled by default, which means that a columnar data structure is built at index time and this is what provides efficient sorting and aggregations.
To add a custom analyzer at index creation time, we can automap as before, but then provide overrides for fields that we want to control the mapping for with .Properties()
client.CreateIndex(defaultIndex, c => c
.Settings(s => s
.Analysis(a => a
.Analyzers(aa => aa
.Custom("lowercase_analyzer", ca => ca
.Tokenizer("keyword")
.Filters(
"standard",
"lowercase",
"trim"
)
)
)
)
)
.Mappings(m => m
.Map<Document>(mm => mm
.AutoMap()
.Properties(p => p
.Text(t => t
.Name(n => n.Property)
.Analyzer("lowercase_analyzer")
.Fields(f => f
.Keyword(k => k
.Name("keyword")
.IgnoreAbove(256)
)
)
)
)
)
)
);
which produces
{
"settings": {
"analysis": {
"analyzer": {
"lowercase_analyzer": {
"type": "custom",
"filter": [
"standard",
"lowercase",
"trim"
],
"tokenizer": "keyword"
}
}
}
},
"mappings": {
"document": {
"properties": {
"property": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "lowercase_analyzer"
}
}
}
}
}

Related

.NET Elastic Search Create NGram Index

I am trying to set up elastic search as a prototype for a project that might use it.
The project needs to look through the contents of datasets and make them searchable.
What I have right now is the following:
Index documents
Search through all fields of the indexed documents for the full text
Missing right now is:
Search through all fields of the indexed documents for partial text
That means I can find this sample dataset from my database by searching for e.g. "Sofia"
, "sofia", "anderson" or "canada", but not by searching for "canad".
{
"id": 46,
"firstName": "Sofia",
"lastName": "Anderson",
"country": "Canada" }
I am creating my index using the "Elastic.Clients.Elasticsearch" NuGet package.
I try to create an Index with a NGram-Tokenizer and apply it to all fields.
That seems to be somehow not working.
This is the code that I use to create the Index:
Client.Indices.Create(IndexName, c => c
.Settings(s => s
.Analysis(a => a
.Tokenizer(t => t.Add(TokenizerName, new Tokenizer(new TokenizerDefinitions(new Dictionary<string, ITokenizerDefinition>() { { TokenizerName, ngram } }))))
.Analyzer(ad => ad
.Custom(AnalyzerName, ca => ca
.Tokenizer(TokenizerName)
)
)
)
)
.Mappings(m => m
.AllField(all => all
.Enabled()
.Analyzer(AnalyzerName)
.SearchAnalyzer(AnalyzerName)
)
)
);
with
private string TokenizerName => "my_tokenizer";
private string AnalyzerName => "my_analyzer";
and
var ngram = new NGramTokenizer() { MinGram = 3, MaxGram = 3, TokenChars = new List<TokenChar>() { TokenChar.Letter }, CustomTokenChars = "" };
With this code I get the behaviour described above.
Is there any error in my code?
Am I missing something?
Do you need further information?
Thanks in advance
Paul
I did not find a way to get this running in .NET.
However what worked for me was to create the index using this API call:
URL:
https://{{elasticUrl}}/{{indexName}}
Body:
{
"mappings": {
"properties": {
"firstName": {
"type":"text",
"analyzer":"index_ngram",
"search_analyzer":"search_ngram"
},
"lastName": {
"type":"text",
"analyzer":"index_ngram",
"search_analyzer":"search_ngram"
},
"country": {
"type":"text",
"analyzer":"index_ngram",
"search_analyzer":"search_ngram"
}
}
},
"settings": {
"index": {
"max_ngram_diff":50
},
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 25
}
},
"analyzer": {
"index_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "ngram_filter", "lowercase" ]
},
"search_ngram": {
"type": "custom",
"tokenizer": "keyword",
"filter": "lowercase"
}
}
}
}
}
This leads to an NGram with term lengths from 2 to 25 for the fields: firstName, lastName, country.
I hope this helps someone in the future :)

How to perform partial word searches on a text input with Elasticsearch?

I have a query to search for records in the following format: TR000002_1_2020.
Users should be able to search for results the following ways:
TR000002 or 2_1_2020 or TR000002_1_2020 or 2020. I am using Elasticsearch 6.8 so I cannot use the built in Search-As-You-Type introduced in E7. Thus, I figured either wildcard searches or ngram may best suit what I needed. Here were my two approaches and why they did not work.
Wildcard
Property mapping:
.Text(t => t
.Name(tr => tr.TestRecordId)
)
Query:
m => m.Wildcard(w => w
.Field(tr => tr.TestRecordId)
.Value($"*{form.TestRecordId}*")
),
This works but it is case-sensitive so if the user searches with tr000002_1_2020, then no results would return (since the t and r are lowercased in the query)
ngram (search as you type equivalent)
Create a custom ngram analyzer
.Analysis(a => a
.Analyzers(aa => aa
.Custom("autocomplete", ca => ca
.Tokenizer("autocomplete")
.Filters(new string[] {
"lowercase"
})
)
.Custom("autocomplete_search", ca => ca
.Tokenizer("lowercase")
)
)
.Tokenizers(t => t
.NGram("autocomplete", e => e
.MinGram(2)
.MaxGram(16)
.TokenChars(new TokenChar[] {
TokenChar.Letter,
TokenChar.Digit,
TokenChar.Punctuation,
TokenChar.Symbol
})
)
)
)
Property Mapping
.Text(t => t
.Name(tr => tr.TestRecordId)
.Analyzer("autocomplete")
.SearchAnalyzer("autocomplete_search")
)
Query
m => m.Match(m => m
.Query(form.TestRecordId)
),
As described in this answer, this does not work since the tokenizer splits the characters up in to elements like 20 and 02 and 2020, so as a result my queries returned all documents in my index that contained 2020 such as TR000002_1_2020 and TR000008_1_2020 and TR000003_6_2020.
What's the best utilization of Elasticsearch to allow my desired search behavior? I've seen query string used as well. Thanks!
here is a simple way to address your requirements ( I hope ).
we use a pattern replace char filter to remove the fixed part of the reference ( TR000... )
we use a split tokenizer to split the reference on the "_" character
we use a matchPhrase query to ensure that fragments of the reference are matched in order
with this analysis chain for reference TR000002_1_2020 we get the tokens ["2", "1", "2020" ]. So it will matches the queries ["TR000002_1_2020", "TR000002 1 2020", "2_1_2020", "1_2020"], but it will not match 3_1_2020 or 2_2_2020.
Here is an example of mapping and analysis. It's not in Nest but I think you will be able to make the translation.
PUT pattern_split_demo
{
"settings": {
"analysis": {
"char_filter": {
"replace_char_filter": {
"type": "pattern_replace",
"pattern": "^TR0*",
"replacement": ""
}
},
"tokenizer": {
"split_tokenizer": {
"type": "simple_pattern_split",
"pattern": "_"
}
},
"analyzer": {
"split_analyzer": {
"tokenizer": "split_tokenizer",
"filter": [
"lowercase"
],
"char_filter": [
"replace_char_filter"
]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "split_analyzer"
}
}
}
}
POST pattern_split_demo/_analyze
{
"text": "TR000002_1_2020",
"analyzer": "split_analyzer"
} => ["2", "1", "2020"]
POST pattern_split_demo/_doc?refresh=true
{
"content": "TR000002_1_2020"
}
POST pattern_split_demo/_search
{
"query": {
"match_phrase": {
"content": "TR000002_1_2020"
}
}
}

How do i get Keyword mapping working in NEST?

I'm on NEST v6.3.1, ElasticSearch v6.4.2
I can't get my field to be indexed as a keyword.
I've tried both attributes:
[Keyword]
public string Suburb { get; set; }
and fluent:
client.CreateIndex(indexName, i => i
.Mappings(ms => ms
.Map<Listing>(m => m
.Properties(ps => ps
.Keyword(k => k
.Name(n => n.Suburb)))
.AutoMap())
.Map<Agent>(m => m
.AutoMap())
.Map<BuildingDetails>(m => m
.AutoMap())
.Map<LandDetails>(m => m
.AutoMap())
)
);
Both result in the same thing:
{
"listings": {
"mappings": {
"listing": {
"properties": {
"suburb": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
e.g doesn't match what i'm seeing here:
https://www.elastic.co/guide/en/elasticsearch/client/net-api/current/attribute-mapping.html
Same thing happens when i try and use [GeoPoint]. Should be type geopoint, but it's mapped to floats:
"latLong": {
"properties": {
"lat": {
"type": "float"
},
"lon": {
"type": "float"
}
}
}
So i'm missing something, just not sure what.
Any help?
Thanks
The index likely already exists, and a field mapping cannot be updated. Check .IsValid on the response from the create index call, and if invalid, take a look at the error and reason. You likely need to delete the index and create again.
Also note that multiple type mappings in one index is not allowed in Elasticsearch 6.x, and will fail. Either create separate indices for different types, or, if the types have the same field structure and you wish to index/analyze them in the same way, you may consider introducing an additional discriminator field.

indexing suggestions using analyzer

Good day:
I'm trying to figure out how to index suggestion without splitting my text using a delimiter and storing it in the CompletionField:
List<string> inputs = new List<string>() {
facility.City,
facility.State,
facility.ZipCode
};
inputs.AddRange(facility.Name.Split(' '));
inputs.AddRange(facility.Address.Split(' '));
inputs.AddRange(facilityType.Description.Split(' '));
var completionField = new CompletionField()
{
Input = inputs.AsEnumerable<string>()
};
return completionField;
This isn't a optimal way of doing this because, I would rather let the analyzer handle this as oppose to doing this and then indexing it. Is there a way to send the entire text to Elastic and let Elastic analyze the text and store in in the completionfield on indexing or something else?
Updated
I've my code to index the entire text and to use the default analyzer however, this is what was index and the analyzer isn't breaking the text up
"suggest": {
"input": [
"Reston",
"Virginia",
"20190",
"Facility 123456",
"22100 Sunset Hills Rd suite 150*"
]
},
My code:
List<string> inputs = new List<string>() {
facility.City,
facility.State,
facility.ZipCode
};
inputs.Add(facility.Name);
inputs.Add(facility.Address);
if (facility.Description != null && facility.Description != "")
{
inputs.Add(facility.Description);
}
var completionField = new CompletionField()
{
Input = inputs.AsEnumerable<string>()
};
return completionField;
My mapping for the property:
"suggest": {
"type": "completion",
"analyzer": "simple",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50
},
But, yet it's not breaking up my input.
Just send all the text in input and specify a custom analyzer that uses the whitespace tokenizer and nothing else
EDIT
First add the analyzer
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"_doc" : {
"properties" : {
"suggest" : {
"type" : "completion",
"analyzer": "my_custom_analyzer"
},
"title" : {
"type": "keyword"
}
}
}
}
}
Then specify it on suggest field

Mapping and indexing Path hierarchy in Elastic NEST to search with in directory paths

I need to search for files and folder with in specific directories. In order to do that, elastic asks us to create the analyzer and set the tokenizer to path_hierarchy
PUT /fs
{
"settings": {
"analysis": {
"analyzer": {
"paths": {
"tokenizer": "path_hierarchy"
}
}
}
}
}
Then, create the mapping as illustrated below with two properties: name (holding the name of the file) and path (to store the directory path):
PUT /fs/_mapping/file
{
"properties": {
"name": {
"type": "string",
"index": "not_analyzed"
},
"path": {
"type": "string",
"index": "not_analyzed",
"fields": {
"tree": {
"type": "string",
"analyzer": "paths"
}
}
}
}
}
This requires us to index the path of the directory where the file lives:
PUT /fs/file/1
{
"name": "README.txt",
"path": "/clinton/projects/elasticsearch",
}
The Question:
How can i create this mapping in NEST Elastic using c#?
The analyzer is created by declaring a custom analyzer, and then setting its tokenizer to "path_tokenizer":
//name of the tokenizer "path_tokenizer"
string pathTokenizerName = "path_tokenizer";
//the name of the analyzer
string pathAnalyzerName = "path";
PathHierarchyTokenizer pathTokenizer = new PathHierarchyTokenizer();
AnalyzerBase pathAnalyzer = new CustomAnalyzer
{
Tokenizer = pathTokenizerName,
};
The second step is creating the index with required analyzer and mapping, in the code below the property "LogicalPath" will keep the locations of directories in the system"
//Create the index,
elastic.CreateIndex(_indexName, i => i
.NumberOfShards(1).NumberOfReplicas(0)
// Adding the path analyzers to the index.
.Analysis(an => an
.Tokenizers(tokenizers => tokenizers
.Add(pathTokenizerName, pathTokenizer)
)
.Analyzers(analyzer => analyzer
.Add(pathAnalyzerName, pathAnalyzer)
)
)
// Add the mappings
.AddMapping<Document>(t => t
.MapFromAttributes()
.Properties(props => props
//associating path tokenizer with required property "Logical Path"
.String(myPathProperty => myPathProperty
.Name(_t => _t.LogicalPath)
.IndexAnalyzer(pathAnalyzerName)
)
)
));

Resources