How to create a custom analyzer to ignore accents and pt-br stopwords using elasticsearch nest api?

How to create a custom analyzer to ignore accents and pt-br stopwords using elasticsearch nest api? - elasticsearch

First of all, consider that I am using a "News" Class (Noticia, in portuguese) that has a string field called "Content" (Conteudo in portuguese)
public class Noticia
{
public string Conteudo { get; set; }
}
I am trying to create an index that is configured to ignore accents and pt-br stopwords as well as to allow up to 40mi chars to be analysed in a highligthed query.
I can create such an index using this code:
var createIndexResponse = client.Indices.Create(indexName, c => c
.Settings(s => s
.Setting("highlight.max_analyzed_offset" , 40000000)
.Analysis(analysis => analysis
.TokenFilters(tokenfilters => tokenfilters
.AsciiFolding("folding-accent", ft => ft
)
.Stop("stoping-br", st => st
.StopWords("_brazilian_")
)
)
.Analyzers(analyzers => analyzers
.Custom("folding-analyzer", cc => cc
.Tokenizer("standard")
.Filters("folding-accent", "stoping-br")
)
)
)
)
.Map<Noticia>(mm => mm
.AutoMap()
.Properties(p => p
.Text(t => t
.Name(n => n.Conteudo)
.Analyzer("folding-analyzer")
)
)
)
);
If I test this analyzer using Kibana Dev Tools, I get the result that I want: No accents and stopwords removed!
POST intranet/_analyze
{
"analyzer": "folding-analyzer",
"text": "Férias de todos os funcionários"
}
Result:
{
"tokens" : [
{
"token" : "Ferias",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "funcionarios",
"start_offset" : 19,
"end_offset" : 31,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
The same (good) results are being returned when I use NEST to analyze a query using my folding analyser (Tokens "Ferias" e "funcionarios" are returned)
var analyzeResponse = client.Indices.Analyze(a => a
.Index(indexName)
.Analyzer("folding-analyzer")
.Text("Férias de todos os funcionários")
);
However, If I perform a search using NEST ElasticSearch .NET client, terms like "Férias" (with accent) and "Ferias" (without accent) are beign treated as different.
My goal is to perform a query that returns all results, no matter if the word is Férias or Ferias
Thats the simplified code (C# nest) I am using to query elasticsearch:
var searchResponse = ElasticClient.Search<Noticia>(s => s
.Index(indexName)
.Query(q => q
.MultiMatch(m => m
.Fields(f => f
.Field(p => p.Titulo,4)
.Field(p => p.Conteudo,2)
)
.Query(termo)
)
)
);
and that's the extended API call associated with the searchResponse
Successful (200) low level call on POST: /intranet/_search?pretty=true&error_trace=true&typed_keys=true
# Audit trail of this API call:
- [1] HealthyResponse: Node: ###NODE ADDRESS### Took: 00:00:00.3880295
# Request:
{"query":{"multi_match":{"fields":["categoria^1","titulo^4","ementa^3","conteudo^2","attachments.attachment.content^1"],"query":"Ferias"}},"size":100}
# Response:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 13.788051,
"hits" : [
{
"_index" : "intranet",
"_type" : "_doc",
"_id" : "4934",
"_score" : 13.788051,
"_source" : {
"conteudo" : "blablabla ferias blablabla",
"attachments" : [ ],
"categoria" : "Novidades da Biblioteca - DBD",
"publicadaEm" : "2008-10-14T00:00:00",
"titulo" : "INFORMATIVO DE DIREITO ADMINISTRATIVO E LRF - JUL/2008",
"ementa" : "blablabla",
"matriculaAutor" : 900794,
"atualizadaEm" : "2009-02-03T13:44:00",
"id" : 4934,
"indexacaoAtiva" : true,
"status" : "Disponível"
}
}
]
}
}
I have also tryed to use Multi Fields and Suffix in a query, without success
.Map<Noticia>(mm => mm
.AutoMap()
.Properties(p => p
.Text(t => t
.Name(n => n.Conteudo)
.Analyzer("folding-analyzer")
.Fields(f => f
.Text(ss => ss
.Name("folding")
.Analyzer("folding-analyzer")
)
)
(...)
var searchResponse = ElasticClient.Search<Noticia>(s => s
.Index(indexName)
.Query(q => q
.MultiMatch(m => m
.Fields(f => f
.Field(p => p.Titulo,4)
.Field(p => p.Conteudo.Suffix("folding"),2)
)
.Query(termo)
)
)
);
Any clue what I am doing wrong or what I can do to reach my goal?
Thanks a lot in advance!

After a few days I found out what I was doing wrong and it was all about the mapping.
Here are the steps I took to approach the problem and solve it in the end
1 - first of all I`ve opened kibana console and found out that only the last field of my mapped fields was being assigned to my custom analyser (folding-analyser)
To test each one of your fields you can use the GET FIELD MAPPING API and a command in dev tools like this:
GET /<index>/_mapping/field/<field>
then you'll be able to see if your analyser is being assigned to your field or not
2 - After that, I discovered that the last field was the only one being assigned to my custom analyser and the reason was because I was messing up with fluent mapping in two ways:
First of all, I had to chain my text properties correctly
Second of all, I was trying to map another POCO class in another Map<> clause, when I was supposed to use the Object<> clause
the correct mapping that worked for me was a bit like this:
.Map<Noticia>(mm => mm
.AutoMap()
.Properties(p => p
.Text(t => t
.Name(n => n.Field1)
.Analyzer("folding-analyzer")
)
.Text(t => t
.Name(n => n.Field2)
.Analyzer("folding-analyzer")
)
.Object<NoticiaArquivo>(o => o
.Name(n => n.Arquivos)
.Properties(eps => eps
.Text(s => s
.Name(e => e.NAField1)
.Analyzer("folding-analyzer")
)
.Text(s => s
.Name(e => e.NAField2)
.Analyzer("folding-analyzer")
)
)
)
)
)
Finally, It's important to share that when you assign an analyser using the .Analyzer("analiserName") clause, you're telling elastic search that you want to use the argument analyser both for indexing and search
If you want to use an analyser only when you search and not on indexing time, you should use the .SearchAnalyzer("analiserName") clause.

Related

Elastic search nest search query

I am getting stuck getting while implementing search using elastic search nest.
I have to implement like query with elastic search.
for ex. select * from table where username like '%abc xyz%'
As you can see in above sql query I have applied like query with string "abc" space and another string "xyz". Similarly I want this query in elastic search. Can anyone help me to implement such query in elastic search nest?
Below is the query
Client.Search<Video>(s => s
.Query(q => q
.Match(m => m
.OnField("video_parent")
.Query("0")
) && q
.Match(m => m
.OnField("video_status")
.Query(objVideosFilterCriteria.Vide‌oStatus.ToString())
) && q
.MatchPhrase(ff=>ff
.OnField("video_title")
.Query(objVideosF‌ilterCriteria.Search‌String)
) && q
.Range(r => r
.OnField(f => f.video_date)
.GreaterOrEquals(fromDate)
.LowerOrEquals(toDate‌)
)
)
.From(objVideosFilterCriteria.PageIndex)
.Size(objVideosFilt‌erCriteria.PageSize)
‌);
Above is the query I am using. In this query I am using
q.MatchPhrase(ff=>ff
.OnField("video_title")
.Query(objVideosF‌‌ilterCriteria.Sear‌ch‌String)
)
for like query. But it doesn't seem to work.
I am using below data set and want to filter data from below list.
"hits" : [
{
"_source" : {
"video_id" : 265006,
"video_title" : "nunchuk rockin roller II"
}
},
{
"_source" : {
"video_id" : 265013,
"video_title" : "?Shaggy?????Locks???7??????Alberto E. Interview {407} 967 ~ 8596?"
}
},
{
"_source" : {
"video_id" : 265014,
"video_title" : "Shakin' Stevens - Kalundborg Rocker"
}
},
{
"_source" : {
"video_id" : 265019,
"video_title" : "?Shaggy?????Locks? = 7??????Greg M. Interview {407} 967 ~ 8596?"
}
},
{
"_source" : {
"video_id" : 265023,
"video_title" : "?Shaggy?????Locks? = 7??????Jason M. Interview {407} 967 ~ 8596?"
}
}
]
For example I would like to search with the keyword "kin rol" in the "video_title" field so with the above data it should fetch one record which exists at first position in above list but in my current query I am getting. nothing.

NEST 2.0 with Elasticsearch for GeoDistance always returns all records

I have the below code using C# .NET 4.5 and NEST 2.0 via nuget. This query always returns my type 'trackpointes' with the total number of documents with this distance search code. I have 2,790 documents and the count return is just that. Even for 1 centimeter as the distance unit it returns all 2,790 documents. My type of 'trackpointes' has a location field, type of geo_point, geohash true, and geohash_precision of 9.
I am just trying to filter results based on distance without any other search terms and for my 2,790 records it returns them all regardless of the unit of measurement. So I have to be missing something (hopefully small). Any help is appreciated. The NEST examples I can find are a year or two old and that syntax does not seem to work any more.
double distance = 4.0;
var geoResult = client.Search<TrackPointES>(s => s.From(0).Size(10000).Type("trackpointes")
.Query(query => query
.Bool( b => b.Filter(filter => filter
.GeoDistance(geo => geo
.Distance(distance, Nest.DistanceUnit.Kilometers).Location(35, -82)))
)
)
);
If I use POSTMAN to connect to my instance of ES and POST a search w/ the below JSON, I get a return of 143 total documents out of 2,790. So I know the data is right as that is a realistic return.
{
"query" : {
"filtered" : {
"filter" : {
"geo_distance" : {
"distance" : "4km",
"location" : {
"top_left": {
"lat" : 35,
"lon" : -82
}
}
}
}
}
}
}

Looks like you didn't specify field in your query. Try this one:
var geoResult = client.Search<Document>(s => s.From(0).Size(10000)
.Query(query => query
.Bool(b => b.Filter(filter => filter
.GeoDistance(geo => geo
.Field(f => f.Location) //<- this
.Distance(distance, Nest.DistanceUnit.Kilometers).Location(35, -82)))
)
)
);

I forgot to specify the field to search for the location. :( But I am posting here just in case someone else has the same issue and to shame myself into trying harder...
.Field(p => p.location) was the difference in the query.
var geoResult = client.Search<TrackPointES>(s => s.From(0).Size(10000).Type("trackpointes")
.Query(query => query
.Bool( b => b.Filter(filter => filter
.GeoDistance(geo => geo.Field(p => p.location).DistanceType(Nest.GeoDistanceType.SloppyArc)
.Distance(distance, Nest.DistanceUnit.Kilometers).Location(35, -82)))
)
)
);

How to write expression for special KV string in logstash kv fiter?

I got plenty of logs like these kind of stuff:
uid[118930] pageview h5_act, actTag[cyts] corpId[2] inviteType[0] clientId[3] clientVer[2.3.0] uniqueId[d317de16a78a0089b0d94d684e7a9585565ffa236138c0.85354991] srcId[0] subSrc[]
Most of these are key-value expression in KEY[VALUE] form.
I have read the document but still cannot figure out how to write the configurations.
Any help would be appreciated!

You can simply configure your kv filter using the value_split and trim settings, like below:
filter {
kv {
value_split => "\["
trim => "\]"
}
}
For the sample log line you've given, you'll get:
{
"message" => "uid[118930] pageview h5_act, actTag[cyts] corpId[2] inviteType[0] clientId[3] clientVer[2.3.0] uniqueId[d317de16a78a0089b0d94d684e7a9585565ffa236138c0.85354991] srcId[0] subSrc[]",
"#version" => "1",
"#timestamp" => "2015-12-12T05:04:00.888Z",
"host" => "iMac.local",
"uid" => "118930",
"actTag" => "cyts",
"corpId" => "2",
"inviteType" => "0",
"clientId" => "3",
"clientVer" => "2.3.0",
"uniqueId" => "d317de16a78a0089b0d94d684e7a9585565ffa236138c0.85354991",
"srcId" => "0",
"subSrc" => ""
}

partial matching of an entire word using elasticsearch - matching the end or middle part

i've read through so much of the documentation, but now getting a bit confused about how to match part of a word in a search. i understand there are many techniques, but most talk about matching the first part of a word. such as 'quick' can match 'quick brown fox'.
well, what if i have the word 'endgame' i'm looking for but the input query is 'game'? i've tried using the standard, keyword, whitespace, etc, tokenizers but i'm not getting it.
i'm sure i'm missing something simple.
update
i was able to implement this with the help of john. here's the implementation using Nest...
var ngramTokenFilter = new NgramTokenFilter
{
MinGram = 2,
MaxGram = 3
};
var nGramTokenizer = new NGramTokenizer
{
MinGram = 2, MaxGram = 3, TokenChars = new List<string>{"letter", "digit"}
};
var nGramAnalyzer = new CustomAnalyzer
{
Tokenizer = "nGramTokenizer",
Filter = new[] { "ngram", "standard", "lowercase" }
};
client.CreateIndex("myindex", i =>
{
i
.Analysis(a => a.Analyzers(an => an
.Add("ngramAnalyer", nGramAnalyzer)
)
.Tokenizers(tkn => tkn
.Add("nGramTokenizer", nGramTokenizer)
)
.TokenFilters(x => x
.Add("ngram", ngramTokenFilter)
)
)
...
and my poco, i'm actually creating a multifield, one not analyzed, and one with my ngram tokenizer analyzer:
pm.Properties(props => props
.MultiField(mf => mf
.Name("myfield")
.Fields(f => f
.String(s => s.Name("myfield").Analyzer("ngramAnalyer"))
.String(s => s.Name("raw").Index(FieldIndexOption.not_analyzed))
)
)
);

I would try an ngram tokenizer: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html
The example below is fairly extreme (it creates 2 and 3 letter tokens) but should give you an idea of how it works:
curl -XPUT 'localhost:9200/test' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_ngram_analyzer" : {
"tokenizer" : "my_ngram_tokenizer"
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "2",
"max_gram" : "3",
"token_chars": [ "letter", "digit" ]
}
}
}
}
}'
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'
Result: FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04
This will allow you to break up your tokens into smaller tokens of a configurable size and search on them. You'll want to play with min_gram and max_gram for your use case.
This can have some memory impact but tends to be a lot faster than a wildcard search that has trailing or leading wildcards (or both). http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html

Creating an index Nest

How would I recreate the following index using Elasticsearch Nest API?
Here is the json for the index including the mapping:
{
"settings": {
"analysis": {
"filter": {
"trigrams_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3
}
},
"analyzer": {
"trigrams": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trigrams_filter"
]
}
}
}
},
"mappings": {
"data": {
"_all" : {"enabled" : true},
"properties": {
"text": {
"type": "string",
"analyzer": "trigrams"
}
}
}
}
}
Here is my attempt:
var newIndex = client.CreateIndexAsync(indexName, index => index
.NumberOfReplicas(replicas)
.NumberOfShards(shards)
.Settings(settings => settings
.Add("merge.policy.merge_factor", "10")
.Add("search.slowlog.threshold.fetch.warn", "1s")
.Add("mapping.allow_type_wrapper", true))
.AddMapping<Object>(mapping => mapping
.IndexAnalyzer("trigram")
.Type("string"))
);
The documentation does not mention anything about this?
UPDATE:
Found this post that uses
var index = new IndexSettings()
and then adds Analysis with the string literal json.
index.Add("analysis", #"{json});
Where can one find more examples like this one and does this work?

Creating an index in older versions
There are two main ways that you can accomplish this as outlined in the Nest Create Index Documentation:
Here is the way where you directly declare the index settings as Fluent Dictionary entries. Just like you are doing in your example above. I tested this locally and it produces the index settings that match your JSON above.
var response = client.CreateIndex(indexName, s => s
.NumberOfReplicas(replicas)
.NumberOfShards(shards)
.Settings(settings => settings
.Add("merge.policy.merge_factor", "10")
.Add("search.slowlog.threshold.fetch.warn", "1s")
.Add("mapping.allow_type_wrapper", true)
.Add("analysis.filter.trigrams_filter.type", "nGram")
.Add("analysis.filter.trigrams_filter.min_gram", "3")
.Add("analysis.filter.trigrams_filter.max_gram", "3")
.Add("analysis.analyzer.trigrams.type", "custom")
.Add("analysis.analyzer.trigrams.tokenizer", "standard")
.Add("analysis.analyzer.trigrams.filter.0", "lowercase")
.Add("analysis.analyzer.trigrams.filter.1", "trigrams_filter")
)
.AddMapping<Object>(mapping => mapping
.Type("data")
.AllField(af => af.Enabled())
.Properties(prop => prop
.String(sprop => sprop
.Name("text")
.IndexAnalyzer("trigrams")
)
)
)
);
Please note that NEST also includes the ability to create index settings using strongly typed classes as well. I will post an example of that later, if I have time to work through it.
Creating index with NEST 7.x
Please also note that in NEST 7.x CreateIndex method is removed. Use Indices.Create isntead. Here's the example.
_client.Indices
.Create(indexName, s => s
.Settings(se => se
.NumberOfReplicas(replicas)
.NumberOfShards(shards)
.Setting("merge.policy.merge_factor", "10")));

In case people have NEST 2.0, the .NumberOfReplicas(x).NumberOfShards(y) are in the Settings area now so specify within the lamba expression under Settings.
EsClient.CreateIndex("indexname", c => c
.Settings(s => s
.NumberOfReplicas(replicasNr)
.NumberOfShards(shardsNr)
)
NEST 2.0 has a lot of changes and moved things around a bit so these answers are a great starting point for sure. You may need to adjust a little for the NEST 2.0 update.

Small example :
EsClient.CreateIndex("indexname", c => c
.NumberOfReplicas(replicasNr)
.NumberOfShards(shardsNr)
.Settings(s => s
.Add("merge.policy.merge_factor", "10")
.Add("search.slowlog.threshold.fetch.warn", "15s")
)
#region Analysis
.Analysis(descriptor => descriptor
.Analyzers(bases => bases
.Add("folded_word", new CustomAnalyzer()
{
Filter = new List<string> { "icu_folding", "trim" },
Tokenizer = "standard"
}
)
.TokenFilters(i => i
.Add("engram", new EdgeNGramTokenFilter
{
MinGram = 1,
MaxGram = 20
}
)
)
.CharFilters(cf => cf
.Add("drop_chars", new PatternReplaceCharFilter
{
Pattern = #"[^0-9]",
Replacement = ""
}
)
#endregion
#region Mapping Categories
.AddMapping<Categories>(m => m
.Properties(props => props
.MultiField(mf => mf
.Name(n => n.Label_en)
.Fields(fs => fs
.String(s => s.Name(t => t.Label_en).Analyzer("folded_word"))
)
)
)
#endregion
);

In case anyone has migrated to NEST 2.4 and has the same question - you would need to define your custom filters and analyzers in the index settings like this:
elasticClient.CreateIndex(_indexName, i => i
.Settings(s => s
.Analysis(a => a
.TokenFilters(tf => tf
.EdgeNGram("edge_ngrams", e => e
.MinGram(1)
.MaxGram(50)
.Side(EdgeNGramSide.Front)))
.Analyzers(analyzer => analyzer
.Custom("partial_text", ca => ca
.Filters(new string[] { "lowercase", "edge_ngrams" })
.Tokenizer("standard"))
.Custom("full_text", ca => ca
.Filters(new string[] { "standard", "lowercase" } )
.Tokenizer("standard"))))));

For 7.X plus you can use the following code to create an index with Shards, Replicas and with Automapping:
if (!_elasticClient.Indices.Exists(_elasticClientIndexName).Exists)
{
var response = _elasticClient.Indices
.Create(_elasticClientIndexName, s => s
.Settings(se => se
.NumberOfReplicas(1)
.NumberOfShards(shards)
).Map<YourDTO>(
x => x.AutoMap().DateDetection(false)
));
if (!response.IsValid)
{
// Elasticsearch index status is invalid, log an exception
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to create a custom analyzer to ignore accents and pt-br stopwords using elasticsearch nest api? - elasticsearch

Related

Elastic search nest search query

NEST 2.0 with Elasticsearch for GeoDistance always returns all records

How to write expression for special KV string in logstash kv fiter?

partial matching of an entire word using elasticsearch - matching the end or middle part

Creating an index Nest

Categories

Resources