partial matching of an entire word using elasticsearch - matching the end or middle part - elasticsearch

i've read through so much of the documentation, but now getting a bit confused about how to match part of a word in a search. i understand there are many techniques, but most talk about matching the first part of a word. such as 'quick' can match 'quick brown fox'.
well, what if i have the word 'endgame' i'm looking for but the input query is 'game'? i've tried using the standard, keyword, whitespace, etc, tokenizers but i'm not getting it.
i'm sure i'm missing something simple.
update
i was able to implement this with the help of john. here's the implementation using Nest...
var ngramTokenFilter = new NgramTokenFilter
{
MinGram = 2,
MaxGram = 3
};
var nGramTokenizer = new NGramTokenizer
{
MinGram = 2, MaxGram = 3, TokenChars = new List<string>{"letter", "digit"}
};
var nGramAnalyzer = new CustomAnalyzer
{
Tokenizer = "nGramTokenizer",
Filter = new[] { "ngram", "standard", "lowercase" }
};
client.CreateIndex("myindex", i =>
{
i
.Analysis(a => a.Analyzers(an => an
.Add("ngramAnalyer", nGramAnalyzer)
)
.Tokenizers(tkn => tkn
.Add("nGramTokenizer", nGramTokenizer)
)
.TokenFilters(x => x
.Add("ngram", ngramTokenFilter)
)
)
...
and my poco, i'm actually creating a multifield, one not analyzed, and one with my ngram tokenizer analyzer:
pm.Properties(props => props
.MultiField(mf => mf
.Name("myfield")
.Fields(f => f
.String(s => s.Name("myfield").Analyzer("ngramAnalyer"))
.String(s => s.Name("raw").Index(FieldIndexOption.not_analyzed))
)
)
);

I would try an ngram tokenizer: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html
The example below is fairly extreme (it creates 2 and 3 letter tokens) but should give you an idea of how it works:
curl -XPUT 'localhost:9200/test' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_ngram_analyzer" : {
"tokenizer" : "my_ngram_tokenizer"
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "2",
"max_gram" : "3",
"token_chars": [ "letter", "digit" ]
}
}
}
}
}'
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'
Result: FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04
This will allow you to break up your tokens into smaller tokens of a configurable size and search on them. You'll want to play with min_gram and max_gram for your use case.
This can have some memory impact but tends to be a lot faster than a wildcard search that has trailing or leading wildcards (or both). http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html

Related

How to make a field Keyword type and Text type at the same time to enable Aggregations and free text search simultainously

We have an application with an ElasticSearch backend. I am trying to enable free text search for some fields and aggregate the same fields. The free text search is expecting the fields to be of the text type and the aggregator is expecting them to be of keyword type. Copy_to doesn't seem to be able to copy the keywords to a text field.
This aggregation works well with the keyword type:
var aggs = await _dataService.Search<Vehicle>(s => s
.Size(50)
.Query(q => q.MatchAll())
.Aggregations(a => a
.Terms("stockLocation_city", c => c
.Field(f => f.StockLocation.City)
.Size(int.MaxValue)
.ShowTermDocCountError(false)
)
.Terms("stockLocation_country", c => c
.Field(f => f.StockLocation.Country)
.Size(int.MaxValue)
.ShowTermDocCountError(false)
)
)
);
The schema looks like this:
"stockLocation": {
"type": "object",
"properties": {
"full_text": {
"type": "text",
"analyzer": "custom_analyzer"
},
"city": {
"type": "keyword",
"copy_to":"full_text"
},
"country": {
"type": "keyword",
"copy_to":"full_text"
}
}
}
The query for the full-text search which works with text fields copied to the full_text property:
var qieryDescriptor = query.SimpleQueryString(p => p.Query(searchQuery.FreeText));
And the ElasticClient instantiation:
public ElasticSearchService(ElasticSearchOptions elasticSearchOptions)
{
_elasticSearchOptions = elasticSearchOptions;
var settings = new ConnectionSettings(
_elasticSearchOptions.CloudId,
new Elasticsearch.Net.BasicAuthenticationCredentials(_elasticSearchOptions.UserName, _elasticSearchOptions.Password)
)
.DefaultIndex(_elasticSearchOptions.IndexAliasName)
.DefaultMappingFor<Vehicle>(m => m.IndexName(_elasticSearchOptions.IndexAliasName));
_elasticClient = new ElasticClient(settings);
}
I have looked in the documentation but haven't seen this particular use case anywhere, so I must be doing something wrong.
How can I enable both aggregation and free text search on the same fields?
Cheers
what you are looking for is the "multi-fields" feature:
Multi-Fields
that way you have the same entry in the document and the engine indexes it twice - once as full text and once as keyword.

Extracting matching conditions from querystring

ElasticSearch Query is formed using query string with multiple AND / OR operators. i.e. ((Condition 1 OR Condition 2) AND (Condition 3 OR Condition 4 OR Condition 5)), based on the condition it provides me multiple documents. For getting exact condition I again loop through all the resultant documents again and mark particular conditions. Is there any simple way to get resultant conditions specific to documents ?
Can anyone provide the better example using NEST API?
I think that what you need is to Highlight the data that made the hit on your query. Highlight functionality of elasticsearch actually marks the text from each search result so the user can see why the document matched the query. The marked text is returned in the response.
Please refer in the elasticsearch documentation in order to understand how this api actually works. Refer in the Nest Documentation in order to see how you can implement it with the Nest library.
For example, using the elasticsearch api imagine the below example:
GET /someIndex/someType/_search
{
"query" : {
"match_phrase" : {
"about" : "rock climbing"
}
},
"highlight": {
"fields" : {
"about" : {}
}
}
}
The same with Nest:
var result = _client.Search<someIndex>(s => s
.Query(q => q
.MatchPhrase(qs => qs
.OnField(e => e.about)
.Query("rock climbing")
)
)
.Highlight(h => h
.OnFields(f => f
.OnField(e => e.about)
)
)
);
The response will be of the below form for each search result (notice the highlight part)
"_score": 0.23013961,
"_source": {
"first_name": "John",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing",
"interests": [ "sports", "music" ]
},
"highlight": {
"about": [
"I love to go <em>rock</em> <em>climbing</em>"
]
}

Creating an index Nest

How would I recreate the following index using Elasticsearch Nest API?
Here is the json for the index including the mapping:
{
"settings": {
"analysis": {
"filter": {
"trigrams_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3
}
},
"analyzer": {
"trigrams": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trigrams_filter"
]
}
}
}
},
"mappings": {
"data": {
"_all" : {"enabled" : true},
"properties": {
"text": {
"type": "string",
"analyzer": "trigrams"
}
}
}
}
}
Here is my attempt:
var newIndex = client.CreateIndexAsync(indexName, index => index
.NumberOfReplicas(replicas)
.NumberOfShards(shards)
.Settings(settings => settings
.Add("merge.policy.merge_factor", "10")
.Add("search.slowlog.threshold.fetch.warn", "1s")
.Add("mapping.allow_type_wrapper", true))
.AddMapping<Object>(mapping => mapping
.IndexAnalyzer("trigram")
.Type("string"))
);
The documentation does not mention anything about this?
UPDATE:
Found this post that uses
var index = new IndexSettings()
and then adds Analysis with the string literal json.
index.Add("analysis", #"{json});
Where can one find more examples like this one and does this work?
Creating an index in older versions
There are two main ways that you can accomplish this as outlined in the Nest Create Index Documentation:
Here is the way where you directly declare the index settings as Fluent Dictionary entries. Just like you are doing in your example above. I tested this locally and it produces the index settings that match your JSON above.
var response = client.CreateIndex(indexName, s => s
.NumberOfReplicas(replicas)
.NumberOfShards(shards)
.Settings(settings => settings
.Add("merge.policy.merge_factor", "10")
.Add("search.slowlog.threshold.fetch.warn", "1s")
.Add("mapping.allow_type_wrapper", true)
.Add("analysis.filter.trigrams_filter.type", "nGram")
.Add("analysis.filter.trigrams_filter.min_gram", "3")
.Add("analysis.filter.trigrams_filter.max_gram", "3")
.Add("analysis.analyzer.trigrams.type", "custom")
.Add("analysis.analyzer.trigrams.tokenizer", "standard")
.Add("analysis.analyzer.trigrams.filter.0", "lowercase")
.Add("analysis.analyzer.trigrams.filter.1", "trigrams_filter")
)
.AddMapping<Object>(mapping => mapping
.Type("data")
.AllField(af => af.Enabled())
.Properties(prop => prop
.String(sprop => sprop
.Name("text")
.IndexAnalyzer("trigrams")
)
)
)
);
Please note that NEST also includes the ability to create index settings using strongly typed classes as well. I will post an example of that later, if I have time to work through it.
Creating index with NEST 7.x
Please also note that in NEST 7.x CreateIndex method is removed. Use Indices.Create isntead. Here's the example.
_client.Indices
.Create(indexName, s => s
.Settings(se => se
.NumberOfReplicas(replicas)
.NumberOfShards(shards)
.Setting("merge.policy.merge_factor", "10")));
In case people have NEST 2.0, the .NumberOfReplicas(x).NumberOfShards(y) are in the Settings area now so specify within the lamba expression under Settings.
EsClient.CreateIndex("indexname", c => c
.Settings(s => s
.NumberOfReplicas(replicasNr)
.NumberOfShards(shardsNr)
)
NEST 2.0 has a lot of changes and moved things around a bit so these answers are a great starting point for sure. You may need to adjust a little for the NEST 2.0 update.
Small example :
EsClient.CreateIndex("indexname", c => c
.NumberOfReplicas(replicasNr)
.NumberOfShards(shardsNr)
.Settings(s => s
.Add("merge.policy.merge_factor", "10")
.Add("search.slowlog.threshold.fetch.warn", "15s")
)
#region Analysis
.Analysis(descriptor => descriptor
.Analyzers(bases => bases
.Add("folded_word", new CustomAnalyzer()
{
Filter = new List<string> { "icu_folding", "trim" },
Tokenizer = "standard"
}
)
.TokenFilters(i => i
.Add("engram", new EdgeNGramTokenFilter
{
MinGram = 1,
MaxGram = 20
}
)
)
.CharFilters(cf => cf
.Add("drop_chars", new PatternReplaceCharFilter
{
Pattern = #"[^0-9]",
Replacement = ""
}
)
#endregion
#region Mapping Categories
.AddMapping<Categories>(m => m
.Properties(props => props
.MultiField(mf => mf
.Name(n => n.Label_en)
.Fields(fs => fs
.String(s => s.Name(t => t.Label_en).Analyzer("folded_word"))
)
)
)
#endregion
);
In case anyone has migrated to NEST 2.4 and has the same question - you would need to define your custom filters and analyzers in the index settings like this:
elasticClient.CreateIndex(_indexName, i => i
.Settings(s => s
.Analysis(a => a
.TokenFilters(tf => tf
.EdgeNGram("edge_ngrams", e => e
.MinGram(1)
.MaxGram(50)
.Side(EdgeNGramSide.Front)))
.Analyzers(analyzer => analyzer
.Custom("partial_text", ca => ca
.Filters(new string[] { "lowercase", "edge_ngrams" })
.Tokenizer("standard"))
.Custom("full_text", ca => ca
.Filters(new string[] { "standard", "lowercase" } )
.Tokenizer("standard"))))));
For 7.X plus you can use the following code to create an index with Shards, Replicas and with Automapping:
if (!_elasticClient.Indices.Exists(_elasticClientIndexName).Exists)
{
var response = _elasticClient.Indices
.Create(_elasticClientIndexName, s => s
.Settings(se => se
.NumberOfReplicas(1)
.NumberOfShards(shards)
).Map<YourDTO>(
x => x.AutoMap().DateDetection(false)
));
if (!response.IsValid)
{
// Elasticsearch index status is invalid, log an exception
}
}

Filter on empty string using ElasticSearch/Nest

This may be a silly question, but how do I filter on an empty string in ElasticSearch using Nest. Specifically, how do I recreate the following result:
curl http://localhost:9200/test/event/_search
{
"filter" : { "term" : { "target" : "" }}
}
I've tried:
(f => f
.Term("target", "")
);
which according to ElasticSearch and Nest filtering does not work is treated like a conditionless query and returns everything, while adding a .Strict() throws a DslException:
(f => f
.Strict().Term("target", "")
);
I've also tried .Missing() and .Exists() to no avail.
The relevant section of my _mapping for reference:
{
"event": {
"dynamic": "false",
"properties": {
target": {
"type": "string",
"index": "not_analyzed",
"store": true,
"omit_norms": true,
"index_options": "docs"
}
}
}
}
Any pointers would be greatly appreciated.
As the documentation on NEST and writing queries mentions you can toggle Strict() mode to trigger exceptions if a part of your query turns out to be conditionless but if thats what you really wanted then you were stuck as you've found out.
I just committed a .Verbatim() construct which works exactly like .Strict() but instead of throwing an exception it will take the query as is and render it as specified.
(f => f
.Verbatim()
.Term("target", "")
);
Should thus disable the conditionless query rewrite and insert the query literally as specified.
This will make it in the next version of NEST (so after the current version of 0.12.0.0)
I will just remark that you have to use Verbatim() on every query, not just once on the top.
var searchResults = this.Client.Search<Project>(s => s
.Query(q => q
//.Verbatim() // no, here won't work
.Bool(b => b
.Should(
bs => bs.Match(p => p.Query("hello").Field("name").Verbatim()),
bs => bs.Match(p => p.Query("world").Field("name").Verbatim())
)
)
)
);

How do && and || work constructing queries in NEST?

According to http://nest.azurewebsites.net/concepts/writing-queries.html, the && and || operators can be used to combine two queries using the NEST library to communicate with Elastic Search.
I have the following query set up:
var ssnQuery = Query<NameOnRecordDTO>.Match(
q => q.OnField(f => f.SocialSecurityNumber).QueryString(nameOnRecord.SocialSecurityNumber).Fuzziness(0)
);
which is then combined with a Bool query as shown below:
var result = client.Search<NameOnRecordDTO>(
body => body.Query(
query => query.Bool(
bq => bq.Should(
q => q.Match(
p => p.OnField(f => f.Name.First)
.QueryString(nameOnRecord.Name.First).Fuzziness(fuzziness)
),
q => q.Match(p => p.OnField(f => f.Name.Last)
.QueryString(nameOnRecord.Name.Last).Fuzziness(fuzziness)
)
).MinimumNumberShouldMatch(2)
) || ssnQuery
)
);
What I think this query means is that if the SocialSecurityNumber matches, or both the Name.First and Name.Last fields match, then the record should be included in the results.
When I execute this query with the follow data for the nameOnRecord object used in the calls to QueryString:
"socialSecurityNumber":"123456789",
"name" : {
"first":"ryan",
}
the results are the person with SSN 123456789, along with anyone with first name ryan.
If I remove the || ssnQuery from the query above, I get everyone whose first name is 'ryan'.
With the || ssnQuery in place and the following query:
{
"socialSecurityNumber":"123456789",
"name" : {
"first":"ryan",
"last": "smith"
}
}
I appear to get the person with SSN 123456789 along with people whose first name is 'ryan' or last name is 'smith'.
So it does not appear that adding || ssnQuery is having the effect that I expected, and I don't know why.
Here is the definition of the index on object in question:
"nameonrecord" : {
"properties": {
"name": {
"properties": {
"name.first": {
"type": "string"
},
"name.last": {
"type": "string"
}
}
},
"address" : {
"properties": {
"address.address1": {
"type": "string",
"index_analyzer": "address",
"search_analyzer": "address"
},
"address.address2": {
"type": "string",
"analyzer": "address"
},
"address.city" : {
"type": "string",
"analyzer": "standard"
},
"address.state" : {
"type": "string",
"analyzer": "standard"
},
"address.zip" : {
"type" : "string",
"analyzer": "standard"
}
}
},
"otherName": {
"type": "string"
},
"socialSecurityNumber" : {
"type": "string"
},
"contactInfo" : {
"properties": {
"contactInfo.phone": {
"type": "string"
},
"contactInfo.email": {
"type": "string"
}
}
}
}
}
I don't think the definition of the address analyzer is important, since the address fields are not being used in the query, but can include it if someone wants to see it.
This was in fact a bug in NEST
A precursor to how NEST helps translate boolean queries:
NEST allows you to use operator overloading to create verbose bool queries/filters easily i.e:
term && term will result in:
bool
must
term
term
A naive implementation of this would rewrite
term && term && term to
bool
must
term
bool
must
term
term
As you can image this becomes unwieldy quite fast the more complex a query becomes NEST can spot these and join them together to become
bool
must
term
term
term
Likewise term && term && term && !term simply becomes:
bool
must
term
term
term
must_not
term
now if in the previous example you pass in a booleanquery directly like so
bool(must=term, term, term) && !term
it would still generate the same query. NEST will also do the same with should's when it sees that the boolean descriptors in play ONLY consist of should clauses. This is because the boolquery does not quite follow the same boolean logic you expect from a programming language.
To summarize the latter:
term || term || term
becomes
bool
should
term
term
term
but
term1 && (term2 || term3 || term4) will NOT become
bool
must
term1
should
term2
term3
term4
This is because as soon as a boolean query has a must clause the should start acting as a boosting factor. So in the previous you could get back results that ONLY contain term1 this is clearly not what you want in the strict boolean sense of the input.
NEST therefor rewrites this query to
bool
must
term1
bool
should
term2
term3
term4
Now where the bug came into play was that your situation you have this
bool(should=term1, term2, minimum_should_match=2) || term3 NEST identified both sides of the OR operation only contains should clauses and it would join them together which would give a different meaning to the minimum_should_match parameter of the first boolean query.
I just pushed a fix for this and this will be fixed in the next release 0.11.8.0
Thanks for catching this one!

Resources