How do I build an elastic search query such that each token in a document field is matched? - elasticsearch

I need to make sure that each token of a field is matched by at least one token in a user's search.
This is a generalized example for the sake of simplification.
Let Store_Name = "Square Steakhouse"
It is simple to build a query that matches this document when the user searches for Square, or Steakhouse. Furthermore, with kstem filter attached to the default analyzer, Steakhouses is also likely to match.
{
"size": 30,
"query": {
"match": {
"Store_Name": {
"query": "Square",
"operator": "AND"
}
}
}
}
Unfortunately, I need each token of the Store_Name field to be matched. I need the following behavior:
Query: Square Steakhouse Result: Match
Query: Square Steakhouses Result: Match
Query: Squared Steakhouse Result: Match
Query: Square Result: No Match
Query: Steakhouse Result: No Match
In summary
It is not an option to use not_analyzed, as I do need to take advantage of analyzer features
I intend to use kstem, custom synonyms, a custom char_filter, a lowercase filter, as well as a standard tokenizer
However, I need to make sure that each tokens of a field is matched
Is this possible in elastic search?

Here is a good method.
It is not perfect, but it is a good compromise in terms of simplicity, computation, and storage.
Index the token count of the field
Obtain the token count of the search text
Perform a filtered query and enforce the number of tokens between the results to be equal
You will want to use the analyze API in order to get the token count. Make sure to use the same analyzer as the field in question. Here is a VB.NET function to obtain token count:
Private Function GetTokenCount(ByVal RawString As String, Optional ByVal Analyzer As String = "default") As Integer
If Trim(RawString) = "" Then Return 0
Dim client = New ElasticConnection()
Dim result = client.Post("http://localhost:9200/myindex/_analyze?analyzer=" & Analyzer, RawString) 'Submit analyze request usign PlainElastic.NET API
Dim J = JObject.Parse(result.ToString()) 'Populate JSON.NET JObject
Return (From X In J("tokens")).Count() 'returns token count using a JSON.NET JObject
End Function
You will want to use this at index-time to store the token count of the field in question. Make sure there is an entry in the mapping for TokenCount
Here is a good elastic search query for utilizing this new token count information:
{
"size": 30,
"query": {
"filtered": {
"query": {
"match": {
"MyField": {
"query": "[query]",
"operator": "AND"
}
}
},
"filter": {
"term": {
"TokenCount": [tokencount]
}
}
}
}
}
Replace [query] with the search terms
Replace [tokencount] with the number of tokens in the search terms (using the GetTokenCount function above
This makes sure that all there are at least as many matches as tokens in MyField.
There are some drawbacks to the above. For example, if we are matching the field "blue red", and the user searches for "blue blue", the above will trigger a match. So, you may want to use a unique token filter. You may also wish to adjust the filter so that
Reference
Clinton Gormely inspired the solution

Related

Is there a way to get ElasticSearch to create n-gram tokens from truncated field?

Documents contain a url field with a full url. Users should be able to search for documents containing a given url by supplying a portion of the url string. The search string can be 3-15 characters long. An N-gram token filter with min_gram of 3 and max_gram of 15 would work but generates a large number of tokens for long urls. Is it possible to have ElasticSearch only generate tokens for the first 100 characters of the url field?
For example, the user should be able to search for documents containing the following url using a search string such as ’example.com’ or ‘/foo/bar’.
https://click.example.com/foo/bar/55gft/?qs=1952934d0ee8e2368ec7f7a921e3c6202b39365b9a2d26774c8122b8555ca21fce9d2344fc08a8ba40caede5e6901a112c6e89ead40892109eb8290d70571eab
There are two ways to achieve what you want.
Option 1: Keep using ngrams as you do now, but insert a truncate token filter before the ngram one, to limit the url size to 100 and only after ngram it.
Option 2: Use the wildcard field type, which has been created exactly for cases like this.
In your index, you should first change the type of the URL field to wildcard:
PUT test
{
"mappings": {
"properties": {
"url": {
"type": "wildcard"
}
}
}
}
Then, you can search on that field, using the wildcard query, like this:
POST test/_search
{
"query": {
"wildcard": {
"url": "*foo/bar*"
}
}
}
Also, read the related blog post which shows in details how the wildcard field type performs.

Elasticseach query filter/term not working when special characters are involved

The following query is not working when "metadata.name" has "-" in the text like "demo-application-child3" . But if I remove "-" and make the query to "demoapplicationchild3". It works. The same with other field metadata.version. I've the data for both demoapplicationchild3 and demo-application-child3. suggestions please.
{
"query": {
"bool": {
"filter": [
{"term": { "metadata.name": "demo-application-child3" }},
{"term": { "metadata.version": "00.00.100" }}]
}
}
}
term queries are not analyzed see the official doc which clearly mention this
Returns documents that contain an exact term in a provided field.
Which clearly means that index time you are using some custom analyzer which is removing - and joining the tokens ie for demo-application-child3 your custom analyzer would be generating demoapplicationchild3 token, which you can easily confirm using the Analyze api.
If you want to get result either change term query to match query or use the .keyword suffix with your field if mappping is generated dynamically or create another field which is of type keyword which uses no-op analyzer.

Custom score for exact, phonetic and fuzzy matching in elasticsearch

I have a requirement where there needs to be custom scoring on name. To keep it simple lets say, if I search for 'Smith' against names in the index, the logic should be:
if input = exact 'Smith' then score = 100%
else
if input = phonetic match then
score = <depending upon fuzziness match of input with name>%
end if
end if;
I'm able to search documents with a fuzziness of 1 but I don't know how to give it custom score depending upon how fuzzy it is. Thanks!
Update:
I went through a post that had the same requirement as mine and it was mentioned that the person solved it by using native scripts. My question still remains, how to actually get the score based on the similarity distance such that it can be used in the native scripts:
The post for reference:
https://discuss.elastic.co/t/fuzzy-query-scoring-based-on-levenshtein-distance/11116
The text to look for in the post:
"For future readers I solved this issue by creating a custom score query and
writing a (native) script to handle the scoring."
You can implement this search logic using the rescore function query (docs here).
Here there is a possible example:
{
"query": {
"function_score": {
"query": { "match": {
"input": "Smith"
} },
"boost": "5",
"functions": [
{
"filter": { "match": { "input.keyword": "Smith" } },
"random_score": {},
"weight": 23
}
]
}
}
}
In this example we have a mapping with the input field indexed both as text and keyword (input.keyword is for exact match). We re-score the documents that match exactly the term "Smith" with an higher score respect to the all documents matched by the first query (in the example is a match, but in your case will be the query with fuzziness).
You can control the re-score effect tuning the weight parameter.

In Elasticsearch match query how to deal with slash

I have a match query searching for a type of doc:
{
"query": {
"bool": {
"should": {
"match": {
"ph1_enc": "EAAQnb1kMr/e2/ADqo"
}
}
}
}
}
"EAAQnb1kMr/e2/ADqo" is the string i'm trying to match, however in the search results I can see multiple records with substring "/e2/" are also returned.
Looks like "/e2/" is indexed separately, so that this could happen.I thought the match query is to do full-text match... Is it because I missed something when creating the template? Any idea?
Add-on instead of reindex, how to modify the query to match the exact value in the query?
Which analyzer do you set in the mapping to index your data?
If you are using the default one (standard analyzer), then according to the documentation, this uses the default tokenizer that seems to split also the text by slash ('/'). The documentation redirects here for more information about the tokenizer.
So, that will index the following words 'EAAQnb1kMr', 'e2', and 'ADqo'. Accordingly, your query value will also been analyzed the same way the field was indexed. That is why documents with 'e2' are also being returned.
If you don't need to tokenize the 'ph1_enc' field, you can just set its type in the mapping as 'keyword'.
"properties": {
"ph1_enc": {
"type": "keyword"
}
}
That will not analyze the field and it will match exactly while you query.
I hope that it helps.

Performing an AND query in elastic search

I have tried looking for another solution to this, but the Bool query in ES seems to not do quite what I am looking for. Or I am just not using it correctly.
In our current implementation of search we are trying to boost performance/reduce memory footprint of each query by changing our query logic. Today, if you search for "The Red Ball" you may get back 5 million documents because ES returns any document that matches "the" OR "red" OR "ball" which means we get back WAAAAAY too many irrelevant documents (mostly because of the "the" term). I would like to change our query to instead use AND so ES would return only documents that match "the" AND "red" AND "ball".
I am using the NEST Client to do this with C# so an example using the client would be best since that seems to be where I cannot figure out what to do. Thanks
You can simply use query string query with AND operator.
{
"query": {
"query_string": {
"default_field": "your_field", <--- remove this if you want to search on all fields
"query": "the red ball",
"default_operator": "AND"
}
}
}
or simply
{
"query": {
"query_string": {
"query": "the AND red AND ball"
}
}
}
I do not know C#, but this is how it might look in nest(everyone,feel free to edit)
client.Search<your_index>(q => q
.Query(qu => qu
.QueryString(qs=>qs
.OnField(x=>your_field).Query("the AND red AND ball")
)
)
);
I found the appropriate query to make using the NEST client:
SearchDescriptor<BackupEntitySearchDocument> desc = new SearchDescriptor<BackupEntitySearchDocument>();
desc.Query(qq => qq.MultiMatch(m => m.OnFields(_searchFields).Query(query).Operator(Operator.And)));
var searchResp = await _client.SearchAsync<BackupEntitySearchDocument>(desc).ConfigureAwait(false);
Where _searchFields is a List<string> containing the fields to match on and query is the term to search for.

Resources