How to search exact text without matching case in Elasticsearch - spring-boot

I want to search the user name in the Elasticsearch. For this I want to match the exact username ignoring its case whether it is capital or small, I just want to find that user name. I'm using the following query for this:
QueryBuilder queryBuilder = QueryBuilders.termQuery("user_name.keyword", userName);
NativeSearchQuery build = new NativeSearchQueryBuilder().withQuery(queryBuilder).build();
List<Company> companies = elasticsearchTemplate.queryForList(build, User.class);
But it is also matching the exact word with the case. for example: if the user name is "Ram" and I search "ram" then it is not returning that name. If I search "Ram" then it is giving me the result. But I want that it only matches the word not the case of that word. Please, someone, help me to solve this problem. I searched a lot but couldn't find any solution.

Issue is you are using user_name.keyword and terms query. Terms query matches exact word instead of that you can use MatchQueryBuilder query :
Code :
QueryBuilder queryBuilder = QueryBuilders.matchQuery("user_name", userName);
NativeSearchQuery build = new NativeSearchQueryBuilder().withQuery(queryBuilder).build();
List<Company> companies = elasticsearchTemplate.queryForList(build, User.class);
When using .keyword field, elastic does not analyze the text but if you use your text field ElasticSerach analyzes your text using default analyzer on that field. Default Analyzer basically converts your text in lowercase and remove stopwords from it. You can read about it from here : https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html
Since you want to do case insensitive search so you don't need to use .keyword.
Also, terms query matches exact terms but again since you want to do case insensitive search you should you match query which also by default internally converts your search text in lowercase and then search the field for that text.
And, now since both your field and search term is in lowercase you can do case insensitive search but this will not do exact match.
For doing exact case insensitive match you need to update your index and use normalizer with your keyword field which guarantees that the analysis chain produces a single token and case insensitive search. You can read more about it from here.
Index Creation:
curl -X PUT "localhost:9200/<index-name>" -H 'Content-Type: application/json' -d
{
"settings": {
"analysis": {
"normalizer": {
"case_insensitive_normalizer": {
"type": "custom",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"user_name": {
"type": "keyword",
"normalizer": "case_insensitive_normalizer"
}
}
}
}
I have indexed these documents :
Doc1 :
{
"user_name": "Ram"
}
Doc2 :
{
"user_name": "Ram Mohan"
}
Search Query :
{
"query" : {
"match" : {
"user_name" : "ram"
}
}
}
Result :
"hits": [
{
"_source": {
"user_name": "Ram"
}
}
]

Try to use Lowercase Token Filter in your index mapping.
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lowercase-tokenfilter.html
Such token filter is applied in both indexing and searching so "Ram" will be indexed as "ram" and then if you'll search for "rAm" it'll be changed to "ram" so it'll hit your document.

If you want to do case insensitive match on a keyword field, you can use normalizer with a lowercase filter
The normalizer property of keyword fields is similar to analyzer
except that it guarantees that the analysis chain produces a single
token.
{
"settings": {
"analysis": {
"normalizer": {
"lowercase_normalizer": {
"type": "custom",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "lowercase_normalizer"
}
}
}
}
}
}
Data
POST index41/_doc
{
"name":"Ram"
}
Query:
{
"query": {
"term": {
"name.keyword": {
"value": "ram"
}
}
}
}
Result:
"hits" : [
{
"_index" : "index41",
"_type" : "_doc",
"_id" : "IyieGHIBZsF59xnAhb47",
"_score" : 0.6931471,
"_source" : {
"name" : "Ram"
}
}
]

You can simply use the text field on your user-name field, text field uses by default standard analyzer which lowercase the tokens, and then match query applies the same analyzer which is used index time(in this case, standard) which will provide you case-insensitive search.
Tokens generated using the standard analyzer
POST /_analyzer
{
"text" : "ram",
"analyzer" : "standard"
}
{
"tokens": [
{
"token": "ram",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
}
]
}

Related

Elasticsearch term query issue

As the pictures show the record field of dispatchvoucher value is "True".
But when I searched with the term it couldnĀ“t found any record.
when I changed the value to "true", the result matched. What's the reason for this?
As mentioned in the documentation :
Avoid using the term query for text fields.
By default, Elasticsearch changes the values of text fields as part of
analysis. This can make finding exact matches for text field values
difficult.
To search text field values, use the match query instead.
The standard analyzer is the default analyzer which is used if none is specified. It provides grammar-based tokenization.
GET /_analyze
{
"analyzer" : "standard",
"text" : "True"
}
The token generated is -
{
"tokens": [
{
"token": "true",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Term query returns documents that contain an exact term in a provided field. Since True gets tokenized to true, so when you are using the term query for "dispatchvoucher": "True", it will not show any results.
You can either change your index mapping to
{
"mappings": {
"properties": {
"dispatchvoucher": {
"type": "keyword"
}
}
}
}
OR You need to add .keyword to the dispatchvoucher field. This uses the keyword analyzer instead of the standard analyzer (notice the ".keyword" after dispatchvoucher field).
Adding a working example with index data, search query, and search result
Index Data:
{
"dispatchvoucher": "True"
}
Search Query:
{
"query": {
"bool": {
"filter": {
"term": {
"dispatchvoucher.keyword": "True"
}
}
}
}
}
Search Result:
"hits": [
{
"_index": "65605120",
"_type": "_doc",
"_id": "1",
"_score": 0.0,
"_source": {
"dispatchvoucher": "True"
}
}
]

Elasticsearch highlighter false positives

I am using an nGram tokenizer in ES 6.1.1 and getting some weird highlights:
multiple adjacent character ngram highlights are not merged into one
tra is incorrectly highlighted in doc 9
The query auftrag matches documents 7 and 9 as expected, but in doc 9 betrag is highlighted incorrectly. That's a problem with the highlighter - if the problem was with the query doc 8 would have also been returned.
Example code
#!/usr/bin/env bash
# Example based on
# https://www.elastic.co/guide/en/elasticsearch/guide/current/ngrams-compound-words.html
# with suggestions from from
# https://github.com/elastic/elasticsearch/issues/21000
DELETE INDEX IF EXISTS
curl -sS -XDELETE 'localhost:9200/my_index'
printf '\n-------------\n'
CREATE NEW INDEX
curl -sS -XPUT 'localhost:9200/my_index?pretty' -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"trigrams": {
"tokenizer": "my_ngram_tokenizer",
"filter": ["lowercase"]
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "3",
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"analyzer": "trigrams",
"term_vector": "with_positions_offsets"
}
}
}
}
}
'
printf '\n-------------\n'
POPULATE INDEX
curl -sS -XPOST 'localhost:9200/my_index/my_type/_bulk?pretty' -H 'Content-Type: application/json' -d'
{ "index": { "_id": 7 }}
{ "text": "auftragen" }
{ "index": { "_id": 8 }}
{ "text": "betrag" }
{ "index": { "_id": 9 }}
{ "text": "betrag auftragen" }
'
printf '\n-------------\n'
sleep 1 # Give ES time to index
QUERY
curl -sS -XGET 'localhost:9200/my_index/my_type/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"text": {
"query": "auftrag",
"minimum_should_match": "100%"
}
}
},
"highlight": {
"fields": {
"text": {
"fragment_size": 120,
"type": "fvh"
}
}
}
}
'
The hits I get are (abbreviated):
"hits" : [
{
"_id" : "9",
"_source" : {
"text" : "betrag auftragen"
},
"highlight" : {
"text" : [
"be<em>tra</em>g <em>auf</em><em>tra</em>gen"
]
}
},
{
"_id" : "7",
"_source" : {
"text" : "auftragen"
},
"highlight" : {
"text" : [
"<em>auf</em><em>tra</em>gen"
]
}
}
]
I have tried various workarounds, such as using the unified/fvh highlighter and setting all options that seemed relevant, but no luck. Any hints are greatly appreciated.
The problem here is not with highlighting but with this how you are using nGram analyzer.
First of all when you are configure mapping this way:
"mappings": {
"my_type": {
"properties": {
"text": {
"type" : "text",
"analyzer" : "trigrams",
"term_vector": "with_positions_offsets"
}
}
}
}
you are saying to Elasticsearch that you want to use it for both indexed text and provided a search term. In your case, this simply means that:
your text from the document 9 = "betrag auftragen" is split for trigrams so in the index you have something like: [bet, etr, tra, rag, auf, uft, ftr, tra, rag, age, gen]
your text from the document 7 = "auftragen" is split for trigrams so in the index you have something like: [auf, uft, ftr, tra, rag, age, gen]
your search term = "auftrag" is also split for trigrams and Elasticsearch is see it as: [auf, uft, ftr, tra, rag]
at the end Elasticsearch matches all the trigrams from search with those from your index and because of this you have 'auf' and 'tra' highlighted separately. 'ufa', 'ftr', and 'rag' also matches, but they overlaps 'auf' and 'tra' and are not highlighted.
First what you need to do is to say to Elasticsearch that you do not want to split search term to grams. All you need to do is to add search_analyzer property to your mapping:
"mappings": {
"my_type": {
"properties": {
"text": {
"type" : "text",
"analyzer" : "trigrams",
"search_analyzer": "standard",
"term_vector" : "with_positions_offsets"
}
}
}
}
Now words from a search term are treated by standard analyzer as separate words so in your case, it will be just "auftrag".
But this single change will not help you. It will even break the search because "auftrag" is not matching to any trigram from your index.
Now you need to improve your nGram tokenizer by increasing max_gram:
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "10",
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
This way texts in your index will be split into 3-grams, 4-grams, 5-grams, 6-grams, 7-grams, 8-grams, 9-grams, and 10-grams. Among these 7-grams you will find "auftrag" which is your search term.
After this two improvements, highlighting in your search result should look as below:
"betrag <em>auftrag</em>en"
for document 9 and:
"<em>auftrag</em>en"
for document 7.
This is how ngrams and highlighting works together. I know that ES documentation is saying:
It usually makes sense to set min_gram and max_gram to the same value. The smaller the length, the more documents will match but the lower the quality of the matches. The longer the length, the more specific the matches. A tri-gram (length 3) is a good place to start.
This is true. For performance reason, you need to experiment with this configuration but I hope that I explained to you how it is working.
I have the same problem here, with ngram(trigram) tokenizer, got incomplete highlight like:
query with `match`: samp
field data: sample
result highlight: <em>sam</em>ple
expected highlight: <em>samp</em>le
Use match_phrase and use fvh highlight type when set the field's term_vector to with_positions_offsets, this may get the correct highlight.
<em>samp</em>le
I hope this can help you as you do not need to change the tokenizer nor increase max_gram.
But my problem is that I want to use simple_query_string which does not support using phrase for default field query, the only way is using quote to wrap the string like "samp", but as there is some logic in query string so I cant do it for users, and require users to do it neither.
Solution from #piotr-pradzynski may not help me as I have a lot of data, increase the max_gram will lead to lots of storage usage.

Elasticsearch : Completion suggester not working with whitespace Analyzer

I am new to Elastic search and I am trying to create one demo of Completion suggester with whitespace Analyzer.
As per the documentation of Whitespace Analyzer, It breaks text
into terms whenever it encounters a whitespace character. So my
question is do it works with Completion suggester too?
So for my completion suggester prefix : "ela", I am expecting output
as "Hello elastic search."
I know an easy solution for this is to add multi-field input as :
"suggest": {
"input": ["Hello","elastic","search"]
}
However, if this is the solution then what is meaning of using analyzer? Does analyzer make sense in completion suggester?
My mapping :
{
"settings": {
"analysis": {
"analyzer": {
"completion_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "whitespace"
}
}
}
},
"mappings": {
"my-type": {
"properties": {
"mytext": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"suggest": {
"type": "completion",
"analyzer": "completion_analyzer",
"search_analyzer": "completion_analyzer",
"max_input_length": 50
}
}
}
}
}
My document :
{
"_index": "my-index",
"_type": "my-type",
"_id": "KTWJBGEBQk_Zl_sQdo9N",
"_score": 1,
"_source": {
"mytext": "dummy text",
"suggest": {
"input": "Hello elastic search."
}
}
}
Search request :
{
"suggest": {
"test-suggest" : {
"prefix" :"ela",
"completion" : {
"field" : "suggest",
"skip_duplicates": true
}
}
}
}
This search is not returning me the correct output, but if I use prefix = 'hel' I am getting correct output : "Hello elastic search."
In brief I would like to know is whitespace Analyzer works with completion suggester?
and if there is a way, can you please suggest me.
PS: I have already look for this links but I didn't find useful answer.
ElasticSearch completion suggester Standard Analyzer not working
What Elasticsearch Analyzer to use for this completion suggester?
I find this link useful Word-oriented completion suggester (ElasticSearch 5.x). However they have not use completion suggester.
Thanks in advance.
Jimmy
The completion suggester cannot perform full-text queries, which means that it cannot return suggestions based on words in the middle of a multi-word field.
From ElasticSearch itself:
The reason is that an FST query is not the same as a full text query. We can't find words anywhere within a phrase. Instead, we have to start at the left of the graph and move towards the right.
As you discovered, the best alternative to the completion suggester that can match the middle of fields is an edge n-gram filter.
gI know this question is ages old, but have you tried have multiple suggestions, one based on prefix and the next one based in regex ?
Something like
{
"suggest": {
"test-suggest-exact" : {
"prefix" :"ela",
"completion" : {
"field" : "suggest",
"skip_duplicates": true
}
},
"test-suggest-regex" : {
"regex" :".*ela.*",
"completion" : {
"field" : "suggest",
"skip_duplicates": true
}
}
}
}
Use results from the second suggest when the first one is empty. The good thing is that meaningful phrases are returned by the Elasticsearch suggest.
Shingle based approach, using a full query search and then aggregating based on search terms sometimes gives broken phrases which are contextually wrong. I can write more if you are interested.

How to handle wildcards in elastic search structured queries

My use case requires to query for our elastic search domain with trailing wildcards. I wanted to get your opinion on the best practices of handling such wildcards in the queries.
Do you think adding the following clauses is a good practice for the queries:
"query" : {
"query_string" : {
"query" : "attribute:postfix*",
"analyze_wildcard" : true,
"allow_leading_wildcard" : false,
"use_dis_max" : false
}
}
I've disallowed leading wildcards since it is a heavy operation. However I wanted to how good is analyzing wildcard for every query request in the long run. My understanding is, analyze wildcard would have no impact if the query doesn't actually have any wildcards. Is that correct?
If you have the possibility of changing your mapping type and index settings, the right way to go is to create a custom analyzer with an edge-n-gram token filter that would index all prefixes of the attribute field.
curl -XPUT http://localhost:9200/your_index -d '{
"settings": {
"analysis": {
"filter": {
"edge_filter": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 15
}
},
"analyzer": {
"attr_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "edge_filter"]
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"attribute": {
"type": "string",
"analyzer": "attr_analyzer",
"search_analyzer": "standard"
}
}
}
}
}'
Then, when you index a document, the attribute field value (e.g.) postfixing will be indexed as the following tokens: p, po, pos, post, postf, postfi, postfix, postfixi, postfixin, postfixing.
Finally, you can then easily query the attribute field for the postfix value using a simple match query like this. No need to use an under-performing wildcard in a query string query.
{
"query": {
"match" : {
"attribute" : "postfix"
}
}
}

How to search with keyword analyzer?

I have keyword analyzer as default analyzer, like so:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "keyword"
}}}}}}
```
But now I can't search anything. e.g:
{
"query": {
"query_string": {
"query": "cast"
}}}
Gives me 0 results all though "cast" is a common value i the indexed documents. (http://gist.github.com/baelter/b0720a52ee5a27e27d3a)
Search for "*" works fine btw.
I only have explicit defaults in my mapping:
{
"oceanography_point": {
"_all" : {
"enabled" : true
},
"properties" : {}
}
}
The index behaves as if no fields are included in _all, because field:value queries works fine.
Am I misusing the keyword analyzer?
Using keyword analyzer , you can only do an exact string match.
Lets assume that you have used keyword analyzer and no filters.
In that case for as string indexed as "Cast away in forest" , neither search for "cast" or "away" will work. You need to do an exact "Cast away in forest" string to match it. ( Assuming no lowercase filter used , you need to give the right case too)
A better approach would be to use multi fields to declare one copy as keyword analyzed and other one normal.
You can search on one of this field and aggregate on the other.
Okey, some 15h of trial and error I can conclude that this works for search:
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"default": {
"type": "keyword"
}}}}}}
How ever this breaks faceting so I ended up using a dynamic template instead:
"dynamic_templates" : [
{
"strings_not_analyzed" : {
"match" : "*",
"match_mapping_type" : "string",
"mapping" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
],

Resources