Elasticsearch highlighter false positives

Elasticsearch highlighter false positives - elasticsearch

I am using an nGram tokenizer in ES 6.1.1 and getting some weird highlights:
multiple adjacent character ngram highlights are not merged into one
tra is incorrectly highlighted in doc 9
The query auftrag matches documents 7 and 9 as expected, but in doc 9 betrag is highlighted incorrectly. That's a problem with the highlighter - if the problem was with the query doc 8 would have also been returned.
Example code
#!/usr/bin/env bash
# Example based on
# https://www.elastic.co/guide/en/elasticsearch/guide/current/ngrams-compound-words.html
# with suggestions from from
# https://github.com/elastic/elasticsearch/issues/21000
DELETE INDEX IF EXISTS
curl -sS -XDELETE 'localhost:9200/my_index'
printf '\n-------------\n'
CREATE NEW INDEX
curl -sS -XPUT 'localhost:9200/my_index?pretty' -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"trigrams": {
"tokenizer": "my_ngram_tokenizer",
"filter": ["lowercase"]
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "3",
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"analyzer": "trigrams",
"term_vector": "with_positions_offsets"
}
}
}
}
}
'
printf '\n-------------\n'
POPULATE INDEX
curl -sS -XPOST 'localhost:9200/my_index/my_type/_bulk?pretty' -H 'Content-Type: application/json' -d'
{ "index": { "_id": 7 }}
{ "text": "auftragen" }
{ "index": { "_id": 8 }}
{ "text": "betrag" }
{ "index": { "_id": 9 }}
{ "text": "betrag auftragen" }
'
printf '\n-------------\n'
sleep 1 # Give ES time to index
QUERY
curl -sS -XGET 'localhost:9200/my_index/my_type/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"text": {
"query": "auftrag",
"minimum_should_match": "100%"
}
}
},
"highlight": {
"fields": {
"text": {
"fragment_size": 120,
"type": "fvh"
}
}
}
}
'
The hits I get are (abbreviated):
"hits" : [
{
"_id" : "9",
"_source" : {
"text" : "betrag auftragen"
},
"highlight" : {
"text" : [
"be<em>tra</em>g <em>auf</em><em>tra</em>gen"
]
}
},
{
"_id" : "7",
"_source" : {
"text" : "auftragen"
},
"highlight" : {
"text" : [
"<em>auf</em><em>tra</em>gen"
]
}
}
]
I have tried various workarounds, such as using the unified/fvh highlighter and setting all options that seemed relevant, but no luck. Any hints are greatly appreciated.

The problem here is not with highlighting but with this how you are using nGram analyzer.
First of all when you are configure mapping this way:
"mappings": {
"my_type": {
"properties": {
"text": {
"type" : "text",
"analyzer" : "trigrams",
"term_vector": "with_positions_offsets"
}
}
}
}
you are saying to Elasticsearch that you want to use it for both indexed text and provided a search term. In your case, this simply means that:
your text from the document 9 = "betrag auftragen" is split for trigrams so in the index you have something like: [bet, etr, tra, rag, auf, uft, ftr, tra, rag, age, gen]
your text from the document 7 = "auftragen" is split for trigrams so in the index you have something like: [auf, uft, ftr, tra, rag, age, gen]
your search term = "auftrag" is also split for trigrams and Elasticsearch is see it as: [auf, uft, ftr, tra, rag]
at the end Elasticsearch matches all the trigrams from search with those from your index and because of this you have 'auf' and 'tra' highlighted separately. 'ufa', 'ftr', and 'rag' also matches, but they overlaps 'auf' and 'tra' and are not highlighted.
First what you need to do is to say to Elasticsearch that you do not want to split search term to grams. All you need to do is to add search_analyzer property to your mapping:
"mappings": {
"my_type": {
"properties": {
"text": {
"type" : "text",
"analyzer" : "trigrams",
"search_analyzer": "standard",
"term_vector" : "with_positions_offsets"
}
}
}
}
Now words from a search term are treated by standard analyzer as separate words so in your case, it will be just "auftrag".
But this single change will not help you. It will even break the search because "auftrag" is not matching to any trigram from your index.
Now you need to improve your nGram tokenizer by increasing max_gram:
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "10",
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
This way texts in your index will be split into 3-grams, 4-grams, 5-grams, 6-grams, 7-grams, 8-grams, 9-grams, and 10-grams. Among these 7-grams you will find "auftrag" which is your search term.
After this two improvements, highlighting in your search result should look as below:
"betrag <em>auftrag</em>en"
for document 9 and:
"<em>auftrag</em>en"
for document 7.
This is how ngrams and highlighting works together. I know that ES documentation is saying:
It usually makes sense to set min_gram and max_gram to the same value. The smaller the length, the more documents will match but the lower the quality of the matches. The longer the length, the more specific the matches. A tri-gram (length 3) is a good place to start.
This is true. For performance reason, you need to experiment with this configuration but I hope that I explained to you how it is working.

I have the same problem here, with ngram(trigram) tokenizer, got incomplete highlight like:
query with `match`: samp
field data: sample
result highlight: <em>sam</em>ple
expected highlight: <em>samp</em>le
Use match_phrase and use fvh highlight type when set the field's term_vector to with_positions_offsets, this may get the correct highlight.
<em>samp</em>le
I hope this can help you as you do not need to change the tokenizer nor increase max_gram.
But my problem is that I want to use simple_query_string which does not support using phrase for default field query, the only way is using quote to wrap the string like "samp", but as there is some logic in query string so I cant do it for users, and require users to do it neither.
Solution from #piotr-pradzynski may not help me as I have a lot of data, increase the max_gram will lead to lots of storage usage.

Related

Elasticsearch - searching for punctuation terms over both text and keyword fields

Using elasticsearch 7, I'm trying to use a simple query string query for searches over different fields, both text and keyword. Here's a minimum, reproducible example to show the initial setup and problem:
mapping.json:
{
"dynamic": false,
"properties": {
"publicId": {
"type": "keyword"
},
"eventDate": {
"type": "date",
"format": "yyyy-MM-dd",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"name": {
"type": "text"
}
}
}
test-data1.json:
{
"publicId": "a1b2c3",
"eventDate": "2022-06-10",
"name": "Research & Development"
}
test-data2.json
{
"publicId": "d4e5f6",
"eventDate": "2021-05-11",
"name": "F.inance"
}
Create index on ES running on localhost:19200:
#!/bin/bash -e
host=${1-localhost:19200}
dir=$( dirname `readlink -f $0` )
mapping=$(<${dir}/mapping.json);
param="{ \"mappings\": $mapping}"
curl -XPUT "http://${host}/test/" -H 'Content-Type: application/json' -d "$param"
curl -XPOST "http://${host}/test/_doc/a1b2c3" -H 'Content-Type: application/json' -d #${dir}/test-data1.json
curl -XPOST "http://${host}/test/_doc/d4e5f6" -H 'Content-Type: application/json' -d #${dir}/test-data2.json
Now the task is to support searches like "Research & Development", "Research & Development 2022-06-10", "Finance" (note the removed dot) or simply "a1b2c3". For example using a query like this:
{
"from": 0,
"size": 20,
"query": {
"bool": {
"must": [
{
"simple_query_string": {
"query": "Research & Development 2022-06-10",
"fields": [
"publicId^1.0",
"eventDate.keyword^1.0",
"name^1.0"
],
"flags": -1,
"default_operator": "and",
"analyze_wildcard": false,
"auto_generate_synonyms_phrase_query": true,
"fuzzy_prefix_length": 0,
"fuzzy_max_expansions": 50,
"fuzzy_transpositions": true,
"boost": 1.0
}
}
],
"adjust_pure_negative": true,
"boost": 1.0
}
},
"version": true
}
The problem with this setup is that the standard analyzer for the text field that removes most punctuation of course also removes the ampersand character. The simple query string query splits the query into three tokens [research, &, development] and searches over all fields using the and operator. There are two matches ("Research" and "Development") for the name text field, but no matches for the ampersand in any field. Thus the result is empty.
Now I came up with a solution to add a second field for name with a different analyzer, the whitespace analyzer, that doesn't remove punctuation:
{
"dynamic": false,
"properties": {
"publicId": {
"type": "keyword"
},
"eventDate": {
"type": "date",
"format": "yyyy-MM-dd",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"name": {
"type": "text",
"fields": {
"whitespace": {
"type": "text",
"analyzer": "whitespace"
}
}
}
}
}
This way all searches work, including "Finance" that matches for "F.inance" for the name field. Also, "Research & Development" matches for the name field and for name.whitespace, but most crucially & matches for name.whitespace and therefore returns a result.
My question now is: given the fact that the real setup includes many more fields and a lot of data, adding an additional field and therefore indexing most terms in the same way twice seems quite heavy. Is there a way to only index analyzed terms to name.whitespace that differ from the standard analyzer's terms of name, i.e. that are not in the "parent" field? E.g. "Research & Development" results in the terms [research, development] for name and [research, development, &] for name.whitespace - ideally it would only index [&] for name.whitespace.
Or is there a more elegant/performant solution for this particular problem altogether?

I guess you can define a dynamic property mapping for all string fields and use whitespace analyzer since your use case has that specification to search on non-standard tokens. In addition, you can specify those fields in the mapping where you don't need whitespace tokenizer.
This would ensure that already mapped fields are analyzed using standard tokenizer while others (dynamic or unmapped fields) are analyzed using whitespace, thus reducing the complexity, field duplication, etc.

Elasticsearch autocomplete searching middle word

I'm stuck on this for a while.
How can I get suggestion on elastic search to complete my word even when I write a middle term.
For example in my data I have "Alan Turing is great" and I start typing "turi", I would like to see suggestion term "Alan Turing is Great".
I am using elastic search v. 6.3.2, I tried with query similar to these:
curl -X GET "http://127.0.0.1:9200/my_index/_search" -H 'Content-Type: application/json' -d '{"_source":false,"suggest":{"show-suggest":{"prefix":"turi","completion":{"field":"auto_suggest"}}}}'
or
curl -X GET "http://127.0.0.1:9200/my_index/_search" -H 'Content-Type: application/json' -d '{"_source":false,"suggest":{"show-suggest":{"text":"turi","completion":{"field":"auto_suggest"}}}}'
but it works only if I search for "alan" and it shows all the terms.
index:
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4,
"token_chars": [
"letter",
"digit"
]
}
}
}
"mappings": {
"poielement": {
"numeric_detection": false,
"date_detection": false,
"dynamic_templates": [
{
"suggestions": {
"match": "suggest_*",
"mapping": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "my_analyzer",
"copy_to": "auto_suggest",
"store": true
}
}
},
{
"property": {
"match": "*",
"mapping": {
"analyzer": "my_analyzer",
"search_analyzer": "my_analyzer"
}
}
}
],
"properties": {
"auto_suggest": {
"type": "completion"
},
"name_suggest": {
"type": "completion"
}
}
}
}

We have an exact similar use case and this is how we solved it. what you are looking is for substring search.
Please create a custom substring analyzer for your field like below, java code for which is below:-
TokenStream result = new WhitespaceTokenizer(SearchManager.LUCENE_VERSION_301, reader);
result = new LowerCaseFilter(SearchManager.LUCENE_VERSION_301, result);
result = new SubstringFilter(result, minSize);
return result;
In the above code, I am first using the WhitespaceTokenizer and then passing it to first LowerCaseFilter and then my custom SubstringFilter code of which is customizable based on the minimum no of chars you want in your tokens.
Above code will generate below tokens for strings like hellowworld if you set min substring length 3.
Giving public URI to access the tokens which it generates as for helloworld string and min substring length 3. it will generate lot of tokens.
https://justpaste.it/4i6gh
Also you can test the tokens which your custom analyzer using the _analyze api, https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html
http://localhost:9200/jaipur/_analyze?text=helloworld&analyzer=substring
here jaipur is my index name and helloworld is the string for which I want to generates tokens using substring.
EDIT
As suggested by Nishant in comments, you can use the ngram filter instead of substring filter, which Elastic inbuilt provides.

Elasticsearch : Completion suggester not working with whitespace Analyzer

I am new to Elastic search and I am trying to create one demo of Completion suggester with whitespace Analyzer.
As per the documentation of Whitespace Analyzer, It breaks text
into terms whenever it encounters a whitespace character. So my
question is do it works with Completion suggester too?
So for my completion suggester prefix : "ela", I am expecting output
as "Hello elastic search."
I know an easy solution for this is to add multi-field input as :
"suggest": {
"input": ["Hello","elastic","search"]
}
However, if this is the solution then what is meaning of using analyzer? Does analyzer make sense in completion suggester?
My mapping :
{
"settings": {
"analysis": {
"analyzer": {
"completion_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "whitespace"
}
}
}
},
"mappings": {
"my-type": {
"properties": {
"mytext": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"suggest": {
"type": "completion",
"analyzer": "completion_analyzer",
"search_analyzer": "completion_analyzer",
"max_input_length": 50
}
}
}
}
}
My document :
{
"_index": "my-index",
"_type": "my-type",
"_id": "KTWJBGEBQk_Zl_sQdo9N",
"_score": 1,
"_source": {
"mytext": "dummy text",
"suggest": {
"input": "Hello elastic search."
}
}
}
Search request :
{
"suggest": {
"test-suggest" : {
"prefix" :"ela",
"completion" : {
"field" : "suggest",
"skip_duplicates": true
}
}
}
}
This search is not returning me the correct output, but if I use prefix = 'hel' I am getting correct output : "Hello elastic search."
In brief I would like to know is whitespace Analyzer works with completion suggester?
and if there is a way, can you please suggest me.
PS: I have already look for this links but I didn't find useful answer.
ElasticSearch completion suggester Standard Analyzer not working
What Elasticsearch Analyzer to use for this completion suggester?
I find this link useful Word-oriented completion suggester (ElasticSearch 5.x). However they have not use completion suggester.
Thanks in advance.
Jimmy

The completion suggester cannot perform full-text queries, which means that it cannot return suggestions based on words in the middle of a multi-word field.
From ElasticSearch itself:
The reason is that an FST query is not the same as a full text query. We can't find words anywhere within a phrase. Instead, we have to start at the left of the graph and move towards the right.
As you discovered, the best alternative to the completion suggester that can match the middle of fields is an edge n-gram filter.

gI know this question is ages old, but have you tried have multiple suggestions, one based on prefix and the next one based in regex ?
Something like
{
"suggest": {
"test-suggest-exact" : {
"prefix" :"ela",
"completion" : {
"field" : "suggest",
"skip_duplicates": true
}
},
"test-suggest-regex" : {
"regex" :".*ela.*",
"completion" : {
"field" : "suggest",
"skip_duplicates": true
}
}
}
}
Use results from the second suggest when the first one is empty. The good thing is that meaningful phrases are returned by the Elasticsearch suggest.
Shingle based approach, using a full query search and then aggregating based on search terms sometimes gives broken phrases which are contextually wrong. I can write more if you are interested.

Elasticsearch: match exact keywords with special characters

I am storing tags as an array of keywords:
...
Tags: {
type: "keyword"
},
...
Resulting in arrays like this:
Tags: [
"windows",
"opengl",
"unicode",
"c++",
"c",
"cross-platform",
"makefile",
"emacs"
]
I thought that as I am using the keyword type I could easily do exact search terms, as it is not supposed to be using any analyser.
Apparently I was wrong! this gives me results:
body.query.bool.must.push({term: {"_all": "c"}}); # 38 results
But this doesn't:
body.query.bool.must.push({term: {"_all": "c++"}}); # 0 results
Although there are obviously instances of this tag, as seen above.
If I use body.query.bool.must.push({match: {"_all": search}}); instead (using match instead of term) then "c" and "c++" returns the exact same results, which is wrong as well.

The problem here is that you are using _all - Field, which uses an analyzer (standard by default). Make a small test with your data to be sure:
Test 1:
curl -X POST http://127.0.0.1:9200/script/test/_search \
-d '{
"query": {
"term" : { "_all": "c++"}
}
}'
Test 2:
curl -X POST http://127.0.0.1:9200/script/test/_search \
-d '{
"query": {
"term" : { "tags": "c++"}
}
}'
In my test second query returns documents, first not.
Do you really need to search with multiple fields? If so, you can override the default analyzer of _all field - for a quick test I put an index with settings like this:
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"test" : {
"_all" : {"type" : "string", "index" : "not_analyzed", "analyzer" : "keyword"},
"properties": {
"tags": {
"type": "keyword"
}
}
}
}
}
Or you can create Custom _all Field.
Solutions like Multi Field query, that allow to define list of fields to be searched over would rather behave like your example with body.query.bool.must.push({match: {"_all": search}});.

How to handle wildcards in elastic search structured queries

My use case requires to query for our elastic search domain with trailing wildcards. I wanted to get your opinion on the best practices of handling such wildcards in the queries.
Do you think adding the following clauses is a good practice for the queries:
"query" : {
"query_string" : {
"query" : "attribute:postfix*",
"analyze_wildcard" : true,
"allow_leading_wildcard" : false,
"use_dis_max" : false
}
}
I've disallowed leading wildcards since it is a heavy operation. However I wanted to how good is analyzing wildcard for every query request in the long run. My understanding is, analyze wildcard would have no impact if the query doesn't actually have any wildcards. Is that correct?

If you have the possibility of changing your mapping type and index settings, the right way to go is to create a custom analyzer with an edge-n-gram token filter that would index all prefixes of the attribute field.
curl -XPUT http://localhost:9200/your_index -d '{
"settings": {
"analysis": {
"filter": {
"edge_filter": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 15
}
},
"analyzer": {
"attr_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "edge_filter"]
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"attribute": {
"type": "string",
"analyzer": "attr_analyzer",
"search_analyzer": "standard"
}
}
}
}
}'
Then, when you index a document, the attribute field value (e.g.) postfixing will be indexed as the following tokens: p, po, pos, post, postf, postfi, postfix, postfixi, postfixin, postfixing.
Finally, you can then easily query the attribute field for the postfix value using a simple match query like this. No need to use an under-performing wildcard in a query string query.
{
"query": {
"match" : {
"attribute" : "postfix"
}
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio