Elasticsearch: Is there a way to exclude synomyms from highlighting? - elasticsearch

I'm trying to exclude synonyms from highlighting. I created a copy of my current analyzer with a synonym filter. So for each field I now have an analyzer and a search_analyzer. The search analyzer is the new analyzer with all the same filters plus the synonym filter.
Any ideas? I am using elasticsearch 5.2
Mapping:
"mappings": {
"doc": {
"properties": {
"body": {
"type": "text",
"analyzer": "custom_analyzer",
"search_analyzer": "custom_analyzer_with_synonyms",
"fields": {
"plain": {
"type": "text",
"analyzer": "standard"
}
}
}
}
}
Search Query:
{
"query": {
"match": {
"body": "something"
}
},
"highlight": {
"pre_tags": "<strong>",
"post_tags": "<strong>",
"fields" : {
"body.plain" : {
"number_of_fragments": 1,
"require_field_match": false
}
}
}
}

I am not sure about the reason behind the problem. I'd have thought that simply highlighting on a non-synonym-analyzed field would have done it. But according to the comments, it is still highlighting the synonyms. There are 2 possible reasons i can think of: (I haven't looked into the highlighter source code)
It could be because of the multi-word synonym problem mentioned in this link: https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-word-synonyms.html It could be fixed now since the link is old. If not, it could be causing the highlighter to look at wrong position offsets.
And/Or, it could also be because of not using the highlight field in the query. The highlighter might be simply using the tokens emitted from the searched field's analyzer (which would contain synonyms) and looking for those tokens in the highlighted field.
If it's the 1st problem, you could try to change your synonyms to use simple contraction. See: https://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms-expand-or-contract.html#synonyms-contraction But, it has its own problems with the frequencies of uncommon words and could be a lot of work.
Fixing for the second case would be to use the "body.plain" field in the query, but you cannot do that since it affects your scores. In that case, specifying a different query for the highlighter (so that scores are not affected) on the non-synonym field does the trick. It works even if the 1st case is the problem too since we are not using synonyms in the highlight field.
So your query should look something like this:
{
"query": {
"match": {
"body": "something"
}
},
"highlight": {
"pre_tags": "<strong>",
"post_tags": "<strong>",
"fields" : {
"body.plain" : {
"number_of_fragments": 1,
"highlight_query": {
"match": {"body.plain": "something"}
}
}
}
}
}
See: https://www.elastic.co/guide/en/elasticsearch/reference/5.4/search-request-highlighting.html#_highlight_query

Related

Why fuzzy query returns a match but query with fuzziness doesn't on the same input?

I created the following index in Elasticsearch:
PUT /my-index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": ["lowercase", "3_5_edgegrams"]
}
},
"filter": {
"3_5_edgegrams": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 10
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Then I inserted the following document:
{
"name": "Nuvus Gro Corp"
}
When I make the following query (let's call it fuzzy_query):
GET /my-index/_search
{
"query": {
"fuzzy": {
"name": {
"value": "qnuv"
}
}
}
}
I get a match for the above document.
When I make the query (let's call the query match_with_fuzziness):
GET /my-index/_search
{
"query": {
"match": {
"name": {
"query": "qnuv",
"fuzziness": "AUTO"
}
}
}
}
I don't get a match. If I make the following query:
GET /my-index/_search
{
"query": {
"match": {
"name": {
"query": "nuvq",
"fuzziness": "AUTO"
}
}
}
}
I again get a match. I don't understand why when I make the match_with_fuzziness query I don't get any matches.
EDIT: I analyzed the queries with Kibana Profiler and according to the profiler match_with_fuzziness is a SynonymQuery Synonym(name:qnu name:qnuv) query while fuzzy_query is a BoostQuery (name:nuv)^0.6666666
Very similar problem to the one explained in your other question.
The problem is that you haven't specified a specific search_analyzer, so at search time qnuv and nuvq also get analyzed by my_analyzer and edge-ngramed as well, hence the match you're receiving.
If we check the first query, since you're using the fuzzy query, qnuv (the search term) will match nuv (the first indexed edge-ngramed token) with a distance of 1 (i.e. the first q is "tolerated"), which is what the fuzzy query does by default (with "fuzziness: AUTO")
In the third query, nuv (the first edge-ngramed token of the search term) will match nuv (the first indexed edge-ngramed token).
The case of the second query is a bit special and I'm referencing below how the fuzziness parameter works in the context of match queries
Fuzzy matching is not applied to terms with synonyms or in cases where the analysis process produces multiple tokens at the same position. Under the hood these terms are expanded to a special synonym query that blends term frequencies, which does not support fuzzy expansion.
The part in bold is what applies to your case. Since the search term qnuv is analyzed by my_analyzer, it produces the two tokens qnu and qnuv at the same position and that does not support fuzzy matching.
You need to change your mapping to this one instead and it will work the way you expect, i.e. all three queries will return your document:
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard" <---- add this line
}
}
}

Only getting results when elasticsearch is case sensitive

I currently have a problem with my multi match search in ES.
Its simple like that: If I'm searching for the City "Sachsen", I'm getting results.
If I'm searching for "sachsen" (lowercase), I'm getting no results.
how to avoid this?
QUERY with no results
{
"match" : {
"City" : {
"query" : "sachsen"
}
}
My analyzer is analyzer_keyword. Should I have anything add ?
MAPPING
City: {
type: "string",
analyzer: "analyzer_keyword"
}
Your analyzer_keyword analyzer is most probably of type keyword which means you can only perform exact matches on it.
It's standard practice to apply multiple "variants" of a field, one of which is going to match lowercase, possibly ascii-tokenized characters (think München -> munchen) and one which will not be tokenized in any way (this is what you have in your analyzer_keyword).
Since you intend to search the lowercase version of Sachsen, your mapping could look something like
PUT sachsen
{
"mappings": {
"properties": {
"City": {
"type": "keyword", <----
"fields": {
"standard": {
"type": "text",
"analyzer": "standard" <----
}
}
}
}
}
}
After indexing a doc
POST sachsen/_doc
{
"City": "Sachsen"
}
The following will work for exact matches:
GET sachsen/_search
{
"query": {
"match": {
"City": "Sachsen"
}
}
}
and this for lowercase
GET sachsen/_search
{
"query": {
"match": {
"City.standard": "sachsen"
}
}
}
Note that I'm using the default, standard analyzer here but you can choose any one you deem appropriate.

elasticsearch query string(full text search) which can't be searched

we have a document below. I can't searched with financialmarkets. but it can be searched with industry_icon_financialmarkets.png. Can anyone tell me what is the reason?
content is the text type field.
document:
{
"title":"test",
"content":"industry_icon_financialmarkets.png"
}
Query:
{
"from": 0,
"size": 2,
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "\"industry_icon_financialmarkets.png\""
}
}
]
}
}
}
The default analyzer for text field is standard which won't break industry_icon_financialmarkets into tokens using _ as a delimiter. I would suggest you to use simple analyzer instead which will breaks text into terms whenever it encounters a character which is not a letter.
You can also add sub-field of type keyword to retain the original value.
So the mapping of the field should be:
{
"content": {
"type": "text",
"analyzer": "simple",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
At the time of creating index, we should have our own mapping for each fields based on its type to get the expected result.
Mapping
PUT relevance
{"mapping":{"ID":{"type":"long"},"title":
{"type":"keyword","analyzer":"my_analyzer"},
"content":
{"type":"string","analyzer":"my_analyzer","search_analyzer":"my_analyzer"}},
"settings":
{"analysis":
{"analyzer":
{"my_analyzer":
{"tokenizer":"my_tokenizer"}},
"tokenizer":
{"my_tokenizer":
{"type":"ngram","min_gram":3,"max_gram":30,"token_chars":
["letter","digit"]
}
}
},"number_of_shards":5,"number_of_replicas":2
}
}
Then start inserting documents,
POST relevance/_doc/1
{
"name": "1elastic",
"content": "working fine" //replace special characters with space using program before inserting into ES index.
}
Query
GET relevance/_search
{"size":20,"query":{"bool":{"must":[{"match":{"content":
{"query":"fine","fuzziness":1}}}]}}}

Elasticsearch : Completion suggester not working with whitespace Analyzer

I am new to Elastic search and I am trying to create one demo of Completion suggester with whitespace Analyzer.
As per the documentation of Whitespace Analyzer, It breaks text
into terms whenever it encounters a whitespace character. So my
question is do it works with Completion suggester too?
So for my completion suggester prefix : "ela", I am expecting output
as "Hello elastic search."
I know an easy solution for this is to add multi-field input as :
"suggest": {
"input": ["Hello","elastic","search"]
}
However, if this is the solution then what is meaning of using analyzer? Does analyzer make sense in completion suggester?
My mapping :
{
"settings": {
"analysis": {
"analyzer": {
"completion_analyzer": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "whitespace"
}
}
}
},
"mappings": {
"my-type": {
"properties": {
"mytext": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"suggest": {
"type": "completion",
"analyzer": "completion_analyzer",
"search_analyzer": "completion_analyzer",
"max_input_length": 50
}
}
}
}
}
My document :
{
"_index": "my-index",
"_type": "my-type",
"_id": "KTWJBGEBQk_Zl_sQdo9N",
"_score": 1,
"_source": {
"mytext": "dummy text",
"suggest": {
"input": "Hello elastic search."
}
}
}
Search request :
{
"suggest": {
"test-suggest" : {
"prefix" :"ela",
"completion" : {
"field" : "suggest",
"skip_duplicates": true
}
}
}
}
This search is not returning me the correct output, but if I use prefix = 'hel' I am getting correct output : "Hello elastic search."
In brief I would like to know is whitespace Analyzer works with completion suggester?
and if there is a way, can you please suggest me.
PS: I have already look for this links but I didn't find useful answer.
ElasticSearch completion suggester Standard Analyzer not working
What Elasticsearch Analyzer to use for this completion suggester?
I find this link useful Word-oriented completion suggester (ElasticSearch 5.x). However they have not use completion suggester.
Thanks in advance.
Jimmy
The completion suggester cannot perform full-text queries, which means that it cannot return suggestions based on words in the middle of a multi-word field.
From ElasticSearch itself:
The reason is that an FST query is not the same as a full text query. We can't find words anywhere within a phrase. Instead, we have to start at the left of the graph and move towards the right.
As you discovered, the best alternative to the completion suggester that can match the middle of fields is an edge n-gram filter.
gI know this question is ages old, but have you tried have multiple suggestions, one based on prefix and the next one based in regex ?
Something like
{
"suggest": {
"test-suggest-exact" : {
"prefix" :"ela",
"completion" : {
"field" : "suggest",
"skip_duplicates": true
}
},
"test-suggest-regex" : {
"regex" :".*ela.*",
"completion" : {
"field" : "suggest",
"skip_duplicates": true
}
}
}
}
Use results from the second suggest when the first one is empty. The good thing is that meaningful phrases are returned by the Elasticsearch suggest.
Shingle based approach, using a full query search and then aggregating based on search terms sometimes gives broken phrases which are contextually wrong. I can write more if you are interested.

What is the best way to handle common term which contains special chars, like C#, C++

I have some documents contains c# or c++ in title which use standard analyzer.
When I query c# on title field, I got all c# and C++ documents, and c++ documents even have higher score than c# document. That makes sense, since both '#' and '++' are removed from token by standard analyzer.
What is the best way to handle this kind special terms? In my case specifically, I want c# documents got higher score than c++ documents when searching for "C#".
Here is approach you can use:
Introduce copy-field where you will have values with special characters. For that you'll need:
Introduce custom analyzer (whitespace tokenizer is important here - it will preserve your special characters):
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":[
"lowercase"
]
}
}
}
}
}
Create copy-field (_wcc suffix will stand for 'with special characters'):
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"prog_lang": {
"type": "text",
"copy_to": "prog_lang_wcc",
"analyzer": "standard"
},
"prog_lang_wcc": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
When issuing query itself you will combine query with boost against prog_lang_wcc field like this (it could be either multi-match or pure boolean + boost):
GET /_search
{
"query": {
"multi_match" : {
"query" : "c#",
"type": "match_phrase",
"fields" : [ "prog_lang_wcc^3", "prog_lang" ]
}
}
}

Resources