Highlight words with whitespace in Elasticsearch 7.6

Highlight words with whitespace in Elasticsearch 7.6 - elasticsearch

I would like to use Elasticsearch highlight to obtain matched keywords found inside a text.
This is my settings/mappings
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"- => _",
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
],
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
},
"description": {
"type": "text",
"analyzer": "my_analyzer",
"fielddata": True
}
}
}
}
I am using a char_filter to search and highligth hypenated words.
This my document example:
{
"_index": "test_tokenizer",
"_type": "_doc",
"_id": "DbBIxXEBL7VGAl98vIRl",
"_score": 1.0,
"_source": {
"title": "Best places: New Mexico and Sedro-Woolley",
"description": "This is an example text containing some cities like New York, Toronto, Rome and many other. So, there are also Milton-Freewater and Las Vegas!"
}
}
and this is the query I use
{
"query": {
"query_string" : {
"query" : "\"New York\" OR \"Rome\" OR \"Milton-Freewater\"",
"default_field": "description"
}
},
"highlight" : {
"pre_tags" : ["<key>"],
"post_tags" : ["</key>"],
"fields" : {
"description" : {
"number_of_fragments" : 0
}
}
}
}
and this is the output I have
...
"hits": [
{
"_index": "test_tokenizer",
"_type": "_doc",
"_id": "GrDNz3EBL7VGAl98EITg",
"_score": 0.72928625,
"_source": {
"title": "Best places: New Mexico and Sedro-Woolley",
"description": "This is an example text containing some cities like New York, Toronto, Rome and many other. So, there are also Milton-Freewater and Las Vegas!"
},
"highlight": {
"description": [
"This is an example text containing some cities like <key>New</key> <key>York</key>, Toronto, <key>Rome</key> and many other. So, there are also <key>Milton-Freewater</key> and Las Vegas!"
]
}
}
]
...
Rome and Milton-Freewater are highlighted correctly. New York is not
How can I have <key>New York</key> instead of <key>New</key> and <key>York</key>?

There is an open PR regarding this but I'd suggest the following interim solution:
Add a term_vector setting
PUT test_tokenizer
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"- => _"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_char_filter"
],
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
},
"description": {
"type": "text",
"analyzer": "my_analyzer",
"term_vector": "with_positions_offsets",
"fielddata": true
}
}
}
}
Sync a doc
POST test_tokenizer/_doc
{"title":"Best places: New Mexico and Sedro-Woolley","description":"This is an example text containing some cities like New York, Toronto, Rome and many other. So, there are also Milton-Freewater and Las Vegas!"}
Convert your query_string to a bunch of bool-should match_phrases inside the highlight_query and use type: fvh
GET test_tokenizer/_search
{
"query": {
"query_string": {
"query": "'New York' OR 'Rome' OR 'Milton-Freewater'",
"default_field": "description"
}
},
"highlight": {
"pre_tags": [
"<key>"
],
"post_tags": [
"</key>"
],
"fields": {
"description": {
"highlight_query": {
"bool": {
"should": [
{
"match_phrase": {
"description": "New York"
}
},
{
"match_phrase": {
"description": "Rome"
}
},
{
"match_phrase": {
"description": "Milton-Freewater"
}
}
]
}
},
"type": "fvh",
"number_of_fragments": 0
}
}
}
}
yielding
{
"highlight":{
"description":[
"This is an example text containing some cities like <key>New York</key>, Toronto, <key>Rome</key> and many other. So, there are also <key>Milton-Freewater</key> and Las Vegas!"
]
}
}

Related

Elasticsearch Highlight the result of script fields

In the last question that I asked I want to remove the HTML tags in my search results, After that I thought I could highlite the results with a common query, But in the highlighting field I got other html contents that you removed with script. Would you please help me to highlight the results without html tags that I saved in my db?
My mapping and settings:
{
"settings": {
"analysis": {
"filter": {
"my_pattern_replace_filter": {
"type": "pattern_replace",
"pattern": "\n",
"replacement": ""
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase"
],
"char_filter": [
"html_strip"
]
},
"parsed_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"char_filter": [
"html_strip"
],
"filter": [
"my_pattern_replace_filter"
]
}
}
}
},
"mappings": {
"properties": {
"html": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"raw": {
"type": "text",
"fielddata": true,
"analyzer": "parsed_analyzer"
}
}
}
}
}
}
Search Query:
POST idx_test/_search
{
"script_fields": {
"raw": {
"script": "doc['html.raw']"
}
},
"query": {
"match": {
"html": "more"
}
},"highlight": {
"fields": {
"*":{ "pre_tags" : ["<strong>"], "post_tags" : ["</strong>"] }
}
}
}
Result:
"hits": [
{
"_index": "idx_test2",
"_type": "_doc",
"_id": "GijDsYMBjgX3UBaguGxc",
"_score": 0.2876821,
"fields": {
"raw": [
"Test More test"
]
},
"highlight": {
"html": [
"<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong>More</strong> test</span></body></html>"
]
}
}
]
Result that I want to get:
"hits": [
{
"_index": "idx_test2",
"_type": "_doc",
"_id": "GijDsYMBjgX3UBaguGxc",
"_score": 0.2876821,
"fields": {
"raw": [
"Test <strong>More</strong> test"
]
}
]

I thought of another solution. You could index two fields, the original html and the html_extract which has only the text.
You would have to use a processor to just index the text coming from the message and highligths would work.
Mapping
PUT idx_html_strip
{
"mappings": {
"properties": {
"html": {
"type": "text"
},
"html_extract": {
"type": "text"
}
}
}
}
Processor Pipeline
PUT /_ingest/pipeline/pipe_html_strip
{
"description": "_description",
"processors": [
{
"html_strip": {
"field": "html",
"target_field": "html_extract"
}
},
{
"script": {
"lang": "painless",
"source": "ctx['html_raw'] = ctx['html_raw'].replace('\n',' ').trim()"
}
}
]
}
Index Data
Note the use ?pipeline=pipe_html_strip
POST idx_html_strip/_doc?pipeline=pipe_html_strip
{
"html": """<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong>More</strong> test</span></body></html>"""
}
Query
GET idx_html_strip/_search?filter_path=hits.hits._source,hits.hits.highlight
{
"query": {
"multi_match": {
"query": "More",
"fields": ["html", "html_extract"]
}
},"highlight": {
"fields": {
"*":{ "pre_tags" : ["<strong>"], "post_tags" : ["</strong>"] }
}
}
}
Results
{
"hits": {
"hits": [
{
"_source": {
"html": """<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong>More</strong> test</span></body></html>""",
"html_extract": "Test More test"
},
"highlight": {
"html": [
"""<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong><strong>More</strong></strong> test</span></body>"""
],
"html_extract": [
"Test <strong>More</strong> test"
]
}
}
]
}
}

How to match when search term has more words than index?

I have an index which is 2-4 characters with no spaces but user often searches for the "full term" which I dont have indexed but has 3 extra characters after a blank space.
Ex: I index "A1" or "A1B" or "A1B2" and the "full term" is something like
"A1 11A" or "A1B ABA" or "A1B2 2C8".
This is current mapping:
"code": {
"type": "text"
},
If he searches "A1" it bring all of them which is also correct, if he types "A1B" I want to bring only the last two and if he searches "A1B2 2C8" I want to bring only the last one.
Is that possible? If so, what would be the best search/index strategy?

Index Mapping:
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"code": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index data:
{
"code": "A1"
}
{
"code": "A1B"
}
{
"code": "A1B2"
}
Search Query:
{
"query": {
"match": {
"code": {
"query": "A1B2 2C8"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65067196",
"_type": "_doc",
"_id": "3",
"_score": 1.3486402,
"_source": {
"code": "A1B2"
}
}
]

Elastic phrase prefix working phrase isnt

I am trying to return all documents that contain a string in the userName & documentName.
Data:
{
"userName" : "johnwick",
"documentName": "john",
"office":{
"name":"my_office"
}
},
{
"userName" : "johnsnow",
"documentName": "snowy",
"office": {
"name":"Abraham deVilliers"
}
},
{
"userName" : "johnnybravo",
"documentName": "bravo",
"office": {
"name":"blabla"
}
},
{
"userName" : "moana",
"documentName": "disney",
"office": {
"name":"deVilliers"
}
},
{
"userName" : "stark",
"documentName": "marvel",
"office": {
"name":"blabla"
}
}
I can perform an exact string match with:
}
_source": [ "userName", "documentName"],
"query": {
"multi_match": {
"query": "johnsnow",
"fields": [ "userName", "documentName"]
}
}
}
This successfully returns:
{
"userName" : "johnsnow",
"documentName": "snowy",
"office": {
"name":"Abraham deVilliers"
}
}
If i use type: phrase_fix with john i also get returned successfully 3 results.
But then i try with:
{
"query": {
"multi_match": {
"query": "ohn", // <---- match all docs that contain 'ohn'
"type": "phrase_prefix"
"fields": [ "userName", "documentName"]
}
}
}
Zero results are returned.

What you are looking for is the infix search and you need to have ngram tokenizer with a search time analyzer to achieve that.
Complete example with your sample data
Index mapping and setting
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "Ingram", --> note this
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
},
"index.max_ngram_diff" : 10 --> this you can reduce based on your requirement.
},
"mappings": {
"properties": {
"userName": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
},
"documentName": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Sample your docs and then use the same search query, I indexed only first and last doc for brevity and it returned me first doc
"hits": [
{
"_index": "infix",
"_type": "_doc",
"_id": "1",
"_score": 5.7100673,
"_source": {
"userName": "johnwick",
"documentName": "john"
}
}
]

field type as text and completion in elastic serach

I am trying to have title field as both text and completion types in elastic search.
As shown below
PUT playlist
{
"settings": {
"number_of_shards": 2,
"number_of_replicas": 2,
"analysis": {
"filter": {
"custom_english_stemmer": {
"type": "stemmer",
"name": "english"
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
}
},
"analyzer": {
"custom_lowercase_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"english_stop",
"custom_english_stemmer"
]
}
}
}
},
"mappings": {
"properties": {
"id": {
"type": "long",
"index": false,
"doc_values": false
},
"title": {
"type": "text",
"analyzer": "custom_lowercase_analyzer",
"fields": {
"raw": {
"type": "completion"
}
}
}
}
}
}
The below suggestion query works
POST media/_search
{
"_source": ["id", "title"],
"suggest": {
"job-suggest": {
"prefix": "sri",
"completion": {
"field": "title"
}
}
}
}
But normal search would fail on the same title
GET media/_search
{
"_source": ["id", "title"],
"query" : {
"query_string": {
"query" : "*sri*",
"fields" : [
"title"
]
}
}
}
Please help me solve this problem

Match all partial words in one field

I have this query:
{
"query": {
"match": {
"tag": {
"query": "john smith",
"operator": "and"
}
}
}
}
With the and operator I solved to return documents, where words "john" and "smith" must be present in the tag field in any position and any order. But I need to return documents where all partial words must be present in the tag field, like "joh" and "smit". I try this:
{
"query": {
"match": {
"tag": {
"query": "*joh* *smit*",
"operator": "and"
}
}
}
}
but nothing returns. How can I solve this?

You can use the edge_ngram tokenizer and boolean query with multiple must clause(using your example 2) to get the desired output.
Working example:
Index Def
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram", --> note this
"min_gram": 1,
"max_gram": 10
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Index two sample doc, one which should match and one which shouldn't
{
"title" : "john bravo" --> show;dn't match
}
{
"title" : "john smith" --> should match
}
Boolean Search query with must clause
{
"query": {
"bool": {
"must": [ --> this means both `jon` and `smit` match clause must match, replacement of your `and` operator.
{
"match": {
"title": "joh"
}
},
{
"match": {
"title": "smit"
}
}
]
}
}
}
Search result
"hits": [
{
"_index": "so_partial",
"_type": "_doc",
"_id": "1",
"_score": 1.2840209,
"_source": {
"title": "john smith"
}
}
]

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Highlight words with whitespace in Elasticsearch 7.6 - elasticsearch

Related

Elasticsearch Highlight the result of script fields

How to match when search term has more words than index?

Elastic phrase prefix working phrase isnt

field type as text and completion in elastic serach

Match all partial words in one field

Categories

Resources