Elastic search: How to highlight the fragment after the search term? - elasticsearch

I am working on a search project which requires the highlight fragment after the search word.
My query is
{
"query": {
"multi_match" : {
"query" : "prawn",
"fields": ["name"]
, "operator": "and",
"use_dis_max": true
}
},
"_source": ["name"],
"highlight": {
"fields": {
"name": {
"pre_tags" : [""], "post_tags" : [""],
"fragment_size": 3,
"number_of_fragments": 1
}
}
}
}
Result:
{
"name" : "special prawn curry"
},
"highlight" : {
"name" : [
"special prawn"
]
}
Whereas, I want the result like
"name" : "special prawn curry"
},
"highlight" : {
"name" : [
"prawn curry"
]
}
i.e the fragment after the search word. Is it possible?

Well you can make use of the Plain highlighter (using "type":"plain") in highlight query and see if that works out.
This used to be the default highlighter till 6.0 release where they've made Unified as default highlighter.
POST <your_index_name>/_search
{
"query": {
"multi_match" : {
"query" : "prawn",
"fields": ["name"]
, "operator": "and",
"use_dis_max": true
}
},
"_source": ["name"],
"highlight": {
"fields": {
"name": {
"type": "plain", <---- Added this
"pre_tags" : [""], "post_tags" : [""],
"fragment_size": 3,
"number_of_fragments": 1,
"order": "score"
}
}
}
}
Hope this helps!

Related

ElasticSearch DSL Matching all elements of query in list of list of strings

I'm trying to query ElasticSearch to match every document that in a list of list contains all the values requested, but I can't seem to find the perfect query.
Mapping:
"id" : {
"type" : "keyword"
},
"mainlist" : {
"properties" : {
"format" : {
"type" : "keyword"
},
"tags" : {
"type" : "keyword"
}
}
},
...
Documents:
doc1 {
"id" : "abc",
"mainlist" : [
{
"type" : "big",
"tags" : [
"tag1",
"tag2"
]
},
{
"type" : "small",
"tags" : [
"tag1"
]
}
]
},
doc2 {
"id" : "abc",
"mainlist" : [
{
"type" : "big",
"tags" : [
"tag1"
]
},
{
"type" : "small",
"tags" : [
"tag2"
]
}
]
},
doc3 {
"id" : "abc",
"mainlist" : [
{
"type" : "big",
"tags" : [
"tag1"
]
}
]
}
The query I've tried that got me closest to the result is:
GET /index/_doc/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"mainlist.tags": "tag1"
}
},
{
"term": {
"mainlist.tags": "tag2"
}
}
]
}
}
}
although I get as result doc1 and doc2, while I'd only want doc1 as contains tag1 and tag2 in a single list element and not spread across both sublists.
How would I be able to achieve that?
Thanks for any help.
As mentioned by #caster, you need to use the nested data type and query as in normal way Elasticsearch treats them as object and relation between the elements are lost, as explained in offical doc.
You need to change both mapping and query to achieve the desired output as shown below.
Index mapping
{
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"mainlist" :{
"type" : "nested"
}
}
}
}
Sample Index doc according to your example, no change there
Query
{
"query": {
"nested": {
"path": "mainlist",
"query": {
"bool": {
"must": [
{
"term": {
"mainlist.tags": "tag1"
}
},
{
"match": {
"mainlist.tags": "tag2"
}
}
]
}
}
}
}
}
And result
hits": [
{
"_index": "71519931_new",
"_id": "1",
"_score": 0.9139043,
"_source": {
"id": "abc",
"mainlist": [
{
"type": "big",
"tags": [
"tag1",
"tag2"
]
},
{
"type": "small",
"tags": [
"tag1"
]
}
]
}
}
]
use nested field type,this is work for it
https://www.elastic.co/guide/en/elasticsearch/reference/8.1/nested.html

Word and phrase search on multiple fields in ElasticSearch

I'd like to search documents using Python through ElasticSearch. I am looking for documents which contains word and/or phrase in any one of three fields.
GET /my_docs/_search
{
"query": {
"multi_match": {
"query": "Ford \"lone star\"",
"fields": [
"title",
"description",
"news_content"
],
"minimum_should_match": "-1",
"operator": "AND"
}
}
}
In the above query, I'd like to get documents whose title, description, or news_content contain "Ford" and "lone star" (as a phrase).
However, it seems that it does not consider "lone star" as a phrase. It returns documents with "Ford", "lone", and "star".
So, I was able to reproduce your issue and solved it using the REST API of Elasticsearch as I am not familiar with the python syntax and glad you provided your search query in JSON format, and I built my solution on top of it.
Index def
{
"mappings": {
"properties": {
"title": {
"type": "text"
},
"description" :{
"type" : "text"
},
"news_content" : {
"type" : "text"
}
}
}
}
Sample docs
{
"title" : "Ford",
"news_content" : "lone star", --> note this matches your criteria
"description" : "foo bar"
}
{
"title" : "Ford",
"news_content" : "lone",
"description" : "star"
}
Search query you are looking for
{
"query": {
"bool": {
"must": [ --> note this, both clause must match
{
"multi_match": {
"query": "ford",
"fields": [
"title",
"description",
"news_content"
]
}
},
{
"multi_match": {
"query": "lone star",
"fields": [
"title",
"description",
"news_content"
],
"type": "phrase" --> note `lone star` must be phrase
}
}
]
}
}
}
Result contains just one doc from sample
"hits": [
{
"_index": "so_phrase",
"_type": "_doc",
"_id": "1",
"_score": 0.9527341,
"_source": {
"title": "Ford",
"news_content": "lone star",
"description": "foo bar"
}
}
]

How Elasticsearch relevance score gets calculated?

I am using multi_match with phrase_prefix for full text search in Elasticsearch 5.5. ES query looks like
{
query: {
bool: {
must: {
multi_match: {
query: "butt",
type: "phrase_prefix",
fields: ["item.name", "item.keywords"],
max_expansions: 10
}
}
}
}
}
I am getting following response
[
{
"_index": "items_index",
"_type": "item",
"_id": "2",
"_score": 0.61426216,
"_source": {
"item": {
"keywords": "amul butter, milk, butter milk, flavoured",
"name": "Flavoured Butter"
}
}
},
{
"_index": "items_index",
"_type": "item",
"_id": "1",
"_score": 0.39063013,
"_source": {
"item": {
"keywords": "amul butter, milk, butter milk",
"name": "Butter Milk"
}
}
}
]
Mappings is as follows(I am using default mappings)
{
"items_index" : {
"mappings" : {
"parent_doc": {
...
"properties": {
"item" : {
"properties" : {
"keywords" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
How item with "name": "Flavoured Butter" getting higher score of 0.61426216 against the document with "name": "Butter Milk" and score 0.39063013?
I tried applying boost to "item.name" and removing "item.keywords" form search fields getting same results.
How scores in Elasticsearch works? Are above results correct in terms of relavance?
The scoring for phrase_prefix is similar to that of best_fields, meaning that score of a document is the score obtained from the best_field, which here is item.keywords.
So, item.name isn't adding to score
Refer: multi-match-types
You can use 2 multi_match queries to combine the score from keywords and name.
{
"query": {
"bool": {
"must": [{
"multi_match": {
"query": "butt",
"type": "phrase_prefix",
"fields": [
"item.keywords"
],
"max_expansions": 10
}
},{
"multi_match": {
"query": "butt",
"type": "phrase_prefix",
"fields": [
"item.name"
],
"max_expansions": 10
}
}]
}
}
}

Is it possible to use a more-like-this query on nested fields?

I have an "event" type based on a (nested) press article, including the title, and the text, which both have multifields.
I've tried :
{
"query":{
"nested":{
"path":"article",
"query":{
"mlt":{
"fields":["article.title.search","article.text.search"],
"max_query_terms": 20,
"min_term_freq": 1,
"include": "false",
"like":[{
"_index":"myindex",
"_type":"event",
"doc":{
"article":{
"title":"this is the title",
"text":"this is the body of the article"
}
}]
}
}
}
}
}
But it always returns 0 hits
{
"query": {
"nested":{
"path":"articles",
"query":{
"more_like_this" : {
"fields" : ["articles.brand", "articles.category", "articles.material"],
"like" : [
{
"_index" : "$index",
"_type" : "$type",
"_id" : "$id"
}
],
"min_term_freq" : 1,
"max_query_terms" : 20
}
}
}
}
This Works for me, Taking in consideration that the mapping of the nested fields you are using must be defined as term vectors.
"brand": {
"type": "string",
"index": "not_analyzed",
"term_vector": "yes"
}
Refer to: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html

Analyzers in ElasticSearch not working

I am using ElasticSearch to store the Tweets I receive from the Twitter Streaming API. Before storing them I'd like to apply an english stemmer to the Tweet content, and to do that I'm trying to use ElasticSearch analyzers with no luck.
This is the current template I am using:
PUT _template/twitter
{
"template": "139*",
"settings" : {
"index":{
"analysis":{
"analyzer":{
"english":{
"type":"custom",
"tokenizer":"standard",
"filter":["lowercase", "en_stemmer", "stop_english", "asciifolding"]
}
},
"filter":{
"stop_english":{
"type":"stop",
"stopwords":["_english_"]
},
"en_stemmer" : {
"type" : "stemmer",
"name" : "english"
}
}
}
}
},
"mappings": {
"tweet": {
"_timestamp": {
"enabled": true,
"store": true,
"index": "analyzed"
},
"_index": {
"enabled": true,
"store": true,
"index": "analyzed"
},
"properties": {
"geo": {
"properties": {
"coordinates": {
"type": "geo_point"
}
}
},
"text": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
When I start the Streaming and the index is created, all the mappings I've defined seem to apply correctly, but the text is stored as it comes from Twitter, completely raw. The index metadata shows:
"settings" : {
"index" : {
"uuid" : "xIOkEcoySAeZORr7pJeTNg",
"analysis" : {
"filter" : {
"en_stemmer" : {
"type" : "stemmer",
"name" : "english"
},
"stop_english" : {
"type" : "stop",
"stopwords" : [
"_english_"
]
}
},
"analyzer" : {
"english" : {
"type" : "custom",
"filter" : [
"lowercase",
"en_stemmer",
"stop_english",
"asciifolding"
],
"tokenizer" : "standard"
}
}
},
"number_of_replicas" : "1",
"number_of_shards" : "5",
"version" : {
"created" : "1010099"
}
}
},
"mappings" : {
"tweet" : {
[...]
"text" : {
"analyzer" : "english",
"type" : "string"
},
[...]
}
}
What am I doing wrong? The analyzers seems to be applied correctly, but nothing is happening :/
Thank you!
PS: The search query I use to realize the analyzer is not being applied:
curl -XGET 'http://localhost:9200/_all/_search?pretty' -d '{
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "_index:1397574496990"
}
}
]
}
},
"filter": {
"bool": {
"must": [
{
"match_all": {}
},
{
"exists": {
"field": "geo.coordinates"
}
}
]
}
}
}
},
"fields": [
"geo.coordinates",
"text"
],
"size": 50000
}'
This should return the stemmed text as one of the fields, but the response is:
{
"took": 29,
"timed_out": false,
"_shards": {
"total": 47,
"successful": 47,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.97402453,
"hits": [
{
"_index": "1397574496990",
"_type": "tweet",
"_id": "456086643423068161",
"_score": 0.97402453,
"fields": {
"geo.coordinates": [
-118.21122533,
33.79349318
],
"text": [
"Happy turtle Tuesday ! The week is slowly crawling to Wednesday good morning everyone 🌊🐢🐢🐢☀️#turtles… http://t.co/wAVmcxnf76"
]
}
},
{
"_index": "1397574496990",
"_type": "tweet",
"_id": "456086701451259904",
"_score": 0.97333175,
"fields": {
"geo.coordinates": [
-81.017636,
33.998741
],
"text": [
"Tuesday is Twins Day over here, apparently (it's a far too often occurrence) #tuesdaytwinsday… http://t.co/Umhtp6SoX6"
]
}
}
]
}
}
The text field is exactly the same that came from Twitter (I'm using the streaming API). What I expect is the text fields stemmed, as the analyzer is applied.
Analyzers don't affect the way data is stored. So, no matter which analyzer you are using you will get the same text back from source and stored fields. Analyzer are applied when you search. So by searching for something like text:twin and finding records with the word Twins, you will know that stemmer was applied.

Resources