Highlight on ElasticSearch autocomplete - elasticsearch

I have the following data to be indexed on ElasticSearch.
I want to implement an autocomplete feature, and highlight why a specific document matched a query.
This are the settings of my index:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 15
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"autocomplete_filter"
]
}
}
}
}
}
Index Analyzing
Splits text on word boundaries.
Removes pontuation.
Lowercases
Edge NGrams each token
So the Inverted Index looks like:
This is how i defined the mappings for a name field:
{
"index_type": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
When I query:
GET http://localhost:9200/index/type/_search
{
"query": {
"match": {
"name": "soft"
}
},
"highlight": {
"fields" : {
"name" : {}
}
}
}
Search for: soft
Applying the Standard Tokenizer, the "soft" is the term, to find on the inverted index. This search matches the Documents: 1, 3, 4, 5, 6, 7 which is correct, but the highlighted part I would expect to be "soft" and not the whole word:
{
"hits": [
{
"_source": {
"name": "SoftwareRocks everytime"
},
"highlight": {
"name": [
"<em>SoftwareRocks</em> everytime"
]
}
},
{
"_source": {
"name": "Software AG"
},
"highlight": {
"name": [
"<em>Software</em> AG"
]
}
},
{
"_source": {
"name": "Software AG2"
},
"highlight": {
"name": [
"<em>Software</em> AG2"
]
}
},
{
"_source": {
"name": "Op Software AG good software better"
},
"highlight": {
"name": [
"Op <em>Software</em> AG good <em>software</em> better"
]
}
},
{
"_source": {
"name": "Op Software AG"
},
"highlight": {
"name": [
"Op <em>Software</em> AG"
]
}
},
{
"_source": {
"name": "is soft ware ok"
},
"highlight": {
"name": [
"is <em>soft</em> ware ok"
]
}
}
]
}
Search for: software ag
Applying the Standard Tokenizer, the "software ag" is transformed into "software" and "ag", to find on the inverted index. This search matches the Documents: 1, 3, 4, 5, 6, which is correct, but the highlighted part I would expect to be "software" and "ag" and not the whole word around "software" and "ag":
{
"hits": [
{
"_source": {
"name": "Software AG"
},
"highlight": {
"name": [
"<em>Software</em> <em>AG</em>"
]
}
},
{
"_source": {
"name": "Software AG2"
},
"highlight": {
"name": [
"<em>Software</em> <em>AG2</em>"
]
}
},
{
"_source": {
"name": "Op Software AG"
},
"highlight": {
"name": [
"Op <em>Software</em> <em>AG</em>"
]
}
},
{
"_source": {
"name": "Op Software AG good software better"
},
"highlight": {
"name": [
"Op <em>Software</em> <em>AG</em> good <em>software</em> better"
]
}
},
{
"_source": {
"name": "SoftwareRocks everytime"
},
"highlight": {
"name": [
"<em>SoftwareRocks</em> everytime"
]
}
}
]
}
I read the highlight documentation on elasticsearch, but I cannot understand how the highlighting is performed. For the two examples above I expect only the matched token on the inverted index to be highlighted and not the whole word.
Can anyone help how to highlight only the passed value?
Update
So, in seems that on ElasticSearch website, the autocomplete on the server side is similar to my implementation. However it seems that they highlight the matched query on the client.
If they do like this, I started to think that there is not a proper solution to do it on ElasticSearch side, so I implemented the highlight feature on server side instead of on client side(as they seem to do).
My implementation on server side(using PHP) is:
public function search($term)
{
$params = [
'index' => $this->getIndexName(),
'type' => $this->getIndexType(),
'body' => [
'query' => [
'match' => [
'name' => $term
]
]
]
];
$results = $this->client->search($params);
$hits = $results['hits']['hits'];
$data = [];
$wrapBefore = '<strong>';
$wrapAfter = '</strong>';
foreach ($hits as $hit) {
$data[] = [
$hit['_source']['id'],
$hit['_source']['name'],
preg_replace("/($term)/i", "$wrapBefore$1$wrapAfter", strip_tags($hit['_source']['name']))
];
}
return $data;
}
Outputs what I aimed with this question:
I added a bounty to see if there is a solution at ElasticSearch level to achive what I described above.

As of now with latest version of elastic this is not possible as highligh documentation don't refer any settings or query for this. I checked elastic autocomplete example in browser console under xhr requests tab and found the response for "att" autocomplete response for keyword as follows.
url - https://search.elastic.co/suggest?q=att
{
"current_page": 1,
"last_page": 4,
"total_hits": 49,
"hits": [
{
"tags": [],
"url": "/elasticon/tour/2016/jp/not-attending",
"section": "Elasticon",
"title": "Not <em>Attending</em> - JP"
},
{
"section": "Elasticon",
"title": "<em>Attending</em> from Training - JP",
"tags": [],
"url": "/elasticon/tour/2016/jp/attending-training"
},
{
"tags": [],
"url": "/elasticon/tour/2016/jp/attending-keynote",
"title": "<em>Attending</em> from Keynote - JP",
"section": "Elasticon"
},
{
"tags": [],
"url": "/elasticon/tour/2016/not-attending",
"section": "Elasticon",
"title": "Thank You - Not <em>Attending</em>"
},
{
"tags": [],
"url": "/elasticon/tour/2016/attending",
"section": "Elasticon",
"title": "Thank You - <em>Attending</em>"
},
{
"section": "Blog",
"title": "What It's Like to <em>Attend</em> Elastic Training",
"tags": [],
"url": "/blog/what-its-like-to-attend-elastic-training"
},
{
"tags": "Elasticsearch",
"url": "/guide/en/elasticsearch/plugins/5.0/mapper-attachments-highlighting.html",
"section": "Docs/",
"title": "Highlighting <em>attachments</em>"
},
{
"title": "<em>attachments</em> » email",
"section": "Docs/",
"tags": "Logstash",
"url": "/guide/en/logstash/5.0/plugins-outputs-email.html#plugins-outputs-email-attachments"
},
{
"section": "Docs/",
"title": "Configuring Email <em>Attachments</em> » Actions",
"tags": "Watcher",
"url": "/guide/en/watcher/2.4/actions.html#configuring-email-attachments"
},
{
"url": "/guide/en/watcher/2.4/actions.html#hipchat-action-attributes",
"tags": "Watcher",
"title": "HipChat Action <em>Attributes</em> » Actions",
"section": "Docs/"
},
{
"title": "Slack Action <em>Attributes</em> » Actions",
"section": "Docs/",
"tags": "Watcher",
"url": "/guide/en/watcher/2.4/actions.html#slack-action-attributes"
}
],
"aggs": {
"sections": [
{
"Elasticon": 5
},
{
"Blog": 1
},
{
"Docs/": 43
}
],
"top_tags": [
{
"XPack": 14
},
{
"Elasticsearch": 12
},
{
"Watcher": 9
},
{
"Logstash": 4
},
{
"Clients": 3
},
{
"Shield": 1
}
]
}
}
But on frontend they are showing "att" only highlighted on in the autosuggest results. Hence they are handling the highlight stuff on browser layer.

Related

Cannot seem to use must and must_not together in an elastic search query

If I run the following query:
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "boxing",
"fuzziness": 2,
"minimum_should_match": 2
}
}
],
"must_not": [
{
"terms_set": {
"allowedCountries": {
"terms": ["gb", "mx"],
"minimum_should_match_script": {
"source": "2"
}
}
}
}
],
"filter": [
{
"range": {
"expireTime": {
"gt": 1674061907954
}
}
},
{
"term": {
"region": {
"value": "row"
}
}
},
{
"term": {
"sourceType": {
"value": "article"
}
}
}
]
}
}
}
against an index with articles that look like:
{
"_index": "content-items-v10",
"_type": "_doc",
"_id": "e7hm75ui4dma1mm4j8q5v7914",
"_score": 4.3724976,
"_source": {
"allowedCountries": ["gb", "ie"],
"body": "Both Joshua Buatsi and Craig Richards join The DAZN Boxing Show ahead of their clash at London's O2 Arena. Matchroom's Eddie Hearn also gives his take on the night, as well as Chantelle Cameron previewing her contest with Victoria Noelia Bustos.",
"competitions": [
{
"id": "8lo6205qyio0fksjx9glqbdhj",
"name": "Buatsi v Richards"
}
],
"contestants": [
{
"id": "7rq59j3eiamxlm12vhxcsgujj",
"name": "Joshua Buatsi"
},
{
"id": "boby9oqe23g6qyuwphrxh8su5",
"name": "Craig Richards"
}
],
"countries": [
{
"id": "7yasa43laq1nb2e6f8bfuvxed",
"name": "World"
},
{
"id": "258l9t5sm55592i08mdpqzr3t",
"name": "United Kingdom"
}
],
"dotsLastUpdateTime": 1673979749396,
"expireTime": 4800000000000,
"fixtureDate": {},
"headline": "Buatsi vs. Richards: Preview",
"id": "e7hm75ui4dma1mm4j8q5v7914",
"importance": 0,
"languageKeys": ["en"],
"languages": ["en"],
"lastUpdateTime": {
"ts": 1653088281000,
"iso8601": "2022-05-20T23:11:21.000Z"
},
"promoImageUrl": null,
"publication": {
"typeId": "1plcw0iyhx9vn1fcanbm2ja3rf",
"typeName": "Shoulder"
},
"publishedTime": {
"ts": 1653088281000,
"iso8601": "2022-05-20T23:11:21.000Z"
},
"region": "row",
"shortHeadline": null,
"sourceType": "article",
"sports": [
{
"id": "2x2oqzx60orpoeugkd754ga17",
"name": "Boxing"
}
],
"teaser": "",
"thumbnailImageUrl": "https://images.daznservices.com/di/library/babcock_canada/45/3e/the-dazn-boxing-show-20052022_xc4jbfqi022l1shq9lu641h9e.png?t=-477976832",
"translations": {}
}
}
I get the following validation error from elasticsearch:
{
"ok": false,
"errors": {
"validation": [
{
"message": "\"query.bool.must_not\" is not allowed",
"path": [
"query",
"bool",
"must_not"
],
"type": "object.unknown",
"context": {
"child": "must_not",
"label": "query.bool.must_not",
"value": [
{
"terms_set": {
"allowedCountries": {
"terms": [
"gb",
"mx"
],
"minimum_should_match_script": {
"source": "2"
}
}
}
}
],
"key": "must_not"
}
}
]
},
"correlationId": "d29e9275-9ab3-4ff8-944d-852b98d4b503"
}
And I cannot figure out what the issue might be! From the elastic docs it should be OK.
I'm using ElasticSearch 7.9.3 running in a local docker container.
I'm hoping someone out there will give me a clue!
Cheers!
I would expect this to just work.
I'm trying to filter out articles that have both of the country codes gb and mx in the field allowedCountries.
I can include them easily enough in the results when I add the terms_set query to the bool.must section of the query.
It works well, you just need to enclose your query in the query section
{
"query": { <--- add this
"bool": { <--- your query starts here
"must": [
...
Thank you for responding!
I was helping with a system I did not have full context on - it turns out there is a proxy in the mix with validation that was blocking the must_not query. So, with the proxy fixed, it now works.

Filter documents out of the facet count in enterprise search

We use enterprise search indexes to store items that can be tagged by multiple tenants.
e.g
[
{
"id": 1,
"name": "document 1",
"tags": [
{ "company_id": 1, "tag_id": 1, "tag_name": "bla" },
{ "company_id": 2, "tag_id": 1, "tag_name": "bla" }
]
}
]
I'm looking to find a way to retrieve all documents with only the tags of company 1
This request:
{
"query": "",
"facets": {
"tags": {
"type": "value"
}
},
"sort": {
"created": "desc"
},
"page": {
"size": 20,
"current": 1
}
}
Is coming back with
...
"facets": {
"tags": [
{
"type": "value",
"data": [
{
"value": "{\"company_id\":1,\"tag_id\":1,\"tag_name\":\"bla\"}",
"count": 1
},
{
"value": "{\"company_id\":2,\"tag_id\":1,\"tag_name\":\"bla\"}",
"count": 1
}
]
}
],
}
...
Can I modify the request in a way such that I get no tags by "company_id" = 2 ?
I have a solution that involves modifying the results to strip the extra data after they are retrieved but I'm looking for a better solution.

Elastic Search Wildcard query with space failing 7.11

I am having my data indexed in elastic search in version 7.11. This is my mapping i got when i directly added documents to my index.
{"properties":{"name":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}
I havent added the keyword part but no idea where it came from.
I am running a wild card query on the same. But unable to get data for keywords with spaces.
{
"query": {
"bool":{
"should":[
{"wildcard": {"name":"*hello world*"}}
]
}
}
}
Have seen many answers related to not_analyzed . And i have tried updating {"index":"true"} in mapping but with no help. How to make the wild card search work in this version of elastic search
Tried adding the wildcard field
PUT http://localhost:9001/indexname/_mapping
{
"properties": {
"name": {
"type" :"wildcard"
}
}
}
And got following response
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "mapper [name] cannot be changed from type [text] to [wildcard]"
}
],
"type": "illegal_argument_exception",
"reason": "mapper [name] cannot be changed from type [text] to [wildcard]"
},
"status": 400
}
Adding a sample document to match
{
"_index": "accelerators",
"_type": "_doc",
"_id": "602ec047a70f7f30bcf75dec",
"_score": 1.0,
"_source": {
"acc_id": "602ec047a70f7f30bcf75dec",
"name": "hello world example",
"type": "Accelerator",
"description": "khdkhfk ldsjl klsdkl",
"teamMembers": [
{
"userId": "karthik.r#gmail.com",
"name": "Karthik Ganesh R",
"shortName": "KR",
"isOwner": true
},
{
"userId": "anand.sajan#gmail.com",
"name": "Anand Sajan",
"shortName": "AS",
"isOwner": false
}
],
"sectorObj": [
{
"item_id": 14,
"item_text": "Cross-sector"
}
],
"geographyObj": [
{
"item_id": 4,
"item_text": "Global"
}
],
"technologyObj": [
{
"item_id": 1,
"item_text": "Artificial Intelligence"
}
],
"themeColor": 1,
"mainImage": "assets/images/Graphics/Asset 35.svg",
"features": [
{
"name": "Ideation",
"icon": "Asset 1007.svg"
},
{
"name": "Innovation",
"icon": "Asset 1044.svg"
},
{
"name": "Strategy",
"icon": "Asset 1129.svg"
},
{
"name": "Intuitive",
"icon": "Asset 964.svg"
},
],
"logo": {
"actualFileName": "",
"fileExtension": "",
"fileName": "",
"fileSize": 0,
"fileUrl": ""
},
"customLogo": {
"logoColor": "#B9241C",
"logoText": "EC",
"logoTextColor": "#F6F6FA"
},
"collaborators": [
{
"userId": "muhammed.arif#gmail.com",
"name": "muhammed Arif P T",
"shortName": "MA"
},
{
"userId": "anand.sajan#gmail.com",
"name": "Anand Sajan",
"shortName": "AS"
}
],
"created_date": "2021-02-18T19:30:15.238000Z",
"modified_date": "2021-03-11T11:45:49.583000Z"
}
}
You cannot modify a field mapping once created. However, you can create another sub-field of type wildcard, like this:
PUT http://localhost:9001/indexname/_mapping
{
"properties": {
"name": {
"type": "text",
"fields": {
"wildcard": {
"type" :"wildcard"
},
"keyword": {
"type" :"keyword",
"ignore_above":256
}
}
}
}
}
When the mapping is updated, you need to reindex your data so that the new field gets indexed, like this:
POST http://localhost:9001/indexname/_update_by_query
And then when this finishes, you'll be able to query on this new field like this:
{
"query": {
"bool": {
"should": [
{
"wildcard": {
"name.wildcard": "*hello world*"
}
}
]
}
}
}

How to build an inverted 1:n elasticsearch index using reindex, ingest pipeline and processors

I have started experimenting with Elasticsearch ingest pipelines and processors as a possibly faster way to build what I can describe as an "inverted index".
Here's what I'm trying to do: I have a documents index. Each document is akin to the following:
{
"id": "DOC1",
"title": "Quiz no. 1",
"questions": [
{
"question": "Who was the first person to walk on the Moon?",
"choices": [
{ "answer": "Michael Jackson", "correct": false },
{ "answer": "Neil Armstrong", "correct": true }
]
},
{
"question": "Who wrote the Macbeth?",
"choices": [
{ "answer": "William Shakespeare", "correct": true },
{ "answer": "Dante Alighieri", "correct": false },
{ "answer": "Arthur Conan Doyle", "correct": false }
]
}
]
}
I am trying to understand if there is a magic combination of reindex, pipelines and processors that can allow me to automatically build a questions index. Here's an example of what that index would look like:
[
{
"question_id": "<randomly-generated-value-1>",
"document_id": "DOC1",
"question": "Who was the first person to walk on the Moon?",
"choices": [
{ "answer": "Michael Jackson", "correct": false },
{ "answer": "Neil Armstrong", "correct": true }
]
},
{
"question_id": "<randomly-generated-value-2>",
"document_id": "DOC1",
"question": "Who wrote the Macbeth?",
"choices": [
{ "answer": "William Shakespeare", "correct": true },
{ "answer": "Dante Alighieri", "correct": false },
{ "answer": "Arthur Conan Doyle", "correct": false }
]
}
]
In the Elasticsearch documentation, it's mentioned you can perform a REINDEX using a specific pipeline. Looking up the simulate pipeline docs, I'm trying a few processors, including the foreach one, but I can't understand if the resulting documents from the pipeline are still 1:1 to the original index or 1 source document can generate multiple destination documents (which is what I need).
Here's the simulated pipeline I'm trying:
{
"pipeline": {
"description": "Inverts the documents index into a questions index",
"processors": [
{
"rename": {
"field": "id",
"target_field": "document_id",
"ignore_missing": false
}
},
{
"foreach": {
"field": "questions",
"processor": {
"rename": {
"field": "_ingest._value.question",
"target_field": "question"
}
}
}
},
{
"foreach": {
"field": "questions",
"processor": {
"rename": {
"field": "_ingest._value.choices",
"target_field": "choices"
}
}
}
},
{
"remove": {
"field": "questions"
}
}
]
}
}
This is almost working. The problem with this approach is that there is only one resulting document that corresponds the first question. The second question is not present in the output of the simulated pipeline,
hence my doubt whether a pipeline of processors can output multiple destination documents reading 1 source document, or we are forced to maintain a 1:1 relationship.
This answer seems to suggest what I'm trying to achieve is not possible.

Multi Match Query for multiple words with operator AND

So my scenario is that in my application there is an inline search just like the one we have here on Udemy site's header bar and the user can type more than one word in it. Now, I want to use that multi word search text entered by user to be queried against multi fields.
Multi Fields against which I am querying have the following mapping
_mapping
{
"category": {
"type": "keyword"
},
"designers": {
"type": "nested",
"properties": {
"name": {
"type": "keyword"
}
}
},
"story": {
"type": "text"
},
"foundryName": {
"type": "text",
}
}
My problem here is how can I do a multi word search like "designerFirstName1 category1 foundryName1" and get results where the matched document has each word from any one of the multifields I am searching in also as I continue to add more words the result set should get reduced.
Query
{
"query": {
"bool": {
"should": [
{
"nested": {
"path": "designers",
"query": {
"match": {
"designers.name": {
"query": "designerFirstName1 category1 foundryName1",
"fuzziness": "auto"
}
}
}
}
},
{
"multi_match": {
"query": "designerFirstName1 category1 foundryName1",
"type": "cross_fields",
"fields": [
"story",
"foundryName",
"category",
]
}
}
],
"minimum_should_match": 1
}
}
}
Expected Result is that this kind of document should be higher and then as we go down the results start having not all the multiwords in any one of the field(as shown below)
{
"category": [
"category1",
"category2"
],
"designers": [
{
"name": "designerFirstName1 designerLastName1"
},
{
"name": "designerFirstName2 designerLastName2"
}
],
"story": "Sphinx of black quartz, judge my vow! Sex-charged fop blew my junk TV quiz.",
"foundryName": "foundryName1"
},
{
"category": [
"category2",
"category3"
],
"designers": [
{
"name": "designerFirstName1 designerLastName1"
},
{
"name": "designerFirstName2 designerLastName2"
}
],
"story": "Sphinx of black quartz, judge my vow! Sex-charged fop blew my junk TV quiz.",
"foundryName": "foundryName1"
},
{
"category": [
"category1",
"category3"
],
"designers": [
{
"name": "designerFirstName3 designerLastName1"
},
{
"name": "designerFirstName2 designerLastName2"
}
],
"story": "Sphinx of black quartz, judge my vow! Sex-charged fop blew my junk TV quiz.",
"foundryName": "foundryName1"
},
{
"category": [
"category2",
"category3"
],
"designers": [
{
"name": "designerFirstName3 designerLastName1" /*changed here comparing with the above document*/
},
{
"name": "designerFirstName2 designerLastName2"
}
],
"story": "Sphinx of black quartz, judge my vow! Sex-charged fop blew my junk TV quiz.",
"foundryName": "foundryName1"
},

Resources