Elastic Search Query for Relevancy Given a Phrase Rather Than Just One Word

Elastic Search Query for Relevancy Given a Phrase Rather Than Just One Word - elasticsearch

Elastic Search querying/boosting is not working as I would expect it to...
I have an index where documents look like this:
{
"entity_id" : "x",
"entity_name" : "y",
"description": "search engine",
"keywords" : [
"Google"
]
}
Im trying to get the document to show up with a relevancy score when querying by a search phrase that contains one of the keywords.
like this:
{
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "What are some of products for Google?",
"boost": 10,
"fields": ["keywords"]
}
}
],
"filter": {
"term" : { "entity_name" : "y" }
}
}
}
}
The problem is that my results are not as expected for three reasons:
The result contains hits that do not have any relevancy to "Google" or "Products" or any words in the search phrase.
The document that I am expecting to get returned has a _score = 0.0
The document that I am expecting to get returned has a mysterious "_ignored" : [ "description.keyword"],
The response looks like this:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.0,
"hits" : [
{
"_score" : 0.0,
"_source": {
"entity_id" : "a",
"entity_name" : "y",
"description": "some other entity",
"keywords": ["Other"]
}
},
{
"_score" : 0.0,
"_ignored" : [
"description.keyword"
],
"_source": {
"entity_id" : "x",
"entity_name" : "y",
"description": "search engine",
"keywords": ["Google"]
}
}
]
}
}
What am I doing wrong?

TLDR;
You use the wrong query type, query_string is not suitable for your needs, maybe use match
To understand
First and foremost:
_ignored is a field that track all the fields that where malformed at index time, and thus are going to be ignored at search time. [doc]
Why is my score 0:
It is because of the query_string query. [doc]
Returns documents based on a provided query string, using a parser with a strict syntax.
eg:
"query": "(new york city) OR (big apple)"
The query_string query splits (new york
city) OR (big apple) into two parts: new york city and big apple.
To illustrate my point, look at the example bellow:
POST /so_relevance_score/_doc
{
"entity_id" : "x",
"entity_name" : "y",
"description": "search engine",
"keywords" : [
"Google"
]
}
POST /so_relevance_score/_doc
{
"entity_id" : "x",
"entity_name" : "y",
"description": "consumer electronic",
"keywords" : [
"Apple"
]
}
GET /so_relevance_score/_search
{
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "What are some of products for Google?",
"boost": 10,
"fields": ["keywords"]
}
}
],
"filter": {
"term" : { "entity_name" : "y" }
}
}
}
}
will return the following results:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.0,
"hits" : [
{
"_index" : "so_relevance_score",
"_type" : "_doc",
"_id" : "0uYgP34Bpf2xEaYqLYai",
"_score" : 0.0,
"_source" : {
"entity_id" : "x",
"entity_name" : "y",
"description" : "search engine",
"keywords" : [
"Google"
]
}
},
{
"_index" : "so_relevance_score",
"_type" : "_doc",
"_id" : "1eYmP34Bpf2xEaYquoZC",
"_score" : 0.0,
"_source" : {
"entity_id" : "x",
"entity_name" : "y",
"description" : "consumer electronic",
"keywords" : [
"Apple"
]
}
}
]
}
}
Score is 0 for both document. Which means that both documents are as relevant on this query for ElasticSearch.
But if you were to change the query type to match
GET /so_relevance_score/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"keywords": "What are some of products for Google?"
}
}
],
"filter": {
"term" : { "entity_name" : "y" }
}
}
}
}
I get:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.6931471,
"hits" : [
{
"_index" : "so_relevance_score",
"_type" : "_doc",
"_id" : "0uYgP34Bpf2xEaYqLYai",
"_score" : 0.6931471,
"_source" : {
"entity_id" : "x",
"entity_name" : "y",
"description" : "search engine",
"keywords" : [
"Google"
]
}
},
{
"_index" : "so_relevance_score",
"_type" : "_doc",
"_id" : "1eYmP34Bpf2xEaYquoZC",
"_score" : 0.0,
"_source" : {
"entity_id" : "x",
"entity_name" : "y",
"description" : "consumer electronic",
"keywords" : [
"Apple"
]
}
}
]
}
}
With a relevance score !
If you want to fine tune your results, I suggest diving into the documentation for query types [doc]

Related

Range filter for count of documents with the same value for a field

In my index my-books, each document represents a book and has a field authorId, which uniquely represents the author of the book. I want to run a search query with a range filter on the total number of books authored by the book's author.
For example: say, if I have four authors A, B, C, D.
A is the author for books a1, a2,a3.
B is the author for book b1.
C is the author for books c1,c2.
D is the author for books d1, d2, d3, d4.
Lets say I want to retrieve all books such as the number of books written by the same author is greater than 1 but less than 4. Then my result hits are [a1, a2, a3, c1, c2].
How do I write such a query?

You need to use
terms aggregation to group by authors
top_hits to get documents under that author
bucket_selector to get terms where doc count is less than 4
{
"aggs": {
"NAME": {
"terms": {
"field": "author.keyword",
"size": 10
},
"aggs": {
"books": {
"top_hits": {
"size": 10
}
},
"final_filter": {
"bucket_selector": {
"buckets_path": {
"values": "_count"
},
"script": "params.values < 4"
}
}
}
}
}
}
Result
"aggregations" : {
"NAME" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "A",
"doc_count" : 2,
"books" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index148",
"_type" : "_doc",
"_id" : "-_pOUHoBVZyA6L_G1XrM",
"_score" : 1.0,
"_source" : {
"book" : "a1",
"author" : "A"
}
},
{
"_index" : "index148",
"_type" : "_doc",
"_id" : "_PpPUHoBVZyA6L_GL3q5",
"_score" : 1.0,
"_source" : {
"book" : "a3",
"author" : "A"
}
}
]
}
}
},
{
"key" : "B",
"doc_count" : 1,
"books" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index148",
"_type" : "_doc",
"_id" : "_fpPUHoBVZyA6L_GWHpg",
"_score" : 1.0,
"_source" : {
"book" : "b1",
"author" : "B"
}
}
]
}
}
},
{
"key" : "C",
"doc_count" : 1,
"books" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index148",
"_type" : "_doc",
"_id" : "_vpPUHoBVZyA6L_Gmnoj",
"_score" : 1.0,
"_source" : {
"book" : "c1",
"author" : "C"
}
}
]
}
}
}
]
}
}

No matches when querying Elastic Search

I'm trying to run a query elastic search. When run this query
GET accounts/_search/
{
"query": {
"term": {
"address_line_1": "1000"
}
}
}
I get back multiple records like
"hits" : [
{
"_index" : "accounts",
"_type" : "_doc",
"_id" : "...",
"_score" : 8.355149,
"_source" : {
"state_id" : 35,
"first_name" : "...",
"last_name" : "...",
"middle_name" : "P",
"dob" : "...",
"status" : "ACTIVE",
"address_line_1" : "1000 BROADROCK CT",
"address_line_2" : "",
"address_city" : "PARMA",
"address_zip" : "",
"address_zip_plus_4" : ""
}
},
But when I try to expand it to include the more like below I don't get any matches
GET accounts/_search/
{
"query": {
"term": {
"address_line_1": "1000 B"
}
}
}
The response is
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}

The term query is looking for exact matches. Your address_line_* fields were most probably indexed with the standard analyzer which lowercase-s all the letters which in turn prevents the query from matching.
So either use
GET accounts/_search/
{
"query": {
"match": { <--
"address_line_1": "1000 B"
}
}
}
which does not really 'care' about B being lower/upper case or adjust your field analyzers such that the capitalization is preserved.

How can i extend an elastic search date range histogram aggregation query?

Hi I have an elastic search index named mep-report.
Each document has a status field. The possible values for status fields are "ENROUTE", "SUBMITTED", "DELIVERED", "FAILED" . Below is the sample elastic search index with 6 documents.
{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 13,
"successful" : 13,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1094313,
"max_score" : 1.0,
"hits" : [
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837500",
"_score" : 1.0,
"_source" : {
"status" : "ENROUTE",
"#timestamp" : "2019-09-11T10:21:26.000Z"
},
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837501",
"_score" : 1.0,
"_source" : {
"status" : "ENROUTE",
"#timestamp" : "2019-09-11T10:21:26.000Z"
},
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837502",
"_score" : 1.0,
"_source" : {
"status" : "SUBMITTED",
"#timestamp" : "2019-09-11T10:21:26.000Z"
}
},
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837503",
"_score" : 1.0,
"_source" : {
"status" : "DELIVERED",
"#timestamp" : "2019-09-11T10:21:26.000Z"
}
},
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837504",
"_score" : 1.0,
"_source" : {
"status" : "FAILED",
"#timestamp" : "2019-09-11T10:21:26.000Z"
},
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837504",
"_score" : 1.0,
"_source" : {
"status" : "FAILED",
"#timestamp" : "2019-09-11T10:21:26.000Z"
}
}
}
I would like to find an aggregation histogram distribution something like to get messages_processed, message_delivered,messages_failed .
messages_processed : 3 ( 2 documents in status ENROUTE + 1 Document with status SUBMITTED )
message_delivered 1 ( 1 document with status DELIVERED )
messages_failed : 2 ( 2 documents with status FAILED )
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 13,
"successful" : 13,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 21300,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"performance_over_time" : {
"buckets" : [
{
"key_as_string" : "2020-02-21",
"key" : 1582243200000,
"doc_count" : 6,
"message_processed": 3,
"message_delivered": 1,
"message_failed": 2
}
]
}
}
}
So the following is my current query and i would like to modify it to get some additional statistics such as message_processed , message_delivered, message_failed. kindly let me know .
{ "size": 0, "query": { "bool": { "must": [ { "range": { "#timestamp": { "from": "2020-02-21T00:00Z", "to": "2020-02-21T23:59:59.999Z", "include_lower": true, "include_upper": true, "format": "yyyy-MM-dd'T'HH:mm:ss.SSSZ ||yyyy-MM-dd'T'HH:mmZ", "boost": 1.0 } } } ], "adjust_pure_negative": true, "boost": 1.0 } }, "aggregations": { "performance_over_time": { "date_histogram": { "field": "#timestamp", "format": "yyyy-MM-dd", "interval": "1d", "offset": 0, "order": { "_key": "asc" }, "keyed": false, "min_doc_count": 0 } } } }

You are almost there with the query, you just need to add Terms Aggregation and looking at your request, I've come up with a Scripted Terms Aggregation.
I've also modified the date histogram aggregation field interval to calendar_interval so that you get the values as per the calendar date.
Query Request:
POST <your_index_name>/_search
{
"size": 0,
"query":{
"bool":{
"must":[
{
"range":{
"#timestamp":{
"from":"2019-09-10",
"to":"2019-09-12",
"include_lower":true,
"include_upper":true,
"boost":1.0
}
}
}
],
"adjust_pure_negative":true,
"boost":1.0
}
},
"aggs":{
"message_processed":{
"date_histogram": {
"field": "#timestamp",
"calendar_interval": "1d" <----- Note this
},
"aggs": {
"my_messages": {
"terms": {
"script": { <----- Core Logic of Terms Agg
"source": """
if(doc['status'].value=="ENROUTE" || doc['status'].value == "SUBMITTED"){
return "message_processed";
}else if(doc['status'].value=="DELIVERED"){
return "message_delivered"
}else {
return "message_failed"
}
""",
"lang": "painless"
},
"size": 10
}
}
}
}
}
}
Note that the core logic what you are looking for is inside the scripted terms aggregation. Logic is self explainable if you go through it. Feel free to modify the logic that fits you.
For the sample date you've shared, you would get the result in the below format:
Response:
{
"took" : 144,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"message_processed" : {
"buckets" : [
{
"key_as_string" : "2019-09-11T00:00:00.000Z",
"key" : 1568160000000,
"doc_count" : 6,
"my_messages" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "message_processed",
"doc_count" : 3
},
{
"key" : "message_failed",
"doc_count" : 2
},
{
"key" : "message_delivered",
"doc_count" : 1
}
]
}
}
]
}
}
}

Elasticsearch, terms aggs according to sibling nested fields

Elasticsearch v7.5
Hello and good day!
We have 2 indices named socialmedia and influencers
Sample contents:
socialmedia:
{
'_id' : 1001,
'title' : "Title 1",
'smp_id' : 1,
"latest" : [
{
"soc_mm_score" : "5",
}
]
},
{
'_id' : 1002,
'title' : "Title 2",
'smp_id' : 2,
"latest" : [
{
"soc_mm_score" : "10",
}
]
},
{
'_id' : 1003,
'title' : "Title 3",
'smp_id' : 3,
"latest" : [
{
"soc_mm_score" : "35",
}
]
},
{
'_id' : 1004,
'title' : "Title 4",
'smp_id' : 2,
"latest" : [
{
"soc_mm_score" : "30",
}
]
}
//omitted some other fields
influencers:
{
'_id' : 1,
'name' : "John",
'smp_id' : 1
},
{
'_id' : 2,
'name' : "Peter",
'smp_id' : 2
},
{
'_id' : 3,
'name' : "Mark",
'smp_id' : 3
}
Now I have this simple query that determines which documents in the socialmedia index has the most latest.soc_mm_score value, and also displaying their corresponding influencers determined by the smp_id
GET socialmedia/_search
{
"size": 0,
"_source": "latest",
"query": {
"match_all": {}
},
"aggs": {
"LATEST": {
"nested": {
"path": "latest"
},
"aggs": {
"MM_SCORE": {
"terms": {
"field": "latest.soc_mm_score",
"order": {
"_key": "desc"
},
"size": 3
},
"aggs": {
"REVERSE": {
"reverse_nested": {},
"aggs": {
"SMP_ID": {
"top_hits": {
"_source": ["smp_id"],
"size": 1
}
}
}
}
}
}
}
}
}
}
SAMPLE OUTPUT:
"aggregations" : {
"LATEST" : {
"doc_count" : //omitted,
"MM_SCORE" : {
"doc_count_error_upper_bound" : //omitted,
"sum_other_doc_count" : //omitted,
"buckets" : [
{
"key" : 35,
"doc_count" : 1,
"REVERSE" : {
"doc_count" : 1,
"SMP_ID" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "socialmedia",
"_type" : "index",
"_id" : "1003",
"_score" : 1.0,
"_source" : {
"smp_id" : "3"
}
}
]
}
}
}
},
{
"key" : 30,
"doc_count" : 1,
"REVERSE" : {
"doc_count" : 1,
"SMP_ID" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "socialmedia",
"_type" : "index",
"_id" : "1004",
"_score" : 1.0,
"_source" : {
"smp_id" : "2"
}
}
]
}
}
}
},
{
"key" : 10,
"doc_count" : 1,
"REVERSE" : {
"doc_count" : 1,
"SMP_ID" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "socialmedia",
"_type" : "index",
"_id" : "1002",
"_score" : 1.0,
"_source" : {
"smp_id" : "2"
}
}
]
}
}
}
}
]
}
}
}
with the query above, I was able to successfully display which documents have the highest latest.soc_mm_score values
The sample output above only displays DOCUMENTS, telling that the influencers (a.k.a smp_id) related to them are the TOP INFLUENCERS according to latest.soc_mm_score
Ideally just by using this aggs query,
"terms" : {
"field" : "smp_id"
}
portrays the concept of which influencers are the top according to the doc_count
Now, displaying the terms query according to latest.soc_mm_score displays TOP DOCUMENTS
"terms" : {
"field" : "latest.soc_mm_score"
}
REAL OBJECTIVE:
I want to display the TOP INFLUENCERS according to the latest.soc_mm_count in the socialmedia index. If Elasticsearch can count all the documents where according to unique smp_id, is there a way for ES to sum all latest.soc_mm_score values and use it as terms?
My objective above should output these:
smp_id 2 as the Top Influencer because he has 2 posts (with soc_mm_score of 30 and 10), adding them gets him 40 soc_mm_score
smp_id 3 as the 2nd Top Influencer, he has 1 post with 35 soc_mm_score
smp_id 1 as the 3rd Top Influencer, he has 1 post with 5 soc_mm_score
Is there a proper query to meet this objective?

FINALLY! FOUND AN ANSWER!!!
"aggs": {
"INFS": {
"terms": {
"field": "smp_id.keyword",
"order": {
"LATEST > SUM_SVALUE": "desc"
}
},
"aggs": {
"LATEST": {
"nested": {
"path": "latest"
},
"aggs": {
"SUM_SVALUE": {
"sum" : {
"field": "latest.soc_mm_score"
}
}
}
}
}
}
}
Displays the following sample:

How to compare 2 field in elasticsearch

Ok, I have example result on my data in elastic search :
"hits" : [
{
"_index" : "solutionpedia_data",
"_type" : "doc",
"_id" : "nyODP24BA840z5O6WguE",
"_score" : 46.63439,
"_source" : {
"ID" : "1",
"PRODUCT_NAME" : "ATM",
"UPDATEDATE" : "13-FEB-18",
"PROPOSAL" : [
{
}
],
"MARKETING_KIT" : [ ],
"VIDEO" : [ ]
}
},
{
"_index" : "classification",
"_type" : "doc",
"_id" : "5M-r5m4BNYha4zuWalJa",
"_score" : 39.25268,
"_source" : {
"productId" : "1",
"productName" : "ATM",
"productIconUrl" : "media/8ae0f0c3-1402-4559-901e-7ec9b874ce68-prod032.webp",
"type" : "nonconnectivity",
"businessLineId" : "",
"subsidiaries" : "",
"segment" : [],
"productType" : "Efisien",
"tariff" : null,
"tags" : [ ],
"contact" : [],
"mediaId" : [
"Med391"
],
"documentId" : [
"doc260",
"doc261"
],
"createdAt" : "2019-09-22T05:22:46.956Z",
"updatedAt" : "2019-09-22T05:22:46.956Z",
"totalClick" : 46
}
}
]
this is a result of my alias. can we search for the same data based on 2 different fields, the example above is the ID and productId fields. Can we make these 2 objects in one bucket or compare?
i was try with some aggregate but nothing :
{
"query": {
"match_all": {}
},
"size": 0,
"aggregations": {
"product catalog": {
"terms": {
"field": "productId.keyword",
"min_doc_count": 2,
"size": 100
},
"aggregations": {
"product solped": {
"terms": {
"field": "ID.keyword",
"min_doc_count": 2
}
}
}
}
}
}
result :
{
"took" : 9,
"timed_out" : false,
"_shards" : {
"total" : 10,
"successful" : 10,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1276,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"product catalog" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
}

You can achieve this with a Scripted Bucket Aggregation, using script logic to define your buckets (pseudo code: if field a exists value of field a, if field b exists value of field b).
Another (and better) way to achieve this is to change your data model and indexing logic on Elasticsearch side and store the information in a field of the same name.
You could also consider the alias data type to make fields with different names in different indices accessible under one common field name. This is also the approach Elastic takes with the Elastic Common Schema specification.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elastic Search Query for Relevancy Given a Phrase Rather Than Just One Word - elasticsearch

Related

Range filter for count of documents with the same value for a field

No matches when querying Elastic Search

How can i extend an elastic search date range histogram aggregation query?

Elasticsearch, terms aggs according to sibling nested fields

How to compare 2 field in elasticsearch

Categories

Resources