How to compare 2 field in elasticsearch - elasticsearch

Ok, I have example result on my data in elastic search :
"hits" : [
{
"_index" : "solutionpedia_data",
"_type" : "doc",
"_id" : "nyODP24BA840z5O6WguE",
"_score" : 46.63439,
"_source" : {
"ID" : "1",
"PRODUCT_NAME" : "ATM",
"UPDATEDATE" : "13-FEB-18",
"PROPOSAL" : [
{
}
],
"MARKETING_KIT" : [ ],
"VIDEO" : [ ]
}
},
{
"_index" : "classification",
"_type" : "doc",
"_id" : "5M-r5m4BNYha4zuWalJa",
"_score" : 39.25268,
"_source" : {
"productId" : "1",
"productName" : "ATM",
"productIconUrl" : "media/8ae0f0c3-1402-4559-901e-7ec9b874ce68-prod032.webp",
"type" : "nonconnectivity",
"businessLineId" : "",
"subsidiaries" : "",
"segment" : [],
"productType" : "Efisien",
"tariff" : null,
"tags" : [ ],
"contact" : [],
"mediaId" : [
"Med391"
],
"documentId" : [
"doc260",
"doc261"
],
"createdAt" : "2019-09-22T05:22:46.956Z",
"updatedAt" : "2019-09-22T05:22:46.956Z",
"totalClick" : 46
}
}
]
this is a result of my alias. can we search for the same data based on 2 different fields, the example above is the ID and productId fields. Can we make these 2 objects in one bucket or compare?
i was try with some aggregate but nothing :
{
"query": {
"match_all": {}
},
"size": 0,
"aggregations": {
"product catalog": {
"terms": {
"field": "productId.keyword",
"min_doc_count": 2,
"size": 100
},
"aggregations": {
"product solped": {
"terms": {
"field": "ID.keyword",
"min_doc_count": 2
}
}
}
}
}
}
result :
{
"took" : 9,
"timed_out" : false,
"_shards" : {
"total" : 10,
"successful" : 10,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1276,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"product catalog" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
}

You can achieve this with a Scripted Bucket Aggregation, using script logic to define your buckets (pseudo code: if field a exists value of field a, if field b exists value of field b).
Another (and better) way to achieve this is to change your data model and indexing logic on Elasticsearch side and store the information in a field of the same name.
You could also consider the alias data type to make fields with different names in different indices accessible under one common field name. This is also the approach Elastic takes with the Elastic Common Schema specification.

Related

Elastic Search Query for Relevancy Given a Phrase Rather Than Just One Word

Elastic Search querying/boosting is not working as I would expect it to...
I have an index where documents look like this:
{
"entity_id" : "x",
"entity_name" : "y",
"description": "search engine",
"keywords" : [
"Google"
]
}
Im trying to get the document to show up with a relevancy score when querying by a search phrase that contains one of the keywords.
like this:
{
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "What are some of products for Google?",
"boost": 10,
"fields": ["keywords"]
}
}
],
"filter": {
"term" : { "entity_name" : "y" }
}
}
}
}
The problem is that my results are not as expected for three reasons:
The result contains hits that do not have any relevancy to "Google" or "Products" or any words in the search phrase.
The document that I am expecting to get returned has a _score = 0.0
The document that I am expecting to get returned has a mysterious "_ignored" : [ "description.keyword"],
The response looks like this:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.0,
"hits" : [
{
"_score" : 0.0,
"_source": {
"entity_id" : "a",
"entity_name" : "y",
"description": "some other entity",
"keywords": ["Other"]
}
},
{
"_score" : 0.0,
"_ignored" : [
"description.keyword"
],
"_source": {
"entity_id" : "x",
"entity_name" : "y",
"description": "search engine",
"keywords": ["Google"]
}
}
]
}
}
What am I doing wrong?
TLDR;
You use the wrong query type, query_string is not suitable for your needs, maybe use match
To understand
First and foremost:
_ignored is a field that track all the fields that where malformed at index time, and thus are going to be ignored at search time. [doc]
Why is my score 0:
It is because of the query_string query. [doc]
Returns documents based on a provided query string, using a parser with a strict syntax.
eg:
"query": "(new york city) OR (big apple)"
The query_string query splits (new york
city) OR (big apple) into two parts: new york city and big apple.
To illustrate my point, look at the example bellow:
POST /so_relevance_score/_doc
{
"entity_id" : "x",
"entity_name" : "y",
"description": "search engine",
"keywords" : [
"Google"
]
}
POST /so_relevance_score/_doc
{
"entity_id" : "x",
"entity_name" : "y",
"description": "consumer electronic",
"keywords" : [
"Apple"
]
}
GET /so_relevance_score/_search
{
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "What are some of products for Google?",
"boost": 10,
"fields": ["keywords"]
}
}
],
"filter": {
"term" : { "entity_name" : "y" }
}
}
}
}
will return the following results:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.0,
"hits" : [
{
"_index" : "so_relevance_score",
"_type" : "_doc",
"_id" : "0uYgP34Bpf2xEaYqLYai",
"_score" : 0.0,
"_source" : {
"entity_id" : "x",
"entity_name" : "y",
"description" : "search engine",
"keywords" : [
"Google"
]
}
},
{
"_index" : "so_relevance_score",
"_type" : "_doc",
"_id" : "1eYmP34Bpf2xEaYquoZC",
"_score" : 0.0,
"_source" : {
"entity_id" : "x",
"entity_name" : "y",
"description" : "consumer electronic",
"keywords" : [
"Apple"
]
}
}
]
}
}
Score is 0 for both document. Which means that both documents are as relevant on this query for ElasticSearch.
But if you were to change the query type to match
GET /so_relevance_score/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"keywords": "What are some of products for Google?"
}
}
],
"filter": {
"term" : { "entity_name" : "y" }
}
}
}
}
I get:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.6931471,
"hits" : [
{
"_index" : "so_relevance_score",
"_type" : "_doc",
"_id" : "0uYgP34Bpf2xEaYqLYai",
"_score" : 0.6931471,
"_source" : {
"entity_id" : "x",
"entity_name" : "y",
"description" : "search engine",
"keywords" : [
"Google"
]
}
},
{
"_index" : "so_relevance_score",
"_type" : "_doc",
"_id" : "1eYmP34Bpf2xEaYquoZC",
"_score" : 0.0,
"_source" : {
"entity_id" : "x",
"entity_name" : "y",
"description" : "consumer electronic",
"keywords" : [
"Apple"
]
}
}
]
}
}
With a relevance score !
If you want to fine tune your results, I suggest diving into the documentation for query types [doc]

Elasticsearch: retrieve only document _id where field doesn't exist

I would like to retrieve all document _ids (without other fields) where field "name" doesn't exist:
I know I can search for where field "name" doesn't exist like this:
"query": {
"bool": {
"must_not": {
"exists": {
"field": "name"
}
}
}
}
and I think that to get the _id of the document only without any fields i need to use (correct me if I'm wrong):
"fields": []
How do I combine these 2 parts to make a query that works?
You can just add _source and set to false as Elasticsearch will return the entire JSON object in that field by default
"_source": false,
"query":{
...
}
and this will retrieve just the metadata from your specified index, so your hits array will contain _index, _type, _id and _score for each result
e.g
{
"took" : 11,
"timed_out" : false,
"_shards" : {
"total" : 12,
"successful" : 12,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 20,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "filebeat-7.8.1-2021.01.28",
"_type" : "_doc"
"_id" : "SomeUniqeuId86aa",
"_score" : 1.0
},
{
"_index" : "filebeat-7.8.1-2021.01.28",
"_type" : "_doc"
"_id" : "An0therrUniqueiD",
"_score" : 1.0
}
]
}
}

No matches when querying Elastic Search

I'm trying to run a query elastic search. When run this query
GET accounts/_search/
{
"query": {
"term": {
"address_line_1": "1000"
}
}
}
I get back multiple records like
"hits" : [
{
"_index" : "accounts",
"_type" : "_doc",
"_id" : "...",
"_score" : 8.355149,
"_source" : {
"state_id" : 35,
"first_name" : "...",
"last_name" : "...",
"middle_name" : "P",
"dob" : "...",
"status" : "ACTIVE",
"address_line_1" : "1000 BROADROCK CT",
"address_line_2" : "",
"address_city" : "PARMA",
"address_zip" : "",
"address_zip_plus_4" : ""
}
},
But when I try to expand it to include the more like below I don't get any matches
GET accounts/_search/
{
"query": {
"term": {
"address_line_1": "1000 B"
}
}
}
The response is
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
The term query is looking for exact matches. Your address_line_* fields were most probably indexed with the standard analyzer which lowercase-s all the letters which in turn prevents the query from matching.
So either use
GET accounts/_search/
{
"query": {
"match": { <--
"address_line_1": "1000 B"
}
}
}
which does not really 'care' about B being lower/upper case or adjust your field analyzers such that the capitalization is preserved.

How to design the index to perform viewed by and viewed by me information

UserA views UserB
UserA views UserC
UserD views UserA
Who Viewed You queries:-
Who Viewed You should show UserD for UserA
Who Viewed You should show UserA for UserB
Who Viewed You should show UserA for UserC
Viewed By Me queries:-
Viewed By Me should show UserA for UserD
How should we model the users index, to fetch the above information
users index contains first_name, last_name, gender, ...
I would just save a array in a visitors fields (or visited depending of the lower cardinality)
I guess that the docs can be huge so to optimize (and avoid a large number of updates), I would have a "visits_logs" indices with just logs and a LCM with a short delete phase. (one index a day and keeping one week of data before deletion)
{"visitor": "userA", "visited": "userB", "#timestamp": 12345678990}
Then at night, use a transformation of a manual aggregation to populate an aggregation index per period:
PUT visits/_doc
{
"visitor": "UserA",
"#timestamp": "today",
"visited": {
"users": ["UserB", "UserC", "UserD"],
"quantity": 3
}
Details really depends on your real use case and volume of your data.
But I think it's a robust solution.
UPDATE:
The queries would be:
If you want to know all users visited by UserA
GET test/_search
{
"query": {
"match": {
"visitor": "UserA"
}
}
}
Response will looks like this and you just have to merge visited arrays
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.4700036,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "5k-z3XQBDjdqjSSDl_K5",
"_score" : 0.4700036,
"_source" : {
"#timestamp" : "today",
"visited" : {
"users" : [
"UserB",
"UserC",
"UserD"
],
"quantity" : 3
},
"visitor" : "UserA"
}
},
{
"_index" : "test",
"_type" : "_doc",
"_id" : "Ksaz3XQBk-8NpR_boPe2",
"_score" : 0.4700036,
"_source" : {
"#timestamp" : "today",
"visited" : {
"users" : [
"UserB",
"UserC",
"UserD"
],
"quantity" : 3
},
"visitor" : "UserA"
}
}
]
}
}
If you want to get "who visit userB"
GET test/_search
{
"query": {
"match": {
"visited.users": "UserB"
}
},
"_source": ["#timestamp", "visitor"]
}
And the answers are then visitors.
You can have a more qualified result with aggregations
GET test/_search
{
"size": 0,
"query": {
"match": {
"visited.users": "UserB"
}
},
"aggs": {
"visitors": {
"terms": {
"field": "visitor.keyword",
"size": 10
}
}
}
}
With a result like
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"visitors" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "UserA",
"doc_count" : 2
}
]
}
}
}
and for visited
GET test/_search
{
"size": 0,
"query": {
"match": {
"visitor": "UserA"
}
},
"aggs": {
"visits": {
"terms": {
"field": "visited.users.keyword",
"size": 10
}
}
}
}
with a result like:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"visits" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "UserB",
"doc_count" : 2
},
{
"key" : "UserC",
"doc_count" : 2
},
{
"key" : "UserD",
"doc_count" : 2
}
]
}
}
}

How can i extend an elastic search date range histogram aggregation query?

Hi I have an elastic search index named mep-report.
Each document has a status field. The possible values for status fields are "ENROUTE", "SUBMITTED", "DELIVERED", "FAILED" . Below is the sample elastic search index with 6 documents.
{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 13,
"successful" : 13,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1094313,
"max_score" : 1.0,
"hits" : [
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837500",
"_score" : 1.0,
"_source" : {
"status" : "ENROUTE",
"#timestamp" : "2019-09-11T10:21:26.000Z"
},
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837501",
"_score" : 1.0,
"_source" : {
"status" : "ENROUTE",
"#timestamp" : "2019-09-11T10:21:26.000Z"
},
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837502",
"_score" : 1.0,
"_source" : {
"status" : "SUBMITTED",
"#timestamp" : "2019-09-11T10:21:26.000Z"
}
},
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837503",
"_score" : 1.0,
"_source" : {
"status" : "DELIVERED",
"#timestamp" : "2019-09-11T10:21:26.000Z"
}
},
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837504",
"_score" : 1.0,
"_source" : {
"status" : "FAILED",
"#timestamp" : "2019-09-11T10:21:26.000Z"
},
{
"_index" : "mep-reports-2019.09.11",
"_type" : "doc",
"_id" : "68e8e03f-baf8-4bfc-a920-58e26edf835c-353899837504",
"_score" : 1.0,
"_source" : {
"status" : "FAILED",
"#timestamp" : "2019-09-11T10:21:26.000Z"
}
}
}
I would like to find an aggregation histogram distribution something like to get messages_processed, message_delivered,messages_failed .
messages_processed : 3 ( 2 documents in status ENROUTE + 1 Document with status SUBMITTED )
message_delivered 1 ( 1 document with status DELIVERED )
messages_failed : 2 ( 2 documents with status FAILED )
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 13,
"successful" : 13,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 21300,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"performance_over_time" : {
"buckets" : [
{
"key_as_string" : "2020-02-21",
"key" : 1582243200000,
"doc_count" : 6,
"message_processed": 3,
"message_delivered": 1,
"message_failed": 2
}
]
}
}
}
So the following is my current query and i would like to modify it to get some additional statistics such as message_processed , message_delivered, message_failed. kindly let me know .
{ "size": 0, "query": { "bool": { "must": [ { "range": { "#timestamp": { "from": "2020-02-21T00:00Z", "to": "2020-02-21T23:59:59.999Z", "include_lower": true, "include_upper": true, "format": "yyyy-MM-dd'T'HH:mm:ss.SSSZ ||yyyy-MM-dd'T'HH:mmZ", "boost": 1.0 } } } ], "adjust_pure_negative": true, "boost": 1.0 } }, "aggregations": { "performance_over_time": { "date_histogram": { "field": "#timestamp", "format": "yyyy-MM-dd", "interval": "1d", "offset": 0, "order": { "_key": "asc" }, "keyed": false, "min_doc_count": 0 } } } }
You are almost there with the query, you just need to add Terms Aggregation and looking at your request, I've come up with a Scripted Terms Aggregation.
I've also modified the date histogram aggregation field interval to calendar_interval so that you get the values as per the calendar date.
Query Request:
POST <your_index_name>/_search
{
"size": 0,
"query":{
"bool":{
"must":[
{
"range":{
"#timestamp":{
"from":"2019-09-10",
"to":"2019-09-12",
"include_lower":true,
"include_upper":true,
"boost":1.0
}
}
}
],
"adjust_pure_negative":true,
"boost":1.0
}
},
"aggs":{
"message_processed":{
"date_histogram": {
"field": "#timestamp",
"calendar_interval": "1d" <----- Note this
},
"aggs": {
"my_messages": {
"terms": {
"script": { <----- Core Logic of Terms Agg
"source": """
if(doc['status'].value=="ENROUTE" || doc['status'].value == "SUBMITTED"){
return "message_processed";
}else if(doc['status'].value=="DELIVERED"){
return "message_delivered"
}else {
return "message_failed"
}
""",
"lang": "painless"
},
"size": 10
}
}
}
}
}
}
Note that the core logic what you are looking for is inside the scripted terms aggregation. Logic is self explainable if you go through it. Feel free to modify the logic that fits you.
For the sample date you've shared, you would get the result in the below format:
Response:
{
"took" : 144,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"message_processed" : {
"buckets" : [
{
"key_as_string" : "2019-09-11T00:00:00.000Z",
"key" : 1568160000000,
"doc_count" : 6,
"my_messages" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "message_processed",
"doc_count" : 3
},
{
"key" : "message_failed",
"doc_count" : 2
},
{
"key" : "message_delivered",
"doc_count" : 1
}
]
}
}
]
}
}
}

Resources