Elasticsearch v7.0
Hello and good day!
I'm trying to create a query that will have a condition: if a nested field has only 1 element, get that first element, if a nested field has 2 more or elements, get a matching nested field condition
Scenario:
I have an index named socialmedia and has a nested field named cms which places a sentiment for that document
An example document of the cms field looks like this
"_id" : 1,
"cms" : [
{
"cli_id" : 0,
"cmx_sentiment" : "Negative"
}
]
This cms field contains "cli_id" : 0 by default for its 1st element (this means it is for all the clients/users to see) but sooner or later, it goes like this:
"_id": 1,
"cms" : [
{
"cli_id" : 0,
"cmx_sentiment" : "Negative"
},
{
"cli_id" : 1,
"cmx_sentiment" : "Positive"
},
{
"cli_id" : 2,
"cmx_sentiment" : "Neutral"
},
]
The 2nd and 3rd element shows that the clients with cli_id equals to 1 and 2 has made a sentiment for that document.
Now, I want to formulate a query that if the client who logged in has no sentiment yet for a specific document, it fetches the cmx_sentiment that has the "cli_id" : 0
BUT , if the client who has logged in has a sentiment for the fetched documents according to his filters, the query will fetch the cmx_sentiment that has the matching cli_id of the logged in client
for example:
the client who has a cli_id of 2, will get the cmx_sentiment of **Neutral** according to the given document above
the client who has a cli_id of 5, will get the cmx_sentiment of **Negative** because he hasn't given a sentiment to the document
PSEUDO CODE :
If a document has a sentiment indicated by the client, get the cmx_sentiment of the cli_id == to the client's ID
if a document is fresh or the client HAS NOT labeled yet a sentiment on that document, get the element's cmx_sentiment that has cli_id == 0
I'm in need of a query to condition for the pseudo code above
Here's my sample query:
"aggs" => [
"CMS" => [
"nested" => [
"path" => "cms",
],
"aggs" => [
"FILTER" => [
"filter" => [
"bool" => [
"should" => [
[
"match" => [
"cms.cli_id" => 0
]
],
[
"bool" => [
"must" => [
[
// I'm planing to create a bool method here to test if cli_id is equalis to the logged-in client's ID
]
]
]
]
]
]
],
"aggs"=> [
"TONALITY"=> [
"terms"=> [
"field" => "cms.cmx_sentiment"
],
]
]
]
]
]
]
Is my query correct?
The problem with the query I have provided, is that it SUMS all the elements, instead of picking one only
The query above provides this scenario:
The client with cli_id 2 logs in
Both the Neutral and Negative cmx_sentiment are being retrieved, instead of the Neutral alone
After the discussion with OP I'm rewriting this answer.
To get the desired result you will have to consider the following to build the query and aggregation:
Query:
This will contain any filter applied by logged in user. For the example purpose I'm using match_all since every document has atleast one nested doc against cms field i.e. for cli_id: 0
Aggregation:
Here we have to divide the aggregations into two:
default_only
sentiment_only
default_only
In this aggregation we find count for those document which don't have nested document for cli_id: <logged in client id>. i.e. only those docs which have nested doc for cli_id: 0.
To do this we follow the steps below:
default_only Use filter aggregation to get document which does not have nested document for cli_id: <logged in client id> i.e. using must_not => cli_id: <logged in client id>
default_nested : Add sub aggregation for nested docs since we need to get the docs against sentiment which is field of nested document.
sentiment_for_cli_id : Add sub aggregation to default_nested aggregation in order to get sentiment only for default client i.e. for cli_id: 0.
default : Add this terms sub aggregation to sentiment_for_cli_id aggregation to get counts against the sentiment. Note that this count is of nested docs and since you always have only one nested doc per cli_id therefore this count seems to be the count of docs but it is not.
the_doc_count: Add this reverse_nested aggregation to get out of nested doc aggs and the count of parent docs. We add this as the sub aggregation of default aggregation.
sentiment_only
This aggregation give count against each sentiment where cli_id: <logged in client id> is present. For this we follow the same approach as we followed for default_only aggregation. But with some tweaks as below:
sentiment_only : must => cli_id: <logged in client id>
sentiment_nested : same reason as above
sentiment_for_cli_id: same but instead of default we filter for cli_id: <logged in client id>
sentiment: same as default
the_doc_count: same as above
Example:
PUT socialmedia/_bulk
{"index":{"_id": 1}}
{"cms":[{"cli_id":0,"cmx_sentiment":"Positive"}]}
{"index":{"_id": 2}}
{"cms":[{"cli_id":0,"cmx_sentiment":"Positive"},{"cli_id":2,"cmx_sentiment":"Neutral"}]}
{"index":{"_id": 3}}
{"cms":[{"cli_id":0,"cmx_sentiment":"Positive"},{"cli_id":2,"cmx_sentiment":"Negative"}]}
{"index":{"_id": 4}}
{"cms":[{"cli_id":0,"cmx_sentiment":"Positive"},{"cli_id":2,"cmx_sentiment":"Neutral"}]}
Query:
GET socialmedia/_search
{
"query": {
"match_all": {}
},
"aggs": {
"default_only": {
"filter": {
"bool": {
"must_not": [
{
"nested": {
"path": "cms",
"query": {
"term": {
"cms.cli_id": 2
}
}
}
}
]
}
},
"aggs": {
"default_nested": {
"nested": {
"path": "cms"
},
"aggs": {
"sentiment_for_cli_id": {
"filter": {
"term": {
"cms.cli_id": 0
}
},
"aggs": {
"default": {
"terms": {
"field": "cms.cmx_sentiment"
},
"aggs": {
"the_doc_count": {
"reverse_nested": {}
}
}
}
}
}
}
}
}
},
"sentiment_only": {
"filter": {
"bool": {
"must": [
{
"nested": {
"path": "cms",
"query": {
"term": {
"cms.cli_id": 2
}
}
}
}
]
}
},
"aggs": {
"sentiment_nested": {
"nested": {
"path": "cms"
},
"aggs": {
"sentiment_for_cli_id": {
"filter": {
"term": {
"cms.cli_id": 2
}
},
"aggs": {
"sentiment": {
"terms": {
"field": "cms.cmx_sentiment"
},
"aggs": {
"the_doc_count": {
"reverse_nested": {}
}
}
}
}
}
}
}
}
}
}
}
Agg Output:
"aggregations" : {
"default_only" : {
"doc_count" : 1,
"default_nested" : {
"doc_count" : 1,
"sentiment_for_cli_id" : {
"doc_count" : 1,
"default" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Positive",
"doc_count" : 1,
"the_doc_count" : {
"doc_count" : 1
}
}
]
}
}
}
},
"sentiment_only" : {
"doc_count" : 3,
"sentiment_nested" : {
"doc_count" : 6,
"sentiment_for_cli_id" : {
"doc_count" : 3,
"sentiment" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Neutral",
"doc_count" : 2,
"the_doc_count" : {
"doc_count" : 2
}
},
{
"key" : "Negative",
"doc_count" : 1,
"the_doc_count" : {
"doc_count" : 1
}
}
]
}
}
}
}
}
Related
I have an index in Elasticsearch with this kind of documents:
"transactionId" : 5588,
"clientId" : "1",
"transactionType" : 1,
"transactionStatus" : 51,
"locationId" : 12,
"images" : [
{
"imageId" : 5773,
"imagePath" : "http://some/url/path",
"imageType" : "dummyData",
"ocrRead" : "XYZ999",
"imageName" : "SOMENUMBERSANDCHARACTERS.jpg",
"ocrConfidence" : "94.6",
"ROITopLeftCoordinate" : "839x251",
"ROIBottomRightCoordinate" : "999x323"
}
],
"creationTimestamp" : 1669645709130,
"current" : true,
"timestamp" : 1669646359686
It's an "add only" type of stack, where a record is never updated. For instance:
.- Adds a new record with "transactionStatus": 10
.- the transactionID changes status, then, adds a new record for the same transactionID with "transactionStatus": 51
and so on.
What I want to achieve, is get a list of 10 records whose last status is 51 but I can't write the correct query.
Here is what I've tried:
{ "size": 10,
"query": {
"match_all": {}
},
"collapse": {
"field": "transactionId",
"inner_hits": {
"name": "most_recent",
"size": 1,
"sort": [{"timestamp": "desc"}]
}
},
"post_filter": {
"term": {
"transactionStatus": "51"
}
}
}
If I change the "transactionStatus":51 on the post_filter term for, let's say 10, it gives me a transactionID record which last record is not 10.
I don't know if I could explain in a proper way. I apologize for my english, is not my native language.
GET test_status/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"transactionStatus": 51
}
}
]
}
},
"sort": [
{
"timestamp": {
"order": "desc"
}
}
]
}
This one will filter and then sort by timestamp. Let me know if there is something missing.
I have an index where I store all the places used in my documents. I want to use this index to see if the user mentioned one of the places in the text query I receive.
Unfortunately, I have two documents whose name is similar enough to trick Elasticsearch scoring: Stockholm and Stockholm-Arlanda.
My test phrase is intyg stockholm and this is the query I use to get the best matching document.
{
"size": 1,
"query": {
"bool": {
"should": [
{
"match": {
"name": "intyig stockholm"
}
}
],
"must": [
{
"term": {
"type": {
"value": "4"
}
}
},
{
"terms": {
"name": [
"intyg",
"stockholm"
]
}
},
{
"exists": {
"field": "data.coordinates"
}
}
]
}
}
}
As you can see, I use a terms query to find the interesting documents and I use a match query in the should part of the root bool query to use scoring to get the document I want (Stockholm) on top.
This code worked locally (where I run ES in a container) but it broke when I started testing on a cluster hosted in AWS (where I have the exact same dataset). I found this explaining what happens and adding the search type argument actually fixes the issue.
Since the workaround is best not used on production, I'm looking for ways to have the expected result.
Here are the two documents:
// Stockholm
{
"type" : 4,
"name" : "Stockholm",
"id" : "42",
"searchableNames" : [
"Stockholm"
],
"uniqueId" : "Place:42",
"data" : {
"coordinates" : "59.32932349999999,18.0685808"
}
}
// Stockholm-Arlanda
{
"type" : 4,
"name" : "Stockholm-Arlanda",
"id" : "1832",
"searchableNames" : [
"Stockholm-Arlanda"
],
"uniqueId" : "Place:1832",
"data" : {
"coordinates" : "59.6497622,17.9237807"
}
}
I have two relational mysql tables and i need to store these data to elasticsearch.
I stored like this and i wanted to ask you if there is a best way or not :
POST categories/_doc
{
"id" : 1,
"name" : "Phones"
}
POST categories/_doc
{
"id" : 2,
"name" : "TV"
}
PUT products
{
"mappings": {
"properties": {
"attributes": {
"type": "nested"
}
}
}
}
POST products/_doc
{
"id" : 3
"category_id" : 1
"name" : "IPhone 5S",
"attributes" : [
{
"color" : "red",
"stock" : 4
},
{
"color" : "blue",
"stock" : 2
}
]
}
POST products/_doc
{
"id" : 5
"category_id" : 2
"name" : "Samsung TV",
"attributes" : [
{
"color" : "red",
"stock" : 2
},
{
"color" : "yellow",
"stock" : 4
}
]
}
And i use two queries for searching :
I firstly search on categories index after that i send category id values to products index
GET products/_search
{
"query": {
"bool": {
"filter": [
{
"nested": {
"path": "attributes",
"query": {
"terms": {
"attributes.color": [
"red",
"blue"
]
}
},
"inner_hits": {}
}
},
{
"term": {
"category_id": 2
}
}
]
}
}
}
Can you please share your comments about this topic ?
Thank you in advance
It is not a good practice to do like this, since you will have a really difficult time doing sorting and pagination on your application. In Elasticsearch is really important to flat the data for performance reasons and store it in one index, if you have really separated content you can have them in multiple indices. Also keep these points in mind:
Elasticsearch is not a relational database. You will not be able to Join indices as you used to join Tables
Denormalization is not natural but is a key for efficiency in an Elasticsearch application.
Thinking of your data mapping at the early beginning will allow your
app to fly for years.
I am trying to solve an issue where I have to get distinct result in the search.
{
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "ABC",
"favorite_cars" : [ "ferrari","toyota" ]
}, {
"name" : "GEORGE",
"favorite_cars" : [ "honda","Hyundae" ]
}
When I perform a term query on favourite cars "ferrari". I get two results whose name is ABC. I simply want that the result returned should be one in this case. So my requirement will be if I can apply a distinct on name field to receive one 1 result.
Thanks
One way to achieve what you want is to use a terms aggregation on the name field and then a top_hits sub-aggregation with size 1, like this:
{
"size": 0,
"query": {
"term": {
"favorite_cars": "ferrari"
}
},
"aggs": {
"names": {
"terms": {
"field": "name"
},
"aggs": {
"single_result": {
"top_hits": {
"size": 1
}
}
}
}
}
}
That way, you'll get a single term ABC and then nested into it a single matching document
Executive summary - why is this Elastic Search query...
{
"query": {
"filtered": {
"filter": {
"bool": {
"should": [
{
"term": {
"document.company_id": 197
}
},
{
"term": {
"change.company_id": 197
}
},
{
"bool": {
"must": [
{
"missing": {
"field": "document.company_id"
}
},
{
"missing": {
"field": "changes.company_id"
}
},
{
"terms": {
"user.id": [
2165, 2976, ...
]
}
}
]
... (closing braces here on)
...returning this record?
"_source" : {
"date" : "2015-03-27T09:36:41.716+00:00",
"change" : {
"company_id" : 12,
"id" : "CC-12-51"
},
"action" : "change-control-approved",
"description" : "blah blah",
"user" : {
"full_name" : "Martin Wtorkowski",
"email" : "mwtorkowski#getzendoc.com",
"id" : 40
},
"_date" : 1427445401,
"id" : 57879,
"invalid" : null
},
Given the fact that must corresponds to AND and should corresponds to OR, ...
The record doesn't have a document.company_id of 197 (so the first OR term doesn't apply)
The record doesn't have a change.company_id of 197 (it has a change.company_id of 12 - so the second OR term doesn't apply either)
The third term says: MUST (therefore AND) for 3 conditions: (a) field document.company_id must be missing - and it is indeed missing (b) field change.company_id must be missing - and IT IS NOT MISSING (c) field user.id must have one of a set of values.
I am probably missing some intricate detail of the ES API - but since the 2nd of the 3 must conditions fails, this record should not have passed.
what am I doing wrong?
In your second missing filter, there's a typo.
If you modify changes.company_id to change.company_id, it should work as expected.