Import only arrays that match conditions in Elastic Search - elasticsearch-painless

I'm dealing with nested nested data in ElasticSearch.
I want it to work like a SELECT * from where in an RDBMS.
If you have the following data
POST test-stack/test/1234_5678
{
"Id" : 1234,
"availables":
[
{
"Id" : 4444,
"date" : "2019-09-10",
"time" : [
{
"dateTime" : "2019-09-10T09:30:00+09:00",
"Count" : 50
},
{
"dateTime" : "2019-09-10T10:00:00+09:00",
"Count" : 50
},
{
"dateTime" : "2019-09-10T10:30:00+09:00",
"Count" : 50
}
]
},
{
"Id" : 5555,
"date" : "2019-09-11",
"time" : [
{
"dateTime" : "2019-09-11T09:30:00+09:00",
"Count" : 50
},
{
"dateTime" : "2019-09-11T10:00:00+09:00",
"Count" : 50
},
{
"dateTime" : "2019-09-11T10:30:00+09:00",
"Count" : 50
}
]
},
{
"Id" : 6666,
"date" : "2019-09-12",
"time" : [
{
"dateTime" : "2019-09-12T09:30:00+09:00",
"Count" : 50
},
{
"dateTime" : "2019-09-12T10:00:00+09:00",
"Count" : 50
},
{
"dateTime" : "2019-09-12T10:30:00+09:00",
"Count" : 50
}
]
}
]
}
If I do this,
Select * from test t where t.availables.date == '2019-09-10';
So, I want to get this answer,
"Id" : 4444,
"date" : "2019-09-10",
"time" : [
{
"dateTime" : "2019-09-10T09:30:00+09:00",
"Count" : 50
},
{
"dateTime" : "2019-09-10T10:00:00+09:00",
"Count" : 50
},
{
"dateTime" : "2019-09-10T10:30:00+09:00",
"Count" : 50
}
]
}
I'm a beginner in Elastic Search and I wonder if this is possible in Elastic Search.
I've studied painless scripts but still don't know.

You need to use nested query and inner hits.
Nested query will help you to filter on nested field and inner hits will return matching nested document
Mapping:
PUT testindex11/_mapping
{
"properties": {
"Id": {
"type": "text"
},
"availables": {
"type": "nested",
"properties": {
"Id": {
"type": "text"
},
"date": {
"type": "date",
"format": "yyyy-MM-dd"
},
"time":{
"type": "nested",
"properties": {
"dateTime" :{
"type":"date",
"format":"yyyy-MM-dd'T'HH:mm:ss"
},
"count":{
"type":"integer"
}
}
}
}
}
}
}
Query:
GET testindex11/_search
{
"query": {
"nested": {
"path": "availables",
"query": {
"term": {
"availables.date": {
"value": "2019-09-10"
}
}
},
"inner_hits": {}
}
}
}
Result:
[
{
"_index" : "testindex11",
"_type" : "_doc",
"_id" : "PXuHQm0B4boMRQnoJOpR",
"_score" : 1.0,
"_source" : {
"Id" : 1234,
"availables" : [
{
"Id" : 4444,
"date" : "2019-09-10",
"time" : [
{
"dateTime" : "2019-09-10T09:30:00",
"Count" : 50
},
{
"dateTime" : "2019-09-10T10:00:00",
"Count" : 50
},
{
"dateTime" : "2019-09-10T10:30:00",
"Count" : 50
}
]
},
{
"Id" : 5555,
"date" : "2019-09-11",
"time" : [
{
"dateTime" : "2019-09-11T09:30:00",
"Count" : 50
},
{
"dateTime" : "2019-09-11T10:00:00",
"Count" : 50
},
{
"dateTime" : "2019-09-11T10:30:00",
"Count" : 50
}
]
},
{
"Id" : 6666,
"date" : "2019-09-12",
"time" : [
{
"dateTime" : "2019-09-12T09:30:00",
"Count" : 50
},
{
"dateTime" : "2019-09-12T10:00:00",
"Count" : 50
},
{
"dateTime" : "2019-09-12T10:30:00",
"Count" : 50
}
]
}
]
},
"inner_hits" : {
"availables" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "testindex11",
"_type" : "_doc",
"_id" : "PXuHQm0B4boMRQnoJOpR",
"_nested" : {
"field" : "availables",
"offset" : 0
},
"_score" : 1.0,
"_source" : {
"Id" : 4444,
"date" : "2019-09-10",
"time" : [
{
"dateTime" : "2019-09-10T09:30:00",
"Count" : 50
},
{
"dateTime" : "2019-09-10T10:00:00",
"Count" : 50
},
{
"dateTime" : "2019-09-10T10:30:00",
"Count" : 50
}
]
}
}
]
}
}
}
}
]

Related

ElasticSearch - how to get aggregation of aggregation

I am very new to elasticsearch
I work for a dating website that has data as follows:
Single - with fields: name, signUpDate, state, and other data fields.
Encounter - with fields: state, encounterDate, singlesInvolved, and other data fields.
These are my 2 indexes
Now I have to write a query that returns as follows:
For every state, how many singles, how many encounters, the longest time a single has been part of our website, and the average time a single has been part of our website
And also return one result that is that same average for all states
Like this example:
[
{ //this one is the average of all states
"singles": 45,
"dates": 18,
"minWaitingTime": 1644677979530,
"avgWaitingTime": 15603
},
{ //these are the averages of each state
"state": "MA",
"singles": 50,
"dates": 23,
"minWaitingTime": 1644677979530,
"avgWaitingTime": 15603
},
{
"state": "NY",
"singles": 39,
"dates": 13,
"minWaitingTime": 1644850558872,
"avgWaitingTime": 6033
}
]
I've been working on the query for each state individually but i dont know how to get an average of all states
so far what i have is this:
GET /single,encounter/_search
{
"size": 0,
"aggs": {
"bystate": {
"terms": {
"field": "state",
"size": 59
},
"aggs": {
"group-by-index": {
"terms": {
"field": "_index"
}
},
"min_date": {
"min": {
"field": "signedUpAt"
}
},
"avg_date": {
"avg": {
"field": "signedUpAt"
}
}
}
}
}
}
I don't know if there is a better way to do this, likewise I don't know how to calculate the average (singles, encounters, min_date and average_date average) for all states using this result
Every result of the previous query looks like this:
{
"key" : "MA",
"doc_count" : 164,
"avg_date" : {
"value" : 1.6457900076508965E12,
"value_as_string" : "2022-02-25T11:53:27.650"
},
"min_date" : {
"value" : 1.64467797953E12,
"value_as_string" : "2022-02-12T14:59:39.530"
},
"group-by-index" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "single",
"doc_count" : 135
},
{
"key" : "encounter",
"doc_count" : 29
}
]
}
},
I would really appreciate help on this one
Addition: index mapping.
Encounter:
{
"encounter" : {
"aliases" : { },
"mappings" : {
"properties" : {
"_class" : {
"type" : "keyword",
"index" : false,
"doc_values" : false
},
"avgAge" : {
"type" : "integer",
"index" : false,
"doc_values" : false
},
"application" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"createdAt" : {
"type" : "date",
"format" : "date_hour_minute_second_millis"
},
"encounterId" : {
"type" : "keyword"
},
"locationType" : {
"type" : "keyword",
"index" : false,
"doc_values" : false
},
"singleOneId" : {
"type" : "keyword",
"index" : false,
"doc_values" : false
},
"singleTwoId" : {
"type" : "keyword",
"index" : false,
"doc_values" : false
},
"serviceLine" : {
"type" : "keyword"
},
"state" : {
"type" : "keyword"
},
"rating" : {
"type" : "keyword"
}
}
},
"settings" : {
"index" : {
"refresh_interval" : "1s",
"number_of_shards" : "1",
"provided_name" : "encounter",
"creation_date" : "1643704661932",
"number_of_replicas" : "1",
"uuid" : "MliXQL_bRBKDN7_d8G_BYw",
"version" : {
"created" : "7100299"
}
}
}
}
}
And Single:
{
"single" : {
"aliases" : { },
"mappings" : {
"properties" : {
"_class" : {
"type" : "keyword",
"index" : false,
"doc_values" : false
},
"id" : {
"type" : "keyword"
},
"singleId" : {
"type" : "keyword"
},
"state" : {
"type" : "keyword"
},
"preferedGender" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"settings" : {
"index" : {
"refresh_interval" : "1s",
"number_of_shards" : "1",
"provided_name" : "single",
"creation_date" : "1643704662136",
"number_of_replicas" : "1",
"uuid" : "Js_tqZfRRx-IxbjVRRN4wQ",
"version" : {
"created" : "7100299"
}
}
}
}
}
You can use avg bucket aggregation, where you can provide bucket_path and based on value it will calculate avg of entire aggregation.
Below is sample query:
{
"size": 0,
"aggs": {
"bystate": {
"terms": {
"field": "state",
"size": 59
},
"aggs": {
"group-by-index": {
"terms": {
"field": "_index"
}
},
"min_date": {
"min": {
"field": "signedUpAt"
}
},
"avg_date": {
"avg": {
"field": "signedUpAt"
}
}
}
},
"avg_all_state": {
"avg_bucket": {
"buckets_path": "bystate>avg_date"
}
}
}
}

ElasticSearch join data within the same index

I am quite new with ElasticSearch and I am collecting some application logs within the same index which have this format
{
"_index" : "app_logs",
"_type" : "_doc",
"_id" : "JVMYi20B0a2qSId4rt12",
"_source" : {
"username" : "mapred",
"app_id" : "application_1569623930006_490200",
"event_type" : "STARTED",
"ts" : "2019-10-02T08:11:53Z"
}
I can have different event types. In this case I am interested in STARTED and FINISHED. I would like to query ES in order to get all the app that started in a certain day and enrich them with their end time. Basically I want to create couples of start/end (an end might also be missing, but that's fine).
I have realized join relations in sql cannot be used in ES and I was wondering if I can exploit some other feature in order to get this result in one query.
Edit: these are the details of the index mapping
{
“app_logs" : {
"mappings" : {
"_doc" : {
"properties" : {
"event_type" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
“app_id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"ts" : {
"type" : "date"
},
“event_type” : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}}}}
What I understood is that you would want to collate list of documents having same app_id along with the status as either STARTED or FINISHED.
I do not think Elasticsearch is not meant to perform JOIN operations. I mean you can but then you have to design your documents as mentioned in this link.
What you would need is an Aggregation query.
Below is the sample mapping, documents, the aggregation query and the response as how it appears, which would actually help you get the desired result.
Mapping:
PUT mystatusindex
{
"mappings": {
"properties": {
"username":{
"type": "keyword"
},
"app_id":{
"type": "keyword"
},
"event_type":{
"type":"keyword"
},
"ts":{
"type": "date"
}
}
}
}
Sample Documents
POST mystatusindex/_doc/1
{
"username" : "mapred",
"app_id" : "application_1569623930006_490200",
"event_type" : "STARTED",
"ts" : "2019-10-02T08:11:53Z"
}
POST mystatusindex/_doc/2
{
"username" : "mapred",
"app_id" : "application_1569623930006_490200",
"event_type" : "FINISHED",
"ts" : "2019-10-02T08:12:53Z"
}
POST mystatusindex/_doc/3
{
"username" : "mapred",
"app_id" : "application_1569623930006_490201",
"event_type" : "STARTED",
"ts" : "2019-10-02T09:30:53Z"
}
POST mystatusindex/_doc/4
{
"username" : "mapred",
"app_id" : "application_1569623930006_490202",
"event_type" : "STARTED",
"ts" : "2019-10-02T09:45:53Z"
}
POST mystatusindex/_doc/5
{
"username" : "mapred",
"app_id" : "application_1569623930006_490202",
"event_type" : "FINISHED",
"ts" : "2019-10-02T09:45:53Z"
}
POST mystatusindex/_doc/6
{
"username" : "mapred",
"app_id" : "application_1569623930006_490203",
"event_type" : "STARTED",
"ts" : "2019-10-03T09:30:53Z"
}
POST mystatusindex/_doc/7
{
"username" : "mapred",
"app_id" : "application_1569623930006_490203",
"event_type" : "FINISHED",
"ts" : "2019-10-03T09:45:53Z"
}
Query:
POST mystatusindex/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"ts": {
"gte": "2019-10-02T00:00:00Z",
"lte": "2019-10-02T23:59:59Z"
}
}
}
],
"should": [
{
"match": {
"event_type": "STARTED"
}
},
{
"match": {
"event_type": "FINISHED"
}
}
]
}
},
"aggs": {
"application_IDs": {
"terms": {
"field": "app_id"
},
"aggs": {
"ids": {
"top_hits": {
"size": 10,
"_source": ["event_type", "app_id"],
"sort": [
{ "event_type": { "order": "desc"}}
]
}
}
}
}
}
}
Notice that for filtering I've made use of Range Query as you only want to filter documents for that date and also added a bool should logic to filter based on STARTED and FINISHED.
Once I have the documents, I've made use of Terms Aggregation and Top Hits Aggregation to get the desired result.
Result
{
"took" : 12,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 5,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"application_IDs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "application_1569623930006_490200", <----- APP ID
"doc_count" : 2,
"ids" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "mystatusindex",
"_type" : "_doc",
"_id" : "1", <--- Document with STARTED status
"_score" : null,
"_source" : {
"event_type" : "STARTED",
"app_id" : "application_1569623930006_490200"
},
"sort" : [
"STARTED"
]
},
{
"_index" : "mystatusindex",
"_type" : "_doc",
"_id" : "2", <--- Document with FINISHED status
"_score" : null,
"_source" : {
"event_type" : "FINISHED",
"app_id" : "application_1569623930006_490200"
},
"sort" : [
"FINISHED"
]
}
]
}
}
},
{
"key" : "application_1569623930006_490202",
"doc_count" : 2,
"ids" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "mystatusindex",
"_type" : "_doc",
"_id" : "4",
"_score" : null,
"_source" : {
"event_type" : "STARTED",
"app_id" : "application_1569623930006_490202"
},
"sort" : [
"STARTED"
]
},
{
"_index" : "mystatusindex",
"_type" : "_doc",
"_id" : "5",
"_score" : null,
"_source" : {
"event_type" : "FINISHED",
"app_id" : "application_1569623930006_490202"
},
"sort" : [
"FINISHED"
]
}
]
}
}
},
{
"key" : "application_1569623930006_490201",
"doc_count" : 1,
"ids" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "mystatusindex",
"_type" : "_doc",
"_id" : "3",
"_score" : null,
"_source" : {
"event_type" : "STARTED",
"app_id" : "application_1569623930006_490201"
},
"sort" : [
"STARTED"
]
}
]
}
}
}
]
}
}
}
Note that the last document with only STARTED appears in the aggregation result as well.
Updated Answer
{
"size":0,
"query":{
"bool":{
"must":[
{
"range":{
"ts":{
"gte":"2019-10-02T00:00:00Z",
"lte":"2019-10-02T23:59:59Z"
}
}
}
],
"should":[
{
"term":{
"event_type.keyword":"STARTED" <----- Changed this
}
},
{
"term":{
"event_type.keyword":"FINISHED" <----- Changed this
}
}
]
}
},
"aggs":{
"application_IDs":{
"terms":{
"field":"app_id.keyword" <----- Changed this
},
"aggs":{
"ids":{
"top_hits":{
"size":10,
"_source":[
"event_type",
"app_id"
],
"sort":[
{
"event_type.keyword":{ <----- Changed this
"order":"desc"
}
}
]
}
}
}
}
}
}
Note the changes I've made. Whenever you would need exact matches or want to make use of aggregation, you would need to make use of keyword type.
In the mapping you've shared, there is no username field but two event_type fields. I'm assuming its just a human err and that one of the field should be username.
Now if you notice carefully, the field event_type has a text and its sibling keyword field. I've just modified the query to make use of the keyword field and when I am doing that, I'm use Term Query.
Try this out and let me know if it helps!

elastic: query the sum of a filtered subset of nested documents

Consider a document (post) like this in elasticsearch index:
{
title: "I love ice cream!"
comments: [
{
body: "me too!",
reaction: 'positive',
likes: 20
},
{
body: "huh!",
reaction: 'sarcastic',
likes: 5
}
]
}
The comments is a field of nested type.
How can elastic answer this:
Give me all posts, where the total sum of likes on "sarcastic" comments is greater than 100.
I'm open to any other way of modelling data which helps answer such queries.
This can be solved using bucket selector aggregation.
Mapping:
{
"index1" : {
"mappings" : {
"properties" : {
"comments" : {
"type" : "nested",
"properties" : {
"body" : {
"type" : "text"
},
"likes" : {
"type" : "integer"
},
"reaction" : {
"type" : "text"
}
}
},
"title" : {
"type" : "keyword"
}
}
}
}
}
Data:
"hits" : [
{
"_index" : "index1",
"_type" : "_doc",
"_id" : "p0y9DGsBfPdKzuAGdQrm",
"_score" : 1.0,
"_source" : {
"title" : "I love ice cream!",
"comments" : [
{
"body" : "me too!",
"reaction" : "positive",
"likes" : 20
},
{
"body" : "huh!",
"reaction" : "sarcastic",
"likes" : 5
}
]
}
},
{
"_index" : "index1",
"_type" : "_doc",
"_id" : "qEy9DGsBfPdKzuAGnwox",
"_score" : 1.0,
"_source" : {
"title" : "I hate ice cream!",
"comments" : [
{
"body" : "me too!",
"reaction" : "positive",
"likes" : 10
},
{
"body" : "huh!",
"reaction" : "sarcastic",
"likes" : 5
}
]
}
}
]
}
Query:
GET index1/_search
{
"size": 0,
"aggs": {
"title": {
"terms": {
"field": "title"
},
"aggs": {
"comments": {
"nested": {
"path": "comments"
},
"aggs": {
"reaction": {
"filter": {
"term": {
"comments.reaction": "positive"
}
},
"aggs": {
"total_likes": {
"sum": {
"field": "comments.likes"
}
}
}
}
}
},
"total_likes_filter": {
"bucket_selector": {
"buckets_path": {
"likes": "comments>reaction>total_likes"
},
"script": "params.likes > 15"
}
}
}
}
}
}
Result:
"aggregations" : {
"title" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "I love ice cream!",
"doc_count" : 1,
"comments" : {
"doc_count" : 2,
"reaction" : {
"doc_count" : 1,
"total_likes" : {
"value" : 20.0
}
}
}
}
]
}
}
}
Bucket contains only "I love ice cream!" where total likes for reaction positive is greater than 20.
I hate ice cream! has total sum 5 for positive reaction so it is not included.

Elasticsearch - Conditional nested fetching

I have index mapping:
{
"dev.directory.3" : {
"mappings" : {
"profile" : {
"properties" : {
"email" : {
"type" : "string",
"index" : "not_analyzed"
},
"events" : {
"type" : "nested",
"properties" : {
"id" : {
"type" : "integer"
},
"name" : {
"type" : "string",
"index" : "not_analyzed"
},
}
}
}
}
}
}
}
with data:
"hits" : [ {
"_index" : "dev.directory.3",
"_type" : "profile",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"email" : "test#dummy.com",
"events" : [
{
"id" : 111,
"name" : "ABC",
},
{
"id" : 222,
"name" : "DEF",
}
],
}
}]
I'd like to filter only matched nested elements instead of returning all events array - is this possible in ES?
Example query:
{
"nested" : {
"path" : "events",
"query" : {
"bool" : {
"filter" : [
{ "match" : { "events.id" : 222 } },
]
}
}
}
}
Eg. If I query for events.id=222 there should be only single element on the result list returned.
What strategy for would be the best to achieve this kind of requirement?
You can use inner_hits to only get the nested records which matched the query.
{
"query": {
"nested": {
"path": "events",
"query": {
"bool": {
"filter": [
{
"match": {
"events.id": 222
}
}
]
}
},
"inner_hits": {}
}
},
"_source": false
}
I am also excluding the source to get only nested hits

Mysteriously wrong values of numerical fields in ElasticSearch

I've spent the last 2 days investigating this mind-bending issue:I have an index with custom mappings on which I perform some aggregations. The problem is that in the results of the aggregation on numerical fields,it returns values that do not appear in the database from which the data was imported, even though the number of results is the same.
I found a similar issue here where the problem was inconsistent mapping of a field across an index, but in my case it is mapped as the same type. The problem happens with the fields: award.value.amount, award.value.x_amountEur, tender.value.x_amountEur as far as I have checked.This is my current mapping as stated by curl -XGET 'http://localhost:9200/documents/_mappings?pretty&human'
(the part that contains the target fields):
{
"documents" : {
"mappings" : {
"document" : {
"properties" : {
"additionalIdentifiers" : {
"type" : "string",
"index" : "not_analyzed"
},
"award" : {
"type" : "nested",
"properties" : {
"_id" : {
"properties" : {
"$oid" : {
"type" : "string"
}
}
},
"contract_number" : {
"type" : "string",
"index" : "not_analyzed"
},
"date" : {
"type" : "nested",
"properties" : {
"_id" : {
"properties" : {
"$oid" : {
"type" : "string"
}
}
},
"x_day" : {
"type" : "integer"
},
"x_month" : {
"type" : "integer"
},
"x_year" : {
"type" : "integer"
}
}
},
"description" : {
"type" : "string"
},
"initialValue" : {
"type" : "nested",
"properties" : {
"_id" : {
"properties" : {
"$oid" : {
"type" : "string"
}
}
},
"amount" : {
"type" : "float"
},
"currency" : {
"type" : "string"
},
"x_vat" : {
"type" : "float"
}
}
},
"minValue" : {
"type" : "nested",
"properties" : {
"_id" : {
"properties" : {
"$oid" : {
"type" : "string"
}
}
},
"amount" : {
"type" : "float"
},
"x_amountEur" : {
"type" : "float"
}
}
},
"title" : {
"type" : "string"
},
"value" : {
"type" : "nested",
"properties" : {
"_id" : {
"properties" : {
"$oid" : {
"type" : "string"
}
}
},
"amount" : {
"type" : "float"
},
"currency" : {
"type" : "string"
},
"x_amountEur" : {
"type" : "float"
},
"x_vat" : {
"type" : "float"
},
"x_vatbool" : {
"type" : "boolean"
}
}
},
"x_initialValue" : {
"type" : "nested",
"properties" : {
"_id" : {
"properties" : {
"$oid" : {
"type" : "string"
}
}
},
"x_amountEur" : {
"type" : "float"
},
"x_vatbool" : {
"type" : "boolean"
}
}
}
}
},
"awardCriteria" : {
"type" : "string"
},
"contract_number" : {
"type" : "string"
},
"document_id" : {
"type" : "string",
"index" : "not_analyzed"
},
"numberOfTenderers" : {
"type" : "string"
},
"procurementMethod" : {
"type" : "string"
},
"procuring_entity" : {
"type" : "nested",
"properties" : {
"_id" : {
"properties" : {
"$oid" : {
"type" : "string"
}
}
},
"address" : {
"type" : "nested",
"properties" : {
"_id" : {
"properties" : {
"$oid" : {
"type" : "string"
}
}
},
"country" : {
"type" : "string"
},
"countryName" : {
"type" : "string",
"index" : "not_analyzed"
},
"email" : {
"type" : "string"
},
"locality" : {
"type" : "string"
},
"postalCode" : {
"type" : "string"
},
"streetAddress" : {
"type" : "string"
},
"telephone" : {
"type" : "string"
},
"x_url" : {
"type" : "string"
}
}
},
"name" : {
"type" : "string"
},
"x_slug" : {
"type" : "string",
"index" : "not_analyzed"
}
}
},
"suppliers" : {
"type" : "nested",
"properties" : {
"_id" : {
"properties" : {
"$oid" : {
"type" : "string"
}
}
},
"address" : {
"properties" : {
"_id" : {
"properties" : {
"$oid" : {
"type" : "string"
}
}
},
"email" : {
"type" : "string"
},
"locality" : {
"type" : "string"
},
"postalCode" : {
"type" : "string"
},
"streetAddress" : {
"type" : "string"
},
"telephone" : {
"type" : "string"
},
"x_url" : {
"type" : "string"
}
}
},
"name" : {
"type" : "string"
},
"x_slug" : {
"type" : "string",
"index" : "not_analyzed"
}
}
},
"tender" : {
"type" : "nested",
"properties" : {
"_id" : {
"properties" : {
"$oid" : {
"type" : "string"
}
}
},
"value" : {
"type" : "nested",
"properties" : {
"_id" : {
"properties" : {
"$oid" : {
"type" : "string"
}
}
},
"amount" : {
"type" : "float"
},
"currency" : {
"type" : "string"
},
"x_amountEur" : {
"type" : "float"
},
"x_vat" : {
"type" : "float"
},
"x_vatbool" : {
"type" : "boolean"
}
}
}
}
}
This is the aggregation I am using in order to get the values of contracts between each pair of supplier - procuring_entity:
Document.es.search({
"search_type": "count" ,
"body":{
"aggregations": {
"entities":{
"nested": {
"path": "procuring_entity"
},
"aggs": {
"procuring_entity_names": {
"terms": {
"field": "procuring_entity.x_slug",
"size": 0
},
"aggs": {
"suppliers": {
"nested": {
"path": "suppliers"
},
"aggs": {
"suppliers_names": {
"terms":{
"field": "suppliers.x_slug",
"size": 0
},
"aggs": {
"awards": {
"nested": {
"path": "award.value"
},
"aggs": {
"award_amounts": {
"terms":{
"field": "award.value.x_amountEur",
"size": 0
}
}
}
}
}
}
}
}
}
}
}
}
}
}})
The result with type float is :
{"entities"=>
{"doc_count"=>24300,
"procuring_entity_names"=>
{"doc_count_error_upper_bound"=>0,
"sum_other_doc_count"=>0,
"buckets"=>
[{"key"=>"vsia-bernu-kliniska-universitates-slimnica",
"doc_count"=>1360,
"suppliers"=>
{"doc_count"=>1360,
"suppliers_names"=>
{"doc_count_error_upper_bound"=>0,
"sum_other_doc_count"=>0,
"buckets"=>
[{"key"=>"recipe-plus-as",
"doc_count"=>388,
"awards"=>
{"doc_count"=>388,
"awards"=>
{"doc_count_error_upper_bound"=>0,
"sum_other_doc_count"=>0,
"buckets"=>
[{"key"=>3679.086669921875, "doc_count"=>373},
{"key"=>0.0, "doc_count"=>12},
{"key"=>73610.3203125, "doc_count"=>1},
{"key"=>244000.0, "doc_count"=>1},
{"key"=>342348.9375, "doc_count"=>1}]}}}
The problem is that in MongoDB the same query returns 388 documents that all have award.value.x_amountEur = 3679.08661250056 , as presented by Mongoid query:
Document.where(:"procuring_entity.x_slug" => "vsia-bernu-kliniska-universitates-slimnica")
.keep_if{|doc| doc.suppliers.first.x_slug == "recipe-plus-as"}
.map{|doc| doc.award.value.x_amountEur}.uniq
=>[3679.08661250056]
A query directly into MongoDB returns the same.
I have also tried to map the targeted fields as double, which gave the same result and as long which returned the following (even more incorrect result):
{"entities"=>
{"doc_count"=>24300,
"procuring_entity_names"=>
{"doc_count_error_upper_bound"=>0,
"sum_other_doc_count"=>0,
"buckets"=>
[{"key"=>"vsia-bernu-kliniska-universitates-slimnica",
"doc_count"=>1360,
"suppliers"=>
{"doc_count"=>1360,
"suppliers_names"=>
{"doc_count_error_upper_bound"=>0,
"sum_other_doc_count"=>0,
"buckets"=>
[{"key"=>"recipe-plus-as",
"doc_count"=>388,
"awards"=>
{"doc_count"=>388,
"awards"=>
{"doc_count_error_upper_bound"=>0,
"sum_other_doc_count"=>0,
"buckets"=>
[{"key"=>3679, "doc_count"=>371},
{"key"=>0, "doc_count"=>12},
{"key"=>44300, "doc_count"=>1},
{"key"=>80472, "doc_count"=>1},
{"key"=>331636, "doc_count"=>1},
{"key"=>342348, "doc_count"=>1},
{"key"=>1658805, "doc_count"=>1}]}}}
I'm using Elasticsearch 2.0, mongoid 5.0.1 and mongoid-elasticsearch for indexing. I can't think of anything else to do so any suggestion is welcomed and appreciated.
I tried to test your scenario with ES 2.0 and there is something that I'm missing. I cannot make it create buckets for the award.value.x_amountEur unless I use a reverse_nested aggregation to "get out" from one nested path and into another.
So, instead of the awards aggregation that you have I'm using the same aggregation but "wrapped" in a reverse_nested aggregation:
"aggs": {
"getting_back": {
"reverse_nested": {},
"aggs": {
"awards": {
"nested": {
"path": "award.value"
},
"aggs": {
"award_amounts": {
"terms": {
"field": "award.value.x_amountEur"
}
}
}
}
}
}
}
And for this one I am seeing something ok.
Later edit: following mine and more general #Val's suggestion, the complete solution was to use reverse_nested on both awards and suppliers aggregations.

Resources