Filtering Nested Objects in Elastic Search - elasticsearch

I have an index which has the following mapping
{
"mappings": {
"data": {
"date_detection": false,
"_all": {
"enabled": false
},
"properties": {
"DocumentId": {
"type": "string"
},
"SubscriptionId": {
"type": "long"
},
"AccountId": {
"type": "long"
},
"SubscriptionStartDateTime": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"SubscriptionEndDateTime": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"DateWiseMetrics": {
"type": "nested",
"properties": {
"Date": {
"type": "date",
"format": "yyyy-MM-dd"
},
"Usage": {
"type": "long"
}
}
}
}
}
}
}
What I'm trying to do is
For a given account id, start date, and end date find all active subscriptions with in that period
Find the usage for that given period for all the active subscriptions found in step 1.
For example,
If start date = "2017-08-01 00:00:00" and end date = "2017-09-01 00:00:00", i will find all active subscriptions in that period and then i want to see the daily usage of those subscriptions from 2017-08-01 till 2017-09-01.
I was able to get the active subscriptions (Step 1) correct, but not getting correct values when i filter further by Date in the nested object (Step 2).
{
"_source" : false,
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{ "term" : { "AccountId" : 290804220029 } },
{
"nested": {
"path": "DateWiseMetrics",
"filter":
{ "range" : { "DateWiseMetrics.Date": { "gte": "2017-01-01", "lte": "2017-08-01" } }
}, "inner_hits" : {}
}
}
],
"must_not" : [
{ "range" : { "SubscriptionStartDateTime": { "gt": "2017-08-01 00:00:00" } }},
{ "range" : { "SubscriptionEndDateTime": { "lt": "2017-01-01 00:00:00" } }}
]
}
}
}
}
}
I think the way i am filtering nested objects is wrong. Please help.
Also, is there a way to show values of few parent items ( for eg: accountid, subscriptionid) along with inner hit results? What i observed is that when you add "_source" : false the parent elements don't appear anymore.
Thanks

Related

Include joined children with Elasticsearch GET request

I have an Elasticsearch index events that has a join field so that an event can have multiple instances (i.e. the same event can occur on different dates). In this simplified mapping, an event doc has fields for title and url while an instance doc has start/end date fields:
{
"mappings": {
"properties": {
"title": {
"type": "text"
},
"url": {
"type": "keyword"
},
"dt": {
"type": "date"
},
"end_dt": {
"type": "date"
},
"event_or_instance": {
"type": "join",
"eager_global_ordinals": true,
"relations": {
"event": "instance"
}
}
}
}
}
I know how to get an event and includes all of its instances using has_child:
GET /events/_search
{
"query" : {
"bool": {
"filter": [
{
"term": {
"_id": {
"value": "c8871a79-1907-46c0-958c-9731c529b93e"
}
}
},
{
"has_child" : {
"type" : "instance",
"query" : { "match_all": {} },
"inner_hits" : {
"_source": true,
"sort": [{"dt": "asc"}]
}
}
}
]
}
},
"_source": true
}
This works fine, but is there a way to do this using the Get/Multi-get API instead of the Search API?

Elasticsearch connect range and term to same array item

I have a user document with a field called experiences which is an array of objects, like:
{
"experiences": [
{
"end_date": "2017-03-02",
"is_valid": false
},
{
"end_date": "2015-03-02",
"is_valid": true
}
]
}
With this document I have to search users where end date is in last year and is_valid is true.
At this time I have a query -> bool and I add two must there, one range for the end_date and one term for the is_valid.
{
"query": {
"bool": {
"must": {
"term": {
"experiences.is_valid": true
},
"range": {
"experiences.end_date": {
"gte": "now-1y",
"lte": "now"
}
},
}
}
}
}
The result is that this user is selected because he has an end_date in the last year (the first exp.) and another exp. with is_valid true.
Of course this is not what I need, because I need that end_date and is_valid must be referenced to the same object, but how can we do this on Elasticsearch?
Mapping:
"experiences": {
"properties": {
"comment": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"end_date": {
"type": "date"
},
"id": {
"type": "long"
},
"is_valid": {
"type": "boolean"
},
"start_date": {
"type": "date"
}
}
}
You need to change experiences type to Nested data type.
Then apply nested query :
{
"query": {
"nested": {
"path": "experiences",
"query": {
"bool": {
"must": [
{
"term": {
"experiences.is_valid": true
}
},
{
"range": {
"experiences.end_date": {
"gte": "now-1y",
"lte": "now"
}
}
}
]
}
}
}
}
}
This is due to the way arrays of objects are flattened in Elasticsearch.
Study more here

Elasticsearch Index type doesn't changed after updating status

I've made some _bulk insert successfully , now I'm trying to make query with date range and filter something like:
{
"query": {
"bool": {
"must": [{
"terms": {
"mt_id": [613]
}
},
{
"range": {
"time": {
"gt": 1470009600000,
"lt": 1470009600000
}
}
}]
}
}
Unfortunately I got no results , Now I noticed that the index mapping is created after bulk insert as following:
{
"agg__ex_2016_8_3": {
"mappings": {
"player": {
"properties": {
"adLoad": {
"type": "long"
},
"mt_id": {
"type": "long"
},
"time": {
"type": "string"
}
}
},
As a solution I tried to change the index mapping with:
PUT /agg__ex_2016_8_3/_mapping/player
{
"properties" : {
"mt_id" : {
"type" : "long",
"index": "not_analyzed"
}
}
}
got
{
"acknowledged": true
}
and PUT /agg__ex_2016_8_3/_mapping/player
{
"properties" : {
"time" : {
"type" : "date",
"format" : "yyyy/MM/dd HH:mm:ss"
}
}
}
got:
{
"error": {
"root_cause": [
{
"type": "remote_transport_exception",
"reason": "[vj_es_c1-esc13][10.132.69.145:9300][indices:admin/mapping/put]"
}
],
"type": "illegal_argument_exception",
"reason": "mapper [time] of different type, current_type [string], merged_type [date]"
},
"status": 400
}
but nothing happened , and still doesn't get any results.
What i'm doing wrong ? ( I must work with http , not using curl)
Thanks!!
Try this:
# 1. delete index
DELETE agg__ex_2016_8_3
# 2. recreate it with the proper mapping
PUT agg__ex_2016_8_3
{
"mappings": {
"player": {
"properties": {
"adLoad": {
"type": "long"
},
"mt_id": {
"type": "long"
},
"time": {
"type": "date"
}
}
}
}
}
# 3. create doc
PUT agg__ex_2016_8_3/player/104
{
"time": "1470009600000",
"domain": "organisemyhouse.com",
"master_domain": "613###organisemyhouse.com",
"playerRequets": 4,
"playerLoads": 0,
"c_Id": 0,
"cb_Id": 0,
"mt_Id": 613
}
# 4. search
POST agg__ex_2016_8_3/_search
{
"query": {
"bool": {
"must": [
{
"terms": {
"mt_Id": [
613
]
}
},
{
"range": {
"time": {
"gte": 1470009600000,
"lte": 1470009600000
}
}
}
]
}
}
}

ElasticSearch query on tags

I am trying to crack the elasticsearch query language, and so far I'm not doing very good.
I've got the following mapping for my documents.
{
"mappings": {
"jsondoc": {
"properties": {
"header" : {
"type" : "nested",
"properties" : {
"plainText" : { "type" : "string" },
"title" : { "type" : "string" },
"year" : { "type" : "string" },
"pages" : { "type" : "string" }
}
},
"sentences": {
"type": "nested",
"properties": {
"id": { "type": "integer" },
"text": { "type": "string" },
"tokens": { "type": "nested" },
"rhetoricalClass": { "type": "string" },
"babelSynsetsOcc": {
"type": "nested",
"properties" : {
"id" : { "type" : "integer" },
"text" : { "type" : "string" },
"synsetID" : { "type" : "string" }
}
}
}
}
}
}
}
}
It mainly resembles a JSON file referring to a pdf document.
I have been trying to make queries with aggregations and so far is going great. I've gotten to the point of grouping by (aggregating) rhetoricalClass, get the total number of repetitions of babelSynsetsOcc.synsetID. Heck, even the same query even by grouping the whole result by header.year
But, right now, I am struggling with filtering the documents that contain a term and doing the same query.
So, how could I make a query such that grouping by rhetoricalClass and only taking into account those documents whose field header.plainText contains either ["Computational", "Compositional", "Semantics"]. I mean contain instead of equal!.
If I were to make a rough translation to SQL it would be something similar to
SELECT count(sentences.babelSynsetsOcc.synsetID)
FROM jsondoc
WHERE header.plainText like '%Computational%' OR header.plainText like '%Compositional%' OR header.plainText like '%Sematics%'
GROUP BY sentences.rhetoricalClass
WHERE clauses are just standard structured queries, so they translate to queries in Elasticsearch.
GROUP BY and HAVING loosely translate to aggregations in Elasticsearch's DSL. Functions like count, min max, and sum are a function of GROUP BY and it's therefore also an aggregation.
The fact that you're using nested objects may be necessary, but it adds an extra layer to each part that touches them. If those nested objects are not arrays, then do not use nested; use object in that case.
I would probably look at translating your query to:
{
"query": {
"nested": {
"path": "header",
"query": {
"bool": {
"should": [
{
"match": {
"header.plainText" : "Computational"
}
},
{
"match": {
"header.plainText" : "Compositional"
}
},
{
"match": {
"header.plainText" : "Semantics"
}
}
]
}
}
}
}
}
Alternatively, it could be rewritten as this, which is a little less obvious of its intent:
{
"query": {
"nested": {
"path": "header",
"query": {
"match": {
"header.plainText": "Computational Compositional Semantics"
}
}
}
}
}
The aggregation would then be:
{
"aggs": {
"nested_sentences": {
"nested": {
"path": "sentences"
},
"group_by_rhetorical_class": {
"terms": {
"field": "sentences.rhetoricalClass",
"size": 10
},
"aggs": {
"nested_babel": {
"path": "sentences.babelSynsetsOcc"
},
"aggs": {
"count_synset_id": {
"count": {
"field": "sentences.babelSynsetsOcc.synsetID"
}
}
}
}
}
}
}
}
Now, if you combine them and throw away hits (since you're just looking for the aggregated result), then it looks like this:
{
"size": 0,
"query": {
"nested": {
"path": "header",
"query": {
"match": {
"header.plainText": "Computational Compositional Semantics"
}
}
}
},
"aggs": {
"nested_sentences": {
"nested": {
"path": "sentences"
},
"group_by_rhetorical_class": {
"terms": {
"field": "sentences.rhetoricalClass",
"size": 10
},
"aggs": {
"nested_babel": {
"path": "sentences.babelSynsetsOcc"
},
"aggs": {
"count_synset_id": {
"count": {
"field": "sentences.babelSynsetsOcc.synsetID"
}
}
}
}
}
}
}
}

Using a custom_score to sort by a nested child's timestamp

I'm pretty new to elasticsearch and have been banging my head trying to get this sorting to work. The general idea is to search email message threads with nested messages and nested participants. The goal is to display search results at the thread level, sorting by the participant who is doing the search and either the last_received_at or last_sent_at column depending on which mailbox they are in.
My understanding is that you can't sort by a single child's value among many nested children. So in order to do this I saw a couple of suggestions for using a custom_score with a script, then sorting on the score. My plan is to dynamically change the sort column and then run a nested custom_score query that will return the date of one of the participants as the score. I've been noticing some issues with both the score format being strange (eg. always has 4 zeros at the end) and it may not be returning the date that I was expecting.
Below are simplified versions of the index and the query in question. If anyone has any suggestions, I'd be very grateful. (FYI - I am using elasticsearch version 0.20.6.)
Index:
mappings: {
message_thread: {
properties: {
id: {
type: long
}
subject: {
dynamic: true
properties: {
id: {
type: long
}
name: {
type: string
}
}
}
participants: {
dynamic: true
properties: {
id: {
type: long
}
name: {
type: string
}
last_sent_at: {
format: dateOptionalTime
type: date
}
last_received_at: {
format: dateOptionalTime
type: date
}
}
}
messages: {
dynamic: true
properties: {
sender: {
dynamic: true
properties: {
id: {
type: long
}
}
}
id: {
type: long
}
body: {
type: string
}
created_at: {
format: dateOptionalTime
type: date
}
recipient: {
dynamic: true
properties: {
id: {
type: long
}
}
}
}
}
version: {
type: long
}
}
}
}
Query:
{
"query": {
"bool": {
"must": [
{
"term": { "participants.id": 3785 }
},
{
"custom_score": {
"query": {
"filtered": {
"query": { "match_all": {} },
"filter": {
"term": { "participants.id": 3785 }
}
}
},
"params": { "sort_column": "participants.last_received_at" },
"script": "doc[sort_column].value"
}
}
]
}
},
"filter": {
"bool": {
"must": [
{
"term": { "messages.recipient.id": 3785 }
}
]
}
},
"sort": [ "_score" ]
}
Solution:
Thanks to #imotov, here is the final result. The participants were not properly nested in the index (while the messages didn't need to be). In addition, include_in_root was used for the participants to simplify the query (participants are small records and not a real size issue, although #imotov also provided an example without it). He then restructured the JSON request to use a dis_max query.
curl -XDELETE "localhost:9200/test-idx"
curl -XPUT "localhost:9200/test-idx" -d '{
"mappings": {
"message_thread": {
"properties": {
"id": {
"type": "long"
},
"messages": {
"properties": {
"body": {
"type": "string",
"analyzer": "standard"
},
"created_at": {
"type": "date",
"format": "yyyy-MM-dd'\''T'\''HH:mm:ss'\''Z'\''"
},
"id": {
"type": "long"
},
"recipient": {
"dynamic": "true",
"properties": {
"id": {
"type": "long"
}
}
},
"sender": {
"dynamic": "true",
"properties": {
"id": {
"type": "long"
}
}
}
}
},
"messages_count": {
"type": "long"
},
"participants": {
"type": "nested",
"include_in_root": true,
"properties": {
"id": {
"type": "long"
},
"last_received_at": {
"type": "date",
"format": "yyyy-MM-dd'\''T'\''HH:mm:ss'\''Z'\''"
},
"last_sent_at": {
"type": "date",
"format": "yyyy-MM-dd'\''T'\''HH:mm:ss'\''Z'\''"
},
"name": {
"type": "string",
"analyzer": "standard"
}
}
},
"subject": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "string"
}
}
}
}
}
}
}'
curl -XPUT "localhost:9200/test-idx/message_thread/1" -d '{
"id" : 1,
"subject" : {"name": "Test Thread"},
"participants" : [
{"id" : 87793, "name" : "John Smith", "last_received_at" : null, "last_sent_at" : "2010-10-27T17:26:58Z"},
{"id" : 3785, "name" : "David Jones", "last_received_at" : "2010-10-27T17:26:58Z", "last_sent_at" : null}
],
"messages" : [{
"id" : 1,
"body" : "This is a test.",
"sender" : { "id" : 87793 },
"recipient" : { "id" : 3785},
"created_at" : "2010-10-27T17:26:58Z"
}]
}'
curl -XPUT "localhost:9200/test-idx/message_thread/2" -d '{
"id" : 2,
"subject" : {"name": "Elastic"},
"participants" : [
{"id" : 57834, "name" : "Paul Johnson", "last_received_at" : "2010-11-25T17:26:58Z", "last_sent_at" : "2010-10-25T17:26:58Z"},
{"id" : 3785, "name" : "David Jones", "last_received_at" : "2010-10-25T17:26:58Z", "last_sent_at" : "2010-11-25T17:26:58Z"}
],
"messages" : [{
"id" : 2,
"body" : "More testing of elasticsearch.",
"sender" : { "id" : 57834 },
"recipient" : { "id" : 3785},
"created_at" : "2010-10-25T17:26:58Z"
},{
"id" : 3,
"body" : "Reply message.",
"sender" : { "id" : 3785 },
"recipient" : { "id" : 57834},
"created_at" : "2010-11-25T17:26:58Z"
}]
}'
curl -XPOST localhost:9200/test-idx/_refresh
echo
# Using include in root
curl "localhost:9200/test-idx/message_thread/_search?pretty=true" -d '{
"query": {
"filtered": {
"query": {
"nested": {
"path": "participants",
"score_mode": "max",
"query": {
"custom_score": {
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"term": {
"participants.id": 3785
}
}
}
},
"params": {
"sort_column": "participants.last_received_at"
},
"script": "doc[sort_column].value"
}
}
}
},
"filter": {
"query": {
"multi_match": {
"query": "test",
"fields": ["subject.name", "participants.name", "messages.body"],
"operator": "and",
"use_dis_max": true
}
}
}
}
},
"sort": ["_score"],
"fields": []
}
'
# Not using include in root
curl "localhost:9200/test-idx/message_thread/_search?pretty=true" -d '{
"query": {
"filtered": {
"query": {
"nested": {
"path": "participants",
"score_mode": "max",
"query": {
"custom_score": {
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"term": {
"participants.id": 3785
}
}
}
},
"params": {
"sort_column": "participants.last_received_at"
},
"script": "doc[sort_column].value"
}
}
}
},
"filter": {
"query": {
"bool": {
"should": [{
"match": {
"subject.name":"test"
}
}, {
"nested" : {
"path": "participants",
"query": {
"match": {
"name":"test"
}
}
}
}, {
"match": {
"messages.body":"test"
}
}
]
}
}
}
}
},
"sort": ["_score"],
"fields": []
}
'
There are a couple of issues here. You are asking about nested objects, but participants are not defined in your mapping as nested objects. The second possible issue is that score has type float, so it might not have enough precision to represent timestamp as is. If you can figure out how to fit this value into float, you can take a look at this example: Elastic search - tagging strength (nested/child document boosting). However, if you are developing a new system, it might be prudent to upgrade to 0.90.0.Beta1, which supports sorting on nested fields.

Resources