Elasticsearch filter by multiple fields in an object which is in an array field - elasticsearch

The goal is to filter products with multiple prices.
The data looks like this:
{
"name":"a",
"price":[
{
"membershipLevel":"Gold",
"price":"5"
},
{
"membershipLevel":"Silver",
"price":"50"
},
{
"membershipLevel":"Bronze",
"price":"100"
}
]
}
I would like to filter by membershipLevel and price. For example, if I am a silver member and query price range 0-10, the product should not appear, but if I am a gold member, the product "a" should appear. Is this kind of query supported by Elasticsearch?

You need to make use of nested datatype for price and make use of nested query for your use case.
Please see the below mapping, sample document, query and response:
Mapping:
PUT my_price_index
{
"mappings": {
"properties": {
"name":{
"type":"text"
},
"price":{
"type":"nested",
"properties": {
"membershipLevel":{
"type":"keyword"
},
"price":{
"type":"double"
}
}
}
}
}
}
Sample Document:
POST my_price_index/_doc/1
{
"name":"a",
"price":[
{
"membershipLevel":"Gold",
"price":"5"
},
{
"membershipLevel":"Silver",
"price":"50"
},
{
"membershipLevel":"Bronze",
"price":"100"
}
]
}
Query:
POST my_price_index/_search
{
"query": {
"nested": {
"path": "price",
"query": {
"bool": {
"must": [
{
"term": {
"price.membershipLevel": "Gold"
}
},
{
"range": {
"price.price": {
"gte": 0,
"lte": 10
}
}
}
]
}
},
"inner_hits": {} <---- Do note this.
}
}
}
The above query means, I want to return all the documents having price.price range from 0 to 10 and price.membershipLevel as Gold.
Notice that I've made use of inner_hits. The reason is despite being a nested document, ES as response would return the entire set of document instead of only the document specific to where the query clause is applicable.
In order to find the exact nested doc that has been matched, you would need to make use of inner_hits.
Below is how the response would return.
Response:
{
"took" : 128,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.9808291,
"hits" : [
{
"_index" : "my_price_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.9808291,
"_source" : {
"name" : "a",
"price" : [
{
"membershipLevel" : "Gold",
"price" : "5"
},
{
"membershipLevel" : "Silver",
"price" : "50"
},
{
"membershipLevel" : "Bronze",
"price" : "100"
}
]
},
"inner_hits" : {
"price" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.9808291,
"hits" : [
{
"_index" : "my_price_index",
"_type" : "_doc",
"_id" : "1",
"_nested" : {
"field" : "price",
"offset" : 0
},
"_score" : 1.9808291,
"_source" : {
"membershipLevel" : "Gold",
"price" : "5"
}
}
]
}
}
}
}
]
}
}
Hope this helps!

Let me take show you how to do it, using the nested fields and query and filter context. I will take your example to show, you how to define index mapping, index sample documents, and search query.
It's important to note the include_in_parent param in Elasticsearch mapping, which allows us to use these nested fields without using the nested fields.
Please refer to Elasticsearch documentation about it.
If true, all fields in the nested object are also added to the parent
document as standard (flat) fields. Defaults to false.
Index Def
{
"mappings": {
"properties": {
"product": {
"type": "nested",
"include_in_parent": true
}
}
}
}
Index sample docs
{
"product": {
"price" : 5,
"membershipLevel" : "Gold"
}
}
{
"product": {
"price" : 50,
"membershipLevel" : "Silver"
}
}
{
"product": {
"price" : 100,
"membershipLevel" : "Bronze"
}
}
Search query to show Gold with price range 0-10
{
"query": {
"bool": {
"must": [
{
"match": {
"product.membershipLevel": "Gold"
}
}
],
"filter": [
{
"range": {
"product.price": {
"gte": 0,
"lte" : 10
}
}
}
]
}
}
}
Result
"hits": [
{
"_index": "so-60620921-nested",
"_type": "_doc",
"_id": "1",
"_score": 1.0296195,
"_source": {
"product": {
"price": 5,
"membershipLevel": "Gold"
}
}
}
]
Search query to exclude Silver, with same price range
{
"query": {
"bool": {
"must": [
{
"match": {
"product.membershipLevel": "Silver"
}
}
],
"filter": [
{
"range": {
"product.price": {
"gte": 0,
"lte" : 10
}
}
}
]
}
}
}
Above query doesn't return any result as there isn't any matching result.
P.S :- this SO answer might help you to understand nested fields and query on them in detail.

You have to use Nested fields and nested query to archive this: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-nested-query.html
Define you Price property with type "Nested" and then you will be able to filter by every property of nested object

Related

SQl equivalent Correlated query for ElasticSearch Aggregation

I have a use case for writing an aggregation which if written in SQL can be achieved using correlated queries.
I have a index called listings where the properties/columns are ListDate, ListPrice, SoldDate, SoldPrice, OffMarketDate.
ListDate is not nullable, but SoldDate,SoldPrice, OffMarketDate can be nullable.
I want to aggregate stats from the above index based on the following requirement.
I want to have monthly stats, which I see can be achieved by
DateHistogramAggregation
For each month from the
DateHistogramAggregation, I want to find the listings as follows:
Example: For Jan 2019, get all the listings where (ListDate< Feb 1st, 2019) and (SoldDate is null or SoldDate<Jan 1st, 2019) and (OffMarketDate is null or OffMarketDate< Jan 1st, 2019)
Then run the aggregation function for those lists each month.
I appreciate any suggestions to implement this use case. Thanks in advance for the help.
Please see the below details and info as how you can approach this problem:
Mapping:
PUT listings
{
"mappings": {
"properties": {
"listDate":{
"type": "date"
},
"listPrice":{
"type": "long"
},
"soldDate":{
"type": "date"
},
"soldPrice": {
"type": "long"
},
"offMarketDate": {
"type": "date"
}
}
}
}
Note that I've constructed the above mapping looking at your question.
Sample Documents:
POST listings/_doc/1
{
"listDate": "2020-01-01",
"listPrice": "100.00",
"soldDate": "2019-12-25",
"soldPrice": "120.00",
"offMarketDate": "2019-12-20"
}
POST listings/_doc/2
{
"listDate": "2020-01-01",
"listPrice": "100.00",
"soldDate": "2019-12-24",
"soldPrice": "122.00",
"offMarketDate": "2019-12-20"
}
POST listings/_doc/3
{
"listDate": "2020-01-25",
"listPrice": "120.00",
"soldDate": "2020-01-30",
"soldPrice": "140.00",
"offMarketDate": "2020-01-26"
}
POST listings/_doc/4
{
"listDate": "2020-01-25",
"listPrice": "120.00",
"soldDate": "2020-02-02",
"soldPrice": "135.00",
"offMarketDate": "2020-01-26"
}
POST listings/_doc/5
{
"listDate": "2020-01-25",
"listPrice": "120.00"
}
POST listings/_doc/6
{
"listDate": "2020-02-02",
"listPrice": "120.00"
}
Note how I've not added the soldDate and offMarketDate in the docs 5 and 6 as that would be better option than having it with null value.
Request Query:
So I've come up with the below query for your use-case.
Also for the sake of aggregation, let's say I've calculated the total soldPrice for the docs having
listDate in the month of Jan 2020 AND
(soldDate either null OR soldDate before the month of Jan 2020) AND.
(offMarketDate either null OR offMarketDate before the month of Jan 2020).
Below is the query:
POST listings/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"listDate": {
"gte": "2020-01-01",
"lte": "2020-02-01"
}
}
},
{
"bool": {
"should": [
{
"range": {
"soldDate": {
"lte": "2020-01-01"
}
}
},
{
"bool": {
"must_not": [
{
"exists": {
"field": "soldDate"
}
}
]
}
}
],
"minimum_should_match": 1
}
},
{
"bool": {
"should": [
{
"range": {
"offMarketDate": {
"lte": "2020-01-01"
}
}
},
{
"bool": {
"must_not": [
{
"exists": {
"field": "offMarketDate"
}
}
]
}
}
],
"minimum_should_match": 1
}
}
]
}
},
"aggs": {
"my_histogram": {
"date_histogram": {
"field": "listDate",
"calendar_interval": "month"
},
"aggs": {
"total_sales_price": {
"sum": {
"field": "soldPrice"
}
}
}
}
}
}
The query above is very easily readable and self explanatory. I'd suggest reading about the below different queries which I've made use of:
Bool Query
Range Query
Field Exists Query to verify if field exists or not.
Data Histogram Aggregation
Sum Metrics Aggregation
Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 3.0,
"hits" : [
{
"_index" : "listings",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.0,
"_source" : {
"listDate" : "2020-01-01",
"listPrice" : "100.00",
"soldDate" : "2019-12-25",
"soldPrice" : "120.00",
"offMarketDate" : "2019-12-20"
}
},
{
"_index" : "listings",
"_type" : "_doc",
"_id" : "2",
"_score" : 3.0,
"_source" : {
"listDate" : "2020-01-01",
"listPrice" : "100.00",
"soldDate" : "2019-12-24",
"soldPrice" : "122.00",
"offMarketDate" : "2019-12-20"
}
},
{
"_index" : "listings",
"_type" : "_doc",
"_id" : "5",
"_score" : 1.0,
"_source" : {
"listDate" : "2020-01-25",
"listPrice" : "120.00"
}
}
]
},
"aggregations" : {
"my_histogram" : {
"buckets" : [
{
"key_as_string" : "2020-01-01T00:00:00.000Z",
"key" : 1577836800000,
"doc_count" : 3,
"total_sales_price" : {
"value" : 242.0
}
}
]
}
}
}
As expected, documents 1,2 and 5 are showing up with the correct aggregated sum of soldPrice.
Hope that helps!

Combining nested query get illegal_state_exception failed to find nested object under path

I'm creating a query on Elasticsearch, for find documents through all indices.
I need to combine should, must and nested query on Elasticsearch, i get the right result but i get an error inside the result.
This is the query I'm using
GET _all/_search
{
"query": {
"bool": {
"minimum_should_match": 1,
"should": [
{ "term": { "trimmed_final_url": "https://www.repubblica.it/t.../" } }
],
"must": [
{
"nested": {
"path": "entities",
"query": {
"bool": {
"must": [
{ "term": { "entities.id": "138511" } }
]
}
}
}
},
{
"term": {
"language": { "value": "it" }
}
}
]
}
}
And this is the result
{
"_shards" : {
"total" : 38,
"successful" : 14,
"skipped" : 0,
"failed" : 24,
"failures" : [
{
"shard" : 0,
"index" : ".kibana_1",
"node" : "7twsq85TSK60LkY0UiuWzA",
"reason" : {
"type" : "query_shard_exception",
"reason" : """
failed to create query: {
...
"index_uuid" : "HoHi97QFSaSCp09iSKY1DQ",
"index" : ".reporting-2019.06.02",
"caused_by" : {
"type" : "illegal_state_exception",
"reason" : "[nested] failed to find nested object under path [entities]"
}
}
},
...
"hits" : {
"total" : {
"value" : 50,
"relation" : "eq"
},
"max_score" : 16.90015,
"hits" : [
{
"_index" : "i_201906_v1",
"_type" : "_doc",
"_id" : "MugcbmsBAzi8a0oJt96Q",
"_score" : 16.90015,
"_source" : {
"language" : "it",
"entities" : [
{
"id" : 101580,
},
{
"id" : 156822,
},
...
I didn't write some fields because the code is too long
I am new to StackOverFlow (made this account to answer this question :D) so if this answer is out of line bear with me. I have been dabbling in nested fields in Elasticsearch recently so I have some ideas as to how this error could be appearing.
Have you defined a mapping for your document type? I don't believe Elasticsearch will recognize the field as nested if you do not tell it to do so in the mapping:
PUT INDEX_NAME
{
"mappings": {
"DOC_TYPE": {
"properties": {
"entities": {"type": "nested"}
}
}
}
}
You may have to specify this mapping for each index and document type. Not sure if there is a way to do that all with one request.
I also noticed you have a "should" clause with minimum matches set to 1. I believe this is exactly the same as a "must" clause so I am not sure what purpose this achieves (correct me if I'm wrong). If your mapping is specified, the query should look something like this:
GET /_all/_search
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "entities",
"query": {
"term": {
"entities.id": {
"value": "138511"
}
}
}
}
},
{
"term": {
"language": {
"value": "it"
}
}
},
{
"term": {
"trimmed_final_url": {
"value": "https://www.repubblica.it/t.../"
}
}
}
]
}
}
}

access query value from function_score to compute new score

I need to customize ES score. The score function I need to implement is:
score = len(document_term) - len(query_term)
For instance, one of my document in the ES index is :
{
"name": "foobar"
}
And the search query
{
"query": {
"function_score": {
"query": {
"match": {
"name": {
"query": "foo"
}
}
},
"functions": [
{
"script_score": {
"script": {
"source": "doc['name'].value.length() - ?LEN(query_tem)?"
}
}
}
],
"boost_mode": "replace"
}
}
}
The above search should provide a score of 6 - 3 = 3. But I didn't find a solution to get access the value of the query term.
Is it possible to access the value of the query term in a function_score context ?
There is no direct way to do this, however you can achieve that in the below way where you would need to add the query parameters in two different parts of the query.
Before that one important note, you cannot apply the doc['myfield'].value if the field is of type text, instead you would need to have its sibling field created as keyword and refer that in the script, which again I've mentioned below:
Mapping:
PUT myindex
{
"mappings" : {
"properties" : {
"myfield" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
Sample Document:
POST myquery/_doc/1
{
"myfield": "I've become comfortably numb"
}
Query:
POST <your_index_name>/_search
{
"query": {
"function_score": {
"query": {
"match": {
"myfield": "numb"
}
},
"functions": [
{
"script_score": {
"script": {
"source": "return doc['myfield.keyword'].value.length() - params.myquery.length()",
"params": {
"myquery": "numb" <---- Add the query string here as well
}
}
}
}
],
"boost_mode": "replace"
}
}
}
Response:
{
"took" : 558,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 24.0,
"hits" : [
{
"_index" : "myindex",
"_type" : "_doc",
"_id" : "1",
"_score" : 24.0,
"_source" : {
"myfield" : "I've become comfortably numb"
}
}
]
}
}
Hope this helps!

Elasticsearch Match Date Range or Number in Array

My goal is to filter my records by date and a day of the week (Mo = 1, Tue = 2, Thu = 3, ..., Sun = 7). In this case, either the date or the weekday should match any of the days in the array. Or both, of course. I am new to Elasticsearch and seem to have a number of mistakes in my query. I documented everything here, as far as I got and hope for a couple of helpful insights. Thanks in advance.
Current Mapping
{
"index":{
"mappings":{
"entity":{
"_meta":{
"model":"AppBundle\\Entity\\Entity"
},
"properties":{
"subEntity":{
"properties":{
"date":{
"type":"date",
"format":"strict_date_optional_time||epoch_millis"
},
"days":{
"properties":{
"day":{
"type":"string"
}
}
}
}
}
}
}
}
}
}
Current Records
curl -XGET 'localhost:9200/index/_search?pretty=1'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 1.0,
"hits" : [ {
"_index" : "index",
"_type" : "entity",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"subEntity" : [ {
"date" : "2016-09-20T00:00:00+02:00",
"days" : [ ]
}, {
"date" : "2016-09-21T00:00:00+02:00",
"days" : [ ]
}, {
"date" : "2016-09-22T00:00:00+02:00",
"days" : [ {
"day" : 4
}, {
"day" : 5
}, {
"day" : 6
} ]
}, {
"date" : "2016-09-20T00:00:00+02:00",
"days" : [ ]
} ]
}
},
[...]
}
}
Current Request
{
"query":{
"should":{
"filter":[ {
"range":{
"entity.subEntity.date":{
"gte":"2016-09-20",
"lte":"2016-09-21"
}
}
}, {
"term":{
"entity.subEntity.days.day": 2
}
} ]
}
}
}
MySQL Equivalent
SELECT entity
FROM entity
LEFT JOIN subEntity ON (subEntity.entity_id = entity.id)
LEFT JOIN day ON (day.subEntity_id = subEntity.id)
WHERE subEntity.date BETWEEN 2016-09-20 AND 2016-09-21
OR day = 2
If you want to query across properties of a sub-object within a document (where a document may have a collection of such sub-objects), you need to map subEntity as a nested type. In your example, since you are only looking for documents that are within the date range or match the day value, you can use an object mapping as have, but if you need to combine queries with an and operation, then you would need a nested type mapping. If you need to do this, it would make sense to map as a nested type. Additionally, since day is a numeric value, you should map it as a byte.
{
"index":{
"mappings":{
"entity":{
"_meta":{
"model":"AppBundle\\Entity\\Entity"
},
"properties":{
"subEntity":{
"type": "nested",
"properties":{
"date":{
"type":"date",
"format":"strict_date_optional_time||epoch_millis"
},
"days":{
"properties":{
"day":{
"type":"byte"
}
}
}
}
}
}
}
}
}
}
Now that subEntity is mapped as a nested type, a nested query needs to be used to query against it, so the query becomes
{
"query": {
"nested": {
"query": {
"bool": {
"should": [
{
"bool": {
"filter": [
{
"range": {
"subEntity.date": {
"gte": "2016-09-20",
"lte": "2016-09-21"
}
}
}
]
}
},
{
"bool": {
"filter": [
{
"terms": {
"subEntity.days.day": [
2
]
}
}
]
}
}
]
}
},
"path": "subEntity"
}
}
}
Both queries are issued as bool filter queries as we don't need to calculate a relevancy score for either, we simply need to know if a document matches or not i.e. a simple yes/no answer. Warpping a query in a bool filter means that the query runs in a filter context.
Next, either query can match, so we add both as should clauses to an outer bool query.
As a complete example:
Create index and mapping
PUT http://localhost:9200/entities?pretty=true
{
"settings": {
"index.number_of_replicas": 0,
"index.number_of_shards": 1
},
"mappings": {
"entity": {
"properties": {
"id": {
"type": "integer"
},
"subEntity": {
"type": "nested",
"properties": {
"date": {
"type": "date"
},
"days": {
"properties": {
"day": {
"type": "short"
}
},
"type": "object"
}
}
}
}
}
}
}
Bulk index four entities
POST http://localhost:9200/_bulk?pretty=true
{"index":{"_index":"entities","_type":"entity","_id":"1"}}
{"subEntity":{"date":"2016-09-19T05:00:00+00:00"}}
{"index":{"_index":"entities","_type":"entity","_id":"2"}}
{"subEntity":{"date":"2016-09-20T05:00:00+00:00"}}
{"index":{"_index":"entities","_type":"entity","_id":"3"}}
{"subEntity":{"date":"2016-09-18T18:00:00+00:00","days":[{"day":2},{"day":5}]}}
{"index":{"_index":"entities","_type":"entity","_id":"4"}}
{"subEntity":{"date":"2016-09-18T18:00:00+00:00","days":[{"day":3},{"day":4}]}}
Issue the search query above
POST http://localhost:9200/entities/entity/_search?pretty=true
{
"query": {
"nested": {
"query": {
"bool": {
"should": [
{
"bool": {
"filter": [
{
"range": {
"subEntity.date": {
"gte": "2016-09-20",
"lte": "2016-09-21"
}
}
}
]
}
},
{
"bool": {
"filter": [
{
"terms": {
"subEntity.days.day": [
2
]
}
}
]
}
}
]
}
},
"path": "subEntity"
}
}
}
We should only get back entities with ids 2 and 3; id 2 matches on date and id 3 matches on day
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.0,
"hits" : [ {
"_index" : "entities",
"_type" : "entity",
"_id" : "2",
"_score" : 0.0,
"_source" : {
"subEntity" : {
"date" : "2016-09-20T05:00:00+00:00"
}
}
}, {
"_index" : "entities",
"_type" : "entity",
"_id" : "3",
"_score" : 0.0,
"_source" : {
"subEntity" : {
"date" : "2016-09-18T18:00:00+00:00",
"days" : [ {
"day" : 2
}, {
"day" : 5
} ]
}
}
} ]
}
}
Your Solution can be easily achieved using "or" query but now in es 2.0.0 onwards "or" query is deprecated. in-place of using or query we can use "bool" query now. Sample query is given below
{
"query": {
"bool" : {
"should" : [
{
"term" : { "CREAT_DT": "2015-11-03T07:49:07.000Z" }
},
{
"term" : { "TableName": "dwd" }
}
],
"minimum_should_match" : 1,
"boost" : 1.0
}
}
}
More details about it's uses can be found in below link
https://www.elastic.co/guide/en/elasticsearch/reference/2.0/query-dsl-bool-query.html

Elasticsearch Filtering Parents by Filtered Child Document Count

I'm attempting to do some elasticsearch query fu on a set of data I have.
I have a user document that is the parent to many child page view documents. I'm looking to return all users that have viewed a specific page an arbitrary amount of times (defined by user input box). So far, I've got a has_child query that will return me all the users that have a page view with certain ids. However, this will return those parents with all their children. Next, I've tried to write an aggregation on those query results, that will essentially do the same has_child query in aggregation form. Now, I have the right document count for my filtered child documents. I need to use this document count to go back and filter the parents. To explain the query in words, "return to me all the users that have viewed a specific page more than 4 times". It's possible that I may need to restructure my data. Any thoughts?
Here is my query thus far:
curl -XGET 'http://localhost:9200/development_users/_search?pretty=true' -d '
{
"query" : {
"has_child" : {
"type" : "page_view",
"query" : {
"terms" : {
"viewed_id" : [175,180]
}
}
}
},
"aggs" : {
"to_page_view": {
"children": {
"type" : "page_view"
},
"aggs" : {
"page_views_that_match" : {
"filter" : { "terms": { "viewed_id" : [175,180] } }
}
}
}
}
}'
This returns me a response like:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "development_users",
"_type" : "user",
"_id" : "22548",
"_score" : 1.0,
"_source":{"id":22548,"account_id":1009}
} ]
},
"aggregations" : {
"to_page_view" : {
"doc_count" : 53,
"page_views_that_match" : {
"doc_count" : 2
}
}
}
}
Associated Mappings:
{
"development_users" : {
"mappings" : {
"page_view" : {
"dynamic" : "false",
"_parent" : {
"type" : "user"
},
"_routing" : {
"required" : true
},
"properties" : {
"created_at" : {
"type" : "date",
"format" : "date_time"
},
"id" : {
"type" : "integer"
},
"viewed_id" : {
"type" : "integer"
},
"time_on_page" : {
"type" : "integer"
},
"title" : {
"type" : "string"
},
"type" : {
"type" : "string"
},
"updated_at" : {
"type" : "date",
"format" : "date_time"
},
"url" : {
"type" : "string"
}
}
},
"user" : {
"dynamic" : "false",
"properties" : {
"account_id" : {
"type" : "integer"
},
"id" : {
"type" : "integer"
}
}
}
}
}
}
Okay, so this is kind of involved. I made a few simplifications to keep it straight in my head. First, I used this mapping:
PUT /test_index
{
"mappings": {
"page_view": {
"_parent": {
"type": "development_user"
},
"properties": {
"viewed_id": {
"type": "string"
}
}
},
"development_user": {
"properties": {
"id": {
"type": "string"
}
}
}
}
}
Then I added some data. In this little universe, I have three users and two pages. I want to find users who have viewed "page_a" at least twice, so if I construct the correct query only user 3 will be returned.
POST /test_index/development_user/_bulk
{"index":{"_type":"development_user","_id":1}}
{"id":"user_1"}
{"index":{"_type":"page_view","_parent":1}}
{"viewed_id":"page_a"}
{"index":{"_type":"development_user","_id":2}}
{"id":"user_2"}
{"index":{"_type":"page_view","_parent":2}}
{"viewed_id":"page_b"}
{"index":{"_type":"development_user","_id":3}}
{"id":"user_3"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_a"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_a"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_b"}
To get that answer we'll use aggregations. Notice that I don't want documents returned (the normal way), but I do want to filter down the documents we analyze, because it will make things more efficient. So I use the same basic filter you had before.
So the aggregation tree starts with terms_parent_id which will just separate parent documents. Inside that I have children_page_view which filters the child documents down to the ones I want ("page_a"), and next to it in the hierarchy is bucket_selector_page_id_term_count which uses a bucket selector (you'll need ES 2.x) to filter the parent documents by those meeting the criterium, and then finally a top hits aggregation which shows us the documents that match the requirements.
POST /test_index/development_user/_search
{
"size": 0,
"query": {
"has_child": {
"type": "page_view",
"query": {
"terms": {
"viewed_id": [
"page_a"
]
}
}
}
},
"aggs": {
"terms_parent_id": {
"terms": {
"field": "id"
},
"aggs": {
"children_page_view": {
"children": {
"type": "page_view"
},
"aggs": {
"filter_page_ids": {
"filter": {
"terms": {
"viewed_id": [
"page_a"
]
}
}
}
}
},
"bucket_selector_page_id_term_count": {
"bucket_selector": {
"buckets_path": {
"children_count": "children_page_view>filter_page_ids._count"
},
"script": "children_count >= 2"
}
},
"top_hits_users": {
"top_hits": {
"_source": {
"include": [
"id"
]
}
}
}
}
}
}
}
which returns:
{
"took": 14,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"terms_parent_id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "user_3",
"doc_count": 1,
"children_page_view": {
"doc_count": 3,
"filter_page_ids": {
"doc_count": 2
}
},
"top_hits_users": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "development_user",
"_id": "3",
"_score": 1,
"_source": {
"id": "user_3"
}
}
]
}
}
}
]
}
}
}
Here's all the code I used:
http://sense.qbox.io/gist/43f24461448519dc884039db40ebd8e2f5b7304f

Resources