How to use value of nested documents in script scoring - elasticsearch

Schema looks like this:
"mappings": {
"_doc": {
"_all": {
"enabled": false
},
"properties": {
"category_boost": {
"type": "nested",
"properties" : {
"category": {
"type": "text",
"index": false
},
"boost": {
"type": "integer",
"index": false
}
}
}
}
}
}
The document in elastic does have data:
"category_boost": [
{
"category": "A",
"boost": 98
},
{
"category": "B",
"boost": 96
},
{
"category": "C",
"boost": 94
},
],
Inside scoring function:
for (int i=0; i<doc['"'category_boost.boost'"'].size(); ++i) {
if (doc['"'category_boost.category'"'][i].value.equals(params.category)) {
boost = doc['"'category_boost.boost'"'][i].value;
}
}
Also tried length to get size of the array, but did help. Since it does not affect results, I tried to divide by size() and it throws division by zero error, so I conclude the size is 0.
Overall problem: have a map of category->boost which is dynamic and I cannot hardcode into schema. I tried type object with json object, but it turned out you cannot access those objects in scoring functions, therefore I went with arrays with defined types.

nested datatype create sub-documents for representing the items of your collections. So access their doc values in a script is possible but you need to be inside a nested query.
Here is one way of doing it, I hope it fulfills your requirements. This example only returns the document with a score depending on the chosen category.
NB : I used elasticsearch 7 in my local, so your will have to modify the mapping to add your "_doc" entry etc....
Here is the modified mapping, I removed the index: false in nested properties since we now use them in queries
PUT test-score_nested
{
"mappings": {
"properties": {
"category_boost": {
"type": "nested",
"properties": {
"category": {
"type": "keyword"
},
"boost": {
"type": "integer"
}
}
}
}
}
}
Then I add your sample data :
POST test-score_nested/_doc
{
"category_boost": [
{
"category": "A",
"boost": 98
},
{
"category": "B",
"boost": 96
},
{
"category": "C",
"boost": 94
}
]
}
And then the query.
We go one level deep in the nested collection
Inside the collection we use a function score query with the replace mode
Inside the function score, we use a filter query to "select" the good category and use its boost for the scoring
POST test-score_nested/_search
{
"query": {
"nested": {
"path": "category_boost",
"query": {
"function_score": {
"boost_mode": "replace",
"query": {
"term": {
"category_boost.category": {
"value": "A"
}
}
},
"functions": [
{
"field_value_factor": {
"field": "category_boost.boost"
}
}
]
}
}
}
}
}
returns
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 98.0,
"hits" : [
{
"_index" : "test-score_nested",
"_type" : "_doc",
"_id" : "v3Smqm0BZ7nyeX7PPevA",
"_score" : 98.0,
"_source" : {
"category_boost" : [
{
"category" : "A",
"boost" : 98
},
{
"category" : "B",
"boost" : 96
},
{
"category" : "C",
"boost" : 94
}
]
}
}
]
}
}
I hope it will help you!

Related

ElasticSearch Query fields based on conditions on another field

Mapping
PUT /employee
{
"mappings": {
"post": {
"properties": {
"name": {
"type": "keyword"
},
"email_ids": {
"properties":{
"id" : { "type" : "integer"},
"value" : { "type" : "keyword"}
}
},
"primary_email_id":{
"type": "integer"
}
}
}
}
}
Data
POST employee/post/1
{
"name": "John",
"email_ids": [
{
"id" : 1,
"value" : "1#email.com"
},
{
"id" : 2,
"value" : "2#email.com"
}
],
"primary_email_id": 2 // Here 2 refers to the id field of email_ids.id (2#email.com).
}
I need help to form a query to check if an email id is already taken as a primary email?
eg: If I query for 1#email.com I should get result as No as 1#email.com is not a primary email id.
If I query for 2#email.com I should get result as Yes as 2#email.com is a primary email id for John.
As far as i know with this mapping you can not achive what you are expecting.
But, You can create email_ids field as nested type and add one more field like isPrimary and set value of it to true whenever email is primary email.
Index Mapping
PUT employee
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"email_ids": {
"type": "nested",
"properties": {
"id": {
"type": "integer"
},
"value": {
"type": "keyword"
},
"isPrimary":{
"type": "boolean"
}
}
},
"primary_email_id": {
"type": "integer"
}
}
}
}
Sample Document
POST employee/_doc/1
{
"name": "John",
"email_ids": [
{
"id": 1,
"value": "1#email.com"
},
{
"id": 2,
"value": "2#email.com",
"isPrimary": true
}
],
"primary_email_id": 2
}
Query
You need to keep below query as it is and only need to change email address when you want to see if email is primary or not.
POST employee/_search
{
"_source": false,
"query": {
"nested": {
"path": "email_ids",
"query": {
"bool": {
"must": [
{
"term": {
"email_ids.value": {
"value": "2#email.com"
}
}
},
{
"term": {
"email_ids.isPrimary": {
"value": "true"
}
}
}
]
}
}
}
}
}
Result
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.98082924,
"hits" : [
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.98082924
}
]
}
}
Interpret Result:
Elasticsearch will not return result in boolean like true or false but you can implement it at application level. You can consider value of hits.total.value from result, if it is 0 then you can consider false otherwise true.
PS: Answer is based on ES version 7.10.

How to get per term statistics in Elasticsearch

I need to implement the following (on the backend): a user types a query and gets back hits as well as statistics for the hits. Below is a simplified example.
Suppose the query is Grif, then the user gets back (random words just for example)
Griffith
Griffin
Grif
Grift
Griffins
And frequency + number of documents a certain term occurs in, for example:
Griffith (freq 10, 3 docs)
Griffin (freq 17, 9 docs)
Grif (freq 6, 3 docs)
Grift (freq 9, 5 docs)
Griffins (freq 11, 4 docs)
I'm relatively new to Elasticsearch, so I'm not sure where to start to implement something like this. What type of query is the most suitable for this? What can I use to get that kind of statistics? Any other advice will be appreciated too.
There are multiple layers to this. You'd need:
n-gram / partial / search-as-you-type matching
a way to group the matched keywords by their original form
a mechanism to reversely look up the document & term frequencies.
I'm not aware of any way to achieve this in one go, but here's my take on it.
You could start off with a special, n-gram-powered analyzer, as explained in my other answer. There's the original content field, plus a multi-field mapping for the said analyzer, plus a keyword field to aggregate on down the line:
PUT my-index
{
"settings": {
"index": {
"max_ngram_diff": 20
},
"analysis": {
"tokenizer": {
"my_ngrams": {
"type": "ngram",
"min_gram": 3,
"max_gram": 20,
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"my_ngrams_analyzer": {
"tokenizer": "my_ngrams",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"fields": {
"analyzed": {
"type": "text",
"analyzer": "my_ngrams_analyzer"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Next, bulk-insert some sample docs containing text inside the content field. Note that each doc has an _id too — you'll need those later on.
POST _bulk
{"index":{"_index":"my-index", "_id":1}}
{"content":"Griffith"}
{"index":{"_index":"my-index", "_id":2}}
{"content":"Griffin"}
{"index":{"_index":"my-index", "_id":3}}
{"content":"Grif"}
{"index":{"_index":"my-index", "_id":4}}
{"content":"Grift"}
{"index":{"_index":"my-index", "_id":5}}
{"content":"Griffins"}
{"index":{"_index":"my-index", "_id":6}}
{"content":"Griffith"}
{"index":{"_index":"my-index", "_id":7}}
{"content":"Griffins"}
Search for n-grams in the .analyzed field and group the matched documents by the original terms through the terms aggregation. At the same time, retrieve the _id of one of the bucketed documents through the top_hits aggregation. BTW — it doesn't matter which _id is returned in a given bucket — all will have contained the same bucketed term.
POST my-index/_search?filter_path=aggregations.*.buckets.key,aggregations.*.buckets.doc_count,aggregations.*.buckets.*.hits.hits._id
{
"size": 0,
"query": {
"term": {
"content.analyzed": "grif"
}
},
"aggs": {
"full_terms": {
"terms": {
"field": "content.keyword",
"size": 10
},
"aggs": {
"top_doc": {
"top_hits": {
"size": 1,
"_source": false
}
}
}
}
}
}
Observe the response. The filter_path URL parameter from the previous request reduces the response to just those attributes that we need — the untouched, original full_terms plus one of the underlying IDs:
{
"aggregations" : {
"full_terms" : {
"buckets" : [
{
"key" : "Griffins",
"doc_count" : 2,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "5"
}
]
}
}
},
{
"key" : "Griffith",
"doc_count" : 2,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "1"
}
]
}
}
},
{
"key" : "Grif",
"doc_count" : 1,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "3"
}
]
}
}
},
{
"key" : "Griffin",
"doc_count" : 1,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "2"
}
]
}
}
},
{
"key" : "Grift",
"doc_count" : 1,
"top_doc" : {
"hits" : {
"hits" : [
{
"_id" : "4"
}
]
}
}
}
]
}
}
}
Time for the fun part.
There's a specialized Elasticsearch API called Term Vectors which does exactly what you're after — it retrieves field & term stats from the whole index. In order for it to hand these stats over to you, it needs the document IDs — which you'll have obtained from the above aggregation!
Finally, since you've got multiple term vectors to work with, you can use the Multi term vectors API like so — again condensing the response thru filter_path:
POST /my-index/_mtermvectors?filter_path=docs.term_vectors.*.*.*.doc_freq,docs.term_vectors.*.*.*.term_freq
{
"docs": [
{
"_id": "5", <--- guaranteeing
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
},
{
"_id": "1", <--- the response
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
},
{
"_id": "3", <--- order
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
},
{
"_id": "2",
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
},
{
"_id": "4",
"fields": [
"content.keyword"
],
"payloads": false,
"positions": false,
"offsets": false,
"field_statistics": false,
"term_statistics": true
}
]
}
The result can be post-processed in your backend to form your autocomplete response. You've got A) the full terms, B) the number of matching documents (doc_freq), and C), the term frequency:
{
"docs" : [
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Griffins" : { | term
"doc_freq" : 2, | <-- # of docs
"term_freq" : 1 | term frequency
}
}
}
}
},
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Griffith" : {
"doc_freq" : 2,
"term_freq" : 1
}
}
}
}
},
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Grif" : {
"doc_freq" : 1,
"term_freq" : 1
}
}
}
}
},
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Griffin" : {
"doc_freq" : 1,
"term_freq" : 1
}
}
}
}
},
{
"term_vectors" : {
"content.keyword" : {
"terms" : {
"Grift" : {
"doc_freq" : 1,
"term_freq" : 1
}
}
}
}
}
]
}
Shameless plug: if you're new to Elasticsearch and, just like me, learn best from real-world examples, consider buying my Elasticsearch Handbook.

Elasticsearch filter by multiple fields in an object which is in an array field

The goal is to filter products with multiple prices.
The data looks like this:
{
"name":"a",
"price":[
{
"membershipLevel":"Gold",
"price":"5"
},
{
"membershipLevel":"Silver",
"price":"50"
},
{
"membershipLevel":"Bronze",
"price":"100"
}
]
}
I would like to filter by membershipLevel and price. For example, if I am a silver member and query price range 0-10, the product should not appear, but if I am a gold member, the product "a" should appear. Is this kind of query supported by Elasticsearch?
You need to make use of nested datatype for price and make use of nested query for your use case.
Please see the below mapping, sample document, query and response:
Mapping:
PUT my_price_index
{
"mappings": {
"properties": {
"name":{
"type":"text"
},
"price":{
"type":"nested",
"properties": {
"membershipLevel":{
"type":"keyword"
},
"price":{
"type":"double"
}
}
}
}
}
}
Sample Document:
POST my_price_index/_doc/1
{
"name":"a",
"price":[
{
"membershipLevel":"Gold",
"price":"5"
},
{
"membershipLevel":"Silver",
"price":"50"
},
{
"membershipLevel":"Bronze",
"price":"100"
}
]
}
Query:
POST my_price_index/_search
{
"query": {
"nested": {
"path": "price",
"query": {
"bool": {
"must": [
{
"term": {
"price.membershipLevel": "Gold"
}
},
{
"range": {
"price.price": {
"gte": 0,
"lte": 10
}
}
}
]
}
},
"inner_hits": {} <---- Do note this.
}
}
}
The above query means, I want to return all the documents having price.price range from 0 to 10 and price.membershipLevel as Gold.
Notice that I've made use of inner_hits. The reason is despite being a nested document, ES as response would return the entire set of document instead of only the document specific to where the query clause is applicable.
In order to find the exact nested doc that has been matched, you would need to make use of inner_hits.
Below is how the response would return.
Response:
{
"took" : 128,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.9808291,
"hits" : [
{
"_index" : "my_price_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.9808291,
"_source" : {
"name" : "a",
"price" : [
{
"membershipLevel" : "Gold",
"price" : "5"
},
{
"membershipLevel" : "Silver",
"price" : "50"
},
{
"membershipLevel" : "Bronze",
"price" : "100"
}
]
},
"inner_hits" : {
"price" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.9808291,
"hits" : [
{
"_index" : "my_price_index",
"_type" : "_doc",
"_id" : "1",
"_nested" : {
"field" : "price",
"offset" : 0
},
"_score" : 1.9808291,
"_source" : {
"membershipLevel" : "Gold",
"price" : "5"
}
}
]
}
}
}
}
]
}
}
Hope this helps!
Let me take show you how to do it, using the nested fields and query and filter context. I will take your example to show, you how to define index mapping, index sample documents, and search query.
It's important to note the include_in_parent param in Elasticsearch mapping, which allows us to use these nested fields without using the nested fields.
Please refer to Elasticsearch documentation about it.
If true, all fields in the nested object are also added to the parent
document as standard (flat) fields. Defaults to false.
Index Def
{
"mappings": {
"properties": {
"product": {
"type": "nested",
"include_in_parent": true
}
}
}
}
Index sample docs
{
"product": {
"price" : 5,
"membershipLevel" : "Gold"
}
}
{
"product": {
"price" : 50,
"membershipLevel" : "Silver"
}
}
{
"product": {
"price" : 100,
"membershipLevel" : "Bronze"
}
}
Search query to show Gold with price range 0-10
{
"query": {
"bool": {
"must": [
{
"match": {
"product.membershipLevel": "Gold"
}
}
],
"filter": [
{
"range": {
"product.price": {
"gte": 0,
"lte" : 10
}
}
}
]
}
}
}
Result
"hits": [
{
"_index": "so-60620921-nested",
"_type": "_doc",
"_id": "1",
"_score": 1.0296195,
"_source": {
"product": {
"price": 5,
"membershipLevel": "Gold"
}
}
}
]
Search query to exclude Silver, with same price range
{
"query": {
"bool": {
"must": [
{
"match": {
"product.membershipLevel": "Silver"
}
}
],
"filter": [
{
"range": {
"product.price": {
"gte": 0,
"lte" : 10
}
}
}
]
}
}
}
Above query doesn't return any result as there isn't any matching result.
P.S :- this SO answer might help you to understand nested fields and query on them in detail.
You have to use Nested fields and nested query to archive this: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-nested-query.html
Define you Price property with type "Nested" and then you will be able to filter by every property of nested object

Counting non-unique items in an Elasticsearch aggregation?

I'm trying to use an Elasticsearch aggregation to return all non-unique counts for each term within a bucket.
Given a mapping:-
{
"properties": {
"addresses": {
"properties": {
"meta": {
"properties": {
"types": {
"properties": {
"type": {
"type": "keyword"
}
}
}
}
}
}
}
}
}
And a document:-
{
"id": 3,
"first_name": "James",
"last_name": "Smith",
"addresses": [
{
"meta": {
"types": [
{
"type": "Home"
},
{
"type": "Home"
},
{
"type": "Business"
},
{
"type": "Business"
},
{
"type": "Business"
},
{
"type": "Fax"
}
]
}
}
]
}
The following terms aggregation:-
GET /test/_search
{
"size": 0,
"query": {
"match": {
"id": 3
}
},
"aggs": {
"types": {
"terms": {
"field": "addresses.meta.types.type"
}
}
}
}
Gives this result:-
"aggregations" : {
"types" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Business",
"doc_count" : 1
},
{
"key" : "Fax",
"doc_count" : 1
},
{
"key" : "Home",
"doc_count" : 1
}
]
}
}
As you can see the terms are unique and I'm really after a total count of each e.g. Home: 2, Business: 3 and Fax: 1.
Is this possible?
I had a look at value_count but as it's not a bucket aggregation it seems a little less convenient to use. Alternatively possible a script might do it but I'm not too sure on the syntax.
Thanks!
I doubt if that is possible using object type in Elasticsearch. The reason is that most of the metrics aggregations is w.r.t the count of documents for particular occurrence of word and not counts of occurrence of words in documents.
You may have to change the type of your field type to nested so that ES would end up saving each type inside types as separate document.
I've provided sample mapping, document(no change in representation), aggregation query and response below.
Sample Mapping:
PUT nested_test
{
"mappings":{
"properties":{
"id":{
"type":"integer"
},
"first_name":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword"
}
}
},
"second_name":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword"
}
}
},
"addresses":{
"properties":{
"meta":{
"properties":{
"types":{
"type":"nested", <----- Note this
"properties":{
"type":{
"type":"keyword"
}
}
}
}
}
}
}
}
}
}
Sample Document (No change)
POST nested_test/_doc/1
{
"id": 3,
"first_name": "James",
"last_name": "Smith",
"addresses": [
{
"meta": {
"types": [
{
"type": "Home"
},
{
"type": "Home"
},
{
"type": "Business"
},
{
"type": "Business"
},
{
"type": "Business"
},
{
"type": "Fax"
}
]
}
}
]
}
Note that every type above is now considered as a separate document linked to the main document.
Aggregation Query:
All that would be required is to make use of Nested Aggregation + Terms Aggregation
POST nested_test/_search
{
"size": 0,
"aggs": {
"myterms": {
"nested": {
"path": "addresses.meta.types"
},
"aggs": {
"myterms": {
"terms": {
"field": "addresses.meta.types.type",
"size": 10,
"min_doc_count": 2 <----- Note this to filter only values with non unique counts
}
}
}
}
}
}
Note that in the above query I've made use of min_doc_count in order to restrict the results as per what you are looking for.
Response:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"myterms" : {
"doc_count" : 6,
"myterms" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Business",
"doc_count" : 3
},
{
"key" : "Home",
"doc_count" : 2
}
]
}
}
}
}
Hope that helps!

Elasticsearch Filtering Parents by Filtered Child Document Count

I'm attempting to do some elasticsearch query fu on a set of data I have.
I have a user document that is the parent to many child page view documents. I'm looking to return all users that have viewed a specific page an arbitrary amount of times (defined by user input box). So far, I've got a has_child query that will return me all the users that have a page view with certain ids. However, this will return those parents with all their children. Next, I've tried to write an aggregation on those query results, that will essentially do the same has_child query in aggregation form. Now, I have the right document count for my filtered child documents. I need to use this document count to go back and filter the parents. To explain the query in words, "return to me all the users that have viewed a specific page more than 4 times". It's possible that I may need to restructure my data. Any thoughts?
Here is my query thus far:
curl -XGET 'http://localhost:9200/development_users/_search?pretty=true' -d '
{
"query" : {
"has_child" : {
"type" : "page_view",
"query" : {
"terms" : {
"viewed_id" : [175,180]
}
}
}
},
"aggs" : {
"to_page_view": {
"children": {
"type" : "page_view"
},
"aggs" : {
"page_views_that_match" : {
"filter" : { "terms": { "viewed_id" : [175,180] } }
}
}
}
}
}'
This returns me a response like:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "development_users",
"_type" : "user",
"_id" : "22548",
"_score" : 1.0,
"_source":{"id":22548,"account_id":1009}
} ]
},
"aggregations" : {
"to_page_view" : {
"doc_count" : 53,
"page_views_that_match" : {
"doc_count" : 2
}
}
}
}
Associated Mappings:
{
"development_users" : {
"mappings" : {
"page_view" : {
"dynamic" : "false",
"_parent" : {
"type" : "user"
},
"_routing" : {
"required" : true
},
"properties" : {
"created_at" : {
"type" : "date",
"format" : "date_time"
},
"id" : {
"type" : "integer"
},
"viewed_id" : {
"type" : "integer"
},
"time_on_page" : {
"type" : "integer"
},
"title" : {
"type" : "string"
},
"type" : {
"type" : "string"
},
"updated_at" : {
"type" : "date",
"format" : "date_time"
},
"url" : {
"type" : "string"
}
}
},
"user" : {
"dynamic" : "false",
"properties" : {
"account_id" : {
"type" : "integer"
},
"id" : {
"type" : "integer"
}
}
}
}
}
}
Okay, so this is kind of involved. I made a few simplifications to keep it straight in my head. First, I used this mapping:
PUT /test_index
{
"mappings": {
"page_view": {
"_parent": {
"type": "development_user"
},
"properties": {
"viewed_id": {
"type": "string"
}
}
},
"development_user": {
"properties": {
"id": {
"type": "string"
}
}
}
}
}
Then I added some data. In this little universe, I have three users and two pages. I want to find users who have viewed "page_a" at least twice, so if I construct the correct query only user 3 will be returned.
POST /test_index/development_user/_bulk
{"index":{"_type":"development_user","_id":1}}
{"id":"user_1"}
{"index":{"_type":"page_view","_parent":1}}
{"viewed_id":"page_a"}
{"index":{"_type":"development_user","_id":2}}
{"id":"user_2"}
{"index":{"_type":"page_view","_parent":2}}
{"viewed_id":"page_b"}
{"index":{"_type":"development_user","_id":3}}
{"id":"user_3"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_a"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_a"}
{"index":{"_type":"page_view","_parent":3}}
{"viewed_id":"page_b"}
To get that answer we'll use aggregations. Notice that I don't want documents returned (the normal way), but I do want to filter down the documents we analyze, because it will make things more efficient. So I use the same basic filter you had before.
So the aggregation tree starts with terms_parent_id which will just separate parent documents. Inside that I have children_page_view which filters the child documents down to the ones I want ("page_a"), and next to it in the hierarchy is bucket_selector_page_id_term_count which uses a bucket selector (you'll need ES 2.x) to filter the parent documents by those meeting the criterium, and then finally a top hits aggregation which shows us the documents that match the requirements.
POST /test_index/development_user/_search
{
"size": 0,
"query": {
"has_child": {
"type": "page_view",
"query": {
"terms": {
"viewed_id": [
"page_a"
]
}
}
}
},
"aggs": {
"terms_parent_id": {
"terms": {
"field": "id"
},
"aggs": {
"children_page_view": {
"children": {
"type": "page_view"
},
"aggs": {
"filter_page_ids": {
"filter": {
"terms": {
"viewed_id": [
"page_a"
]
}
}
}
}
},
"bucket_selector_page_id_term_count": {
"bucket_selector": {
"buckets_path": {
"children_count": "children_page_view>filter_page_ids._count"
},
"script": "children_count >= 2"
}
},
"top_hits_users": {
"top_hits": {
"_source": {
"include": [
"id"
]
}
}
}
}
}
}
}
which returns:
{
"took": 14,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"terms_parent_id": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "user_3",
"doc_count": 1,
"children_page_view": {
"doc_count": 3,
"filter_page_ids": {
"doc_count": 2
}
},
"top_hits_users": {
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "development_user",
"_id": "3",
"_score": 1,
"_source": {
"id": "user_3"
}
}
]
}
}
}
]
}
}
}
Here's all the code I used:
http://sense.qbox.io/gist/43f24461448519dc884039db40ebd8e2f5b7304f

Resources