Elastic Search Querying/filtering nested arrays - elasticsearch

I have stored below type of nested data on my index test_agg in ES.
{
"Date": "2015-10-21",
"Domain": "abc.com",
"Processed_at": "10/23/2015 9:47",
"Events": [
{
"Name": "visit",
"Count": "188",
"Value_Aggregations": [
{
"Value": "red",
"Count": "100"
}
]
},
{
"Name": "order_created",
"Count": "159",
"Value_Aggregations": [
{
"Value": "$125",
"Count": "50"
}
]
},
]
}
mapping of the nested item is
curl -XPOST localhost:9200/test_agg/nested_evt/_mapping -d '{
"nested_evt":{
"properties":{
"Events": {
"type": "nested"
}
}
}
}'
I am trying to get "Events.Count" and "Events.Value_Aggregations.Count" where Events.Name='Visit' using the below query
{
"fields" : ["Events.Count","Events.Value_Aggregations.Count"]
"query": {
"filtered": {
"query": {
"match": { "Domain": "abc.com" }
},
"filter": {
"nested": {
"path": "Events",
"query": {
"match": { "Events.Name": "visit" }
},
}
}
}
}
}
instead of resulting single value
Events.Count=[188] Events.Value_Aggregations.Count=[100]
it gives
Events.Count=[188,159] Events.Value_Aggregations.Count=[100,50]
what is the exact query structure to get my desired output?

So the problem here is that the nested filter you are applying selects parent documents based on attributes of the nested child documents. So ES finds the parent document that matches your query (based on the document's nested children). Then, instead of returning the entire document, since you have specified "fields" it picks out only those fields that you have asked for. Those fields happen to be nested fields, and since the parent document has two nested children, it finds two values each for the fields you specified and returns them. To my knowledge there is no way to return the child documents instead, at least with a nested architecture.
One solution to this problem would be to use the parent/child relationship instead, then you could use a has_parent query in combination with the other filters, against the child type to get what you want. That would probably be a cleaner way to do this, as long as the schema architecture doesn't conflict with your other needs.
However, there is a way to do sort of what you are asking, with your current schema, with a nested aggregation combined with a filter aggregation. It's kind of involved (and slightly ambiguous in this case; see explanation below), but here's the query:
POST /test_index/_search
{
"size": 0,
"query": {
"filtered": {
"query": {
"match": {
"Domain": "abc.com"
}
},
"filter": {
"nested": {
"path": "Events",
"query": {
"match": {
"Events.Name": "visit"
}
}
}
}
}
},
"aggs": {
"nested_events": {
"nested": {
"path": "Events"
},
"aggs": {
"filtered_events": {
"filter": {
"term": {
"Events.Name": "visit"
}
},
"aggs": {
"events_count_terms": {
"terms": {
"field": "Events.Count"
}
},
"value_aggregations_count_terms": {
"terms": {
"field": "Events.Value_Aggregations.Count"
}
}
}
}
}
}
}
}
which returns:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
},
"aggregations": {
"nested_events": {
"doc_count": 2,
"filtered_events": {
"doc_count": 1,
"value_aggregations_count_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "100",
"doc_count": 1
}
]
},
"events_count_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "188",
"doc_count": 1
}
]
}
}
}
}
}
Caveat: it's not clear to me whether you actually need the "filter": { "nested": { ... } } clause of the "query" in what I've shown here. If this part filters out parent documents in a useful way, then you need it. If your only intention was to select which nested child documents from which to return fields, then it's redundant here since the filter aggregation is taking care of that part.
Here is the code I used to test it:
http://sense.qbox.io/gist/dcc46e50117031de300b6f91c647fe9b729a5283

here is the parent/child relationship query which resulted my desired output
{
"query": {
"filtered": {
"query": {
"bool": {"must": [
{"term": {"Name": "visit"}}
]}
},
"filter":{
"has_parent": {
"type": "domain_info",
"query" : {
"filtered": {
"query": { "match_all": {}},
"filter" : {
"and": [
{"term": {"Domain": 'abc.com'}}
]
}
}
}
}
}
}
}
}

Related

Nested Filter aggregation includes the null documents in the doc_count

I have an elasticsearch index with following mapping :
{
"properties":{
"asset":{
"properties":{
"customerId":{
"type":"long"
}
}
},
"software":{
"type": "nested",
"properties":{
"id":{
"type":"long"
},
... (more properties)
}
}
}
There could be some documents which have "software":null
When a nested filter aggregation is performed on software attribute say, id, the doc_count in the filter aggregation includes those software too which are null.
aggregation looks like this :
"aggregations": {
"aggs": {
"nested": {
"path": "software"
},
"aggregations": {
"filtered": {
"filter": {
"term": {
"software.type": {
"value": "Application",
"boost": 1.0
}
}
},
"aggregations": {
"software_ids": {
"terms": {
"field": "software.id",
"min_doc_count": 1,
"shard_min_doc_count": 0
}
}
}
}
}
}
}
The part of the response :
"aggregations": {
"aggs": {
"doc_count": 129958,
"filtered": {
**"doc_count": 7094,**
This doc_count includes the "software":null
Is there a way to exclude them?
Edit : I have considered using "missing" param for the inner terms aggregations (i.e. for the aggregation inside the filter aggregation). But would like to know if there is any way to exclude such 'nested' nulls from the aggregations altogether.
Missing attribute to the rescue.
With Missing attribute, you can specify what value the field should take if the field is missing. You can specify a value as "JUNK" and the document will then land up in JUNK bucket in your aggregation.
Following should work now.
"aggregations": {
"aggs": {
"nested": {
"path": "software"
},
"aggregations": {
"filtered": {
"filter": {
"term": {
"software.type": {
"value": "Application",
"boost": 1.0
}
}
},
"aggregations": {
"software_ids": {
"terms": {
"field": "software.id",
"min_doc_count": 1,
"shard_min_doc_count": 0,
"missing": "JUNK"
}
}
}
}
}
}
}

Elasticsearch nested cardinality aggregation

I have a mapping with nested schema, i am tring to do aggregation on nested field and order by docid count.
select name, count(distinct docid) as uniqueid from table
group by name
order by uniqueid desc
Above is what i am trying to do.
{
"size": 0,
"aggs": {
"samples": {
"nested": {
"path": "sample"
},
"aggs": {
"sample": {
"terms": {
"field": "sample.name",
"order": {
"DocCounts": "desc"
}
},
"aggs": {
"DocCounts": {
"cardinality": {
"field": "docid"
}
}
}
}
}
}
}
}
But in the result i am not getting the expected output
result:
"buckets": [
{
"key": "xxxxx",
"doc_count": 173256,
"DocCounts": {
"value": 0
}
},
{
"key": "yyyyy",
"doc_count": 63,
"DocCounts": {
"value": 0
}
}
]
i am getting the DocCounts = 0. This is not expected. What went wrong in my query.
I think your last nested aggregation is too much. Try to get rid of it:
{
"size": 0,
"aggs": {
"samples": {
"nested": {
"path": "sample"
},
"aggs": {
"sample": {
"terms": {
"field": "sample.name",
"order": {
"DocCounts": "desc"
}
},
"DocCounts": {
"cardinality": {
"field": "docid"
}
}
}
}
}
}
}
In general when doing some aggregation on nested type by value from upper scope, we observed that we need to put/copy the value from upper scope on nested type when storing document.
Then in your case aggregation would look like:
"aggs": {
"DocCounts": {
"cardinality": {
"field": "sample.docid"
}
}
}
It works in such case at least on version 1.7 of Elasticsearch.
You can use reverse nested aggregation on top of Cardinality aggregation on DocCounts. This is because when nested aggregation is applied, the query runs against the nested document. So to access any field of parent document inside nested doc, reverse nested aggregation can be used. Check ES Reference for more info on this.
Your cardinality query will look like:
"aggs": {
"internal_DocCounts": {
"reverse_nested": { },
"DocCounts": {
"cardinality": {
"field": "docid"
}
}
}
}
The response will look like:
"buckets": [
{
"key": "xxxxx",
"doc_count": 173256,
"internal_DocCounts": {
"doc_count": 173256,
"DocCounts": {
"value": <some_value>
}
}
},
{
"key": "yyyyy",
"doc_count": 63,
"internal_DocCounts": {
"doc_count": 63,
"DocCounts": {
"value": <some_value>
}
}
},
.....
Check this similar thread

Document count aggregation via query in Elasticsearch (like facet.query in solr)

I have a main query and i need the number of matches for a couple of sub-queries.
In solr words I need a facet.query. What I am missing is a simple doc_count aggregation like the value_count aggregation.
Any suggestions?
I found two possible solutions which I do not like:
Use filter aggregation with value_count metric on _id:
example:
GET _search
{
"query": {
"match_main": {}
},
"aggs": {
"facetvalue1": {
"filter": {
"bool": {
"should": [
{"match": { "name": "fred" }},
{"term": { "lastname": "krueger" }}
]
}
},
"aggs": {
"count": {
"value_count": {
"field": "_id"
}
}
}
},
"facetvalue2": {
"filter": {
"term": { "name": "freddy" }
},
"aggs": {
"count": {
"value_count": {
"field": "_id"
}
}
}
}
}
}
Use Multi Search API
example:
GET _msearch
{"index":"myindex"}
{"query":{"match_main": {}}}
{"index":"myindex"}
{"size": 0, "query":{"match_main": {}}, "filter": {"bool": {"should":[{"match": { "name": "fred" }},{"term": { "lastname": "krueger" }}]}}}
{"index":"myindex"}
{"size": 0, "query":{"match_main": {}},"filter": {"term": { "name": "freddy" }}}
I see that solution 2 is faster but imagine match_main as complex query!
So I would prefer solution 1 if there would be an doc_count:{} instead of value_count:{"field":"_id"}.
But back to my basic question: what is the counterpart of the solr facet.query in elasticsearch?
You can use a filters aggregation for this. Note the additional s, that is different from the filter aggregation you already mentioned.
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"values": {
"filters": {
"filters": {
"value1": {
"bool": {
"should": [
{
"match": {
"name": "fred"
}
},
{
"term": {
"lastname": "krueger"
}
}
]
}
},
"value2": {
"term": {
"name": "freddy"
}
}
}
}
}
}
}
This will return something like
"aggregations": {
"values": {
"buckets": {
"value1": {
"doc_count": 4
},
"value2": {
"doc_count": 1
}
}
}
}
Edit: As a general note, you don't have to use a metric aggregation on your bucket aggregations. If you don't provide any subaggregations, you will just get the document count. In this case, filters will provide the buckets, but multiple filter aggregations should work as well.

How to use ElasticSearch to bucket historical data from midnight to now?

So I have an index with timestamps in the following format:
2015-03-20T12:00:00+0500
What I would like to do in the SQL equivalent is the following:
select date(timestamp), sum(orders)
from data
where time(timestamp) < time(now)
group by date(timestamp)
I know I need an aggregation but, for now, I've tried a basic search query below but I'm getting a malformed error:
{
"size": 0,
"query":
{
"filtered":
{
"query":
{
"match_all" : {}
},
"filter":
{
"range":
{
"#timestamp":
{
"from": "00:00:01.000",
"to": "15:00:00.000"
}
}
}
}
}
}
You do indeed want an aggregation, specifically the date histogram aggregation. Something like
{
"query": {"match_all": {}},
"aggs": {
"by_date": {
"date_histogram": {
"field": "timestamp",
"interval": "day"
},
"aggs": {
"order_sum": {
"sum": {"field": "foo"}
}
}
}
}
}
First you have a bucketing aggregation that groups your documents by date, then inside that a metric aggregation that computes a value (in this case a sum) for each bucket
which would return data of the form
{
...
"aggregations": {
"by_date": {
"buckets": [
{
"key_as_string": "2015-03-01T00:00:00.000Z",
"key": 1425168000000,
"doc_count": 8644,
"order_sum": {
"value": 1234
}
},
{
"key_as_string": "2015-03-02T00:00:00.000Z",
"key": 1425254400000,
"doc_count": 8819,
"order_sum": {
"value": 45678
}
},
...
]
}
}
}
There is a good intro to aggregations on the elasticsearch blog (part 1 and part 2) if you want to do some more reading.

Trying to extract a leaf field from Elasticsearch

I have an object in elasticsearch which resembles something like this:
{
"text": "something something something",
"entities": { "hashtags":["test","test123"]}
}
The problem is that not each document has the entities attribute set. So I want to write a query which:
must contain a keyword in the text field
must have the entities field
extracts the entities.hashtag field
I'm trying to extract a leaf field using following query, the problem is I still get documents which don't have an entities field.
For the second part of the question, I was wondering: How do I only extract the entities.hashtags field? I tried something like "fields": ["entities.hashtags"] but it didn't work.
{
"size": 2000,
"query": {
"filtered": {
"query": {
"match_all": {
}
},
"filter": {
"bool": {
"must": [{
"term": {
"text": "something"
}
},
{
"missing": {
"field": "entities",
"existence": true
}
}]
}
}
}
}
}
This seems to do what you want, if I'm understanding you correctly. A "term" filter on the "text" field and an "exists" filter on the "entities" field filters the docs, and a "terms" aggregation on "entities.hashtags" extracts the values. I'll just post the full example I used:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1
}
}
PUT /test_index/doc/1
{
"text": "something something something",
"entities": { "hashtags": ["test","test123"] }
}
PUT /test_index/doc/2
{
"text": "another doc",
"entities": { "hashtags": ["testagain","testagain123"] }
}
PUT /test_index/doc/3
{
"text": "doc with no entities"
}
POST /test_index/_search
{
"size": 0,
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{ "term": { "text": "something" } },
{ "exists": { "field": "entities" } }
]
}
}
}
},
"aggs": {
"hashtags": {
"terms": {
"field": "entities.hashtags"
}
}
}
}
...
{
"took": 35,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
},
"aggregations": {
"hashtags": {
"buckets": [
{
"key": "test",
"doc_count": 1
},
{
"key": "test123",
"doc_count": 1
}
]
}
}
}

Resources