Querying after aggregation in elasticsearch? - elasticsearch

We have a database of contacts outside of elasticsearch. Every of these contacts has many dynamic attributes (gender:male,yearOfBirth:1985,carColor:blue,etc).
We wanted to integrate elasticsearch to our setup and we decided to index by attributes for scalability. So an example of document in elasticsearch would be like this:
{
contactId:"XYZ",
attribute:"gender",
value:"male"
}
That way, we can add unlimited attributes for any contacts without having to reindex any documents.
Our problem comes when it's time to search within those documents. We want to be able to list contacts passing attribute definitions to elastic search i.e. (be able to list contacts that are male AND have blue cars AND etc.)
So we would want to do something like
Aggregate documents by contactId
write the query for the attributes needs
Paginate the results
We came up with something like this.
{
query: {
bool: {
should: [
{
bool: {
must: [
{ match: { attribute: "gender" }},
{ match: { value: "f" }},
],
},
},
{
bool: {
must: [
{ match: { attribute: "carColor" }},
{ match: { value: "blue" }},
],
},
}
],
minimum_should_match: 2,
},
},
aggs: {
contacts: {
composite: {
size: 15,
sources: [
{
contactId: {
terms: {
field: "contactId",
}
}
}
],
},
},
},
}
But we can really get to the result we want.
Anyone has any idea of what we do wrong and/or how we could improve this query.
Thanks a lot !

The main issue that you're having is that ElasticSearch can't join data. I can imagine that your query don't match anything because ElasticSearch applies criteria to each document one by one. Since the attribute-value pairs are stored in individual document, a multi-criteria query like below can't match a single document:
(attribute:gender AND value:male) AND (attribute:carColor AND value:blue)
Going forward, one option is to get all the documents that match the given criteria and then aggregate. I guess, this is what you're trying to achieve.
Assuming the data:
attribute | contactId | value
---------------+---------------+---------------
gender |XYZ |male
carColor |XYZ |blue
gender |ABS |female
pet |ABS |tiger
carColor |ABS |red
pet |XYZ |dog
gender |XXX |female
carColor |XXX |blue
With the following query:
{
"size": 0,
"query": {
"query_string": {
"query": "(attribute:gender AND value:male) OR (attribute:carColor AND value:blue)" // Lets read them as (C1) or (C2)
}
},
"aggs": {
"contacts": {
"terms": {
"field": "contactId",
"order": {
"_count": "asc"
},
"size": 5
},
"aggs": {
"attribute": {
"terms": {
"field": "attribute",
"order": {
"_count": "asc"
},
"size": 5
},
"aggs": {
"attribute_value": {
"terms": {
"field": "value",
"order": {
"_count": "asc"
},
"size": 5
}
}
}
}
}
}
}
}
Results in:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"contacts" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "XXX",
"doc_count" : 1,
"attribute" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "carColor",
"doc_count" : 1,
"attribute_value" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "blue",
"doc_count" : 1
}
]
}
}
]
}
},
{
"key" : "XYZ",
"doc_count" : 2,
"attribute" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "carColor",
"doc_count" : 1,
"attribute_value" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "blue",
"doc_count" : 1
}
]
}
},
{
"key" : "gender",
"doc_count" : 1,
"attribute_value" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "male",
"doc_count" : 1
}
]
}
}
]
}
}
]
}
}
This way, you can get a top-level grouping by the "contactId". The caveats are
The nested-aggregations are needed to join the data (perhaps the only way to join data in ES dynamically)
Result buckets will contain any grouping which can be made from records that either match (C1) or (C2). The buckets then need to be post-filtered to remove buckets that don't match all the given criteria (in above example buckets with single match need to be dropped).
For N-many criteria, the nesting will have a depth N and will become painfully-slow with a N>10
Finally, I'd suggest that re-indexing is probably not that big of a deal, especially if your original source already has these fields together in a single document/row. The pros for this approach are:
Querying ES would be cheaper (implementation & computation wise)
Updating the document in ElasticSearch is as simple as indexing it the first time.
In conclusion, the question boils down to this: Is the pain worth it? In other words: is the cost of re-indexing the documents more or less than the cost of writing & maintaining these queries in the first place (plus the query-time computation cost that follows).
PS: And yeah, you can just paste the query part in Kibana & use the data-table visualization to generate the nested-aggregation query. This way, you can also get an idea of when the queries start to become really slow as your increase your nesting level (adding more columns in data-table does that for you).
adios

We decided to reindex our whole dataset and put all attribute documents as child of certain documents that represents our contacts. That way we can do easy querying without complex aggregations this way.
We have to specify the parameter min_children to the has_child because we want both should operator to exists within our contacts.
{
size: 10,
from: 1,
query: {
has_child: {
type: "attributedocument",
query: {
bool: {
should: [
{
bool: {
must: [
{ match: { attribute: "gender" } },
{ match: { value: "f"} },
],
},
},
{
bool: {
must: [
{ match: { attribute: "carColor" } },
{ match: { value: "Blue" } },
],
},
},
],
},
},
min_children: "2",
},
},
}

Related

Elasticsearch Query with subquery

I'm relatively new to elasticsearch. I can able to make simple query in dev tools. I need a help on converting the following sql into es query
select c.conversationid from conversations c
where c.conversationid not in
(select s.conversationid from conversations s
where s.type='end' and s.conversationid=c.conversationid)
Index looks like below.
conversationid
type
1
start
2
start
1
end
3
start
If I execute above query I will get the following results.
conversationid
2
3
I have used following
Terms aggregation
Bucket Selector
Query
{
"aggs": {
"conversations": {
"terms": {
"field": "conversationid",
"size": 10
},
"aggs": { --> subaggregation where type == end
"types": {
"terms": {
"field": "type.keyword",
"include": [
"end"
],
"size": 10
}
},
"select": { --> select those terms where there is no bucket for "end"
"bucket_selector": {
"buckets_path": {
"path": "types._bucket_count"
},
"script": "params.path==0"
}
}
}
}
}
}
Result
"aggregations" : {
"conversations" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 2,
"doc_count" : 1,
"types" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
},
{
"key" : 3,
"doc_count" : 1,
"types" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
}
]
}
}

ELASTICSEARCH - Total doc_count aggregations

I am looking for a way to sum up the total of an aggregation that I have defined in the query.
For example:
{
"name" : false,
"surname" : false
},
{
"name" : false,
"surname" : false
}
Query:
GET index/_search?size=0
{"query": {
"bool": {
"must": [
{"term": {"name": false}},
{"term": {"surname": false}}
]
}
},
"aggs": {
"name": {
"terms": {
"field": "name"
}
},
"surname": {
"terms": {
"field": "surname"
}
}
}
}
The query returns the value for each field "name" and "surname" with value "false".
"aggregations" : {
"name" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 0,
"key_as_string" : "false",
"doc_count" : 2 <---------
}
]
},
"surname" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 0,
"key_as_string" : "false",
"doc_count" : 2 <---------
}
]
}
}
}
Is it possible to return the total sum of doc_count, so that in this situation it would be "doc_count" : 2 + "doc_count" : 2 == 4?
I've been trying to do it with script but since they are boolean values it doesn't work.
The functionality that most closely resembles the solution I am looking for is sum_bucket.
GET index/_search?filter_path=aggregations
{
"aggs": {
"surname_field": {
"terms": {
"field": "surname",
"size": 1
}
},
"sum": {
"sum_bucket" : {
"buckets_path":"surname_field>_count"
}
}
}
}
For this specific case where it is a simple JSON, the result of the query is the same as the hits.total.value (number of documents) with filtering to boolean field surname:false or name:false.
But for situations with Json with more fields we can specify the number of times we have a result in our database.
With this result I wanted to find the total number of hits and not the number of documents in the result.

Perform a pipelines aggregation over the full set of potential buckets

When using the _search API of Elasticsearch, if you set size to 10, and perform an avg metric aggregation, the average will be of all values across the dataset matching the query, not just the average of the 10 items returned in the hits array.
On the other hand, if you perform a terms aggregation and set the size of the terms aggregation to be 10, then performing an avg_buckets aggregation on those terms buckets will calculate an average over only those 10 buckets - not all potential buckets.
How can I calculate the an average of some field across all potential buckets, but still only have 10 items in the buckets array?
To make my question more concrete, consider this example: Suppose that I am a hat maker. Multiple stores carry my hats. I have an Elasticsearch index hat-sales which has one document for each time one of my hats is sold. Included in this document is price and that store at which the hat was sold.
Here are two examples of the documents I tested this on:
{
"type": "top",
"color": "black",
"price": 19,
"store": "Macy's"
}
{
"type": "fez",
"color": "red",
"price": 94,
"store": "Walmart"
}
If I want to find the average price of all the hats I have sold, I can run this:
GET hat-sales/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"average_hat_price": {
"avg": {
"field": "price"
}
}
}
}
And average_hat_price will be the same whether size is set to 0, 3, or whatever.
OK, now I want to find the top 3 stores which have sold the most number of hats. I also want to compare them with the average number of hats sold at a store. So I want to do something like this:
GET hat-sales/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"by_store": {
"terms": {
"field": "store.keyword",
"size": 3
},
"aggs": {
"sales_count": {
"cardinality": {
"field": "_id"
}
}
}
},
"avg sales at a store": {
"avg_bucket": {
"buckets_path": "by_store>sales_count"
}
}
}
}
which yields a response of
"aggregations" : {
"by_store" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 8,
"buckets" : [
{
"key" : "Macy's",
"doc_count" : 6,
"sales_count" : {
"value" : 6
}
},
{
"key" : "Walmart",
"doc_count" : 5,
"sales_count" : {
"value" : 5
}
},
{
"key" : "Dillard's",
"doc_count" : 3,
"sales_count" : {
"value" : 3
}
}
]
},
"avg sales at a store" : {
"value" : 4.666666666666667
}
}
The problem is that avg sales at a store is calculated over only Macy's, Walmart, and Dillard's. If I want to find the average over all store, I have to set aggs.by_store.terms.size to 65536. (65536 because that is the default maximum number of terms buckets and I do not know a priori how many buckets there may be.) This gives a result of:
"aggregations" : {
"by_store" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Macy's",
"doc_count" : 6,
"sales_count" : {
"value" : 6
}
},
{
"key" : "Walmart",
"doc_count" : 5,
"sales_count" : {
"value" : 5
}
},
{
"key" : "Dillard's",
"doc_count" : 3,
"sales_count" : {
"value" : 3
}
},
{
"key" : "Target",
"doc_count" : 3,
"sales_count" : {
"value" : 3
}
},
{
"key" : "Harrod's",
"doc_count" : 2,
"sales_count" : {
"value" : 2
}
},
{
"key" : "Men's Warehouse",
"doc_count" : 2,
"sales_count" : {
"value" : 2
}
},
{
"key" : "Sears",
"doc_count" : 1,
"sales_count" : {
"value" : 1
}
}
]
},
"avg sales at a store" : {
"value" : 3.142857142857143
}
}
So the average number of hats sold per store is 3.1, not 4.6. But in the buckets array I want to see only the top 3 stores.
You can achieve what you are aiming at without a pipeline aggregation. It sort of cheats the aggregation framework, but, it works.
Here is the data setup:
PUT hat_sales
{
"mappings": {
"properties": {
"storename": {
"type": "keyword"
}
}
}
}
POST hat_sales/_bulk?refresh=true
{"index": {}}
{"storename": "foo"}
{"index": {}}
{"storename": "foo"}
{"index": {}}
{"storename": "bar"}
{"index": {}}
{"storename": "baz"}
{"index": {}}
{"storename": "baz"}
{"index": {}}
{"storename": "baz"}
Here is the tricky query:
GET hat_sales/_search?size=0
{
"aggs": {
"stores": {
"terms": {
"field": "storename",
"size": 2
}
},
"average_sales_count": {
"avg_bucket": {
"buckets_path": "stores>_count"
}
},
"cheat": {
"filters": {
"filters": {
"all": {
"exists": {
"field": "storename"
}
}
}
},
"aggs": {
"count": {
"value_count": {
"field": "storename"
}
},
"unique_count": {
"cardinality": {
"field": "storename"
}
},
"total_average": {
"bucket_script": {
"buckets_path": {
"total": "count",
"unique": "unique_count"
},
"script": "params.total / params.unique"
}
}
}
}
}
}
This is a small abuse of the aggs framework. But, the idea is that you effectively want num_stores/num_docs. I restricted the num_docs to only docs that actually have the storefield name.
I got around some validations by using the filters agg which is technically a multi-bucket agg (though I only care about one bucket).
Then I get the unique count through cardinality (num stores) and the total count (value_count) and use a bucket_script to finish it off.
All in all, here is the slightly mangled result :D
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"cheat" : {
"buckets" : {
"all" : {
"doc_count" : 6,
"count" : {
"value" : 6
},
"unique_count" : {
"value" : 3
},
"total_average" : {
"value" : 2.0
}
}
}
},
"stores" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 1,
"buckets" : [
{
"key" : "baz",
"doc_count" : 3
},
{
"key" : "foo",
"doc_count" : 2
}
]
},
"average_sales_count" : {
"value" : 2.5
}
}
}
Note that cheat.buckets.all.total_average is 2.0 (the true average) while the old way (pipeline average) is the non-global average of 2.5

Elasticsearch Terms Aggregation - for dynamic keys of an object

Documents Structure
Doc_1 {
"title":"hello",
"myObject":{
"key1":"value1",
"key2":"value2"
}
}
Doc_2 {
"title":"hello world",
"myObject":{
"key2":"value4",
"key3":"value3"
}
}
Doc_3 {
"title":"hello world2",
"myObject":{
"key1":"value1",
"key3":"value3"
}
}
Information: myObject contains dynamic key-value pair.
Objective: My objective is to write an aggregation query to return the number of unique all dynamic key-value pairs.
Attempt and explanation: I can easily get results for known keys in this way.
{
"size":0,
"query":{
"match":{"title":"hello"}
},
"aggs":{
"key1Agg":{
"terms":{"field":"myObject.key1.keyword"}
},
"key2Agg":{
"terms":{"field":"myObject.key2.keyword"}
},
"key3Agg":{
"terms":{"field":"myObject.key3.keyword"}
}
}
}
This is the typical result of the above hardcoded nested keys aggregation.
{
...
"aggregations": {
"key1Agg": {
...
"buckets": [
{
"key": "value1",
"doc_count": 2
}
]
},
"key2Agg": {
...
"buckets": [
{
"key": "value2",
"doc_count": 1
},
{
"key": "value4",
"doc_count": 1
}
]
},
"key3Agg": {
...
"buckets": [
{
"key": "value3",
"doc_count": 2
}
]
}
}
}
Now all I want is to return the count of all dynamic key-value pairs, i.e without putting any hardcore key names in an aggregation query.
I am using ES 6.3, Thanks in Advance!!
From the information you have provided, it appears that myObject seems to be of object datatype and not nested datatype.
Well, there is no easy way to do without modifying the data you have, what you can do and possibly the simplest solution would be is to include an additional field say let's call it as myObject_list which would be of type keyword where the documents would be as follows:
Sample Documents:
POST test_index/_doc/1
{
"title":"hello",
"myObject":{
"key1":"value1",
"key2":"value2"
},
"myObject_list": ["key1_value1", "key2_value2"] <--- Note this
}
POST test_index/_doc/2
{
"title":"hello world",
"myObject":{
"key2":"value4",
"key3":"value3"
},
"myObject_list": ["key2_value4", "key3_value3"] <--- Note this
}
POST test_index/_doc/3
{
"title":"hello world2",
"myObject":{
"key1":"value1",
"key3":"value3"
},
"myObject_list": ["key1_value1", "key3_value3"] <--- Note this
}
You can have a query as simple as below:
Request Query:
POST test_index/_search
{
"size": 0,
"aggs": {
"key_value_aggregation": {
"terms": {
"field": "myObject_list", <--- Make sure this is of keyword type
"size": 10
}
}
}
}
Note that I've used Terms Aggregation over here.
Response:
{
"took" : 406,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"key_value_aggregation" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "key1_value1",
"doc_count" : 2
},
{
"key" : "key3_value3",
"doc_count" : 2
},
{
"key" : "key2_value2",
"doc_count" : 1
},
{
"key" : "key2_value4",
"doc_count" : 1
}
]
}
}
}
Hope this helps!

Elasticsearch, how to return unique values of two fields

I have an index with 20 different fields. I need to be able to pull unique docs where combination of fields "cat" and "sub" are unique.
In SQL it would look this way: select unique cat, sub from table A;
I can do it for one field this way:
{
"size": 0,
"aggs" : {
"unique_set" : {
"terms" : { "field" : "cat" }
}
}}
but how do I add another field to check uniqueness across two fields?
Thanks,
SQL's SELECT DISTINCT [cat], [sub] can be imitated with a Composite Aggregation.
{
"size": 0,
"aggs": {
"cat_sub": {
"composite": {
"sources": [
{ "cat": { "terms": { "field": "cat" } } },
{ "sub": { "terms": { "field": "sub" } } }
]
}
}
}
}
Returns...
"buckets" : [
{
"key" : {
"cat" : "a",
"sub" : "x"
},
"doc_count" : 1
},
{
"key" : {
"cat" : "a",
"sub" : "y"
},
"doc_count" : 2
},
{
"key" : {
"cat" : "b",
"sub" : "y"
},
"doc_count" : 3
}
]
The only way to solve this are probably nested aggregations:
{
"size": 0,
"aggs" : {
"unique_set_1" : {
"terms" : {
"field" : "cats"
},
"aggregations" : {
"unique_set_2": {
"terms": {"field": "sub"}
}
}
}
}
}
Quote:
I need to be able to pull unique docs where combination of fields "cat" and "sub" are unique.
This is nonsense; your question is unclear. You can have 10s unique pairs {cat, sub}, and 100s unique triplets {cat, sub, field_3}, and 1000s unique documents Doc{cat, sub, field3, field4, ...}.
If you are interested in document counts per unique pair {"Category X", "Subcategory Y"} then you can use Cardinality aggregations. For two or more fields you will need to use scripting which will come with performance hit.
Example:
{
"aggs" : {
"multi_field_cardinality" : {
"cardinality" : {
"script": "doc['cats'].value + ' _my_custom_separator_ ' + doc['sub'].value"
}
}
}
}
Alternate solution: use nested Terms terms aggregations.

Resources