Elasticsearch, how to return unique values of two fields - elasticsearch

I have an index with 20 different fields. I need to be able to pull unique docs where combination of fields "cat" and "sub" are unique.
In SQL it would look this way: select unique cat, sub from table A;
I can do it for one field this way:
{
"size": 0,
"aggs" : {
"unique_set" : {
"terms" : { "field" : "cat" }
}
}}
but how do I add another field to check uniqueness across two fields?
Thanks,

SQL's SELECT DISTINCT [cat], [sub] can be imitated with a Composite Aggregation.
{
"size": 0,
"aggs": {
"cat_sub": {
"composite": {
"sources": [
{ "cat": { "terms": { "field": "cat" } } },
{ "sub": { "terms": { "field": "sub" } } }
]
}
}
}
}
Returns...
"buckets" : [
{
"key" : {
"cat" : "a",
"sub" : "x"
},
"doc_count" : 1
},
{
"key" : {
"cat" : "a",
"sub" : "y"
},
"doc_count" : 2
},
{
"key" : {
"cat" : "b",
"sub" : "y"
},
"doc_count" : 3
}
]

The only way to solve this are probably nested aggregations:
{
"size": 0,
"aggs" : {
"unique_set_1" : {
"terms" : {
"field" : "cats"
},
"aggregations" : {
"unique_set_2": {
"terms": {"field": "sub"}
}
}
}
}
}

Quote:
I need to be able to pull unique docs where combination of fields "cat" and "sub" are unique.
This is nonsense; your question is unclear. You can have 10s unique pairs {cat, sub}, and 100s unique triplets {cat, sub, field_3}, and 1000s unique documents Doc{cat, sub, field3, field4, ...}.
If you are interested in document counts per unique pair {"Category X", "Subcategory Y"} then you can use Cardinality aggregations. For two or more fields you will need to use scripting which will come with performance hit.
Example:
{
"aggs" : {
"multi_field_cardinality" : {
"cardinality" : {
"script": "doc['cats'].value + ' _my_custom_separator_ ' + doc['sub'].value"
}
}
}
}
Alternate solution: use nested Terms terms aggregations.

Related

How to query distinct count distibution in elasticsearch

Cardinality aggregation query calculates an approximate count of distinct values. How we can calculate the cardinality distribution of documents?
For example suppose we have:
a,a,a,b,b,b,c,c,d,d,e
and distinct count distribution is:
3: 2 # count of distint element that have 3 occurnes (a, b)
2: 2 # c, d
1: 1 # e
Actually you cannot do aggregations like this.
But, using transform API (https://www.elastic.co/guide/en/elasticsearch/reference/current/transform-examples.html) you could create a new index to do a simple terms aggregation:
PUT _transform/so
{
"dest" : {
"index" : "my-so"
},
"source": {
"index": "my-index"
},
"pivot": {
"group_by": {
"country": {
"terms": {
"field": "letter"
}
}
},
"aggregations": {
"cardinality": {
"value_count": {
"field" : "letter"
}
}
}
}
}
This will give you:
[
{
"country" : "a",
"cardinality" : 22
},
{
"country" : "b",
"cardinality" : 4
},
{
"country" : "c",
"cardinality" : 5049
}...
Then, you can use simple terms or histogram aggregation:
GET /my-so/_search
{
"size" : 0,
"aggs": {
"cc": {
"terms": {
"field": "cardinality"
}
}
}
}

Querying after aggregation in elasticsearch?

We have a database of contacts outside of elasticsearch. Every of these contacts has many dynamic attributes (gender:male,yearOfBirth:1985,carColor:blue,etc).
We wanted to integrate elasticsearch to our setup and we decided to index by attributes for scalability. So an example of document in elasticsearch would be like this:
{
contactId:"XYZ",
attribute:"gender",
value:"male"
}
That way, we can add unlimited attributes for any contacts without having to reindex any documents.
Our problem comes when it's time to search within those documents. We want to be able to list contacts passing attribute definitions to elastic search i.e. (be able to list contacts that are male AND have blue cars AND etc.)
So we would want to do something like
Aggregate documents by contactId
write the query for the attributes needs
Paginate the results
We came up with something like this.
{
query: {
bool: {
should: [
{
bool: {
must: [
{ match: { attribute: "gender" }},
{ match: { value: "f" }},
],
},
},
{
bool: {
must: [
{ match: { attribute: "carColor" }},
{ match: { value: "blue" }},
],
},
}
],
minimum_should_match: 2,
},
},
aggs: {
contacts: {
composite: {
size: 15,
sources: [
{
contactId: {
terms: {
field: "contactId",
}
}
}
],
},
},
},
}
But we can really get to the result we want.
Anyone has any idea of what we do wrong and/or how we could improve this query.
Thanks a lot !
The main issue that you're having is that ElasticSearch can't join data. I can imagine that your query don't match anything because ElasticSearch applies criteria to each document one by one. Since the attribute-value pairs are stored in individual document, a multi-criteria query like below can't match a single document:
(attribute:gender AND value:male) AND (attribute:carColor AND value:blue)
Going forward, one option is to get all the documents that match the given criteria and then aggregate. I guess, this is what you're trying to achieve.
Assuming the data:
attribute | contactId | value
---------------+---------------+---------------
gender |XYZ |male
carColor |XYZ |blue
gender |ABS |female
pet |ABS |tiger
carColor |ABS |red
pet |XYZ |dog
gender |XXX |female
carColor |XXX |blue
With the following query:
{
"size": 0,
"query": {
"query_string": {
"query": "(attribute:gender AND value:male) OR (attribute:carColor AND value:blue)" // Lets read them as (C1) or (C2)
}
},
"aggs": {
"contacts": {
"terms": {
"field": "contactId",
"order": {
"_count": "asc"
},
"size": 5
},
"aggs": {
"attribute": {
"terms": {
"field": "attribute",
"order": {
"_count": "asc"
},
"size": 5
},
"aggs": {
"attribute_value": {
"terms": {
"field": "value",
"order": {
"_count": "asc"
},
"size": 5
}
}
}
}
}
}
}
}
Results in:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"contacts" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "XXX",
"doc_count" : 1,
"attribute" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "carColor",
"doc_count" : 1,
"attribute_value" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "blue",
"doc_count" : 1
}
]
}
}
]
}
},
{
"key" : "XYZ",
"doc_count" : 2,
"attribute" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "carColor",
"doc_count" : 1,
"attribute_value" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "blue",
"doc_count" : 1
}
]
}
},
{
"key" : "gender",
"doc_count" : 1,
"attribute_value" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "male",
"doc_count" : 1
}
]
}
}
]
}
}
]
}
}
This way, you can get a top-level grouping by the "contactId". The caveats are
The nested-aggregations are needed to join the data (perhaps the only way to join data in ES dynamically)
Result buckets will contain any grouping which can be made from records that either match (C1) or (C2). The buckets then need to be post-filtered to remove buckets that don't match all the given criteria (in above example buckets with single match need to be dropped).
For N-many criteria, the nesting will have a depth N and will become painfully-slow with a N>10
Finally, I'd suggest that re-indexing is probably not that big of a deal, especially if your original source already has these fields together in a single document/row. The pros for this approach are:
Querying ES would be cheaper (implementation & computation wise)
Updating the document in ElasticSearch is as simple as indexing it the first time.
In conclusion, the question boils down to this: Is the pain worth it? In other words: is the cost of re-indexing the documents more or less than the cost of writing & maintaining these queries in the first place (plus the query-time computation cost that follows).
PS: And yeah, you can just paste the query part in Kibana & use the data-table visualization to generate the nested-aggregation query. This way, you can also get an idea of when the queries start to become really slow as your increase your nesting level (adding more columns in data-table does that for you).
adios
We decided to reindex our whole dataset and put all attribute documents as child of certain documents that represents our contacts. That way we can do easy querying without complex aggregations this way.
We have to specify the parameter min_children to the has_child because we want both should operator to exists within our contacts.
{
size: 10,
from: 1,
query: {
has_child: {
type: "attributedocument",
query: {
bool: {
should: [
{
bool: {
must: [
{ match: { attribute: "gender" } },
{ match: { value: "f"} },
],
},
},
{
bool: {
must: [
{ match: { attribute: "carColor" } },
{ match: { value: "Blue" } },
],
},
},
],
},
},
min_children: "2",
},
},
}

Elasticsearch: sort terms aggregation buckets by non-key column

Data
I have objects persisted in an ES index. Each of them has a myKey and myName string fields (persisted as keyword fields). There is no guarantee that myName will always be the same for the same myKey. E.g. the following two entries share the same myKey but have different myName values:
{
"myKey": "123asd",
"myName": "United States",
...
},
{
"myKey": "123asd",
"myName": "United States of America",
...
},
{
"myKey": "456fgh",
"myName": "United Kingdom",
...
}
Challenge
I need to select and return all distinct myKey values, find and display the most likely myName (most occurances within the context of myKey) AND sort the resulting buckets by myName.
So far I managed the following:
Select the distinct myKey values by using a terms aggregation.
Select the corresponding first myName value to each myKey by using a top_hits aggregation.
Sorted by myKey using the order clause of the terms aggregation.
This is the code of the aggregation:
"aggs": {
"distinct": {
"terms": {
"field": "myKey",
"order": {
"_key": "desc" <----- this sorts the buckets by myKey
}
},
"aggs": {
"tops": {
"top_hits": {
"size": 1,
"_source": {
"includes": ["myName"]
}
}
}
}
}
I read up on the ES documentation explaining how one can introduce a second aggregation returning a single metric. This appears to address numeric field only though. myName is not numeric.
Is there a way to sort the buckets in ES by myName?
Any help greatly appreciated.
Edit on 2. Sept 2020
Based on the asking of user #joe, current and the expected result are as follows.
Current result
As it is apparent, the sorting of the buckets is based on the key: 123asd comes before 456fgh:
"aggregations" : {
"distinct" : {
"buckets" : [
{
"key" : "123asd",
"tops" : {
"hits" : {
"hits" : [
{
"_source" : {
"myName" : "United States"
}
}
]
}
}
},
{
"key" : "456fgh",
"tops" : {
"hits" : {
"hits" : [
{
"_source" : {
"myName" : "United Kingdom"
}
}
]
}
}
}
]
}
}
Expected result
The task is to sort the buckets based on the extra selected field myName: United Kingdom comes before United States:
"aggregations" : {
"distinct" : {
"buckets" : [
{
"key" : "456fgh",
"tops" : {
"hits" : {
"hits" : [
{
"_source" : {
"myName" : "United Kingdom"
}
}
]
}
}
},
{
"key" : "123asd",
"tops" : {
"hits" : {
"hits" : [
{
"_source" : {
"myName" : "United States"
}
}
]
}
}
}
]
}
}
By doing _count:desc, you've only ordered the top agg alphabetically...
Have you tried the following which looks for the most frequest myNames under a given myKey?
{
"size": 0,
"aggs": {
"by_key": {
"terms": {
"field": "myKey",
"order": {
"_key": "desc"
}
},
"aggs": {
"by_name": {
"terms": {
"field": "myName",
"order":{
"_count": "desc"
}
}
}
}
}
}
}
Or are you looking to sort the parent myKey agg by the result of the child myName agg?
EDIT
Sorting a parent agg by the result of a multi-bucket child aggregation results in the following error:
Buckets can only be sorted on a sub-aggregator path that is built out
of zero or more single-bucket aggregations within the path and a final
single-bucket or a metrics aggregation at the path end.
In other words, what you're trying to achieve is not possible and here's nicely explained why.
Had your child aggregation been numeric (or single-bucket), it would've been possible.
For now your only option appears to be post-processing (or rather post-sorting) the current response in the frontend (or wherever you're using these aggs).

combine output of a a first filter as input of a second filter

We have an elasticsearch instance with entries with two tagged fields.
sessionid
message
In a first filter, I find all entries where the message contains a certain substring. Each of those entries contains a sessionid,
In a second filter, I want to find all messages, where the sessionid matches one of the sessionids returned by the first filter. This filter should go through all entries a second time.
Example, in the log below (sessionid;message)
1234;miss 1
2456;miss 2
1234;match
When filtering for the string "match" in the message part, I would get as output of the combined query:
1234;miss 1
1234;match
We are using KQL.
Background: We want an easy way to follow complete flows with an error-string in a message, in a multithreaded environment.
I understand why you'd want to do that in one go but it's not possible in ElasticSearch. You cannot "revisit" documents which you've already ruled out by a different query -- searching for match would disqualify all misss.
It's unfortunate you have the log message combined with the ID but you can try this:
Find all that match match (pun intended) -- I'm assuming you do have a keyword field available
GET your_index/_search
{
"query": {
"regexp": {
"separated_msg.keyword": ".*\\;match.*"
}
}
}
Post-process the hits and extract the session IDs
Run session ID matching:
GET your_index/_search
{
"query": {
"regexp": {
"separated_msg.keyword": "1234;.*"
}
}
}
or on multiple IDs using a bool should:
GET your_index/_search
{
"query": {
"bool": {
"should": [
{
"regexp": {
"separated_msg.keyword": "1234;.*"
}
},
{
"regexp": {
"separated_msg.keyword": "4567;.*"
}
}
]
}
}
}
If a unique numeric value can be assigned to each message ex 1 for "match", 2 for "miss 1" then bucket selector and top_hits can be used.
{
"size": 0,
"aggs": {
"sessionid": {
"terms": {
"field": "sessionid", --> first get all unique sessionids
"size": 10
},
"aggs": {
"documents":{
"top_hits": {
"size": 10
}
},
"messageid": {
"terms": {
"field": "messageid", ---> get unique sessionId
"size": 10
},
"aggs": {
"matching_messageid": { ---> select a bucket with key(message Id) as 2
"bucket_selector": {
"buckets_path": {
"key": "_key"
},
"script": "params.key==2"
}
}
}
},
"my_bucket": {
"bucket_selector": {
"buckets_path": {
"hits": "messageid._bucket_count"
},
"script": "params.hits>0"--> if bucket not empty then consider that sessionid
}
}
}
}
}
}
Result
"aggregations" : {
"sessionid" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 1234,
"doc_count" : 2,
"documents" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "index31",
"_type" : "_doc",
"_id" : "MTAYpnABheSAx2q_eNEF",
"_score" : 1.0,
"_source" : {
"sessionid" : 1234,
"message" : "miss 1",
"messageid" : 1
}
},
{
"_index" : "index31",
"_type" : "_doc",
"_id" : "MjAYpnABheSAx2q_n9FW",
"_score" : 1.0,
"_source" : {
"sessionid" : 1234,
"message" : "match",
"messageid" : 2
}
}
]
}
},
"messageid" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 2,
"doc_count" : 1
}
]
}
}
]
}
}
If a given message has timestamp(max/min) then max_path can be used to select buckets with given messages.
The best approach to above problem will be to use nested documents
{
"sessionid":1234,
"messages":[
{
"message":"match"
},
{
"message":"miss 1"
}
]
}
````
then the problem can be resolved by nested query. If logstash is used then above structure can generated while indexing.

How to get all unique tags from 2 collections in Elasticsearch?

I have a set of tags stored in document.tags and document.fields.articleTags.
This is how I get all the tags from both namespaces, but how can I get the result merged into one array in the response from ES?
{
"query" : {
"match_all" : { }
},
"size": 0,
"aggs" : {
"tags" : {
"terms" : { "field" : "tags" }
},
"articleTags" : {
"terms" : { "field" : "articleTags" }
}
}
}
Result
I get the tags listed in articleTags.buckets and tags.buckets. Is it possible to have the result delivered in one bucket?
{
"aggregations": {
"articleTags": {
"buckets": [
{
"key": "halloween"
}
]
},
"tags": {
"buckets": [
{
"key": "news"
}
Yes, you can using a single terms aggregation with a script instead that would "join" the two arrays (i.e. add them together), it goes like this:
{
"query" : {
"match_all" : { }
},
"size": 0,
"aggs" : {
"all_tags" : {
"terms" : { "script" : "doc.tags.values + doc.articleTags.values" }
}
}
}
Note that you need to make sure to enable dynamic scripting in order for this query to work.

Resources