Elastic query to find similar tags in content from different organizations - elasticsearch

I consume content sources from different organizations, which all supply metadata tags. I would like a list of terms, that are supplied by different organizations.
A sample of data in Elasticsearch:
doc1: {
"tags":["tag1", "tag5", "tag6", "tag4"],
"organization" : "A"
}
doc2: {
"tags":["tag1", "tag2", "tag4"],
"organization" : "B"
}
Desired query result:
{
"tag": "tag1",
"organization" : ["A", "B"]
},
{
"tag": "tag4",
"organization" : ["A", "B"]
}
What i got so far
With the suggestion below, i got a list of results containing keywords that are used by one organization, and keywords that are used by different organizations.
To clarify, this a is a part of the result:
{
"key": "someKeyWord",
"doc_count": 66,
"organization_list": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Organization A",
"doc_count": 62
},
{
"key": "Organization B",
"doc_count": 4
}
]
}
},
{
"key": "someOtherKeyword",
"doc_count": 62,
"organization_list": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Organization A",
"doc_count": 62
}
]
}
}
Now i only want the first result, which has two buckets from the organization_list aggregation. Because that keyword is used by two different organizations.
I tried like this:
"number_buckets_filter": {
"bucket_selector": {
"buckets_path": {
"my_var": "organization_list"
},
"script": "params.my_var > 1"
}
}
But that gets me an exception: "buckets_path must reference either a number value or a single value numeric metric aggregation, got: org.elasticsearch.search.aggregations.bucket.terms.StringTerms"
Is there any way to filter the results? Thanks in advance for any help.
Kind regards,
Oskar uit de Bos

You can use the following query to bucket first on tags and then to sub bucket on organizations
{
"size": 0,
"aggs": {
"tags_list": {
"terms": {
"field": "tags",
"size": 100
},"aggs": {
"organization_list": {
"terms": {
"field": "organization",
"size": 100
}
}
}
}
}
}
mappings
{
"mappings": {
"product": {
"properties": {
"tags": {
"type": "text",
"fielddata": true
},
"organization": {
"type": "text",
"fielddata": true
}
}
}
}
}
Note - make sure the have both tags and organization as not analyzed for aggregations. also set fielddata=true in mappings to avoid heavy memory usages.

Related

Nested array of objects aggregation in Elasticsearch

Documents in the Elasticsearch are indexed as such
Document 1
{
"task_completed": 10
"tagged_object": [
{
"category": "cat",
"count": 10
},
{
"category": "cars",
"count": 20
}
]
}
Document 2
{
"task_completed": 50
"tagged_object": [
{
"category": "cars",
"count": 100
},
{
"category": "dog",
"count": 5
}
]
}
As you can see that the value of the category key is dynamic in nature. I want to perform a similar aggregation like in SQL with the group by category and return the sum of the count of each category.
In the above example, the aggregation should return
cat: 10,
cars: 120 and
dog: 5
Wanted to know how to write this aggregation query in Elasticsearch if it is possible. Thanks in advance.
You can achieve your required result, using nested, terms, and sum aggregation.
Adding a working example with index mapping, search query and search result
Index Mapping:
{
"mappings": {
"properties": {
"tagged_object": {
"type": "nested"
}
}
}
}
Search Query:
{
"size": 0,
"aggs": {
"resellers": {
"nested": {
"path": "tagged_object"
},
"aggs": {
"books": {
"terms": {
"field": "tagged_object.category.keyword"
},
"aggs":{
"sum_of_count":{
"sum":{
"field":"tagged_object.count"
}
}
}
}
}
}
}
}
Search Result:
"aggregations": {
"resellers": {
"doc_count": 4,
"books": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cars",
"doc_count": 2,
"sum_of_count": {
"value": 120.0
}
},
{
"key": "cat",
"doc_count": 1,
"sum_of_count": {
"value": 10.0
}
},
{
"key": "dog",
"doc_count": 1,
"sum_of_count": {
"value": 5.0
}
}
]
}
}
}

Elasticsearch: Aggregate distinct values in array

I am using Elasticsearch to store click traffic and each row includes topics of the page which has been visited. A typical row looks like:
{
"date": "2017-09-10T12:26:53.998Z",
"pageid": "10263779",
"loc_ll": [
-73.6487,
45.4671
],
"ua_type": "Computer",
"topics": [
"Trains",
"Planes",
"Electric Cars"
]
}
I want each topics to be a keyword so if I search for cars nothing will be returned. Only Electric Cars would return a result.
I also want to run a distinct query on all topics in all rows so I have a list of all topics used.
Doing this on a pageid would look like like the following, but I am unsure how to approach this for the topics array.
{
"aggs": {
"ids": {
"terms": {
"field": pageid,
"size": 10
}
}
}
}
Your approach to querying and getting the available terms looks fine. Probably you should check your mapping. If you get results for cars this looks as your mapping for topics is an analyzed string (e.g. type text instead of keyword). So please check your mapping for this field.
PUT keywordarray
{
"mappings": {
"item": {
"properties": {
"id": {
"type": "integer"
},
"topics": {
"type": "keyword"
}
}
}
}
}
With this sample data
POST keywordarray/item
{
"id": 123,
"topics": [
"first topic", "second topic", "another"
]
}
and this aggregation:
GET keywordarray/item/_search
{
"size": 0,
"aggs": {
"topics": {
"terms": {
"field": "topics"
}
}
}
}
will result in this:
"aggregations": {
"topics": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "another",
"doc_count": 1
},
{
"key": "first topic",
"doc_count": 1
},
{
"key": "second topic",
"doc_count": 1
}
]
}
}
It is very therapeutic asking on SO. Simply changing the mapping type to keyword allowed me to achieve what I needed.
A part of me thought that it would concatenate the array into a string. But it doesn't
{
"mappings": {
"view": {
"properties": {
"topics": {
"type": "keyword"
},...
}
}
}
}
and a search query like
{
"aggs": {
"ids": {
"terms": {
"field": pageid,
"size": 10
}
}
}
}
Will return a distinct list of all elements in a fields array.

Aggregating with multiple fields returned in ElasticSearch

Suppose I have a relative simple index with the following fields...
"testdata": {
"properties": {
"code": {
"type": "integer"
},
"name": {
"type": "string"
},
"year": {
"type": "integer"
},
"value": {
"type": "integer"
}
}
}
I can write a query to get the total sum of the values aggregated by the code like so:
{
"from":0,
"size":0,
"aggs": {
"by_code": {
"terms": {
"field": "code"
},
"aggs": {
"total_value": {
"sum": {
"field": "value"
}
}
}
}
}
}
And this returns the following (abridged) results:
"aggregations": {
"by_code": {
"doc_count_error_upper_bound": 478,
"sum_other_doc_count": 328116,
"buckets": [
{
"key": 236948,
"doc_count": 739,
"total_value": {
"value": 12537
}
},
However, this data is being fed to a web front-end, where it is required both the code and the name is displayed. So, the question is, is it possible to amend the query somehow to also return the name field, as well as the code field, in the results?
So, for example, the results can look a bit like this:
"aggregations": {
"by_code": {
"doc_count_error_upper_bound": 478,
"sum_other_doc_count": 328116,
"buckets": [
{
"key": 236948,
"code": 236948,
"name": "Test Name",
"doc_count": 739,
"total_value": {
"value": 12537
}
},
I've read up on sub-aggregations, but in this case there is a one-to-one relationship between code and name (so, you wouldn't have different names for the same key). Also, in my real case, there are 5 other fields, like description, that I would like to return, so I am wondering if there was another way to do it.
In SQL (from which this data originally came from before it was swapped to ElasticSearch) I would write the following query
SELECT Code, Name, SUM(Value) AS Total_Value
FROM [TestData]
GROUP BY Code, Name
You can achieve this using scripting, i.e. instead of specifying a field, you specify a combination of fields:
{
"from":0,
"size":0,
"aggs": {
"by_code": {
"terms": {
"script": "[doc.code.value, doc.name.value].join('-')"
},
"aggs": {
"total_value": {
"sum": {
"field": "value"
}
}
}
}
}
}
note: you need to make sure to enable dynamic scripting for this to work

ElasticSearch 2.1.0 - Deep 'children' aggregation with 'sum' metric returning empty results

I have a hierarchy of document types two levels deep. The documents are related by parent-child relationships as follows: category > sub_category > item i.e. each sub_category has a _parent field referring to a category id, and each item has a _parent field referring to a sub_category id.
Each item has a price field. Given a query for categories, which includes conditions for sub-categories and items, I want to calculate a total price for each sub_category.
My query looks something like this:
{
"query": {
"has_child": {
"child_type": "sub_category",
"query": {
"has_child": {
"child_type": "item",
"query": {
"range": {
"price": {
"gte": 100,
"lte": 150
}
}
}
}
}
}
}
}
My aggregation to calculate the price for each sub-category looks like this:
{
"aggs": {
"categories": {
"terms": {
"field": "id"
},
"aggs": {
"sub_categories": {
"children": {
"type": "sub_category"
},
"aggs": {
"sub_category_ids": {
"terms": {
"field": "id"
},
"aggs": {
"items": {
"children": {
"type": "item"
},
"aggs": {
"price": {
"sum": {
"field": "price"
}
}
}
}
}
}
}
}
}
}
}
}
Despite the query response listing matching results, the aggregation response doesn't match any items:
{
"aggregations": {
"categories": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "category1",
"doc_count": 1,
"sub_categories": {
"doc_count": 3,
"sub_category_ids": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "subcat1",
"doc_count": 1,
"items": {
"doc_count": 0,
"price": {
"value": 0
}
}
},
{
"key": "subcat2",
"doc_count": 1,
"items": {
"doc_count": 0,
"price": {
"value": 0
}
}
},
{
"key": "subcat3",
"doc_count": 1,
"items": {
"doc_count": 0,
"price": {
"value": 0
}
}
}
]
}
}
}]
}
}
}
However, omitting the sub_category_ids aggregation does cause the items to appear and for prices to be summed at the level of the categories aggregation. I would expect including the sub_category_ids aggregation to simply change the level at which the prices are summed.
Am I misunderstanding how the aggregation is evaluated, and if so how could I modify it to display the summed prices for each sub-category?
I opened an issue #15413, regarding children aggregation as I and other folks were facing similar issues in ES 2.0
Apparently the problem according to ES developer #martijnvg was that
The children agg makes an assumption (that all segments are being seen by children agg) that was true in 1.x but not in 2.x
PR #15457 fixed this issue, again from #martijnvg
Before we only evaluated segments that yielded matches in parent aggs, which caused us to miss to evaluate child docs in segments we didn't have parent matches for.
The fix for this is stop remember in what segments we have matches for
and simply evaluate all segments. This makes the code simpler and we
can still quickly see if a segment doesn't hold child docs like we did
before
This pull request has been merged and it has also been back ported to the 2.x, 2.1 and 2.0 branches.

How to get an Elasticsearch aggregation with multiple fields

I'm attempting to find related tags to the one currently being viewed. Every document in our index is tagged. Each tag is formed of two parts - an ID and text name:
{
...
meta: {
...
tags: [
{
id: 123,
name: 'Biscuits'
},
{
id: 456,
name: 'Cakes'
},
{
id: 789,
name: 'Breads'
}
]
}
}
To fetch the related tags I am simply querying the documents and getting an aggregate of their tags:
{
"query": {
"bool": {
"must": [
{
"match": {
"item.meta.tags.id": "123"
}
},
{
...
}
]
}
},
"aggs": {
"baked_goods": {
"terms": {
"field": "item.meta.tags.id",
"min_doc_count": 2
}
}
}
}
This works perfectly, I am getting the results I want. However, I require both the tag ID and name to do anything useful. I have explored how to accomplish this, the solutions seem to be:
Combine the fields when indexing
A script to munge together the fields
A nested aggregation
Option one and two are are not available to me so I have been going with 3 but it's not responding in an expected manner. Given the following query (still searching for documents also tagged with 'Biscuits'):
{
...
"aggs": {
"baked_goods": {
"terms": {
"field": "item.meta.tags.id",
"min_doc_count": 2
},
"aggs": {
"name": {
"terms": {
"field": "item.meta.tags.name"
}
}
}
}
}
}
I will get this result:
{
...
"aggregations": {
"baked_goods": {
"buckets": [
{
"key": "456",
"doc_count": 11,
"name": {
"buckets": [
{
"key": "Biscuits",
"doc_count": 11
},
{
"key": "Cakes",
"doc_count": 11
}
]
}
}
]
}
}
}
The nested aggregation includes both the search term and the tag I'm after (returned in alphabetical order).
I have tried to mitigate this by adding an exclude to the nested aggregation but this slowed the query down far too much (around 100 times for 500000 docs). So far the fastest solution is to de-dupe the result manually.
What is the best way to get an aggregation of tags with both the tag ID and tag name in the response?
Thanks for making it this far!
By the looks of it, your tags is not nested.
For this aggregation to work, you need it nested so that there is an association between an id and a name. Without nested the list of ids is just an array and the list of names is another array:
"item": {
"properties": {
"meta": {
"properties": {
"tags": {
"type": "nested", <-- nested field
"include_in_parent": true, <-- to, also, keep the flat array-like structure
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "string"
}
}
}
}
}
}
}
Also, note that I've added to the mapping this line "include_in_parent": true which means that your nested tags will, also, behave like a "flat" array-like structure.
So, everything you had so far in your queries will still work without any changes to the queries.
But, for this particular query of yours, the aggregation needs to change to something like this:
{
"aggs": {
"baked_goods": {
"nested": {
"path": "item.meta.tags"
},
"aggs": {
"name": {
"terms": {
"field": "item.meta.tags.id"
},
"aggs": {
"name": {
"terms": {
"field": "item.meta.tags.name"
}
}
}
}
}
}
}
}
And the result is like this:
"aggregations": {
"baked_goods": {
"doc_count": 9,
"name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 123,
"doc_count": 3,
"name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "biscuits",
"doc_count": 3
}
]
}
},
{
"key": 456,
"doc_count": 2,
"name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cakes",
"doc_count": 2
}
]
}
},
.....

Resources