aggregation on fields values (regex) - elasticsearch

I am trying to perform an aggregation to group documents by the first two letters of a specific field value.
I successfully aggreated my documents by a specific field name, but i don't know how to work with the values.
For example, for the docs:
[
{
"name": "John"
},
{
"name": "Jog"
},
{
"name": "James"
},
{
"name": "Robert"
},
{
"name": "Jessica"
}
]
I would like to get the following response:
[
{
"key": "Jo",
"doc_count": 2
},
{
"key": "Ja",
"doc_count": 1
},
{
"key": "Ro",
"doc_count": 1
},
{
"key": "Je",
"doc_count": 1
}
]
Is there an aggregation query able to do that?

You could use a terms aggregation with a script instead of a field, like this:
{
"size": 0,
"aggs": {
"first_two": {
"terms": {
"script": "doc.name.value?.size() >=2 ? doc.name.value?.substring(0, 2) : doc.name.value"
}
}
}
}
Note that if your name fields all have at least two characters, the script could simply be doc.name.value?.substring(0, 2). My script above accounts for single character names.
Also make sure to enable dynamic scripting in order for this to work.

Related

Elastic query to find similar tags in content from different organizations

I consume content sources from different organizations, which all supply metadata tags. I would like a list of terms, that are supplied by different organizations.
A sample of data in Elasticsearch:
doc1: {
"tags":["tag1", "tag5", "tag6", "tag4"],
"organization" : "A"
}
doc2: {
"tags":["tag1", "tag2", "tag4"],
"organization" : "B"
}
Desired query result:
{
"tag": "tag1",
"organization" : ["A", "B"]
},
{
"tag": "tag4",
"organization" : ["A", "B"]
}
What i got so far
With the suggestion below, i got a list of results containing keywords that are used by one organization, and keywords that are used by different organizations.
To clarify, this a is a part of the result:
{
"key": "someKeyWord",
"doc_count": 66,
"organization_list": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Organization A",
"doc_count": 62
},
{
"key": "Organization B",
"doc_count": 4
}
]
}
},
{
"key": "someOtherKeyword",
"doc_count": 62,
"organization_list": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Organization A",
"doc_count": 62
}
]
}
}
Now i only want the first result, which has two buckets from the organization_list aggregation. Because that keyword is used by two different organizations.
I tried like this:
"number_buckets_filter": {
"bucket_selector": {
"buckets_path": {
"my_var": "organization_list"
},
"script": "params.my_var > 1"
}
}
But that gets me an exception: "buckets_path must reference either a number value or a single value numeric metric aggregation, got: org.elasticsearch.search.aggregations.bucket.terms.StringTerms"
Is there any way to filter the results? Thanks in advance for any help.
Kind regards,
Oskar uit de Bos
You can use the following query to bucket first on tags and then to sub bucket on organizations
{
"size": 0,
"aggs": {
"tags_list": {
"terms": {
"field": "tags",
"size": 100
},"aggs": {
"organization_list": {
"terms": {
"field": "organization",
"size": 100
}
}
}
}
}
}
mappings
{
"mappings": {
"product": {
"properties": {
"tags": {
"type": "text",
"fielddata": true
},
"organization": {
"type": "text",
"fielddata": true
}
}
}
}
}
Note - make sure the have both tags and organization as not analyzed for aggregations. also set fielddata=true in mappings to avoid heavy memory usages.

Aggregating with multiple fields returned in ElasticSearch

Suppose I have a relative simple index with the following fields...
"testdata": {
"properties": {
"code": {
"type": "integer"
},
"name": {
"type": "string"
},
"year": {
"type": "integer"
},
"value": {
"type": "integer"
}
}
}
I can write a query to get the total sum of the values aggregated by the code like so:
{
"from":0,
"size":0,
"aggs": {
"by_code": {
"terms": {
"field": "code"
},
"aggs": {
"total_value": {
"sum": {
"field": "value"
}
}
}
}
}
}
And this returns the following (abridged) results:
"aggregations": {
"by_code": {
"doc_count_error_upper_bound": 478,
"sum_other_doc_count": 328116,
"buckets": [
{
"key": 236948,
"doc_count": 739,
"total_value": {
"value": 12537
}
},
However, this data is being fed to a web front-end, where it is required both the code and the name is displayed. So, the question is, is it possible to amend the query somehow to also return the name field, as well as the code field, in the results?
So, for example, the results can look a bit like this:
"aggregations": {
"by_code": {
"doc_count_error_upper_bound": 478,
"sum_other_doc_count": 328116,
"buckets": [
{
"key": 236948,
"code": 236948,
"name": "Test Name",
"doc_count": 739,
"total_value": {
"value": 12537
}
},
I've read up on sub-aggregations, but in this case there is a one-to-one relationship between code and name (so, you wouldn't have different names for the same key). Also, in my real case, there are 5 other fields, like description, that I would like to return, so I am wondering if there was another way to do it.
In SQL (from which this data originally came from before it was swapped to ElasticSearch) I would write the following query
SELECT Code, Name, SUM(Value) AS Total_Value
FROM [TestData]
GROUP BY Code, Name
You can achieve this using scripting, i.e. instead of specifying a field, you specify a combination of fields:
{
"from":0,
"size":0,
"aggs": {
"by_code": {
"terms": {
"script": "[doc.code.value, doc.name.value].join('-')"
},
"aggs": {
"total_value": {
"sum": {
"field": "value"
}
}
}
}
}
}
note: you need to make sure to enable dynamic scripting for this to work

How to get an Elasticsearch aggregation with multiple fields

I'm attempting to find related tags to the one currently being viewed. Every document in our index is tagged. Each tag is formed of two parts - an ID and text name:
{
...
meta: {
...
tags: [
{
id: 123,
name: 'Biscuits'
},
{
id: 456,
name: 'Cakes'
},
{
id: 789,
name: 'Breads'
}
]
}
}
To fetch the related tags I am simply querying the documents and getting an aggregate of their tags:
{
"query": {
"bool": {
"must": [
{
"match": {
"item.meta.tags.id": "123"
}
},
{
...
}
]
}
},
"aggs": {
"baked_goods": {
"terms": {
"field": "item.meta.tags.id",
"min_doc_count": 2
}
}
}
}
This works perfectly, I am getting the results I want. However, I require both the tag ID and name to do anything useful. I have explored how to accomplish this, the solutions seem to be:
Combine the fields when indexing
A script to munge together the fields
A nested aggregation
Option one and two are are not available to me so I have been going with 3 but it's not responding in an expected manner. Given the following query (still searching for documents also tagged with 'Biscuits'):
{
...
"aggs": {
"baked_goods": {
"terms": {
"field": "item.meta.tags.id",
"min_doc_count": 2
},
"aggs": {
"name": {
"terms": {
"field": "item.meta.tags.name"
}
}
}
}
}
}
I will get this result:
{
...
"aggregations": {
"baked_goods": {
"buckets": [
{
"key": "456",
"doc_count": 11,
"name": {
"buckets": [
{
"key": "Biscuits",
"doc_count": 11
},
{
"key": "Cakes",
"doc_count": 11
}
]
}
}
]
}
}
}
The nested aggregation includes both the search term and the tag I'm after (returned in alphabetical order).
I have tried to mitigate this by adding an exclude to the nested aggregation but this slowed the query down far too much (around 100 times for 500000 docs). So far the fastest solution is to de-dupe the result manually.
What is the best way to get an aggregation of tags with both the tag ID and tag name in the response?
Thanks for making it this far!
By the looks of it, your tags is not nested.
For this aggregation to work, you need it nested so that there is an association between an id and a name. Without nested the list of ids is just an array and the list of names is another array:
"item": {
"properties": {
"meta": {
"properties": {
"tags": {
"type": "nested", <-- nested field
"include_in_parent": true, <-- to, also, keep the flat array-like structure
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "string"
}
}
}
}
}
}
}
Also, note that I've added to the mapping this line "include_in_parent": true which means that your nested tags will, also, behave like a "flat" array-like structure.
So, everything you had so far in your queries will still work without any changes to the queries.
But, for this particular query of yours, the aggregation needs to change to something like this:
{
"aggs": {
"baked_goods": {
"nested": {
"path": "item.meta.tags"
},
"aggs": {
"name": {
"terms": {
"field": "item.meta.tags.id"
},
"aggs": {
"name": {
"terms": {
"field": "item.meta.tags.name"
}
}
}
}
}
}
}
}
And the result is like this:
"aggregations": {
"baked_goods": {
"doc_count": 9,
"name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 123,
"doc_count": 3,
"name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "biscuits",
"doc_count": 3
}
]
}
},
{
"key": 456,
"doc_count": 2,
"name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cakes",
"doc_count": 2
}
]
}
},
.....

ElasticSearch aggregation function

Is that a possible to define an aggregation function in elastic search?
E.g. for data:
author weekday status
me monday ok
me tuesday ok
me moday bad
I want to get an aggregation based on author and weekday, and as a value I want to get concatenation of status field:
agg1 agg2 value
me monday ok,bad
me tuesday ok
I know you can do count, but is that possible to define another function used for aggregation?
EDIT/ANSWER: Looks like there is no multirow aggregation support in ES, thus we had to use subaggregations on last field (see Akshay's example). If you need to have more complex aggregation function, then aggregate by id (note, you won't be able to use _id, so you'll have to duplicate it in other field) - that way you'll be able to do advanced aggregation on individual items in each bucket.
You can get get roughly what you want by using sub aggregations available in 1.0. Assuming the documents are structured as author, weekday and status, you could using the aggregation below:
{
"size": 0,
"aggs": {
"author": {
"terms": {
"field": "author"
},
"aggs": {
"days": {
"terms": {
"field": "weekday"
},
"aggs": {
"status": {
"terms": {
"field": "status"
}
}
}
}
}
}
}
}
Which gives you the following result:
{
...
"aggregations": {
"author": {
"buckets": [
{
"key": "me",
"doc_count": 3,
"days": {
"buckets": [
{
"key": "monday",
"doc_count": 2,
"status": {
"buckets": [
{
"key": "bad",
"doc_count": 1
},
{
"key": "ok",
"doc_count": 1
}
]
}
},
{
"key": "tuesday",
"doc_count": 1,
"status": {
"buckets": [
{
"key": "ok",
"doc_count": 1
}
]
}
}
]
}
}
]
}
}
}

Elastic Search Grouped Queries

I'm indexing an array of key value pairs. The key is always a UUID and the value is a user entered value. I've been crawling through the documentation but I can't figure out exactly how to query in this scenarioExample schema:
{
"id": 1,
"owner_id": 1,
"values": [
{ "key": "k3kfa23rewf", "value": "the red card" },
{ "key": "23a2dd23108", "value": "purple balloons" },
]
},
{
"id": 2,
"owner_id": 1,
"values": [
{ "key": "k3kfa23rewf", "value": "the blue card" },
{ "key": "23a2dd23108", "value": "purple balloons" },
]
}
I would like to query:
{ "term": { "owner_id": 1 },
{ "term": { "values.key": "23a2dd23108" }, "match": { "values.value": "purple" } },
{ "term": { "values.key": "k3kfa23rewf" }, "match": { "values.value": "blue" } }
So that the record with ID 2 is returned. Any suggestions?
I think that you need here to use nested documents.
That way, you will be able to create BoolQueries, with a Must clause with a TermQuery on owner_id and two must clauses with nested queries with Term and Match queries on values.key and values.value.
Does it help?

Resources