elasticearch aggregation by array size - elasticsearch

I need a stats on elasticsearch. I can't make the request.
I would like to know the number of people per appointment.
appointment index mapping
{
"id" : "383577",
"persons" : [
{
"id" : "1",
},
{
"id" : "2",
}
]
}
what i would like
"buckets" : [
{
"key" : "1", <--- appointment of 1 person
"doc_count" : 1241891
},
{
"key" : "2", <--- appointment of 2 persons
"doc_count" : 10137
},
{
"key" : "3", <--- appointment of 3 persons
"doc_count" : 8064
}
]
Thank you

The easiest way to do this is to create another integer field containing the length of the persons array and aggregating on that field.
{
"id" : "383577",
"personsCount": 2, <---- add this field
"persons" : [
{
"id" : "1",
},
{
"id" : "2",
}
]
}
The non-optimal way of achieving what you expect is to use a script that will return the length of the persons array dynamically, but be aware that this is sub-optimal and can potentially harm your cluster depending on the volume of data you have:
GET /_search
{
"aggs": {
"persons": {
"terms": {
"script": "doc['persons.id'].size()"
}
}
}
}
If you want to update all your documents to create that field you can do it like this:
POST index/_update_by_query
{
"script": {
"source": "ctx._source.personsCount = ctx._source.persons.length"
}
}
However, you'll also need to modify the logic of your indexing application to create that new field.

Related

Counting unique buckets from aggregation

I am trying to get the unique count for all labels used on a set of documents. In order to do that, and have the json returned in the bucket (cardinality doesnt return json and count together), I need to write a pipeline query.
My query gets me half way there, but I'm missing the second part that counts the number of buckets a label is in.
Here's my query
{
"size":0,
"aggs" : {
unique_count : {
"composite" : [
"metadataId" : {
"terms" :{"field" : "document.metadata.id"}
},
"label" : {
"terms" :{"field" : "document.label"}
}
]
}
}
}
This produces
...
"buckets" : [
{
"key" : {
"metadataId" : "1",
"label" : "label one"
},
"doc_count" : 2
},
{
"key" : {
"metadataId" : "2",
"label" : "label one"
},
"doc_count" : 1
},
{
"key" : {
"metadataId" : "3",
"label" : "label three"
},
"doc_count" : 3
}
]
...
The problem I'm facing is that each bucket is considered unique and the sum of the unique counts is what I would like to return. For example, in the buckets above the label "label one" is contained within two buckets, so it's doc_count should be 2, while "label three" should have a doc_count of 1.
After the last phase in the pipeline I'd like to see the following output:
"buckets" : [
{
"label" : "label one"
"doc_count" : 2
},
{
"label" : "label three"
"doc_count" : 1
}
]
I've tried all sorts of things, but they're just not getting me close to the output I need. Can anyone point me in the right direction?
Try with the nested terms aggregations where first level aggs would be on label and the second level on metadataId field. The aggs block should look something like:
"aggs" : {
"labels": {
"terms": {
"field": "label.keyword",
"size": 1000
},
"aggs": {
"metadata": {
"terms": {
"field": metadataId.keyword",
"size": 1000
}
}
}
}
}
As output, you will get buckets of labels with key as label value and doc_count with count of docs matching that label. Each label bucket will have a nested buckets of metadataId with key as metadataId value and doc_count with count of docs matching that label and metadataId.

Convert two repeated values in array into a string

I have some old documents where a field has an array of two vales repeated, something like this:
"task" : [
"first_task",
"first_task"
],
I'm trying to convert this array into a string because it's the same value. I've seen the following script: Convert array with 2 equal values to single value but in my case, this problem can't be fixed through logstash because it happens just with old documents stored.
I was thinking to do something like this:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"processors": [
{
"script": {
"description": "Change task field from array to first element of this one",
"lang": "painless",
"source": """
if (ctx['task'][0] == ctx['task'][1]) {
ctx['task'] = ctx['task'][0];
}
"""
}
}
]
},
"docs": [
{
"_index" : "tasks",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"#timestamp" : "2022-05-03T07:33:44.652Z",
"task" : ["first_task", "first_task"]
}
}
]
}
The result document is the following:
{
"docs" : [
{
"doc" : {
"_index" : "tasks",
"_type" : "_doc",
"_id" : "1",
"_source" : {
"#timestamp" : "2022-05-03T07:33:44.652Z",
"task" : "first_task"
},
"_ingest" : {
"timestamp" : "2022-05-11T09:08:48.150815183Z"
}
}
}
]
}
We can see the task field is reassigned and we have the first element of the array as a value.
Is there a way to manipulate actual data from Elasticsearch and convert all the documents with this characteristic using DSL queries?
Thanks.
You can achieve this with _update_by_query endpoint. Here is an example:
POST tasks/_update_by_query
{
"script": {
"source": """
if (ctx._source['task'][0] == ctx._source['task'][1]) {
ctx._source['task'] = ctx._source['task'][0];
}
""",
"lang": "painless"
},
"query": {
"match_all": {}
}
}
You can remove the match_all query if you want to update all documents or you can filter documents by chaning the conditions in the query.
Keep in mind that running a script to update all documents in the index may cause some performance issues while the update process is running.

Using two separate relational indexes in elasticsearch is sense or not?

I have two relational mysql tables and i need to store these data to elasticsearch.
I stored like this and i wanted to ask you if there is a best way or not :
POST categories/_doc
{
"id" : 1,
"name" : "Phones"
}
POST categories/_doc
{
"id" : 2,
"name" : "TV"
}
PUT products
{
"mappings": {
"properties": {
"attributes": {
"type": "nested"
}
}
}
}
POST products/_doc
{
"id" : 3
"category_id" : 1
"name" : "IPhone 5S",
"attributes" : [
{
"color" : "red",
"stock" : 4
},
{
"color" : "blue",
"stock" : 2
}
]
}
POST products/_doc
{
"id" : 5
"category_id" : 2
"name" : "Samsung TV",
"attributes" : [
{
"color" : "red",
"stock" : 2
},
{
"color" : "yellow",
"stock" : 4
}
]
}
And i use two queries for searching :
I firstly search on categories index after that i send category id values to products index
GET products/_search
{
"query": {
"bool": {
"filter": [
{
"nested": {
"path": "attributes",
"query": {
"terms": {
"attributes.color": [
"red",
"blue"
]
}
},
"inner_hits": {}
}
},
{
"term": {
"category_id": 2
}
}
]
}
}
}
Can you please share your comments about this topic ?
Thank you in advance
It is not a good practice to do like this, since you will have a really difficult time doing sorting and pagination on your application. In Elasticsearch is really important to flat the data for performance reasons and store it in one index, if you have really separated content you can have them in multiple indices. Also keep these points in mind:
Elasticsearch is not a relational database. You will not be able to Join indices as you used to join Tables
Denormalization is not natural but is a key for efficiency in an Elasticsearch application.
Thinking of your data mapping at the early beginning will allow your
app to fly for years.

I would like to combine the duplicate values in Elasticsearch into one and see the results with a different filter

I'm collecting logs through Elastic Search. The logs are collected as below.
ex.
{
"name" : "John"
"team" : "IT"
"startTime" : "21:00"
"result" : "pass"
},
{
"name" : "James"
"team" : "HR"
"startTime" : "21:04"
"result" : "pass"
},
{
"name" : "Paul"
"team" : "IT"
"startTime" : "21:05"
"result" : "pass"
},
{
"name" : "Jackson"
"team" : "Marketing"
"startTime" : "21:30"
"result" : "fail"
},
{
"name" : "John"
"team" : "IT"
"startTime" : "21:41"
"result" : "pass"
},
.....and so on
If you run the query below on these collected logs,
GET logData/_search
{
"size": 0,
"aggs": {
"Documents_per_team": {
"terms": {
"field": "team"
}
}
}
}
The following results will be exposed.
"aggregations" : {
"Documents_per_team" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "IT",
"doc_count" : 70
},
{
"key" : "Marketing",
"doc_count" : 55
},
{
"key" : "HR",
"doc_count" : 11
}
]
}
}
}
What I want is to eliminate duplication if the name of the document is duplicated in this result.
[AS-IS]
As shown above, the IT team count is exposed to 70
[The result I want]
if John performed 50 times, Kate performed 10 times, Paul performed 10 times, the IT team count 3 is exposed. (Because there are three of IT team member)
Can I get a team-by-team result after removing duplicates?
Thanks
You've got two options:
a cardinality sub-aggregation (straightforward, but approximate and not very scalable, albeit only in very specific/advanced situations)
or a scripted metric aggregation (slower, more verbose but exact).
Both approaches assume that the names are unique per team-level. If they're not, you'll need to adjust accordingly. Also, it is assumed that the name is mapped to be of type keyword, just like the team. If not, you'll need to replace them with your_field.keyword
1. Cardinality
{
"size": 0,
"aggs": {
"Documents_per_team": {
"terms": {
"field": "team"
},
"aggs": {
"unique_names_per_team": {
"cardinality": {
"field": "name"
}
}
}
}
}
}
2. Scripted Metric
{
"size": 0,
"aggs": {
"Documents_per_team": {
"scripted_metric": {
"init_script": "state.by_department = [:]; state.dept_vs_name = [:];",
"map_script": """
def dept = doc['team'].value;
def name = doc['name'].value;
def name_already_considered = state.by_department.containsKey(dept) && state.dept_vs_name[dept].containsKey(name);
if (name_already_considered) {
return;
}
if (state.by_department.containsKey(dept)) {
state.by_department[dept] += 1;
} else {
state.by_department[dept] = 1
}
if (!state.dept_vs_name.containsKey(dept)) {
// init new map & set is first member
state.dept_vs_name[dept] = [name:true];
} else if (!state.dept_vs_name[dept].containsKey(name)) {
state.dept_vs_name[dept][name] = true;
}
""",
"combine_script": "return state.by_department",
"reduce_script": "return states"
}
}
}
}
Note: If you also wish to see the underlying dept vs. name breakdown, you can modify the combine_script to return the whole state, i.e. return state.

Terms aggregation on an inner object and retrieving bucket metadata

We index the following products:
{
"id": "1",
"name": "the-name",
"categories": [
{
"id" : 10,
"name" : "cat-1"
},
{
"id" : 20,
"name" : "cat-2"
}
]
}
We are doing an aggregation on categories.id using :
REQUEST:
//...
"aggs": {
"by_cat": {
"terms": {
"field": "categories.id",
"size": 10
}
}
}
---
RESPONSE:
// ...
"by_cat" : {
"buckets" : [
{
"key" : 10,
"doc_count" : 804
},
{
"key" : 20,
"doc_count" : 327
},
It works well, however, each bucket contains only the categories.id in the key field. What we would like is to be able to have the name of the category in the bucket, for example :
// ...
"buckets" : [
{
"key" : 10,
"metadata": {
"name": "cat-1"
},
"doc_count" : 804
},
{
"key" : 20,
"metadata": {
"name": "cat-2"
},
"doc_count" : 327
},
What is the good way to do that ? We found two to get this information but they both looks "hackish" :
Using top_hits with size 1 and source limited to categories, it will retrieve one document per bucket containing the information we need. This first solution doesn't look performance-wise and the more aggregation we have, the more bloated is the response.
Adding a new column id_name which concatenate id and name and doing the term aggregation on it. It looks more like a hack, and may be complicated if many fields.
We also tried by mixing field and script in terms but it doesn't help.
metadata looked exactly what we wanted but it is global for all the buckets and not dynamic.
Do we have other way to retrieve this information ?

Resources