How can I aggregate the whole field value in Elasticsearch - elasticsearch

I am using Elasticsearch 7.15 and need to aggregate a field and sort them by order.
My document saved in Elasticsearch looks like:
{
"logGroup" : "/aws/lambda/myLambda1",
...
},
{
"logGroup" : "/aws/lambda/myLambda2",
...
}
I need to find out which logGroup has the most document. In order to do that, I tried to use aggregate in Elasticsearch:
GET /my-index/_search?size=0
{
"aggs": {
"types_count": {
"terms": {
"field": "logGroup",
"size": 10000
}
}
}
}
the output of this query looks like:
"aggregations" : {
"types_count" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "aws",
"doc_count" : 26303620
},
{
"key" : "lambda",
"doc_count" : 25554470
},
{
"key" : "myLambda1",
"doc_count" : 25279201
}
...
}
As you can see from above output, it splits the logGroup value into terms and aggregate based on term not the whole string. Is there a way for me to aggregate them as a whole string?
I expect the output looks like:
"buckets" : [
{
"key" : "/aws/lambda/myLambda1",
"doc_count" : 26303620
},
{
"key" : "/aws/lambda/myLambda2",
"doc_count" : 25554470
},
The logGroup field in the index mapping is:
"logGroup" : {
"type" : "text",
"fielddata" : true
},
Can I achieve it without updating the index?

In order to get what you expect you need to change your mapping to this:
"logGroup" : {
"type" : "keyword"
},
Failing to do that, your log groups will get analyzed by the standard analyzer which splits the whole string and you'll not be able to aggregate by full log groups.
If you don't want or can't change the mapping and reindex everything, what you can do is the following:
First, add a keyword sub-field to your mapping, like this:
PUT /my-index/_mapping
{
"properties": {
"logGroup" : {
"type" : "text",
"fields": {
"keyword": {
"type" : "keyword"
}
}
}
}
}
And then run the following so that all existing documents pick up this new field:
POST my-index/_update_by_query?wait_for_completion=false
Finally, you'll be able to achieve what you want with the following query:
GET /my-index/_search
{
"size": 0,
"aggs": {
"types_count": {
"terms": {
"field": "logGroup.keyword",
"size": 10000
}
}
}
}

Related

Can Elastic Search do aggregations for within a document?

I have a mapping like this:
mappings: {
"seller": {
"properties" : {
"overallRating": {"type" : byte}
"items": [
{
itemName: {"type": string},
itemRating: {"type" : byte}
}
]
}
}
}
Each item will only have one itemRating. Each seller will only have one overall rating. There can be many items, and at most I'm expecting maybe 50 items with itemRatings. Not all items have to have an itemRating.
I'm trying to get an average rating for each seller that combines all itemRatings and the overallRating. I have looked into aggregations but all I have seen are aggregations for across all documents. The aggregation I'm looking to do is within the document itself, and I am not sure if that is possible. Any tips would be appreciated.
Yes this is very much possible with Elasticeasrch. To produce a combined rating, you simply need to subaggregate by the document id. The only thing present in the bucket would be the individual document . That is what you want.
Here is an example:
Create the index:
PUT /ratings
{
"mappings": {
"properties": {
"overallRating": {"type" : "float"},
"items": {
"type" : "nested",
"properties": {
"itemName" : {"type" : "keyword"},
"itemRating" : {"type" : "float"},
"overallRating": {"type" : "float"}
}
}
}
}
}
Add some data:
POST ratings/_doc/
{
"overallRating" : 1,
"items" : [
{
"itemName" : "labrador",
"itemRating" : 10,
"overallRating" : 1
},
{
"itemName" : "saint bernard",
"itemRating" : 20,
"overallRating" : 1
}
]
}
{
"overallRating" : 1,
"items" : [
{
"itemName" : "cat",
"itemRating" : 5,
"overallRating" : 1
},
{
"itemName" : "rat",
"itemRating" : 10,
"overallRating" : 1
}
]
}
Query the index for a combined rating and sort by the rating:
GET ratings/_search
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"average_rating": {
"composite": {
"sources": [
{
"ids": {
"terms": {
"field": "_id"
}
}
}
]
},
"aggs": {
"average_rating": {
"nested": {
"path": "items"
},
"aggs": {
"avg": {
"avg": {
"field": "items.compound"
}
}
}
}
}
}
},
"runtime_mappings": {
"items.compound": {
"type": "double",
"script": {
"source": "emit(doc['items.overallRating'].value + doc['items.itemRating'].value)"
}
}
}
}
The result (Pls note that i changed the exact values of ratings between writing the answer and running it in the console, so the averages are a bit different)
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"average_rating" : {
"after_key" : {
"ids" : "3vUp44EBbR3hrRYkA8pj"
},
"buckets" : [
{
"key" : {
"ids" : "3_Up44EBbR3hrRYkLsrC"
},
"doc_count" : 1,
"average_rating" : {
"doc_count" : 2,
"avg" : {
"value" : 151.0
}
}
},
{
"key" : {
"ids" : "3vUp44EBbR3hrRYkA8pj"
},
"doc_count" : 1,
"average_rating" : {
"doc_count" : 2,
"avg" : {
"value" : 8.5
}
}
}
]
}
}
}
One change for convenience:
I edited your mappings to add the overAllRating to each Item entry. This simplifies the calculations that come subsequently, simply because you only look in the nested scope and never have to step out.
I also had to use a "runtime mapping" to combine the value of each overAllRating and ItemRating, to produce a better average. I basically made a sum of every ItemRating with the OverAllRating and averaged those across every entry.
I had to use a top level composite "id" aggregation so that we only get results per document (which is what you want).
There is some pretty heavy lifting happening here, but it is very possible and easy to edit this as you require.
HTH.

ElasticSearch query fields in disabled object

I have an Elastic Search 6.8.7 cluster.
I have a column with this mapping:
"event_object": { "enabled": false, "type": "object" }
I want to search for records that match certain other criteria, and also have a particular value for a particular field field in this object.
So far, I have tried variations of doing a normal search for the indexed fields, and a filter script for the unindexed ones:
GET /my_index/_search
{
"query":{
"bool":{
"must":{
"query_string": {
"query": "foo:bar"
}
},
"filter": {
"script": {
"script": {
"source": "doc[\"event_object\"][\"state\"].value == \"R\""
}
}
}
}
},
"terminate_after":1000,
"from":0,
"size":1000
}
Which is a hodgepodge of testing myself forwards based on google searches. But I can't get things to even compile, let alone run and filter.
It is not possible to access the content of JSON objects that have enabled: false. From the official documentation:
Elasticsearch skips parsing of the contents of the field entirely. The JSON can still be retrieved from the _source field, but it is not searchable or stored in any other way
So even scripting will not help here.
However, there's one way to access this disabled data from scripting in a terms aggregation (using the include parameter and a top_hitssub-aggregation):
POST test/_search
{
"query": {
"match_all": {}
},
"aggs": {
"state": {
"terms": {
"script": "params._source.event_object.state",
"size": 100,
"include": "R"
},
"aggs": {
"hits": {
"top_hits": {
"size": 10
}
}
}
}
}
}
And you'd get a response like this one:
"aggregations" : {
"state" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "R",
"doc_count" : 1,
"hits" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"event_object" : {
"state" : "R"
},
"test" : "hello"
}
}
]
}
}
}
]
}
}

access query value from function_score to compute new score

I need to customize ES score. The score function I need to implement is:
score = len(document_term) - len(query_term)
For instance, one of my document in the ES index is :
{
"name": "foobar"
}
And the search query
{
"query": {
"function_score": {
"query": {
"match": {
"name": {
"query": "foo"
}
}
},
"functions": [
{
"script_score": {
"script": {
"source": "doc['name'].value.length() - ?LEN(query_tem)?"
}
}
}
],
"boost_mode": "replace"
}
}
}
The above search should provide a score of 6 - 3 = 3. But I didn't find a solution to get access the value of the query term.
Is it possible to access the value of the query term in a function_score context ?
There is no direct way to do this, however you can achieve that in the below way where you would need to add the query parameters in two different parts of the query.
Before that one important note, you cannot apply the doc['myfield'].value if the field is of type text, instead you would need to have its sibling field created as keyword and refer that in the script, which again I've mentioned below:
Mapping:
PUT myindex
{
"mappings" : {
"properties" : {
"myfield" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
Sample Document:
POST myquery/_doc/1
{
"myfield": "I've become comfortably numb"
}
Query:
POST <your_index_name>/_search
{
"query": {
"function_score": {
"query": {
"match": {
"myfield": "numb"
}
},
"functions": [
{
"script_score": {
"script": {
"source": "return doc['myfield.keyword'].value.length() - params.myquery.length()",
"params": {
"myquery": "numb" <---- Add the query string here as well
}
}
}
}
],
"boost_mode": "replace"
}
}
}
Response:
{
"took" : 558,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 24.0,
"hits" : [
{
"_index" : "myindex",
"_type" : "_doc",
"_id" : "1",
"_score" : 24.0,
"_source" : {
"myfield" : "I've become comfortably numb"
}
}
]
}
}
Hope this helps!

Query on nested type with aggregation on nested types returns unexpected results

We are using elasticsearch 5.6.4. As mentioned in the ES documentation,
aggregation operates in the context of the query scope, any filter
applied to the query will also apply to the aggregation.
Now, what I have is this :
An index with mapping :
{
"properties":{
"asset":{
"properties":{
"customerId":{
"type":"long"
}
}
},
"software":{
"type": "nested",
"properties":{
"id":{
"type":"long"
},
"name":{
"type":"text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
I have created several documents to perform various tests here. Docs are indexed on customerId. I have in all 10 documents each having 2 or more softwares. For testing aggregations on softwares, I created softwares with same ID across multiple documents. e.g. I have software with Id as 12 twice in doc with customerId 1 and 2 and 3. Also, Doc with customerId 2 has two softwares with Id as 12.
So there are 4 softwares with Id as 12 across documents 1, 2 and 3.
The aggregation result includes only the documents with customerId 1 and not 2 and 3 ,when this query is run :
{
"query" : {
"term":{
"asset.customerId":1
}
},
"aggregations" : {
"aggs" : {
"nested" : {
"path" : "software"
},
"aggregations" : {
"software.id.agg" : {
"terms" : {
"field" : "software.id",
"size" : 10,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
}
}
}
}
}
But when the query filter is run on a nested type (software.id), aggregation result includes all the docs (1, 2 and 3) and hence the buckets which should be filtered out because of the query are also present. :
{
"query" : {
"nested" : {
"query" : {
"match_phrase_prefix" : {
"software.id" : {
"query" : 12,
"slop" : 100,
"max_expansions" : 50,
"boost" : 1.0
}
}
},
"path" : "software",
"ignore_unmapped" : false,
"score_mode" : "none",
"boost" : 1.0
}
},
"aggregations" : {
"aggs" : {
"nested" : {
"path" : "software"
},
"aggregations" : {
"software.id.agg" : {
"terms" : {
"field" : "software.id",
"size" : 10,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_term" : "asc"
}
]
}
}
}
}
}
}
What's the correct way to provide the query filter on nested type so that it is applied on aggregation?

Getting _id fields of aggregated records in Elastic Search

I am using ES to aggregate results based on a field. Additional to that, I would like to retrieve the _id of the records that went into each aggregated bucket as well. Is it possible ?
For example: for the following query
{
"aggs" : {
"genders" : {
"terms" : { "field" : "gender" }
}
}
}
the response would be something like this
{
...
"aggregations" : {
"genders" : {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets" : [
{
"key" : "male",
"doc_count" : 14
},
{
"key" : "female",
"doc_count" : 14
},
]
}
}
}
Now, here I want the _id of all the 14 male and 14 female records that make up the aggregation as well.
Why would I need that ?
Say, because I need to some post processing on these records i.e. insert a new field in those records based on their gender. Of course, its not as trivial as that, but my use case is something on that lines.
Thanks in advance !
Create nested aggregation something like
{
"aggs" : {
"genders" : {
"terms" : { "field" : "gender" }
},
"aggs": {
"ids":{
"terms" : {"field" : "_uid"}
}
}
}
}

Resources