How to use aggregations with Elastic Search - elasticsearch

I'm using Elastic Search to create a search filter and I need to find all the values saved in the database of the "cambio" column without repeating the values.
The values are saved as follows: "Manual de 5 marchas" or "Manual de 6 marchas"....
I created this query to return all saved values:
GET /crawler10/crawler-vehicles10/_search
{
"size": 0,
"aggregations": {
"my_agg": {
"terms": {
"field": "cambio"
}
}
}
}
But when I run the returned values they look like this:
"aggregations": {
"my_agg": {
"doc_count_error_upper_bound": 2,
"sum_other_doc_count": 2613,
"buckets": [
{
"key": "de",
"doc_count": 2755
},
{
"key": "marchas",
"doc_count": 2714
},
{
"key": "manual",
"doc_count": 2222
},
{
"key": "modo",
"doc_count": 1097
},
{
"key": "5",
"doc_count": 1071
},
{
"key": "d",
"doc_count": 1002
},
{
"key": "n",
"doc_count": 1002
},
{
"key": "automática",
"doc_count": 935
},
{
"key": "com",
"doc_count": 919
},
{
"key": "6",
"doc_count": 698
}
]
}
}

Aggregations are based on the mapping type of the saved field. The field type for cambio seems to be set to analyzed(by default). Please create an index with the mapping not_analyzed for your field cambio.
You can create the index with a PUT request as below (if your ES version is less than 5) and then you will need to re-index your data in the crawler10 index.
PUT crawler10/_mapping/
{
"mappings": {
"crawler-vehicles10": {
"properties": {
"cambio": {
"type": "string"
"index": "not_analyzed"
}
}
}
}
}
For ES v5 or greater
PUT crawler10/_mapping/
{
"mappings": {
"crawler-vehicles10": {
"properties": {
"cambio": {
"type": "keyword"
}
}
}
}
}

Related

Nested array of objects aggregation in Elasticsearch

Documents in the Elasticsearch are indexed as such
Document 1
{
"task_completed": 10
"tagged_object": [
{
"category": "cat",
"count": 10
},
{
"category": "cars",
"count": 20
}
]
}
Document 2
{
"task_completed": 50
"tagged_object": [
{
"category": "cars",
"count": 100
},
{
"category": "dog",
"count": 5
}
]
}
As you can see that the value of the category key is dynamic in nature. I want to perform a similar aggregation like in SQL with the group by category and return the sum of the count of each category.
In the above example, the aggregation should return
cat: 10,
cars: 120 and
dog: 5
Wanted to know how to write this aggregation query in Elasticsearch if it is possible. Thanks in advance.
You can achieve your required result, using nested, terms, and sum aggregation.
Adding a working example with index mapping, search query and search result
Index Mapping:
{
"mappings": {
"properties": {
"tagged_object": {
"type": "nested"
}
}
}
}
Search Query:
{
"size": 0,
"aggs": {
"resellers": {
"nested": {
"path": "tagged_object"
},
"aggs": {
"books": {
"terms": {
"field": "tagged_object.category.keyword"
},
"aggs":{
"sum_of_count":{
"sum":{
"field":"tagged_object.count"
}
}
}
}
}
}
}
}
Search Result:
"aggregations": {
"resellers": {
"doc_count": 4,
"books": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "cars",
"doc_count": 2,
"sum_of_count": {
"value": 120.0
}
},
{
"key": "cat",
"doc_count": 1,
"sum_of_count": {
"value": 10.0
}
},
{
"key": "dog",
"doc_count": 1,
"sum_of_count": {
"value": 5.0
}
}
]
}
}
}

Elasticsearch group-by field

I have some squid data like below:
{"requestresultcode": "TCP_MISS/200"},
{"requestresultcode": "TCP_MISS/200"},
{"requestresultcode": "TCP_MISS/302"},
{"requestresultcode": "TCP_MISS/504"},
{"requestresultcode": "TCP_MISS/200"},
{"requestresultcode": "ERR_CLIENT_ABORT/000"},
{"requestresultcode": "ERR_CLIENT_ABORT/200"},
{"requestresultcode": "ERR_CLIENT_ABORT/302"},
{"requestresultcode": "ERR_CLIENT_ABORT/502"},
{"requestresultcode": "ERR_CONNECT_FAIL/502"}
I want to group by the field, so I used aggregations terms to do it
{
"aggs": {
"agg1": {
"terms": {
"field": "cacheresultcode"
}
}
}
}
I got the result:
"aggregations": {
"agg1": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "200",
"doc_count": 2011
},
{
"key": "tcp_miss",
"doc_count": 1740
},
{
"key": "err_client_abort",
"doc_count": 705
},
{
"key": "302",
"doc_count": 244
},
{
"key": "000",
"doc_count": 185
},
{
"key": "502",
"doc_count": 24
},
{
"key": "err_connect_fail",
"doc_count": 23
},
{
"key": "504",
"doc_count": 4
}
]
}
}
It is a few different between use SQL, I think it should be like
ERR_CLIENT_ABORT/000
ERR_CLIENT_ABORT/200
ERR_CLIENT_ABORT/302
ERR_CLIENT_ABORT/502
ERR_CONNECT_FAIL/502
TCP_MISS/200
TCP_MISS/302
TCP_MISS/504
How should I do ?
Thanks for your help !!
If you are using the analyzed field somewhere else then you can use multifields to have a keyword type for cacheresultcode.
Mappings
{
"mappings": {
"document_type" : {
"properties": {
"cacheresultcode":{
"type": "text",
"fields": {
"keyword" : {
"type": "keyword"
}
}
}
}
}
}
}
Query
{
"aggs": {
"agg1": {
"terms": {
"field": "cacheresultcode.keyword"
}
}
}
}
Hope this helps.

Elastic query to find similar tags in content from different organizations

I consume content sources from different organizations, which all supply metadata tags. I would like a list of terms, that are supplied by different organizations.
A sample of data in Elasticsearch:
doc1: {
"tags":["tag1", "tag5", "tag6", "tag4"],
"organization" : "A"
}
doc2: {
"tags":["tag1", "tag2", "tag4"],
"organization" : "B"
}
Desired query result:
{
"tag": "tag1",
"organization" : ["A", "B"]
},
{
"tag": "tag4",
"organization" : ["A", "B"]
}
What i got so far
With the suggestion below, i got a list of results containing keywords that are used by one organization, and keywords that are used by different organizations.
To clarify, this a is a part of the result:
{
"key": "someKeyWord",
"doc_count": 66,
"organization_list": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Organization A",
"doc_count": 62
},
{
"key": "Organization B",
"doc_count": 4
}
]
}
},
{
"key": "someOtherKeyword",
"doc_count": 62,
"organization_list": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Organization A",
"doc_count": 62
}
]
}
}
Now i only want the first result, which has two buckets from the organization_list aggregation. Because that keyword is used by two different organizations.
I tried like this:
"number_buckets_filter": {
"bucket_selector": {
"buckets_path": {
"my_var": "organization_list"
},
"script": "params.my_var > 1"
}
}
But that gets me an exception: "buckets_path must reference either a number value or a single value numeric metric aggregation, got: org.elasticsearch.search.aggregations.bucket.terms.StringTerms"
Is there any way to filter the results? Thanks in advance for any help.
Kind regards,
Oskar uit de Bos
You can use the following query to bucket first on tags and then to sub bucket on organizations
{
"size": 0,
"aggs": {
"tags_list": {
"terms": {
"field": "tags",
"size": 100
},"aggs": {
"organization_list": {
"terms": {
"field": "organization",
"size": 100
}
}
}
}
}
}
mappings
{
"mappings": {
"product": {
"properties": {
"tags": {
"type": "text",
"fielddata": true
},
"organization": {
"type": "text",
"fielddata": true
}
}
}
}
}
Note - make sure the have both tags and organization as not analyzed for aggregations. also set fielddata=true in mappings to avoid heavy memory usages.

Is it possible to returns other fields when you aggregate results on Elasticsearch?

Here is the mappings of my index PublicationsLikes:
id : String
account : String
api : String
date : Date
I'm currently making an aggregation on ES where I group the results counts by the id (of the publication).
{
"key": "<publicationId-1>",
"doc_count": 25
},
{
"key": "<publicationId-2>",
"doc_count": 387
},
{
"key": "<publicationId-3>",
"doc_count": 7831
}
The returned "key" (the id) is an information but I also need to select another fields of the publication like account and api. A bit like that:
{
"key": "<publicationId-1>",
"api": "Facebook",
"accountId": "65465z4fe6ezf456ezdf",
"doc_count": 25
},
{
"key": "<publicationId-2>",
"api": "Twitter",
"accountId": "afaez5f4eaz",
"doc_count": 387
}
How can I manage this?
Thanks.
This requirement is best achieved by top_hits aggregation, where you can sort the documents in each bucket and choose the first and also you can control which fields you want returned:
{
"size": 0,
"aggs": {
"publications": {
"terms": {
"field": "id"
},
"aggs": {
"sample": {
"top_hits": {
"size": 1,
"_source": ["api","accountId"]
}
}
}
}
}
}
You can use subaggregation for this.
GET /PublicationsLikes/_search
{
"aggs" : {
"ids": {
"terms": {
"field": "id"
},
"aggs": {
"accounts": {
"terms": {
"field": "account",
"size": 1
}
}
}
}
}
}
Your result will not exactly what you want but it will be a bit similar:
{
"key": "<publicationId-1>",
"doc_count": 25,
"accounts": {
"buckets": [
{
"key": "<account-1>",
"doc_count": 25
}
]
}
},
{
"key": "<publicationId-2>",
"doc_count": 387,
"accounts": {
"buckets": [
{
"key": "<account-2>",
"doc_count": 387
}
]
}
},
{
"key": "<publicationId-3>",
"doc_count": 7831,
"accounts": {
"buckets": [
{
"key": "<account-3>",
"doc_count": 7831
}
]
}
}
You can also check the link to find more information
Thanks both for your quick replies. I think the first solution is the most "beautiful" (in terms of request but also to retrieves the results) but both seems to be sub aggregations queries.
{
"size": 0,
"aggs": {
"publications": {
"terms": {
"size": 0,
"field": "publicationId"
},
"aggs": {
"sample": {
"top_hits": {
"size": 1,
"_source": ["accountId", "api"]
}
}
}
}
}
}
I think I must be careful to size=0 parameter, so, because I work in the Java Api, I decided to put INT.Max instead of 0.
Thnaks a lot guys.

ElasticSearch aggregation query customized field

I am just wondering for a aggregation query in ES, is that possible to utilize the returned bucket for your own purpose. For example if I have response result like this:
{
"key": "test",
"doc_count": 2000,
"child": {
"value": 1000
}
}
And I want to get the ratio of doc_count and value, so I am looking for a way to generate another field/aggregation to do the math of those two fields, like this:
{
"key": "test",
"doc_count": 2000,
"child": {
"value": 1000
},
"ratio" : 2
}
or
{
"key": "test",
"doc_count": 1997,
"child": {
"value": 817
},
"buckets": [
{
"key": "ratio",
"value": 2
}
]
}

Resources