Elastic query group by - elasticsearch

I've started the process of learning ElasticSearch and I was wondering if somebody could help me shortcut the process by providing some examples of how I would a build couple of queries.
Here's my example schema...
PUT /sales/_mapping
{
"sale": {
"properties": {
"productCode: {"type":"string"},
"productTitle": {"type": "string"},
"quantity" : {"type": "integer"},
"unitPrice" : {"type": double}
}
}
}
POST /sales/1
{"productCode": "A", "productTitle": "Widget", "quantity" : 5, "unitPrice":
5.50}
POST /sales/2
{"productCode": "B", "productTitle": "Gizmo", "quantity" : 10, "unitPrice": 1.10}
POST /sales/3
{"productCode": "C", "productTitle": "Spanner", "quantity" : 5, "unitPrice":
9.00}
POST /sales/4
{"productCode": "A", "productTitle": "Widget", "quantity" : 15, "unitPrice":
5.40}
POST /sales/5
{"productCode": "B", "productTitle": "Gizmo", "quantity" : 20, "unitPrice":
1.00}
POST /sales/6
{"productCode": "B", "productTitle": "Gizmo", "quantity" : 30, "unitPrice":
0.90}
POST /sales/7
{"productCode": "B", "productTitle": "Gizmo", "quantity" : 40, "unitPrice":
0.80}
POST /sales/8
{"productCode": "C", "productTitle": "Spanner", "quantity" : 100,
"unitPrice": 7.50}
POST /sales/9
{"productCode": "C", "productTitle": "Spanner", "quantity" : 200,
"unitPrice": 5.50}
What query would I need to generate the following results?
a). Show the show the number of documents grouped by product code
Product code Title Count
A Widget 2
B Gizmo 4
C Spanner 3
b). Show the total units sold by product code, i.e.
Product code Title Total units sold
A Widget 20
B Gizmo 100
C Spanner 305
TIA

You can accomplish that using aggregations, in particular Terms Aggregations. And it can be done in just one run, by including them within your query structure; in order to instruct ES to generate analytic data based in aggregations, you need to include the aggregations object (or aggs), and specify within it the type of aggregations you would like ES to run upon your data.
{
"query": {
"match_all": {}
},
"aggs": {
"group_by_product": {
"terms": {
"field": "productCode"
},
"aggs": {
"units_sold": {
"sum": {
"field": "quantity"
}
}
}
}
}
}
By running that query, besides the resulting hits from your search (in this case we are doing a match all), and additional object will be included, within the response object, holding the corresponding resulting aggregations. For example
{
...
"hits": {
"total": 6,
"max_score": 1,
"hits": [ ... ]
},
"aggregations": {
"group_by_product": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "b",
"doc_count": 3,
"units_sold": {
"value": 60
}
},
{
"key": "a",
"doc_count": 2,
"units_sold": {
"value": 20
}
},
{
"key": "c",
"doc_count": 1,
"units_sold": {
"value": 5
}
}
]
}
}
}
I omitted some details from the response object for brevity, and to highlight the important part, which is within the aggregations object. You can see how the aggregated data consists of different buckets, each representing the distinct product types (identified by the key key) that were found within your documents, doc_count has the number of occurrences per product type, and the unit_sold object, holds the total sum of units sold per each of the product types.
One important thing to keep into consideration is that in order to perform aggregations on string or text fields, you need to enable the fielddata setting within your field mapping, as that setting is disabled by default on all text based fields. In order to update the mapping, for ex. of the product code field, you just need to to a PUT request to the corresponding mapping type within the index, for example
PUT http://localhost:9200/sales/sale/_mapping
{
"properties": {
"productCode": {
"type": "string",
"fielddata": true
}
}
}
(more info about the fielddata setting)

Related

Want to get distinct records in hits section from elasticsearch

I want to get all the distinct records as per "departmentNo" .
Please check the below Index Data : (it is dummy data.)
{'departmentNo': 1, 'departmentName': 'Food', 'departmentLoc': "I1", "departmentScore": "5", "employeeid" : 1, "employeeName": "vijay", ...}
{'departmentNo': 1, 'departmentName': 'Food', 'departmentLoc': "I1", "departmentScore": "5", "employeeid" : 2, "employeeName": "rathod", ...}
{'departmentNo': 2, 'departmentName': 'Non-Food', 'departmentLoc': "I2", "departmentScore": "6", "employeeid" : 3, "employeeName": "ajay", ...}
{'departmentNo': 2, 'departmentName': 'Non-Food', 'departmentLoc': "I2", "departmentScore": "6", "employeeid" : 4, "employeeName": "kamal", ...}
{'departmentNo': 1, 'departmentName': 'Food', 'departmentLoc': "I1", "departmentScore": "5", "employeeid" : 5, "employeeName": "rahul", ...}
I want the below output.
{'departmentNo': 1, 'departmentName': 'Food', 'departmentLoc': "I1", "departmentScore": "5", "employeeid" : 1, "employeeName": "vijay", ...}
{'departmentNo': 2, 'departmentName': 'Non-Food', 'departmentLoc': "I2", "departmentScore": "6", "employeeid" : 3, "employeeName": "ajay", ...}
I was trying to get data in hits section. But didn't found the answer.
So I tried with aggeration. Used below query
{
"size": 0,
"aggs": {
"Group_By_Dept": {
"terms": {
"field": "departmentNo"
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1
}
}
}
}
}
}
I got the data by the above query. But I want all the distinct data and they should support pagination + sorting.
In elastic 6.0 we could use bucket_sort , but I am using 5.6.7.So I can't use bucket_sort.
So Can I do it in any other way.?
If I could get data in hits's section then it will be good.
(I don't want to change my index mapping. Actually here i have added dummy mapping. but usecase is same.)
You can do that by using field collapsing:
{
"query": { ... },
"from": 153,
"size": 27,
"collapse": {
"field": "departmentNo"
}
}
This will leave only one document for each repeating value in such field. You can control which document it would be using standard sort (i.e. document with highest sort value among collapsed would be returned).
Please note that there is additional functionality called inner hits, which you may want to use in the future - be aware that it multiplies document fetches and negatively affects performance.

Merge / flatten sub aggs into main agg

Is there away in elasticsearch to get the results back in a sort of flattend form (multiple child/sub aggs?
For instance currently i am trying to get back all product types and their status (online / offline).
This is what i end up with:
aggs
[
{ key: SuperProduct, doc_count:3, subagg:[
{status:online, doc_count:1},
{status:offline, doc_count:2}
]
},
{ key: SuperProduct2, doc_count:10, subagg:[
{status:online, doc_count:7},
{status:offline, doc_count:3}
]
Charting libraries tend to like it flattened so i was wondering if elasticsearch could probide it in this sort of manner:
[
{ products_key: 'SuperProduct', status_key:'online', doc_count:1},
{ products_key: 'SuperProduct', status_key:'offline', doc_count:2},
{ products_key: 'SuperProduct2', status_key:'online', doc_count:7},
{ products_key: 'SuperProduct2', status_key:'offline', doc_count:3}
]
Thanks
It is possible with composite aggregation which you can use to link two terms aggregations:
// POST /i/_search
{
"size": 0,
"aggregations": {
"distribution": {
"composite": {
"sources": [
{"product": {"terms": {"field": "product.keyword"}}},
{"status": {"terms": {"field": "status.keyword"}}}
]
}
}
}
}
This results in following structure:
{
"aggregations": {
"distribution": {
"after_key": {
"product": "B",
"status": "online"
},
"buckets": [
{
"key": {
"product": "A",
"status": "offline"
},
"doc_count": 3
},
{
"key": {
"product": "A",
"status": "online"
},
"doc_count": 2
},
{
"key": {
"product": "B",
"status": "offline"
},
"doc_count": 1
},
{
"key": {
"product": "B",
"status": "online"
},
"doc_count": 4
}
]
}
}
}
If for any reason composite aggregation doesn't fulfill your needs, you can create (via copy_to or by concatenation) or simulate (via scripted fields) field that would uniquely identify bucket. In our project we went with concatenation (partially for the necessity to collapse on this field), e.g. {"bucket": "SuperProductA:online"}, which results in dirtier output (you'll have to decode that field back or use top hits to get original values) but still does the job.

Elasticsearch Sorting Tiebreakers

Say I am creating a search engine for a photo sharing social network and the documents of the site have the following schema
{
"id": 123456
"name": "Foo",
"num_followers": 123456,
"num_photos": 123456
}
I would like my search results to satisfy the following requirements:
Only have results where the search query strings matches the "name" field in the document
Rank the search results by number of followers descending
In the case where multiple customers have the same number of followers, rank by number of photos descending
For example, say I have the following documents in my index:
{
"id": 1,
"name": "Customer",
"num_followers": 3,
"num_photos": 27
}
{
"id": 2,
"name": "Customer",
"num_followers": 25,
"num_photos": 1
}
{
"id": 3,
"name": "Customer",
"num_followers": 8,
"num_photos": 2
}
{
"id": 4,
"name": "Customer",
"num_followers": 8,
"num_photos": 5
}
{
"id": 5,
"name": "FooBar",
"num_followers": 10000,
"num_photos": 20000
}
If I search "Customer" in the search bar of the site, the ES hits should be in the following order:
{
"id": 2,
"name": "Customer",
"num_followers": 25,
"num_photos": 1
}
{
"id": 4,
"name": "Customer",
"num_followers": 8,
"num_photos": 5
}
{
"id": 3,
"name": "Customer",
"num_followers": 8,
"num_photos": 2
}
{
"id": 1,
"name": "Customer",
"num_followers": 3,
"num_photos": 27
}
I'm assuming I will need to perform some sort of compact query to create this "tiebreaker" logic. What clauses should I be using? If anyone had an example of something similar that would be amazing. Thanks in advance.
This sounds like a pretty standard sorting use case. Elasticsearch can sort on multiple fields in a predefined priority order. See documentation here.
GET /my_index/_search
{
"sort" : [
{ "num_followers" : {"order" : "desc"}},
{ "num_photos" : "desc" }
],
"query" : {
"term" : { "name" : "Customer" }
}
}
Obviously this is just a simple term query -- you may want that to be a keyword search instead based on the wording of your question.

SonarQube Component Tree response data

I'm having some trouble understanding some of the data in the response from the SonarQube GET api/measures/component_tree API.
Some metrics have a value attribute while others don't. I've figured out that the value displayed in the UI is the "value" unless it does not exist, then the value at the earliest period is used. The other periods are then basically deltas between measurements. Would anyone be able to provide some details around what the response values actually mean? Unfortunately, the actual API documentation that SonarQube provides doesn't give any detail around response data. Specifically, I'm wondering when a value attribute would and would not be there, what the index means since not all have the same indexes (ie. some go 1-4, others have just 3,4), and what the period data represents.
{
"metric": "new_lines_to_cover",
"periods": [
{
"index": 1,
"value": "572"
},
{
"index": 2,
"value": "572"
},
{
"index": 3,
"value": "8206"
},
{
"index": 4,
"value": "186574"
}
]
},
{
"metric": "duplicated_lines",
"value": "80819",
"periods": [
{
"index": 1,
"value": "-158"
},
{
"index": 2,
"value": "-158"
},
{
"index": 3,
"value": "-10544"
},
{
"index": 4,
"value": "-6871"
}
]
},
{
"metric": "new_line_coverage",
"periods": [
{
"index": 3,
"value": "3.9900249376558605"
},
{
"index": 4,
"value": "17.221615720524017"
}
]
},
The heuristic is very close from the truth:
if the metric starts with "new_", it means it's a metric that compute new elements on a period of time. Starting with 6.3, only the leak period is supported
otherwise, the "value" represents the raw value.
For example, to compute the number of issues:
violations computes the total number of issues
new_violations computes the number of new issues on the leak period
To know more about the leak period concept in SonarQube, please check this article.

Elastic Search. Search by sub-collection value

Need help with specific ES query.
I have objects at Elastic Search index. Example of one of them (Participant):
{
"_id": null,
"ObjectID": 6008,
"EventID": null,
"IndexName": "crmws",
"version_id": 66244,
"ObjectData": {
"PARTICIPANTTYPE": "2",
"STATE": "ACTIVE",
"EXTERNALID": "01010111",
"CREATORID": 1006,
"partAttributeList":
[
{
"SYSNAME": "A",
"VALUE": "V1"
},
{
"SYSNAME": "B",
"VALUE": "V2"
},
{
"SYSNAME": "C",
"VALUE": "V2"
}
],
....
I need to find the only entity(s) by partAttributeList entities. For example whole Participant entity with SYSNAME=A, VALUE=V1 at the same entity of partAttributeList.
If i use usul matches:
{"match": {"ObjectData.partAttributeList.SYSNAME": "A"}},
{"match": {"ObjectData.partAttributeList.VALUE": "V1"}}
Of course I will find more objects than I really need. Example of redundant object that can be found:
...
{
"SYSNAME": "A",
"VALUE": "X"
},
{
"SYSNAME": "B",
"VALUE": "V1"
}..
What I get you are trying to do is to search multiple fields of the same object for exact matches of a piece of text so please try this out:
https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-query-strings.html

Resources