Aggregate over top hits ElasticSearch - elasticsearch

My documents are structured in the following way:
{
"chefInfo": {
"id": int,
"employed": String
... Some more recipe information ...
}
"recipe": {
... Some recipe information ...
}
}
If a chef has multiple recipes, the nested chefInfo block will be identical in each document. My problem is that I want to do an aggregation of a field in the chefInfo part of the document. However, this doesn't take into account for the fact that the chefInfo block is a duplicate.
So, if the chef with the id of 1 is on 5 recipes and I am aggregating on the employed field then this particular chef, will represent 5 of the counts in the aggregation, whereas, I want them to only count a single one.
I thought about doing a top_hits aggregation on the chef_id and then I wanted to do a sub-aggregation over all of the buckets but I can't work out how to do the counts over the results of all the buckets.
Is it possible what I want to do?

For elastic every document in itself is unique. In your case you want to define uniqueness based on a different field, here chefInfo.id. To find unique count based on this field you have to make use of cardinality aggregation.
You can apply the aggregation as below:
{
"aggs": {
"employed": {
"nested": {
"path": "chefInfo"
},
"aggs": {
"employed": {
"terms": {
"field": "chefInfo.employed.keyword"
},
"aggs": {
"employed_unique": {
"cardinality": {
"field": "chefInfo.id"
}
}
}
}
}
}
}
}
In the result employed_unique give you the expected count.

Related

Elasticsearch agg filter using an array of values

{ "colors":["red","black","blue"] }
{ "colors":["red","black"] }
{ "colors":["red"] }
{ "colors":["orange, green"] }
{ "colors":["purple"] }
How can I run an agg that filters for specific values contained in the array field?
For example, I only want the count of "red" and wish to exclude its other siblings from the aggregation result.
Note: I cannot use an "include" pattern for "red". This example is simplistic, the real-world example has a long list of string values that are unique.
I would like to filter the agg using an array of string values.
From docs
For matching based on exact values the include and exclude parameters can simply take an array of strings that represent the terms as they are found in the index:
{
"aggs": {
"colors": {
"terms": {
"field": "colors",
"include": [ "red","black" ]
}
}
}
}

ElasticSearch how to get docs with 10 or more fields in them?

I want to get all docs that have 10 or more fields in them. I'm guessing something like this:
{
"query": {
"range": {
"fields": {
"gt": 1000
}
}
}
}
What you can do is to run a script query like this
{
"query": {
"script": {
"script": {
"source": "params._source.size() >= 10"
}
}
}
}
However, be advised that depending on the number of documents you have and the hardware that supports your cluster, this can negatively impact the performance of your cluster.
A better idea would be to add another integer field that contains the number of fields that the document contains, so you can simply run a range query on it, like in your question.
As Per Documentation of _source field, you can do this like that or can't get results based on fields count.
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html

How to find all duplicate documents in ElasticSearch

We have a need to walk over all of the documents in our AWS ElasticSearch cluster, version 6.0, and gather a count of all the duplicate user ids.
I have tried using a Data Visualization to aggregate counts on the user ids and export them, but the numbers don't match another source of our data that is searchable via traditional SQL.
What we would like to see is like this:
USER ID COUNT
userid1 4
userid22 3
...
I am not an advanced Lucene query person and have yet to find an answer to this question. If anyone can provide some insight into how to do this, I would be appreciative.
The following query will count each id, and filter the ids which have <2 counts, so you'll get something in the likes of:
id:2, count:2
id:4, count:15
GET /index
{
"query":{
"match_all":{}
},
"aggs":{
"user_id":{
"terms":{
"field":"user_id",
"size":100000,
"min_doc_count":2
}
}
}
}
More here:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
If you want to get all duplicate userids with count
First you get to know maximum size of aggs.
find all maximum matches record via aggs cardinality.
GET index/type/_search
{
"size": 0,
"aggs": {
"maximum_match_counts": {
"cardinality": {
"field": "userid",
"precision_threshold": 100
}
}
}
}
get value of maximum_match_counts aggregations
Now you can get all duplicate userids
GET index/type/_search
{
"size": 0,
"aggs": {
"userIds": {
"terms": {
"field": "userid",
"size": maximum_match_counts,
"min_doc_count": 2
}
}
}
}
When you go with terms aggregation (Bharat suggestion) and set aggregation size more than 10K you will get a warning about this approach will throw an error for the feature releases.
Instead of using terms aggregation you should go with composite aggregation to scan all of your documents by pagination/afterkey method.
the composite aggregation can be used to paginate all buckets from a multi-level aggregation efficiently. This aggregation provides a way to stream all buckets of a specific aggregation similarly to what scroll does for documents.

Write a query which sort term aggregations buckets based on inner sorted top_hits results

I have trouble to write a specific query in elasticsearch.
The context:
I have an index where each document represents a “SKU”: a declination of a product (symbolized by pId).
For example, the first 3 documents are declinations in color and price of product 235.
BS is for “Best SKU”: for a given product, SKUs are sorted from the most representative to the less representative.
After a search, only best SKUs matching the search should be used for further sorting or aggregations.
this is a script to create a test index:
POST /test/skus/DOC_1
{
"pId":235,
"BS":3,
"color":"red",
"price":59.00
}
POST /test/skus/DOC_2
{
"pId":235,
"BS":2,
"color":"red",
"price":29.00
}
POST /test/skus/DOC_3
{
"pId":235,
"BS":1,
"color":"green",
"price":69.00
}
POST /test/skus/DOC_4
{
"pId":236,
"BS":2,
"color":"blue",
"price":19.00
}
POST /test/skus/DOC_5
{
"pId":236,
"BS":1,
"color":"red",
"price":99.00
}
POST /test/skus/DOC_6
{
"pId":236,
"BS":3,
"color":"red",
"price":39.00
}
POST /test/skus/DOC_7
{
"pId":237,
"BS":2,
"color":"red",
"price":10.00
}
POST /test/skus/DOC_8
{
"pId":237,
"BS":1,
"color":"blue",
"price":50.00
}
POST /test/skus/DOC_9
{
"pId":237,
"BS":3,
"color":"green",
"price":20.00
}
The query I'm trying to write is a query that search, for example, the red SKUs, do an aggregation by product (using term aggregation and pId), only retains the best SKU in each bucket and THEN sort those buckets on the price of best SKU.
Here is what I've got so far:
GET /test/skus/_search
{
"size": 0,
"query": {
"term": {
"color": {
"value": "red"
}
}
},
"aggs": {
"bypId": {
"terms": {
"field": "pId",
"size": 10
},
"aggs": {
"mytophits": {
"top_hits": {
"size": 1,
"sort": ["BS"]
}
}
}
}
}
}
I don't know from here how to sort on buckets price.
I've done some screenshot to better explain what I'm trying to achieve:
screenshot1
screenshot2
screenshot3
screenshot4
screenshot5
Update: Still stuck.
An answer that tells me that it is not possible to do such a thing is also welcomed :)

Elastic Search filter with aggregate like Max or Min

I have simple documents with a scheduleId. I would like to get the count of documents for the most recent ScheduleId. Assuming Max ScheduleId is the most recent, how would we write that query. I have been searching and reading for few hours and could get it to work.
{
"aggs": {
"max_schedule": {
"max": {
"field": "ScheduleId"
}
}
}
}
That is getting me the Max ScheduleId and the total count of documents out side of that aggregate.
I would appreciate if someone could help me on how take this aggregate value and apply it as a filter (like a sub query in SQL!).
This should do it:
{
"aggs": {
"max_ScheduleId": {
"terms": {
"field": "ScheduleId",
"order" : { "_term" : "desc" },
"size": 1
}
}
}
}
The terms aggregation will give you document counts for each term, and it works for integers. You just need to order the results by the term instead of by the count (the default). And since you only want the highest ScheduleID, "size":1 is adequate.
Here is the code I used to test it:
http://sense.qbox.io/gist/93fb979393754b8bd9b19cb903a64027cba40ece

Resources