Elastic Search filter with aggregate like Max or Min - elasticsearch

I have simple documents with a scheduleId. I would like to get the count of documents for the most recent ScheduleId. Assuming Max ScheduleId is the most recent, how would we write that query. I have been searching and reading for few hours and could get it to work.
{
"aggs": {
"max_schedule": {
"max": {
"field": "ScheduleId"
}
}
}
}
That is getting me the Max ScheduleId and the total count of documents out side of that aggregate.
I would appreciate if someone could help me on how take this aggregate value and apply it as a filter (like a sub query in SQL!).

This should do it:
{
"aggs": {
"max_ScheduleId": {
"terms": {
"field": "ScheduleId",
"order" : { "_term" : "desc" },
"size": 1
}
}
}
}
The terms aggregation will give you document counts for each term, and it works for integers. You just need to order the results by the term instead of by the count (the default). And since you only want the highest ScheduleID, "size":1 is adequate.
Here is the code I used to test it:
http://sense.qbox.io/gist/93fb979393754b8bd9b19cb903a64027cba40ece

Related

Aggregate over top hits ElasticSearch

My documents are structured in the following way:
{
"chefInfo": {
"id": int,
"employed": String
... Some more recipe information ...
}
"recipe": {
... Some recipe information ...
}
}
If a chef has multiple recipes, the nested chefInfo block will be identical in each document. My problem is that I want to do an aggregation of a field in the chefInfo part of the document. However, this doesn't take into account for the fact that the chefInfo block is a duplicate.
So, if the chef with the id of 1 is on 5 recipes and I am aggregating on the employed field then this particular chef, will represent 5 of the counts in the aggregation, whereas, I want them to only count a single one.
I thought about doing a top_hits aggregation on the chef_id and then I wanted to do a sub-aggregation over all of the buckets but I can't work out how to do the counts over the results of all the buckets.
Is it possible what I want to do?
For elastic every document in itself is unique. In your case you want to define uniqueness based on a different field, here chefInfo.id. To find unique count based on this field you have to make use of cardinality aggregation.
You can apply the aggregation as below:
{
"aggs": {
"employed": {
"nested": {
"path": "chefInfo"
},
"aggs": {
"employed": {
"terms": {
"field": "chefInfo.employed.keyword"
},
"aggs": {
"employed_unique": {
"cardinality": {
"field": "chefInfo.id"
}
}
}
}
}
}
}
}
In the result employed_unique give you the expected count.

Elasticsearch: get documents only when value changes

I have an ES index with such kind of documents:
from_1,to_1,timestamp_1
from_1,to_1,timestamp_2
from_1,to_2,timestamp_3
from_2,to_3,timestamp_4
from_1,to_2,timestamp_5
from_2,to_3,timestamp_6
from_1,to_1,timestamp_7
from_2,to_4,timestamp_8
I need a query that would return a document only if its combination of from and to values is different than the previous seen document with the same from value.
So with the provided sample above:
document with timestamp_1 should be in the result because there is no earlier document with from_1+to_1 combination
document with timestamp_2 must be skipped because its from+to combination is exactly the same as the last seen document with from = from_1
document with timestamp_3 should be in the result because its to field (to_2) is different than the value of the last seen with the same from (to_1 in document with timestamp_1
document with timestamp_4 should be in the result
document with timestamp_5 must not be in the result because it has the same combination of from+to as the last seen with from_1 (document with timestamp_3)
document with timestamp_6 must not be in the result because it has the same combination of from+to as the last seen with from_2 (document with timestamp_4)
document with timestamp_7 should be in the result because it has the different combination of from+to to the last seen with from_1 (document with timestamp_3)
document with timestamp_8 should be in the result because its combination is completely new so far
I need to fetch all such "semi-unique" documents from the index, so it would be nice if it possible to use scroll request or after_key if an aggregation is used.
Any ideas how to approach it?
The closest thing I could come up with is the following (let me know if it does not work with your data).
{
"size": 0,
"aggs": {
"from_and_to": {
"composite" : {
"size": 5,
"sources": [
{
"from_to_collected":{
"terms": {
"script": {
"lang": "painless",
"source": "doc['from'].value + '_' + doc['to'].value"
}
}
}
}]
},
"aggs": {
"top_from_and_to_hits": {
"top_hits": {
"size": 1,
"sort": [{"timestamp":{"order":"asc"}}],
"_source": {"includes": ["_id"]}
}
}
}
}
}
}
Keep in mind that the terms aggregations is probabilistic.
This will allow you to scroll to the next set of buckets over the from_to_collected key.

Write a query which sort term aggregations buckets based on inner sorted top_hits results

I have trouble to write a specific query in elasticsearch.
The context:
I have an index where each document represents a “SKU”: a declination of a product (symbolized by pId).
For example, the first 3 documents are declinations in color and price of product 235.
BS is for “Best SKU”: for a given product, SKUs are sorted from the most representative to the less representative.
After a search, only best SKUs matching the search should be used for further sorting or aggregations.
this is a script to create a test index:
POST /test/skus/DOC_1
{
"pId":235,
"BS":3,
"color":"red",
"price":59.00
}
POST /test/skus/DOC_2
{
"pId":235,
"BS":2,
"color":"red",
"price":29.00
}
POST /test/skus/DOC_3
{
"pId":235,
"BS":1,
"color":"green",
"price":69.00
}
POST /test/skus/DOC_4
{
"pId":236,
"BS":2,
"color":"blue",
"price":19.00
}
POST /test/skus/DOC_5
{
"pId":236,
"BS":1,
"color":"red",
"price":99.00
}
POST /test/skus/DOC_6
{
"pId":236,
"BS":3,
"color":"red",
"price":39.00
}
POST /test/skus/DOC_7
{
"pId":237,
"BS":2,
"color":"red",
"price":10.00
}
POST /test/skus/DOC_8
{
"pId":237,
"BS":1,
"color":"blue",
"price":50.00
}
POST /test/skus/DOC_9
{
"pId":237,
"BS":3,
"color":"green",
"price":20.00
}
The query I'm trying to write is a query that search, for example, the red SKUs, do an aggregation by product (using term aggregation and pId), only retains the best SKU in each bucket and THEN sort those buckets on the price of best SKU.
Here is what I've got so far:
GET /test/skus/_search
{
"size": 0,
"query": {
"term": {
"color": {
"value": "red"
}
}
},
"aggs": {
"bypId": {
"terms": {
"field": "pId",
"size": 10
},
"aggs": {
"mytophits": {
"top_hits": {
"size": 1,
"sort": ["BS"]
}
}
}
}
}
}
I don't know from here how to sort on buckets price.
I've done some screenshot to better explain what I'm trying to achieve:
screenshot1
screenshot2
screenshot3
screenshot4
screenshot5
Update: Still stuck.
An answer that tells me that it is not possible to do such a thing is also welcomed :)

How to calculate the overlap / elapsed time range in elasticsearch?

I have some records in ES, they are different online meeting records that people join/leave at the different time.
{name:"p1", join:'2017-11-17T00:01:00.293Z', leave: "2017-11-17T00:06:00.293Z"}
{name:"p2", join:'2017-11-17T00:02:00.293Z', leave: "2017-11-17T00:04:00.293Z"}
{name:"p3", join:'2017-11-17T00:03:00.293Z', leave: "2017-11-17T00:05:00.293Z"}
Time range could be something like this:
p1: [============================================]
p2: [=================]
p3: [==================]
The question is how to calculate the overlap time range (common/meeting/shared time), which should be 3 min
Another further question is that is it possible to know when to when there is 1/2/3 people at that time? 2 mins 2 persons; 1 min 3 persons
I don't think its possible to do only with ES. Simply because all you need is that in search it should go to all documents that matched and calculate based on that
I would do it in following steps.
1.Before indexing new document search for documents which overlaps.
GET /meetings/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"join": {
"gte": "2007-10-01T00:00:00"
}
}
},
{
"range": {
"leave": {
"lte": "2007-10-01T00:00:00"
}
}
}
]
}
}
}
Calculate all functionality on back-end for all documents that overlaps.
Save to to documents as nested object overlaps metadata you need
You can do the first part easily using max(join) and min(leave):
GET your_index/your_type/_search
{
"size": 0,
"aggs": {
"startTime": {
"max": {
"field": "join"
}
},
"endTime": {
"min": {
"field": "leave"
}
}
}
}
And then you can compute endTime-startTime either when you process Elasticsearch response or using a bucket script aggregation. It may be negative in which case there is no overlap.
For the second one, it depends of what you want:
If you want the exact boundaries, which may be hard to read, you can do it using a Scripted Metric Aggregation.
If you want to have the number per slot (hour for instance) it may be easier to use a Date Histogram Aggregation.

Random document in ElasticSearch

Is there a way to get a truly random sample from an elasticsearch index? i.e. a query that retrieves any document from the index with probability 1/N (where N is the number of documents currently indexed)?
And as a follow-up question: if all documents have some numeric field s, is there a way to get a document through weighted random sampling, i.e. where the probability to get document i with value s_i is equal to s_i / sum(s_j for j in index)?
I know it is an old question, but now it is possible to use random_score,
with the following search query:
{
"size": 1,
"query": {
"function_score": {
"functions": [
{
"random_score": {
"seed": "1477072619038"
}
}
]
}
}
}
For me it is very fast with about 2 million documents.
I use current timestamp as seed, but you can use anything you like. The best is if you use the same seed, you will get the same results. So you can use your user's session id as seed and all users will have different order.
The only way I know of to get random documents from an index (at least in versions <= 1.3.1) is to use a script:
sort: {
_script: {
script: "Math.random() * 200000",
type: "number",
params: {},
order: "asc"
}
}
You can use that script to make some weighting based on some field of the record.
It's possible that in the future they might add something more complicated, but you'd likely have to request that from the ES team.
You can use random_score with a function_score query.
{
"size":1,
"query": {
"function_score": {
"functions": [
{
"random_score": {
"seed": 11
}
}
],
"score_mode": "sum",
}
}
}
The bad part is that this will apply a random score to every document, sort the documents, and then return the first one. I don't know of anything that is smart enough to just pick a random document.
NEST Way :
var result = _elastic.Search<dynamic>(s => s
.Query(q => q
.FunctionScore(fs => fs.Functions(f => f.RandomScore())
.Query(fq => fq.MatchAll()))));
raw query way :
GET index-name/_search
"size": 1,
"query": {
"function_score": {
"query" : { "match_all": {} },
"random_score": {}
}
}
}
You can use random_score to randomly order responses or retrieve a document with roughly 1/N probability.
Additional notes:
https://github.com/elastic/elasticsearch/issues/1170
https://github.com/elastic/elasticsearch/issues/7783

Resources