How to calculate the overlap / elapsed time range in elasticsearch? - elasticsearch

I have some records in ES, they are different online meeting records that people join/leave at the different time.
{name:"p1", join:'2017-11-17T00:01:00.293Z', leave: "2017-11-17T00:06:00.293Z"}
{name:"p2", join:'2017-11-17T00:02:00.293Z', leave: "2017-11-17T00:04:00.293Z"}
{name:"p3", join:'2017-11-17T00:03:00.293Z', leave: "2017-11-17T00:05:00.293Z"}
Time range could be something like this:
p1: [============================================]
p2: [=================]
p3: [==================]
The question is how to calculate the overlap time range (common/meeting/shared time), which should be 3 min
Another further question is that is it possible to know when to when there is 1/2/3 people at that time? 2 mins 2 persons; 1 min 3 persons

I don't think its possible to do only with ES. Simply because all you need is that in search it should go to all documents that matched and calculate based on that
I would do it in following steps.
1.Before indexing new document search for documents which overlaps.
GET /meetings/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"join": {
"gte": "2007-10-01T00:00:00"
}
}
},
{
"range": {
"leave": {
"lte": "2007-10-01T00:00:00"
}
}
}
]
}
}
}
Calculate all functionality on back-end for all documents that overlaps.
Save to to documents as nested object overlaps metadata you need

You can do the first part easily using max(join) and min(leave):
GET your_index/your_type/_search
{
"size": 0,
"aggs": {
"startTime": {
"max": {
"field": "join"
}
},
"endTime": {
"min": {
"field": "leave"
}
}
}
}
And then you can compute endTime-startTime either when you process Elasticsearch response or using a bucket script aggregation. It may be negative in which case there is no overlap.
For the second one, it depends of what you want:
If you want the exact boundaries, which may be hard to read, you can do it using a Scripted Metric Aggregation.
If you want to have the number per slot (hour for instance) it may be easier to use a Date Histogram Aggregation.

Related

get documents irrespective of years in elasticsearch

I want to get all documents on 8 Dec irrespective of years. I have tried two queries but both fails, Is there any way to calculate this?
First Query
GET /my_index/my_type/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"myDate": {
"gte": "12-08",
"lte": "12-08",
"format": "MM-dd"
}
}
}
]
}
}
}
Second Query
GET /my_index/my_type/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"mydate": "12-08"
}
}
]
}
}
}
Unfortunately, I don't think that will be easily possible. DateTime datatypes are actually just long numbers. The range query will also transform the defined input into a number. Example: now -> 1497541939892. See https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html for more information - specifically this:
Internally, dates are converted to UTC (if the time-zone is specified) and stored as a long number representing milliseconds-since-the-epoch.
With that in mind, you would have to subtract 1 (or x) years (in milliseconds) for every subquery. That doesn't sound practical.
I think your best bet would be, to additionally index the day and month - and maybe year as well - separately. Then you would be able to query just by month/day, which would be integer values. I don't know if that is easily done in your case, but I really have no other idea right now.

Kibana: Report on different date intervals such as Today, Yesterday, Last Week

I need to display a report on Kibana that will aggregate results based on multiple date intervals. Times are mapped as float data type along with the timestamp.
Example:
Jobs, Yesterday, Last Week, Last Quarters
Job 1, 5hr, 10 hr, 60 hr
What is the best way to do this with ES and Kibana?
Given that you want it to display as:
job N | range 1 | range 2 | range 3 | ... | range N
This may be difficult to actually get in Kibana exactly because of how it likes to split up the data table, but it's best to know how to get something before you even try to visualize it:
{
"size" : 0,
"aggs" : {
"per_job": {
"terms": {
"field": "job",
"size": 10
},
"aggs": {
"ranges": {
"date_range": {
"field": "timestamp",
"ranges": [
{
"from": "now-1d/d"
},
{
"from" : "now-7d/d"
},
{
"from": "now-3M/M"
}
]
},
"aggs": {
"worked": {
"sum": {
"field": "hours"
}
}
}
}
}
}
}
}
What is this providing? This is grouping by each job, then splitting each job into three bucketed date ranges, each being longer versions of the previous range (notice there's no "to" specified, which you could specify as "to" : "now"), then finally each date range's split is summed up on the field of interest, which I assume is named hours.
How can you use this in Kibana? Well, Kibana is just a visualization tool to build these aggregations and chart or otherwise display them.
The top level aggregation is therefore going to be a Terms aggregation. The secondary or "sub-bucket" will be the Date Range, and finally the metric (above the buckets) will be the Sum.
Unfortunately, given that you seem to want a table view of it, there's no way that I am aware of to get the separate date ranges to just add another row unless you accept one table per job:

Elastic Search filter with aggregate like Max or Min

I have simple documents with a scheduleId. I would like to get the count of documents for the most recent ScheduleId. Assuming Max ScheduleId is the most recent, how would we write that query. I have been searching and reading for few hours and could get it to work.
{
"aggs": {
"max_schedule": {
"max": {
"field": "ScheduleId"
}
}
}
}
That is getting me the Max ScheduleId and the total count of documents out side of that aggregate.
I would appreciate if someone could help me on how take this aggregate value and apply it as a filter (like a sub query in SQL!).
This should do it:
{
"aggs": {
"max_ScheduleId": {
"terms": {
"field": "ScheduleId",
"order" : { "_term" : "desc" },
"size": 1
}
}
}
}
The terms aggregation will give you document counts for each term, and it works for integers. You just need to order the results by the term instead of by the count (the default). And since you only want the highest ScheduleID, "size":1 is adequate.
Here is the code I used to test it:
http://sense.qbox.io/gist/93fb979393754b8bd9b19cb903a64027cba40ece

Random document in ElasticSearch

Is there a way to get a truly random sample from an elasticsearch index? i.e. a query that retrieves any document from the index with probability 1/N (where N is the number of documents currently indexed)?
And as a follow-up question: if all documents have some numeric field s, is there a way to get a document through weighted random sampling, i.e. where the probability to get document i with value s_i is equal to s_i / sum(s_j for j in index)?
I know it is an old question, but now it is possible to use random_score,
with the following search query:
{
"size": 1,
"query": {
"function_score": {
"functions": [
{
"random_score": {
"seed": "1477072619038"
}
}
]
}
}
}
For me it is very fast with about 2 million documents.
I use current timestamp as seed, but you can use anything you like. The best is if you use the same seed, you will get the same results. So you can use your user's session id as seed and all users will have different order.
The only way I know of to get random documents from an index (at least in versions <= 1.3.1) is to use a script:
sort: {
_script: {
script: "Math.random() * 200000",
type: "number",
params: {},
order: "asc"
}
}
You can use that script to make some weighting based on some field of the record.
It's possible that in the future they might add something more complicated, but you'd likely have to request that from the ES team.
You can use random_score with a function_score query.
{
"size":1,
"query": {
"function_score": {
"functions": [
{
"random_score": {
"seed": 11
}
}
],
"score_mode": "sum",
}
}
}
The bad part is that this will apply a random score to every document, sort the documents, and then return the first one. I don't know of anything that is smart enough to just pick a random document.
NEST Way :
var result = _elastic.Search<dynamic>(s => s
.Query(q => q
.FunctionScore(fs => fs.Functions(f => f.RandomScore())
.Query(fq => fq.MatchAll()))));
raw query way :
GET index-name/_search
"size": 1,
"query": {
"function_score": {
"query" : { "match_all": {} },
"random_score": {}
}
}
}
You can use random_score to randomly order responses or retrieve a document with roughly 1/N probability.
Additional notes:
https://github.com/elastic/elasticsearch/issues/1170
https://github.com/elastic/elasticsearch/issues/7783

Elasticsearch and aggregation of subqueries

I know that elasticsearch allows sub-aggregations (ie. nested aggregation), however I would like to apply aggregation on the result of "first" aggregation (or in generic any query - aggregation or not).
Concrete example: I log events about user actions (for simplicity I have documents with user_id and action). I can make a query that counts number of actions executed by each user. However I would like to find out percentage (or count) of "active users" (e.g. users that have executed more than 10 actions). Ideal result would be a histogram over all users showing how active the users are.
Is there a way how to create such query? Or is there any other approach I can take other than store aggregated results of subquery and compute the histogram out of that?
Note: I have seen Elastic Search and "sub queries" question, but it was about something else and it is over one and half year old and elasticsearch is being actively developed.
Additionally it seems that in version 1.4 there will be available scripted metric aggregation, but anyway that would require to store counter for every user until reduce phase. And some "approximate solution" is good for me - similar to what ES uses internally for its aggregations.
Here is the query I have used, notice the "min_doc_count" in the aggregation.
{
"query": {
"filtered": {
"filter": {
"and": [
{ "term" : { "name": "did x" } },
{ "range": { "created_at": { "gte": "now-7d", "lte": "now" } } }
]
}
}
},
"aggregations": {
"my_agg": {
"terms": {
"field": "user_id",
"min_doc_count": 10,
"size": 0
}
}
}
}
This query returns the list of buckets (users) with more than 9 events in the specified time period. Just 'count' results to get the number of active users.
I have tested this approach with thousands of events and it works well. At a certain scale you will have to use Hadoop.

Resources