Can this be done in a Elastic Search Query? - sorting

I am currently solving this by pre-calculating a score before inserting the events into Elastic Search. However because it is based on a date I have to recalculate the score daily. Would it be possible to do this calculation during a query?
Data:
{
"title" : "event 1",
"rank" : 1034, // pre-calculated score
"score": 34,
"date" : "2015-10-10 00:00:00",
"meta" : [
{
"date": "2015-10-10 00:00:00",
"type": "insert"
},
{
"date": "2015-12-10 00:00:00",
"type": "outsert"
},
{
"date": "2015-05-10 00:00:00",
"type": "other"
}
]
}
Ranking:
There are 4 "buckets" created using the insert date.
Events under 5 days old
Events over 5 days and under 10 days old
Events over 10 days and under 15 days old
Events older than 15 days
Events in each bucket need to be sorted by the score field DESC.
The pre-calculated rank is made by adding 1000, 2000 or 3000 to the score depending in which bucket the Event falls.
When a query is made the results are sorted by Rank.
How would I do this without using a pre-calculated rank?

i think you can achieve this.The real pain with your current predefined scoring logic is that you cannot move event data backwards after its expiration in his current bucket.since your buckets follow days difference symmetry of 5 days.Use function score with linear day,scale -5 days.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html
{
"gauss": {
"date_field": {
"origin": "2013-09-17",
"scale": "5d",
"offset": "0d",
"decay" : 0.5
}
}
}
Replace origin with current date when querying the data.setup the boost_mode and score_mode according to the link shared.
Hope this works.

Related

Calculate an average value for the bar chart using the most frequent value

If I plot the values of "Price" data (with Python) using bar chart with bins, I get this result:
So, the price is between 0 and 15. Let's imagine that these is the distribution of the values of price for some particular hour of the day.
In Kibana I want to create a line plot that will calculate an average price per hours. If I apply Average Bucket or Average, then basically the mean value is calculated from the data. However, in my case the "average" is basically the most frequent value from the histogram.
For example, in the above-given chart the average value would be 1.1 because it has more than 800 entries.
How can I calculate this kind of "average" in Kibana?
Lets straighten out your goal. Your goal is to ,
Find the most occurred price with in each hour.
This is too high level to implement this on kibana, lets make it more in elasticsearch context.
Set the x-axis to date time with hourly interval
Find the most occurred price with in each hourly bucket
This can be don by below setting.
y-Axis
Set to Metric Aggregations - Average
x-Axis
Set to Date Histogram with
Field : YOUR TIMESTAMP
Interval : hourly
Below is the important part for your case.
Split Series
Set to Terms
Field : price
Order By : Custom Metric (count)
Order : Descending
Size :1
"aggs": {
"2": {
"date_histogram": {
"field": "#timestamp",
"interval": "1h",
"time_zone": "Asia/Tokyo",
"min_doc_count": 1
},
"aggs": {
"3": {
"terms": {
"field": "price",
"size": 1,
"order": {
"_count": "desc"
}
},
"aggs": {
"1": {
"avg": {
"field": "price"
}
}
}
}
}
}
}
If you look at the query in kibana, you could see that the 2nd term aggregation is just returning the highest count document.

Elasticsearch stats aggregation group by date on timeseries

I having some trouble getting a query working. I want to aggregate a weather station's timeseries data in ElasticSearch. I have a value (double) for each day of the year. I would like a query to be able to provide me the sum, min, max of my value field, grouped by month.
My document has a stationid field and a timeseries object array:
}PUT /stations/rainfall/2
{
"stationid":"5678",
"timeseries": [
{
"value": 91.3,
"date": "2016-05-01"
},
{
"value": 82.2,
"date": "2016-05-02"
},
{
"value": 74.3,
"date": "2016-06-01"
},
{
"value": 34.3,
"date": "2016-06-02"
}
]
}
So I am hoping to be able to query this stationid: "5678" or the doc index:2
and see: stationid: 5678, monthlystats: [ month:5, avg:x, sum:y, max:z ]
Many thanks in advance for any help. Also happy to take any advice on my document structure too.

How to specifiy "precision" when sorting on date fields in Elasticsearch?

I have a field of type date (input format epoch_second) in my Elasticsearch mapping (I'm using ES 2.1). I know that I can sort on that field like
{
"sort": [
{
"myDateField" : {
"order": "desc"
}
}
]
}
But this sorts in seconds precision. I'd like to sort by "week interval" (7 day intervals backward from now) and within the same week by score again, like this (pseduocode):
{
"sort": [
{
"myDateField" : {
"order" : "desc",
"precision" : "week"
}
},
"_score"
]
}
So, all hits being within the last 7 days should be ranked equally, all hits older than 7 days and younger than 14 days in the next "sort group" and so on. And each "week group" should be sorted by score again.
In words: "What are the most relevant (to the current query) documents from the last 7 days (but don't filter out older ones completly)?"
Background: An event search, where obviously more recent events should matter most.
How can I achieve this?
You may find a decay function on a function score query useful in your situation. It is specifically designed to adjust the score of a document the "further away" one of its fields is from some defined starting point.
This works with dates as well as numbers and geo point fields. It accepts an origin option which sets the reference date from which other documents will be compared. Conveniently, if you don't provide an origin for a date field, it defaults to the current date (which should work for your scenario).
You would probably want to set your offset to 7 days. That way, all documents for the last 7 days will be scored equally. Outside of that range, the score begins to drop, depending on the decay function you use.
Try something like this:
{
"query": {
"function_score": {
"query": {
"match": {
"field1": "search goes here"
}
},
"functions": [
{
"exp": {
"myDateField": {
"offset": "7d",
"scale": "14d",
"decay": 0.5
}
}
}
]
}
}
}
I read about scripted sorting and this is my solution which works for me:
{
"sort": [
"_script": {
"lang": "expression",
"type": "number",
"script": "doc['myDateField'].value - doc['myDateField'].value % 604800000"
"order": "desc"
},
"_score"
]
}
What I'm doing here is dividing my date field by the desired span expressed in milliseconds (Elasticsearch stores a date field internally as long which maps to milliseconds) and taking the remainder. Then substracting the remainder from the actual date. This way all dates are truncated to "0 am on (0 to 6) days before". This way all documents in the same 7-day-interval will then have the same timestamp and are sorted equally according to it. And finally I append the regular score sorting as the second order condition.
I'm not sure how the performance scales on this solution, but for my few thousand documents that need this sorting I was unable to notice any delay over non-sorting.

Nested query in Elasticsearch?

My team owns several dashboard and are considering the possibility to move to Elasticsearch in order to consolidate the software stacks. One type of common charts we expose is like "What's the pending workflow by the end of each day?". Here are some example data:
day workflow_id version status
20151101 1 1 In Progress
20151101 2 1 In Progress
20151102 1 2 In Progress
20151102 3 1 In Progress
20151102 4 1 In Progress
20151102 2 2 Completed
20151103 1 3 Completed
20151103 3 2 In Progress
20151104 3 3 Completed
20151105 4 2 Completed
Every time when something changed in the workflow, a new record is inserted, which might or might not change the status. The record with the max(version) is the most recent data for the workflow_id.
The goal is to have a chart to show what's the total number of 'In Progress' and 'Completed' workflows at the end of each day. This should only consider the record that has the largest version number until the day. This can be done in SQL with nested queries:
with
snapshot_dates as
(select distinct day from workflow),
snapshot as
(select d.day, w.workflow_id, max(w.version) as max_version
from snapshot_dates d, workflow w
where d.day >= w.day
group by d.day, w.workflow_id
order by d.day, w.workflow_id)
select s.day, w.status, count(1)
from workflow w join snapshot s on w.workflow_id=s.workflow_id and w.version = s.max_version
group by s.day, w.status
order by s.day, w.status;
Here is expected output from the query:
day,status,count
20151101,In Progress,2
20151102,Completed,1
20151102,In Progress,3
20151103,Completed,2
20151103,In Progress,2
20151104,Completed,3
20151104,In Progress,1
20151105,Completed,4
I am still new to Elasticsearch and wonder if Elasticsearch can do a similar query without using application side logic by properly define the mapping and query. More generally, what is the best practice to solve such problem using Elasticsearch?
I tried to find the solution using bucket selector aggregation, but I was stuck at one point. I discussed the same in elasticsearch forum. Following is what suggested by Christian Dahlqvist.
In addition to this you also index the record into a workflow-centric
index with a unique identifier, e.g. workflow id, as the document id.
If several updates come in for the same workflow, each will result in
an update and the latest state will be preserved. Running aggregations
across this index to find the current or latest state will be
considerably more efficient and scalable as you only have a single
record per workflow and do not need to filter out documents based on
relationships to other documents.
So as per this suggestion, you should use Workflow Id as document id while indexing. And whenever there is an update for that workflow, you update new version and date using workflow id. Let's say the index name is workflow and index type is workflow_status. So mapping of this workflow_status type will be as follow:
{
"workflow_status": {
"properties": {
"date": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"status": {
"type": "string",
"index": "not_analyzed"
},
"version": {
"type": "long"
},
"workFlowId": {
"type": "long"
}
}
}
}
Keep adding/updating the documents to this index type keeping workFlowId as document id.
Now for showing a chart day wise, you may need to create another index type, let's say per_day_workflow with following mapping:
{
"per_day_workflow": {
"properties": {
"date": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"in_progress": {
"type": "long"
},
"completed": {
"type": "long"
}
}
}
}
This index will hold data for each day. So you need to create a job which will run at the end of day and fetch total "In Progress" & "Completed" workflow from workflow_status index type using following aggregation search:
POST http://localhost:9200/workflow/workflow_status/_search?search_type=count
{
"aggs": {
"per_status": {
"terms": {
"field": "status"
}
}
}
}
Response will look as follow (I ran for date 2015-11-02 on your sample data):
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": []
},
"aggregations": {
"per_status": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "In Progress",
"doc_count": 3
},
{
"key": "Completed",
"doc_count": 1
}
]
}
}
}
From this response you need to extract In Progress and Completed count and add them to per_day_workflow index type with today's' date.
Now when you need per day data for your graph, then you can fetch easily from this per_day_workflow index type.

ElasticSearch Facets / Aggregations sorting / ordering

Does anyone know how to order aggregation / facet buckets from a range into a predictable order i/e the order they were added to the facet in?
Currently the 1.4 branch (and possibly older branches) order the buckets by "doc_count" which is not predictable. I want to be able to output the buckets in a pre-defined order.
A simple way could be to order them at at your end on the response from Elasticsearch. Another way could be order by term (key of the aggregation).
Update:
If you are using date range aggregation with query like below then the result will be automatically in the chronological order of "3 days ago", "yesterday", "today" irrespective of doc_count.
{
"aggs" : {
"timerange" : {
"date_range" : {
"field" : "day",
"keyed" : true,
"ranges" : [
{
"key": "today",
"from": "now/d"
}
,
{
"key": "yesterday",
"from": "now-1d/d",
"to": "now/d"
},
{
"key": "3 days ago",
"from": "now-3d/d",
"to": "now-2d/d"
}
]
}
}
}
}
If you are interested in daily data then a Date histogram will be more convenient.

Resources