Calculate change rate of Time Series values - elasticsearch

I have an application which writes Time Series data to Elasticsearch. The (simplified) data looks like the following:
{
"timestamp": 1425369600000,
"shares": 12271
},
{
"timestamp": 1425370200000,
"shares": 12575
},
{
"timestamp": 1425370800000,
"shares": 12725
},
...
I now would like to use an aggregation to calculate the change rate of the shares field by time "buckets", for example like
The change rate of the share values within the last 10 minute "bucket" could be IMHO calculated as
# of shares t1
--------------
# of shares t0
I tried the Date Histogram aggregation, but I guess that's not what I need to calculate the change rates, because this would only give me the doc_count, and it's not clear to me how I could calculate the change rate from these:
{
"aggs" : {
"shares_over_time" : {
"date_histogram" : {
"field" : "timestamp",
"interval" : "10m"
}
}
}
}
Is there a way to achieve my goal with aggregations within Elasticsearch? I search the docs, but didn't find a matching method.
Thanks a lot for any help!

I think it is hard to achieve with out-of-the-box aggregate functions. However, you can take a look at percentile_ranks_aggregation and add your own modifications to the script to to create point in time rates.
Also, sorry for the off-top, but I wonder: is the elastic search the best fit for that kind of stuff? As I understand, at any given point in time you need only the previous sample data to calculate the correct rate for the current sample. This sounds to me like a better fit for some sliding window algorithm real time implementation (even on some relational DB like Postgres), where you keep a fixed number of time buckets and counters you are interested in inside the bucket. Once the new sample 'arrives', you update (slide) the window and calculate the updated rate for the most recent time bucket.

Related

Elasticsearch date based function scoring boosting the wrong way

I would like to boost scores of documents based on how "recent" a document is. I am trying to do this using a function_score. Here is an example of me doing this on a field called updated_at:
{
"function_score": {
"boost_mode": "sum",
"functions": [
{
"exp": {
"updated_at": {
"origin": "now",
"scale": "1h",
"decay": 0.01,
},
},
"weight": 1,
}
],
"query": query
},
}
I would expect documents close to the datetime now will have a score closer to 1, and documents closer to scale will have a score closer to decay (as described in the docs). Therefore, I'm using the boost_mode sum, to keep the original document scores, and increase depending on how close to now the updated_at value is. (Also, the query score is useful so I would rather add than multiply, which is the default).
To test this scenario, I create a document (A) that returns a query score of about 2. I then duplicate it (B) and modify the new document's updated_at timestamp to be an hour in the past.
In this scenario, I would expect (A) to have a higher score and (B) to have a lower score. However, when I run this scenario, I get the exact opposite. (B) ends up with a score of 3 and (A) ends up with a score of 2.
What am I misunderstanding here to cause this to happen? And how would I modify my function score to do what I would like?
This turned out to be a a timezone issue.
I ended up using the explain API to look at what was contributing to the score. When doing that, I noticed that the origin set to now was actually in a different timezone to the one I was setting in the documents.
I fixed this by manually providing a UTC timestamp in the elasticsearch query rather than using now as the value.
(If there is a better way to do this, please let me know)

Elasticsearch Multi Index Query and Filter

I have 2 indexes, one that stores data about an event and one that stores the availability of that event. I am trying to create a single query that gets events by a query but only returns ones that are available, and I am having difficulty doing so.
The events index stores
{
"id" : "152ce52d-e975-4ebd-849a-0a12f535e644",
"createdAt" : 1.5519999143126902E12,
"description" : "A very not so concise description",
"geoHash" : "dnh00x6x5",
"name" : "a name",
...etc...
}
The availability index stores availability like so:
{
"eventId" : "152ce52d-e975-4ebd-849a-0a12f535e644",
"maxGuests" : 8,
"availability" : {
"lte" : "2019-10-18T22:15:00.000Z",
"gte" : "2019-10-18T02:30:00.000Z"
}
}
I am trying to create a query like below, but what I can't figure out is how to filter by listings that meet the criteria in the events index AND are available in the availability index.
GET events,availability/_search
{
"size": 5,
"from": 0,
"_source": [
"id"
],
"query": {
"bool": {
"must": [
{
"geo_distance": {
"distance": "25mi",
"geoHash": {
"lat": 34.0389,
"lon": -84.3826
}
}
}
],
"should": [],
"filter":[
{
"range" : {
"availability" : {
"gte" : "2019-10-31",
"lte" : "2020-11-01",
"relation" : "within"
}
}
}
]
}
}
}
--
The reason I want to only do one query is that the client is expecting a certain specified number of events. If I filter out the unavailable events after I get the event data then I am likely to be left with fewer events than the client expected and would need to do yet another search to fill the gap.
Also, of course, I could merge the two indices so that the event also stores the availability info, but I originally set them up this way because the availability info may have hundreds or thousands of entries per event.
What you want to accomplish is an equivalent of a foreign key of SQL (join). There is no way to have exactly what you want, meaning to filter documents from index A by querying an index B. Your options are:
As you've mentioned solve it on application level (although this causes other problems for you, so it's not a solution).
Merge the data in one index and have duplicated event informatin. Although it seems expensive, the duplication of data in a NoSQL database is to be expected. If you need a relational model then maybe you should use a SQL solution.
Use parent/child (join datatype). The problem here is that you will need to have the data in the same index overall. Moreover, parent and child will be stored in the same shard as well.
One approach to this (a bit more complex though) that I believe would work for you is to use the nested datatype, which actually is a more compact approach for the solution number 2 (combine your data in one index, but save root information only once). Make events be at the root and availability appear as nested. When you want to add one availability you can use the update api, and when you query, you can search by the root fields and by the nested. If you need to retrieve specific availability entries for an event you can use inner hits
What you are trying to do (multi-index search) will not join your data automatically, it will not work. Elasticsearch doesn't work that way, and the relational model is not suited for this product.
One last thing, it's a good thing to plan ahead, but it's a bad thing to try to optimize early on.
The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
An interesting read that summarizes the above

Elasticsearch filter vs term query for many ids

I have an index of documents connected with some product_id. And I would like to find all documents for specific ids (around 100 000 product_ids to be found and 100 million are in total in index).
Would the filter query be the fastest and best option in that case?
"query": {
"bool": {
"filter": {"terms": {"product_id": product_ids}
}
}
Or is it better to chunkify ids and use just terms query or smth else?
The question is probably kind of a duplicate, but I would be very grateful for the best practice advice (and a bit of reasoning).
After some testing and more reading I found an answer:
Filter query works much much faster as chunks with just terms query.
But making really big filter can slower getting the result a lot.
In my case, using filter query with chunks of 10 000 ids is 10 times faster, than using filter query with all 100 000 ids at once (btw, this number is already restricted in Elasticsearch 6).
Also from official elasticsearch documentation:
Potentially the amount of ids specified in the terms filter can be a lot. In this scenario it makes sense to use the terms filter’s terms lookup mechanism.
The only disadvantage to be taken into account is that filter query is stored in cache. (The cache implements an LRU eviction policy: when a cache becomes full, the least recently used data is evicted to make way for new data.)
P.S. In all cases I always used scroll.
you can use "paging" or "scrolling" feature of elastic search query for very large result sets.
Use "from - to" query : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html
or "scroll" query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
I think that "From / To" is a more efficient way to go unless you want to return thousands of results each time (which could be many many MB of data so you probably don't want that)
Edit:
You can make a query like this in bulks:
GET my_index/_search
{
"query": {
"terms": {
"_id": [ "1", "2", "3", .... "10000" ] // tune for the best array length
}
}
}
If your document Id is sequential or some other number form that you could easily order by, and have a field available you can do a "range query"
GET _search
{
"query": {
"range" : {
"document_id_that_is_a_number" : {
"gte" : 0, // bump this on each query by "lte" step factor
"lte" : 10000 // find a good number here
}
}
}
}

ElasticSearch circuit_breaking_exception (Data too large) with significant_terms aggregation

The query:
{
"aggregations": {
"sigTerms": {
"significant_terms": {
"field": "translatedTitle"
},
"aggs": {
"assocs": {
"significant_terms": {
"field": "translatedTitle"
}
}
}
}
},
"size": 0,
"from": 0,
"query": {
"range": {
"timestamp": {
"lt": "now+1d/d",
"gte": "now/d"
}
}
},
"track_scores": false
}
Error:
{
"bytes_limit": 6844055552,
"bytes_wanted": 6844240272,
"reason": "[request] Data too large, data for [<reused_arrays>] would be larger than limit of [6844055552/6.3gb]",
"type": "circuit_breaking_exception"
}
Index size is 5G. How much memory does the cluster need to execute this query?
You can try to increase the request circuit breaker limit to 41% (default is 40%) in your elasticsearch.yml config file and restart your cluster:
indices.breaker.request.limit: 41%
Or if you prefer to not restart your cluster you can change the setting dynamically using:
curl -XPUT localhost:9200/_cluster/settings -d '{
"persistent" : {
"indices.breaker.request.limit" : "41%"
}
}'
Judging by the numbers showing up (i.e. "bytes_limit": 6844055552, "bytes_wanted": 6844240272), you're just missing ~190 KB of heap, so increasing by 1% to 41% you should get 17 MB of additional heap (your total heap = ~17GB) for your request breaker which should be sufficient.
Just make sure to not increase this value too high, as you run the risk of going OOM since the request circuit breaker also shares the heap with the fielddata circuit breaker and other components.
I am not sure what you are trying to do, but I'm curious to find out. Since you get that exception, I can assume the cardinality of that field is not small. You are basically trying to see, I guess, the relationships between all the terms in that field, based on significance.
The first significant_terms aggregation will consider all the terms from that field and establish how "significant" they are (calculating frequencies of that term in the whole index and then comparing those with the frequencies from the range query set of documents).
After it's doing that (for all the terms), you want a second significant_aggregation that should do the first step, but now considering each term and doing for it another significant_aggregation. That's gonna be painful. Basically, you are computing number_of_term * number_of_terms significant_terms calculations.
The big question is what are you trying to do?
If you want to see a relationship between all the terms in that field, that's gonna be expensive for the reasons explained above. My suggestion is to run a first significant_terms aggregation, take the first 10 terms or so and then run a second query with another significant_terms aggregation but limiting the terms by probably doing a parent terms aggregation and include only those 10 from the first query.
You can, also, take a look at sampler aggregation and use that as a parent for your only one significant terms aggregation.
Also, I don't think increasing the circuit breaker limit is the real solution. Those limits were chosen with a reason. You can increase that and maybe it will work, but it has to make you ask yourself if that's the right query for your use case (as it doesn't sound like it is). That limit value that it's in the exception might not be the final one... reused_arrays refers to an array class in Elasticsearch that is resizeable, so if more elements are needed, the array size is increased and you may hit the circuit breaker again, for another value.
Circuit breakers are designed to deal with situations when request processing needs more memory than available. You can set limit by using following query
PUT /_cluster/settings
{
"persistent" : {
"indices.breaker.request.limit" : "45%"
}
}
You can get more information on
https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html
https://www.elastic.co/guide/en/elasticsearch/reference/1.4/index-modules-fielddata.html

Histogram on the basis of facet counts

I am currently working on a project in which I am storing user activity logs in elasticsearch. the user field in the log is like {"user":"abc#yahoo.com"}. I have a timestamp field for each activity, that describes when this activity was recorded. Can i generate date histogram on the basis of number of users in a particular time period. eg the histogram entry must show the number of users on that time. I can have this implemented by obtaining facet counts, but i need to get counts on various intervals and various ranges with minimum queries. Please guide me in this regard. Thanks.
Add a facet to your query something like the following:
{"facets": {
"daily_volume": {
"date_histogram": {
"size": 100,
"field": "created_at",
"interval": "day"
"order": "time"
}
}
}
This returns a nice set of ordered data for the number of items per day.
I then feed this to a Google Chart (the ColumnChart works nicely for histograms), doing a conversion on the returned timestamp integer to convert it to a Date type understood correctly by the Javascript charts API.

Resources