Elastic Search - Rolling Calculations - elasticsearch

I have a series of JSON documents like {"type":"A", "value": 2}, {"type":"B"," value":3}, and {"type":"C","value":7} and I feed that into elastic search.
Let's say I want to do one query to avg value all documents with "type": "A"
What is the difference between how elastic search calculates the count vs how let's say Mongo would?
Is Elastic search:
Automatically creating a "rolling count" for all those types and
incrementing the something like "typeA_sum", "typeA_count" "typeA_avg" as new
data is fed in? If so that would be sweet, because then it's not
actually having to calculate anything.
Is it just creating an
index over type and actually calculate the sum each time the query
is ran?
Is it doing #2 in the background (i.e. precalculating)
and just updating some cache value so when the query runs it has the
result pretty quickly?

It is closest to your #2, however the results are cached, so that if the results are useful in a subsequent query that will be very quick. There is no way Elasticsearch could know beforehand what query you are going to run, so #1 is impossible, and #3 would be wasteful.
However, for your example use case you probably do not need two queries, one would be enough. See for instance the stats aggregation that will return count, min, max, average and sum. Combine that with a terms aggregation (and perhaps a missing aggregation) to group the documents on your type field, and you'll get count and average (and min, max, sum) for all types separately with a single query.

Related

Elasticsearch Stats and Sum Aggregation under the hood

I want to understand how elastic search works under the hood for stats aggregation and sum aggregation.
My use case needs date histogram aggregation as primary aggregation and sum aggregation or stat aggregation as the nested aggregation. I executed queries using both the aggregations on same amount of data in Kibana. And the time both the queries took for execution was similar. So, for all our use cases we might use stats aggregation all the time if there's no performance difference between stats & sum aggregation.
I couldn't find any detailed information about internal working of these aggregations. Request to provide me with any information on it or point me to any documentation describing how these aggregations work under the hood.
Elasticsearch version : 7.1
Thank You
When in doubt, go to the source.
If you look at the implementation of StatsAggregator.java and SumAggregator.java, you'll see that they are very similar.
SumAggregator only computes a sum, while StatsAggregator computes sum, min, max, count and avg. Even though the latter seems to do more job, it is also only iterating once through the data in order to compute additional metrics, but those computations are not computationally expensive.
So if you know you need just the sum, use SumAggregator, but if you also need either min, max, count or avg, then go for StatsAggregator instead, so you only iterate once through the data.

Elastic search calculation with data from different indexes

Good day, everyone. I have a lit bit strange case of using elastic search for me.
There are two different indexes, each index contain one data type.
First type contains next important for this case data:
keyword (text,keyword),
URL (text,keyword)
position (number).
Second type contains next data fields:
keyword (text,keyword)
numberValue (number).
I need to do next things:
1.Group data from the first ind by URL
2.For each object in group calculate new metric (metric A) by next simple formula: position*numberValue*Param
3.For each groups calculate sum of elements metric A we have calculated on stage 1
4.Order by desc result groups by sums we have calculated on stage 3
5.Take some interval of result groups.
Param - param, i need to set for calculation, this is not in elastic.
That is not difficult algorithm, but data in different indices, and i don`t know how to do it fast, and i prefer to do it on elastic search level.
I don`t know how to make effective data search or pipeline of data processing which can help me to implement this case.
I use ES version 6.2.3 if it is important.
Give me some advice, please, how can i implement this algorithm.
By reading 2. you seem to assume keyword is some sort of primary key. Elasticsearch is not an RDB and can only reason over one document at a time, so unless numberValue and position are (indexed) fields of the same document you can't combine them.
The rest of the items seem to be possible to achieve with the help of Aggregation

Retrieve length of an array field in ElasticSearch

A field in an index I'm building gets regularly appended to. I'd like to be able to query elasticsearch to count the number of current items. Is it possible to query for the length of an array field? I can write something to bring back the field and count the items but some of them have a large number of entries and so am looking for something that is done in place in ES.
Here's what I recommend. Perform a terms aggregation over _uid. Then perform another aggregation over all the fields in the array field and sum the doc_counts.
But such an operation really depends on the number of records that you need to query otherwise, this can be an expensive operation.
Another option you have is to store count of array elements as another field and query it directly and given the fact that you are already storing large arrays in your document, having an integer field for the size seems to be a fair trade-off. In case you need the count to filter records I would recommend using scripted filter as explained here

Performance of Elastic Time Range Queries against Out of Range Indexes

It is common to have elastic indices with dates, in particular from something like logstash.
So for example, you have indices like foo-2016.05.01, foo-2016.05.02, etc...
When doing a time range query for data. What is the cost of querying indexes that I already know won't have data for that time range?
So for example if time range query only asks for data from 2016.05.02 but I also include the foo-2016.05.01 index in my query.
Is that basically a quick one-op per index where the index knows it has no data in that date range, or will doing this be costly to performance? I'm hoping not only to know the yes/no answer, but to understand why it behaves the way it does.
Short version: it's probably going to be expensive. The cost will be n where n is the number of distinct field values for the date data. If all entries in the index had an identical date field value, it'd be a cheap query of 1 check (and would be pointless since it'd be a binary "all or nothing" response at that point). Of course, the reality is usually that every single doc has a unique date field value (which is incrementing such as in a log), depending on how granular the date is (assuming here that the time is included to seconds or milliseconds). Elasticsearch will check each aggregated, unique date field value of the included indices to try and find documents that match on the field by satisfying the predicates of the range query. This is the nature of the inverted index (indexing documents by their fields).
An easy way to improve performance is to change the Range Query to a Range Filter which caches results and improves performance for requests beyond the first one. Of course, this is only valuable if you're repeating the same range filter over time (the cache is read more than it is written), and if the range is not part of scoring the documents (that is to say those in range are not more valuable that those not in range when returning a set of both - also known as "boosting").
Another way to improve performance is by convention. If you query by day, store each day in its own rolling index and then do pre-search logic to select the indexes to query. This eliminates the need for the filter or query entirely.
Elasticsearch doesn't care about the index name (that includes the date) and it doesn't automagically exclude that index from your range query. It will query all the shards (a copy - be it replica or primary) of all the indices specified in the query. Period.
Kibana, on the other hand, knows based on the time range selected to query specific indices only.
If you know your range will not make sense on some indices, then exclude those from the query before creating the query.
A common approach for logging usecase, in case the current day is most frequently queried is to create an alias. Give it a significant name - like today - that will always point to today's index. Also, common with time based indices is the retention period. For these two tasks - managing the aliases and deleting the "expired" indices - you can use Curator.
In case the most times you care about the current day, use that alias and thus you get rid of the days before today.
In case not, then filter the indices to be queried based on the range before deciding on which indices to run the query.

How do I compute facets/aggregations for the top n documents, with pagination in Elasticsearch?

Suppose I have an index for cars on a dealer's car lot. Each document resembles the following:
{
color: 'red',
model_year: '2015',
date_added: '2015-07-20'
}
Suppose I have a million cars.
Suppose I want to present a view of the most recently added 1000 cars, along with facets over those 1000 cars.
I could just use from and size to paginate the results up to a fixed limit of 1000, but in doing so the totals and facets on model_year and color (i.e. aggregations) I get back from Elasticsearch aren't right--they're over the entire matched set.
How do I limit my search to the most recently added 1000 documents for pagination and aggregation?
As you probably saw in the documentation, the aggregations are performed on the scope of the query itself. If no query is given, the aggregations are performed on a match_all list of results. Even if you would use size at the query level, it will still not give you what you need because size is just a way of returning a set of documents from all the documents the query matched. Aggregations operate on what the query matches.
This feature request is not new and has been asked for before some time ago.
In 1.7 there is no straight forward solution. Maybe you can use the limit filter or terminate_after in-body request parameter, but this will not return the documents that were, also, sorted. This will give you the first terminate_after number of docs that matched the query and this number is per shard. This is not performed after the sorting has been applied.
In ES 2.0 there is, also, the sampler aggregation which works more or less the same way as the terminate_after is working, but this one takes into consideration the score of the documents to be considered from each shard. In case you just sort after date_added and the query is just a match_all all the documents will have the same score and it will be returning an irrelevant set of documents.
In conclusion:
there is no good solution for this, there are workarounds with number of docs per shard. So, if you want 1000 cars, then you need to take this number divide it by the number of primary shards, use it in sampler aggregation or with terminate_after and get a set of documents
my suggestion is to use a query to limit the number of documents (cars) by a different criteria instead. For example, show (and aggregate) the cars in the last 30 days or something similar. Meaning, the criteria should be included in the query itself, so that the resulting set of documents to be the one you want it aggregated. Applying aggregations to a certain number of documents, after they have been sorted, is not easy.

Resources