I know how Elasticsearch index words and strings, but I wonder if there's a different behaviour for timestamps?
We have internal elasticsearch instance that index events ( millions of events per day).
I want to pull once in X seconds all the events that we received in the last X seconds.
Does Elasticsearch index the timestamp in efficient way such that we don't need to traverse all the documents to return the relevant results? How it index this data?
Anything numeric, like date fields, integer fields, geo fields, etc, are not stored in the inverted index, but in BKD trees (since ES 5), which are especially suited for range queries and finding collection of unordered docIDs that meet the time range conditions.
Related
with our current implementation of search engine we do something like:
search by date range from to (by #timestamp)
get all indices by some prefix (e.g. technical-logs*)
filter out only those indices which applies the range from to (e.g. if from=20230101 and to=20230118 then we select all indices in those ranges with prefix technical-logs-yyyyMMdd)
It seems like that data streams could be beneficial for us. The problem I see is that all indices being created by data streams are hidden by default so I won't be able to see them (by default) therefore I won't be able to query only those indices which I'm interested in (from-to).
Is there some easy mechanism how we can select only indices which we want or does the ES has some functionality for that? I know that there is #timestamp field but I don't know if that is somehow being used also to filtering out only indices which contains given date.
That's the whole point of data streams, i.e. you don't need to know which indices to query, you just query the data stream (i.e. like an alias) or a subset thereof technical-logs* and ES will make sure to only query the underlying indexes that satisfy your constraints (from/to time interval, etc)
Time-series data streams use time bound indices. Each of those backing indices is then sorted by #timestamp so that when you search for a specific time interval, ES will only query the relevant backing indexes.
I have data with the following format::
{
timestamp: Date,
x: number
}
I want to display these values simply in a line, without any aggregation over x, but in Kibana it always requires me to select some kind of aggregation, like average.
It is possible to create the line-chart that you request, but for Kibana to create an visualization, I'm afraid an aggregation would be necessary.
Kibana basis its visualization on buckets (Date, x-axis) and metrics (x, y-axis). Buckets are aggregations of documents over a specified search (almost 30 aggregation methods)
. Metrics are value(s) based on the documents contained in each bucket (almost 20 aggregation methods)
(https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html).
However, you could try to create buckets with 'date_histogram' for which the time interval is small enough so it contain one document. Then for the metric aggregation you could select min or max aggregation (Note: This assumes though that you timestamp is unique for each document).
A field in an index I'm building gets regularly appended to. I'd like to be able to query elasticsearch to count the number of current items. Is it possible to query for the length of an array field? I can write something to bring back the field and count the items but some of them have a large number of entries and so am looking for something that is done in place in ES.
Here's what I recommend. Perform a terms aggregation over _uid. Then perform another aggregation over all the fields in the array field and sum the doc_counts.
But such an operation really depends on the number of records that you need to query otherwise, this can be an expensive operation.
Another option you have is to store count of array elements as another field and query it directly and given the fact that you are already storing large arrays in your document, having an integer field for the size seems to be a fair trade-off. In case you need the count to filter records I would recommend using scripted filter as explained here
It is common to have elastic indices with dates, in particular from something like logstash.
So for example, you have indices like foo-2016.05.01, foo-2016.05.02, etc...
When doing a time range query for data. What is the cost of querying indexes that I already know won't have data for that time range?
So for example if time range query only asks for data from 2016.05.02 but I also include the foo-2016.05.01 index in my query.
Is that basically a quick one-op per index where the index knows it has no data in that date range, or will doing this be costly to performance? I'm hoping not only to know the yes/no answer, but to understand why it behaves the way it does.
Short version: it's probably going to be expensive. The cost will be n where n is the number of distinct field values for the date data. If all entries in the index had an identical date field value, it'd be a cheap query of 1 check (and would be pointless since it'd be a binary "all or nothing" response at that point). Of course, the reality is usually that every single doc has a unique date field value (which is incrementing such as in a log), depending on how granular the date is (assuming here that the time is included to seconds or milliseconds). Elasticsearch will check each aggregated, unique date field value of the included indices to try and find documents that match on the field by satisfying the predicates of the range query. This is the nature of the inverted index (indexing documents by their fields).
An easy way to improve performance is to change the Range Query to a Range Filter which caches results and improves performance for requests beyond the first one. Of course, this is only valuable if you're repeating the same range filter over time (the cache is read more than it is written), and if the range is not part of scoring the documents (that is to say those in range are not more valuable that those not in range when returning a set of both - also known as "boosting").
Another way to improve performance is by convention. If you query by day, store each day in its own rolling index and then do pre-search logic to select the indexes to query. This eliminates the need for the filter or query entirely.
Elasticsearch doesn't care about the index name (that includes the date) and it doesn't automagically exclude that index from your range query. It will query all the shards (a copy - be it replica or primary) of all the indices specified in the query. Period.
Kibana, on the other hand, knows based on the time range selected to query specific indices only.
If you know your range will not make sense on some indices, then exclude those from the query before creating the query.
A common approach for logging usecase, in case the current day is most frequently queried is to create an alias. Give it a significant name - like today - that will always point to today's index. Also, common with time based indices is the retention period. For these two tasks - managing the aliases and deleting the "expired" indices - you can use Curator.
In case the most times you care about the current day, use that alias and thus you get rid of the days before today.
In case not, then filter the indices to be queried based on the range before deciding on which indices to run the query.
Can I use an elasticsearch facet (or aggregate) to get e.g. the average of the maximum value in each document, rather than the average of all values across all documents?
While you can use a script_field for this, I'd recommend determining the maximum at index time and then storing it in a separate field to facet/aggregate on. It'll be faster, and the memory requirement for doing the facet will be much smaller.