detect elastic search index last modified time - elasticsearch

I'm wondering what would be an efficient way to detect the last modified timestamp of an index in Elastic Search. I have read posts of adding a timestamp fields in pipeline but this solution has limitations (e.g. only newly created index supports timestamp update?)
If only a handful of indices are required to track their last modify time, what would be the most efficient way? Would periodic query and compare result between queries give us an approx. last modify time? any other ways to track ES events?

there is a creation_date setting, but not a comparable update_date one. the reasoning behind this is that updating this for every indexing event would be very expensive, even more so in a distributed environment
you could use something like meta, but it has the same limitation as adding a timestamp to individual documents

Related

Elasticsearch data-stream selecting indices to search

with our current implementation of search engine we do something like:
search by date range from to (by #timestamp)
get all indices by some prefix (e.g. technical-logs*)
filter out only those indices which applies the range from to (e.g. if from=20230101 and to=20230118 then we select all indices in those ranges with prefix technical-logs-yyyyMMdd)
It seems like that data streams could be beneficial for us. The problem I see is that all indices being created by data streams are hidden by default so I won't be able to see them (by default) therefore I won't be able to query only those indices which I'm interested in (from-to).
Is there some easy mechanism how we can select only indices which we want or does the ES has some functionality for that? I know that there is #timestamp field but I don't know if that is somehow being used also to filtering out only indices which contains given date.
That's the whole point of data streams, i.e. you don't need to know which indices to query, you just query the data stream (i.e. like an alias) or a subset thereof technical-logs* and ES will make sure to only query the underlying indexes that satisfy your constraints (from/to time interval, etc)
Time-series data streams use time bound indices. Each of those backing indices is then sorted by #timestamp so that when you search for a specific time interval, ES will only query the relevant backing indexes.

Using Search After without Index Sorting

I use ElasticSearch 6.2.4 and currently using Scroll to deal with queries that returns more than 10000 documents, since most of the queries return less (90%) and involve real time usage building scroll is inefficient so I consider to start use the Search After feature.
I noticed that in the example in Search After Doc that a sort is used and if I understand correctly it will be used in every query , will elastic order the results every time again and again? this can have a huge implication on performance.
I read about the Sorted Index, can this solve my problem since the shard will always be sorted?
Is there any reason to use search after without Sort Index?

Performance of Elastic Time Range Queries against Out of Range Indexes

It is common to have elastic indices with dates, in particular from something like logstash.
So for example, you have indices like foo-2016.05.01, foo-2016.05.02, etc...
When doing a time range query for data. What is the cost of querying indexes that I already know won't have data for that time range?
So for example if time range query only asks for data from 2016.05.02 but I also include the foo-2016.05.01 index in my query.
Is that basically a quick one-op per index where the index knows it has no data in that date range, or will doing this be costly to performance? I'm hoping not only to know the yes/no answer, but to understand why it behaves the way it does.
Short version: it's probably going to be expensive. The cost will be n where n is the number of distinct field values for the date data. If all entries in the index had an identical date field value, it'd be a cheap query of 1 check (and would be pointless since it'd be a binary "all or nothing" response at that point). Of course, the reality is usually that every single doc has a unique date field value (which is incrementing such as in a log), depending on how granular the date is (assuming here that the time is included to seconds or milliseconds). Elasticsearch will check each aggregated, unique date field value of the included indices to try and find documents that match on the field by satisfying the predicates of the range query. This is the nature of the inverted index (indexing documents by their fields).
An easy way to improve performance is to change the Range Query to a Range Filter which caches results and improves performance for requests beyond the first one. Of course, this is only valuable if you're repeating the same range filter over time (the cache is read more than it is written), and if the range is not part of scoring the documents (that is to say those in range are not more valuable that those not in range when returning a set of both - also known as "boosting").
Another way to improve performance is by convention. If you query by day, store each day in its own rolling index and then do pre-search logic to select the indexes to query. This eliminates the need for the filter or query entirely.
Elasticsearch doesn't care about the index name (that includes the date) and it doesn't automagically exclude that index from your range query. It will query all the shards (a copy - be it replica or primary) of all the indices specified in the query. Period.
Kibana, on the other hand, knows based on the time range selected to query specific indices only.
If you know your range will not make sense on some indices, then exclude those from the query before creating the query.
A common approach for logging usecase, in case the current day is most frequently queried is to create an alias. Give it a significant name - like today - that will always point to today's index. Also, common with time based indices is the retention period. For these two tasks - managing the aliases and deleting the "expired" indices - you can use Curator.
In case the most times you care about the current day, use that alias and thus you get rid of the days before today.
In case not, then filter the indices to be queried based on the range before deciding on which indices to run the query.

Elasticsearch: Modifying Field Normalization at Query Time (omit_norms in queries)

Elasticsearch takes the length of a document into account when ranking (they call this field normalization). The default behavior is to rank shorter matching documents higher than longer matching documents.
Is there anyway to turn off or modify field normalization at query time? I am aware of the index time omit_norms option, but I would prefer to not reindex everything to try this out.
Also, instead of simply turning off field normalization, I wanted to try out a few things. I would like to take field length into account, but not as heavily as elasticsearch currently does. With the default behavior, a document will rank 2 times higher than a document which is two times longer. I wanted to try a non-linear relationship between ranking and length.

index size impact on search speed (to store or not to store)

right now, we are using Solr as an fulltext index, where all fields of the documents are indexed but not stored.
There are some million documents, index-size is 50 GB. Average query-time is around 100ms.
To use features like Highlighting, we are thinking about to: additional store text. But, that could double the size of the index-files.
I know there is absolutely no (linear) relation between index size and query time. Rising the documents on factor 10 results in nearly no difference of query time.
But at all, the system (Solr/Lucene/Linux/...) has to handle more informations - the index files (for example) are based on much more I-nodes, and so on.
So I'm sure, there is an impact on query time in relation to the index-size. (But: is this noticeably?)
1st:
Do you think, I'm right?
Did you have any experiences on index-size and search speed in relation to with/without stored text?
Is it smart and reasonable to blow up the index by storing the documents?
2nd:
Do you know, how Solr/Lucene handled stored text? Maybe in separate files? (So that there is no impact for simples searches, where no stored text is needed!?)
Thank you.
Yes, it's absolutely true that the index grows if you make big fields stored, but if you want to highlight them, you don't have other ways. I don't think the speed will be decreased that much, maybe just because you need to download more data retrieving results, but it's not that relevant.
Regarding the lucene index format and the different files within the index you can have a look here: the stored fields are stored in a specific file.

Resources