Elasticsearch index by date search performance - to split or not to split

Elasticsearch index by date search performance - to split or not to split - elasticsearch

I am currently playing around with Elasticsearch (ES). We are ingesting sensor data and for 3 years we have approximately 1,000,000,000 documents in one index, making the index about 50GB in size. Indexing performance is not that important as new data only arrives every 15 minutes per sensor on average, therefore I want to focus on searching and aggregating performance. We are running a front-end showing basically a dashboard about average values from last week compared to one year before etc.
I am using ES on AWS and after performance on one machine was quite slow, I spun up a cluster with 3 data nodes (each 2 cores, 8 GB mem), and gave the index 3 primary shards and one replica. Throwing computing power at the data certainly improved the situation and more power would help more, but my question is:
Would splitting the index for example by month increase the performance? Or being more specific: is querying (esp. by date) a smaller index faster if I adjust the queries adequatly, or does ES already 'know' where to find specific dates in a shard?
(I know about other benefits of having smaller indices, like being able to roll over and keep only a specific time interval, etc.)

1/ Elasticsearch only knows where to find a specific date in an index if your index is sorted by your date field. You can check the documentation here.
In your use case, it can improve drastically search performance. And since all the data will be added at the "end of the index" since its date sorted, you should not see much of indexation overhead.
2/ Without index sort, smaller time-bounded indices will work better (even if you target all your indices) since it will often allow a rewrite or your range query to a match_all / match_none internal query.
For more information about this behavior you should read this blog post :
Instant Aggregations: Rewriting Queries for Fun and Profit

Related

Tradeoffs between large and small indexes on Elasticsearch

Generally speaking, which are the tradeoffs (in terms of performance and memory usage) between large and small indexes in Elasticsearch?
Elaborating a little:
Consider a cluster with 8 nodes, each node with 1 shard and 30Gb allocated to the JVM.
Consider also a scenario with 50 million of documents per day (all with the same structure and using doc-values), retained for 90 days. Each day of documents has about 35Gb on disk.
I want to run some queries in these cluster, covering a total of 12 hours of data.
These queries are composed by some nested aggregations: a date-histogram, followed by a cardinality and a percentile aggregation.
Considering the amount of data, which is better: use daily-indexes or only a single index?
PS: I know that is a "vague" question. My question is more theoretical.
I want to understand better what occur during an aggregation and how this relates to the number of indexes.

ElasticSearch indexing with hundreds of indices

I have the following scenario:
More than 100 million items and counting (10 million added each month).
8 Elastic servers
12 Shards for our one index
Until now, all of those items were indexed in the same index (under different types). In order to improve the environment, we decided to index items by geohash code when our mantra was - not more than 30GB per shard.
The current status is that we have more than 1500 indices, 12 shards per index, and every item will be inserted into one of those indices. The number of shards surpassed 20000 as you can understand....
Our indices are in the format <Base_Index_Name>_<geohash>
My question is raised due to performance problems which made me question our method. Simple count query in the format of GET */_count
takes seconds!
If my intentions is to question many indices, is this implementation bad? How many indices should a cluster with 8 virtual servers have? How many shards? We have a lot of data and growing fast.

Actually it is depends on your usage. Query to all of the indices takes long time because query should go to all of the shards and results should be merged afterwards. 20K shard is not an easy task to query.
If your data is time based , I would advise to add month or date information to the index name and change your query to GET indexname201602/search or GET *201602.
That way you can drastically reduce the number of shards that your query executes and it will take much less time

ElasticSearch Scale Forever

ElasticSearch Community:
Suppose I have a customer named Twetter who has hired me today to build out their search capability for a 181 word social media site.
Assume I cannot predict the number of shards I will need for future scaling and the storage size is already in tens of terabytes.
Assume I do not need to edit any documents once they are indexed. This is strictly for searching.
Referencing the image above, there seems to be some documents which point to 'rolling indexes' ref1 ref2 ref3 whereby I may create a single index (ea. index named tweets1 -> N) on-the-fly. When one index fills up, I can simply add a new machine, with a new index, and add it to the same cluster and alias for searching.
Does this architecture hold water in production?
Are there any long term ramifications to this 'rolling index' architecture as opposed to predicting a shard count and scaling within that estimate?

A shard in elasticsearch is just a lucene index. An elasticsearch index is just a collection of lucene indices (shards). Given that, for capacity planning in your situation you simply need to figure out how many documents you can store in an index with only one shard and still get the query performance you want.
It is the underlying lucene indices that use up resources. Based on how your documents are indexed within the lucene indices, there is a finite number of shards that any single node in your cluster will be able to handle. You can always scale by adding more nodes to the cluster. Just monitor resource usage and query response times to know when to add more nodes.
It is perfectly reasonable to create indices named tweet_1, tweet_2, tweet_3, etc. rolling forward instead of worrying about resharding your data. It accomplishes the same thing in the end. Just use an index alias to hide the numbers.
Once you figure out how many documents you can store per shard to get your query performance, then decide how many shards per index you want to have and then multiply those numbers and cap the index at that number of documents in your code. Once you reach the cap you just roll over to a new index. Here is what I do in my code to determine which index to send a document to (I have sequential ids):
$index = 'file_' . (int)($fid / $docsPerIndex);
Note that I am using index templates so it can automatically create a new index without me having to manually roll over when the cap is reached.
One other consideration is what type of queries you will be performing. As the data grows you have two options for scaling.
You need to have enough nodes in your cluster for parallelizing the query that it can easily search across all indices and still respond quickly.
or
You need to name your indices such that you know which to query and only need to query a subset of the indices in the cluster.
Keep in mind that if you have sequential or predictable ids then elasticsearch can perform id based queries efficiently without actually having to query the whole cluster. If you let ES automatically assign ids (assuming you are using ES >=1.4.0) it will use predictable ids (flake ids) already. This also speeds up indexing. Random ids create a worst case scenario.
If your queries are going to be time based then it will have to search the entire set of indices for each query under this scheme. For time based queries you want to roll your indices over based on some amount of time (e.g. each day or month depending on how much data you receive in that time frame) and name them something like tweets_2015_01, tweets_2015_02, etc. By doing so you can narrow the set of indices you have to search at query time based on the requested search time range.

Should I control the Index size in Elastic Search?

I have a fast growing database and I'm using Elastic Search to manage it.it has only one index and gets 200 K new documents per day. each document contains of about 5 KB text.
Should I keep using only one index or it's better to have one index for each day or something else?
If so, what's the benefits of having multiple indices?

You should definitely worry about the maximum size of your shards/index. We use daily indexes for stuff where we are inserting millions of records per day and monthly indexes where were are inserting millions per month.
A good rule of thumb is that shards should max out around 4 GB (remember there are a configurable number of shards per index).
The advantage is that when you have daily/weekly/monthly indexes, you can eventually close/delete them when your cluster becomes too big or the data isn't useful anymore. If your data is time series data, you can craft your queries to only hit the indexes that are used for the given data. Also if you've made a mistake in how many shards you really need, you can correct it going forward (because you create a new index periodically).
The disadvantage is then that you have to manage all of the extra indexes, but there are tools to do that (elasticsearch-curator for example).

Performance issues using Elasticsearch as a time window storage

We are using elastic search almost as a cache, storing documents found in a time window. We continuously insert a lot of documents of different sizes and then we search in the ES using text queries combined with a date filter so the current thread does not get documents it has already seen. Something like this:
"((word1 AND word 2) OR (word3 AND word4)) AND insertedDate > 1389000"
We maintain the data in the elastic search for 30 minutes, using the TTL feature. Today we have at least 3 machines inserting new documents in bulk requests every minute for each machine and searching using queries like the one above pratically continuously.
We are having a lot of trouble indexing and retrieving these documents, we are not getting a good throughput volume of documents being indexed and returned by ES. We can't get even 200 documents indexed per second.
We believe the problem lies in the simultaneous queries, inserts and TTL deletes. We don't need to keep old data in elastic, we just need a small time window of documents indexed in elastic at a given time.
What should we do to improve our performance?
Thanks in advance
Machine type:
An Amazon EC2 medium instance (3.7 GB of RAM)
Additional information:
The code used to build the index is something like this:
https://gist.github.com/dggc/6523411
Our elasticsearch.json configuration file:
https://gist.github.com/dggc/6523421
EDIT
Sorry about the long delay to give you guys some feedback. Things were kind of hectic here at our company, and I chose to wait for calmer times to give a more detailed account of how we solved our issue. We still have to do some benchmarks to measure the actual improvements, but the point is that we solved the issue :)
First of all, I believe the indexing performance issues were caused by a usage error on out part. As I told before, we used Elasticsearch as a sort of a cache, to look for documents inside a 30 minutes time window. We looked for documents in elasticsearch whose content matched some query, and whose insert date was within some range. Elastic would then return us the full document json (which had a whole lot of data, besides the indexed content). Our configuration had elastic indexing the document json field by mistake (besides the content and insertDate fields), which we believe was the main cause of the indexing performance issues.
However, we also did a number of modifications, as suggested by the answers here, which we believe also improved the performance:
We now do not use the TTL feature, and instead use two "rolling indexes" under a common alias. When an index gets old, we create a new one, assign the alias to it, and delete the old one.
Our application does a huge number of queries per second. We believe this hits elastic hard, and degrades the indexing performance (since we only use one node for elastic search). We were using 10 shards for the node, which caused each query we fired to elastic to be translated into 10 queries, one for each shard. Since we can discard the data in elastic at any moment (thus making changes in the number of shards not a problem to us), we just changed the number of shards to 1, greatly reducing the number of queries in our elastic node.
We had 9 mappings in our index, and each query would be fired to a specific mapping. Of those 9 mappings, about 90% of the documents inserted went to two of those mappings. We created a separate rolling index for each of those mappings, and left the other 7 in the same index.
Not really a modification, but we installed SPM (Scalable Performance Monitoring) from Sematext, which allowed us to closely monitor elastic search and learn important metrics, such as the number of queries fired -> sematext.com/spm/index.html
Our usage numbers are relatively small. We have about 100 documents/second arriving which have to be indexed, with peaks of 400 documents/second. As for searches, we have about 1500 searches per minute (15000 before changing the number of shards). Before those modifications, we were hitting those performance issues, but not anymore.

TTL to time-series based indexes
You should consider using time-series-based indexes rather than the TTL feature. Given that you only care about the most recent 30 minute window of documents, create a new index for every 30 minutes using a date/time based naming convention: ie. docs-201309120000, docs-201309120030, docs-201309120100, docs-201309120130, etc. (Note the 30 minute increments in the naming convention.)
Using Elasticsearch's index aliasing feature (http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases/), you can alias docs to the most recently created index so that when you are bulk indexing, you always use the alias docs, but they'll get written to docs-201309120130, for example.
When querying, you would filter on a datetime field to ensure only the most recent 30 mins of documents are returned, and you'd need to query against the 2 most recently created indexes to ensure you get your full 30 minutes of documents - you could create another alias here to point to the two indexes, or just query against the two index names directly.
With this model, you don't have the overhead of TTL usage, and you can just delete the old, unused indexes from over an hour in the past.
There are other ways to improve bulk indexing and querying speed as well, but I think removal of TTL is going to be the biggest win - plus, your indexes only have a limited amount of data to filter/query against, which should provide a nice speed boost.
Elasticsearch settings (eg. memory, etc.)
Here are some setting that I commonly adjust for servers running ES - http://pastebin.com/mNUGQCLY, note that it's only for a 1GB VPS, so you'll need to adjust.
Node roles
Looking into master vs data vs 'client' ES node types might help you as well - http://www.elasticsearch.org/guide/reference/modules/node/
Indexing settings
When doing bulk inserts, consider modifying the values of both index.refresh_interval index.merge.policy.merge_factor - I see that you've modified refresh_interval to 5s, but consider setting it to -1 before the bulk indexing operation, and then back to your desired interval. Or, consider just doing a manual _refresh API hit after your bulk operation is done, particularly if you're only doing bulk inserts every minute - it's a controlled environment in that case.
With index.merge.policy.merge_factor, setting it to a higher value reduces the amount of segment merging ES does in the background, then back to its default after the bulk operation restores normal behaviour. A setting of 30 is commonly recommended for bulk inserts and the default value is 10.

Some other ways to improve Elasticsearch performance:
increase index refresh interval. Going from 1 second to 10 or 30 seconds can make a big difference in performance.
throttle merging if it's being overly aggressive. You can also reduce the number of concurrent merges by lowering index.merge.policy.max_merge_at_once and index.merge.policy.max_merge_at_once_explicit. Lowering the index.merge.scheduler.max_thread_count can help as well
It's good to see you are using SPM. Its URL in your EDIT was not hyperlink - it's at http://sematext.com/spm . "Indexing" graphs will show how changing of the merge-related settings affects performance.

I would fire up an additional ES instance and have it form a cluster with your current node. Then I would split the work between the two machines, use one for indexing and the other for querying. See how that works out for you. You might need to scale out even more for your specific usage patterns.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio