Ways to improve first time indexing in ElasticSearch

Ways to improve first time indexing in ElasticSearch - elasticsearch

In my application, I have a need to re-index all of the data from time to time. I have noticed that the time it takes to index data the first time (via bulk index) is significantly slower than subsequent re-indexing. In one scenario, it takes about 2 hours to perform the indexing the first time, and about 15 minutes (indexing the same data) with subsequent indexing.
While the 2 hours to index the first time is reasonable, I am curious why subsequent iterations to re-index are significantly faster. And more so, I am wondering if there's anything I can do to improve the performance for when indexing the first time, e.g. perhaps by indicating how large the index will be, etc.
Thanks,
Eric

Have you defined a mapping for your types? If not, everytime ES find a new field, the mapping must be updated (and this impact the whole index).
On subsequent indexing, the mapping is already complete. So what you could do is explicitly mapping your types.
Also, you can improve speed of re-indexing by setting the refresh_interval to an higher value, look at this benchmark.

Edited to strike out references to merge_factor as it has been removed in ES 2.0: https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking_20_setting_changes.html#_merge_and_merge_throttling_settings
As Damien indicates, you can indeed influence (bulk) indexing settings - refresh_interval can be set to -1 temporarily and set back to the default value of 1s after you complete your bulk indexing. Another setting to modify is the merge.policy.merge_factor; set it to a higher value such as 30 and then back to the default of 10 once done.
There are a number of tutorials and mailing list discussions about optimizing bulk indexing, but here's some official doc links to start with:
http://www.elasticsearch.org/guide/reference/index-modules/merge/
http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings/
If you haven't already tuned the memory settings for your JVM, you should. Although specific to a 512mb VPS running Ubuntu 10.04 server, these settings (http://pastebin.com/mNUGQCLY) should point you in the right direction. Basically, allocating the desired amount of RAM to Elasticsearch upon startup can improve JVM memory allocation/GC timing.

Related

ElasticSearch - Setting readonly to an index improves performance?

I have this scenario where a system generates an index every day inside a node. Each index has 1 primary shard.
So, all the documents indexed in a day goes to a certain index, and after the day passed, a new index is created.
I keep the indices of the last 60 days (so this means that I always have 60 shards in the node). I can't close the old indices because I want them to support searches. After the 60th day passed then I delete them.
As I was reading the following article I noticed this about the index buffer:
It defaults to 10%, meaning that 10% of the total memory allocated to a node will be used as the indexing buffer size. This amount is then divided between all the different shards
This means that for the index of the day I have 10% / 60 of buffer index memory. So I am not really using the 10%.
The question is, what happens if I set to read only the older indices:
index.blocks.read_only
Set to true to make the index and index metadata read only, false to allow writes and metadata changes.
Will I see a benefit for doing this? Like having the entire 10% of index buffer in my only writable index? Or an improve in the searches of the other indices since they can be merged in an only segmented as they do not longer recieves writes?
Thanks!
Pablo
PD: Im using ElasticSearch 1.7.3 but Im looking forward to migrate to 2.2 in a future. I would like to know, if possible, if I have a benefit if this would be translated to 2.2

No, you won't get a benefit from adding the read_only block. The only way you would free up the buffer is to close the index entirely.
Also, you may be interested to know about https://github.com/elastic/elasticsearch/pull/14121 which gives "actively indexing" shards a bigger share of the buffer. Coming in Elasticsearch 5.0.

Bulk insert performance in MongoDB for large collections

I'm using the BulkWriteOperation (java driver) to store data in large chunks. At first it seems to be working fine, but when the collection grows in size, the inserts can take quite a lot of time.
Currently for a collection of 20M documents, bulk insert of 1000 documents could take about 10 seconds.
Is there a way to make inserts independent of collection size?
I don't have any updates or upserts, it's always new data I'm inserting.
Judging from the log, there doesn't seem to be any issue with locks.
Each document has a time field which is indexed, but it's linearly growing so I don't see any need for mongo to take the time to reorganize the indexes.
I'd love to hear some ideas for improving the performance
Thanks

You believe that the indexing does not require any document reorganisation and the way you described the index suggests that a right handed index is ok. So, indexing seems to be ruled out as an issue. You could of course - as suggested above - definitively rule this out by dropping the index and re running your bulk writes.
Aside from indexing, I would …
Consider whether your disk can keep up with the volume of data you are persisting. More details on this in the Mongo docs
Use profiling to understand what’s happening with your writes

Do have any index in your collection?
If yes, it has to take time to build index tree.
is data time-series?
if yes, use updates more than inserts. Please read this blog. The blog suggests in-place updates more efficient than inserts (https://www.mongodb.com/blog/post/schema-design-for-time-series-data-in-mongodb)
do you have a capability to setup sharded collections?
if yes, it would reduce time (tested it in 3 sharded servers with 15million ip geo entry records)

Disk utilization & CPU: Check the disk utilization and CPU and see if any of these are maxing out.
Apparently, it should be the disk which is causing this issue for you.
Mongo log:
Also, if a 1000 bulk query is taking 10sec, then check for mongo log if there are any few inserts in the 1000 bulk that are taking time. If there are any such queries, then you can narrow down your analysis
Another thing that's not clear is the order of queries that happen on your Mongo instance. Is inserts the only operation that happens or there are other find queries that run too? If yes, then you should look at scaling up whatever resource is maxing out.

Elasticsearch - queries throttle cpu

Recently our cluster has seen extreme performance degradation. We had 3 nodes, 64 GB, 4CPU (2 core) each for an index that is 250M records, 60GB large. Performance was acceptable for months.
Since then we've:
1. Added a fourth server, same configuration.
2. Split the index into two indexes, query them with an alias
3. Disable paging (windows server 2012)
4. Added synonym analysis on one field
Our cluster can now survive for a few hours before it's basically useless. I have to restart elastic on each node to rectify the problem. We tried bumping each node to 8 cpus (2 cores) with little to no gain.
One issue is that EVERY QUERY uses up 100% of the cpu of whatever node it hits. Every query is facetted on 3+ fields, which hasn't changed since our cluster was healthy. Unfortunately I'm not sure if this was an happening before, but certainly it seems like an issue. We need to be able to respond to more than one request every few seconds obviously. When multiple requests come in at the same time the performance doesn't seem to get worse for those particular responses. Again, over time, the performance slows to a crawl; the CPU (all cores) stays maxed out indefinitely.
I'm using elasticsearch 1.3.4 and the plugin elasticsearch-analysis-phonetic 2.3.0 on every box and have been even when our performance wasn't so terrible.
Any ideas?
UPDATE:
it seems like the performance issue is due to index aliasing. When I pointed the site to a single index that ultimately stores about 80% of the data, the CPU wasn't being throttled. There were still a few 100% spikes, but they were much shorter. When I pointed it back to the alias (which points to two indexes total), I could literally bring the cluster down by refreshing the page a dozen times quickly: CPU usage goes to 100% every query and gets stuck there with many in a row.
Is there a known issue with elastic search aliases? Am I using the alias incorrectly?
UPDATE 2:
Found the cause in the logs. Paging queries are TERRIBLE. Is this a known bug in elastic? If I run an empty query then try and view the last page (from 100,000,000 e.g.) it brings the whole cluster down. That SINGLE QUERY. It gets through the first 1.5M results then quits, all the while taking up 100% of the CPU for over a minute.
UPDATE 3:
So here's somethings else strange. Pointing to an old index on dev (same size, no aliases) and trying to reproduce the paging issue; the cluster doesn't get hit immediately. It has 1% cpu usage for the first 20 seconds after the query. The query returns with an error before the CPU usage every goes up. About 2 minutes later, CPU usage spikes to 100% and server basically crashes (can't do anything else because CPU is so over taxed). On the production index this CPU load is instantaneous (it happens immediately after a query is made)

Without checking certain metrics it is very difficult to identify the cause of slow response or any other issue. But from the data you have mentioned it looks like there are to many cache evictions happening thereby increasing the number of Garbage Collection on your nodes. A frequent Garbage Collection (mainly the old GC) will consume lot of CPU. This in turn will start to affect all cluster.
As you have mentioned it started giving issues only after you added another node. This surprises me. Is there any increase in the traffic?.
Can you include the output of _stats API taken at the time when your cluster slows down. It will have lot of information from which I can make a better diagnosis. Also include a sample of the query.
I suggest you to install bigdesk so that you can have a graphical view of your cluster health more easily.

elasticsearch ttl vs daily dropping tables

I understand that there are two dominant patterns for keeping a rolling window of data inside elasticsearch:
creating daily indices, as suggested by logstash, and dropping old indices, and therefore all the records they contain, when they fall out of the window
using elasticsearch's TTL feature and a single index, having elasticsearch automatically remove old records individually as they fall out of the window
Instinctively I go with 2, as:
I don't have to write a cron job
a single big index is easier to communicate to my colleagues and for them to query (I think?)
any nightmare stream dynamics, that cause old log events to show up, don't lead to the creation of new indices and the old events only hang around for the 60s period that elasticsearch uses to do ttl cleanup.
But my gut tells me that dropping an index at a time is probably a lot less computationally intensive, though tbh I've no idea how much less intensive, nor how costly the ttl is.
For context, my inbound streams will rarely peak above 4K messages per second (mps) and are much more likely to hang around 1-2K mps.
Does anyone have any experience with comparing these two approaches? As you can probably tell I'm new to this world! Would appreciate any help, including even help with what the correct approach is to thinking about this sort of thing.
Cheers!

Short answer is, go with option 1 and simply delete indexes that are no longer needed.
Long answer is it somewhat depends on the volume of documents that you're adding to the index and your sharding and replication settings. If your index throughput is fairly low, TTLs can be performant but as you start to write more docs to Elasticsearch (or if you a high replication factor) you'll run into two issues.
Deleting documents with a TTL requires that Elasticsearch runs a periodic service (IndicesTTLService) to find documents that are expired across all shards and issue deletes for all those docs. Searching a large index can be a pretty taxing operation (especially if you're heavily sharded), but worse are the deletes.
Deletes are not performed instantly within Elasticsearch (Lucene, really) and instead documents are "marked for deletion". A segment merge is required to expunge the deleted documents and reclaim disk space. If you have large number of deletes in the index, it'll put much much more pressure on your segment merge operations to the point where it will severely affect other thread pools.
We originally went the TTL route and had an ES cluster that was completely unusable and began rejecting search and indexing requests due to greedy merge threads.
You can experiment with "what document throughput is too much?" but judging from your use case, I'd recommend saving some time and just going with the index deletion route which is much more performant.

I would go with option 1 - i.e. daily dropping of indices.
Daily Dropping Indices
pros:
This is the most efficient way of deleting data
If you need to restructure your index (e.g. apply a new mapping, increase number of shards) any changes are easily applied to the new index
Details of the current index (i.e. the name) is hidden from clients by using aliases
Time based searches can be directed to search only a specific small index
Index templates simplify the process of creating the daily index.
These benefits are also detailed in the Time-Based Data Guide, see also Retiring Data
cons:
Needs more work to set up (e.g. set up of cron jobs), but there is a plugin (curator) that can help with this.
If you perform updates on data then all versions of a document data will need to sit in the same index, i.e. multiple indexes won't work for you.
Use of TTL or Queries to delete data
pros:
Simple to understand and easily implemented
cons:
When you delete a document, it is only marked as deleted. It won’t be physically deleted until the segment containing it is merged away. This is very inefficient as the deleted data will consume disk space, CPU and memory.

Performance issues using Elasticsearch as a time window storage

We are using elastic search almost as a cache, storing documents found in a time window. We continuously insert a lot of documents of different sizes and then we search in the ES using text queries combined with a date filter so the current thread does not get documents it has already seen. Something like this:
"((word1 AND word 2) OR (word3 AND word4)) AND insertedDate > 1389000"
We maintain the data in the elastic search for 30 minutes, using the TTL feature. Today we have at least 3 machines inserting new documents in bulk requests every minute for each machine and searching using queries like the one above pratically continuously.
We are having a lot of trouble indexing and retrieving these documents, we are not getting a good throughput volume of documents being indexed and returned by ES. We can't get even 200 documents indexed per second.
We believe the problem lies in the simultaneous queries, inserts and TTL deletes. We don't need to keep old data in elastic, we just need a small time window of documents indexed in elastic at a given time.
What should we do to improve our performance?
Thanks in advance
Machine type:
An Amazon EC2 medium instance (3.7 GB of RAM)
Additional information:
The code used to build the index is something like this:
https://gist.github.com/dggc/6523411
Our elasticsearch.json configuration file:
https://gist.github.com/dggc/6523421
EDIT
Sorry about the long delay to give you guys some feedback. Things were kind of hectic here at our company, and I chose to wait for calmer times to give a more detailed account of how we solved our issue. We still have to do some benchmarks to measure the actual improvements, but the point is that we solved the issue :)
First of all, I believe the indexing performance issues were caused by a usage error on out part. As I told before, we used Elasticsearch as a sort of a cache, to look for documents inside a 30 minutes time window. We looked for documents in elasticsearch whose content matched some query, and whose insert date was within some range. Elastic would then return us the full document json (which had a whole lot of data, besides the indexed content). Our configuration had elastic indexing the document json field by mistake (besides the content and insertDate fields), which we believe was the main cause of the indexing performance issues.
However, we also did a number of modifications, as suggested by the answers here, which we believe also improved the performance:
We now do not use the TTL feature, and instead use two "rolling indexes" under a common alias. When an index gets old, we create a new one, assign the alias to it, and delete the old one.
Our application does a huge number of queries per second. We believe this hits elastic hard, and degrades the indexing performance (since we only use one node for elastic search). We were using 10 shards for the node, which caused each query we fired to elastic to be translated into 10 queries, one for each shard. Since we can discard the data in elastic at any moment (thus making changes in the number of shards not a problem to us), we just changed the number of shards to 1, greatly reducing the number of queries in our elastic node.
We had 9 mappings in our index, and each query would be fired to a specific mapping. Of those 9 mappings, about 90% of the documents inserted went to two of those mappings. We created a separate rolling index for each of those mappings, and left the other 7 in the same index.
Not really a modification, but we installed SPM (Scalable Performance Monitoring) from Sematext, which allowed us to closely monitor elastic search and learn important metrics, such as the number of queries fired -> sematext.com/spm/index.html
Our usage numbers are relatively small. We have about 100 documents/second arriving which have to be indexed, with peaks of 400 documents/second. As for searches, we have about 1500 searches per minute (15000 before changing the number of shards). Before those modifications, we were hitting those performance issues, but not anymore.

TTL to time-series based indexes
You should consider using time-series-based indexes rather than the TTL feature. Given that you only care about the most recent 30 minute window of documents, create a new index for every 30 minutes using a date/time based naming convention: ie. docs-201309120000, docs-201309120030, docs-201309120100, docs-201309120130, etc. (Note the 30 minute increments in the naming convention.)
Using Elasticsearch's index aliasing feature (http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases/), you can alias docs to the most recently created index so that when you are bulk indexing, you always use the alias docs, but they'll get written to docs-201309120130, for example.
When querying, you would filter on a datetime field to ensure only the most recent 30 mins of documents are returned, and you'd need to query against the 2 most recently created indexes to ensure you get your full 30 minutes of documents - you could create another alias here to point to the two indexes, or just query against the two index names directly.
With this model, you don't have the overhead of TTL usage, and you can just delete the old, unused indexes from over an hour in the past.
There are other ways to improve bulk indexing and querying speed as well, but I think removal of TTL is going to be the biggest win - plus, your indexes only have a limited amount of data to filter/query against, which should provide a nice speed boost.
Elasticsearch settings (eg. memory, etc.)
Here are some setting that I commonly adjust for servers running ES - http://pastebin.com/mNUGQCLY, note that it's only for a 1GB VPS, so you'll need to adjust.
Node roles
Looking into master vs data vs 'client' ES node types might help you as well - http://www.elasticsearch.org/guide/reference/modules/node/
Indexing settings
When doing bulk inserts, consider modifying the values of both index.refresh_interval index.merge.policy.merge_factor - I see that you've modified refresh_interval to 5s, but consider setting it to -1 before the bulk indexing operation, and then back to your desired interval. Or, consider just doing a manual _refresh API hit after your bulk operation is done, particularly if you're only doing bulk inserts every minute - it's a controlled environment in that case.
With index.merge.policy.merge_factor, setting it to a higher value reduces the amount of segment merging ES does in the background, then back to its default after the bulk operation restores normal behaviour. A setting of 30 is commonly recommended for bulk inserts and the default value is 10.

Some other ways to improve Elasticsearch performance:
increase index refresh interval. Going from 1 second to 10 or 30 seconds can make a big difference in performance.
throttle merging if it's being overly aggressive. You can also reduce the number of concurrent merges by lowering index.merge.policy.max_merge_at_once and index.merge.policy.max_merge_at_once_explicit. Lowering the index.merge.scheduler.max_thread_count can help as well
It's good to see you are using SPM. Its URL in your EDIT was not hyperlink - it's at http://sematext.com/spm . "Indexing" graphs will show how changing of the merge-related settings affects performance.

I would fire up an additional ES instance and have it form a cluster with your current node. Then I would split the work between the two machines, use one for indexing and the other for querying. See how that works out for you. You might need to scale out even more for your specific usage patterns.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio