Write heavy elasticsearch - elasticsearch

I am writing a real time analytics tool using kafka,storm and elasticsearch and want a elasticsearch that is write optimized for about 50K/sec inserts. For the purpose of POC I tried inserting bulk documents into the elasticsearch attaining 10K inserts per seconds.
I am running ES on a large box of amazon ec2.
I have tweaked the properties as below:
indices.memory.index_buffer_size: 30%
indices.memory.min_shard_index_buffer_size: 30mb
indices.memory.min_index_buffer_size: 96mb
threadpool.bulk.type: fixed
threadpool.bulk.size: 100
threadpool.bulk.queue_size: 2000
bootstrap.mlockall: true
But I want write performance in order of 50Ks and not 10Ks to ensure the normal flow of my storm topology. Can anyone suggest how to configure a heavy write optimized ES cluster.

The scripts located here may help you improve indexing performance. There are many options and configurations to try, I write about some here however this isn't a comprehensive list. Reducing replicas and increasing shards increases indexing performance but however reduces availability and searching performance during indexing.
Perhaps sending HTTP bulk requests to several nodes rather than just the master node could help you get the figures you desire.
Hope this helps somewhat. 10k/ps inserts is good better than what most people have achieved however whether they get to use a large Amazon instance I don't know.

Related

Measuring performance of CouchDb replication

I have several PCs in the network with application that uses CouchDB. CouchDb is configured to replicate data with CouchDb instance on all other nodes. I would like to measure a performance of data replication between the nodes. I tried to find if CouchDb exposes some data about the time spend replicating data, but I didn't find anything helpful except the endpoint _scheduler/jobs that can show information about how many documents waits for replicating and a sequence number.
My current idea is a very naive script that will query each CouchDb instance _scheduler/jobs edpoint frequently and based on numbers returned in changes_pending and docs_written fields I can somehow get some approximate estimate how long it takes to replicate data. This is, however, inaccurate and take me a moment to setup. Maybe you know some easier ways/tools that can help me?
Questions:
Are there any way to fetch information about time that replication of documents took in CouchDb?
Also, maybe you know some tools that can help me with measuring performance of CouchDb replication?
There was a performance improvement for document replication in the 3.3.0 release. The corresponding pull requests also documents how the replication performance was tested.
If you want to test replication performance yourself, I suggest you follow their setup using couchdyno.

optimization on old indexes collecting logs from my apps

I have an elastic cluster with 3x nodes(each 6x cpu, 31GB heap , 64GB RAM) collecting 25GB logs per day , but after 3x months I realized my dashboards become very slow when checking stats in past weeks , please, advice if there is an option to improve the indexes read erformance so it become faster when calculating my dashboard stats?
Thanks!
I would suggest you try to increase the shards number
when you have more shards Elasticsearch will split your data over the shards so as a result, Elastic will send multiple parallel requests to search in a smaller data stack
for Shards number you could try to split it based on your heap memory size
No matter what actual JVM heap size you have, the upper bound on the maximum shard count should be 20 shards per 1 GB of heap configured on the server.
ElasticSearch - Optimal number of Shards per node
https://qbox.io/blog/optimizing-elasticsearch-how-many-shards-per-index
https://opster.com/elasticsearch-glossary/elasticsearch-choose-number-of-shards/
It seems that the amount of data that you accumulated and use for your dashboard is causing performance problems.
A straightforward option is to increase your cluster's resources but then you're bound to hit the same problem again. So you should rather rethink your data retention policy.
Chances are that you are really only interested in most recent data. You need to answer the question what "recent" means in your use case and simply discard anything older than that.
Elasticsearch has tools to automate this, look into Index Lifecycle Management.
What you probably need is to create an index template and apply a lifecycle policy to it. Elasticsearch will then handle automatic rollover of indices, eviction of old data, even migration through data tiers in hot-warm-cold architecture if you really want very long retention periods.
All this will lead to a more predictable performance of your cluster.

Figure out the problematic index in ES cluster?

I have elastic-search cluster which hosts more than 15 indices, I have a Datadog integration which shows me the below view of my elastic-search cluster.
We have alert integration with DD(datadog) which gives us alert if overall CPU usage goes beyond 60% and also in our application we start getting alerts when elasticsearch cluster is under stress as in this case our response time increases beyond a configures threshold.
Now my problem is how to know which indices are consuming the ES cluster resources most, so that we can fine either throttle the request from those indices or optimize their requests.
Some things which we did:
Looked at the slow query log: Which doesn't give us the culprit as due to heavy load or CPU usage, we have slow queries log from almost all the big indices.
Like in the DD dashboard there is spike in the bulk queue, but this is overall and not specific to a particular ES indices.
So my problem is very simple and all I want some metric from DD or elastic which can easily tell me which indices are consuming the most resources on a elastic-search cluster.
Unfortuanetly I can not propose an exact solution/workaround to you but you might have a look at the following documentations/API's:
Indices Stats API
Cluster Stats API
Nodes Stats API
The cpu usage is not included in the exported fields but maybe you can derive a high cpu usage behaviour from the other fields.
I hope I could help you in some way.

ElasticSearch/Logstash/Kibana How to deal with spikes in log traffic

What is the best way to deal with a surge in log messages being written to an ElasticSearch cluster in a standard ELK setup?
We use a standard ELK (ElasticSearch/Logstash/Kibana) set-up in AWS for our websites logging needs.
We have an autoscaling group of Logstash instances behind a load balancer, that log to an autoscaling group of ElasticSearch instances behind another load balancer. We then have a single instance serving Kibana.
For day to day business we run 2 Logstash instances and 2 ElasticSearch instances.
Our website experiences short periods of high level traffic during events - our traffic increases by about 2000% during these events. We know about these occurring events well in advance.
Currently we just increase the number of ElasticSearch instances temporarily during the event. However we have had issues where we have subsequently scaled down too quickly, meaning we have lost shards and corrupted our indexes.
I've been thinking of setting the auto_expand_replicas setting to "1-all" to ensure each node has a copy of all the data, so we don't need to worry about how quickly we scale up or down. How significant would the overhead of transferring all the data to new nodes be? We currently only keep about 2 weeks of log data - this works out around 50gb in all.
I've also seen people mention using a separate auto scaling group of non-data nodes to deal with increases of search traffic, while keep the number of data nodes the same. Would this help in a write heavy situation, such as the event I previously mentioned?
My Advice
Your best bet is using Redis as a broker in between Logstash and Elasticsearch:
This is described on some old Logstash docs but is still pretty relevant.
Yes, you will see a minimal delay between the logs being produced and them eventually landing in Elasticsearch, but it should be minimal as the latency between Redis and Logstash is relatively small. In my experience Logstash tends to work through the backlog on Redis pretty quickly.
This kind of setup also gives you a more robust setup where even if Logstash goes down, you're still accepting the events through Redis.
Just scaling Elasticsearch
As to your question on whether or not extra non-data nodes will help in write-heavy periods: I don't believe so, no. Non-data nodes are great when you're seeing lots of searches (reads) being performed, as they delegate the search to all the data nodes, and then aggregate the results before sending them back to the client. They take away the load of aggregating the results from the data nodes.
Writes will always involve your data nodes.
I don't think adding and removing nodes is a great way to cater for this.
You can try to tweak the thread pools and queues in your peak periods. Let's say normally you have the following:
threadpool:
index:
type: fixed
size: 30
queue_size: 1000
search
type: fixed
size: 30
queue_size: 1000
So you have an even amount of search and index threads available. Just before your peak time, you can change the setting (on the run) to the following:
threadpool:
index:
type: fixed
size: 50
queue_size: 2000
search
type: fixed
size: 10
queue_size: 500
Now you have a lot more threads doing indexing, allowing for a faster indexing throughput, while search is put on the backburner. For good measure I've also increased the queue_size to allow for more of a backlog to build up. This might not work as expected, though, and experimentation and tweaking is recommended.

Which server parameters to tweak in Solr if I expect heavy writes and light reads?

I am facing scalability issues designing a new Solr cluster and I need to master to be able to handle a relatively high rate of updates with almost no reads - they can be done via slaves.
My existing Solr instance is occupying a huge amount of RAM, in fact it started swapping at only 4.5mil docs. I am interested in making the footprint as little as possible in RAM, even if it affects search performance.
So, which Solr config values can I tweak in order to accomplish this?
Thank you.
It's hard to say without knowing the specifics of your enviroment (like the schema, custom indexers, queryfunctions etc...) and whats a huge amount of ram? but you could start by
setting filterCache, queryResultCache and documentCache to 0 in solrconfig.xml. This will severely impact the performance of queries executed in SOLR.
set compression to true TextField and StrField types that you store. Then set compressThreshold to a low integer value. This will decrease the size of the documents at the cost of increased CPU usage. (see http://wiki.apache.org/solr/SchemaXml#head-73cdcd26354f1e31c6268b365023f21ee8796613 for more details
turn off all autowarming queries and don't do any read queries
make sure you commit often enough
obviously these are all things to do on the master not on the slaves.

Resources