Find ES bottleneck in bulk (bigdesk screenshots attached) - amazon-ec2

Updated: Beware long post
Before I go big and move to a bigger server I want to understand what's wrong with this one.
This is an MVP of an elasticsearch server in AWS (EC2). Two micro-s with just 600mb ram each.
I want to understand what's wrong with this configuration. As you can see there's a bottlenecking in the bulk command. The OS memory is quite full, heap memory is still low and although the process CPU is running at maximum, the OS cpu is low.
I reduced the complexity of each document in the bulk-feed and set unwanted field to not be indexed. The screenshots below is my last attempt.
Is it an I/O bottleneck? I store the data on a S3 bucket.
Server Info:
2 Nodes (one in each server), 3 indexes each of them running with 2 shards and 1 replica. So it's a primary node with a running backup one. Strangely "Iron man" node never took over a shard.
I run again the feeder with the above cluster state and the bottleneck seems to be on both nodes:
Here is the beginning of the feeder:
Primary:
Secondary (secondary has the bottleneck):
After 5 minutes of feeding:
Primary (now primary has the bottleneck)
Secondary (secondary now is better):
I'm using py-elasticsearch so requests are auto-throttled in the streamer. However after the big bottleneck below it threw this error:
elasticsearch.exceptions.ConnectionError:
ConnectionError(HTTPConnectionPool(host='IP_HERE', port=9200):
Read timed out. (read timeout=10)) caused by:
ReadTimeoutError(HTTPConnectionPool(host='IP_HERE', port=9200):
Read timed out. (read timeout=10))
And here below is a very interesting screenshot on the same "bulk-feed". The Queue reached 20, the python threw the expression above and the refresh command runs until now that I'm writing.
My objective is to understand which source (CPU, RAM, Disk, Network...) is the inadequate or even better to use the existing sources more efficiently.

So Nate's script was (among others) reducing the refresh interval. Let me add some other findings as well:
The refresh rate was stressing the cluster however I continued searching and found more "errors". One gotcha was that I have a deprecated S3.Gateway. S3 is persistent but slower than the EC2 volume.
Not only did I have S3 as data storage but on a different region (ec2 virginia -> s3 oregon). So sending documents over the network. I got down to that because some old tutorials have S3 as cloud data storage option.
After solving that, the "Documents deleted" below was better. When I was using S3 it was like 30%. This is from Elasticsearch HQ plugin.
Since now we have optimized I/O. Let's see what else we can do.
I found out that CPU is an issue. Although big desk says that the workload was minimal, t1.micros are not to used for persistent CPU usage. That means that although on the charts CPU it is not fully used that's because Amazon throttles it in intervals and in reality they are fully used.
If you put a big more complex documents it will stress the server.
Happy dev-oping.

Can you run the IndexPerfES.sh script against index you are bulk indexing to, we can then see if the performance improves. I think that the refresh rate is degrading performance and is perhaps causing stress on the cluster, leading to problems. Let me know and we can work this out.

Related

Kubernetes number of replicas vs performance

I have just gotten into Kubernetes and really liking its ability to orchestrate containers. I had the assumption that when the app starts to grow, I can simply increase the replicas to handle the demand. However, now that I have run some benchmarking, the results confuse me.
I am running Laravel 6.2 w/ Apache on GKE with a single g1-small machine as the node. I'm only using NodePort service to expose the app since LoadBalancer seems expensive.
The benchmarking tool used are wrk and ab. When the replicas is increased to 2, requests/s somehow drops. I would expect the requests/s to increase since there are 2 pods available to serve the request. Is there a bottleneck occurring somewhere or perhaps my understanding is flawed. Do hope someone can point out what I'm missing.
A g1-small instance is really tiny: you get 50% utilization of a single core and 1.7 GB of RAM. You don't describe what your application does or how you've profiled it, but if it's CPU-bound, then adding more replicas of the process won't help you at all; you're still limited by the amount of CPU that GCP gives you. If you're hitting the memory limit of the instance that will dramatically reduce your performance, whether you swap or one of the replicas gets OOM-killed.
The other thing that can affect this benchmark is that, sometimes, for a limited time, you can be allowed to burst up to 100% CPU utilization. So if you got an instance and ran the first benchmark, it might have used a burst period and seen higher performance, but then re-running the second benchmark on the same instance might not get to do that.
In short, you can't just crank up the replica count on a Deployment and expect better performance. You need to identify where in the system the actual bottleneck is. Monitoring tools like Prometheus that can report high-level statistics on per-pod CPU utilization can help. In a typical database-backed Web application the database itself is the bottleneck, and there's nothing you can do about that at the Kubernetes level.

Elasticsearch - queries throttle cpu

Recently our cluster has seen extreme performance degradation. We had 3 nodes, 64 GB, 4CPU (2 core) each for an index that is 250M records, 60GB large. Performance was acceptable for months.
Since then we've:
1. Added a fourth server, same configuration.
2. Split the index into two indexes, query them with an alias
3. Disable paging (windows server 2012)
4. Added synonym analysis on one field
Our cluster can now survive for a few hours before it's basically useless. I have to restart elastic on each node to rectify the problem. We tried bumping each node to 8 cpus (2 cores) with little to no gain.
One issue is that EVERY QUERY uses up 100% of the cpu of whatever node it hits. Every query is facetted on 3+ fields, which hasn't changed since our cluster was healthy. Unfortunately I'm not sure if this was an happening before, but certainly it seems like an issue. We need to be able to respond to more than one request every few seconds obviously. When multiple requests come in at the same time the performance doesn't seem to get worse for those particular responses. Again, over time, the performance slows to a crawl; the CPU (all cores) stays maxed out indefinitely.
I'm using elasticsearch 1.3.4 and the plugin elasticsearch-analysis-phonetic 2.3.0 on every box and have been even when our performance wasn't so terrible.
Any ideas?
UPDATE:
it seems like the performance issue is due to index aliasing. When I pointed the site to a single index that ultimately stores about 80% of the data, the CPU wasn't being throttled. There were still a few 100% spikes, but they were much shorter. When I pointed it back to the alias (which points to two indexes total), I could literally bring the cluster down by refreshing the page a dozen times quickly: CPU usage goes to 100% every query and gets stuck there with many in a row.
Is there a known issue with elastic search aliases? Am I using the alias incorrectly?
UPDATE 2:
Found the cause in the logs. Paging queries are TERRIBLE. Is this a known bug in elastic? If I run an empty query then try and view the last page (from 100,000,000 e.g.) it brings the whole cluster down. That SINGLE QUERY. It gets through the first 1.5M results then quits, all the while taking up 100% of the CPU for over a minute.
UPDATE 3:
So here's somethings else strange. Pointing to an old index on dev (same size, no aliases) and trying to reproduce the paging issue; the cluster doesn't get hit immediately. It has 1% cpu usage for the first 20 seconds after the query. The query returns with an error before the CPU usage every goes up. About 2 minutes later, CPU usage spikes to 100% and server basically crashes (can't do anything else because CPU is so over taxed). On the production index this CPU load is instantaneous (it happens immediately after a query is made)
Without checking certain metrics it is very difficult to identify the cause of slow response or any other issue. But from the data you have mentioned it looks like there are to many cache evictions happening thereby increasing the number of Garbage Collection on your nodes. A frequent Garbage Collection (mainly the old GC) will consume lot of CPU. This in turn will start to affect all cluster.
As you have mentioned it started giving issues only after you added another node. This surprises me. Is there any increase in the traffic?.
Can you include the output of _stats API taken at the time when your cluster slows down. It will have lot of information from which I can make a better diagnosis. Also include a sample of the query.
I suggest you to install bigdesk so that you can have a graphical view of your cluster health more easily.

ElasticSearch/Logstash/Kibana How to deal with spikes in log traffic

What is the best way to deal with a surge in log messages being written to an ElasticSearch cluster in a standard ELK setup?
We use a standard ELK (ElasticSearch/Logstash/Kibana) set-up in AWS for our websites logging needs.
We have an autoscaling group of Logstash instances behind a load balancer, that log to an autoscaling group of ElasticSearch instances behind another load balancer. We then have a single instance serving Kibana.
For day to day business we run 2 Logstash instances and 2 ElasticSearch instances.
Our website experiences short periods of high level traffic during events - our traffic increases by about 2000% during these events. We know about these occurring events well in advance.
Currently we just increase the number of ElasticSearch instances temporarily during the event. However we have had issues where we have subsequently scaled down too quickly, meaning we have lost shards and corrupted our indexes.
I've been thinking of setting the auto_expand_replicas setting to "1-all" to ensure each node has a copy of all the data, so we don't need to worry about how quickly we scale up or down. How significant would the overhead of transferring all the data to new nodes be? We currently only keep about 2 weeks of log data - this works out around 50gb in all.
I've also seen people mention using a separate auto scaling group of non-data nodes to deal with increases of search traffic, while keep the number of data nodes the same. Would this help in a write heavy situation, such as the event I previously mentioned?
My Advice
Your best bet is using Redis as a broker in between Logstash and Elasticsearch:
This is described on some old Logstash docs but is still pretty relevant.
Yes, you will see a minimal delay between the logs being produced and them eventually landing in Elasticsearch, but it should be minimal as the latency between Redis and Logstash is relatively small. In my experience Logstash tends to work through the backlog on Redis pretty quickly.
This kind of setup also gives you a more robust setup where even if Logstash goes down, you're still accepting the events through Redis.
Just scaling Elasticsearch
As to your question on whether or not extra non-data nodes will help in write-heavy periods: I don't believe so, no. Non-data nodes are great when you're seeing lots of searches (reads) being performed, as they delegate the search to all the data nodes, and then aggregate the results before sending them back to the client. They take away the load of aggregating the results from the data nodes.
Writes will always involve your data nodes.
I don't think adding and removing nodes is a great way to cater for this.
You can try to tweak the thread pools and queues in your peak periods. Let's say normally you have the following:
threadpool:
index:
type: fixed
size: 30
queue_size: 1000
search
type: fixed
size: 30
queue_size: 1000
So you have an even amount of search and index threads available. Just before your peak time, you can change the setting (on the run) to the following:
threadpool:
index:
type: fixed
size: 50
queue_size: 2000
search
type: fixed
size: 10
queue_size: 500
Now you have a lot more threads doing indexing, allowing for a faster indexing throughput, while search is put on the backburner. For good measure I've also increased the queue_size to allow for more of a backlog to build up. This might not work as expected, though, and experimentation and tweaking is recommended.

Performance issue for batch insertion into marklogic

I have the requirement to insert 10,000 docs into marklogic in less than 10 seconds.
I tested in one single-node marklogic server in the following way:
use xdmp:spawn to pass the doc insertion task to task server;
use xdmp:document-insert without specify forest explicitly;
the task server has 8 theads to process tasks;
We have enabled CPF.
The performance is very bad: it took 2 minutes to finish the 10,000 doc creation.
I'm sure the performance will be better if I tested it in a cluster environment, but I'm not sure whether it can finish in less than 10 seconds.
Please advise the way of improving the performance.
I would start by gathering more information. What version of MarkLogic is this? What OS is it running on? What's the CPU? RAM? What's the storage subsystem? How many forests are attached to the database?
Then gather OS-level metrics, to see if one of the subsystems is an obvious bottleneck. For now I won't speculate beyond that.
If you need a fast load, I wouldn't use xdmp:spawn for each individual document, nor use CPF. But 2 minutes for 10k docs doesn't necessarily sound slow. On the other hand, I have reached up to 3k/sec, but without range indexes, transforms, whatsoever. And a very fast disk (e.g. ssd)..
HTH!
Assuming 2 socket server, 128GB-256GB of ram, fast IO (400-800MB/sec sustained)
Appropriate number of forests (12 primary or 6 primary/6 secondary)
More than 8 threads assuming enough cores
CPF off
Turn on perf history, look in metrics, and you will see where the bottleneck is.
SSD is not required - just IO throughput...which multiple spinning disks provide without issue.

Are Amazon's micro instances (Linux, 64bit) good for MongoDB servers?

Do you think using an EC2 instance (Micro, 64bit) would be good for MongoDB replica sets?
Seems like if that is all they did, and with 600+ megs of RAM, one could use them for a nice set.
Also, would they make good primary (write) servers too?
My database is only 1-2 gigs now but I see it growing to 20-40 gigs this year (hopefully).
Thanks
They COULD be good - depending on your data set, but likely they will not be very good.
For starters, you dont get much RAM with those instances. Consider that you will be running an entire operating system and all related services - 613mb of RAM could get filled up very quickly.
MongoDB tries to keep as much data in RAM as possible and that wont be possible if your data set is 1-2 gigs and becomes even more of a problem if your data set grows to 20-40 gigs.
Secondly they are labeled as "Low IO performance" so when your data swaps to disk (and it will based on the size of that data set), you are going to suffer from disk reads due to low IO throughput.
Be aware that micro instances are designed for spiky CPU usage, and you will be throttled to the "low background level" if you exceed the allotment.
The AWS Micro Documentation has good information of what they are intended for.
Between the CPU and not very good IO performance my experience with using micros for development/testing has not been very good. (larger instance types have been fine though), but a micro may work for your use case.
However, there are exceptions for a config or arbiter nodes, I believe a micro should be good enough for these types of machines.
There is also some mongodb documentation specific to EC2 which might help.

Resources