I am stuck with the problem of monitoring http requests of a website with an internet-facing loadbalancer. To be specific, I have hosted a website that uses a server farm of AWS EC2 instances with a loadbalancer (ELB) at the front. Now I want to get an idea about the request arrival rate per second (or per minute) to scale the server farm.
I have thought of an approach to perform this task online. The idea is to get the ELB log each minute and parsing it for http request count for the last minute. Just wondering whether there is any efficient way to do it online.
Any help would be highly appreciated.
Your best bet is to use AWS's cloudwatch to do the monitoring for you:
http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/US_MonitoringLoadBalancerWithCW.html
Elastic Load Balancing publishes data points to Amazon CloudWatch
about your load balancers and your back-end application instances.
CloudWatch allows you to retrieve statistics about those data points
as an ordered set of time-series data, known as metrics. Think of a
metric as a variable to monitor, and the data points represent the
values of that variable over time. Each data point has an associated
time stamp and (optionally) a unit of measurement. For example, total
number of healthy EC2 instances behind a load balancer over a
specified time period can be a metric.
Amazon CloudWatch provides statistics based on the metric data points
published by Elastic Load Balancing. Statistics are metric data
aggregations over specified periods of time. The following statistics
are available: Minimum (min), Maximum (max), Sum, Average, and Count.
When you request statistics, the returned data stream is identified by
the metric name and a dimension. A dimension is a name/value pair that
helps you to uniquely identify a metric. For example, you can request
statistics of all the healthy EC2 instances behind a load balancer
launched in a specific Availability Zone.
Related
I have elastic-search cluster which hosts more than 15 indices, I have a Datadog integration which shows me the below view of my elastic-search cluster.
We have alert integration with DD(datadog) which gives us alert if overall CPU usage goes beyond 60% and also in our application we start getting alerts when elasticsearch cluster is under stress as in this case our response time increases beyond a configures threshold.
Now my problem is how to know which indices are consuming the ES cluster resources most, so that we can fine either throttle the request from those indices or optimize their requests.
Some things which we did:
Looked at the slow query log: Which doesn't give us the culprit as due to heavy load or CPU usage, we have slow queries log from almost all the big indices.
Like in the DD dashboard there is spike in the bulk queue, but this is overall and not specific to a particular ES indices.
So my problem is very simple and all I want some metric from DD or elastic which can easily tell me which indices are consuming the most resources on a elastic-search cluster.
Unfortuanetly I can not propose an exact solution/workaround to you but you might have a look at the following documentations/API's:
Indices Stats API
Cluster Stats API
Nodes Stats API
The cpu usage is not included in the exported fields but maybe you can derive a high cpu usage behaviour from the other fields.
I hope I could help you in some way.
What is the best way to deal with a surge in log messages being written to an ElasticSearch cluster in a standard ELK setup?
We use a standard ELK (ElasticSearch/Logstash/Kibana) set-up in AWS for our websites logging needs.
We have an autoscaling group of Logstash instances behind a load balancer, that log to an autoscaling group of ElasticSearch instances behind another load balancer. We then have a single instance serving Kibana.
For day to day business we run 2 Logstash instances and 2 ElasticSearch instances.
Our website experiences short periods of high level traffic during events - our traffic increases by about 2000% during these events. We know about these occurring events well in advance.
Currently we just increase the number of ElasticSearch instances temporarily during the event. However we have had issues where we have subsequently scaled down too quickly, meaning we have lost shards and corrupted our indexes.
I've been thinking of setting the auto_expand_replicas setting to "1-all" to ensure each node has a copy of all the data, so we don't need to worry about how quickly we scale up or down. How significant would the overhead of transferring all the data to new nodes be? We currently only keep about 2 weeks of log data - this works out around 50gb in all.
I've also seen people mention using a separate auto scaling group of non-data nodes to deal with increases of search traffic, while keep the number of data nodes the same. Would this help in a write heavy situation, such as the event I previously mentioned?
My Advice
Your best bet is using Redis as a broker in between Logstash and Elasticsearch:
This is described on some old Logstash docs but is still pretty relevant.
Yes, you will see a minimal delay between the logs being produced and them eventually landing in Elasticsearch, but it should be minimal as the latency between Redis and Logstash is relatively small. In my experience Logstash tends to work through the backlog on Redis pretty quickly.
This kind of setup also gives you a more robust setup where even if Logstash goes down, you're still accepting the events through Redis.
Just scaling Elasticsearch
As to your question on whether or not extra non-data nodes will help in write-heavy periods: I don't believe so, no. Non-data nodes are great when you're seeing lots of searches (reads) being performed, as they delegate the search to all the data nodes, and then aggregate the results before sending them back to the client. They take away the load of aggregating the results from the data nodes.
Writes will always involve your data nodes.
I don't think adding and removing nodes is a great way to cater for this.
You can try to tweak the thread pools and queues in your peak periods. Let's say normally you have the following:
threadpool:
index:
type: fixed
size: 30
queue_size: 1000
search
type: fixed
size: 30
queue_size: 1000
So you have an even amount of search and index threads available. Just before your peak time, you can change the setting (on the run) to the following:
threadpool:
index:
type: fixed
size: 50
queue_size: 2000
search
type: fixed
size: 10
queue_size: 500
Now you have a lot more threads doing indexing, allowing for a faster indexing throughput, while search is put on the backburner. For good measure I've also increased the queue_size to allow for more of a backlog to build up. This might not work as expected, though, and experimentation and tweaking is recommended.
We have three EC2 instances—one in each availability zone (AZ) in the eu-west-1 region. They are loadbalanced using ELB. We'd like to monitor how many instances are registered at the loadbalancer, using CloudWatch. The problem ist: I don't really understand the HealthyHostCount metric.
For a deployment, we'd like to be able to de-register a single instance (take it out of the LB) without being notified. So the alarm would be: Notify if there is only 1 healthy instance left behind the loadbalancer for 5 minutes.
As far as I understand, HealthyHostCount (HHC) is the number of healthy instances that are registered with a given ELB, averaged over all AZs. If everything is okay, the HHC should be 1 (no matter over what period of time) because there is 1 instance in each AZ.
A couple of days ago, someone deployed without re-registering the instances, so there was only 1 instance being balanced. When we noticed that, we created an alarm that was to notify us when the average HHC sunk below 0.6 after 5 minutes. (If only 1 instance is registered in ELB, the HHC should average 0.33 for any period of time.) However, the alarm never changed to state "ALARM."
When I checked the HHC in CloudWatch, the HHC were numbers that didn't make sense (sum of 10.0 for a 5-minute interval is all I remember now).
It's all a big mess to me. Any time I think I understand the metric, the CloudWatch charts are all gibberish to me.
Could someone please explain how to use HHC to get an alarm when only 1 instance is registered? Is average HHC the way to go or should I use another metric?
The HealthyHostCount metric records one data value with the count of available hosts for each availability zone, each time a health check is executed. Your ELB health check has an Interval parameter that defines how many health checks are executed per minute.
If you are watching a Per-AZ metric, with a health check Interval of 10 seconds, with 2 healthy hosts in that AZ, you will see 6 data points per minute (60/10) with a value of 2. The average, max and min will be 2, but the sum will be 6*2=12.
If you have 3 AZs with 2 hosts each, again with an Interval=10, but you are looking at the Per-LB metric, you will see 3*6=18 data points per minute, each one with a value of 2. The average, max and min will be 2, but the sum will be 18*2=36
I recommend you to set-up an interval value that can divide 60 seconds (either 5, 6, 10, 15, 20, 30 or 60 seconds).
In your case, if your interval is 30 seconds, and you have 3 AZs and 1 server per AZ: You should expect 2 data points per AZ per minute, so set-up an alarm Per-LB, with a Period of 1 minute, for Sum of HealthyHostCount that triggers when value is LowerOrEqual than 2 (2 data values * 1 Healthy AZ * 1 healthy server = 2, the other 4 data values of the unhealthy AZs should be 0 so they won't affect the sum).
UPDATE:
It turns out that the number of health check executed also depends on the number of internal instances that shapes the ELB (ussually one per AZ), so if you are suffering a traffic spike, or enough load to saturate a single elb-internal-instance, the amount of internal servers inside the ELB will grow and you will have more data points unexpectedly. This may affect the sum value, only if you have lots of traffic. I didn't saw this issue with a peak load of 6k RPM distributed in 3 AZs. If this is your scenario, then using average is a safer bet, but I would recommend that you use LowerThan 0.65 as your threshold.
The link also makes me wonder how does the Cross-Zone Load Balancing feature affects the amount of data points...
This is an area where the CloudWatch web console doesn't expose everything that cloud watch can do. As the docs explain, HealthyHostCount is a per availability zone metric. The console lets you have HealthHostCount by availability zone (but across all load balancers) or by load balancer (but across all zones) but not sliced both ways.
If you only have one load balancer the simplest thing would be to setup one alarm on each of the per zone metrics. If you have multiple availability zones then you should be able to use the api to create an alarm slicing across availability zone and load balancer (again, one alarm per load balancer) but you can't do this from the web UI as far as I know.
I want to understand how ELB load balances between multiple availability zones. For example, if I have 4 instances (a1, a2, a3, a4) in zone us-east-1a and a single instance d1 in us-east-1d behind an ELB, how is the traffic distributed between the two availability zones? i.e., would d1 get nearly 50% of all the traffic or 1/5th of the traffic?
If you enable ELB Cross-Zone Load Balancing, d1 will get 20% of the traffic.
Here's what happen without enabling Cross-Zone Load Balancing:
D1 would get nearly 50% of the traffic. This is why Amazon recommends adding the same amount of instances from each AZ to your ELB.
The following excerpt is extracted from Overview of Elastic Load Balancing:
Incoming traffic is load balanced equally across all Availability Zones enabled for your load balancer, so it is important to have approximately equivalent numbers of instances in each zone. For example, if you have ten instances in Availability Zone us-east-1a and two instances in us-east-1b, the traffic will still be equally distributed between the two Availability Zones. As a result, the two instances in us-east-1b will have to serve the same amount of traffic as the ten instances in us-east-1a. As a best practice, we recommend you keep an equivalent or nearly equivalent number of instances in each of your Availability Zones. So in the example, rather than having ten instances in us-east-1a and two in us-east-1b, you could distribute your instances so that you have six instances in each Availability Zone.
The load balancing between different availability zones is done via DNS. When a DNS resolver on the client asks for the IP address of the ELB, it gets two addresses. And chooses to use one of them (usually the first). The DNS server usually responds with a random order, so the first IP is not used at all times but each IP is used only part of the time (half for 2, third of the time for 3, etc ...).
Then behind these IP addresses you have an ELB server in each availability zone that has your instances connected to it. This is the reason why a zone with just a single instance will get the same amount of traffic as all the instances in another zone.
When you get to the point that you have a very large number of instances, ELB can decide to create two such servers in a single availability zone, but in this case it will split your instances for it to have half (or some other equal division) of your instances.
For my application I am using auto scaling, without using elastic load balancing, is there any performance issue for directly using Auto scaling without ELB?
Adi,
David is right.
Autoscaling allows you to scale instances (based on cloudwatch metrics, a single event, or on a recurring schedule).
Suppose you have three instances running (scaled with Autoscaling): how is traffic going to reach them? You need to implement a Load Balancing somewhere, that's why Elastic Load Balancing is so useful.
Without that, your traffic can only be directed in a poorly-engineered manner.
See Slide #5 of this presentation on slideshare, to get a sense of the architecture: http://www.slideshare.net/harishganesan/scale-new-business-peaks-with-auto-scaling
Best,
Autoscaling determines, based on some measurement (CPU load is a common measurement), whether or not to increase/decrease the number of instances running.
Load balancing relates to how you distribute traffic to your instances based on domain name lookup, etc. Somewhere you must have knowledge of which IP addresses are those currently assigned to the instances that the autoscaling creates.
You can have multiple IP address entries for A records in the DNS settings and machines will be allocated in a roughly round-robin fashion from that pool. But, keeping the pool up to date in real-time is hard.
The load balancer gives you an easy mechanism to provide a single interface/IP address to the outside world and it has knowledge of which instances it is load balancing in real time.
If you are using autoscaling, unless you are going to create a fairly complex monitoring and DNS updating system, you can reasonably assume that you must use a load balancer as well.