I am currently trying to understand why some of my requests in my Python Heroku app take >30 seconds. Even simple requests which do absolutely nothing.
One of the things I've done is look into the load average on my dynos. I did three things:
1) Look at the Heroku logs. Once in a while, it will print the load. Here are examples:
Mar 16 11:44:50 d.0b1adf0a-0597-4f5c-8901-dfe7cda9bce0 heroku[web.2] Dyno load average (1m): 11.900
Mar 16 11:45:11 d.0b1adf0a-0597-4f5c-8901-dfe7cda9bce0 heroku[web.2] Dyno load average (1m): 8.386
Mar 16 11:45:32 d.0b1adf0a-0597-4f5c-8901-dfe7cda9bce0 heroku[web.2] Dyno load average (1m): 6.798
Mar 16 11:45:53 d.0b1adf0a-0597-4f5c-8901-dfe7cda9bce0 heroku[web.2] Dyno load average (1m): 8.031
2) Run "heroku run uptime" several times, each time hitting a different machine (verified by running "hostname"). Here is sample output from just now:
13:22:09 up 3 days, 13:57, 0 users, load average: 15.33, 20.55, 22.51
3) Measure the load average on the machines on which my dynos live by using psutil to send metrics to graphite. The graphs confirm numbers of anywhere between 5 and 20.
I am not sure whether this explains simple requests taking very long or not, but can anyone say why the load average numbers on Heroku are so high?
Heroku sub-virtualizes hosts to the guest 'Dyno' you are using via LXC. When you run 'uptime' you are seeing the whole hosts uptime NOT your containers, and as pointed out by #jon-mountjoy you are getting a new LXC container not one of your running Dynos when you do this.
https://devcenter.heroku.com/articles/dynos#isolation-and-security
Heroku’s dyno load calculation also differs from the traditional UNIX/LINUX load calculation.
The Heroku load average reflects the number of CPU tasks that are in the ready queue (i.e. waiting to be processed). The dyno manager takes the count of runnable tasks for each dyno roughly every 20 seconds. An exponentially damped moving average is computed with the count of runnable tasks from the previous 30 minutes where period is either 1-, 5-, or 15-minutes (in seconds), the count_of_runnable_tasks is an entry of the number of tasks in the queue at a given point in time, and the avg is the previous calculated exponential load average from the last iteration
https://devcenter.heroku.com/articles/log-runtime-metrics#understanding-load-averages
The difference between Heroku's load average and Linux is that Linux also includes processes in uninterruptible sleep states (usually waiting for disk activity), which can lead to markedly different results if many processes remain blocked in I/O due to a busy or stalled I/O system.
On CPU bound Dyno's I would presume this wouldn't make much difference. On an IO bound Dyno the load averages reported by Heroku would be much lower than what is reported by what you would get if you could get a TRUE uptime on an LXC container.
You can also enable sending periodic load messages of your running dynos with by enabling log-runtime-metrics
Perhaps it's expected dyno idling?
PS. I suspect there's no point running heroku run uptime - that will run it in a new one-off dyno every time.
Related
I am using private-L on heroku
I have a question about a metric called dyno load.
dyno load is like below
Here is the description of dyno load
If your application has four physical cores and is executing four concurrent threads, the load value will show 4
https://devcenter.heroku.com/articles/metrics
I am using private-L dyno so I have 4 physical cores.
But as you can see in the image the dyno load is about 0.3 on average
image
Am I correct in assuming that this means that I am not taking advantage of the high performance CPU of the provate-L at all?
I am new to sidekiq, my requirement is that there can be as many high priority jobs as the number of users logged into the system. Lets sat each user is expecting a notification soon as his job is processed.
I have one sidekiq daemon running with concurrency of 50 so at a time I can have just 50 jobs processing? I have read that the wiki states we should have multiple sidekiqs running.
What is the upper limit on the number of sidekiqs to run?
how will I be able to match the number of users logged in with the number of concurrent workers?
Is there a technology stack I can use to launch these workers? Something like unicorn to have a pool of workers? Can i even use unicorn with sidekiq ?
What is the upper limit on the number of sidekiqs to run?
You will want a max of one Sidekiq per processor core. If you have a dual-core processor, then 2 Sidekiqs. However, if your server is also doing other stuff such as running a webserver, you will want to leave some cores available for that.
how will I be able to match the number of users logged in with the number of concurrent workers?
With Sidekiq, you pre-emptively create your threads. You essentially have a thread-pool of X idle threads which are ready to deploy at any moment should a huge surge of jobs come in. You will need to create as many threads as the max number of jobs you think you will have at any time. However going over 50 threads per core is not a good idea for performance reasons (the amount of time switching between a huge number of threads significantly cuts into the CPU time allocated for the threads to do actual work).
Is there a technology stack I can use to launch these workers? Something like unicorn to have a pool of workers? Can i even use unicorn with sidekiq ?
You can't use Unicorn for this. You need some process supervisor to handle starting/restarting of Sidekiq. Their wiki recommends Upstart or systemd, but I've found that Supervisor works incredibly well, and is really easy to set-up.
Our application has a feature that requires a rake tast to run every day at a specific time over all the users. This involves computation of some of their attributes and running through db queries and sending a push notification to every user. As such, the task has been designed to run in O(n) but that would still mean growing total time to finish with increasing user base. And we want the task to finish in not more than a minute - it already is take 8 minutes at 14000 users and also ever increasing the CPU util (throughout the rest of the day, the average cpu util sits around 10% but goes up to 50% when the task runs). I want to solve two problems here - make the task run in lesser time and bring down the cpu util the task run spikes.
Tech Specs - Sinatra application serving an API for the app, running on Phusion Passenger (nginx module), using MongoDB and deployed on a c3.large ec2 instance.
P.S - I don't have much knowledge about how parallel processing and threading are done in Ruby and if it can solve this issue, but can bucketing the total users and paralelly computing those buckets be an answer? If so, how do I go about doing something like that? I want to avoid buying out a bigger server just for this purpose as rest of the time it handles the requests quite easily like I pointed out above.
I'm trying to stress-test my Spring RESTful Web Service.
I run my Tomcat server on a Intel Core 2 Duo notebook, 4 GB of RAM. I know it's not a real server machine, but i've only this and it's only for study purpose.
For the test, I run JMeter on a remote machine and communication is through a private WLAN with a central wireless router. I prefer to test this from wireless connection because it would be accessed from mobile clients. With JMeter i run a group of 50 threads, starting one thread per second, then after 50 seconds all threads are running. Each thread sends repeatedly an HTTP request to the server, containing a small JSON object to be processed, and sleeping on each iteration for an amount of time equals to the sum of a 100 milliseconds constant delay and a random value of gaussian distribution with standard deviation of 100 milliseconds. I use some JMeter plugins for graphs.
Here are the results:
I can't figure out why mi hits per seconds doesn't pass the 100 threshold (in the graph they are multiplied per 10), beacuse with this configuration it should have been higher than this value (50 thread sending at least three times would generate 150 hit/sec). I don't get any error message from server, and all seems to work well. I've tried even more and more configurations, but i can't get more than 100 hit/sec.
Why?
[EDIT] Many time I notice a substantial performance degradation from some point on without any visible cause: no error response messages on client, only ok http response messages, and all seems to work well on the server too, but looking at the reports:
As you can notice, something happens between 01:54 and 02:14: hits per sec decreases, and response time increase, okay it could be a server overload, but what about the cpu decreasing? This is not compatible with the congestion hypothesis.
I want to notice that you've chosen very well which rows to display on Composite Graph. It's enough to make some conclusions:
Make note that Hits Per Second perfectly correlates with CPU usage. This means you have "CPU-bound" system, and the maximum performance is mostly limited by CPU. This is very important to remember: server resources spent by Hits, not active users. You may disable your sleep timers at all and still will receive the same 80-90 Hits/s.
The maximum level of CPU is somewhere at 80%, so I assume you run Windows OS (Win7?) on your machine. I used to see that it's impossible to achieve 100% CPU utilization on Windows machine, it just does not allow to spend the last 20%. And if you achieved the maximum, then you see your installation's capacity limit. It just has not enough CPU resources to serve more requests. To fight this bottleneck you should either give more CPU (use another server with higher level CPU hardware), or configure OS to let you use up to 100% (I don't know if it is applicable), or optimize your system (code, OS settings) to spend less CPU to serve single request.
For the second graph I'd suppose something is downloaded via the router, or something happens on JMeter machine. "Something happens" means some task is running. This may be your friend who just wanted to do some "grep error.log", or some scheduled task is running. To pin this down you should look at the router resources and jmeter machine resources at the degradation situation. There must be a process that swallows CPU/DISK/Network.
How do you configure AWS autoscaling to scale up quickly? I've setup an AWS autoscaling group with an ELB. All is working well, except it takes several minutes before the new instances are added and are online. I came across the following in a post about Puppet and autoscaling:
The time to scale can be lowered from several minutes to a few seconds if the AMI you use for a group of nodes is already up to date.
http://puppetlabs.com/blog/rapid-scaling-with-auto-generated-amis-using-puppet/
Is this true? Can time to scale be reduced to a few seconds? Would using puppet add any performance boosts?
I also read that smaller instances start quicker than larger ones:
Small Instance 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform with a base install of CentOS 5.3 AMI
Amount of time from launch of instance to availability:
Between 5 and 6 minutes us-east-1c
Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform with a base install of CentOS 5.3 AMI
Amount of time from launch of instance to availability:
Between 11 and 18 minutes us-east-1c
Both were started via command line using Amazons tools.
http://www.philchen.com/2009/04/21/how-long-does-it-take-to-launch-an-amazon-ec2-instance
I note that the article is old and my c1.xlarge instances are certainly not taking 18min to launch. Nonetheless, would configuring an autoscale group with 50 micro instances (with an up scale policy of 100% capacity increase) be more efficient than one with 20 large instances? Or potentially creating two autoscale groups, one of micros for quick launch time and one of large instances to add CPU grunt a few minutes later? All else being equal, how much quicker does a t1.micro come online than a c1.xlarge?
you can increase or decrease the time of reaction for an autoscaller by playing with
"--cooldown" value (in seconds).
regarding the types of instances to be used, this is mostly based on the application type and a decision on this topic should be taken after close performance monitor and production tuning.
The time to scale can be lowered from several minutes to a few seconds
if the AMI you use for a group of nodes is already up to date. This
way, when Puppet runs on boot, it has to do very little, if anything,
to configure the instance with the node’s assigned role.
The advice here is talking about having your AMI (The snapshot of your operating system) as up to date as possible. This way, when auto scale brings up a new machine, Puppet doesn't have to install lots of software like it normally would on a blank AMI, it may just need to pull some updated application files.
Depending on how much work your Puppet scripts do (apt-get install, compiling software, etc) this could save you 5-20 minutes.
The two other factors you have to worry about are:
How long it takes your load balancer to determine you need more resources (e.g a policy that dictates "new machines should be added when CPU is above 90% for more then 5 minutes" would be less responsive and more likely to lead to timeouts compared to "new machines should be added when CPU is above 60% for more then 1 minute")
How long it takes to provision a new EC2 instance (smaller Instance Types tend to take shorted times to provision)
How soon ASG responds would depend on 3 things:
1. Step - how much to increase by % or fixed number - a large step - you can rapidly increase. ASG will launch the entire Step in one go
2. Cooldown Period - This applies 'how soon' the next increase can happen. If the previous increase step is still within the defined cooldown period (seconds), ASG will wait and not take action for next increase yet. Having a small cooldown period will enable next Step quicker.
3 AMI type- how much time a AMI takes to launch, this depends on type of AMI - many factors come into play. All things equal Fully Baked AMIs launch much faster