ArgumentError during creation of many EC2 instances using knife - amazon-ec2

I'm invoking "knife ec2 server create" to create many ec2 instances with a delay of 10 seconds. It works well for few instances (approx. 10). However, if I create more instances (in the order of 30), I start getting the following argument error:
.INFO: SIGHUP received, reconfiguring
ERROR: ArgumentError: You must pass :on, :tail, or :head to :on
The error seems to happen during random phases. Sometimes while waiting for the ec2 instance, sometimes later when executing my recipe.
Is there a limit of knife processes or chef api calls I should have running at the same time?

I suspect this has nothing to do with Chef (although the error you're getting is being swallowed by Chef). I think the EC2 API is rate limiting you. You may need to add a splay or delay between calls or perform them in smaller batches.
If you are continuing to experience this error, I would recommend opening a ticket at https://tickets.opscode.com

Related

Why does my Redis key show up only minutes after being stored?

I have a handler function on AWS Lambda that is connecting to a Redis instance to store a single key in the cache. The function has completed successfully but the key in Redis shows up minutes (or more) after the fact.
This behavior is observable on both Heroku Redis and Redis Cloud, they're both hosted solutions.
I can't for the life of me figure out what's causing this lag. My Redis knowledge is practically zero, I know how to store a list using LPUSH and how to trim that list using LTRIM.
The writer to Redis uses this Node client while I observe the lag using redis-cli on my local machine.
Is it common to experience this kind of lack in the setup I describe? What can I do to debug this?
I'm purposefully ignoring most of the information in the question and would like to refer only to the alleged symptom, namely that
key show up only minutes after being stored
This behavior is impossible with Redis - any change to the data is immediately visible given Redis' design. That said, the only scenario what you're describing could be remotely possible is when you're writing to a Redis master server and reading from a very-badly-lagged slave. I can ensure you that this is not the case with Redis Cloud however.
The main reason is due to the fact that the Lambda container starts to sleep as soon as your function terminates, and the Redis client you are using is all asynchronous APIs.
Note that the API is entire asynchronous. To get data back from the server, you'll need to use a callback.
I'm assuming that the asynchronous SET is the last action performed in your Lambda function. Once that is called, the underlying Lambda container goes to sleep, and most likely, the actual SET action hasn't finished its job yet. Therefore, the record will not show in Redis until the exact same Lambda container was called to execute your function again, and finished the job that it was supposed to finish on the last execution. This is probably the lag that you are experiencing.
To test whether or not this is true, do a sleep action for a couple of seconds at the end of your function to delay the Lambda container going to sleep immediately, and see if the lag is still there.
I would also recommend not to use asynchronous behaviour APIs inside Lambda functions. They'll add state to your Lambda computation, and this is actually not recommended by AWS themselves within the Lambda documentations too.

Amazon EC2 instances network issues

We had few instances of our system on EC2. Some of them application servers, some of them Memcached, Database and etc.
After few weeks after creating instance, it starts to raise a lot number of errors depending to network: errors like "MEMCACHED TIMEOUT ERROR", "RABBITMQ connection error" and same. Errors happens only from single instance. After creating copy of this instance - errors goes away.
Did anybody have same problems?
I have experienced this before. I think it has to do with problems with the network stack of the host, at least that is as much information I could get form aws.
If you are using EBS backed instances. Simply stopping and then restarting the instance should solve the problem. The instance gets assigned to a new host in that case.

EC2 Amazon Server getting stuck

I'va have my web hosted via a Amazon EC2.
Overall it's working fine, but sometimes (1 per hour aprox) it's like getting stuck. I'm not even able to write commands on the server console when it's on that status.
I moved from the micro instance to the small one expecting some improvement, but it's happening the same.
Any guidance where I should look to resolve this?
This depends on various factors.
Areas you should be looking:
If you are not able to connect (SSH) to your instance: check your
system log from your management console.
If you are expecting slow response times: check your CloudWatch metrics from your console.
Verify running processes on your instance. find out which process is taking CPU % / Memory %
you can do this by top or ps -auwx

Amazon EC2 autoscaling down with graceful shutdown?

We're looking at using EC2 autoscaling to deal with spikes in load. In our case we want to scale up instances based on an SQS queue size and then down scale with the queue size gets back under control. Each SQS message defines a potentially long running job (sometimes up to 20 minutes each for message) that must complete before the instance can be terminated.
Our software handles the shutdown process gracefully, so issuing sudo service ourapp stop will wait for the app to complete before returning.
My question; when autoscaling starts scaling down it issues a terminate (which apparently is like hitting the power button), will it wait for for our app to completely exit before the instance is 'powered off'?
https://forums.aws.amazon.com/message.jspa?messageID=180674 <- that and other things I've found seem to suggest that it doesn't
On most newer AMI's, the machines are given the equivalent to a 'halt' (or 'shutdown -h now' command so that the services are gracefully shut down. As long as your program plays nicely with the startup/shutdown scripts, you should be fine -- but, if your program takes more than 20 seconds to terminate, you may experience that amazon will kill the instance completely.
Amazon's documentation with regards to their autoscaling doesn't specify the termination process, but, AWS's documentation for ec2 in general does contain about what happens during the termination process -- that the machines is given a 'shutdown' command, and the default shutdown time on most systems is 30 seconds.
In mid 2014 AWS introduced 'lifecycle hooks' which allows for full control of the termination process.
Our high level down scale process is:
Auto Scaling sends a message to a SQS queue with an instance ID
Controller app picks up the message
Controller app issues a 'stop instance' request
Controller app re-queues the SQS message while the instance is stopping
Controller app picks up the message again, checks if the instance has stopped (or re-queues the message to try again later)
Controller app notifies Auto Scaling to 'PROCEED' with the termination
Controller app deletes the message from the SQS queue
More details: http://docs.aws.amazon.com/autoscaling/latest/userguide/lifecycle-hooks.html
use replaceunhealty option in autoscaling.
refer:
http://alestic.com/2011/11/ec2-schedule-instance
particularly see this comment.

How to get a stack trace on all running ruby threads on passenger

I have a production ruby sinatra app running on nginx/passenger, and I frequently see requests get inexplicably stalled. I wrote a script to call passenger-status on my cluster of machines every ten seconds and plot the results on a graph. This is what I see:
The blue line shows the global queue waiting spiking constantly to 60. This is an average across 4 machines, so when the blue line hits 60, it means every machine is maxed out. I have the current passenger_max_pool_size set to 20, so it's getting to 3x the max pool size, and then presumably dropping subsequent requests.
My app depends on two key external resources - an Amazon RDS mysql backend and a Redis instance. Perhaps one of these is periodically becoming slow or unresponsive and thereby causing this behavior?
Can anyone advise me on how to get a stack trace to see if the bottleneck here is Amazon RDS, Redis, or something else?
Thanks!
I figured it out -- I had a SAVE config parameter in Redis that was firing once a minute. Evidently the forking/saving operations of redis are blocking for my app. I change the config param to be "3600 1", meaning I only save my database once an hour, which is OK because I am using it as a cache (data persisted in MYSQL).
To answer your original question, it is possible to get "all stack traces" for the running ruby processes that passenger is shepherding. Basically send SIGQUIT message to each one, and they'll spit out all their backtraces into the apache/nginx log file, ex:
https://gist.github.com/rdp/905759f88134229c2969b9f242188615

Resources