Heroku: Really slow connect times to RDS - heroku

We're stumped by this, been trying to solve it for a month. It looks like every once in a while (maybe one in 100 requests) our connection time to RDS or RabbitMQ takes > 0.5 seconds.
We're using Heroku with Django, amazon RDS and S3 (we do tons of image uploads).
Here's a sample slow connection (fast connections are really fast, like < 20 ms):
It's worth mentioning both Heroku and RDS are in the us-east region. Also when we use Redis Cloud (Heroku add-on) we see slow connection speeds to that too.
UPDATE:
I've also found that these same delays exist for Memcached. In fact, on average Memcached takes roughly the same amount as a request to the database. I'm just using Memcached for storing API keys and there's plenty of memory left.

It resembles something I read earlier on a blog where it seemed that on Heroku a request might be routed to a dyno that is handling a long running request. That way the old request blocks the execution of the next request. The routing isn't quite as "intelligent" as it is advertised.
I'm not 100% sure that this is the problem you are experiencing as is seems that the slowness shouldn't be in the connection but in the handling of the request to Heroku.
You can find more details on the problem describes here: http://news.rapgenius.com/James-somers-herokus-ugly-secret-lyrics

Related

AWS Aurora / Lambda serverless production environment exhibiting occasional spikes

We've been running our production web app off AWS Lambda / API Gateway, with an Aurora serverless database. Things had been running smoothly for over a year, but recently (coinciding with much increased periods of peak usage) we've experienced temporary slowness, and in the worst case unavailability, due to some kind of bottleneck that results in a spike in the number of DB connections and 4XX and 5XX from our two APIs.
We're using the serverless-mysql library to execute queries and manage DB connections.
Some potential causes of the issue that have been eliminated:
There are no long-running queries locking up tables or anything of that sort (as demonstrated by show full processlist in MySQL), in fact no query runs longer than 1s accordingly to our slow_log
All calls to await serverlessMysql.query() are immediately followed by await serverlessMysql.end()
Our database manager class is instantiated outside the Lambda handler, so it isn't reinstantiated every time a Lambda instance is reused
We've adjusted the config options for serverless-mysql so that retries aren't so aggresive. The default config makes it very aggressive in retrying to connect, both in frequency and number of retries. This has definitely helped, but has not eliminated the problem.
What details can I post that might help someone diagnose this problem? It's a major pain in the ass.
It would be helpful to see the load this application is getting. Which I know is easier said than done with Lambda.
You sort of hinted at it, but it's possible you're hitting the Max Connections() on the 'capacity class' your aurora serverless instance is set to. I've hit this a few times. It's hard to discover with lambda and serverless aurora because you don't have the same logging you would traditionally have.
Outside of that, the core issue you're experiencing seems to be related to spikes created from your application - so you need to discover if a query is maybe just inefficient, and running too many times at once. These are almost impossible to troubleshoot with Lambda logs. But db locks still occur with aurora serverless.
To help track down the issue, you could try the following:
Setup APM
I highly, highly, recommend getting something like NewRelic setup and monitoring your Lambda function.
I'm pretty sure NR has a free trial option, and tracking down a problem like this would be seemingly simple with an APM. I can't tell you how much easier problems like this are to solve with a solid apm.
Monitor traffic ingress
Again, I'm not sure of what this application is doing, but it could be possible that a spike in network traffic from a particular user kicks off a load of queries that make things go awry. Setup a free Cloudflare account or some other proxy if you can, and determine network traffic more easily.
Hope this helps.

Possible explanation of sudden spike in memcached get

Newbie in Newrelic here. I have an API service hosted on Heroku and being monitored at Newrelic.
While I was studying how to use newrelic. I found out my 2 workers are being underutilised with very low RPM and low transaction time. So I decided to cut down to one worker which saves me $36 a month. =]
Shortly after that I received tonnes of logEntries emails stating request timeouts of one of my web dynos. Looking into Newrelic. I found out that one of my actions are being called suspciously high number of times for 2-3 minutes.
The action being V1::CarsController#Index, which basically shows a collection of cars.
While I was not sure whether the deletion of one worker dyno has caused memcached to do something, I also suspect that may be someone is trying scrap the data off the database. I am not too sure how to further investigate into the issue. I wonder if I can track down the request IP and see it is the same? or how can I further investigate?
If further information is needed I am happy to provide in Edits!
Thanks

How many users should a EC2 Micro Instance be able to handle only with a nginx server?

I have a iOS Social App.
This app talks to my server to do updates & retrieval fairly often. Mostly small text as JSON. Sometimes users will upload pictures that my web-server will then upload to a S3 Bucket. No pictures or any other type of file will be retrieved from the web-server
The EC2 Micro Ubuntu 13.04 Instance runs PHP 5.5, PHP-FPM and NGINX. Cache is handled by Elastic Cache using Redis and the database connects to a separate m1.large MongoDB server. The content can be fairly dynamic as newsfeed can be dynamic.
I am a total newbie in regards to configuring NGINX for performance and I am trying to see whether I've configured my server properly or not.
I am using Siege to test my server load but I can't find any type of statistics on how many concurrent users / page loads should my system be able to handle so that I know that I've done something right or something wrong.
What amount of concurrent users / page load should my server be able to handle?
I guess if I cant get hold on statistic from experience what should be easy, medium, and extreme for my micro instance?
I am aware that there are several other questions asking similar things. But none provide any sort of estimates for a similar system, which is what I am looking for.
I haven't tried nginx on microinstance for the reasons Jonathan pointed out. If you consume cpu burst you will be throttled very hard and your app will become unusable.
IF you want to follow that path I would recommend:
Try to cap cpu usage for nginx and php5-fpm to make sure you do not go over the thereshold of cpu penalities. I have no ideia what that thereshold is. I believe the main problem with micro instance is to maintain a consistent cpu availability. If you go over the cap you are screwed.
Try to use fastcgi_cache, if possible. You want to hit php5-fpm only if really needed.
Keep in mind that gzipping on the fly will eat alot of cpu. I mean alot of cpu (for a instance that has almost none cpu power). If you can use gzip_static, do it. But I believe you cannot.
As for statistics, you will need to do that yourself. I have statistics for m1.small but none for micro. Start by making nginx serve a static html file with very few kb. Do a siege benchmark mode with 10 concurrent users for 10 minutes and measure. Make sure you are sieging from a stronger machine.
siege -b -c10 -t600s 'http:// private-ip /test.html'
You will probably see the effects of cpu throttle by just doing that! What you want to keep an eye on is the transactions per second and how much throughput can the nginx serve. Keep in mind that m1small max is 35mb/s so m1.micro will be even less.
Then, move to a json response. Try gzipping. See how much concurrent requests per second you can get.
And dont forget to come back here and report your numbers.
Best regards.
Micro instances are unique in that they use a burstable profile. While you may get up two 2 ECU's in terms of performance for a short period of time, after it uses its burstable allotment it will be limited to around 0.1 or 0.2 ECU. Eventually the allotment resets and you can get 2 ECU's again.
Much of this is going to come down to how CPU/Memory heavy your application is. It sounds like you have it pretty well optimized already.

How does Heroku dyno caching works with Play framework

I have a Play application hosted by Heroku with 5 dynos. Its seems like my dynos were been restarted randomly in different time period. For example, three of them were restarted by itself 22 hours ago, and two of them were restarted 10 hours ago (not sure if this time was triggered by clearing the cache). It seems like that cached data isn't persistent between dynos. My problem is when I sent same request to my Heroku application multiple times, I get different cached response, in the response, some are most up to date data, others were old data. I assume this is because my request was processed by different dyno. After restart all my dyno fixed the problem(I assume this also clear cache in all dynos).
So I want to know what triggered the random dyno restart, and why it does that?
How to solve cached data inconsistency in this case?
Thanks
I think you should use mutualised cache in order to avoid this kind of problem when you scale horizontally.
Couchbase is a good solution to do it. We use this internally at Clever Cloud (http://www.clever-cloud.com/en/), that is the reason why we released a Couchbase as a service.
As for dyno restarts, did you try the documentation? Dyno's are cycled at least once per day

How can I increase SSL performance with Elastic Beanstalk

I really like Elastic Beanstalk and managed to get my webapp (Spring MVC, Hibernate, ...) up and running using SSL on a Tomcat7 64-bit container.
A major concern to me is performance (I thought using the Amazon cloud would help here).
To benchmark my server performance I am using blitz.io (which uses the amazon cloud to have multiple clients access my webservice simultaneously).
My very first simple performance test already got me wondering:
I benchmarked a health check url (which basically just prints "I'm ok").
Without SSL: Looks fine.
13 Hits/s with a response time of 9ms
230 Hits/s with a response time of 8ms
With SSL: Not so fine.
13 Hits/s with a response time of 44ms (Ok, this should be a bit larger due to encryption overhead)
30 Hits/s with a response time of 3.6s!
Going higher left me with connection timeouts (timeout = 10s).
I tried using a larger EC2 instance in the background with essentially the same result.
If I am not mistaken, the Load Balancer before the EC2 Instances serves as an endpoint for SSL encryption. How do I increase this performance?
Can this be done with elastic beanstalk? Or do I need to setup my own load balancer etc.?
I also did some tests using Heroku (albeith with a slightly different technology stack, play! vs. SpringMVC). Here I also saw the increased response time, but it stayed mostly constant. I am assuming they are using quite performant SSL endpoints. How do I get that for Elastic Beanstalk?
It seems my testing method was flawed.
Amazon's Elastic Load Balancers seem to go up to 10k SSL requests per second.
See this great writeup:
http://blog.mattheworiordan.com/post/24620577877/part-2-how-elastic-are-amazon-elastic-load-balancers
SSL requires a handshaking before a secure transmission channel is opened. Once the handshaking is done, which involves several roundtrips, the data is transmitted.
When you are just hitting a page using a load tester, it is doing the handshake for each and every hit. It is not reusing an already established session.
That's not how browsers are going to do. Browse will do handshake once and then reuse the open encrypted session for all the subsequent requests for a certain duration.
So, I would not be very worried about the results. I suggest you try a tool like www.browsermob.com to see how long a full page with many image, js, css etc takes to load over SSL vs non-SSL. That will be a fair comparison.
Helps?

Resources