Disclaimer: Please tell me if the question is too broad, and I will do my best to narrow it down.
We have an Heroku app which is running 2 web 1X dynos. This infrastructure has been running for the last 9 months.
However, in the last few weeks, we had several episodes where the app would see its response times skyrocketing for about an hour, before returning to normal without us doing anything about it.
On the pictures below, you can find an extract of Heroku Metrics during one of these "episodes", which happened yesterday afternoon.
As you can see, response time is going up and eventually, almost any request made to the server gets a timeout. During the event, it was barely possible to even load the home page of our website, hosted on this app. Most of the times, we would get the "Application Error" Heroku page.
What I see is:
The amount of requests to the server (failed or not) was not crazily high (less than 1000 every 10 minutes). For this reason, I think a DDOS attack is out of the picture.
Everything that is shown by Heroku Logs is that the failed request get a 503 (Service Unavailable) error, which would make me think about an overload.
The dynos do not seem overloaded. The memory usage is low, and the dyno load is reasonable, nothing unusual.
Heroku reported no issue during our crash event, as https://status.heroku.com/ states (last incident was on the 1st of July).
Restarting the dynos through several methods (from the interface, a command line or triggering an automatic deployment via our Gitlab repository) had no effect.
I am quite unsure as to how to interpretate these metrics, and what would be the solution to ensure this kind of episode does not happen again.
So my question is: where should I look? Is there some kind of documentation about how to investigate crashes on Heroku apps?
Related
We are working on an ASP.NET 5 Web API project that is in production now but we are experiencing an issue where it becomes unresponsive intermittently throughout the day.
A few notes about the application architecture. It is an ASP.NET Web API project using a MariaDB database on a separate EC2 instance within the same private network. The connection string uses the private IP of the database server to avoid any name resolution issues. The site is hosted via IIS 10.
The application itself has been developed carefully following the best practices provided by Microsoft. Heavy focus on async operations, minimizing query response times and offloading more expensive operations into background services.
The app is extremely responsive. It performs with sub 100ms responses on almost all requests, even the more complicated requests, and all the way up until it becomes unresponsive this high level of performance remains the same. We tend to see between 10-30 requests per second and 300-500 select queries per second at peak usage so not too extreme. However, randomly (2-3 times over a 24 hour period) it will begin hanging on requests and simply not respond to the request. During this time, the database is still extremely responsive and we are never over 300 connections out of our 512 connection limit.
The resources on the application server itself are never really taxed much at all. The CPU never gets above ~20% and the memory usage sits around 20-30%.
If I were to stop the site in IIS and start it again while this is happening, it will quickly come back online. If I don't it will be down for a few minutes until IIS finally kills it due to a failed health check. There are no real errors generated as a response to the issue other than typical errors caused by the hanging of the process such as connection terminated errors. The only thing I have seen before that gave me pause was the fact that there a few connection timeouts when getting the connection from the pool, but like I said, the connections to the server are never close to the limit.
Also, this app and version has been in production for months and it wasn't until the traffic volume started to grow that we started seeing these issues. At this point, I am at a loss for next steps of troubleshooting and I'm seeking suggestions.
In IIS App Pool advanced settings set Start Mode to AlwaysRunning
I never found a root cause for this issue, however, after updating to newer versions of .NET MVC this issue went away. My best guess is that changes with the Kestrel possibly resolved this issue, although, I have no idea what specific change that might have been. I have gone through the change logs a few times and didn't see anything that specifically jumped out at me.
Newbie in Newrelic here. I have an API service hosted on Heroku and being monitored at Newrelic.
While I was studying how to use newrelic. I found out my 2 workers are being underutilised with very low RPM and low transaction time. So I decided to cut down to one worker which saves me $36 a month. =]
Shortly after that I received tonnes of logEntries emails stating request timeouts of one of my web dynos. Looking into Newrelic. I found out that one of my actions are being called suspciously high number of times for 2-3 minutes.
The action being V1::CarsController#Index, which basically shows a collection of cars.
While I was not sure whether the deletion of one worker dyno has caused memcached to do something, I also suspect that may be someone is trying scrap the data off the database. I am not too sure how to further investigate into the issue. I wonder if I can track down the request IP and see it is the same? or how can I further investigate?
If further information is needed I am happy to provide in Edits!
Thanks
I have been testing some viewport issues for mobile and probably ran
git push heroku master
about 50 times in the last 3 hours. I am now seeing from the google speed tests that:
Reduce server response time In our test, your server responded in 8.9
seconds. There are many factors that can slow down your server
response time. Please read our recommendations to learn how you can
monitor and measure where your server is spending the most time. Hide
details
This wasn't popping up earlier this morning and was under .5 seconds. Did I damage one of my dynos on the heroku servers?= My site isn't really getting any traffic yet so I haven't been doing any stage testing.
What is the best way to test production?
I was reading through this but was wondering if there is a better way to test production quickly.
https://devcenter.heroku.com/articles/multiple-environments
Thanks,
Jeff
There's nothing wrong with pushing many times in a row, but every time you push, your dynos will cycle. This takes something like 5 to 15 seconds depending on the size of your slug.
Generally this means that the first query sent to your app at the moment your dynos are cycling might hang for about that long. If Google checked your server's speed at that time, then that explains the response time. However, there shouldn't be any lasting effects after you finish pushing repeatedly.
If I recall correctly there is a Heroku labs option to cycle dynos to eliminate this pause, basically taking down some of your dynos and then cycling them while the other ones are still up, but I do not recommend using it as it makes code pushes very unpredictable and can result in two versions of your app being live at the same time.
I am using Meteor on Heroku (free tier) with MongoHQ. My app is very simple right now, it loads 3-4 entries from a Collection, but when I deploy it to Heroku, I am seeing ridiculous load times (1-2 minutes). The HTML is rendered immediately. When I deploy to Meteor.com's free server, load times are a lot lower but still about 15 seconds for 4 tiny pieces of data. I'm not seeing this whatsoever when I deploy locally, app pulls data from the DB right away.
It is worth noting that I don't think it's an "idling" issue for Heroku. Even if I already have one browser window with the app just opened, if I use a different browser and try again I still get 1-2 minute load times. Once the data is loaded, however, performance goes back to being great, I can read and write with no problems.
What am I missing? I'm not seeing any errors in the console, mongo shows several queries in the logs and shows that it is responding quickly with 4 documents, but apparently somewhere in the middle there's a traffic jam. Any help with this is greatly appreciated, if I can't get past this Meteor is useless for my needs right now.
UPDATE: I've been watching it closely in Firebug, and it looks like the performance is largely inconsistent. Sometimes a simple refresh will take 1 minute, sometimes it will take 10 seconds. But what I've noticed is that the times when its slow, it GETs the sockjs/info file, then right after that the sockjs POST is aborted (sometimes multiple times). When it runs fast, the POST and subsequent POSTs run smoothly
Slow:
GET http://pocleaderboard.herokuapp.com/sockjs/info 200 OK 22ms
POST http://pocleaderboard.herokuapp.com/sockjs/029/su0d77fb/xhr Aborted
GET http://pocleaderboard.herokuapp.com/sockjs/info 200 OK 27ms
POST http://pocleaderboard.herokuapp.com/sockjs/132/uljqusxd/xhr Aborted
GET http://pocleaderboard.herokuapp.com/sockjs/info 200 OK 28ms
POST http://pocleaderboard.herokuapp.com/sockjs/154/kcbr6a5p/xhr Aborted
Fast(er):
GET http://pocleaderboard.herokuapp.com/sockjs/info 200 OK 1.08s
POST http://pocleaderboard.herokuapp.com/sockjs/755/xiggb555/xhr 200 OK 1.02s
Meteor gets loaded that fast locally, because it doesn't depend on your internet connection and the files can just be read from your harddrive and don't need to be downloaded.
And once the data is loaded it's the same everywhere you host, because the client (you) perform all actions on your cached mongo database and then just wait for the server to say if the action was alright or not.
But for the Heroku loading times, I have no idea, Sorry!
UPDATE:
These are the long-pulls from SockJS that is used by Meteor.
Normally these pulls only get Aborted on a hot code push (when a file is added/changed/removed).
Either you or Heroku seem to write or change something in the directory.
Because then a hot code push may be initiated by Meteor.
Heroku may not support web-sockets, which means you're stuck with the slower polling approach. See this:
https://devcenter.heroku.com/articles/using-socket-io-with-node-js-on-heroku
I created two very simple Heroku apps to test out the service, but it's often taking several seconds to load the page when I first visit them:
Cropify - Basic Sinatra App (on github)
Textile2HTML - Even more basic Sinatra App (on github)
All I did was create a simple Sinatra app and deploy it. I haven't done anything to mess with or test the Heroku servers. What can I do to improve response time? It's very slow right now and I'm not sure where to start. The code for the projects are on github if that helps.
If your application is unused for a while it gets unloaded (from the server memory).
On the first hit it gets loaded and stays loaded until some time passes without anyone accessing it.
This is done to save server resources. If no one uses your app why keep resources busy and not let someone who really needs use them ?
If your app has a lot of continous traffic it will never be unloaded.
There is an official note about this.
You might also want to investigate the caching options you have on Heroku w/ Varnish and Memcached. These are persisted independent of the dynos.
For example, if you have an unchanging homepage, you can cache that for extended periods in Varnish by adding Cache-Control headers to the response. Then your users won't experience the load hit until they want to "do something" rather than when they arrive.
You should check out Tom Robinson's answer to "Scalability: How Does Heroku Work?" on Quora: http://www.quora.com/Scalability/How-does-Heroku-work
Heroku divides up server resources among many different customers/applications. Your app is allotted blocks of computing power. Heroku partitions based on resource demand. When you have a popular application that demands more power, you can pay for more 'dynos' (application containers) and then get a larger chunk of the pie in return.
In your case though, you are running a free app that few people--if any outside of you--are visiting/using. Therefore, Heroku cuts down on the resources you're getting by unloading your app--putting it in hibernation essentially--until there is a request made to your address. When that happens, and your app has been idling for a long time, it takes time to reload.
Add 1 extra dyno to keep your app from falling asleep, if that reload time is important.
I am having the same problem. I deployed a Rails 3 (1.9.2) app last night and it's slow. I know that 1.9.2/Rails 3 is in BETA on Heroku but the support ticket said it should be fine using some instructions they sent me.
I understand that the first request after a long time takes the longest. Makes sense. But simply loading pages that don't even connect to a DB taking 10 seconds sometimes is pretty bad.
Anyway, you might want to try what I'm going to do. That is profile my app and see how long it takes locally. If it's taking 400ms then something is wrong. But if I get 50ms locally and it still takes 10 seconds on Heroku then something is definitely wrong.
Obviously, caching helps but you only get 5MB for free and once again, with ONE person using the site, it shouldn't be that slow.
I had the same problem with every app I have put on via heroku's free account. Now there are options of adding dynos so that your app does not get offloaded while it is not being used for a while, you can also try using redis or memcached for caching. But I used a hacky solution for my small scale project, I basically built web scraper put it inside an infinite loop with sleep and tada the website actually had much better response times(I guess it was not getting offloaded because of the script).