Time spent in network exploding randomly - (website speed) - performance

We are a PHP web app hosted with Amazon in Dublin. Lately we have had a very strange problem.
The problem:
All of a sudden our site gets incredibly slow, sometimes to a point were it is no longer available. This typically lasts for a couple of minutes and then everything is fine again. It seems to happen randomly. Sometimes it happens several times in one day and then we won't have this problem for a period of days.
We track our website speed with New Relic. In the monitoring I found that the "time spent in network" seems to explode all of a sudden (definition here: https://newrelic.com/docs/features/how-does-real-user-monitoring-work#what-does-the-network-time-include). It is normally around 0.5 seconds per request. This value explodes to anywhere between 9-15 seconds. After about 10-15 minutes it goes back to 0.5 seconds.
Things that I can exclude as causes:
There are no traffic spikes causing this (it also happens during normal load) + we have sufficient CPU and DB power so that small spikes shouldn't be creating issues.
There aren't any expensive internal scripts running that are causing the problem.
It doesn't seem to be related to external software being unresponsive (even individual pages that have no 3rd party components implemented are incredibly slow (they don't even have google analytics).
What I think it could be:
To be entirely honest I am kind of lost.
The only thing I can imagine is that the time spent between the application and the database is super high for an unknown reason, e.g. because they are in different Amazon availability zones for a period of time or something like this, but this kind of problem should then be affecting everyone once in a while and I don't really know how to solve or detect this.
I have reached out to Amazon, but am still waiting for an answer.
Have you had a similar problem or any idea what could be the cause for this issue?
Many thanks for any tips.

Related

How to figure out what's causing long DOM processing time?

I have a landing page to which I'm driving traffic through PPC. For a variety of reasons, I've come to believe that, even though the site is highly performant for me, it isn't for my actual visitors. So, I turned on AWS CloudWatch.
For me, with cleared caches, the page loads at around 0.9 seconds in Safari, 1.75 seconds in Firefox, and 2.25 seconds in Chrome. If I were in micro-optimization mode, I'd worry about that Chrome number, but right now, my issue is much bigger. According to CloudWatch, my real users are experiencing an average load time of 12.1 seconds! And of that 12.1 seconds, DOM processing is taking about 11 seconds:
Now, I'm not sure if I have to worry about the full 11 seconds, or just the part I've marked "A" — the processing that happens before it starts loading the DOM — but either way, how do I figure out what's causing this?
I know there's a way to simulate low network speed in devtools, but maybe there's also a way to throttle CPU? Or maybe there's a way to "look" at the waterfall in devtools and figure out which pieces are blocking the DOM Content loaded action? Then, even though they're fast for me, I can try to optimising those pieces. Or maybe a deeper level of diagnostics I can enable on CloudWatch? Or, maybe there's some other option I haven't considered.
FWIW, almost all of my visitors are on Android devices, and they're about 70% Chrome, 10% Android Browser.

Updating the complication gradually degrades the Apple Watch app performance in watchOS3

I've been stressing over this issue for about a week now, trying to pin point the source of a slow but steady Apple Watch app performance degradation. Over the course of about two days, my app's UI would become progressively more sluggish. I've narrowed it down to a complication update code. Even if I strip down the complication update to an absolute bare minimum, this problem still happens, albeit slower than if I update the complication with some actual data. I update the complication every 10 minutes. Once the new data comes, I simply execute
for (CLKComplication *comp in [CLKComplicationServer sharedInstance].activeComplications) {
[[CLKComplicationServer sharedInstance] reloadTimelineForComplication:comp];
}
which in turn calls:
- (void)getCurrentTimelineEntryForComplication:(CLKComplication *)complication withHandler:(void(^)(CLKComplicationTimelineEntry * __nullable))handler {
...
}
This works fine, the new data displays, but when repeated a few dozen times, the UI responsiveness of the main app begins to noticeably degrade, and when it's repeated about a hundred times (which happens in less than a day with 10 minute updates) the UI really slows down significantly.
I have nothing fancy going on with the complication structure - no time travel, just display the current data, and everything is set up for that. To make sure I'm not looking at the wrong place, I've made a test that reloads the timeline every second, and in this test, my getCurrentTimelineEntryForComplication looks like this:
- (void)getCurrentTimelineEntryForComplication:(CLKComplication *)complication withHandler:(void(^)(CLKComplicationTimelineEntry * __nullable))handler {
handler(nil);
}
so there's literally nothing going there, just send back the empty handler. Yet, even in this scenario, after a hundred or so timeline reloads, the main app's UI slows down visibly.
Some other things to note:
If I'm not updating the complication, the app's UI performance never degrades, no matter how many times I open it, or how long I use it, or how many times the data fetching code runs in the background.
When testing this in the simulator, I can't get the performance degradation to happen, but I can consistently see that there's a small, but steady memory leak coming from the complication update (again, this happens no matter how simple update I do inside the getCurrentTimelineEntryForComplication method.
Has anyone else noticed this, and is there any hope to deal with it? Am I doing something wrong? For the time being I make sure only to update the complication if the data has changed, but that just postpones the problem, rather than solving it.
Oct 24 edit
I've done more careful testing on a real watch, and while before for some reason I didn't notice the memory leak associated with this on a real watch, I have now definitely seen it happen. The real device mirrors the issue seen on the simulator completely, just with a different initial amount of memory allocation.
Again, all I do is call reloadTimelineForComplication on a constant loop, and the complication is updated with a single line of text from a cached data object, and the complication controller is otherwise stripped to a bare minimum. When the complication is removed from the watch face, memory leak predictably stops.
My main project is written in ObjectiveC, but I have repeated the test with a test project made in Swift, and there are no differences. Also, the problem persists with the latest XCode 8.1 GM and the watchOS 3.1 beta that's supplied with the simulator that comes with it, as well as running it on a real watch with watchOS3.1 installed.
Jan 24, 2017 edit
Sadly, the issue persists in watchOS 3.1.3, completely unchanged. In the meantime I've contacted Apple's code-level support, sent them sample code, and they've confirmed that the problem exists, and told me to file a bug report. I did file a bug report about two months ago, but up until now it remains unclassified, which I guess means no one looked at it yet.
Jan 31, 2017 edit
Apple has fixed the problem in watchOS3.2 beta 1. I've been testing it both in the simulator and on real watch. Everything's working great, no memory leaks or performance degradation anymore. In the end there were no workarounds for this, until they decided to fix it.
Apple has fixed the problem in watchOS3.2 beta 1. I've been testing it both in the simulator and on real watch. Everything's working great, no memory leaks or performance degradation anymore. In the end there were no workarounds for this, until they decided to fix it.
I noticed that using the native calendar complication everything i do becomes very sluggish. So maybe it's a bug in the new watch OS.
After using the calendar complication for a couple of days it's impossible to use that watch face. Even if I change to another complication and switch back to the calendar one it doesn't "reset" the performance. The only thing that solves is to reboot the watch. (or forget about the calendar and use another complication instead)

Why does a cached file have a high Waiting (TTFB) or Content Loaded ms value?

I'm looking at a waterfall in Chromes Developer tools of several CSS and Javascript files.
When refreshing the page, several of the files load from the browser cache, as expected. These are taking 1ms to load most of the time. However some files, and it seems to be the same offenders each refresh, are taking quite a bit longer. Sometthing between 400ms and 800ms.
The waterfall timeline in Chromes network tab shows that this time is spent in the TTFB (time to first byte) in some cases. This doesn't make any sense to me, if it's getting it from the browser cache it should be getting it from the hard drive, not the server, why is there a TTFB?
For other files or sometimes on a different refresh I see the time is blamed on content dowloaded time. Again, coming from cache this should be pretty instantaneous, yet I'm seeing it take over half a second sometimes.
Can anyone shed some light on what's happening here?
This is a web app I'm working on so I don't have a link I can share I'm afraid.

Web Server occasionally taking a very long time to serve pages

I am on a virtual dedicated webserver and my site has very little traffic at the moment. Everything was fine up until a couple of weeks ago, my pages would always load very quickly but now a problem has developed.
About one in every 3 page loads it takes 10 seconds or more to load the pages that usually take about a second. Using Google chrome the grey icon where that precedes the favicon spends a long time rotating anticlockwise but is normal once it begins rotating clockwise leading me to believe that there is a problem connecting to the server. This problem occurs wherever I or anyone else connects from so it is not a problem with my own interenet connection or anything like that.
Strangely everything is always fine loading index.php, the problems only occur with the other pages on the site.
I've been suspecting it has something to do with DNS but I don't know much about it.
I have been trying hard to work out what the problem is for myself, but I am not much of an expert on servers etc and most of the stuff I come find with google is about slow internet connections. If anyone has had this problem themselves and managed to solve it or can help in anyway I would be very grateful.

Performance issue with site speed 'w3p.exe’ process –reaches 99/100%

My site goes slow and stops access certain services externally if we check the Process monitor we see that it is normally due to the ‘w3p.exe’ process – which is the background process for running the website – it regularly reaches 99/100% - killing the process/restarting the WebPublishing service reolves tis – my webhost says this can only be due to bad coding ....can someone comment on this ??…
When performance testing a reasonably straightforward website (coded in ASP.Net) I saw it slow to a crawl with memory use going through the roof over time. Each time recycling the w3wp process restored performance back to normal.
I never got around to figuring out why (the load we were testing with was way above normal, and it could have been worked around anyway by recycling the w3wp service more frequently), but my bet would have been that it was viewstate causing the slowdown. A lot of pages had very large viewstate which wasnt being used in any way - I can fesably see how loading large viewstate values might cause memory related performance degregation over time.
What language is the site coded in? I recently ran into the same problem on a server running IIS6/PHP and found the following bug -
http://bugs.php.net/bug.php?id=37575
Upgrading PHP to 5.3 solved the issue.

Resources