I am using Windows Azure SDK 2.2 and have created an Azure cloud service that uses an in-role cache.
I have 2 instances of the service running under normal conditions.
When the services scales (up to 3 instances, or back down to 2 instances), I get lots of DataCacheExceptions. These are often accompanied by Azure db connection failures from the process going in inside the cache. (If I don't find the entry I want in the cache, I get it from the db and put it into the cache. All standard stuff.)
I have implemented retry processes on the cache gets and puts, and use the ReliableSqlConnection object with a retry process for db connection using the Transient Fault Handling application block.
The retry process uses a fixed interval retrying every second for 5 tries.
The failures are typically;
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later
Any idea why the scaling might cause these exceptions?
Should I try a less aggressive retry policy?
Any help appreciated.
I have also noticed that I am getting a high percentage (> 70%) cache miss rate and when the system is struggling, there is high cpu utilisation (> 80%).
Well, I haven't been able to find out any reason for the errors I am seeing, but I have 'fixed' the problem, sort of!
When looking at the last few days processing stats, it is clear the high cpu usage corresponds with the cloud service having 'problems'. I have changed the service to use two medium instances instead of two small instances.
This seems to have solved the problem, and the service has been running quite happily, low cpu usage, low memory usage, no exceptions.
So, whilst still not discovering what the source of the problems were, I seem to have overcome them by providing a bigger environment for the service to run in.
--Late news!!! I noticed this morning that from about 06:30, the cpu usage started to climb, along with the time taken for the service to process as it should. Errors started appearing and I had to restart the service at 10:30 to get things back to 'normal'. Also, when restarting the service, the OnRoleRun process threw loads of DataCacheExceptions before it started running again, 45 minutes later.
Now all seems well again, and I will monitor for the next hours/days...
There seems to be no explanation for this, remote desktop to the instances show no exceptions in the event log, other logging is not showing application problems, so I am still stumped.
Related
We are working on an ASP.NET 5 Web API project that is in production now but we are experiencing an issue where it becomes unresponsive intermittently throughout the day.
A few notes about the application architecture. It is an ASP.NET Web API project using a MariaDB database on a separate EC2 instance within the same private network. The connection string uses the private IP of the database server to avoid any name resolution issues. The site is hosted via IIS 10.
The application itself has been developed carefully following the best practices provided by Microsoft. Heavy focus on async operations, minimizing query response times and offloading more expensive operations into background services.
The app is extremely responsive. It performs with sub 100ms responses on almost all requests, even the more complicated requests, and all the way up until it becomes unresponsive this high level of performance remains the same. We tend to see between 10-30 requests per second and 300-500 select queries per second at peak usage so not too extreme. However, randomly (2-3 times over a 24 hour period) it will begin hanging on requests and simply not respond to the request. During this time, the database is still extremely responsive and we are never over 300 connections out of our 512 connection limit.
The resources on the application server itself are never really taxed much at all. The CPU never gets above ~20% and the memory usage sits around 20-30%.
If I were to stop the site in IIS and start it again while this is happening, it will quickly come back online. If I don't it will be down for a few minutes until IIS finally kills it due to a failed health check. There are no real errors generated as a response to the issue other than typical errors caused by the hanging of the process such as connection terminated errors. The only thing I have seen before that gave me pause was the fact that there a few connection timeouts when getting the connection from the pool, but like I said, the connections to the server are never close to the limit.
Also, this app and version has been in production for months and it wasn't until the traffic volume started to grow that we started seeing these issues. At this point, I am at a loss for next steps of troubleshooting and I'm seeking suggestions.
In IIS App Pool advanced settings set Start Mode to AlwaysRunning
I never found a root cause for this issue, however, after updating to newer versions of .NET MVC this issue went away. My best guess is that changes with the Kestrel possibly resolved this issue, although, I have no idea what specific change that might have been. I have gone through the change logs a few times and didn't see anything that specifically jumped out at me.
We deployed our WebAPI as an azure website under the standard plan and have turned on Always On. After getting multiple memory and CPU alerts we decided on checking the logs via xyz.scm.azurewebsites.net. It seems Always ON has a high response time. Could this be causing high memory and CPU issues. Sometimes the alerts come when none is even using the system and auto resolve within 5 mins.
The always on feature only invokes the root of your web app every 5 minutes.
If this is causing high memory or cpu it could be a memory leak within your application because if you don't use the always on feature your process gets recycled on idle.
You should check what your app does if you invoke it with the root path and determine why this is causing high response time.
I have deployed a simple Spring boot app in Google App Engine Flexible. The app. has two APIs, one to add the user data into the DB (xxx.appspot.com/add) the other to get all the user data from the DB (xxx.appspot.com/all).
I wanted to see how GAE scales for the load, hence used JMeter to create a load with 100 user concurrency ramped up in 10 seconds and calls these two APIs in half a second delay, forever. While it runs fine for sometime (with just one instance), it starts to fail after 30 seconds or so with a "java.net.SocketException" or "The server responded with a status of 502".
After this error, when I try to access the same API from the browser, it displays,
Error: Server Error
The server encountered a temporary error and could not complete your
request. Please try again in 30 seconds.
The service is back to normal after 30 mins or so, and whenever the load test happens it repeats the same behavior as mentioned above. I expect GAE to auto-scale based on the load coming in to handle it without any down time (using multiple instances), instead it just crashes or blocks the service (without any information in the log). My app.yaml configuration is,
runtime: java
env: flex
service: hello-service
automatic_scaling:
min_num_instances: 1
max_num_instances: 10
I am a bit stuck with this one, Any help would be greatly appreciated. Thanks in advance.
The solution was to increase the resource configuration, details below.
Given that I did not set a resource parameter, it defaulted to the pre-defined values for both CPU and Memory. In this case, the default
memory was set at 0.6GB. App Engine Flex instances uses about 0.4GB
for overhead processes. Given Java is known to consume higher memory, there is a
great likelihood that the overhead processes consumed more than the
approximate 0.4GB value. Now instances in App Engine are restarted due
to a variety of reasons including optimization due to memory use. This
explains why your instances went off and it shows Tomcat is starting
up (they got restarted) and ends up in 502 error due to the nginx is
not able to complete the request. Fixing the above may lessen if not completely eliminate the 502s.
After I have specified the resources attribute and increased the configuration in app.yaml 502 error seems to be gone.
I have an asp.net mvc4 web api interface that gets about 54k requests a day.
http://myserv.x.com/api/123/getstuff?whatstuff=thisstuff
I have 3 web servers behind a load balancer that are setup to handle the http requests.
On average response times are ~300ms. However, lately something has gone awry (or maybe it has always been there) as there is sporadic behavior of response times coming back in 10-20sec. This would be for the same request hitting the same server directly instead of through the load balancer.
GIVEN:
- System has been passed down to me so there may be gaps with IIS confiuration, etc,.
- Database: SQL Server 2008R2
- Web Servers: Windows Server 2008R2 Enterprise SP1
- IIS 7.5
- Using MemoryCache aggressively with Model and Business Objects with eviction set to 2hrs
- Looked at the logs but really don't see anything significantly relevant
- One application pool...no other LOB applications running on this server
Assumptions & Ask:
Somehow I'm thinking that something is recycling the application pool or IIS worker threads are shutting down and restarting thus causing each new request to warmup and recache itself. It's so sporadic that it's tough to trouble shoot right now. The same request to the same server comes back fast as expected (back to back N requests) since it was cached in about 300ms....but wait about 5-10-20min and that same request to the same server takes 16seconds.
I have limited tracing to go by as these are prod systems so I can only expose so much logging details. Any help and information attacking this or similar behavior somebody else has run into is appreciated. Thx
UPDATE:
The w3wpe.exe process grows to ~3G. Somehow it gets wiped out and the PID changes so itself or something is killing it every 3-4min I see tons of warnings in my webserver (IIS) log:
A process serving application pool 'MyApplication' suffered a fatal
communication error with the Windows Process Activation Service. The
process id was '1732'. The data field contains the error number.
After 4-5 days of assessing IIS and configuration vs internal code issues I finally found the issue with little to no help with windbg or debugdiag IIS tools. Those tools contain so much information even with mini dumps or log trace stacks that they can be red herrings. Best bet was to reproduce it by setting up a "copy intelligently" instance of a production system, which we did not have at the time and took a bit for ops to set something up.
Needless to say the problem had to do with over cacheing business objects. There was one race condition where updates on a certain table were updating an attribute to that corresponding business object (updates were coming from multiple servers) which was causing an OOC stackoverflow that pretty much caused the cacheing to recursively cache itself to death thus causing the w3wp.exe process to die and psuedo-recycle itself. It was one of those edge cases that was incredibly hard to test and repro in a non-production environment.
I implemented a web application to start the Tomcat service works very quickly, but spending hours and when more users are entering is getting slow (up to 15 users approx.).
Checking RAM usage statistics (20%), CPU (25%)
Server Features:
RAM 8GB
Processor i7
Windows Server 2008 64bit
Tomcat 7
MySql 5.0
Struts2
-Xms1024m
-Xmx1024m
PermGen = 1024
MaxPernGen = 1024
I do not use Web server, we publish directly on Tomcat.
Entering midnight slowness is still maintained (only 1 user online)
The solution I have is to restart the Tomcat service and response time is again excellent.
Is there anyone who has experienced this issue? Any clue would be appreciated.
Not enough details provided. Need more information :(
Use htop or top to find memory and CPU usage per process & per thread.
CPU
A constant 25% CPU usage in a 4 cores system can indicate that a single-core application/thread is running 100% CPU on the only core it is able to use.
Which application is eating the CPU ?
Memory
20% memory is ~1.6GB. It is a bit more than I expect for an idle server running only tomcat + mysql. The -Xms1024 tells tomcat to preallocate 1GB memory so that explains it.
Change tomcat settings to -Xms512 and -Xmx2048. Watch tomcat memory usage while you throw some users at it. If it keeps growing until it reaches 2GB... then freezes, that can indicate a memory leak.
Disk
Use df -h to check disk usage. A full partition can make the issues you are experiencing.
Filesystem Size Used Avail Usage% Mounted on
/cygdrive/c 149G 149G 414M 100% /
(If you just discovered in this example that my laptop is running out of space. You're doing it right :D)
Logs
Logs are awesome. Yet they have a bad habit to fill up the disk. Check logs disk usage. Are logs being written/erased/rotated properly when new users connect ? Does erasing logs fix the issue ? (copy them somewhere for future analysis before you erase them)
If not. Logs are STILL awesome. They have the good habit to help you track bugs. Check tomcat logs. You may want to set logging level to debug. What happens last when the website die ? Any useful error message ? Do user connections are still received and accepted by tomcat ?
Application
I suppose that the 25% CPU goes to tomcat (and not mysql). Tomcat doesn't fail by itself. The application running on it must be failing. Try removing the application from tomcat (you can eventually put an hello world instead). Can tomcat keep working overnight without your application ? It probably can, in which case the fault is on the application.
Enable full debug logging in your application and try to track the issue. Run it straight from eclipse in debug mode and throw users at it. Does it fail consistently in the same way ?
If yes, hit "pause" in the eclipse debugger and check what the application is doing. Look at the piece of code each thread is currently running + its call stack. Repeat that a few times. If there is a deadlock, an infinite loop, or similar, you can find it this way.
You will have found the issue by now if you are lucky. If not, you're unfortunate and it's a tricky bug that might be deep inside the application. That can get tricky to trace. Determination will lead to success. Good luck =)
For performance related issue, we need to follow the given rules:
You can equalize and emphasize the size of xms and xmx for effectiveness.
-Xms2048m
-Xmx2048m
You can also enable the PermGen to be garbage collected.
-XX:+UseConcMarkSweepGC -XX:+CMSPermGenSweepingEnabled -XX:+CMSClassUnloadingEnabled
If the page changes too frequently to make this option logical, try temporarily caching the dynamic content, so that it doesn't need to be regenerated over and over again. Any techniques you can use to cache work that's already been done instead of doing it again should be used - this is the key to achieving the best Tomcat performance.
If there any database related issue, then can follow sql query perfomance tuning
rotating the Catalina.out log file, without restarting Tomcat.
In details,There are two ways.
The first, which is more direct, is that you can rotate Catalina.out by adding a simple pipe to the log rotation tool of your choice in Catalina's startup shell script. This will look something like:
"$CATALINA_BASE"/logs/catalina.out WeaponOfChoice 2>&1 &
Simply replace "WeaponOfChoice" with your favorite log rotation tool.
The second way is less direct, but ultimately better. The best way to handle the rotation of Catalina.out is to make sure it never needs to rotate. Simply set the "swallowOutput" property to true for all Contexts in "server.xml".
This will route System.err and System.out to whatever Logging implementation you have configured, or JULI, if you haven't configured.
See more at: Tomcat Catalina Out
I experienced a very slow stock Tomcat dashboard on a clean Centos7 install and found the following cause and solution:
Slow start up times for Tomcat are often related to Java's
SecureRandom implementation. By default, it uses /dev/random as an
entropy source. This can be slow as it uses system events to gather
entropy (e.g. disk reads, key presses, etc). As the urandom manpage
states:
When the entropy pool is empty, reads from /dev/random will block until additional environmental noise is gathered.
Source: https://www.digitalocean.com/community/questions/tomcat-8-5-9-restart-is-really-slow-on-my-centos-7-2-droplet
Fix it by adding the following configuration option to your tomcat.conf or (preferred) a custom file into /tomcat/conf/conf.d/:
JAVA_OPTS="-Djava.security.egd=file:/dev/./urandom"
We encountered a similar problem, the cause was "catalina.out". It is the standard destination log file for "System.out" and "System.err". It's size kept on increasing thus slowing things down and ultimately tomcat crashed. This problem was solved by rotating "catalina.out". We were using redhat so we made a shell script to rotate "catalina.out".
Here are some links:-
Mulesoft article on catalina (also contains two methods of rotating):
Tomcat Catalina Introduction
If "catalina.out" is not the problem then try this instead:-
Mulesoft article on optimizing tomcat:
Tuning Tomcat Performance For Optimum Speed
We had a problem, which looks similar to yours. Tomcat was slow to respond, but access log showed just milliseconds for answer. The problem was streaming responses. One of our services returned real-time data that user could subscribe to. EPOLL were becoming bloated. Network requests couldn't get to the Tomcat. And whats more interesting, CPU was mostly idle (since no one could ask server to do anything) and acceptor/poller threads were sitting in WAIT, not RUNNING or IN_NATIVE.
At the time we just limited amount of such requests and everything became normal.