Long running webjob fail due to Azure Storage timeout

Long running webjob fail due to Azure Storage timeout - azure-blob-storage

The issue is with a long running azure webjob on a daily schedule. Each run takes 2-4 hours doing data analytics. The only dependencies are with Azure SQL database via EF and with Azure Storage, just setting up AzureWebJobsDashboard and AzureWebJobsStorage connections on the App.Config, standard setup on VS with webjob SDK. Most of the time of the webjob is consumed with EF's SaveChanges().
I also do an important amount of logging to monitor progress with aprox 3000 lines of Console output.
The web app is configured as Allways ON and the WEBJOBS_IDLE_TIMEOUT is set to a very high number.
The following is the log of the error:
[10/20/2016 07:48:17 > 492c46: ERR ] Unhandled Exception: Microsoft.WindowsAzure.Storage.StorageException: The client could not finish the operation within specified timeout. ---> System.TimeoutException: The client could not finish the operation within specified timeout.
[10/20/2016 07:48:17 > 492c46: ERR ] --- End of inner exception stack trace ---
[10/20/2016 07:48:17 > 492c46: ERR ] at Microsoft.WindowsAzure.Storage.Core.Util.StorageAsyncResult`1.End()
[10/20/2016 07:48:17 > 492c46: ERR ] at Microsoft.WindowsAzure.Storage.Blob.CloudBlockBlob.EndUploadText(IAsyncResult asyncResult)
[10/20/2016 07:48:17 > 492c46: ERR ] at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c__DisplayClass4.b__3(IAsyncResult ar)

This is a few months old, but for those that come later...
You mention you do a lot of logging. There is an issue that was logged dealing with something similar. Apparently the WebJobs SDK does a periodic save of the log data to blob storage. If you are using a lot of bandwidth or otherwise consuming a lot of resources you may run into timeouts from the SDK trying to save to blob storage. Note the upload call in your stack trace.
I'm seeing this sporadically in a process that is punishing the wire pretty good so I am disabling logging via the WebJobs logging facility.

A triggered webjob is aborted if it is idle, has no cpu time or output for a certain amount of time. Try and increase it by setting configuration WEBJOBS_IDLE_TIMEOUT to a large number, for instance 3600.
It could also be aborted if your instances hasn't configured Always on.
If that doesn't help you should try handle the amount of logging. Could it be that you try and write to many messages to fast? Have a look at this answer to see if that coould be the case.

Related

Syslog Drain created in PCF throwing error

Hi we have created a Syslog Drain from pcf to logstash but sometimes we are getting 2018-07-19T15:09:53.524+05:30 [LGR/] [ERR] Syslog Drain: Error when writing. Backing off for 4ms.
this error.
What is this and why?

I suspect that it's a communication problem with the logging system in your Cloud Foundry platform as it's trying to talk with your LogStash. The message doesn't give you an exact error though. To find that, you would need to be a platform operator and look at the Loggregator logs to see why it's failing. If you're not the CF platform operator, reach out to your operator for assistance.
When you see errors like this I would suggest checking for two things:
How often do you see this message?
How large does the number in "Backing off for XXms." get?
When an error occurs sending logs the platform will back off, but as errors continue to occur the backoff timeout will get larger. If you see a large value in the backoff timeout, that means you have a prolonged problem. This could be something like you've configured the log drain incorrectly, your LogStash server is down or the network to it is down. If you see the errors frequently, but the number stays low, it means it's only intermittently failing (some logs go OK, some don't) which could point to a flaky network connection, one that's up/down a lot.

How to limit Couchbase client from trying to connect to Couchbase server when it's down?

I'm trying to handle Couchbase bootstrap failure gracefully and not fail the application startup. The idea is to use "Couchbase as a service", so that if I can't connect to it, I should still be able to return a degraded response. I've been able to somewhat achieve this by using the Couchbase async API; RxJava FTW.
Problem is, when the server is down, the Couchbase Java client goes crazy and keeps trying to connect to the server; from what I see, the class that does this is ConfigEndpoint and there's no limit to how many times it tries before giving up. This is flooding the logs with java.net.ConnectException: Connection refused errors. What I'd like, is for it to try a few times, and then stop.
Got any ideas that can help?
Edit:
Here's a sample app.
Steps to reproduce the problem:
svn export https://github.com/asarkar/spring/trunk/beer-demo.
From the beer-demo directory, run ./gradlew bootRun. Wait for the application to start up.
From another console, run curl -H "Accept: application/json" "http://localhost:8080/beers". The client request is going to timeout due to the failure to connect to Couchbase, but Couchbase client is going to flood the console continuously.

The reason we choose to have the client continue connecting is that Couchbase is typically deployed in high-availability clustered situations. Most people who run our SDK want it to keep trying to work. We do it pretty intelligently, I think, in that we do an exponential backoff and have tuneables so it's reasonable out of the box and can be adjusted to your environment.
As to what you're trying to do, one of the tuneables is related to retry. With adjustment of the timeout value and the retry, you can have the client referenceable by the application and simply fast fail if it can't service the request.
The other option is that we do have a way to let your application know what node would handle the request (or null if the bootstrap hasn't been done) and you can use this to implement circuit breaker like functionality. For a future release, we're looking to add circuit breakers directly to the SDK.
All of that said, these are not the normal path as the intent is that your Couchbase Cluster is up, running and accessible most of the time. Failures trigger failovers through auto-failover, which brings things back to availability. By design, Couchbase trades off some availability for consistency of data being accessed, with replica reads from exception handlers and other intentionally stale reads for you to buy into if you need them.
Hope that helps and glad to get any feedback on what you think we should do differently.

Solved this issue myself. The client I designed handles the following use cases:
The client startup must be resilient of CB failure/availability.
The client must not fail the request, but return a degraded response instead, if CB is not available.
The client must reconnect should a CB failover happens.
I've created a blog post here. I understand it's preferable to copy-paste rather than linking to an external URL, but the content is too big for an SO answer.

Start a separate thread and keep calling ping on it every 10 or 20 seconds, one CB is down ping will start failing, have a check like "if ping fails 5-6 times continuous then close all the CB connections/resources"

502 server error in Google App Engine Flexible when load testing with JMeter

I have deployed a simple Spring boot app in Google App Engine Flexible. The app. has two APIs, one to add the user data into the DB (xxx.appspot.com/add) the other to get all the user data from the DB (xxx.appspot.com/all).
I wanted to see how GAE scales for the load, hence used JMeter to create a load with 100 user concurrency ramped up in 10 seconds and calls these two APIs in half a second delay, forever. While it runs fine for sometime (with just one instance), it starts to fail after 30 seconds or so with a "java.net.SocketException" or "The server responded with a status of 502".
After this error, when I try to access the same API from the browser, it displays,
Error: Server Error
The server encountered a temporary error and could not complete your
request. Please try again in 30 seconds.
The service is back to normal after 30 mins or so, and whenever the load test happens it repeats the same behavior as mentioned above. I expect GAE to auto-scale based on the load coming in to handle it without any down time (using multiple instances), instead it just crashes or blocks the service (without any information in the log). My app.yaml configuration is,
runtime: java
env: flex
service: hello-service
automatic_scaling:
min_num_instances: 1
max_num_instances: 10
I am a bit stuck with this one, Any help would be greatly appreciated. Thanks in advance.

The solution was to increase the resource configuration, details below.
Given that I did not set a resource parameter, it defaulted to the pre-defined values for both CPU and Memory. In this case, the default
memory was set at 0.6GB. App Engine Flex instances uses about 0.4GB
for overhead processes. Given Java is known to consume higher memory, there is a
great likelihood that the overhead processes consumed more than the
approximate 0.4GB value. Now instances in App Engine are restarted due
to a variety of reasons including optimization due to memory use. This
explains why your instances went off and it shows Tomcat is starting
up (they got restarted) and ends up in 502 error due to the nginx is
not able to complete the request. Fixing the above may lessen if not completely eliminate the 502s.
After I have specified the resources attribute and increased the configuration in app.yaml 502 error seems to be gone.

Azure in role cache exceptions when service scales

I am using Windows Azure SDK 2.2 and have created an Azure cloud service that uses an in-role cache.
I have 2 instances of the service running under normal conditions.
When the services scales (up to 3 instances, or back down to 2 instances), I get lots of DataCacheExceptions. These are often accompanied by Azure db connection failures from the process going in inside the cache. (If I don't find the entry I want in the cache, I get it from the db and put it into the cache. All standard stuff.)
I have implemented retry processes on the cache gets and puts, and use the ReliableSqlConnection object with a retry process for db connection using the Transient Fault Handling application block.
The retry process uses a fixed interval retrying every second for 5 tries.
The failures are typically;
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later
Any idea why the scaling might cause these exceptions?
Should I try a less aggressive retry policy?
Any help appreciated.
I have also noticed that I am getting a high percentage (> 70%) cache miss rate and when the system is struggling, there is high cpu utilisation (> 80%).

Well, I haven't been able to find out any reason for the errors I am seeing, but I have 'fixed' the problem, sort of!
When looking at the last few days processing stats, it is clear the high cpu usage corresponds with the cloud service having 'problems'. I have changed the service to use two medium instances instead of two small instances.
This seems to have solved the problem, and the service has been running quite happily, low cpu usage, low memory usage, no exceptions.
So, whilst still not discovering what the source of the problems were, I seem to have overcome them by providing a bigger environment for the service to run in.
--Late news!!! I noticed this morning that from about 06:30, the cpu usage started to climb, along with the time taken for the service to process as it should. Errors started appearing and I had to restart the service at 10:30 to get things back to 'normal'. Also, when restarting the service, the OnRoleRun process threw loads of DataCacheExceptions before it started running again, 45 minutes later.
Now all seems well again, and I will monitor for the next hours/days...
There seems to be no explanation for this, remote desktop to the instances show no exceptions in the event log, other logging is not showing application problems, so I am still stumped.

Appfabric cache (Velocity): Expiration in non-expirable cache and unreadable trace log?

we are using Appfabric cache in our project, and we ran into 2 major problems.
First - we are using named caches (no explicitly created regions). One of them, created as Expirable=false, Eviction=none, TTL=525600 is used for objects that should be always available (populated at application start, via the Put method). But from time to time (i couldnt identify exact timespan, nor connection to certain actions in application) all object in this cache suddenly expires - i can see this from performance counters - object count for this cache goes to 0, total expired objects counter increases of the amount of objects in this cache at the same time. Am i missing some other settings ? I tried both inserting them via Put() without timespan, and Put with timespan "a year". Still expires after several minutes...
The second problem - when I tried to solve first problem, i decided to use ETW trace logging feature to see in log, what is happening. I have created tracelog via logman and started it, waited for cache to expire, stopped the log, and have used tracerpt to create dumpfile from etl. Everything ok so far. But this dumpfile is useless, because there are no readable data, only 4400690073007400720....... After some quick research, i figured out, that I need to supply an PDB or TMF file to tracerpt, so it can "decode" binaryeventdata to readable eventdata. Is it possible to get some of these for appfabric cache ? Or there is some other way to use ETW with appfabric to get some usefull readable log ?

I found out what the problem is when your cache is expiring nearly instantly.
If your memory is low the cache gets cleared.
Check in the eventvwr -> Applicatin and Service -> Microsoft -> Windows - Application Server System Services and select Operational.
Look for warnings like:
Service available memory low - Cache private bytes percent {2} Cache working set percent {1} Cache data size percent {0} Available memory percent {21} CLR Generation2 count {2013} Released memory percent {0}.

There is an explanation in here to convert log file to cvs: http://msdn.microsoft.com/en-us/library/ff921010.aspx
But I can not even use tracelog tool.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio