MlflowException: API request to ...URL... failed to return code 200 after 3 tries - azure-databricks

I am currently trying to track my machine learning model metrics using the MLFlow API in Azure Databricks.
I registered the experiment under my team's machine learning workspace and had tried a few metric log commands that worked but were simply used as a test.
My notebook ran a for loop logging metrics per calculation within the loop.
It took a while (3-5 seconds) before sending out the error.
I tried to look at the experiment metrics and it seems to have logged a bit of the for loop's metrics before crashing.
Not sure as to why it does it and now it throws the exception to my earlier test calls to log metrics.


What could cause AWS S3 MultiObjectDeleteException?

In our Spring Boot app, we are using AmazonS3Client.deleteObjects() to delete multiple objects in a bucket. From time to time, the request throws MultiObjectDeleteException and one or many objects won't be deleted. It is not often, about 5 failures among thousands of requests. But still it could be a problem. What could lead to the exception?
And I have no idea how to debug. The log from our app follows the data flow but not showing much useful information. It suddenly throws the exception after the request. Please help.
Another thing is that the exception comes back with a 200 code. How could this be possible? One or
more objects could not be deleted (Service: null; Status Code: 200;
Error Code: null; Request ID: xxxx; S3 Extended Request ID: yyyy;
Proxy: null)
TLDR: Some error rates are normal and the application should handle them. 500 and 503 errors are retriable. The MultiObjectDeleteException should provide a clue and getDeletedObjects() gives you a list of the deleted objects. The rest you should mostly try later.
In the MultiObjectDeleteException documentation is said that exception should have an explanation of the issue which caused the error
Exception for partial or total failure of the multi-object delete API, including the errors that occurred. For successfully deleted objects, refer to getDeletedObjects().
According to AWS does not guarantee 100% availability. Again, according to that document:
• “Error Rate” means: (i) the total number of internal server errors returned by the Amazon S3 Service as error status “InternalError” or “ServiceUnavailable” divided by (ii) the total number of requests for the applicable request type during that 5-minute interval. We will calculate the Error Rate for each Amazon S3 Service account as a percentage for each 5-minute interval in the monthly billing cycle. The calculation of the number of internal server errors will not include errors that arise directly or indirectly as a result of any of the Amazon S3 SLA Exclusions.
Usually we think about SLA in the terms of downtimes so it is easy to assume that AWS does mean the same. But that's not the case here. Some number of errors is normal and should be expected. In many documents AWS does suggest that you should implement a combination of slowdowns and retries e.g. here
Some 500 and 503 errors are, again, part of the normal operation
The documents specifically says:
Because Amazon S3 is a distributed service, a very small percentage of 5xx errors is expected during normal use of the service. All requests that return 5xx errors from Amazon S3 can be retried. This means that it's a best practice to have a fault-tolerance mechanism or to implement retry logic for any applications making requests to Amazon S3. By doing so, S3 can recover from these errors.
Edit: Later was added a question: "How is it possible that the API call returned status code 200 while some objects were not deleted."
And the answer to that is very simple: This is how the API is defined. From the JDK reference page for deleteObjects you can go directly to the AWS API documentation page
Which says that this is the expected behavior. Status code 200 means that the high level API code succeeded and was able to request the deletion of the listed objects. Well, some of these actions did fail and, but the API call did create a report about it in the response.
Why does the Java API throw an exception then? Again, the authors of the AWS Java SDK tried to translate the response to the Java programming language and they clearly thought that while AWS API works with a non-zero error rate as part of the service agreement, Java developers are more used to a situation that anything but 100% success should end up by an exception.
Both of the abstractions are well documented and it is the programmer who is responsible for a precise implementation. The engineering rule is cheap, fast, reliable - chose two. AWS was able to provide a service which has all three with a reasonable concession that part of the reliability will be implemented on the client side - retries and slow-downs.

AWS Aurora / Lambda serverless production environment exhibiting occasional spikes

We've been running our production web app off AWS Lambda / API Gateway, with an Aurora serverless database. Things had been running smoothly for over a year, but recently (coinciding with much increased periods of peak usage) we've experienced temporary slowness, and in the worst case unavailability, due to some kind of bottleneck that results in a spike in the number of DB connections and 4XX and 5XX from our two APIs.
We're using the serverless-mysql library to execute queries and manage DB connections.
Some potential causes of the issue that have been eliminated:
There are no long-running queries locking up tables or anything of that sort (as demonstrated by show full processlist in MySQL), in fact no query runs longer than 1s accordingly to our slow_log
All calls to await serverlessMysql.query() are immediately followed by await serverlessMysql.end()
Our database manager class is instantiated outside the Lambda handler, so it isn't reinstantiated every time a Lambda instance is reused
We've adjusted the config options for serverless-mysql so that retries aren't so aggresive. The default config makes it very aggressive in retrying to connect, both in frequency and number of retries. This has definitely helped, but has not eliminated the problem.
What details can I post that might help someone diagnose this problem? It's a major pain in the ass.
It would be helpful to see the load this application is getting. Which I know is easier said than done with Lambda.
You sort of hinted at it, but it's possible you're hitting the Max Connections() on the 'capacity class' your aurora serverless instance is set to. I've hit this a few times. It's hard to discover with lambda and serverless aurora because you don't have the same logging you would traditionally have.
Outside of that, the core issue you're experiencing seems to be related to spikes created from your application - so you need to discover if a query is maybe just inefficient, and running too many times at once. These are almost impossible to troubleshoot with Lambda logs. But db locks still occur with aurora serverless.
To help track down the issue, you could try the following:
Setup APM
I highly, highly, recommend getting something like NewRelic setup and monitoring your Lambda function.
I'm pretty sure NR has a free trial option, and tracking down a problem like this would be seemingly simple with an APM. I can't tell you how much easier problems like this are to solve with a solid apm.
Monitor traffic ingress
Again, I'm not sure of what this application is doing, but it could be possible that a spike in network traffic from a particular user kicks off a load of queries that make things go awry. Setup a free Cloudflare account or some other proxy if you can, and determine network traffic more easily.
Hope this helps.

502 server error in Google App Engine Flexible when load testing with JMeter

I have deployed a simple Spring boot app in Google App Engine Flexible. The app. has two APIs, one to add the user data into the DB ( the other to get all the user data from the DB (
I wanted to see how GAE scales for the load, hence used JMeter to create a load with 100 user concurrency ramped up in 10 seconds and calls these two APIs in half a second delay, forever. While it runs fine for sometime (with just one instance), it starts to fail after 30 seconds or so with a "" or "The server responded with a status of 502".
After this error, when I try to access the same API from the browser, it displays,
Error: Server Error
The server encountered a temporary error and could not complete your
request. Please try again in 30 seconds.
The service is back to normal after 30 mins or so, and whenever the load test happens it repeats the same behavior as mentioned above. I expect GAE to auto-scale based on the load coming in to handle it without any down time (using multiple instances), instead it just crashes or blocks the service (without any information in the log). My app.yaml configuration is,
runtime: java
env: flex
service: hello-service
min_num_instances: 1
max_num_instances: 10
I am a bit stuck with this one, Any help would be greatly appreciated. Thanks in advance.
The solution was to increase the resource configuration, details below.
Given that I did not set a resource parameter, it defaulted to the pre-defined values for both CPU and Memory. In this case, the default
memory was set at 0.6GB. App Engine Flex instances uses about 0.4GB
for overhead processes. Given Java is known to consume higher memory, there is a
great likelihood that the overhead processes consumed more than the
approximate 0.4GB value. Now instances in App Engine are restarted due
to a variety of reasons including optimization due to memory use. This
explains why your instances went off and it shows Tomcat is starting
up (they got restarted) and ends up in 502 error due to the nginx is
not able to complete the request. Fixing the above may lessen if not completely eliminate the 502s.
After I have specified the resources attribute and increased the configuration in app.yaml 502 error seems to be gone.

Azure in role cache exceptions when service scales

I am using Windows Azure SDK 2.2 and have created an Azure cloud service that uses an in-role cache.
I have 2 instances of the service running under normal conditions.
When the services scales (up to 3 instances, or back down to 2 instances), I get lots of DataCacheExceptions. These are often accompanied by Azure db connection failures from the process going in inside the cache. (If I don't find the entry I want in the cache, I get it from the db and put it into the cache. All standard stuff.)
I have implemented retry processes on the cache gets and puts, and use the ReliableSqlConnection object with a retry process for db connection using the Transient Fault Handling application block.
The retry process uses a fixed interval retrying every second for 5 tries.
The failures are typically;
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later
Any idea why the scaling might cause these exceptions?
Should I try a less aggressive retry policy?
Any help appreciated.
I have also noticed that I am getting a high percentage (> 70%) cache miss rate and when the system is struggling, there is high cpu utilisation (> 80%).
Well, I haven't been able to find out any reason for the errors I am seeing, but I have 'fixed' the problem, sort of!
When looking at the last few days processing stats, it is clear the high cpu usage corresponds with the cloud service having 'problems'. I have changed the service to use two medium instances instead of two small instances.
This seems to have solved the problem, and the service has been running quite happily, low cpu usage, low memory usage, no exceptions.
So, whilst still not discovering what the source of the problems were, I seem to have overcome them by providing a bigger environment for the service to run in.
--Late news!!! I noticed this morning that from about 06:30, the cpu usage started to climb, along with the time taken for the service to process as it should. Errors started appearing and I had to restart the service at 10:30 to get things back to 'normal'. Also, when restarting the service, the OnRoleRun process threw loads of DataCacheExceptions before it started running again, 45 minutes later.
Now all seems well again, and I will monitor for the next hours/days...
There seems to be no explanation for this, remote desktop to the instances show no exceptions in the event log, other logging is not showing application problems, so I am still stumped.

EC2 Amazon Server getting stuck

I'va have my web hosted via a Amazon EC2.
Overall it's working fine, but sometimes (1 per hour aprox) it's like getting stuck. I'm not even able to write commands on the server console when it's on that status.
I moved from the micro instance to the small one expecting some improvement, but it's happening the same.
Any guidance where I should look to resolve this?
This depends on various factors.
Areas you should be looking:
If you are not able to connect (SSH) to your instance: check your
system log from your management console.
If you are expecting slow response times: check your CloudWatch metrics from your console.
Verify running processes on your instance. find out which process is taking CPU % / Memory %
you can do this by top or ps -auwx
