What could cause AWS S3 MultiObjectDeleteException? - spring-boot

In our Spring Boot app, we are using AmazonS3Client.deleteObjects() to delete multiple objects in a bucket. From time to time, the request throws MultiObjectDeleteException and one or many objects won't be deleted. It is not often, about 5 failures among thousands of requests. But still it could be a problem. What could lead to the exception?
And I have no idea how to debug. The log from our app follows the data flow but not showing much useful information. It suddenly throws the exception after the request. Please help.
Another thing is that the exception comes back with a 200 code. How could this be possible?
com.amazonaws.services.s3.model.MultiObjectDeleteException: One or
more objects could not be deleted (Service: null; Status Code: 200;
Error Code: null; Request ID: xxxx; S3 Extended Request ID: yyyy;
Proxy: null)

TLDR: Some error rates are normal and the application should handle them. 500 and 503 errors are retriable. The MultiObjectDeleteException should provide a clue and getDeletedObjects() gives you a list of the deleted objects. The rest you should mostly try later.
In the MultiObjectDeleteException documentation is said that exception should have an explanation of the issue which caused the error
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/MultiObjectDeleteException.html
Exception for partial or total failure of the multi-object delete API, including the errors that occurred. For successfully deleted objects, refer to getDeletedObjects().
According to https://aws.amazon.com/s3/sla/ AWS does not guarantee 100% availability. Again, according to that document:
• “Error Rate” means: (i) the total number of internal server errors returned by the Amazon S3 Service as error status “InternalError” or “ServiceUnavailable” divided by (ii) the total number of requests for the applicable request type during that 5-minute interval. We will calculate the Error Rate for each Amazon S3 Service account as a percentage for each 5-minute interval in the monthly billing cycle. The calculation of the number of internal server errors will not include errors that arise directly or indirectly as a result of any of the Amazon S3 SLA Exclusions.
Usually we think about SLA in the terms of downtimes so it is easy to assume that AWS does mean the same. But that's not the case here. Some number of errors is normal and should be expected. In many documents AWS does suggest that you should implement a combination of slowdowns and retries e.g. here https://docs.aws.amazon.com/AmazonS3/latest/userguide/ErrorBestPractices.html
Some 500 and 503 errors are, again, part of the normal operation https://aws.amazon.com/premiumsupport/knowledge-center/http-5xx-errors-s3/
The documents specifically says:
Because Amazon S3 is a distributed service, a very small percentage of 5xx errors is expected during normal use of the service. All requests that return 5xx errors from Amazon S3 can be retried. This means that it's a best practice to have a fault-tolerance mechanism or to implement retry logic for any applications making requests to Amazon S3. By doing so, S3 can recover from these errors.
Edit: Later was added a question: "How is it possible that the API call returned status code 200 while some objects were not deleted."
And the answer to that is very simple: This is how the API is defined. From the JDK reference page for deleteObjects you can go directly to the AWS API documentation page https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html
Which says that this is the expected behavior. Status code 200 means that the high level API code succeeded and was able to request the deletion of the listed objects. Well, some of these actions did fail and, but the API call did create a report about it in the response.
Why does the Java API throw an exception then? Again, the authors of the AWS Java SDK tried to translate the response to the Java programming language and they clearly thought that while AWS API works with a non-zero error rate as part of the service agreement, Java developers are more used to a situation that anything but 100% success should end up by an exception.
Both of the abstractions are well documented and it is the programmer who is responsible for a precise implementation. The engineering rule is cheap, fast, reliable - chose two. AWS was able to provide a service which has all three with a reasonable concession that part of the reliability will be implemented on the client side - retries and slow-downs.

Related

Laravel - Efficiently consuming large external API into database

I'm attempting to consume the Paypal API transaction endpoint.
I want to grab ALL transactions for a given account. This number could potentially be in the 10's of millions of transactions. For each of these transactions, I need to store it in the database for processing by a queued job. I've been trying to figure out the best way to pull this many records with Laravel. Paypal has a max request items limit of 20 per page.
I initially started off with the idea of creating a job when a user gives me their API credentials that gets the first 20 items and processes them, then dispatches a job from the first job that contains the starting index to use. This would loop forever until it errored out. This doesn't seem to be working well though as it causes a gateway timeout on saving those API credentials and the request to the API eventually times out (before getting all transactions). I should also mention that the total number of transactions is unknown, so chaining doesn't seem to be the answer as there is no way to know how many jobs to dispatch...
Thoughts? Is getting API data best suited for a job?
Yes job is way to go . I’m not familiar with paypal api but it’s seems requests are rate limited paypal rate limiting.. you might want to delay your api requests a bit.. also you can make a class to monitor your api requests consumption by tracking the latest requests you made and in the job you can determine when to fire the next request and record it in the database...
My humble advise
please don’t pull all the data your database will get bloated quickly and you’ll need to scale each time you have a new account it’s not easy task.
You could dispatch the same job at the end of the first job which queries your current database to find the starting index of the transactions for that job.
So even if your job errors out, you could dispatch it again, then it will resume from where it was ended previously
May be you will need link your app with another data engine like AWS, anyway I think the best idea is creating an APi, pull only the most important data, indexed, and keep the all big data in another endpoint, where you can reach them if you need

Google's RuntimeConfig API responds with 'Our systems have detected unusual traffic from your computer network'

Since today (november 20 2018) we get error responses from Google's RuntimeConfig API:
Our systems have detected unusual traffic from your computer network. This page checks to see if it's really you sending the requests, and not a robot...
(check this link for complete HTML error)
We retrieve variables from Google's RuntimeConfig using the API in our code. We do quite a few request, but not more than before:
A developer starts his server locally, which retrieves all the needed variables (+- 30 everytime you start).
Requesting RuntimeConfig variables via GCloud results in the same HTML error:
gcloud beta runtime-config configs variables get-value databaseHost --config-name database --project=your-test-environment
Other gcloud api requests work (projects describe, gsutil, etc).
How can I verify if I violated any terms? I can only find a usage limit in GCloud Console of 6000 calls per minute.
You can find the quotas for Runtime Configurator and how much of those you are using in the Cloud Console under IAM & Admin. In the Quotas section you can filter on Service = Cloud Runtime Configuration API and you should see all the quotas and how close to those you are for this API. There are 4 quotas that may affect you (docs here):
1200 Queries Per Minute (QPM) for delete, create, and update requests
600 QPM for watch requests
6000 QPM for get and list requests.
4MB of data per project, which consists of all data written to the Runtime Configurator service and accompanying metadata.
We had the exact same issue on November 20th when a large amount of our preemptibles were reallocated at the same time.
Our startup-scripts make use of the gcloud beta runtime-config...-commands and they all responded with 503.
These commands responded correctly again after a few hours.
We have had a support-ticket with Google and there was a problem with their internal quota mechanisms at the time which since is fixed so the issue is resolved.

Google API Key giving Query over limit error

We have an web application which was working fine till yesterday. But since yesterday afternoon , one of our projects in google api console , all the keys started giving OVER_QUERY_LIMIT error.
And we cross checked that the quotas for that project and api are still not full. Can anybody help me to understand what may have caused this.
And after a days use also the API keys are still giving the same error.
Just to give more information we are using Geocoding API and Distance Matrix API in our application.
If you exceed the usage limits you will get an OVER_QUERY_LIMIT status code as a response. This means that the web service will stop providing normal responses and switch to returning only status code OVER_QUERY_LIMIT until more usage is allowed again. This can happen:
Within a few seconds, if the error was received because your application sent too many requests per second.
Within the next 24 hours, if the error was received because your application sent too many requests per day. The daily quotas are reset at midnight, Pacific Time.
This screencast provides a step-by-step explanation of proper request throttling and error handling, which is applicable to all web services.
Upon receiving a response with status code OVER_QUERY_LIMIT, your application should determine which usage limit has been exceeded. This can be done by pausing for 2 seconds and resending the same request. If status code is still OVER_QUERY_LIMIT, your application is sending too many requests per day. Otherwise, your application is sending too many requests per second.
Note: It is also possible to get the OVER_QUERY_LIMIT error:
From the Google Maps Elevation API when more than 512 points per request are provided.
From the Google Maps Distance Matrix API when more than 625 elements per request are provided.
Applications should ensure these limits are not reached before sending requests.
Documentation usage limits

502 server error in Google App Engine Flexible when load testing with JMeter

I have deployed a simple Spring boot app in Google App Engine Flexible. The app. has two APIs, one to add the user data into the DB (xxx.appspot.com/add) the other to get all the user data from the DB (xxx.appspot.com/all).
I wanted to see how GAE scales for the load, hence used JMeter to create a load with 100 user concurrency ramped up in 10 seconds and calls these two APIs in half a second delay, forever. While it runs fine for sometime (with just one instance), it starts to fail after 30 seconds or so with a "java.net.SocketException" or "The server responded with a status of 502".
After this error, when I try to access the same API from the browser, it displays,
Error: Server Error
The server encountered a temporary error and could not complete your
request. Please try again in 30 seconds.
The service is back to normal after 30 mins or so, and whenever the load test happens it repeats the same behavior as mentioned above. I expect GAE to auto-scale based on the load coming in to handle it without any down time (using multiple instances), instead it just crashes or blocks the service (without any information in the log). My app.yaml configuration is,
runtime: java
env: flex
service: hello-service
automatic_scaling:
min_num_instances: 1
max_num_instances: 10
I am a bit stuck with this one, Any help would be greatly appreciated. Thanks in advance.
The solution was to increase the resource configuration, details below.
Given that I did not set a resource parameter, it defaulted to the pre-defined values for both CPU and Memory. In this case, the default
memory was set at 0.6GB. App Engine Flex instances uses about 0.4GB
for overhead processes. Given Java is known to consume higher memory, there is a
great likelihood that the overhead processes consumed more than the
approximate 0.4GB value. Now instances in App Engine are restarted due
to a variety of reasons including optimization due to memory use. This
explains why your instances went off and it shows Tomcat is starting
up (they got restarted) and ends up in 502 error due to the nginx is
not able to complete the request. Fixing the above may lessen if not completely eliminate the 502s.
After I have specified the resources attribute and increased the configuration in app.yaml 502 error seems to be gone.

Google Classroom 503 Service Unavailable Backend Error while Creating Course

I have a service that takes a queue of courses created in my SIS and am trying to automatically create them via the Google Classroom API. I was able to create around 1000 courses and now I am getting the error below:
Google.Apis.Requests.RequestError
The service is currently unavailable. [503]
Errors [
Message[The service is currently unavailable.] Location[ - ] Reason[backendError] Domain[global]
]
It does not seem to matter what I do, the error still occurs.
This is a regular occurrence with Google APIs. It's the method which Google servers use to say "you're going to fast slow down". In order to handle this, well behaved API clients should implement exponential backoff.
So for example, your script can create courses as fast as it can as long as it's getting HTTP 2xx success responses from Google. As soon as it sees a 503 back end error, it should pause all calls for 1 second and then retry the failed operation. Very often on 2nd try the operation will succeed but if it doesn't your script should pause 2 seconds, then 4, 8, etc, etc until success. I recommend maxing out at 10 tries and then failing with an error.
If your script does not do backoff and just continues to retry API calls with no pause, you are likely to see an increase in errors like this and your script may be blacklisted eventually.

Resources