boto ElasticMapReduce throttling and rate limiting - amazon-ec2

I've run into rate limting from Amazon EMR a few times via boto API with the following:
boto.exception.EmrResponseError: EmrResponseError: 400 Bad Request
<ErrorResponse xmlns="http://elasticmapreduce.amazonaws.com/doc/2009-03-31">
<Error>
<Type>Sender</Type>
<Code>Throttling</Code>
<Message>Rate exceeded</Message>
</Error>
<RequestId>69d74a63-7de3-11e0-aafc-2b540b1e5f42</RequestId>
</ErrorResponse>
The operation is a one-time operation request the state of a jobflow, so there shouldn't be any rate-limiting involved. Has anyone else ran into this issue? Also, there doesn't seem to be much documentation on EC2 and EMR throttling/rate limiting...

Almost all (if not all) AWS APIs are rate limited. Even reading data puts a load on their services (some more than others) so they protect themselves by limiting the rate of requests each account is allowed to make. According to the AWS docs the recommended approach to dealing with a throttling response is to implement exponential backoff in your retry logic.

Related

What could cause AWS S3 MultiObjectDeleteException?

In our Spring Boot app, we are using AmazonS3Client.deleteObjects() to delete multiple objects in a bucket. From time to time, the request throws MultiObjectDeleteException and one or many objects won't be deleted. It is not often, about 5 failures among thousands of requests. But still it could be a problem. What could lead to the exception?
And I have no idea how to debug. The log from our app follows the data flow but not showing much useful information. It suddenly throws the exception after the request. Please help.
Another thing is that the exception comes back with a 200 code. How could this be possible?
com.amazonaws.services.s3.model.MultiObjectDeleteException: One or
more objects could not be deleted (Service: null; Status Code: 200;
Error Code: null; Request ID: xxxx; S3 Extended Request ID: yyyy;
Proxy: null)
TLDR: Some error rates are normal and the application should handle them. 500 and 503 errors are retriable. The MultiObjectDeleteException should provide a clue and getDeletedObjects() gives you a list of the deleted objects. The rest you should mostly try later.
In the MultiObjectDeleteException documentation is said that exception should have an explanation of the issue which caused the error
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/MultiObjectDeleteException.html
Exception for partial or total failure of the multi-object delete API, including the errors that occurred. For successfully deleted objects, refer to getDeletedObjects().
According to https://aws.amazon.com/s3/sla/ AWS does not guarantee 100% availability. Again, according to that document:
• “Error Rate” means: (i) the total number of internal server errors returned by the Amazon S3 Service as error status “InternalError” or “ServiceUnavailable” divided by (ii) the total number of requests for the applicable request type during that 5-minute interval. We will calculate the Error Rate for each Amazon S3 Service account as a percentage for each 5-minute interval in the monthly billing cycle. The calculation of the number of internal server errors will not include errors that arise directly or indirectly as a result of any of the Amazon S3 SLA Exclusions.
Usually we think about SLA in the terms of downtimes so it is easy to assume that AWS does mean the same. But that's not the case here. Some number of errors is normal and should be expected. In many documents AWS does suggest that you should implement a combination of slowdowns and retries e.g. here https://docs.aws.amazon.com/AmazonS3/latest/userguide/ErrorBestPractices.html
Some 500 and 503 errors are, again, part of the normal operation https://aws.amazon.com/premiumsupport/knowledge-center/http-5xx-errors-s3/
The documents specifically says:
Because Amazon S3 is a distributed service, a very small percentage of 5xx errors is expected during normal use of the service. All requests that return 5xx errors from Amazon S3 can be retried. This means that it's a best practice to have a fault-tolerance mechanism or to implement retry logic for any applications making requests to Amazon S3. By doing so, S3 can recover from these errors.
Edit: Later was added a question: "How is it possible that the API call returned status code 200 while some objects were not deleted."
And the answer to that is very simple: This is how the API is defined. From the JDK reference page for deleteObjects you can go directly to the AWS API documentation page https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html
Which says that this is the expected behavior. Status code 200 means that the high level API code succeeded and was able to request the deletion of the listed objects. Well, some of these actions did fail and, but the API call did create a report about it in the response.
Why does the Java API throw an exception then? Again, the authors of the AWS Java SDK tried to translate the response to the Java programming language and they clearly thought that while AWS API works with a non-zero error rate as part of the service agreement, Java developers are more used to a situation that anything but 100% success should end up by an exception.
Both of the abstractions are well documented and it is the programmer who is responsible for a precise implementation. The engineering rule is cheap, fast, reliable - chose two. AWS was able to provide a service which has all three with a reasonable concession that part of the reliability will be implemented on the client side - retries and slow-downs.

AWS Aurora / Lambda serverless production environment exhibiting occasional spikes

We've been running our production web app off AWS Lambda / API Gateway, with an Aurora serverless database. Things had been running smoothly for over a year, but recently (coinciding with much increased periods of peak usage) we've experienced temporary slowness, and in the worst case unavailability, due to some kind of bottleneck that results in a spike in the number of DB connections and 4XX and 5XX from our two APIs.
We're using the serverless-mysql library to execute queries and manage DB connections.
Some potential causes of the issue that have been eliminated:
There are no long-running queries locking up tables or anything of that sort (as demonstrated by show full processlist in MySQL), in fact no query runs longer than 1s accordingly to our slow_log
All calls to await serverlessMysql.query() are immediately followed by await serverlessMysql.end()
Our database manager class is instantiated outside the Lambda handler, so it isn't reinstantiated every time a Lambda instance is reused
We've adjusted the config options for serverless-mysql so that retries aren't so aggresive. The default config makes it very aggressive in retrying to connect, both in frequency and number of retries. This has definitely helped, but has not eliminated the problem.
What details can I post that might help someone diagnose this problem? It's a major pain in the ass.
It would be helpful to see the load this application is getting. Which I know is easier said than done with Lambda.
You sort of hinted at it, but it's possible you're hitting the Max Connections() on the 'capacity class' your aurora serverless instance is set to. I've hit this a few times. It's hard to discover with lambda and serverless aurora because you don't have the same logging you would traditionally have.
Outside of that, the core issue you're experiencing seems to be related to spikes created from your application - so you need to discover if a query is maybe just inefficient, and running too many times at once. These are almost impossible to troubleshoot with Lambda logs. But db locks still occur with aurora serverless.
To help track down the issue, you could try the following:
Setup APM
I highly, highly, recommend getting something like NewRelic setup and monitoring your Lambda function.
I'm pretty sure NR has a free trial option, and tracking down a problem like this would be seemingly simple with an APM. I can't tell you how much easier problems like this are to solve with a solid apm.
Monitor traffic ingress
Again, I'm not sure of what this application is doing, but it could be possible that a spike in network traffic from a particular user kicks off a load of queries that make things go awry. Setup a free Cloudflare account or some other proxy if you can, and determine network traffic more easily.
Hope this helps.

Google's RuntimeConfig API responds with 'Our systems have detected unusual traffic from your computer network'

Since today (november 20 2018) we get error responses from Google's RuntimeConfig API:
Our systems have detected unusual traffic from your computer network. This page checks to see if it's really you sending the requests, and not a robot...
(check this link for complete HTML error)
We retrieve variables from Google's RuntimeConfig using the API in our code. We do quite a few request, but not more than before:
A developer starts his server locally, which retrieves all the needed variables (+- 30 everytime you start).
Requesting RuntimeConfig variables via GCloud results in the same HTML error:
gcloud beta runtime-config configs variables get-value databaseHost --config-name database --project=your-test-environment
Other gcloud api requests work (projects describe, gsutil, etc).
How can I verify if I violated any terms? I can only find a usage limit in GCloud Console of 6000 calls per minute.
You can find the quotas for Runtime Configurator and how much of those you are using in the Cloud Console under IAM & Admin. In the Quotas section you can filter on Service = Cloud Runtime Configuration API and you should see all the quotas and how close to those you are for this API. There are 4 quotas that may affect you (docs here):
1200 Queries Per Minute (QPM) for delete, create, and update requests
600 QPM for watch requests
6000 QPM for get and list requests.
4MB of data per project, which consists of all data written to the Runtime Configurator service and accompanying metadata.
We had the exact same issue on November 20th when a large amount of our preemptibles were reallocated at the same time.
Our startup-scripts make use of the gcloud beta runtime-config...-commands and they all responded with 503.
These commands responded correctly again after a few hours.
We have had a support-ticket with Google and there was a problem with their internal quota mechanisms at the time which since is fixed so the issue is resolved.

Why did Google invalidate all OAuth2 access tokens today?

What happened to your OAuth2/token servers today around 7am PST?
URL: https://accounts.google.com/o/oauth2/token
Token refreshing failed (credential.refreshToken()==false) for all our Google users resulting in losing access tokens and connectivity for hundreds of our clients!
Then for a while our users could not re-authorize as Google was returning 503 Unavailable and CAPTURE in response to the following API call:
GoogleAuthorizationCodeFlow gacf =
new GoogleAuthorizationCodeFlow.Builder(new NetHttpTransport(),
new JacksonFactory(), appId,
appSecret, scopes).build();
tokenResponse = gacf.newTokenRequest(code).setRedirectUri(<callback_url>).execute();
com.google.api.client.auth.oauth2.TokenResponseException: 503 Service Unavailable
Our systems have detected unusual traffic from your computer network. This page checks to see if it's really you sending the requests, and not a robot.
After a while everything started working again only now we have hundreds of clients for whom our service stopped working because there is no valid OAuth2 access tokens anymore.
Can you please explain what Google did earlier today and how it managed to invalidate all the tokens? How can we find out what "unusual traffic" it detected?
Thank you
You on AppEngine??
if so, maybe it's this...
SUMMARY:
On Thursday 5 March 2015, for a duration of 84 minutes, Google App Engine applications that accessed some Google APIs over HTTP experienced elevated error rates. We apologize for any impact this incident had on your service or application, and have made immediate changes to prevent this issue from recurring.
DETAILED DESCRIPTION OF IMPACT:
On Thursday 5 January, from 07:04 AM to 08:28 AM, some Google App Engine applications making calls to other Google APIs via HTTP experienced elevated error rates. During the incident, the global error rate for all API calls remained under 1%, and in total, the outage affected 2% of applications that were active during the incident. The effect on those applications was significant: requests to issue OAuth tokens experienced an error rate of over 85%. In addition, the HTTP APIs to googleapis.com/storage and googleapis.com/gmail received error rates between 50% and 60%. Other googleapis.com endpoints were affected with error rates of 10% to 20%.
ROOT CAUSE:
A component in Google’s shared HTTP load balancing fabric experienced a non-malicious increase in traffic, exceeding its provisioned capacity. This triggered an automatic DoS protection which shunted a portion of the incoming traffic to a CAPTCHA. The unexpected response caused some clients to issue automated retries, exacerbating the problem.
REMEDIATION AND PREVENTION:
Google Engineers were alerted to the issue by automated monitoring at 07:02, as the load balancing system detected excess traffic and attempted to automatically mitigate it. At 07:46, Google Engineers enabled standby load balancing capacity to rectify the issue. From 08:15 to 08:40, Google Engineers continued to provision additional resources in the load balancing fabric in order to serve the increased traffic. During this period, at 08:28, Google engineers determined that sufficient capacity was in place to serve both regular and retry traffic, and instructed the load balancing system to cease mitigation and resume normal traffic serving. This action marked the end of the event.
To prevent this issue from recurring, Google engineers are comprehensively re-examining the affected load balancing fabric to ensure it is and remains correctly provisioned. Additionally, Google engineers are improving monitoring rules to provide an early warning of capacity shortfall. Finally, Google engineers are examining the services that depend on this load balancing system, and will move some services to a separate pool of more easily scalable load balancers where appropriate.

Heroku: Really slow connect times to RDS

We're stumped by this, been trying to solve it for a month. It looks like every once in a while (maybe one in 100 requests) our connection time to RDS or RabbitMQ takes > 0.5 seconds.
We're using Heroku with Django, amazon RDS and S3 (we do tons of image uploads).
Here's a sample slow connection (fast connections are really fast, like < 20 ms):
It's worth mentioning both Heroku and RDS are in the us-east region. Also when we use Redis Cloud (Heroku add-on) we see slow connection speeds to that too.
UPDATE:
I've also found that these same delays exist for Memcached. In fact, on average Memcached takes roughly the same amount as a request to the database. I'm just using Memcached for storing API keys and there's plenty of memory left.
It resembles something I read earlier on a blog where it seemed that on Heroku a request might be routed to a dyno that is handling a long running request. That way the old request blocks the execution of the next request. The routing isn't quite as "intelligent" as it is advertised.
I'm not 100% sure that this is the problem you are experiencing as is seems that the slowness shouldn't be in the connection but in the handling of the request to Heroku.
You can find more details on the problem describes here: http://news.rapgenius.com/James-somers-herokus-ugly-secret-lyrics

Resources