Minimise / eradicate initial FN spin up delay for OCI API GW authorisation - api-gateway

We are using an FN Function for OCI API Gateway authorisation (https://docs.cloud.oracle.com/en-us/iaas/Content/APIGateway/Tasks/apigatewayusingauthorizerfunction.htm). We're finding that there is a slight delay in the auth process when it hasn't been triggered for a while as an instance of the Function container spins up, which is expected. As the Oracle documentation states :
When the function has finished executing and after a period being idle, the Docker container is removed. If Oracle Functions receives another call to the same function before the container is removed, the second request is routed to the same running container. If Oracle Functions receives a call to a function that is currently executing inside a running container, Oracle Functions scales horizontally to serve both incoming requests and a second Docker container is started. (https://docs.cloud.oracle.com/en-us/iaas/Content/Functions/Concepts/functionshowitworks.htm)
We would like to minimise or ideally eradicate this initial delay, for instance by keeping one instance of the Function running all the time. What would be the best approach?

I doubt if you could keep the FN Container hot without repeatedly invoking it at the first place. One of the daft options could be to keep calling it after every "sleep" interval; but this has to be traded-off with associated FN Invoking cost/month.
Other options could be based on how long the actual operation runs for. For instance, this could be split into the two Operations represented by two FNs. A FN can call another FN; so you should be able to sequence invoking them one by one if that is achievable for your intended task.

That is known in serverless as the "cold start" and it is something that is being worked on to reduce the initial startup time. Until then a health-check can be used to periodically ping the function.
Essentially create a case in the function where the URL ends in something like /status or /healthcheck. In that case
return response.Response(ctx,response_data=json.dumps({"status": "OK"}),
headers={"Content-Type": "application/json"})
In API Gateway, create a route making sure to enable anonymous for /status (or /healthcheck) that invokes the function.
Then set up a health check to periodically invoke the API with the /status or /healthcheck end-point. This both keeps the function active and also monitors the health. Your case could perform any needed validation rather than just returning an OK response.
Another thing to keep in mind is API Gateway will cache responses, so depending on your chosen TTL you can adjust your healthcheck timing accordingly.

This 'hot start' requirement is now covered by Oracle Cloud's "Provisioned Concurrency" feature for Functions:
https://docs.oracle.com/en-us/iaas/Content/Functions/Tasks/functionsusingprovisionedconcurrency.htm
From the documentation:
Provisioned concurrency is the ability of OCI Functions to always have available the execution infrastructure for at least a certain minimum number of concurrent function invocations.

Related

DynamoDB:PutItem calls silently ignored

I have a Lambda function bound to CodeBuild notifications; a Lambda instance writes details of the notification that triggered it to a DynamoDB table (BillingMode PAY_PER_REQUEST)
Each CodeBuild notification spawns an independent Lambda instance. A CodeBuild build can spawn 7-8 separate notifications/Lambda instances, many of which often happen simultaneously.
The Lambda function uses DynamoDB:PutItem to put details of the notification to DynamoDB. What I find is that out of 7-8 notifications in a 30 second period, sometimes all 7-8 get written to DynamoDB, but sometimes it can be as low as 0-1; many calls to DynamoDB:PutItem simply seem to be "ignored".
Why is this happening?
My guess is that DynamoDB simply shouldn't be accessed by multiple Lambda instances in this way; that best practice is to push the updates to a SQS queue bound to a separate Lambda, and have that separate Lambda write many updates to DynamoDB as part of a transaction.
Is that right? Why might parallel independent calls to DynamoDB:PutItem fail silently?
TIA.
DynamoDB uses a web endpoint and for that reason it can handle any number of concurrent connections, so the issue is not with how many Lambdas are writing.
I typically see this happen when users do not allow the Lambda to wait until the API requests are complete and the container gets shut down prematurely. I would first check your code and ensure that your lambda is staying alive for all items to be processed, you can do this by adding some simple logging in your code.
What you are describing is a good use case for Step Functions.
As much as Lambda functions are great to glue between services, they have their overheads and their limitations. With Step Functions, you can call directly to DynamoDB:PutItem, and you can handle various scenarios and flows, such as Async calls. These flows are possible to implement in a Lambda function, however with less visibility and with less traceability.
BTW, you can also call a Lambda function from Step Functions, however, I recommend you to try and use the direct service call to maximize the benefits of the Step Functions service.
My mistake, I had a separate issue which was messing up some of the range keys and causing updates to "fail" silently. But thx for the tip regarding timeouts

Use Cases for LRA

I am attempting to accomplish something along these lines with Quarkus, and Naryana:
client calls service to start a process that takes a while: /lra/start
This call sets off an LRA, and returns an LRA id used to track the status of the action
client can keep polling some endpoint to determine status
service eventually finishes and marks the action done through the coordinator
client sees that the action has completed, is given the result or makes another request to get that result
Is this a valid use case? Am I visualizing the correct way this tool can work? Based on how the linked guide reads, it seems that the endpoints are more of a passthrough to the coordinator, notifying it that we start and end an LRA. Is there a more programmatic way to interact with the coordinator?
Yes, it might be a valid use case, but in every case please read the MicroProfile LRA specification - https://github.com/eclipse/microprofile-lra.
The idea you describe is more or less one LRA participant executing in a new LRA and polling the status of this execution. This is not totally what the LRA is intended for, but surely can be used this way.
The main idea of LRA is the composition of distributed transactions based on the saga pattern. Basically, the point is to coordinate multiple services to achieve consistent results with an eventual consistency guarantee. So you see that the main benefit arises when you can propagate LRA through different services that either all complete their actions or all of their compensation callbacks will be called in case of failures (and, of course, only for the services that executed their actions in the first place). Here is also an example with the LRA propagation https://github.com/xstefank/quarkus-lra-trip-example.
EDIT: Sorry, I forgot to add the programmatic API that allows same interactions as annotations - https://github.com/jbosstm/narayana/blob/master/rts/lra/client/src/main/java/io/narayana/lra/client/NarayanaLRAClient.java. However, note that is not in the specification and is only specific to Narayana.

Lambdas timing out

I have a specific Lambda function invoked by SNS events that repeatedly times out in about 1/2 of its instances that seem to be running any of the handler code.
What's peculiar is that I have a number of log statements at the very start of the function handler that should be getting triggered.
I've tried increasing the timeout to 120 seconds, but this doesn't fix anything. I've also looked at the Lambda init logic (the code outside the main handler method) but its just simple imports and class initialisation, no database connections or HTTP requests that might be causing a timeout.
The handler logic does include database connections and network requests, but those were timing out then I'd expect to also see some logs prior to the timeouts.
When I view the Lambda logs by stream then around half of them look like the above and just time out, whereas the other half run as expected. Are streams specific to individual Lambda containers? If so then it looks as if there is a number of "dead" containers.
Has anyone experienced an issue like this in the past or has any idea what is going on?
This issue was fixed after realising that the lambda was inside two different subnets, one of which didn't have a NAT gateway. After moving the lambda to a single subnet with a NAT the timeouts have stopped.

How can I trigger one AWS Lambda function from another, guaranteeing the second only runs once?

I've built a bit of a pipeline of AWS Lambda functions using the Serverless framework. There are currently five steps/functions, and I need them to run in order and each run exactly once. Roughly, the functions are:
Trigger function by an HTTP request, respond with an ID.
Access and API to get the URL of a resource to download.
Download that resource and upload a copy to S3.
Alter that resource and upload the altered copy to S3.
Submit the altered resource to a different API.
The specifics aren't important, but the question is: What's the best event/trigger to use to move along down this line of functions? The first one is triggered by an HTTP call, but the first one needs to trigger the second somehow, then the second triggers the third, and so on.
I wrote all the code using AWS SNS, but now that I've deployed it to staging I see that SNS often triggers more than once. I could add a bunch of code to detect this, but I'd rather not. And the problem is also compounding -- if the second function gets triggered twice, it sends two SNS notifications to trigger step three. If either of those notifications gets doubled... it's not unreasonable that the last function could be called ten times instead of once.
So what's my best option here? Trigger the chain through HTTP? Kinesis maybe? I have never worked with a trigger other than HTTP or SNS, so I'm not really sure what my options are, and which options are guaranteed to only trigger the function once.
AWS Step Functions seems pretty well targeted at this use-case of tying together separate AWS operations into a coherent workflow with well-defined error handling.
Not sure if the pricing will work for you (can be pricey for millions+ operations) but it may be worth looking at.
Also not sure about performance overhead or other limitations, so YMMV.
You can simply trigger the next lambda asynchronously in your lambda function after you complete the required processing in that step.
So, the first lambda is triggered by an HTTP call and in that lambda execution, after you finish processing this step, just launch the next lambda function asynchronously instead of sending the trigger through SNS or Kinesis. Repeat this process in each of your steps. This would guarantee single time execution of all the steps by lambda.
Eventful Lambda triggers (SNS, S3, CloudWatch, ...) generally guarantee at-least-once invocation, not exactly-once. As you noted you'd have to handle deduplication manually by, for example, keeping track of event IDs in DynamoDB (using strongly consistent reads!), or by implementing idempotent Lambdas, meaning functions that have no additional effects even when invoked several times with the same input. In your example step 4 is essentially idempotent providing that the function doesn't have any side effects apart from storing the altered copy, and that the new copy overwrites any previously stored copies with the same event ID.
One service that does guarantee exactly-once delivery out of the box is SQS FIFO. This service unfortunately cannot be used to trigger Lambdas directly so you'd have to set up a scheduled Lambda to poll the FIFO queue periodically (as per this answer). In your case you could handle step 5 with this arrangement, since I'm assuming you don't want to submit the same resource to the target API several times.
So in summary here's how I'd go about it:
Lambda A, invoked via HTTP, responds with ID and proceeds to asynchronously fetch resource from the API and store it to S3
Lambda B, invoked by S3 upload event, downloads the uploaded resource, alters it, stores the altered copy to S3 and finally pushes a message into the FIFO SQS queue using the altered resource's filename as the distinct deduplication ID
Lambda C, invoked by CloudWatch scheduler, polls the FIFO SQS queue and upon a new message fetches the specified altered resource from S3 and submits it to the other API
With this arrangement even if Lambda B is occasionally executed twice or more by the same S3 upload event there's no harm done since the FIFO SQS queue handles deduplication for you before the flow reaches Lambda C.
AWS Step function is meant for you: https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html
You will execute the steps you want based on previous executions outputs.
Each task/step just need to output a json correctly in the wanted "state".
https://docs.aws.amazon.com/step-functions/latest/dg/concepts-states.html
Based on the state, your workflow will move on. You can create your workflow easily and trigger lambdas, or ECS tasks.
ECS tasks are your own "lambda" environment, running without the constraints of the AWS Lambda environment.
With ECS tasks you can run on Bare metal, on your own EC2 machine, or in ECS Docker containers on ECS and thus have unlimited resources extensible limits.
As compared to Lambda where the limits are pretty strict: 500Mb of disk, execution limited in time, etc.

Use of self.clients.claim() and self.skipWaiting() in service worker may increase the load time?

After reading many tutorials, I founded that self.skipWaiting() is used to immediately apply an update to an existing serviceWorker, and self.clients.claim() is used for taking control immediately on the first load.
self.addEventListener('install', function(event) {
event.waitUntil(self.skipWaiting());
});
self.addEventListener('activate', function(event) {
event.waitUntil(self.clients.claim());
});
Is it lookup for update on every request or how does it work internally? Use of self.clients.claim() and self.skipWaiting() is any impact on load or service worker performance?
There's no impact from a performance perspective.
Both self.skipWaiting() and self.clients.claim() work by taking some action and, at the same time, immediately resolving with an undefined value.
In the case of self.skipWaiting(), the action that's taking is effectively just flipping an internal flag and causing the service worker to attempt to activate. In the case of self.clients.claim(), the action is to go through all the clients and attempt to have the currently executing service worker take control.
The actual promise that's returned by both of those methods is irrelevant, and you don't have to wrap them in event.waitUntil() (although it doesn't hurt, and many examples of service worker usage continue to do so).
Additionally, because your code is only making calls to those methods inside of install and activate listeners, the code in question won't even execute most of the time that the service worker thread starts up—only when there's an updated service worker.

Resources