Is there any way to trigger a AWS Lambda function at the end of an AWS Glue job? - aws-lambda

Currently I'm using an AWS Glue job to load data into RedShift, but after that load I need to run some data cleansing tasks probably using an AWS Lambda function. Is there any way to trigger a Lambda function at the end of a Glue job? Lambda functions can be triggered using SNS messages, but I couldn't find a way to send an SNS at the end of the Glue job.

#oreoluwa is right, this can be done using Cloudwatch Events.
From the Cloudwatch dashboard:
Click on 'Rules' from the left menu
For 'Event Source', choose 'Event Pattern' and in 'Service Name' choose 'Glue'
For 'Event Type' choose 'Glue Job State Change'
On the right side of the page, in the 'Targets' section, click 'Add Target' -> 'Lambda Function' and then choose your function.
The event you'll get in Lambda will be of the format:
{
'version': '0',
'id': 'a9bc90be-xx00-03e0-9bc5-a0a0a0a0a0a0',
'detail-type': 'GlueJobStateChange',
'source': 'aws.glue',
'account': 'xxxxxxxxxx',
'time': '2018-05-10T16: 17: 03Z',
'region': 'us-east-2',
'resources': [],
'detail': {
'jobName': 'xxxx_myjobname_yyyy',
'severity': 'INFO',
'state': 'SUCCEEDED',
'jobRunId': 'jr_565465465446788dfdsdf546545454654546546465454654',
'message': 'Jobrunsucceeded'
}
}

Since AWS Glue has started supporting python, you can probably follow the below path to achieve what you desire. Below sample script shows how to do that -
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3 ## Step-2
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## Do all ETL stuff here
## Once the ETL completes
lambda_client = boto3.client('lambda') ## Step-3
response = lambda_client.invoke(FunctionName='string') ## Step-4
Create a python based Glue Job (to perform ETL on Redshift)
In the job script, import boto3 (need to place this package as script library).
Make a connection to lambda using boto3
Invoke lambda function using the boto3 lambda invoke() once the ETL completes.
Please make sure that the role that you are using while creating the Glue job has permissions to invoke lambda functions.
Refer to the Boto3 documentation for lambda here.

No. Currently you can't trigger a lambda function at the end of a Glue job. The reason for this is that this trigger has not yet been provided by AWS in Lambda. If you look at the list of AWS lambda triggers after you create a lambda function, you will see that it has most of AWS services as trigger but not AWS Glue. So, for now, it is not possible but maybe in future.
But I would like to mention that you can actually control the flow of glue scripts using your lambda python script. (I did it using python, I am sure there may be other languages supporting this). My use case was that whenever I upload any object in S3 bucket, it gets lambda function trigger from which I was reading the object file and starting my glue job. And once the status of Glue job was complete, I would write my file back to S3 bucket linked to this Lambda function.

#ace and #adeel, have part of the solution, but you could get this resolved by creating the CloudWatch Rule with the following event pattern:
{
"source": [
"aws.glue"
],
"detail-type": [
"Glue Job State Change"
],
"detail": {
"jobName": [
"<YourJobName>"
],
"state": [
"SUCCEEDED"
]
}
}

Lambda can be triggered on S3 put. You can put a dummy file on S3 as the last glue job; which would in turn trigger lambda. I have tested this.

You can orchestrate your AWS Glue Jobs and AWS Lambda functions by using AWS Step Functions. Here is a blog post that explains how to do it and gives an example: https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-using-aws-step-functions-and-aws-lambda/
In essence, when a Glue job finishes (success or fail), your Step Function workflow can catch the event and invoke your Lambda function.

yes it is possible to trigger but for this we have to take help of EventBridge .
Please follow below instruction
go to EventBridge then Under Events you will find rules click on it then click on create rule give a suitable name to your rule by make sure radio button selected on Rule with an event pattern then click Next in event source it will be AWS events or EventBridge partner events then in creation method select Use pattern form.
In event pattern select event source as "AWS service" and in AWS service select glue and then new drop down selection will be enabled there select "Glue Job State Change"
then right side event pattern is there click on edit pattern and do changes as per your need.
{
"detail-type": ["Glue Job State Change"],
"source": ["aws.glue"],
"detail": {
"jobName": ["Your glue Name"],
"state": ["FAILED"]
}
}
in state : STARTING , RUNNING , STOPPING , STOPPED , SUCCEEDED , FAILED , ERROR , WAITING and TIMEOUT you can choose this
don't use any other field unless you are using ec2 instance then you have to use resources field and you can place it next to source
then click on next select aws service in target type select Lambda function and then select your lambda function name in new drop down which appeared after selecting the target and then next , next and save.
congrats you have successfully created the configuration to trigger lambda function based on glue job.

Related

How to trigger a AWS lambda by sending event to EventBridge

I have a AWS lambda that the trigger for activating it is an event from EventBridge (rule)
The rule looks like this:
{
"detail-type": ["ECS Task State Change"],
"source": ["aws.ecs"],
"detail": {
"stopCode": ["EssentialContainerExited", "UserInitiated"],
"clusterArn": ["arn:aws:ecs:.........."],
"containers": {
"name": ["some name"]
},
"lastStatus": ["DEACTIVATING"],
"desiredStatus": ["STOPPED"]
}
}
This event is normally triggered when ECS task status is changed (in this case when a task is killed)
My questions are:
Can I simulate this event from command line?
maybe by running aws events put-events --entries file://putevents.json
(What should I write in the putevents.json file?)
Can I simulate this event from Javascript code?
TL;DR Yes and yes, provided you deal with with the limitation that user-generated events cannot have a source that begins with aws.
Send custom events to EventBridge with the PutEvents API. The API is available in the CLI as well as in the SDKs (see AWS JS SDK). The list of custom events you pass in the entries parameter must have three fields at a minimum:
[
{
"source": "my-custom-event", // cannot start with aws !!,
"detail-type": "ECS Task State Change",
"detail": {} // copy from the ECS sample events docs
}
]
The ECS task state change event samples in the ECS documentation make handy templates for your custom events. You can safely prune any non-required field that you don't need for pattern matching.
Custom events are not permitted to mimic the aws system event sources. So amend your rule to also match on your custom source name:
"source": ["aws.ecs", "my-custom-event"],

Winston Force flush before ending lambda execution

I'm trying to use Winston to send logs to Datadog from an Aws Lambda. The problem with the lambdas is that once we return a response, the lambda execution stops and it doesn't give time to Winston to flush the logs.
Is there a way I can force the flush before returning. I'm trying this but it doesn't seem to do the trick:
async function handler (event): Promise<FormattedJSONResponse> {
const logger = getLogger()
// do some work
await closeLogger(logger)
return awsResponse
}
function closeLogger (logger: Logger): Promise<any> {
const loggerDone = new Promise((resolve, _) => {
logger.on('finish', () => {
resolve(logger)
})
})
logger.end()
logger.close()
return loggerDone
}
Versions:
AWS Lambda with nodejs 12
Winston: 3.3.3
Thanks for your help
First of all I don't understand why you would want to send your logs within you lambda function? If you do so your lambda function will run longer to process the logs, meaning you will be charged for the time it takes to send the logs to Datadog.
Instead, you could save the logs to CloudWatch. To avoid high charges for CloudWatch set the retention to a rather short time, maybe one day. On the CloudWatch log stream you can then add a subscriber which could be another lambda function. This "log-processor"-lambda-function will process, transform the logs and send them to Datadog. With this architecture your first lambda function containing the business logic won't fail if Datadog cannot be reached for instance. It makes your architecture more resilient and has better separation of concerns. Yan Cui wrote a great article on "Centralised logging for AWS Lambda"
Another approach, still separating your logging from your lambda function business logic to some degree, builds upon lambda extensions namely the Lambda Logs API.
Put simple, lambda extensions add an extra layer to your function but are not part of the lambda function's code itself. Probably the best part for you: Datadog already offers a ready to use extension, which is responsible for:
Pushing real-time enhanced Lambda metrics, custom metrics, and traces from the Datadog Lambda Library to Datadog.
Forwarding logs from your Lambda function to Datadog.
For more info on Lambda extensions follow the links mentioned above or have a look at Yan Cui's post "Lambda Logs API: a new way to process Lambda logs in real-time"
After spending 4 hours on this issue, I found no other way (that works, isn't buggy and is transport agnostic) than to use an arbitrary timeout before returning a response.
This example is for NextJS but you can easily remove res: NextApiResponse.
export const gracefulExit = (response: any, res: NextApiResponse) => {
setTimeout(() => {
res.send({ ...response, sessionId });
}, 400);
};
Then in all my serverless functions I don't do res.send({x}) but rather gracefulExit({x}, res)

Lambda Step Functions: Fire & Forget pattern

I have a Python-based Lambda (core Lambda) serving a synchronous API. The API is triggered from an user interactive application. I now need to add some logging & metrics (slightly compute intensive) to the Lambda. I don't want the core Lambda to be delayed by this. I want to push this into a new Lambda (logging Lambda). What I want is- core Lambda completes its work, triggers the logging Lambda (fire & forget) and returns the response to API call immediately. The end state (success/failure) of the logging Lambda is irrelevant.
Can "Step Functions" achieve this? The core & logging Lambdas have their own end state and I'm not sure if the "Step" function pattern can accommodate this.
You can start an asynchronous Lambda function invocation using "InvocationType": "Event" in your Invoke parameters. To do that in Step Functions, the ASL code looks like this:
{
"StartAt": "Invoke Lambda function asynchronously",
"States": {
"Invoke Lambda function asynchronously": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "myFunction",
"Payload.$": "$",
"InvocationType": "Event"
},
"End": true
}
}
}
Having an async Lambda Task (as shown above) after your core Lambda Task seems like it should work. To make sure the logging Lambda failing doesn't affect the overall workflow, you can add a Catcher to it on States.ALL and redirect to a Succeed state.
https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html#error-handling-fallback-states
If the secondary Lambda is purely invoked for logging purposes and the state machine is not dependent on its output, you could invoke the secondary Lambda from within your primary Lambda, then return from the primary Lambda. This way your state machine doesn't need to know about the logging steps and you can "fire and forget" before resuming your workflow.

When should I use a DynamoDB trigger over calling the Lambda with another?

I currently have one AWS Lambda function that is updating a DynamoDB table, and I need another Lambda function that needs to run after the data is updated. Is there any benefit to using a DynamoDB trigger in this case instead of invoking the second Lambda using the first one?
It looks like the programmatic invocation would give me more control over when the Lambda is called (ie. I could wait for several updates to occur before calling), and reading from a DynamoDB Stream costs money while simply invoking the Lambda does not.
So, is there a benefit to using a trigger here? Or would I be better off invoking the Lambda myself?
DynamoDB Stream seems to be the better practice because:
you delegate the responsibility of invoking the post-processor function from your writer-Lambda. Makes writer more simple (aka faster),
you simplify connecting new external writers to the same Table, otherwise you have to implement the logic to call post-processors in all of them as well,
you guarantee that all data is post-processed (even if somebody added a new item in the web-interface of DynamoDB. :)
moneywise, the execution time you will spend to send invoke() operation from writer Lambda will likely cover the costs of a stream.
unless you use DynamoDB transactions your data may still be not yet available for post-processor if you call him from writer too soon. If your business logic doesn't need transactions then using them just to cover this problem = extra time/cost.
P.S. You can batch from the DynamoDB stream of course out of the box with simple setting. You are not obliged to invoke post-processor for every write operation.
After the data is updated, you can publish a SQS message, then add a trigger to configure another function to read from Amazon SQS in the Lambda console, create an SQS trigger.
To create a trigger
Open the Lambda console Functions page.
Choose a function.
Under Designer, choose Add trigger.
Choose a trigger type.
Configure the required options and then choose Add.
Lambda supports the following options for Amazon SQS event sources.
Event Source Options
SQS queue – The Amazon SQS queue to read records from.
Batch size – The number of items to read from the queue in each batch, up to 10. The event may contain fewer items if the batch that Lambda read from the queue had fewer items.
Enabled – Disable the event source to stop processing items.
var QUEUE_URL = 'https://sqs.us-east-1.amazonaws.com/{AWS_ACCUOUNT_}/matsuoy-lambda';
var AWS = require('aws-sdk');
var sqs = new AWS.SQS({region : 'us-east-1'});
exports.handler = function(event, context) {
var params = {
MessageBody: JSON.stringify(event),
QueueUrl: QUEUE_URL
};
sqs.sendMessage(params, function(err,data){
if(err) {
console.log('error:',"Fail Send Message" + err);
context.done('error', "ERROR Put SQS"); // ERROR with message
}else{
console.log('data:',data.MessageId);
context.done(null,''); // SUCCESS
}
});
}
Please don't forget add a trigger from another function to this SQS topic. That function will receive the SQS message automatic to handle.

one AWS CloudWatch event to control multiple things

I have multiple cloudwatch events. Each of them triggers the same Lambda called app with different inputs at the same time: i.e.
event1 triggers lambda app at a schedule using input: app_name=app1
event2 triggers lambda app at the same schedule using input: app_name=app2.
event3 triggers lambda app at the same schedule using input: app_name=app3.
As you can see all the event has the same schedule. I really do not need so many duplicated events.
Is there any way I can use one CloudWatch event to trigger one Lambda with multiple input? i.e. at a time, the same event will trigger lambda app with input app1; it also triggers the same lambda with input app2, it also triggers the same lambda with input app3?
it will make my structure neat. one event, one lambda (with different input) for multiple app.
You can have one CloudWatch Rule with a Schedule event source and one Lambda function target. You will need to configure the input to use a Constant (JSON text) with an array of data as shown here:
Then in your Lambda function the event will be your constant. Example with Node.js 8.10 to start EC2 instances:
const AWS = require('aws-sdk');
const ec2 = new AWS.EC2();
exports.handler = async (event) => {
console.log('Starting instances: %j', event);
const data = await ec2.startInstances({ InstanceIds: event }).promise();
console.log(data);
};

Resources