How to pass machine input and Lambda function output to map states - aws-lambda

Current Setup:
I currently have a Step Functions state machine that kicks off a Task state (which calls a Lambda function), followed by a map state (which submits a job to Batch), defined as follows
State Machine Definition
(note: region and account id have been omitted and substituted for dummy variable ACCOUNT_INFO)
{
"StartAt": "Populate EFS",
"States": {
"Populate EFS": {
"Next": "MapState",
"Type": "Task",
"InputPath": "$",
"ResultPath": "$.populate_efs_result",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "arn:aws:lambda:{ACCOUNT_INFO}:function:PopulateEFSLambda",
"Payload.$": "$"
}
},
"MapState": {
"Type": "Map",
"End": true,
"ResultPath": "$.metadata.run_info",
"InputPath": "$",
"Iterator": {
"StartAt": "TaskState",
"States": {
"TaskState": {
"Type": "Task",
"End": true,
"InputPath": "$",
"ResultPath": null,
"Resource": "arn:aws:states:::batch:submitJob.sync",
"Parameters": {
"JobDefinition": "arn:aws:batch:{ACCOUNT_INFO}:job-definition/BatchJobDefCfn:1",
"JobName": "test",
"JobQueue": "arn:aws:batch:{ACCOUNT_INFO}:job-queue/BatchQueue123",
"ContainerOverrides": {
"Command": [
"sh",
"-c",
"entrypoint.pl -i /NGS/${sequencer}/${run_id}/ -s ${sample_name}"
],
"Environment": [
{
"Name": "run_id",
"Value.$": "$.run_id"
},
{
"Name": "sample_name",
"Value.$": "$.sample_name"
},
{
"Name": "sequencer",
"Value.$": "$.sequencer"
}
]
}
}
}
}
}
}
}
}
State machine input
{
"metadata": {
"run_info": [
{
"sample_name": "SAMPLE_X",
"sequencer": "Nextseq"
},
{
"sample_name": "SAMPLE_Y",
"sequencer": "Nextseq"
},
{
"sample_name": "SAMPLE_Z",
"sequencer": "Nextseq"
}
]
}
}
Lambda output (shortened for simplicity)
{"populate_efs_result": {
"ExecutedVersion": "$LATEST",
"Payload": "RUN_1"}
Expected Outcome:
The second step (MapState) needs information from the machine input (sample_name and sequencer), as well as what the Lambda function returns in populate_efs_result.Payload (run_id), therefore both need to be included in the event object for the Map state input. However, in my attempts so far, the input for the map state has been either the machine input or the Lambda output, not both.
I've tried changing the InputPath and ItemsPath parameters in the Map state definition and have also tried including the following in the Map state definition, but none of these methods work: Parameters: {"new_run_id.$": "$.populate_efs_result.Payload"}.

A simple but not so elegant solution could be to move your lambda step within the Map state. The advantage would be that the response from the lambda would be in the context of the map state (this might be needed if your lambda response if specific to each iteration of the map state). The disadvantage is that your lambda function would need to be executed for each iteration of the map, while lambda function are fast and cheap it is still not a perfect solution.
Another approach would be to pass the execution input to that lambda and then extend the lambda to modify the run_info array to include the required data. Then pass this modified array to the map state as the InputPath

Related

How to ensure a Step Function executes Parameterized Query properly in AWS?

I'm currently trying to execute an Athena Query during a State Machine. The query itself needs a date variable to use in several WHERE statements so I'm using a Lambda to generate it.
When I run EXECUTE prepared-statement USING 'date', 'date', 'date'; directly in Athena, I get the results I expect so I know the query is formed correctly, but when I try to do it in the state machine, it gives me the following error:
SYNTAX_ERROR: line 19:37: Unexpected parameters (integer) for function date. Expected: date(varchar(x)) , date(timestamp) , date(timestamp with time zone)
So my best guess is that I'm somehow not passing the execution parameters correctly.
The Lambda that calculates the date returns it in a string with the format %Y-%m-%d, and in the State Machine I make sure to pass it to the output of every State that needs it. Then I get a named query to create a prepare statement from within the state machine. I then use that prepared statement to run an EXECUTE query that requires the date multiple times, so I use an intrinsic function to turn it into an array:
{
"StartAt": "calculate_date",
"States": {
"calculate_date": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"Payload.$": "$",
"FunctionName": "arn:aws:lambda:::function:calculate_date:$LATEST"
},
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.SdkClientException",
"Lambda.TooManyRequestsException"
],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
}
],
"Next": "get_query",
"ResultSelector": {
"ExecDate.$": "$.Payload.body.exec_date"
}
},
"get_query": {
"Type": "Task",
"Next": "prepare_query",
"Parameters": {
"NamedQueryId": "abc123"
},
"Resource": "arn:aws:states:::aws-sdk:athena:getNamedQuery",
"ResultPath": "$.Payload"
},
"prepare_query": {
"Type": "Task",
"Next": "execute_query",
"Parameters": {
"QueryStatement.$": "$.Payload.NamedQuery.QueryString",
"StatementName": "PreparedStatementName",
"WorkGroup": "athena-workgroup"
},
"Resource": "arn:aws:states:::aws-sdk:athena:createPreparedStatement",
"ResultPath": "$.Payload"
},
"execute_query": {
"Type": "Task",
"Resource": "arn:aws:states:::athena:startQueryExecution",
"Parameters": {
"ExecutionParameters.$": "States.Array($.ExecDate, $.ExecDate, $.ExecDate)",
"QueryExecutionContext": {
"Catalog": "catalog_name",
"Database": "database_name"
},
"QueryString": "EXECUTE PreparedStatementName",
"WorkGroup": "athena-workgroup",
"ResultConfiguration": {
"OutputLocation": "s3://bucket"
}
},
"End": true
}
}
}
The execution of the State Machine returns successfully, but the query doesn't export the results to the bucket, and when I click on the "Athena query execution" link in the list of events, it takes me to the Athena editor page where I see the error listed above
https://i.stack.imgur.com/pxxOm.png
Am I generating the ExecutionParameters wrong? Does the createPreparedStatement resource need a different syntax for the query parameters? I'm truly at a lost here, so any help is greatly appreciated
I just solved my problem. And I'm posting this answer in case anyone comes across the same issue.
Apparently, the ExecutionParameters paremeter in an Athena StartQueryExecution state does not respect the variable type of a JSONPath variable, so you need to manually add the single quotes when forming your array. I solved this by adding a secondary output from the lambda, with the date wrapped in single quotes so that when I make the array using instrinsic functions and pass it to the query execution, it forms the query string correctly.
Transform the output from the lambda like this:
"ExecDateQuery.$": "States.Format('\\'{}\\'', $.Payload.body.exec_date)"
And use ExecDateQuery in the array intrinsic function, instead of ExecDate.
The final State Machine would look like this:
{
"StartAt": "calculate_date",
"States": {
"calculate_date": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"Payload.$": "$",
"FunctionName": "arn:aws:lambda:::function:calculate_date:$LATEST"
},
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.SdkClientException",
"Lambda.TooManyRequestsException"
],
"IntervalSeconds": 2,
"MaxAttempts": 6,
"BackoffRate": 2
}
],
"Next": "get_query",
"ResultSelector": {
"ExecDate.$": "$.Payload.body.exec_date",
"ExecDateQuery.$": "States.Format('\\'{}\\'', $.Payload.body.exec_date)"
}
},
"get_query": {
"Type": "Task",
"Next": "prepare_query",
"Parameters": {
"NamedQueryId": "abc123"
},
"Resource": "arn:aws:states:::aws-sdk:athena:getNamedQuery",
"ResultPath": "$.Payload"
},
"prepare_query": {
"Type": "Task",
"Next": "execute_query",
"Parameters": {
"QueryStatement.$": "$.Payload.NamedQuery.QueryString",
"StatementName": "PreparedStatementName",
"WorkGroup": "athena-workgroup"
},
"Resource": "arn:aws:states:::aws-sdk:athena:createPreparedStatement",
"ResultPath": "$.Payload"
},
"execute_query": {
"Type": "Task",
"Resource": "arn:aws:states:::athena:startQueryExecution",
"Parameters": {
"ExecutionParameters.$": "States.Array($.ExecDateQuery, $.ExecDateQuery, $.ExecDateQuery)",
"QueryExecutionContext": {
"Catalog": "catalog_name",
"Database": "database_name"
},
"QueryString": "EXECUTE PreparedStatementName",
"WorkGroup": "athena-workgroup",
"ResultConfiguration": {
"OutputLocation": "s3://bucket"
}
},
"End": true
}
}
}

How to create a step function that iterates over a list of strings

I'm trying to create my first AWS step function. I needs to iterated over a list of string and pass each value to a lambda function. I've gotten started but I'm not understanding how to reference the current element in the list to pass was parameter to the lambda function.
{
"StartAt": "Handle Loaders",
"States": {
"Handle Loaders": {
"Type": "Map",
"ItemsPath": "$.InputData"
"Iterator": {
"StartAt": "Execute Loader"
"Execute Loader": {
"Type": "Task"
"Resource": !Ref DataLoader,
"Parameters": {
"SeedData": <same for every iteration>
"Loader": <the current string iterated over>
}
"End": true
}
}
"End": true
}
}
}
I imagine the input would looks something like this
{
"InputData: {
"SeedData": <somedata_values>,
"Loaders": ["Makes", "Models", "Styles"],
}
}
I think this is the definition that you are looking for:
{
"StartAt": "Handle Loaders",
"States": {
"Handle Loaders": {
"Type": "Map",
"ItemsPath": "$.InputData.Loaders",
"Parameters": {
"loader.$": "$$.Map.Item.Value"
},
"Iterator": {
"StartAt": "Execute Loader",
"States": {
"Execute Loader": {
"Type": "Task",
"Resource": "<arn of your function>",
"Parameters": {
"SeedData.$": "$$.Execution.Input.InputData.SeedData",
"Loader.$": "$.loader"
},
"End": true
}
}
},
"End": true
}
}
}
Here are some useful link that would help you to understand it:
https://docs.aws.amazon.com/step-functions/latest/dg/input-output-contextobject.html
https://docs.aws.amazon.com/step-functions/latest/dg/concepts-input-output-filtering.html
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-map-state.html
Also, given that you are starting with step functions I'd recommend you 2 useful tools in the console:
Data Flow Simulator
Workflow Studio
You're missing a part of the definition for moving into the tasks inside the Map:
Iterator": {
"StartAt": "Prepare Test Data",
"States": {
"Prepare Test Data": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"OutputPath": "$.TestData",
"Parameters": {
"Payload.$": "$",
"FunctionName": "arn:aws:your:function"
},
"Next": "Call_Test_System
},
"Call_Test_System::{
"Type": "Task",
... ect
specifically note that after StartAt key, there is a key States: that is the mini definition of tasks/states within a map iterator.
You can then use the OutputPath defined (in this case as the base of the json $ under the key TestData
in an InputPath for the next step in the same format, which will pass whatever the first state outputs into the next.
The Iterator also has an OutputPath, where it will place a list of all the responses of the iterations within the map.

Pass Step Function variable to AWS Glue Job Not Working

I'm trying to pass an AWS Step Function variable to a Glue Job parameter, similar to this:
aws-passing-job-parameters-value-to-glue-job-from-step-function
However, this is not working for me. The glue job error message indicates that it's getting the passed variable name--not the actual value of the variable. Here's my Step Function code:
{
"Comment": "Converts CSV files to parquet for a date range.",
"StartAt": "ConfigureCount",
"States": {
"ConfigureCount": {
"Type": "Pass",
"Result": {
"start": 201601,
"end": 201602,
"index": 201601
},
"ResultPath": "$.iterator",
"Next": "Iterator"
},
"Iterator": {
"Type": "Task",
"Resource": "arn:aws:lambda:eu-west-1:123456789:function:date-iterator",
"ResultPath": "$.iterator",
"Next": "IsCountReached"
},
"IsCountReached": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.iterator.continue",
"BooleanEquals": true,
"Next": "ConvertToParquet"
}
],
"OutputPath": "$.iterator",
"Default": "Done"
},
"ConvertToParquet": {
"Comment": "Your application logic, to run a specific number of times",
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "convert-to-parquet",
"Arguments": {
"--DATE_RANGE": "$.iterator.index"
}
},
"ResultPath": "$.iterator.index",
"Next": "Iterator"
},
"Done": {
"Type": "Pass",
"End": true
}
}
}
The step "Iterator"step is calling a Lambda called "date-iterator" which returns JSON similar to the following:
{
"start": "201601",
"end": "201602",
"index": "201601"
}
This was based on this article, so that I can loop through values: Iterating a Loop Using Lambda
My Step Function fails, saying "$.iterator.index" is not a valid date.
How do I pass this value, and not the variable name?
from Amazon States Language (https://states-language.net/spec.html):
If any field within the Payload Template (however deeply nested) has a name ending with the characters ".$", its value is transformed according to rules below and the field is renamed to strip the ".$" suffix.
Based on that adding .$ should solve your issue:
"Parameters": {
"JobName": "convert-to-parquet",
"Arguments": {
"--DATE_RANGE.$": "$.iterator.index"
}
},

AWS step function: chosing a Resource dynamically

I would like to dynamically chose an AWS Lambda worker based on the result coming from a previous step. Something like {"Resource": "$.worker_arn"}.
"RunWorkers": {
"Type": "Map",
"MaxConcurrency": 0,
"InputPath": "$.output",
"ResultPath": "$.raw_result",
"Iterator": {
"StartAt": "CallWorkerLambda",
"States": {
"CallWorkerLambda": {
"Type": "Task",
"Resource": "$$.worker_arn",
"End": true
}
}
},
"Next": "Aggregate"
},
The input from previous step is expected as following:
[{"worker_arn":..., "output":1}, {"worker_arn":..., "output":1}, ...],
where worker_arn is the same among all workers.
When I write a pipeline like this, the linter complains that it expects an ARN.
Are there any options better than wrapping my worker lambda into another lambda?
Using "Resource": "arm:aws:states:::lambda:invoke" you can set the "FunctionName" field in "Parameters" at runtime using a Path.
{
"StartAt":"CallLambda",
"States":{
"CallLambda":{
"Type":"Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters":{
"FunctionName.$":"$.MyFunction",
"Payload.$": "$"
},
"End":true
}
}
}
https://docs.aws.amazon.com/step-functions/latest/dg/connect-lambda.html

Step function dynamic execution

I have two Lambda functions: first one runs and creates a list of specific time to run at for function 2 ex {"2020-09-04T01:59:00Z","2020-09-04T02:59:00Z","2020-09-04T03:59:00Z","2020-09-04T04:59:00Z"}
I have only managed to create it using one input only:
ex:
{
"Comment": "Fixtures Wait state",
"StartAt": "FirstState",
"States": {
"FirstState": {
"Type": "Task",
"Resource": "arn:aws:lambda:${aws_region}:${aws_account_id}:function:hello",
"ResultPath": "$.first",
"Next": "wait_using_timestamp_path"
},
"wait_using_timestamp_path": {
"Type": "Wait",
"TimestampPath": "$.expirydate",
"Next": "wait_using_seconds_path"
},
"FinalState": {
"Type": "Task",
"Resource": "arn:aws:lambda:${aws_region}:${aws_account_id}:function:hello",
"End": true
}
}
}
is it possible to have it process all inputs? or am i thinking wrong?
The Map state can be used to execute an iterator over an array of inputs. After the first Lambda Task, put the steps you want to execute for each item in the output array in a Map state.
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-map-state.html

Resources