Redrive data in Kinesis stream to Lambda function - aws-lambda

I have a very simple Lambda right now that is triggered by Kinesis and it's all hooked up and working fine...but I want to work through the case where I found a bug in my lambda code and need to re-run data that is still available in the stream (my stream is setup to retain data for 7 days).
Is there an easy way to do this? I was hoping there would be something in the console to "reset" the sequence position for the lambda but I couldn't find that.
One method I've tested is to delete the original trigger and add a new one with the position as TRIM_HORIZON but wondering if there's an easier way to do this (my original trigger was setup w/ LATEST).

If you have to reprocess the whole data from Kinesis stream, there is no other way except recreating the Trigger.
STARTING_POSITION can't be updated for the existing triggers. Only certain properties are allowed to be updated using UpdateEventSourceMapping command.
Updating STARTING_POSITION in EventSourceMapping will impact the committed checkpoints in Kinesis as it determines how many records are processed and where the current position is committed. Whenever a new trigger is created, It starts checkpointing the records processing from Kinesis.

Related

How to place dynamo db records in a SQS queue to trigger lambda

I have a lambda that scans through items present in a dynamo table and does some post processing with that. While this works fine due to smaller number of entries in the table, it will soon grow and the 15 minute timeout will be reached.
I am considering utilizing a SQS but not sure how i can place records from the table to SQS which will then trigger the lambda concurrently.
Is this a feasible solution? Or should i just create threads with the lambda and process it, again unsure if this will count towards the 15 minute limit
Any suggestions will be appreciated, thanks
DynamoDB streams is a perfect use-case for this, every item added or modified will enter the stream and in turn will trigger your Lambda function that does the pre-processing, but of course it strongly relies on your particular use-case.
If for example you require all the data from the table, you can make useful aggregations and contain those aggregates in a single item. Then instead of having to Scan the table to get all the items, you just do a single GetItem request which already holds your aggregate data.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
As #LeeHannigan says, use DynamoDB Streams to capture your table's CRUD events. Streams has traditionally had 2 targets to consume these change events: Lambda and Kinesis.
But what about a SQS destination? EventBridge Pipes adds EventBridge as another way to consume DynamoDB streams. EB Pipes, a new integration service, would have the DynamoDB Stream as its source and SQS as its target.
The flow would be DynamoDB Streams -> EB Pipes -> SQS -> Lambda.

How to avoid concurrent requests to a lambda

I have a ReportGeneration lambda that takes request from client and adds following entries to a DDB table.
Customer ID <hash key>
ReportGenerationRequestID(UUID) <sort key>
ExecutionStartTime
ReportExecutionStatus < workflow status>
I have enabled DDB stream trigger on this table and a create entry in this table triggers the report generation workflow. This is a multi-step workflow that takes a while to complete.
Where ReportExecutionStatus is the status of the report processing workflow.
I am supposed to maintain the history of all report generation requests that a customer has initiated.
Now What I am trying to do is avoid concurrent processing requests by the same customer, so if a report for a customer is already getting generated don’t create another record in DDB ?
Option Considered :
query ddb for the customerID(consistent read) :
- From the list see if any entry is either InProgress or Scheduled
If not then create a new one (consistent write)
Otherwise return already existing
Issue: If customer clicks in a split second to generate report, two lambdas can be triggered, causing 2 entires in DDB and two parallel workflows can be initiated something that I don’t want.
Can someone recommend what will be the best approach to ensure that there are no concurrent executions (2 worklflows) for the same Report from same customer.
In short when one execution is in progress another one should not start.
You can use ConditionExpression to only create the entry if it doesn't already exist - if you need to check different items, than you can use DynamoDB Transactions to check if another item already exists and if not, create your item.
Those would be the ways to do it with DynamoDB, getting a higher consistency.
Another option would be to use SQS FIFO queues. You can group them by the customer ID, then you wouldn't have concurrent processing of messages for the same customer. Additionally with this SQS solution you get all the advantages of using SQS - like automated retry mechanisms or a dead letter queue.
Limiting the number of concurrent Lambda executions is not possible as far as I know. That is the whole point of AWS Lambda, to easily scale and run multiple Lambdas concurrently.
That said, there is probably a better solution for your problem using a DynamoDB feature called "Strongly Consistent Reads"
By default reads to DynamoDB (if you use the AWS SDK) are eventually consistent, causing the behaviour you observed: Two writes to the same table are made but your Lambda only was able to notice one of those writes.
If you use Strongly consistent reads, the documentation states:
When you request a strongly consistent read, DynamoDB returns a response with the most up-to-date data, reflecting the updates from all prior write operations that were successful.
So your Lambda needs to do a strongly consistent read to your table to check if the customer already has a job running. If there is already a job running the Lambda does not create a new job.

Clear a DynamoDb trigger iterator

Is it possible to "purge"/"clear" the iterator of a dynamoDb table's trigger ?
Context:
A lambda is processing updates on a given table using a trigger.
Badly formatted records are inserted in the table and the lamdba is unable to process it, stalling the iterator.
A few time after a fix is issued to insert correctly formatted records in the table. So we want to "fast forward" the processing of updates and skip the old ones?
I suppose deleting/recreating the trigger would do that. Is there a "better" way ?
Old question, but this can help somebody else:
as stated in the comments, I don't think it's possible to simply purge the queue.
However, when you re-create it, you can chose the "starting position" between:
"Latest" (will skip all previously stored messages)
"Trim horizon" (will process all the available messages)
Re-creating the trigger is pretty easy, that's what I have done in a similar case recently

Export existing DynamoDB items to Lambda Function

Is there any AWS managed solution which would allow be to perform what is essentially a data migration using DynamoDB as the source and a Lambda function as the sink?
I’m setting up a Lambda to process DynamoDB streams, and I’d like to be able to use that same Lambda to process all the existing items as well rather than having to rewrite the same logic in a Spark or Hive job for AWS Glue, Data Pipeline, or Batch. (I’m okay with the input to the Lambda being different than a DynamoDB stream record—I can handle that in my Lambda—I’m just trying to avoid re-implementing my business logic elsewhere.)
I know that I could build my own setup to run a full table scan, but I’m also trying to avoid any undifferentiated heavy lifting.
Edit: One possibility is to update all of the items in DynamoDB so that it triggers a DynamoDB Stream event. However, my question still remains—is there an AWS managed service that can do this for me?
You can create a new kinesis data stream. Add this as a trigger to your existing lambda function. Create a new simple lambda function which scans the entire table and puts records into this stream. That's it.
Your business logic stays in your original function. You are sending existing data from dynamodb to this function via kinesis.
Ref: https://aws.amazon.com/blogs/compute/indexing-amazon-dynamodb-content-with-amazon-elasticsearch-service-using-aws-lambda/

How to add pre-existing data from DynamoDB to Elasticsearch?

I set up Elasticsearch Service and DynamoDb stream as described in this blog post.
Now I need to add pre-existing data from DynamoDB to Elasticsearch.
I saw "Indexing pre-existing content" part of article but I dont know what to do with that Python code, where to execute it.
What the best option in this case to add pre-existing data?
Populating existing items to elasticsearch is not straightforward since dynamodb stream works for item changes not for existing records,
Here are few approaches with pro and cons
Scan all the existing items from dynamodb and send to elasticsearch
We can scan all the existing items and run a python code hosted on a ec2 machine to send data to es.
Pros:
a. Simple solution, nothing much required.
Cons:
a. Can not be run on a lambda function since the job may timeout if number of records are too many.
b. This approach is more of a one time thing and can not be used for incremental changes, (let's say we want to keep updating es as dynamodb data changes.)
Use dynamodb streams
We can enable dynamodb streams and build the pipeline as explained here.
Now we can update some flag of existing items so that all the records flow through the pipeline and data goes to es.
Pros:
a. The pipeline can be used for incremental dynamodb changes.
b. No code duplication or one time effort, Every time we need to update one item in es, we update the item and it gets indexed in es.
c. No redundant, untested, one time code. (Huge issue in software world to maintain code.)
Cons:
a. Changing Prod data can be a dangerous thing and may not be allowed depending on use case.
This is slight modification of above approach
Instead of changing item in prod table we can create a Temporary table and enable stream on Temporary table. Utilize the pipeline mentioned in 2nd approach. And then copy items from prod table to Temporary table, The data will flow through the existing pipeline and get indexed in ES.
Pros:
a. No Prod data change is required and this pipeline can be used for incremental changes as well.
b. same as approach 2.
Cons:
a. Copying data from one table to another may take lots of time depending on data size.
b. Copying data from one table to another is a one time script, hence has maintainability issues.
Feel free to edit or suggest another approaches in comment.
In this post described how to add pre-existing data from DynamoDB to Elasticsearch.

Resources