Reading AWS Dynamodb Stream - aws-lambda

I want to do an incremental DynamoDB backup on S3 using DynamoDB Streams. I have a lambda that reads the dynamodb stream and writes files into S3. In order to mark already read shards I have ExclusiveStartShardId logged into configuration file.
What I do is:
Describe the stream (using the logged ExclusiveStartShardId)
Get stream's shards
For all shards that are CLOSED (has EndingSequenceNumber) I do the following:
Get shard iterator for the certain shard (shardIteratorType: 'TRIM_HORIZON')
Iterate through shard and fetch records till NextShardIterator becomes null
The problem here is that I read only closed shards and in order to get new records I must wait (undetermined-amount-of-time) for it to be closed.
It seems that the last shard is usually in OPEN state (has NO EndingSequenceNumber). If I remove the check for EndingSequenceNumber from the pseudo code above I end up with infinite loop because when I hit the last shard NextShardIterator is always presented. I cannot also do a check if fetched items are 0 because there could be "gaps" in the shard.
In this tutorial numChanges is used in order to stop the infinite loop http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.LowLevel.Walkthrough.html#Streams.LowLevel.Walkthrough.Step5
What is the best approach in this situation?
I also found a similar question: Reading data from dynamodb streams. Unfortunately I could not find the answer for my question.

Why not attach the DynamoDB stream as an event source for your Lambda function? Then Lambda will take care of polling the stream and calling your function when necessary. See this for details.

Related

Choosing between DynamoDB Streams vs. Kinesis Streams for IoT Sensor data

I have a fleet of 250 Wifi-enabled IoT sensors streaming weight data. Each devices samples once per second. I am requesting help between choosing AWS DynamoDB Streams vs. AWS Kinesis Streams to to store and process this data in real-time. Here are some additional requirements:
I need to keep all raw data in a SQL-accessible table.
I also need to clean the raw stream data with Python's Pandas library to recognize device-level events based on weight changes (e.g. if weight of sensor #1 increases, record as "sensor #1 increased by x lbs # XX:XX PM" If no change, do nothing).
I need that change-event data (interpreted with library from the raw data streams) to be accessible in real time dashboard (e.g. device #1 weight just went to zero, prompting employee to refill container #1)
Either DDB Streams or Kinesis Streams can support Lambda functions, which is what I'll use for the data cleaning, but I've read the documentation and comparison articles and can't distinguish which is best for my use case. Cost is not a key consideration. Thanks in advance!!
Unfortunately, I think you will need a few pieces of infrastructure for a full solution.
I think you could use Kinesis and firehose to write to a database to store the raw data in a way that can be queried with SQL.
For the data cleaning step, I think you will need to use a stateful stream processor like flink or bytewax and then the transformed data can be written to a real-time database or back to kinesis so that it can be consumed in a dashboard.
DynamoDB stream works with DynamoDB. It streams row changes to be picked up by downstream services like Lambda. You mentioned that you want data to be stored in SQL data base. DynamoDB is a NOSQL databse. So you can exclude that service.
Not sure why you want to have data in SQL database. If it is timeseries data, you would probably store them into a time series db like TimeStream.
If you are using AWS IoT Core to send data over MQTT to AWS, you can forward those messages to Kinesis Data Stream (or SQS). Then you can have a lambda triggered on messages received in Kinesis. This lambda can process the data and store them in the DB you want.

Kafka Elasticsearch Connector for bulk operations

I am using the Elasticsearch Sink Connector for operations (index, update, delete) on single records.
Elasticsearch also has a /_bulk endpoint which can be used to create, update, index, or delete multiple records at once. Documentation here.
Does the Elasticsearch Sink Connector support these types of bulk operations? If so, what is the configuration I need, or is there any sample code I can review?
Internally the Elasticsearch sink connector creates a bulk processor that is used to send records in a batch. To control this processor you need to configure the following properties:
batch.size: The number of records to process as a batch when writing to Elasticsearch.
max.in.flight.requests: The maximum number of indexing requests that can be in-flight to Elasticsearch before blocking further requests.
max.buffered.records: The maximum number of records each task will buffer before blocking acceptance of more records. This config can be used to limit the memory usage for each task.
linger.ms: Records that arrive in between request transmissions are batched into a single bulk indexing request, based on the batch.size configuration. Normally this only occurs under load when records arrive faster than they can be sent out. However it may be desirable to reduce the number of requests even under light load and benefit from bulk indexing. This setting helps accomplish that - when a pending batch is not full, rather than immediately sending it out the task will wait up to the given delay to allow other records to be added so that they can be batched into a single request.
flush.timeout.ms: The timeout in milliseconds to use for periodic flushing, and when waiting for buffer space to be made available by completed requests as records are added. If this timeout is exceeded the task will fail.

How to maintain order and avoid duplicate records when copying from DynamoDB streams to KinesisData Streams?

I currently have a use case to copy data from DDB Streams to Kinesis Data Streams (just to increase data retention period). With DDB Streams, its just 24 hour retention versus with Kinesis Data Streams is upto 7 days.
So, I was thinking a of a lambda to copy the items from DDB Streams to Kinesis Data Streams but I'm not sure if the ordering / duplicate records case would come into play when I do the copy, because I'm guessing "Consumer" failures (i.e) Lambda failures might result in out of order delivery of stream records to DynamoDB and also there might be duplicate records in the Kinesis Data Streams? Is there a AWS customer built solution to handle this or any workaround this?
Also, the reason behind me opting for Kinesis data streams/ DDB Streams was because I'm going to have a lambda work off the stream and I'd like the lambdas to be triggered per shard.
since you have one producer which is the dynamodb streams what you can do is have a lambda function which consumes the stream and inserts it into FIFO SQS queue, you can then deduplicate events by following the below post :
https://dev.to/napicella/deduplicating-messages-exactly-once-processing-4o2
btw you can set SQS retention period to 14 days, so you can use it instead of kinesis if you're not looking for realtime solution
a sample use case
https://fernandomc.com/posts/aws-first-in-first-out-queues/

Triggering lambda from kinesis data stream

I have a Kinesis data stream with shard size 5 that triggers a lambda with batch size 1. Looking at the logs, the lambda function is triggered asynchronously with one record even though there are multiple records in the data stream distributed among different shards.
While looking at the documentation, it is mentioned that "Lambda reads records from the data stream and invokes your function synchronously with an event that contains stream records. Lambda reads records in batches and invokes your function to process records from the batch.".
I am trying to find out this behaviour of the lambda but could not find out the reason. Is there something that I am missing?
Found out that, each shard triggers lambda separately. So in this case, 5 lambda invocations happens synchronously for each shard.

AWS Hive + Kinesis on EMR = Understanding check-pointing

I have an AWS Kinesis stream and I created an external table in Hive pointing at it. I then create a DynamoDB table for the checkpoints and in my Hive query I set the following properties as described here:
set kinesis.checkpoint.enabled=true;
set kinesis.checkpoint.metastore.table.name=my_dynamodb_table;
set kinesis.checkpoint.metastore.hash.key.name=HashKey;
set kinesis.checkpoint.metastore.range.key.name=RangeKey;
set kinesis.checkpoint.logical.name=my_logical_name;
set kinesis.checkpoint.iteration.no=0;
I have the following questions:
Do I always have to start with iteration.no set to 0?
Does this always start from the beginning of the script (oldest Kinesis record about to be evicted)?
Imagine I set up a cron to schedule the execution of the script, how do I retrieve the 'next' iteration number?
To re-execute the script on the same data, is it enough to re run the query with the same execution number?
If I execute a select * from kinesis_ext_table limit 100with iteration.no=0 over and over, will I get different/weird results once the first Kinesis records start to be evicted?
Given the DynamoDB checkpoint entry:
{"startSeqNo":"1234",
"endSeqNo":"5678",
"closed":false}
What's the meaning of the closed field?
Are sequence number incremental and is there a relation between the start and end (EG: end - start = number of records read)?
I noticed that sometimes there is only the endSeqNum (no startSeqNum), how should I interpret that?
I know that it's a lot of questions but I could not find these answers on the documentation.
Check out the Kinesis documentation and the Kinesis Storage Handler Readme which contains answers to many of your questions.
Do I always have to start with iteration.no set to 0?
Yes, unless you are doing some advanced logic which requires you to skip a known or already processed part of the stream
Does this always start from the beginning of the script (oldest
Kinesis record about to be evicted)?
Yes
Imagine I set up a cron to schedule the execution of the script, how do I retrieve the 'next' iteration number?
This is handled by the hive script, since it is querying all data in the kinesis stream at each run
To re-execute the script on the same data, is it enough to re run the query with the same execution number?
As Kinesis data is a 24-hour time window, the data has (possibly) changed since your last query, so you probably would want to query all records again in the Hive job
If I execute a select * from kinesis_ext_table limit 100with iteration.no=0 over and over, will I get different/weird results once the first Kinesis records start to be evicted?
Yes, you would expect the results to change as the stream changes
Given the DynamoDB checkpoint entry:
What's the meaning of the closed field?
Although this is an internal detail of the Kinesis Storage Handler, I believe this indicates whether the shard is a parent shard, which indicates whether is it open and accepting new data or closed and not accepting new data into the shard. If you have scaled your stream up or down, parent shards exist for 24 hours, and contain all data since you scaled, however no new data will be inserted into these shards.
Are sequence number incremental and is there a relation between the start and end (EG: end - start = number of records read)?
New sequence numbers generally increase over time is the only guidance that Amazon provide on this.
I noticed that sometimes there is only the endSeqNum (no startSeqNum), how should I interpret that?
This means the shard is open and still accepting new data (not a parent shard)

Resources