"Amazon Kinesis Data Streams" not completely realtime stream? - aws-lambda

I expect Kinesis data stream (and succeeding process such as KDS firehose) to send data sequentially in real time.
However, when I check the data sequence, it does not seem what I expected.
I get audio data by pyaudio, and send data to KDS by while loop. Its sequence is continuous.
However when I check the cloudwatch log of lambda function (that is triggered when one data arrived at KDS), its sequence number is not continuous.
Could someone tell me how it happens and how to make it sequential?

Related

API waiting for a specific record on DynamoDb without pooling

I am inheriting a workflow that has a reasonable amount of data stored in DynamoDb. The data is periodically refreshed by Lambdas calling third parties when needed. The lambdas are triggered by both SQS and DynamoDB streams and go through four or five steps before the data is updated.
I'm given the task to write an API that can forcibly update N items and return their status. The obvious way to do this without reinventing the wheel and honoring DRY is to trigger an event that spawns off a refresh for each item so that the lambdas can do their thing.
The trouble is that I'm not sure the best pub/sub approach to handle being notified that end state of each workflow is met. Do I read from an update/insert stream of dynamodb to see if the records are updated? Do I create some sort of pub/sub model like Reddis or SNS to listen for the end state of each lambda being triggered?
Since I'm writing a REST API, timeouts, if there are failures along the line, arefine. But at the same time I want to make sure I can handle the following.
Be guaranteed that I can be notified that an update occurred for my targets after my call (in the case of multiple forced updates being called at once I only care about the first one to arrive).
Not be bogged down by listening for updates for record updates that are not contextually relevant to the API call in question.
Have an amortized time complexity of 1
In other words, in terms of cap theory i care about C & A but not P (because a 502 isn't that big a deal). But getting the timing wrong or missing a subscription is a problem.
I know I can just listen to a dynamodb event stream but I'm concerned that when things get noisy there will be more irrelevant stuff slowing me down. And I'm not sure if having every single record getting it's own topic is scalable (or how messy that would be).
You can use DynamoDB streams in combination with Lambda Event Filtering so the Lambda function only executes for the relevant change you are interested in. More information is available here:
https://aws.amazon.com/about-aws/whats-new/2021/11/aws-lambda-event-filtering-amazon-sqs-dynamodb-kinesis-sources/

DynamoDb re-processing records

I just inherited some one else's code that uses a server-less lambda function to process records from DynamoDb. The original developer is using DynamoDb much like how RabbitMQ works; as a temporary staging area with some level of fault tolerance and a lambda function that will process them at a later date.
We currently have a way to delay message publication in RabbitMQ at my company, but this feature is missing on the AWS side of the fence.
I wrote some code in my serverless lambda function so that it checks a special property called ProcessAfter (UTC DateTime) and effectively skips processing any given DynamoDb record if the current UTC date/time is less than that specified by the ProcessAfter. However DynamoDb never sends me that record ever again. It appears that DynamoDb only ever allows a single attempt at processing a record (excluding the exception re-tries built in), so I'm stuck with my attempted solution to implementing a delay capability.
Is there anyway to replicate the delay functionality in DynamoDb, or in my lambda function so that messages are skipped, and then re-processed as often as necessary until the delay is over and the record is successfully processed?
Looks like you are listening to dynamo_db streams. They work in a way if any event(insert, update etc which is being configured) happens for a record it will be sent to a listener for processing.
Now talking about your specific scenario, you need to have an SQS in place for processing a record later if you do not wish to process it after listening.
Better architecture I would advice is put an extra SQS and Lambda. The Lambda will listen the dynamo_db stream event, will compare processAfter with Date_Now to compute delay, add that delay as delay_seconds and send message to SQS.
Finally lambda listener will listen and process it after specified delay or 0 delay as required.

Which guarantees does Kafka Stream provide when using a RocksDb state store with changelog?

I'm building a Kafka Streams application that generates change events by comparing every new calculated object with the last known object.
So for every message on the input topic, I update an object in a state store and every once in a while (using punctuate), I apply a calculation on this object and compare the result with the previous calculation result (coming from another state store).
To make sure this operation is consistent, I do the following after the punctuate triggers:
write a tuple to the state store
compare the two values, create change events and context.forward them. So the events go to the results topic.
swap the tuple by the new_value and write it to the state store
I use this tuple for scenario's where the application crashes or rebalances, so I can always send out the correct set of events before continuing.
Now, I noticed the resulting events are not always consistent, especially if the application frequently rebalances. It looks like in rare cases the Kafka Streams application emits events to the results topic, but the changelog topic is not up to date yet. In other words, I produced something to the results topic, but my changelog topic is not at the same state yet.
So, when I do a stateStore.put() and the method call returns successfully, are there any guarantees when it will be on the changelog topic?
Can I enforce a changelog flush? When I do context.commit(), when will that flush+commit happen?
To get complete consistency, you will need to enable processing.guarantee="exaclty_once" -- otherwise, with a potential error, you might get inconsistent results.
If you want to stay with "at_least_once", you might want to use a single store, and update the store after processing is done (ie, after calling forward()). This minimized the time window to get inconsistencies.
And yes, if you call context.commit(), before input topic offsets are committed, all stores will be flushed to disk, and all pending producer writes will also be flushed.

AWS Lambda processing stream from DynamoDB

I'm trying to create a lambda function that is consuming a stream from dynamoDB table. However I was wondering which is the best practice to handle data that may not have been processed for some errors during the execution? For example my lambda failed and I lost part of the stream, which is the best way to reprocess the lost data?
This is handled for you. DynamoDB Streams, like Kinesis Streams, will resend records until they have been successfully processed. When you are using Lambda to process the stream, that means successfully exiting the function. If there is an error and the function exits unexpectedly, the DynamoDB stream will simply resend the record that was being processing.
The good thing is you are guaranteed at-least-once processing however, there are some things you need to look out for. Like Kinesis Streams, DynamoDB Streams are guaranteed to processes records in order. As a side effect of this, when a record fails to process, it is retried until it is successfully processed or it expires from the stream (possibly days) before processing any records behind it in the stream.
How you solve for this depends on the needs of your application. If you need at-least-once processing but don't need to guarantee that all records are processed in order, I would just drop the records into an SQS queue and do the processing off of the queue. SQS queues will also retry records that aren't successfully processed however, unlike DynamoDB and Kinesis Streams, records will not block each other in the queue. If you encounter an error when transferring a record from the DynamoDB Stream to the SQS Queue, you can just retry however, this may introduce duplicates in the SQS Queue.
If order is critical or duplicates can't be tolerated, you can use a SQS FIFO Queue. SQS FIFO Queues are similar to (Standard) SQS Queues except they they are guaranteed to deliver messages to the consumer in order and have a deduplication window (5 mins) where any duplicates added to the queue within that window will be discarded.
In both cases, when using SQS queues to process messages, you can setup a Dead Letter Queue where messages can automatically be sent if they fail to be processed N number of times.
TLDR: Use SQS Queues.
Updating this thread as all the existing answers are stale.
AWS Lambda now supports the DLQs for synchronous steam read from DynamoDB table stream.
With this feature in context, here is the flow that I would recommend:
Configure the event source mapping to include the DLQ arns and set the retry-attempts count. After these many retry, the batch metadata would then be moved to DLQs.
Set-up alarm on DLQ message visibility to get alert on impacted records.
DLQ message can be used to retrieve the impacted stream record using KCL library
ProTip: you can use attribute "Bisect on Function Error" to enable batch splitting. With this option, lambda would be able to narrow down on the impacted record.
DynamoDB Streams invokes the Lambda function for each event untill it successfully processes it (Untill the code calls success callback).
In an error situation while executing, you need to handle it in code unless otherwise the Lambda won't continue with the remaining messages in the stream.
If there is a situation where you need to process the message separate due to an error, you can use the dead letter queue (with Amazon SQS) to push the message and continue with the remaining items in the stream. You can have a separate logic to process the messages in this queue.

#Storm: how to setup various metrics for the same data source

I'm trying to setup Storm to aggregate a stream, but with various (DRPC available) metrics on the same stream.
E.g. the stream is consisted of messages that have a sender, a recipient, the channel through which the message arrived and a gateway through which it was delivered. I'm having trouble deciding how to organize one or more topologies that could give me e.g. total count of messages by gateway and/or by channel. And besides the total, counts per minute would be nice too.
The basic idea is to have a spout that will accept messaging events, and from there aggregate the data as needed. Currently I'm playing around with Trident and DRPC and I've came up with two possible topologies that solve the problem at this stage. Can't decide which approach is better, if any?!
The entire source is available at this gist.
It has three classes:
RandomMessageSpout
used to emit the messaging data
simulates the real data source
SeparateTopology
creates a separate DRPC stream for each metric needed
also a separate query state is created for each metric
they all use the same spout instance
CombinedTopology
creates a single DRPC stream with all the metrics needed
creates a separate query state for each metric
each query state extracts the desired metric and groups results for it
Now, for the problems and questions:
SeparateTopology
is it necessary to use the same spout instance or can I just say new RandomMessageSpout() each time?
I like the idea that I don't need to persist grouped data by all the metrics, but just the groupings we need to extract later
is the spout emitted data actually processed by all the state/query combinations, e.g. not the first one that comes?
would this also later enable dynamic addition of new state/query combinations at runtime?
CombinedTopology
I don't really like the idea that I need to persist data grouped by all the metrics since I don't need all the combinations
it came as a surprise that the all the metrics always return the same data
e.g. channel and gateway inquiries return status metrics data
I found that this was always the data grouped by the first field in state definition
this topic explains the reasoning behind this behaviour
but I'm wondering if this is a good way of doing thins in the first place (and will find a way around this issue if need be)
SnapshotGet vs TupleCollectionGet in stateQuery
with SnapshotGet things tended to work, but not always, only TupleCollectionGet solved the issue
any pointers as to what is correct way of doing that?
I guess this is a longish question / topic, but any help is really appreciated!
Also, if I missed the architecture entirely, suggestions on how to accomplish this would be most welcome.
Thanks in advance :-)
You can't actually split a stream in SeparateTopology by invoking newStream() using the same spout instance, since that would create new instances of the same RandomMessageSpout spout, which would result in duplicate values being emitted to your topology by multiple, separate spout instances. (Spout parallelization is only possible in Storm with partitioned spouts, where each spout instance processes a partition of the whole dataset -- a Kafka partition, for example).
The correct approach here is to modify the CombinedTopology to split the stream into multiple streams as needed for each metric you need (see below), and then do a groupBy() by that metric's field and persistentAggregate() on each newly branched stream.
From the Trident FAQ,
"each" returns a Stream object, which you can store in a variable. You can then run multiple eaches on the same Stream to split it, e.g.:
Stream s = topology.each(...).groupBy(...).aggregate(...)
Stream branch1 = s.each(...)
Stream branch2 = s.each(...)
See this thread on Storm's mailing list, and this one for more information.

Resources