AWS Kinesis Stream Aggregating Based on Time Spans - algorithm

I currently have a Kinesis stream that is populated with JSON messages that are in the form of:
{"datetime": "2017-09-29T20:12:01.755z", "payload":"4"}
{"datetime": "2017-09-29T20:12:07.755z", "payload":"5"}
{"datetime": "2017-09-29T20:12:09.755z", "payload":"12"}
etc...
What im trying to accomplish here is to aggregate the data in terms of time chunks. In this case, i'd like to group the averages for 10 minute spans. For example, from 12:00 > 12:10, I want to average the payload value and save it as the 12:10 value.
For example, the above data would produce:
Datetime: 2017-09-29T20:12:10.00z
Average: 7
The method that i'm thinking of is to use caching at the service level and then some type of way to track the time. If the messages ever move into the next 10 minute timespan, I average the cached data, store it to the DB and then delete that cache value.
Currently, my service sees 20,000 messages every minute with higher volume to be expected in the future. I'm a little stuck on how to implement this to guarantee I get all the values for that 10 minute time period from Kinesis. Those of you that are more familiar with Kinesis and AWS, is there a simple way to go about this?
The reason for doing this is to shorten the query times for data from large timespans, such as for 1 year. I wouldn't want to grab millions of values but rather, a few aggregated values.
Edit:
I have to keep track of many different averages at the same time. For example, the above JSON may just pertain to one 'set', such as the average temperature per city in 10 minute timespans. This requires me to keep track of each cities averages for every timespan.
Toronto (12:01 - 12:10): average_temp
New York (12:01 - 12:10): average_temp
Toronto (12:11 - 12:20): average_temp
New York (12:11 - 12:20): average_temp
etc...
This could pertain to any city worldwide. If new temperatures arrive for say, Toronto and it pertains to the 12:01 - 12:10 timespan, I have to recalculate and store that average.

This is how I would do it. Thanks for the interesting question.
Kinesis Streams --> Lambda (Event Insertor) --> DynamoDB(Streams) --> Lambda(Count and Value incrementor) --> DynamoDB(streams) --> Average (Updater)
DynamoDB Table Structure:
{
Timestamp: 1506794597
Count: 3
TotalValue: 21
Average: 7
Event{timestamp}-{guid}: { event }
}
timestamp -- timestamp of the actual event
guid -- avoid any collision on a timestamp that occurred at same time
Event{timestamp}-{guid} -- This should be removed by (count and value incrementor)
If the fourth record for that timestamp arrives,
Get the time close to 10 min timespan, increment the count, increment the totalvalue. Neve read the value and increment, that will result in error unless you use strong consistency(which is very costly to read). Instead perform the increment operation with atomic increment.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.AtomicCounters
Create DynamoDB streams from the above table, Listen on another lambda, Now calculate the average value and update the value.
When you calculate the average, don't perform a read from the table. Instead the data will be available over the stream, you just need to calculate the average and update it. (overwrite previous average value).
This will work on any scale and with high availability.
Hope it helps.
EDIT1:
Since the OP is not familier with AWS Services,
Lambda Documentation:
https://aws.amazon.com/lambda/
DynamoDB Documentation:
https://aws.amazon.com/dynamodb/
AWS cloud services used for the solution.

Related

Kafka stream time and window expire - KStreamSessionWindowAggregate skipping records

I am a newbie to Kafka-stream and I am experimenting it to process a steam of messages.
Scenario
Incoming payload structure is:
"building-<M>, sensor-<N>.<parameter>, value, timestamp".
For example:
"building-1, sensor-1.temperature, 18, 2020-06-12T15:01:05Z"
"building-1, sensor-1.humidity, 75, 2020-06-12T15:01:05Z"
"building-1, sensor-2.temperature, 20, 2020-06-12T15:01:05Z"
"building-1, sensor-2.humidity, 70, 2020-06-12T15:01:05Z"
Message key in kafka is building-id.
Stream transforms this as a POJO for further downstream processing:
SensorData {
buildingId = "building-1"
sensorId = "sensor-1"
parameterName = "temperature"
parameterValue = 18
timestamp = 1592048743000
..
..
}
Each sensor will send all of its parameters at same time as separate record. Each set of feed comes at every 5 minutes from each sensor.
Time stamp extractor is set to take the time from payload. It will also reject the record if the timestamp on the record is way off (say 1 hour deviation from current stream time)
In my topology, at one point, I want to perform an aggregate operation combining all the data from one sensor. For example, in the above sample, I want to perform an aggregation for each sensor using the temperature and humidity reported by that sensor.
Topology
I do a group using "buildingId" and "sensorId", then apply a session window of 2 minute gap with 1 minute grace period.
kStreamBuilder
.stream("building-sensor-updates", ...)
//Had to cleanup key and also needed some data from context
.tranform(() -> new String2SensorObjectConvertor())
//triggers another re-partition
.groupBy((key, value) -> value.buildingId + "-" + value.sensorId, ...)
.windowedBy(SessionWindows.with(..))
.aggregate(
() -> new SensorDataAggregator,
...,
Materialized.<String, SensorDataAggregator,
SessionStore<Bytes, byte[]>>as("session_aggregate_store"))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream()
...
...
As expected, this triggers a re-partition and sub-stream will consume records from this re-partition topic "sensor_data_processor-session_aggregate_store-repartition". I am seeing an issue there as explained later.
Test input data
I am testing a scenario where past data is re-processed again from storage or from Kafka offset. For testing, I feeding data from csv using Kafka-spool-connect. Time stamp of each record in the input CSV file is kept in ascending order. For a same sensor, next set of records will have 5 minutes increased timestamp.
"building-1, sensor-1.temperature, 18, 2020-06-12T15:01:02Z"
"building-1, sensor-1.humidity, 75, 2020-06-12T15:01:05Z"
"building-1, sensor-2.temperature, 20, 2020-06-12T15:01:03Z"
"building-1, sensor-2.humidity, 70, 2020-06-12T15:01:06Z"
"building-1, sensor-1.temperature, 19, 2020-06-12T15:06:04Z"
"building-1, sensor-1.humidity, 65, 2020-06-12T15:06:08Z"
"building-1, sensor-2.temperature, 21, 2020-06-12T15:06:05Z"
"building-1, sensor-2.humidity, 73, 2020-06-12T15:06:09Z"
I inject test data in bulk (200000) without any delay.
Issue
When the substream process the records from this re partition topic, I see following WARNING message from KStreamSessionWindowAggregate and the records gets skipped.
WARN
org.apache.kafka.streams.kstream.internals.KStreamSessionWindowAggregate
- Skipping record for expired window. key=[BUILDING-ID-1003-sensor-1] topic=[sensor_data_processor-session_aggregate_store-repartition]
partition=[0] offset=[1870] timestamp=[1591872043000]
window=[1591872043000,1591872043000] expiration=[1591951243000]
streamTime=[1591951303000]
If you look at the time stamps in the WARNING message,
Time stamp of the message is "June 11, 2020 10:40:43Z"
Stream time has already passed "June 12, 2020 08:40:43Z"
Window expiration June 12, 2020 08:41:43Z
I tried with time window of 7 minutes with 2 min advance. I had similar issue there as well.
Observations
As the key of the original messages is "building-id", all records from same building (and hence same sensor) should go in to one partition and the records from each sensor should be in order.
I am also doing a tranform() at the beginning of topology. I had to cleanup key and also wanted some data from context. Though this may trigger a re-partition, this should not changed the order of records within a sensor as it only does a cleanup of key and hence the partition outcome would maintain same elements in the partition. I will get rid of this tranform() with some optimization.
My window grouping is based on building-id + sensor-id, so the elements from same sensor in each re-partitioned group also should be coming in order.
Given all this, I was hoping that each partition/group's stream-time will monotonically progress as per the timestamp of the events in that partition as their order is maintained. But I see a jump in the stream-time. I looked at org.apache.kafka.streams.kstream.internals.KStreamSessionWindowAggregate and some kafka-stream documentations -
It appears to me, monotonic stream-time is maintained for stream-task and not per partitions. And same stream-task may be used for processing multiple topic partitions. Because the records are injected in quick succession, it may process a bulk of records from a a partition and when it picks up another topic partition, the stream time might have already crossed a lot compared to the time stamp of records in the new topic partition which will result in expiring.
Questions
For replaying records like this, how this can be handled other than putting a large grace period for the window.
Even in realtime scenario, this issue might happen if there are back pressure. Using a large grace period is not an option as results will get delayed as I am using Suppresed.untilWindowClose(). What would be the best way to handle this?
If stream-time is maintained for stream-task and same task may be used for multiple topic partitions, is there anyway we can keep 1-1 mapping and stickiness between stream-task and topic partitions? If so, what would be the implications other than potential performance issues?
Why wouldn't kafka-stream maintain stream-time for topic partition instead of per stream-task?
When I looked at the "sensor_data_processor-session_aggregate_store-re-partition" topic mentioned in the warning message, I see that most of "temperature" records alone are getting published to that topic (Yes, for each group, "temperature" comes first in the test data set). Why only temperature records goes in to that topic ? Is it just a timing coincidence?
For replaying records like this, how this can be handled other than putting a large grace period for the window.
I guess you cannot. If you process data of today, and later data from yesterday, data from yesterday would be discarded. What you could do it, to start a new application. For this case, on startup the app has no stream time, and thus it will init its stream time with "yesterday" and thus data won't be discarded.
Even in realtime scenario, this issue might happen if there are back pressure. Using a large grace period is not an option as results will get delayed as I am using Suppresed.untilWindowClose(). What would be the best way to handle this?
Well, you have to pick your poison... Or you fall back to the Processor API and implement whatever logic you need manually.
If stream-time is maintained for stream-task and same task may be used for multiple topic partitions, is there anyway we can keep 1-1 mapping and stickiness between stream-task and topic partitions? If so, what would be the implications other than potential performance issues?
Stream time is definitely maintained per task, and there is a 1:1 mapping between tasks and partitions. Maybe the data is shuffled unexpectedly. My window grouping is based on building-id + sensor-id, so the elements from same sensor in each re-partitioned group also should be coming in order.: agreed, however, data would still be shuffled; thus, if one upstream task processed data faster than its "parallel" pears, it would lead to a fast advance of stream time if all downstream tasks, too.
Why wouldn't kafka-stream maintain stream-time for topic partition instead of per stream-task?
Not sure if I can follow. Each task tracks stream time individually. And there is a 1:1 mapping between tasks and partition. Hence, it seems both (tracking per partition or tracking per task -- assuming there is only one input partition per task) is the same.

Aggregate timeseries data over various timeframes

I have a question about how to aggregate time series data that is coming into DynamoDB.
Currently energy usage data is coming into DynamoDB every 30 seconds per device. The devices are also spread across many timezones.
I want to show the aggregate energy usage over one hour, one day, one month, and one year.
I know one way that I can do it is run a Lambda on a 1 hour cron job that takes all of the readings for the previous hour and adds them all together and then records that in a different table in.
At the same time in that cron job the Lambda can check if any devices timezones just had their day end, and if so batch up the previous 24 hours for into a single day reading.
The same goes for month, and year.
But something tells me there is a another, better, way to do all this (probably using some otherAWS service which I am not thinking of)
Instead of a cron job, you can use dynamoDB streams.
In this case, when a record comes into your data collection table, it can kick off a lambda function that updates your aggregate tables. That will allow you to get more timely updates into the aggregate tables. The logic for what hour/day/month/year your record gets aggregated should be in that lambda.
Also, I’d use a cloud watch event instead of cron...

a data structure to query number of events in different time interval

My program receives thousands of events in a second from different types. For example 100k API access in a second from users with millions of different IP addresses. I want to keep statistics and limit number of accesses in 1 minute, 1 hour, 1 day and so on. So I need event counts in last minute, hour or day for every user and I want it to be like a sliding window. In this case, type of event is the user address.
I started using a time series database, InfluxDB; but it failed to insert 100k events per second and aggregate queries to find event counts in a minute or an hour is even worse. I am sure InfluxDB is not capable of inserting 100k events per second and performing 300k aggregate queries at the same time.
I don't want events retrieved from the database because they are just a simple address. I just want to count them as fast as possible in different time intervals. I want to get the number of events of type x in a specific time interval (for example, past 1 hour).
I don't need to store statistics in the hard disk; so maybe a data structure to keep event counts in different time intervals is good for me. On the other hand, I need it to be like a sliding window.
Storing all the events in RAM in a linked-list and iterating over it to answer queries is another solution that comes to my mind but because the number of events is too high, keeping all of the events in RAM could not be a good idea.
Is there any good data structure or even a database for this purpose?
You didn't provide enough details on events input format and how events can be delivered to statistics backend: is it a stream of udp messages, http put/post requests or smth else.
One possible solution would be to use Yandex Clickhouse database.
Rough description of suggested pattern:
Load incoming raw events from your application into memory-based table Events
with Buffer storage engine
Create materialized view with per-minute aggregation in another
memory-based table EventsPerMinute with Buffer engine
Do the same for hourly aggregation of data in EventsPerHour
Optionally, use Grafana with clickhouse datasource plugin to build
dashboards
In Clickhouse DB Buffer storage engine not associated with any on-disk table will be kept entirely in memory and older data will be automatically replaced with fresh. This will give you simple housekeeping for raw data.
Tables (materialized views) EventsPerMinute and EventsPerHour can be also created with MergeTree storage engine if case you want to keep statistics on disk. Clickhouse can easily handle billions of records.
At 100K events/second you may need some kind of shaper/load balancer in front of database.
you can think of a hazelcast cluster instead of simple ram. I also think a graylog or simple elastic seach but with this kind of load you shoud test. You can think about your data structure as well. You can construct a hour map for each address and put the event into the hour bucket. And when the time passes the hour you can calculate the count and cache in this hour's bucket. When you need a minute granularity you go to hours bucket and count the events under the list of this hour.

high volume data storage and processing

I am building a new application where I am expecting a high volume of geo location data something like a moving object sending geo coordinates every 5 seconds. This data needs to be stored in some database so that it can be used for tracking the moving object on a map anytime. So, I am expecting about 250 coordinates per moving object per route. And each object can run about 50 routes a day. and I have 900 such objects to track. SO, that brings to about 11.5 million geo coordinates to store per day. I have to store about one week of data at least in my database.
This data will be basically used for simple queries like find all the geocoordates for a particular object and a particular route. so, the query is not very complicated and this data will not be used for any analysis purpose.
SO, my question is should I just go with normal Oracle database like 12C distributed over two VMs or should I think about some big data technologies like NO SQL or hadoop?
One of the key requirement is to have high performance. Each query has to respond withing 1 second.
Since you know the volume of data (11.5 million) you can easily simulate the all your scenario in Oracle DB and test it well before.
My suggestions are you need to go for day level partitions & 2 sub partitions like objects & routs. All your business SQL has to hit right partitions always.
and also you might required to clear older days data. or Some sort of aggregation you can created with past days and delete your raw data would help.
its well doable 12C.

Aerospike: get upsert time without explicitly storing it for records with a TTL

Aerospike is blazingly fast and reliable, but expensive. The cost, for us, is based on the amount of data stored.
We'd like the ability to query records based on their upsert time. Currently, when we add or update a record, we set a bin to the current epoch time and can run scan queries on this bin.
It just occurred to me that Aerospike knows when to expire a record based on when it was upserted, and since we can query the TTL value from the record metadata via a simple UDF, it might be possible to infer the upsert time for records with a TTL. We're effectively using space to store a value that's already known.
Is it possible to access record creation or expiry time, via UDF, without explicitly storing it?
At this point, Aerospike only stores the void time along with the record (the time when the record expires). So the upsert time is unfortunately not available. Stay tuned, though, as I heard there were some plans to have some new features that may help you. (I am part of Aerospike's OPS/Support team).
void time : This tracks the life of a key in system. This is the time at which key should expire and is used by eviction subsystem.
so ttl is derived from the void time.
As we get ttl from a record, we can only calculate the void time (now + ttl)
Based on what you have, I think you can evaluate the upsert time from ttl only if you add same amount of expiration to all your records, say CONSTANT_EXPIRATION_TIME.
in that case
upsert_time = now - (CONSTANT_EXPIRATION_TIME - ttl)
HTH

Resources