Kafka stream time and window expire - KStreamSessionWindowAggregate skipping records - apache-kafka-streams

I am a newbie to Kafka-stream and I am experimenting it to process a steam of messages.
Scenario
Incoming payload structure is:
"building-<M>, sensor-<N>.<parameter>, value, timestamp".
For example:
"building-1, sensor-1.temperature, 18, 2020-06-12T15:01:05Z"
"building-1, sensor-1.humidity, 75, 2020-06-12T15:01:05Z"
"building-1, sensor-2.temperature, 20, 2020-06-12T15:01:05Z"
"building-1, sensor-2.humidity, 70, 2020-06-12T15:01:05Z"
Message key in kafka is building-id.
Stream transforms this as a POJO for further downstream processing:
SensorData {
buildingId = "building-1"
sensorId = "sensor-1"
parameterName = "temperature"
parameterValue = 18
timestamp = 1592048743000
..
..
}
Each sensor will send all of its parameters at same time as separate record. Each set of feed comes at every 5 minutes from each sensor.
Time stamp extractor is set to take the time from payload. It will also reject the record if the timestamp on the record is way off (say 1 hour deviation from current stream time)
In my topology, at one point, I want to perform an aggregate operation combining all the data from one sensor. For example, in the above sample, I want to perform an aggregation for each sensor using the temperature and humidity reported by that sensor.
Topology
I do a group using "buildingId" and "sensorId", then apply a session window of 2 minute gap with 1 minute grace period.
kStreamBuilder
.stream("building-sensor-updates", ...)
//Had to cleanup key and also needed some data from context
.tranform(() -> new String2SensorObjectConvertor())
//triggers another re-partition
.groupBy((key, value) -> value.buildingId + "-" + value.sensorId, ...)
.windowedBy(SessionWindows.with(..))
.aggregate(
() -> new SensorDataAggregator,
...,
Materialized.<String, SensorDataAggregator,
SessionStore<Bytes, byte[]>>as("session_aggregate_store"))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream()
...
...
As expected, this triggers a re-partition and sub-stream will consume records from this re-partition topic "sensor_data_processor-session_aggregate_store-repartition". I am seeing an issue there as explained later.
Test input data
I am testing a scenario where past data is re-processed again from storage or from Kafka offset. For testing, I feeding data from csv using Kafka-spool-connect. Time stamp of each record in the input CSV file is kept in ascending order. For a same sensor, next set of records will have 5 minutes increased timestamp.
"building-1, sensor-1.temperature, 18, 2020-06-12T15:01:02Z"
"building-1, sensor-1.humidity, 75, 2020-06-12T15:01:05Z"
"building-1, sensor-2.temperature, 20, 2020-06-12T15:01:03Z"
"building-1, sensor-2.humidity, 70, 2020-06-12T15:01:06Z"
"building-1, sensor-1.temperature, 19, 2020-06-12T15:06:04Z"
"building-1, sensor-1.humidity, 65, 2020-06-12T15:06:08Z"
"building-1, sensor-2.temperature, 21, 2020-06-12T15:06:05Z"
"building-1, sensor-2.humidity, 73, 2020-06-12T15:06:09Z"
I inject test data in bulk (200000) without any delay.
Issue
When the substream process the records from this re partition topic, I see following WARNING message from KStreamSessionWindowAggregate and the records gets skipped.
WARN
org.apache.kafka.streams.kstream.internals.KStreamSessionWindowAggregate
- Skipping record for expired window. key=[BUILDING-ID-1003-sensor-1] topic=[sensor_data_processor-session_aggregate_store-repartition]
partition=[0] offset=[1870] timestamp=[1591872043000]
window=[1591872043000,1591872043000] expiration=[1591951243000]
streamTime=[1591951303000]
If you look at the time stamps in the WARNING message,
Time stamp of the message is "June 11, 2020 10:40:43Z"
Stream time has already passed "June 12, 2020 08:40:43Z"
Window expiration June 12, 2020 08:41:43Z
I tried with time window of 7 minutes with 2 min advance. I had similar issue there as well.
Observations
As the key of the original messages is "building-id", all records from same building (and hence same sensor) should go in to one partition and the records from each sensor should be in order.
I am also doing a tranform() at the beginning of topology. I had to cleanup key and also wanted some data from context. Though this may trigger a re-partition, this should not changed the order of records within a sensor as it only does a cleanup of key and hence the partition outcome would maintain same elements in the partition. I will get rid of this tranform() with some optimization.
My window grouping is based on building-id + sensor-id, so the elements from same sensor in each re-partitioned group also should be coming in order.
Given all this, I was hoping that each partition/group's stream-time will monotonically progress as per the timestamp of the events in that partition as their order is maintained. But I see a jump in the stream-time. I looked at org.apache.kafka.streams.kstream.internals.KStreamSessionWindowAggregate and some kafka-stream documentations -
It appears to me, monotonic stream-time is maintained for stream-task and not per partitions. And same stream-task may be used for processing multiple topic partitions. Because the records are injected in quick succession, it may process a bulk of records from a a partition and when it picks up another topic partition, the stream time might have already crossed a lot compared to the time stamp of records in the new topic partition which will result in expiring.
Questions
For replaying records like this, how this can be handled other than putting a large grace period for the window.
Even in realtime scenario, this issue might happen if there are back pressure. Using a large grace period is not an option as results will get delayed as I am using Suppresed.untilWindowClose(). What would be the best way to handle this?
If stream-time is maintained for stream-task and same task may be used for multiple topic partitions, is there anyway we can keep 1-1 mapping and stickiness between stream-task and topic partitions? If so, what would be the implications other than potential performance issues?
Why wouldn't kafka-stream maintain stream-time for topic partition instead of per stream-task?
When I looked at the "sensor_data_processor-session_aggregate_store-re-partition" topic mentioned in the warning message, I see that most of "temperature" records alone are getting published to that topic (Yes, for each group, "temperature" comes first in the test data set). Why only temperature records goes in to that topic ? Is it just a timing coincidence?

For replaying records like this, how this can be handled other than putting a large grace period for the window.
I guess you cannot. If you process data of today, and later data from yesterday, data from yesterday would be discarded. What you could do it, to start a new application. For this case, on startup the app has no stream time, and thus it will init its stream time with "yesterday" and thus data won't be discarded.
Even in realtime scenario, this issue might happen if there are back pressure. Using a large grace period is not an option as results will get delayed as I am using Suppresed.untilWindowClose(). What would be the best way to handle this?
Well, you have to pick your poison... Or you fall back to the Processor API and implement whatever logic you need manually.
If stream-time is maintained for stream-task and same task may be used for multiple topic partitions, is there anyway we can keep 1-1 mapping and stickiness between stream-task and topic partitions? If so, what would be the implications other than potential performance issues?
Stream time is definitely maintained per task, and there is a 1:1 mapping between tasks and partitions. Maybe the data is shuffled unexpectedly. My window grouping is based on building-id + sensor-id, so the elements from same sensor in each re-partitioned group also should be coming in order.: agreed, however, data would still be shuffled; thus, if one upstream task processed data faster than its "parallel" pears, it would lead to a fast advance of stream time if all downstream tasks, too.
Why wouldn't kafka-stream maintain stream-time for topic partition instead of per stream-task?
Not sure if I can follow. Each task tracks stream time individually. And there is a 1:1 mapping between tasks and partition. Hence, it seems both (tracking per partition or tracking per task -- assuming there is only one input partition per task) is the same.

Related

Aggregate timeseries data over various timeframes

I have a question about how to aggregate time series data that is coming into DynamoDB.
Currently energy usage data is coming into DynamoDB every 30 seconds per device. The devices are also spread across many timezones.
I want to show the aggregate energy usage over one hour, one day, one month, and one year.
I know one way that I can do it is run a Lambda on a 1 hour cron job that takes all of the readings for the previous hour and adds them all together and then records that in a different table in.
At the same time in that cron job the Lambda can check if any devices timezones just had their day end, and if so batch up the previous 24 hours for into a single day reading.
The same goes for month, and year.
But something tells me there is a another, better, way to do all this (probably using some otherAWS service which I am not thinking of)
Instead of a cron job, you can use dynamoDB streams.
In this case, when a record comes into your data collection table, it can kick off a lambda function that updates your aggregate tables. That will allow you to get more timely updates into the aggregate tables. The logic for what hour/day/month/year your record gets aggregated should be in that lambda.
Also, I’d use a cloud watch event instead of cron...

a data structure to query number of events in different time interval

My program receives thousands of events in a second from different types. For example 100k API access in a second from users with millions of different IP addresses. I want to keep statistics and limit number of accesses in 1 minute, 1 hour, 1 day and so on. So I need event counts in last minute, hour or day for every user and I want it to be like a sliding window. In this case, type of event is the user address.
I started using a time series database, InfluxDB; but it failed to insert 100k events per second and aggregate queries to find event counts in a minute or an hour is even worse. I am sure InfluxDB is not capable of inserting 100k events per second and performing 300k aggregate queries at the same time.
I don't want events retrieved from the database because they are just a simple address. I just want to count them as fast as possible in different time intervals. I want to get the number of events of type x in a specific time interval (for example, past 1 hour).
I don't need to store statistics in the hard disk; so maybe a data structure to keep event counts in different time intervals is good for me. On the other hand, I need it to be like a sliding window.
Storing all the events in RAM in a linked-list and iterating over it to answer queries is another solution that comes to my mind but because the number of events is too high, keeping all of the events in RAM could not be a good idea.
Is there any good data structure or even a database for this purpose?
You didn't provide enough details on events input format and how events can be delivered to statistics backend: is it a stream of udp messages, http put/post requests or smth else.
One possible solution would be to use Yandex Clickhouse database.
Rough description of suggested pattern:
Load incoming raw events from your application into memory-based table Events
with Buffer storage engine
Create materialized view with per-minute aggregation in another
memory-based table EventsPerMinute with Buffer engine
Do the same for hourly aggregation of data in EventsPerHour
Optionally, use Grafana with clickhouse datasource plugin to build
dashboards
In Clickhouse DB Buffer storage engine not associated with any on-disk table will be kept entirely in memory and older data will be automatically replaced with fresh. This will give you simple housekeeping for raw data.
Tables (materialized views) EventsPerMinute and EventsPerHour can be also created with MergeTree storage engine if case you want to keep statistics on disk. Clickhouse can easily handle billions of records.
At 100K events/second you may need some kind of shaper/load balancer in front of database.
you can think of a hazelcast cluster instead of simple ram. I also think a graylog or simple elastic seach but with this kind of load you shoud test. You can think about your data structure as well. You can construct a hour map for each address and put the event into the hour bucket. And when the time passes the hour you can calculate the count and cache in this hour's bucket. When you need a minute granularity you go to hours bucket and count the events under the list of this hour.

AWS Kinesis Stream Aggregating Based on Time Spans

I currently have a Kinesis stream that is populated with JSON messages that are in the form of:
{"datetime": "2017-09-29T20:12:01.755z", "payload":"4"}
{"datetime": "2017-09-29T20:12:07.755z", "payload":"5"}
{"datetime": "2017-09-29T20:12:09.755z", "payload":"12"}
etc...
What im trying to accomplish here is to aggregate the data in terms of time chunks. In this case, i'd like to group the averages for 10 minute spans. For example, from 12:00 > 12:10, I want to average the payload value and save it as the 12:10 value.
For example, the above data would produce:
Datetime: 2017-09-29T20:12:10.00z
Average: 7
The method that i'm thinking of is to use caching at the service level and then some type of way to track the time. If the messages ever move into the next 10 minute timespan, I average the cached data, store it to the DB and then delete that cache value.
Currently, my service sees 20,000 messages every minute with higher volume to be expected in the future. I'm a little stuck on how to implement this to guarantee I get all the values for that 10 minute time period from Kinesis. Those of you that are more familiar with Kinesis and AWS, is there a simple way to go about this?
The reason for doing this is to shorten the query times for data from large timespans, such as for 1 year. I wouldn't want to grab millions of values but rather, a few aggregated values.
Edit:
I have to keep track of many different averages at the same time. For example, the above JSON may just pertain to one 'set', such as the average temperature per city in 10 minute timespans. This requires me to keep track of each cities averages for every timespan.
Toronto (12:01 - 12:10): average_temp
New York (12:01 - 12:10): average_temp
Toronto (12:11 - 12:20): average_temp
New York (12:11 - 12:20): average_temp
etc...
This could pertain to any city worldwide. If new temperatures arrive for say, Toronto and it pertains to the 12:01 - 12:10 timespan, I have to recalculate and store that average.
This is how I would do it. Thanks for the interesting question.
Kinesis Streams --> Lambda (Event Insertor) --> DynamoDB(Streams) --> Lambda(Count and Value incrementor) --> DynamoDB(streams) --> Average (Updater)
DynamoDB Table Structure:
{
Timestamp: 1506794597
Count: 3
TotalValue: 21
Average: 7
Event{timestamp}-{guid}: { event }
}
timestamp -- timestamp of the actual event
guid -- avoid any collision on a timestamp that occurred at same time
Event{timestamp}-{guid} -- This should be removed by (count and value incrementor)
If the fourth record for that timestamp arrives,
Get the time close to 10 min timespan, increment the count, increment the totalvalue. Neve read the value and increment, that will result in error unless you use strong consistency(which is very costly to read). Instead perform the increment operation with atomic increment.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.AtomicCounters
Create DynamoDB streams from the above table, Listen on another lambda, Now calculate the average value and update the value.
When you calculate the average, don't perform a read from the table. Instead the data will be available over the stream, you just need to calculate the average and update it. (overwrite previous average value).
This will work on any scale and with high availability.
Hope it helps.
EDIT1:
Since the OP is not familier with AWS Services,
Lambda Documentation:
https://aws.amazon.com/lambda/
DynamoDB Documentation:
https://aws.amazon.com/dynamodb/
AWS cloud services used for the solution.

Data structure and file structure for storing append only messages?

A message is a bundle of data of variable size with a unique message ID(integer). I'd like to have a design/data structure/algorithm to:
be able to efficiently store the messages on the disk, the number of messages can be very big, length is variable. But there is no update or modification of stored ones.
be able to retrieve a message with a message ID, i.e. return the message stored.
recently stored messages are queried more often than old ones
each message has a TTL, need a way to truncate the file with old messages
What is the proper data structure and file structure for this need?
If we're talking five messages per second, then you're talking on the order of a half million messages per day.
What I've done in the past is maintain multiple files. If the TTL for messages is measured in days, I have one messages file per day. The process that reads and stores messages creates a new file for the first message of a new day. This is trivial to implement by keeping track of the date and time the last message was received.
I also maintain a paired index file with each messages file. This, too, is a simple sequential file that contains message ID and position for each message. So to look up a message for a particular day, you load that day's file, do a binary search for the message ID, and then use the corresponding position to look up the message in the messages file. Lookup should be very fast within an index if the message IDs are sequential and no numbers are missing. If you can have missing numbers, then binary search works well. And with only 512K messages, binary search will be very fast.
To handle multiple days, you have the lookup program's startup sequence scan the directory for all daily message indexes and build a meta-index that contains the IDs for the first message in each day.
To delete old messages, you have the lookup program delete old files on startup, or have it do that at midnight every day. At that time it can also get the ID for the first message in the next day's file.
Or, the message gatherer can spawn a task to delete old files when it receives the first message for a new day. You can also make it notify the lookup program of the new day so that the lookup program can update its meta index.
With only 512K messages per day (5 per second is about a half million per day), you should be able to keep 10 day's worth of index entries in memory without trouble. Your index will contain a message ID and file offset, so figure 16 bytes per entry. Times 5 million for 10 days, that's like 80 megabytes: pocket change. To remove old entries (once per day), just delete that day's index from memory.
If messages have varying TTL, then you keep older messages around but keep track of their TTL. When somebody looks up an expired message, you'll have to do a secondary check on the expiration date before returning it. And of course you'll have to keep track of the longest TTL for each day so that you can delete the file when all of its messages have expired.
This is a pretty low-tech solution, but you can code it up in a day and it works and performs surprisingly well. I've used it in several projects, to great effect.

Multiple small inserts in clickhouse

I have an event table (MergeTree) in clickhouse and want to run a lot of small inserts at the same time. However the server becomes overloaded and unresponsive. Moreover, some of the inserts are lost. There are a lot of records in clickhouse error log:
01:43:01.668 [ 16 ] <Error> events (Merger): Part 201 61109_20161109_240760_266738_51 intersects previous part
Is there a way to optimize such queries? I know I can use bulk insert for some types of events. Basically, running one insert with many records, which clickhouse handles pretty well. However, some of the events, such as clicks or opens could not be handled in this way.
The other question: why clickhouse decides that similar records exist, when they don't? There are similar records at the time of insert, which have the same fields as in index, but other fields are different.
From time to time I also receive the following error:
Caused by: ru.yandex.clickhouse.except.ClickHouseUnknownException: ClickHouse exception, message: Connect to localhost:8123 [ip6-localhost/0:0:0:0:0:0:0:1] timed out, host: localhost, port: 8123; Connect to ip6-localhost:8123 [ip6-localhost/0:0:0:0:0:0:0:1] timed out
... 36 more
Mostly during project build when test against clickhouse database are run.
Clickhouse has special type of tables for this - Buffer. It's stored in memory and allow many small inserts with out problem. We have near 200 different inserts per second - it works fine.
Buffer table:
CREATE TABLE logs.log_buffer (rid String, created DateTime, some String, d Date MATERIALIZED toDate(created))
ENGINE = Buffer('logs', 'log_main', 16, 5, 30, 1000, 10000, 1000000, 10000000);
Main table:
CREATE TABLE logs.log_main (rid String, created DateTime, some String, d Date)
ENGINE = MergeTree(d, sipHash128(rid), (created, sipHash128(rid)), 8192);
Details in manual: https://clickhouse.yandex/docs/en/operations/table_engines/buffer/
This is known issue when processing large number of small inserts into (non-replicated) MergeTree.
This is a bug, we need to investigate and fix.
For workaround, you should send inserts in larger batches, as recommended: about one batch per second: https://clickhouse.tech/docs/en/introduction/performance/#performance-when-inserting-data.
I've had a similar problem, although not as bad - making ~20 inserts per second caused the server to reach a high loadavg, memory consumption and CPU use. I created a Buffer table which buffers the inserts in memory, and then they are flushed periodically to the "real" on-disk table. And just like magic, everything went quite: loadavg, memory and CPU usage came down to normal levels. The nice thing is that you can run queries against the buffer table, and get back matching rows from both memory and disk - so clients are unaffected by the buffering. See https://clickhouse.tech/docs/en/engines/table-engines/special/buffer/
Alternatively, you can use something like https://github.com/nikepan/clickhouse-bulk: it will buffer multiple inserts and flush them all together according to user policy.
The design of clickhouse MergeEngines is not meant to take small writes concurrently. The MergeTree as much as I understands merges the parts of data written to a table into based on partitions and then re-organize the parts for better aggregated reads. If we do small writes often you would encounter another exception that Merge
Error: 500: Code: 252, e.displayText() = DB::Exception: Too many parts (300). Merges are processing significantly slow
When you would try to understand why the above exception is thrown the idea will be a lot clearer. CH needs to merge data and there is an upper limit as to how many parts can exist! And every write in a batch is added as a new part and then eventually merged with the partitioned table.
SELECT
table, count() as cnt
FROM system.parts
WHERE database = 'dbname' GROUP BY `table` order by cnt desc
The above query can help you monitor parts, observe while writing how the parts would increase and eventually merge down.
My best bet for the above would be buffering the data set and periodically flushing it to DB, but then that means no real-time analytics.
Using buffer is good, however please consider these points:
If the server is restarted abnormally, the data in the buffer is lost.
FINAL and SAMPLE do not work correctly for Buffer tables. These conditions are passed to the destination table, but are not used for processing data in the buffer
When adding data to a Buffer, one of the buffers is locked. (So no reads)
If the destination table is replicated, some expected characteristics of replicated tables are lost when writing to a Buffer table. (no deduplication)
Please read throughly, it's a special case engine: https://clickhouse.tech/docs/en/engines/table-engines/special/buffer/

Resources