Kafka Windowed State Stores not cleaning up after retention - apache-kafka-streams

For some reason my old state stores are not cleaning up after the retention policy is expiring. I am testing it locally so I am just sending a single test message in every 5 minutes or so. I have the retention durations set low just for testing. retentionPeriod = 120, retentionWindowSize = 15 and I assume retain duplicates should be false. When should that be true?
Stores.persistentWindowStore(storeName,
Duration.of(retentionPeriod, ChronoUnit.SECONDS),
Duration.of(retentionWindowSize, ChronoUnit.SECONDS),
false)
When I ls in the state store directory I see the old stores well after the retention period has expired. For example, store.1554238740000 (assuming the number is epoch ms). I am well pass the 2 minute retention time and that directory is still there.
What am I missing?
Note, it does eventually clean up just a lot later than I was expecting. What triggers the clean up?

Retention time is a minimum guarantee how long data is stored. To make expiration efficient, so-called segments are used to partition the time-line into "buckets". Only after the time for all data in a segment can be expired, the segment is dropped. By default, Kafka Streams uses 3 segment. Thus for your example with a retention time of 120 seconds, each segment will be 60 seconds big (not 40 seconds). The reason is that the oldest segment can only be deleted of all data in it is passed the retention time. If the segment size would only be 40 seconds, 4 segments would be required to achieve this:
S1 [0-40) -- S2 [40,80) -- S3 [80,120)
If a record with timestamp 121 should be store, S1 cannot be deleted yet, because it contains data for timestamps 1 to 40 that are not passed retention period yet. Thus, a new segment S4 would be required. For segment size 60, 3 segments are sufficient:
S1 [0-60) -- S2 [60,120) -- S3 [120,180)
For this case, if a record with timestamp 181 arrives, all data in the first segment are passed the retention time of 181 - 120 = 61 and thus S1 can be deleted before S4 is created.
Note, that since Kafka 2.1, the internal mechanism is still the same, however, Kafka Streams enforced the retention period at application level in a strict manner, ie, write are dropped and reads return null for all data passed the retention period (even if the data is still there, because the segment is still in use).

Related

NiFi MergeRecords leaving out one file

I'm using NiFi to take in some user data and combine all the JSONs into one record. The MergeRecord processor is working just like I need, except it always leaves out one record (usually the same one every time). The processor is set to run ever 60 seconds. I can't understand why because there are only 56 records to merge. I've included images below for any help y'all may have.
Firstly, you have 56 FlowFiles, that does not necessarily mean 56 Records unless you have 1 Record per FlowFile.
You are using MergeRecord which counts Records, not files.
Your current config is set to Min 50 - Max 1000 Records
If you have 56 files with 1 Record in each, then merging 50 files is enough to meet the Minimum condition and release the bucket.
You also say Merge is set to run every 60 seconds, and perhaps this is not doing what you think it is. In almost all cases, Merge should be left to the default 0 sec schedule.
NiFi has no idea what all means, it takes an input and works on it - it does not know if or when the next input will come.
If every FlowFile is 1 Record, and it is categorically always 56 and that will never change, then your setting could be Min 56 - Max 56 and that will always merge 56 times.
However, that is very inflexible to change - if it suddenly changed to 57, you need to modify the flow.
Instead, you could set the Min-Max to very high numbers, say 10,000-20,000 and then set a Max Bin Age to 60 seconds (and the processor scheduling back to 0 sec). This would have the effect of merging every Record that enters the processor until A) 10-20k Records have been merged, or B) 60 seconds expire.
Example scenarios:
A) All 56 arrives within the first 2 seconds of the flow starting
All 56 are merged into 1 file after 60 seconds of the first file arriving
B) 53 arrive within the first 60 seconds, 3 arrive in the second 60 seconds
The first 53 are merged into 1 file after 60 seconds of the first file arriving, the last 3 are merged into another file after 60 seconds from the frst of the 3 arriving
C) 10,000 arrive in the first 5 seconds
All 10k will merge immediately into 1 file, they will not wait for 60 seconds

Why queries are getting progressively slower when using postgres with spring batch?

I'm running a job using Spring Batch 4.2.0 with postgres (11.2) as backend. It's all wrapped in a spring boot app. I've 5 steps and each runs using a simple partitioning strategy to divide data by id ranges and reads data into each partition (which are processed by separate threads). I've about 18M rows in the table, each step reads, changes few fields and writes back. Each step reads all 18M rows and writes back. The issue I'm facing is, the queries that run to pull data into each thread scans data by id range like,
select field_1, field_2, field_66 from table where id >= 1 and id < 10000.
In this case each thread processes 10_000 rows at a time. When there's no traffic the query takes less than a second to read all 10,000 rows. But when the job runs there's about 70 threads reading all that data in. It goes progressively slower to almost a minute and a half, any ideas where to start troubleshooting this?
I do see autovacuum running in the backgroun for almost the whole duration of job. It definitely has enough memory to hold all that data in memory (about 6GB max heap). Postgres has sufficient shared_buffers 2GB, max_wal_size 2GB but not sure if that in itself is sufficient. Another thing I see is loads of COMMIT queries hanging around when checking through pg_stat_activity. Usually as much as number of partitions. So, instead of 70 connections being used by 70 partitions there are 140 conections used up with 70 of them running COMMIT. As time progresses these COMMITs get progressively slower too.
You are probably hitting https://github.com/spring-projects/spring-batch/issues/3634.
This issue has been fixed and will be part of version 4.2.3 planned to be released this week.

What is a reasonable value for StreamsConfig.COMMIT_INTERVAL_MS_CONFIG for Kafka Streams

I was looking to some confluent examples for the Kafka Streams the different values for configuration value 'StreamsConfig.COMMIT_INTERVAL_MS_CONFIG' confused me little bit.
For ex, in micro service example,
config.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1); //commit as fast as possible
https://github.com/confluentinc/kafka-streams-examples/blob/5.1.0-post/src/main/java/io/confluent/examples/streams/microservices/util/MicroserviceUtils.java
Another one,
// Records should be flushed every 10 seconds. This is less than the
default
// in order to keep this example interactive.
streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 10 *
1000);
https://github.com/confluentinc/kafka-streams-examples/blob/5.1.0-post/src/main/java/io/confluent/examples/streams/WordCountLambdaExample.java
Another one,
// Set the commit interval to 500ms so that any changes are flushed
frequently and the top five
// charts are updated with low latency.
streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG,
500);
https://github.com/confluentinc/kafka-streams-examples/blob/5.1.0-post/src/main/java/io/confluent/examples/streams/interactivequeries/kafkamusic/KafkaMusicExample.java
In the examples intervals changes from 1ms to 10000ms, what I am really interested is the 1ms in a system that heavy load all the time, could it be dangerous to go 1ms Commit Interval?
Thx for answers..
Well, it depends how frequently you want to commit your records. It actually refers to the Record Caching in memory:
https://kafka.apache.org/21/documentation/streams/developer-guide/memory-mgmt.html#record-caches-in-the-dsl
If you want to see each record as output, you can set it to the lowest number. In some scenarios, you may want to get the output for each event, there having lowest number makes sense. But in some scenario, where it is okay to consolidate the events and produce fewer output, you can set it to higher number.
Also be aware, that Record caching is affected by these two configuration:
commit.interval.ms and cache.max.byte.buffering
The semantics of caching is that data is flushed to the state store and forwarded to the next downstream processor node whenever the earliest of commit.interval.ms or cache.max.bytes.buffering (cache pressure) hits.

Storm: Min/max aggregation across several sliding windows with varying sizes

I wonder what the best practice is to approach the following problem in Apache Storm.
I have a single spout that generates a stream of integer values with an explicit timestamp attached. The goal is to perform min/max aggregation with three sliding windows over this stream:
last hour
last day, i.e. last 24 hours
Last hour is easy:
topology.setBolt("1h", ...)
.shuffleGrouping("spout")
.withWindow(Duration.hours(1), Duration.seconds(10))
.withTimestampField("timestamp"));
However, for longer periods I am concerned about the queue sizes of the windows. When I consume the tuples directly from the spout as with the last-hour aggregation, every single tuple would end up in the queue.
One possibility would be to consume the tuples from the pre-aggregated "1h" bolt. However, since I am using explicit timestamps, late tuples arriving from the "1h" bolt are ignored. A 1 hour lag is not an option as this delays the evaluation of the window. Is there a way to "allow" late tuples without impacting the timeliness of the results?
Of course I could also store away an aggregate every hour and then compute the minimum over the last 24 hours including the latest value from the "1h" stream. But I am curious if there is a way to do this properly using Storm means.
Update 1
Thanks to arunmahadevan's answer I changed the 1h min bolt to emit the minimum tuple with the maximum timestamp of all tuples in the respective 1h window. That way the consuming bolt does not discard the tuple due to late arrival. I also introduced a new field original-timestamp to retain the original timestamp of the minimum tuple.
Update 2
I finally found an even better way by only emitting state changes in the 1h min bolt. Storm does not advance the time in the consuming bolt as long as no new tuples are received hence the late arrival issue is prevented. Also, I get to keep the original timestamp without copying it into a separate field.
I think periodically emitting the min from "1h" to "24h" bolt should work and keep the "24h" queue size in check.
If you configure a lag, the bolt's execute is invoked only after that lag (i.e. when the event time cross the sliding interval + lag).
Lets say if the "1h" bolt is configured with a lag of 1 min, the execute will be invoked for the tuples between 01:00 - 02:00 only after event time crosses 02:01. (i.e. the bolt has seen an event with timestamp >= 02:01). The execute will however only receive the tuples between 01:00 and 02:00.
Now if you compute the last one hour minimum and emit the result to a "24h" bolt that has a sliding interval of say 1 hr and lag=0, it will trigger once incoming event's timestamp crosses the next hr. If you emitted the 01:00-02:00 min with a timestamp of 02:00 the "24h" window will trigger (for the events between the previous day 02:00 to 02:00) as soon as it receives the min event since the event time crossed the next hour and the configured lag is 0.

Spreading/smoothing periodic tasks out over time

I have a database table with N records, each of which needs to be refreshed every 4 hours. The "refresh" operation is pretty resource-intensive. I'd like to write a scheduled task that runs occasionally and refreshes them, while smoothing out the spikes of load.
The simplest task I started with is this (pseudocode):
every 10 minutes:
find all records that haven't been refreshed in 4 hours
for each record:
refresh it
set its last refresh time to now
(Technical detail: "refresh it" above is asynchronous; it just queues a task for a worker thread pool to pick up and execute.)
What this causes is a huge resource (CPU/IO) usage spike every 4 hours, with the machine idling the rest of the time. Since the machine also does other stuff, this is bad.
I'm trying to figure out a way to get these refreshes to be more or less evenly spaced out -- that is, I'd want around N/(10mins/4hours), that is N/24, of those records, to be refreshed on every run. Of course, it doesn't need to be exact.
Notes:
I'm fine with the algorithm taking time to start working (so say, for the first 24 hours there will be spikes but those will smooth out over time), as I only rarely expect to take the scheduler offline.
Records are constantly being added and removed by other threads, so so we can't assume anything about the value of N between iterations.
I'm fine with records being refreshed every 4 hours +/- 20 minutes.
Do a full refresh, to get all your timestamps in sync. From that point on, every 10 minutes, refresh the oldest N/24 records.
The load will be steady from the start, and after 24 runs (4 hours), all your records will be updating at 4-hour intervals (if N is fixed). Insertions will decrease refresh intervals; deletions may cause increases or decreases, depending on the deleted record's timestamp. But I suspect you'd need to be deleting quite a lot (like, 10% of your table at a time) before you start pushing anything outside your 40-minute window. To be on the safe side, you could do a few more than N/24 each run.
Each minute:
take all records older than 4:10 , refresh them
If the previous step did not find a lot of records:
Take some of the oldest records older than 3:40, refresh them.
This should eventually make the last update time more evenly spaced out. What "a lot" and "some" means You should decide Yourself (possibly based on N).
Give each record its own refreshing interval time, which is a random number between 3:40 and 4:20.

Resources