Kafka Streams: Handle Aging of events in a stream on window expiry - apache-kafka-streams

I'm currently using kafka streams to collate related events within a window. In case if all the related events don't arrive within a window, is there a way in Kafka streams where we get a handle to the events that are expired. This would assist in handling/ notifying the downstream application that all the related events didn't arrive for collation. Appreciate your response.
Below are the examples
Example-1:
- GroupID: g1
- Events arrival: E1,10am; E2 10:01am and E3 10:02am
- Window: Session Window of inactivity duration of 5 mins.
- Result: All the events are collated successfully.
Example-2:
- Events arrival: E1,10am; E2 10:01am and E3 don't arrive
- Window: Session Window of inactivity duration of 5 mins.
- Result: Trigger an action OR get notified via a listener for partial
collation upon window expiry for E1 and E2 at 10:06 am

Windows in Kafka Streams "don't expire" but are kept open to allow the handling of late arriving data.
Compare How to send final kafka-streams aggregation result of a time windowed KTable?
It's not possible to register any call-back,
not for the case that "stream time" advances and passed "window end time"
not for the case that a window if finally dropped (ie, after retention period did pass)

Have not tried it, but seems like window final results might do it
https://kafka.apache.org/24/documentation/streams/developer-guide/dsl-api.html#window-final-results
The idea is to check if all events have arrived when the window closes and trigger some action if this is not the case.

Related

How to limit Message consumption rate of Kafka Consumer in SpringBoot? (Kafka Stream)

I want to limit my Kafka Consumer message consumption rate to 1 Message per 10 seconds .I'm using kafka streams in Spring boot .
Following is the property I tried to Make this work but it didn't worked out s expected(Consumed many messages at once).
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, brokersUrl);
config.put(StreamsConfig.APPLICATION_ID_CONFIG, applicationId);
config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, autoOffsetReset);
//
config.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG,1);
config.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 10000);
is there any way to Manually ACK(Manual offsetCommits) in KafkaStreams? which will be usefull to control the msg consumption rate .
Please note that i'm using Kstreams(KafkaStreams)
Any help is really appreciated . :)
I think you misunderstand what MAX_POLL_INTERVAL_MS_CONFIG actually does.
That is the max allowed time for the client to read an event.
From docs
controls the maximum time between poll invocations before the consumer will proactively leave the group (5 minutes by default). The value of the configuration request.timeout.ms (default to 30 seconds) must always be smaller than max.poll.interval.ms(default to 5 minutes), since that is the maximum time that a JoinGroup request can block on the server while the consumer is rebalance
"maximum time" not saying any "delay" between poll invocations.
Kafka Streams will constantly poll; you cannot easily pause/start it and delay record polling.
To read an event every 10 seconds without losing consumers in the group due to lost heartbeats, then you should use Consumer API, with pause() method, call Thread.sleep(Duration.ofSeconds(10)), then resume() + poll() while setting max.poll.records=1
Finally ,I achieved the desired message consuming limit using Thread.sleep().
Since , there is no way to control the message consumption rate using kafka config properties itself . I had to use my application code to control the rate of consumption .
Example: if I want control the record consumption rate say 4 msg per 10 seconds . Then i will just consumer 4 msg (will keep a count parallely) once 4 records are consumer then i will make the thread sleep for 10 seconds and will repeat the same process over again .
I know it's not a good solution but there was no other way.
thank you OneCricketeer

Dataflow job has high data freshness and events are dropped due to lateness

I deployed an apache beam pipeline to GCP dataflow in a DEV environment and everything worked well. Then I deployed it to production in Europe environment (to be specific - job region:europe-west1, worker location:europe-west1-d) where we get high data velocity and things started to get complicated.
I am using a session window to group events into sessions. The session key is the tenantId/visitorId and its gap is 30 minutes. I am also using a trigger to emit events every 30 seconds to release events sooner than the end of session (writing them to BigQuery).
The problem appears to happen in the EventToSession/GroupPairsByKey. In this step there are thousands of events under the droppedDueToLateness counter and the dataFreshness keeps increasing (increasing since when I deployed it). All steps before this one operates good and all steps after are affected by it, but doesn't seem to have any other problems.
I looked into some metrics and see that the EventToSession/GroupPairsByKey step is processing between 100K keys to 200K keys per second (depends on time of day), which seems quite a lot to me. The cpu utilization doesn't go over the 70% and I am using streaming engine. Number of workers most of the time is 2. Max worker memory capacity is 32GB while the max worker memory usage currently stands on 23GB. I am using e2-standard-8 machine type.
I don't have any hot keys since each session contains at most a few dozen events.
My biggest suspicious is the huge amount of keys being processed in the EventToSession/GroupPairsByKey step. But on the other, session is usually related to a single customer so google should expect handle this amount of keys to handle per second, no?
Would like to get suggestions how to solve the dataFreshness and events droppedDueToLateness issues.
Adding the piece of code that generates the sessions:
input = input.apply("SetEventTimestamp", WithTimestamps.of(event -> Instant.parse(getEventTimestamp(event))
.withAllowedTimestampSkew(new Duration(Long.MAX_VALUE)))
.apply("SetKeyForRow", WithKeys.of(event -> getSessionKey(event))).setCoder(KvCoder.of(StringUtf8Coder.of(), input.getCoder()))
.apply("CreatingWindow", Window.<KV<String, TableRow>>into(Sessions.withGapDuration(Duration.standardMinutes(30)))
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(30)))
.apply("GroupPairsByKey", GroupByKey.create())
.apply("CreateCollectionOfValuesOnly", Values.create())
.apply("FlattenTheValues", Flatten.iterables());
After doing some research I found the following:
regarding constantly increasing data freshness: as long as allowing late data to arrive a session window, that specific window will persist in memory. This means that allowing 30 days late data will keep every session for at least 30 days in memory, which obviously can over load the system. Moreover, I found we had some ever-lasting sessions by bots visiting and taking actions in websites we are monitoring. These bots can hold sessions forever which also can over load the system. The solution was decreasing allowed lateness to 2 days and use bounded sessions (look for "bounded sessions").
regarding events dropped due to lateness: these are events that on time of arrival they belong to an expired window, such window that the watermark has passed it's end (See documentation for the droppedDueToLateness here). These events are being dropped in the first GroupByKey after the session window function and can't be processed later. We didn't want to drop any late data so the solution was to check each event's timestamp before it is going to the sessions part and stream to the session part only events that won't be dropped - events that meet this condition: event_timestamp >= event_arrival_time - (gap_duration + allowed_lateness). The rest will be written to BigQuery without the session data (Apparently apache beam drops an event if the event's timestamp is before event_arrival_time - (gap_duration + allowed_lateness) even if there is a live session this event belongs to...)
p.s - in the bounded sessions part where he demonstrates how to implement a time bounded session I believe he has a bug allowing a session to grow beyond the provided max size. Once a session exceeded the max size, one can send late data that intersects this session and is prior to the session, to make the start time of the session earlier and by that expanding the session. Furthermore, once a session exceeded max size it can't be added events that belong to it but don't extend it.
In order to fix that I switched the order of the current window span and if-statement and edited the if-statement (the one checking for session max size) in the mergeWindows function in the window spanning part, so a session can't pass the max size and can only be added data that doesn't extend it beyond the max size. This is my implementation:
public void mergeWindows(MergeContext c) throws Exception {
List<IntervalWindow> sortedWindows = new ArrayList<>();
for (IntervalWindow window : c.windows()) {
sortedWindows.add(window);
}
Collections.sort(sortedWindows);
List<MergeCandidate> merges = new ArrayList<>();
MergeCandidate current = new MergeCandidate();
for (IntervalWindow window : sortedWindows) {
MergeCandidate next = new MergeCandidate(window);
if (current.intersects(window)) {
if ((current.union == null || new Duration(current.union.start(), window.end()).getMillis() <= maxSize.plus(gapDuration).getMillis())) {
current.add(window);
continue;
}
}
merges.add(current);
current = next;
}
merges.add(current);
for (MergeCandidate merge : merges) {
merge.apply(c);
}
}

KafkaConsumer poll() behavior understanding

Trying to understand (new to kafka)how the poll event loop in kafka works.
Use Case : 25 records on the topic, max poll size is set to 5.
max.poll.interval.ms = 5000 //5 seconds by default max.poll.records = 5
Sequence of tasks
Poll the records from the topic.
Process the records in a for loop.
Some processing login where the logic would either pass or fail.
If logic passes (with offset) will be added to a map.
Then it will be committed using commitSync call.
If fails then the loop will break and whatever was success before this would be committed.The problem starts after this.
The next poll would just keep moving in batches of 5 even after error, is it expected?
What we basically expect is that the loop breaks and the offsets till success process message logic should get committed, then the next poll should continue from the failed message.
Example, 1st batch of poll 5 messages polled and 1,2 offsets successful and committed then 3rd failed.So the poll call keep moving to next batch like 5-10,10-15 if there are any errors in between we expect it to stop at that point and poll should start from 3 in first case or if it fails in 2nd batch at 8 then the next poll should start from 8th offset not from next max poll batch settings which would be like 5 in this case.IF IT MATTERS USING SPRING BOOT PROJECT and enable autocommit is false.
I have tried finding this in documentation but no help.
tried tweaking this but no help max.poll.interval.ms
EDIT: Not accepted answer because there is no direct solution for a customer consumer.Keeping this for informational purpose
max.poll.interval.ms is milliseconds, not seconds so it should be 5000.
Once the records have been returned by the poll (and offsets not committed), they won't be returned again unless you restart the consumer or perform seek() operations on the consumer to reset the offset to the unprocessed ones.
The Spring for Apache Kafka project provides a SeekToCurrentErrorHandler to perform this task for you.
If you are using the consumer yourself (which it sounds like), you must do the seeks.
You can manually seek to the beginning offset of the poll for all the assigned partitions on failure. I am not sure using spring consumer.
Sample code for seeking offset to beginning for normal consumer.
In the code below I am getting the records list per partition and then getting the offset of the first record to seek to.
def seekBack(records: ConsumerRecords[String, String]) = {
records.partitions().map(partition => {
val partitionedRecords = records.records(partition)
val offset = partitionedRecords.get(0).offset()
consumer.seek(partition, offset)
})
}
One problem doing this in production is bad since you don't want seekback all the time only in cases where you have a transient error otherwise you will end up retrying infinitely.

FB Messenger API - Receiving double requests

I have a working FB Bot built with Ruby which allows players to play a scavenger hunt.
Sometimes though, when I have multiple players in a team, FB is sending me a players 'Answer' webhook twice. I have looked into it and at first thought it was to do with the 20 second timeout if FB gets no 200 OK response (Docs here). After checking the logs though, I am receiving the second webhook from FB only 14 seconds later. See below:
# Webhook #1
{"object"=>"page", "entry"=>[{"id"=>"252445748474312", "time"=>1532153642358, "messaging"=>[{"sender"=>{"id"=>"1709242109154907"}, "recipient"=>{"id"=>"252445748474312"}, "timestamp"=>1532153641935, "message"=>{"mid"=>"0FeOChulGjuPgg3YJqEgajNsY8kMfNRt_bpIdeegEeE54h-KB8szcd-EQ-UHUT3850RwHgH4TxVYFkoFwxqhtg", "seq"=>402953, "text"=>"Larrikins"}}]}]}
# Webhook #2 (14 seconds later)
{"object"=>"page", "entry"=>[{"id"=>"252445748474312", "time"=>1532153656901, "messaging"=>[{"sender"=>{"id"=>"1709242109154907"}, "recipient"=>{"id"=>"252445748474312"}, "timestamp"=>1532153641935, "message"=>{"mid"=>"0FeOChulGjuPgg3YJqEgajNsY8kMfNRt_bpIdeegEeE54h-KB8szcd-EQ-UHUT3850RwHgH4TxVYFkoFwxqhtg", "seq"=>402953, "text"=>"Larrikins"}}]}]}
Notice both are exactly the same apart from the first "time" attribute (14 secs later).
Due to a number of methods and calls that I process after receiving the first webhook, the 200 OK response is only being sent back to FB once I have finished sending my messages in response (hence the 14 second delay).
So I have two questions:
Is the 14 second delay too long and that is why FB is resending? If so, how can I send a 200OK response straight away (head :ok)?
Is it another issue entirely?
You also ensure that "Echo" is disabled.
Go to Settings>Webhooks, edit events.
Asyncronous language like NodeJS is recomended, in my case y work with AWS SQS, I have workers that process the requests witout blocking (dont wait), I return 200,"ok" to FB to avoid that FB send again the message to my webhook.
Anothe apporach maybe store the mid in database, and check in each request if the mid exists, if exists the dont process the message. I was use Dynamo DB (AWS) with TTL enabled, thus with TTL my database autoclean every hour erasing old request.
I think it is the 15 second wait before replying, was also happening to me as Facebook auto retries when you don't reply fast enough. Te EEe Te's idea is solid, write some mechanism to cache mids and check if it is a duplicate before processing

How to use RabbitMQ http api to see what queue had a messages in a ready state

I have a RabbitMQ server setup with thousands of queues. Of which only about 5 of these are persistent queues. Every now and then there is a back up of a queue that will have about 5-10 messages in a ready state. These messages do not appear to be in the persistent queues. I want to find out which queues had the messages in a ready state, but the only indication that it is happening is on the overview page of the web management console which is for all queues.
Is there a way to query Rabbit to tell me the stat info for messages that were in a ready state for a period of minutes and which queue they were in?
I would use the HTTP API.
http://rabbit-broker:15672/api/queues
This will give you a list of the current queue states in JSON so you'll have to keep polling it. Store the "messages_ready" for given queue "name" for the period you want to monitor. Now you'll be able to see which queues have that backlog spike.
You can use simple curl as well as whichever platform you prefer with an HTTP client.
Please note: the user you'll connect will have to have monitor tag to access all the queue information.
Out of the box there is no easy way AFAIK, you'd have to manually click through the queues and look at their graphs in the UI for the last hour, which is tedious.
I had similar requirements and I found a better way than polling. The docs say that you may get raw samples via api if you use special parameters in the request.
For example in your case, if you are interested in messages with ready state, you may ask your queue for a history of queue lengths, for example last 60 seconds with samples every 1 second (note 15672 is the default port used by rabbitmq_management):
http://rabbitHost:15672/api/queues/vhost/queue?lengths_age=60&lengths_incr=1
For default vhost=/ it will be:
http://rabbitHost:15672/api/queues/%2F/queue?lengths_age=60&lengths_incr=1
Then in the result json there will be some additional _details objects like this:
"messages_ready_details": {
"avg": 8.524590163934427,
"avg_rate": 0.08333333333333333,
"samples": [{
"timestamp": 1532699694000,
"sample": 5
}, {
"timestamp": 1532699693000,
"sample": 11
},
<... more samples ...>
],
"rate": -6.0
},
"messages_ready": 5,
Then on this raw data you may do any stats you need.
Other raw data samples appear if you use differen parameters in
What sampling will appear? What parameters are required for it to appear?
Messages sent and received msg_rates_age / msg_rates_incr
Bytes sent and received data_rates_age / data_rates_incr
Queue lengths lengths_age / lengths_incr
Node statistics (e.g. file descriptors, disk space free) node_stats_age / node_stats_incr

Resources