Regarding fault-tolerance for suppress operator in KTable

Regarding fault-tolerance for suppress operator in KTable - apache-kafka-streams

We are planning to use suppress operator over Session Windowed KTable.
We are wondering about fault-tolerance when using suppress operator.
We understand that buffer is used to store events/aggregations until the window closes.
Now let us say a rebalance has happened, and active task is moved to different machine. We are wondering what happens to this (in-memory ?) buffer.
Let us say we are tracking click count by user. And we configured session window's in-activity period to be 3 minutes, and session window has started for a key alice, and aggregations happened for that key for 2 minutes. For example in buffer we have (alice -> 5) entry representing that alice had made 5 clicks in this session so far.
And say there is no activity after that from alice.
If things are working fine , then once the session is over, downstream processor will get event alice -> 5 .
But what if there is rebalance now, and active task that is maintaining session window for alice is moved to new machine ?
Since there is no further activity from alice, will downstream processor which is running on new machine miss this event alice ->5 ?

The suppress operator provides fault tolerance similarly to any other state store in Streams. Although the active data structure is in memory, the suppression buffer maintains a changelog (an internal Kafka topic).
So, when you have that rebalance, the previous active task flushes its state to the changelog and discards the in-memory buffer. The new active task re-creates the state by replaying the changelog topic, resulting in the exact same buffered contents as if there had been no rebalance.
In other words, just like in-memory state stores, the suppression buffer is made durable (in a Kafka topic) even though it is not persistent (on the local disk).
Does that make sense?

Related

Does kafka streams clean up its state store on rebalance, or simply accumulate them?

I wonder what happen to state stores when there is a rebalance in a kafka-streams application. Let say an instance goes off for a while and then comes back, either out of the static membership time window, or with no static membership at all. Does the old state store corresponding to the old tasks assignment gets deleted or do they live there together with the new state store corresponding to the new task assignments ?

KStreams: implementing session window with pocessor API

I need to implement a logic similar to session windows using processor API in order to have a full control over state store. Since processor API doesn't provide windowing abstraction, this needs to be done manually. However, I fail to find the source code for KStreams session window logic, to get some initial ideas (specifically regarding session timeouts).
I was expecting to use punctuate method, but it's a per processor timer rather than per key timer. Additionally SessionStore<K, AGG> doesn't provide an API to traverse the database for all keys.
[UPDATE]
As an example, assume processor instance is processing K1 and stream time is incremented which causes the session for K2 to timeout. K2 may or may not exist at all. How do you know that there exists a specific key (like K2 when stream time is incremented (while processing a different key)? In other words when stream time is incremented, how do you figure out which windows are expired (because you don't know those keys exists)?

This is the DSL code: https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamSessionWindowAggregate.java -- hope it helps.
It's unclear what your question is though -- it's mostly statements. So let me try to give some general answer.
In the DSL, sessions are close based on "stream time" progress. Only relying on the input data makes the operation deterministic. Using wall-clock time would introduce non-determinism. Hence, using a Punctuation is not necessary in the DSL implementation.
Additionally SessionStore<K, AGG> doesn't provide an API to traverse the database for all keys.
Sessions in the DSL are based on keys and thus it's sufficient to scan the store on a per-key basis over a time range (as done via findSessions(...)).
Update:
In the DSL, each time a session window is updated, as corresponding update event is sent downstream immediately. Hence, the DSL implementation does not wait for "stream time" to advance any further but publishes the current (potentially intermediate) result right away.
To obey the grace period, the record timestamp is compared to "stream time" and if the corresponding session window is already closed, the record is skipped (cf. https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamSessionWindowAggregate.java#L146). I.e., closing a window is just a logical step (not an actually operation); the session will still be stored and if a window is closed no additional event needs to be sent downstream because the final result was sent downstream in the last update to the window already.
Retention time itself must not be handled by the Processor implementation because it's a built-in feature of the SessionStore: internally, the session store maintains so-called "segments" that store sessions for a certain time period. Each time a put() is done, the store checks if old segments can be dropped (based on the timestamp provided by put()). I.e., old sessions are deleted lazily and as bulk deletes (i.e., all session of the whole segment will be deleted at once) as it's more efficient than individual deletes.

Intentionally drop state when using suppress for rate limiting updates to KTable

I am using Kafka Streams 2.3.1 suppress() operator to limit the number of updates being sent to the underlying KTable.
The use case here is that in my processing logic, I want to make an HTTP call, however to limit the number of calls, I am windowing the stream and aggregating source topic messages that fall into the same time window to make a single API call.
Code looks roughly as follows
KTable<Windowed<String>, List<Event>> windowedEventKTable = inputKStream
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofSeconds(30)).grace(Duration.ofSeconds(5))
.aggregate(Aggregator::new, ((key, value, aggregate) -> aggregate.aggregate(value)), stateStore)
.suppress(Suppressed.untilTimeLimit(Duration.ofSeconds(5), maxRecords(500).emitEarlyWhenFull())
.mapValues((windowedKey, groupedTriggerAggregator) -> {//code here returning a list})
.toStream((k,v) -> k.key())
.flatMapValues((readOnlyKey, value) -> value);
The problem I am running into is that while the windows exceeding the record limit are emitted, the state is preserved. At some point the state for a single time window grows into multiple MB's, causing the supress store changelog message to exceed the topic's max.message.bytes limit. For our use case, as soon as window is emitted we actually don't care about leftover state and it would be safe to drop it.
As we are sharing the Kafka Cluster between multiple teams, the team running the cluster is hesitant to increase cluster level max.message.bytes property beyond 10 MB's that we require.
Do I have any options other than implementing my logic using transformValues? If not, are there any future Kafka Streams enhancements that would be able to handle this more out of the box?

For our use case, as soon as window is emitted we actually don't care about leftover state and it would be safe to drop it.
For this case, you can set the store retention time (default is 1 day) to the same value as the specified grace period, via aggregation() parameter Materialized.withRetentiontTime(...).
The problem I am running into is that while the windows exceeding the record limit are emitted, the state is preserved. At some point the state for a single time window grows into multiple MB's, causing the supress store changelog message to exceed the topic's max.message.bytes limit.
This is actually an interesting statement, and looking at your code, I just want to clarify something: As you limit by time and allow to emit early based on cache size, it seems that you have a lot of records that are out of order and update the state further even after an intermediate result was emitted. If you purge the state via retention time as describe above you need to consider the following:
Purging state won't affect any emits that are triggered base on cache size, because, the state will only be purges after the retention time passed.
0 Furthermore, purging state implies that all out of order records the appear after purging would not be processed at all, but would be dropped (because retention time implicitly marks input records with smaller timestamp as "late").
However, overall it seems that you don't really care about out of order data and event-time windows as it's ok for you to "arbitrarily" put records into a window as the only goal is to reduce the number of external API calls. Hence, it seems appropriate that you actually switch to processing time semantics by using WallclockTimetampExtractor (instead of the default extractor). For ensure that each record is only emitted once, you should change the suppress() configuration to only emit "final" results.

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Lost connection to task manager

I'm working on a project that uses flink (Version - 1.4.2) for bulk data ingestion to my graph database (Janusgraph). Data ingestion has two phases, one is vertex data ingestion and the other is edge data ingestion to graph db. Vertex data ingestion runs without any issue but during edge ingestion i'm getting an error saying Lost connection to task manager taskmanagerName. The detailed error traceback from flink-taskmanager-b6f46f6c8-fgtlw is attached below:
2019-08-01 18:13:26,025 ERROR org.apache.flink.runtime.operators.BatchTask
- Error in task code: CHAIN Join(Remap EDGES id: TO) -> Map (Key Extractor) -> Combine (Deduplicate edges including bi-directional edges) (62/80)
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Lost connection to task manager 'flink-taskmanager-b6f46f6c8-gcxnm/10.xx.xx.xx:6121'.
This indicates that the remote task manager was lost.
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:146)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.flink.shaded.netty4.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
... 6 more
For ease of understanding lets say:
flink-taskmanager-b6f46f6c8-gcxnm as TM1 and
flink-taskmanager-b6f46f6c8-fgtlw as TM2
On debugging I was able to find that the TM1 requested for ResultPartition (RPP) from TM2 and the TM2 started to send ResultPartition to TM1. But on checking the logs from TM1 we found that it waited for long time to get RP from TM2 but after some time it started to deregister the accepted task. We believe deregistring task after netty remote transport exception caused TM2 to send Lost Taskmanager error for the specific job.
Both taskmanagers are running in separate ec2 instance (m4.2xlarge). I have verified cpu and memory utilization of both instances and was able to see all metric within limit.
Can you please tell me why taskmanager is acting weird like this and a way to fix this issue.
Thanks in advance

Flushing Buffers to Netty
In the picture above, the credit-based flow control mechanics actually sit inside the “Netty Server” (and “Netty Client”) components and the buffer the RecordWriter is writing to is always added to the result subpartition in an empty state and then gradually filled with (serialised) records. But when does Netty actually get the buffer? Obviously, it cannot take bytes whenever they become available since that would not only add substantial costs due to cross-thread communication and synchronisation, but also make the whole buffering obsolete.
In Flink, there are three situations that make a buffer available for consumption by the Netty server:
a buffer becomes full when writing a record to it, or
the buffer timeout hits, or
a special event such as a checkpoint barrier is sent.
Flush after Buffer Full
The RecordWriter works with a local serialisation buffer for the current record and will gradually write these bytes to one or more network buffers sitting at the appropriate result subpartition queue. Although a RecordWriter can work on multiple subpartitions, each subpartition has only one RecordWriter writing data to it. The Netty server, on the other hand, is reading from multiple result subpartitions and multiplexing the appropriate ones into a single channel as described above. This is a classical producer-consumer pattern with the network buffers in the middle and as shown by the next picture. After (1) serialising and (2) writing data to the buffer, the RecordWriter updates the buffer’s writer index accordingly. Once the buffer is completely filled, the record writer will (3) acquire a new buffer from its local buffer pool for any remaining bytes of the current record - or for the next one - and add the new one to the subpartition queue. This will (4) notify the Netty server of data being available if it is not aware yet4. Whenever Netty has capacity to handle this notification, it will (5) take the buffer and send it along the appropriate TCP channel.
Image 1
We can assume it already got the notification if there are more finished buffers in the queue.
Flush after Buffer Timeout
In order to support low-latency use cases, we cannot only rely on buffers being full in order to send data downstream. There may be cases where a certain communication channel does not have too many records flowing through and unnecessarily increase the latency of the few records you actually have. Therefore, a periodic process will flush whatever data is available down the stack: the output flusher. The periodic interval can be configured via StreamExecutionEnvironment#setBufferTimeout and acts as an upper bound on the latency5 (for low-throughput channels). The following picture shows how it interacts with the other components: the RecordWriter serialises and writes into network buffers as before but concurrently, the output flusher may (3,4) notify the Netty server of data being available if Netty is not already aware (similar to the “buffer full” scenario above). When Netty handles this notification (5) it will consume the available data from the buffer and update the buffer’s reader index. The buffer stays in the queue - any further operation on this buffer from the Netty server side will continue reading from the reader index next time.
Image 2
Reference:
Below link may help you out.
flink-network-stack-details

Can you check the GC logs of TM1 and TM2 to see if there were full GCs which may cause heatbeat timeout.

Which guarantees does Kafka Stream provide when using a RocksDb state store with changelog?

I'm building a Kafka Streams application that generates change events by comparing every new calculated object with the last known object.
So for every message on the input topic, I update an object in a state store and every once in a while (using punctuate), I apply a calculation on this object and compare the result with the previous calculation result (coming from another state store).
To make sure this operation is consistent, I do the following after the punctuate triggers:
write a tuple to the state store
compare the two values, create change events and context.forward them. So the events go to the results topic.
swap the tuple by the new_value and write it to the state store
I use this tuple for scenario's where the application crashes or rebalances, so I can always send out the correct set of events before continuing.
Now, I noticed the resulting events are not always consistent, especially if the application frequently rebalances. It looks like in rare cases the Kafka Streams application emits events to the results topic, but the changelog topic is not up to date yet. In other words, I produced something to the results topic, but my changelog topic is not at the same state yet.
So, when I do a stateStore.put() and the method call returns successfully, are there any guarantees when it will be on the changelog topic?
Can I enforce a changelog flush? When I do context.commit(), when will that flush+commit happen?

To get complete consistency, you will need to enable processing.guarantee="exaclty_once" -- otherwise, with a potential error, you might get inconsistent results.
If you want to stay with "at_least_once", you might want to use a single store, and update the store after processing is done (ie, after calling forward()). This minimized the time window to get inconsistencies.
And yes, if you call context.commit(), before input topic offsets are committed, all stores will be flushed to disk, and all pending producer writes will also be flushed.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio