Kafka Streams DSL Cache - Handle Tombstones - apache-kafka-streams

I need to use the Kafka Streams DSL cache to reduce the amount of write volume to downstream processors. However, our app processes tombstones, which introduces a complication. For example, given the following records for a single key, K1:
<K1, V1>
<K1, V2>
<K1, V3>
The DSL cache may only emit the final record of:
<K1, V3>
With the DSL cache turned off, of course, it would emit all of the intermediate records:
<K1, V1>
<K1, V2>
<K1, V3>
Everything is working as expected so far. But, with tombstones, the raw sequence becomes:
<K1, V1>
<K1, V2>
<K1, V3>
<K1, NULL>
So depending on when the cache is flushed, we may never see the final count. e.g.
<K1, V1> | cached
<K1, V2> | flushed
<K1, V3> | cached
<K1, NULL> | deleted
would mean <K1, V2> is flushed, but never <K1, V3>. The semantics I'm trying to achieve involves flushing the latest record for a given key in the cache whenever a tombstone is received for that key.
<K1, V1> | cached
<K1, V2> | flushed
<K1, V3> | cached
<K1, NULL> | emit the latest record (`<K1, V3>`), then delete.
I have not been able to do this with the DSL, and the Processor API doesn't expose the underlying cache, so can't do it there either. I'm thinking about implementing a custom in-memory cache and using that with the Processor API, but it gets complicated because it seems like there could be data loss if the app is shutdown ungracefully (e.g. SIGKILL). Not sure how the DSL cache handles ungraceful shutdowns either (e.g. maybe there's dataloss) so maybe the implementation I'm thinking of can be modeled after the DSL cache.
Anyways, am I over thinking this problem? Is there a way to flush the latest record from the DSL cache when a tombstone is received, instead of implementing a custom cache?

we may never see the final count
I understand what you are saying, however, for this case the "final" record is the tombstone, so you do see the final one. What you wants is a specific intermediate result. The DSL does not allow such a fine grained configuration to do this.
the Processor API doesn't expose the underlying cache
Well, it does. Using Stores.keyValueStoreBuilder() you can call withCachingEnabled() on the returned StoreBuilder. Note, for this case, by default no records are emitted downstream and you need to implement the emit logic manually. Ie, you don't know when the cache is flushed, and if it's flushed, it only flushing to local disk and the changelog-topic, but not data is emitted downstream on flush.
You could register a punctuation to emit data in regular time interval. Also, each time you process a tombstone, you can emit the currently stored value from the store before you do the delete on the store.

Related

Does Kafka DSL context.forward() ensures delivery?

I'm writing my custom KStream transformer and I want to store certain records to state store and later (from punctuate) resend them.
Yet I don't want to keep them in state store - and I have no tombstone in my flow to clean them later from the store - so I want to explicitly clean the records from store after they are forwarded:
context.forward(key, value)
store.delete(key)
I'm not sure if the message is not lost when there is failure later in topology.
Does KStream still ensures at-least-once delivery with such usage. For example by:
context.forward() returns only after successful processing
context.forward() return immediately but record is deleted from store only when all messages from context are processed
any other mechanism....

How to parallelize a Flink job with Guava cache?

I have written a Flink job which uses Guava cache. The cache object is created and used in a run() function called in the main() function.
It is something like :
main() {
run(some,params)
}
run() {
//create and use Guava cache object here
}
If I run this Flink job, with some level of parallelism, will all of the parallel tasks, use the same cache object? If not, how can I make them all use a single cache?
The cache is used inside a process() function for a stream. So it's like
incoming_stream.process(new ProcessFunction() { //Use Guava Cache here })
You can think of my use case as of cache based deduping, so I want all of the parallel tasks to refer to a single cache object
Using a Guava cache with Flink is usually an anti-pattern. Not that it can't be made to work, but there's probably a simpler and more scalable solution.
The standard approach to deduplicating in a thoroughly scalable, performant way with Flink is to partition the stream by some key (using keyBy), and then use keyed state to remember the keys that have been seen. Flink's keyed state is managed by Flink in a way that makes it fault tolerant and rescalable, while keeping it local. Flink's keyed state is a sharded key/value store, with each instance handling all of the events for some portion of the key space. You are guaranteed that for each key, all events for the same key will be processed by the same instance -- which is why this works well for deduplication.
If you need instead that all of the parallel instances have a complete copy of some (possibly evolving) data set, that's what broadcast state is for.
Flink tasks run on multi JVMs or machines,so the issue is how to share objects between JVM.
Normally,you can acquire objects from remote JVM by RPC (via tcp) or rest (via http) call.
Alternatively,you may serialize objects and store them to database like reids,then read from database and deserialize to objects.
In Flink,there is a more graceful way to achive this,you can store objects in state,and broadcast_state may fit you.
Broadcast state was introduced to support use cases where some data coming from one stream is required to be broadcasted to all downstream tasks
Hope this helps.

How to avoid timestamps extracted from a changelog topic to trigger Processor::punctuate

I'm writing a KafkaStreams (v2.1.0) application which reads messages from a regular Kafka topic, left-joins these messages with metadata from a compacted changelog topic and performs some stateful processing on the result using a Processor which also runs a punctuate call on regular time intervals based on event-time. The messages in the data-topic have a timestamp and a custom TimestampExtractor is defined (which defines the event-time as I'd like to use in the puctuate call). The messages in the metadata-topic, however, don't have a timestamp, but it seems KafkaStreams requires a TimestampExtractor to be defined anyhow. Now, if I use the embedded metadata in Kafka messages ExtractRecordMetadataTimestamp or just a WallclockTimestampExtractor it breaks the logic of my application because it seems these timestamps of the metadata-topic also trigger the punctuate call in my processor, but the event-time in my data topic can be hours behind the wall-clock time and I only want to have the punctuate-call trigger on these.
My question is how I can avoid the punctuate call to be triggered by timestamps extracted from this compacted metadata topic?
One trick that seems to work at first sight is to always return 0 as the timestamp of these metadata messages, but I'm not sure that it will not have unwanted side-effects. Another workaround is not to rely on 'punctuate' altogether and implement my own tracking of the event time but I'd prefer to use the standard KafkaStreams approach. So maybe there is another way to solve this?
This is the structure of my application:
KStream<String, Data> input =
streamsBuilder.stream(dataTopic,
Consumed.with(Serdes.String(),
new DataSerde(),
new DataTimestampExtractor(),
Topology.AutoOffsetReset.LATEST));
KTable<String, Metadata> metadataTable =
streamsBuilder.table(metadataTopic,
Consumed.with(Serdes.String(),
new MetadataSerde(),
new WhichTimestampExtractorToUse(),
Topology.AutoOffsetReset.EARLIEST));
input.leftJoin(metadataTable, this::joiner)
.process(new ProcessorUsingPunctuateOnEventTime());

What is behaviour of ProcessorContext.getStateStore(String name) & ReadOnlyKeyValueStore.get(String key) in Kafka sream

I have 1.0.0 kafka stream application with two classes as updated at How to evaluate consuming time in kafka stream application. In my application, I read the events, perform some conditional checks and forward to same kafka in another topic. During my evaluation , I am getting some of expressions from Kafka with help of global table store. Observed that most of the time was taken while getting the value from store (sample code is below).
Is it read only one time from Kafka and maintain it in local store?
or
Is it read from Kafka whenever we call the org.apache.kafka.streams.state.ReadOnlyKeyValueStore.get(String key) API? If yes then how to maintain local store instead of read everytime from Kafka?
Please help.
Ex:
private KeyValueStore<String, List<String>> policyStore = (KeyValueStore<String, List<String>>) this.context
.getStateStore(policyGlobalTableName);
List<String> policyIds = policyStore.get(event.getCustomerCode());
By default, stores use an application local RocksDB instance to buffer data. Thus, if you query the store with a get() it will not go over the network and not the brokers, but only the local RocksDB.
You can try to change RocksDB setting to improve the performance, but I have no guidelines atm which configs you might wanna change. Configuring RocksDB is a quite tricky thing. But you might want to search the Internet for further information about it.
You can pass in RocksDB configs via StreamsConfig (cf. https://docs.confluent.io/current/streams/developer-guide/config-streams.html#rocksdb-config-setter)
As an alternative, you could also try to reconfigure Streams to use in-memory stores instead of RocksDB. Note, that this will increase your rebalance time, as there is no local buffered state if you use in-memory instead of RocksDB. (cf. https://docs.confluent.io/current/streams/developer-guide/processor-api.html#defining-and-creating-a-state-store)

Low loading into cache speed

I'm using Infinispan 6.0.0 in a 3-node setup (distributed caching with 2 replicas for each entry, no writes into persistent store) and I'm just reading the file line-by-line and storing that lines' contents into the cache. The speed seems a bit low to me (I can achieve more writes onto the SSD (persistent storage) than into RAM with Infinispan), but there isn't any obvious bottleneck in the test code (I'm using buffered input streams, and their limits certainly aren't reached. As for now, I'm able to write 100K entries each ~45 seconds and that doesn't satisfy me. Assume simplified code snippet:
while ((s = reader.readLine()) != null) {
cache.put(s.substring(0,2), s.substring(2,5));
}
And CacheManager is created as follows:
return new DefaultCacheManager(
GlobalConfigurationBuilder.defaultClusteredBuilder()
.transport().addProperty("configurationFile", "jgroups.xml").build(),
new ConfigurationBuilder()
.clustering().cacheMode(CacheMode.DIST_ASYNC).hash().numOwners(2)
.transaction().transactionMode(TransactionMode.TRANSACTIONAL).lockingMode(LockingMode.OPTIMISTIC)
.build());
What could I be possibly doing wrong?
I am not fully aware of all the asynchronous mode specialities, but I'd afraid that something in the two-phase commit (Prepare and Commit) might force some blocking RPC => waiting for network latency => slow down.
Do you need transactional behaviour? If not, switch them off. If you really need it, you may disable just the autocommit feature and load the cluster via non-transactional operations. Or, you may try one phase commits.
Another option could be mass loading via putAll (with tens or hundreds of entries, depends on your entry size), but routing of this message is not really smart. In transactional mode it could behave a bit better, I guess.
The last option if you just want to load the cluster fast and then operate on it could be transferring the bulk data to each node without Infinispan (using your own JGroups channel, or just with sockets), and loading all nodes with the CACHE_MODE_LOCAL flag.
By default Infinispan follows the Map.put() contract of returning the previous value, so even though you are using the DIST_ASYNC cache mode you're still implicitly performing a synchronous cache.get() for every put.
You can avoid this in two ways:
configurationBuilder.unsafe().unreliableReturnValues(true) will suppress the remote lookup for all the operations on the cache.
cache.getAdvancedCache().withFlags(Flag.IGNORE_RETURN_VALUES).put(k, v) will suppress the remote lookup for a single operation.

Resources