Kafka Connect MaskField SMT fails on tombstones - apache-kafka-connect

I want to apply the MaskField SMT on a field in a specific topic. The MaskField SMT fails with Only Map objects supported in absence of schema for [mask fields], found: null when encountering tombstone events (null value). To apply masking only on the selected topic, I'm using the TopicNameMatches predicate, but I don't find a way to apply a negated RecordIsTombstone as well. I found that for some SMTs there is an option to let tombstone through without being affected by the transformation, but can't seem to do anything about it for MaskField. Is there a way to achieve this without writing a custom SMT (using Kafka Connect 6.0.1)?
My config currently:
transforms: mask_trf
transforms.mask_trf.type: org.apache.kafka.connect.transforms.MaskField$Value
transforms.mask_trf.fields: foo
transforms.mask_trf.replacement: ***
transforms.mask_trf.predicate: pred
predicates: pred
predicates.pred.type: org.apache.kafka.connect.transforms.predicates.TopicNameMatches
predicates.pred.pattern: t1

I ended up creating a custom SMT that extends MaskField. The only change is that in apply() it checks if it's a tombstone, and if so, returns the record unchanged. Otherwise it calls MaskField's apply.

Related

How to access DSL-created KTable/GlobalKTable using Processor API?

I am using a Processor API (PAPI) topology.
Is it possible to access a KTable (or GlobalKTable) created with DSL from within the Processor API (even if read-only)?
I.e. using the:
val builder = new StreamsBuilder()
val KTable = builder.table("topicname")
I get a KTable, but the Topology only allows you to use addStateStore with a StoreBuilder, not the KTable itself.
.addStateStore(myStoreBuilder, MY_PROCESSOR_NAME)
So I could build one by doing this:
def keyValueStoreBuilder[K, V](storeName: String, keySerde: Serde[K], valueSerde: Serde[V]): StoreBuilder[KeyValueStore[K, V]] = {
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(storeName),
keySerde,
valueSerde)
}
But, how to cleanly obtain the storeName in this case?
When you create a KTable it will automatically create a store internally, with a generated name. (You can get the name via Topology#describe()). You can also assign a name to the store via table() method using Materialized parameter.
It's a little unclear to me, what you mean by "access a KTable within the Processor API" though? If you mean "access the KTable store within a Processor" you can use Topology#connectProcessorAndStateStores() to give the processor access to the store. Note, that the processor should never write into the KTable store, as the table() operator is responsible to maintain the table's state. If you do write into the store, there are not guarantees and you might loose data in case of a failure.

Different serde for Kafka Streams KTable state store

As part of our application logic, we use Kafka Streams state store for range lookups, data is loaded from Kafka topic using builder.table() method.
The problem is that source topic's key is serialised as JSON and doesn't suite well to binary key comparisons used internally in RocksDB based state store.
We were hoping to use a separate serde for keys by passing it to Materialized.as(). However, it looks like that streams implementation resets whatever is passed to the original serdes used to load from the table topic.
This is what I can see in streams builder internals:
public synchronized <K, V> KTable<K, V> table(final String topic,
final Consumed<K, V> cons,
final Materialized<K, V, KeyValueStore<Bytes, byte[]>> materialized) {
Objects.requireNonNull(topic, "topic can't be null");
Objects.requireNonNull(consumed, "consumed can't be null");
Objects.requireNonNull(materialized, "materialized can't be null");
materialized.withKeySerde(consumed.keySerde).withValueSerde(consumed.valueSerde);
return internalStreamsBuilder.table(topic,
new ConsumedInternal<>(consumed),
new MaterializedInternal<>(materialized, internalStreamsBuilder, topic + "-"));
}
Anybody knows why it's done this way, and if it's possible to use a different serde for a DSL state store?
Please don't propose using Processor API, this route is well explored. I would like to avoid writing a processor and a custom state store every time when I need to massage data before saving it into a state store.
After some digging through streams sources, I found out that I can pass a custom Materialized.as to the filter with always true predicate. But it smells a bit hackerish.
This is my code, that unfortunately doesn't work as we hoped to, because of "serdes reset" described above.
Serde<Value> valueSerde = new JSONValueSerde()
KTable<Key, Value> table = builder.table(
tableTopic,
Consumed.with(new JSONKeySerde(), valueSerde)
Materialized.as(cacheStoreName)
.withKeySerde(new BinaryComparisonsCompatibleKeySerde())
.withValueSerde(valueSerde)
)
The code works by design. From a streams point of view, there is no reason to use a different Serde for the store are for reading the data from the topic, because it's know to be the same data. Thus, if one does not use the default Serdes from the StreamsConfig, it's sufficient to specify the Serde once (in Consumed) and it's not required to specify it in Materialized again.
For you special case, you could read the topic as a stream a do a "dummy aggregation" that just return the latest value per record (instead of computing an actual aggregate). This allows you to specify a different Serde for the result type.

What is the purpose of RocksDBStore with Serdes.Bytes() and Serdes.ByteArray()?

RocksDBStore<K,V> stores keys and values as byte[] on disk. It converts to/from K and V typed objects using Serdes provided while constructing the object of RocksDBStore<K,V>.
Given this, please help me understand the purpose of the following code in RocksDbKeyValueBytesStoreSupplier:
return new RocksDBStore<>(name,
Serdes.Bytes(),
Serdes.ByteArray());
Providing Serdes.Bytes() and Serdes.ByteArray() looks redundant.
RocksDbKeyValueBytesStoreSupplier is introduced in KAFKA-5650 (Kafka Streams 1.0.0) as part of KIP-182: Reduce Streams DSL overloads and allow easier use of custom storage engines.
In KIP-182, there is the following sentence :
The new Interface BytesStoreSupplier supersedes the existing StateStoreSupplier (which will remain untouched). This so we can provide a convenient way for users creating custom state stores to wrap them with caching/logging etc if they chose. In order to do this we need to force the inner most store, i.e, the custom store, to be a store of type <Bytes, byte[]>.
Please help me understand why we need to force custom stores to be of type <Bytes, byte[]>?
Another place (KAFKA-5749) where I found similar sentence:
In order to support bytes store we need to create a MeteredSessionStore and ChangeloggingSessionStore. We then need to refactor the current SessionStore implementations to use this. All inner stores should by of type < Bytes, byte[] >
Why?
Your observation is correct -- the PR implementing KIP-182 did miss to remove the Serdes from RocksDBStore that are not required anymore. This was fixed in 1.1 release already.

How to get ordering for different MessagePostProcessors in SimpleMessageListenerContainer

I have multiple MessagePostProcessors in SpringAMQP which i set them using SimpleMessageListenerContainer.setAfterReceivePostProcessors API , now my query is does these MessagePostProcessors are called in order I have mentioned.
Pseoudo code
SimpleMessageListenerContainer container = // api returing SimpleMessageListenerContainer object
container.setAfterReceivePostProcessors(new MessagePostProcessor[] {
messagePostProcessors1 , messagePostProcessors2});
So does Spring AMQP call messagePostProcessors1 followed messagePostProcessors2 in sequence or does it randomly selects the same ?
If it randomly selects is there any way that we can order the same i.e messagePostProcessors2 always gets called after messagePostProcessors1
Akshat , the order is based on the order that is set in the processor.Quoting the document here. When i look at the concrete implementation of the processors , i find there is a setOrder method (form interface ordered i think). May be setting that in your message post processor will do the trick.
public void setAfterReceivePostProcessors(MessagePostProcessor...
afterReceivePostProcessors)
Set a MessagePostProcessor that will be
invoked immediately after a Channel#basicGet() and before any message
conversion is performed. May be used for operations such as
decompression Processors are invoked in order, depending on
PriorityOrder, Order and finally unordered.

Hadoop Cascading : CascadeException "no loops allowed in cascade" when cogroup pipes twice

I'm trying to write a Casacading(v1.2) casade (http://docs.cascading.org/cascading/1.2/userguide/htmlsingle/#N20844) consisting of two flows:
1) The first flow outputs urls to a db table, (in which they are automatically assigned id's via an auto-incrementing id value).
This flow also outputs pairs of urls into a SequenceFile with field names "urlTo", "urlFrom".
2) The second flow reads from both these sources and tries to do a CoGroup on "urlTo" (from the SequenceFile) and "url" (from the db source) to get the db record "id" for each "urlTo".
It then does a CoGroup on "urlFrom" and "url" to get the db record "id" for each "urlFrom".
The two flows work individually - if I call flow.complete() on the first before running the second flow. But if I put the two flows in a cascade object I get the error
cascading.cascade.CascadeException: no loops allowed in cascade, flow: urlLink*url*url, source: JDBCTap{connectionUrl='jdbc:mysql://localhost:3306/mydb', driverClassName='com.mysql.jdbc.Driver', tableDesc=TableDesc{tableName='urls', columnNames=null, columnDefs=null, primaryKeys=null}}, sink: JDBCTap{connectionUrl='jdbc:mysql://localhost:3306/mydb', driverClassName='com.mysql.jdbc.Driver', tableDesc=TableDesc{tableName='url_link', columnNames=[urlLinkFrom, urlLinkTo], columnDefs=[bigint(20), bigint(20)], primaryKeys=[urlLinkFrom, urlLinkTo]}}
on trying to configure the cascade.
I can see it's coming from the addEdgeFor function of the CascadeConnector but I'm not clear on how to resolve this problem.
I've never used Cascade / CascadeConnector before. Is there something I'm missing?
It seems like your some paths for source and sinks are the same.
A Cascade uses the concept of Direct Graphs to build the Cascade itself so if you have a flow source and a sink source pointing to the same location that in essence creates a loop and is disallowed in the concept of Directed Graphs since
it does not go from:
Source Location A to Sink Location B
but instead goes from:
Source Location A to Sink Location A.
"A Tap is not given an explicit name by design. This is so a given Tap instance can be re-used in different {#link Flow}s that may expect a source or sink by a different logical name, but are the same physical resource."
"In general, two instances of the same Tap class must have differing Identifiers (and different #equals)."
It turns out that JDBCTaps generate their identifier from the connection url alone (and do not include the table name). So as I was reading from one table and writing to a different table in the same database it seemed like I was reading from and writing to the same Tap and causing a loop.
As a work-around, I'm going to subclass the JDBCTap and override the getIdentifier() method to include the table name.

Resources