How to access DSL-created KTable/GlobalKTable using Processor API?

How to access DSL-created KTable/GlobalKTable using Processor API? - apache-kafka-streams

I am using a Processor API (PAPI) topology.
Is it possible to access a KTable (or GlobalKTable) created with DSL from within the Processor API (even if read-only)?
I.e. using the:
val builder = new StreamsBuilder()
val KTable = builder.table("topicname")
I get a KTable, but the Topology only allows you to use addStateStore with a StoreBuilder, not the KTable itself.
.addStateStore(myStoreBuilder, MY_PROCESSOR_NAME)
So I could build one by doing this:
def keyValueStoreBuilder[K, V](storeName: String, keySerde: Serde[K], valueSerde: Serde[V]): StoreBuilder[KeyValueStore[K, V]] = {
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(storeName),
keySerde,
valueSerde)
}
But, how to cleanly obtain the storeName in this case?

When you create a KTable it will automatically create a store internally, with a generated name. (You can get the name via Topology#describe()). You can also assign a name to the store via table() method using Materialized parameter.
It's a little unclear to me, what you mean by "access a KTable within the Processor API" though? If you mean "access the KTable store within a Processor" you can use Topology#connectProcessorAndStateStores() to give the processor access to the store. Note, that the processor should never write into the KTable store, as the table() operator is responsible to maintain the table's state. If you do write into the store, there are not guarantees and you might loose data in case of a failure.

Related

Kafka streams: Using the DSL api, within a transform, how can I send two messages to different topics/separate DSL downstream processors

I'm using the DSL api and I have a use case where I need to check a condition and then if true, send an additional message to a separate topic from the happy path. My question is, how can I attach child processors to parents in the DSL api? Is it as simple as caching a stream variable and using it in two subsequent places, and naming those stream processors? Here's some brief code that explains what I'm trying to do. I am using the DSL api because I need the use of the foreignKeyJoin.
var myStream = stream.process(myProcessorSupplier); //3.3 returns a stream
stream.to("happyThingTopic"); Q: will the forward ever land here?
stream.map( myKvMapper, new Named("what-is-this")).to("myOtherTopic"); //will the forward land here?
public KeyValue<String, Object> process(Object key, Object value){
if (value.hasFlag){
processorContext.forward(key, new OtherThing(), "what-is-this?");
}
return new KeyValue(key, HappyThing(value));
}

BizTalk: How to promote 2 fields in custom pipeline?

I'm trying to write a custom pipeline that will promote 2 fields on my map so that the concatenation of those 2 fields could be blocked in the filter, I have a list of words that needs to be blocked.
How do I go about doing this?

How can you promote 2 values that you needs to be joined then?
Set them up as regular Promoted Properties.
Write a custom Pipeline Component to run after the Xml Disassembler to read those Properties and Write/Promote the third Property.
Important Note: In the custom Pipeline Component, you must ensure the entire stream has been read by the XmlDisassembler to guarantee the Propeties have been Promoted. You can do this by just copying the incoming stream to a new stream and resetting the pointer back to 0.

1 . You need write in Custom Pipeline
Write a custom Pipeline Component to run after the Xml Disassembler.
After code below promove properties in your Custom Pipeline.
outMessage.Context.Promote("MessageType", systemPropertiesNamespace, namespaceURI );

Different serde for Kafka Streams KTable state store

As part of our application logic, we use Kafka Streams state store for range lookups, data is loaded from Kafka topic using builder.table() method.
The problem is that source topic's key is serialised as JSON and doesn't suite well to binary key comparisons used internally in RocksDB based state store.
We were hoping to use a separate serde for keys by passing it to Materialized.as(). However, it looks like that streams implementation resets whatever is passed to the original serdes used to load from the table topic.
This is what I can see in streams builder internals:
public synchronized <K, V> KTable<K, V> table(final String topic,
final Consumed<K, V> cons,
final Materialized<K, V, KeyValueStore<Bytes, byte[]>> materialized) {
Objects.requireNonNull(topic, "topic can't be null");
Objects.requireNonNull(consumed, "consumed can't be null");
Objects.requireNonNull(materialized, "materialized can't be null");
materialized.withKeySerde(consumed.keySerde).withValueSerde(consumed.valueSerde);
return internalStreamsBuilder.table(topic,
new ConsumedInternal<>(consumed),
new MaterializedInternal<>(materialized, internalStreamsBuilder, topic + "-"));
}
Anybody knows why it's done this way, and if it's possible to use a different serde for a DSL state store?
Please don't propose using Processor API, this route is well explored. I would like to avoid writing a processor and a custom state store every time when I need to massage data before saving it into a state store.
After some digging through streams sources, I found out that I can pass a custom Materialized.as to the filter with always true predicate. But it smells a bit hackerish.
This is my code, that unfortunately doesn't work as we hoped to, because of "serdes reset" described above.
Serde<Value> valueSerde = new JSONValueSerde()
KTable<Key, Value> table = builder.table(
tableTopic,
Consumed.with(new JSONKeySerde(), valueSerde)
Materialized.as(cacheStoreName)
.withKeySerde(new BinaryComparisonsCompatibleKeySerde())
.withValueSerde(valueSerde)
)

The code works by design. From a streams point of view, there is no reason to use a different Serde for the store are for reading the data from the topic, because it's know to be the same data. Thus, if one does not use the default Serdes from the StreamsConfig, it's sufficient to specify the Serde once (in Consumed) and it's not required to specify it in Materialized again.
For you special case, you could read the topic as a stream a do a "dummy aggregation" that just return the latest value per record (instead of computing an actual aggregate). This allows you to specify a different Serde for the result type.

What is the purpose of RocksDBStore with Serdes.Bytes() and Serdes.ByteArray()?

RocksDBStore<K,V> stores keys and values as byte[] on disk. It converts to/from K and V typed objects using Serdes provided while constructing the object of RocksDBStore<K,V>.
Given this, please help me understand the purpose of the following code in RocksDbKeyValueBytesStoreSupplier:
return new RocksDBStore<>(name,
Serdes.Bytes(),
Serdes.ByteArray());
Providing Serdes.Bytes() and Serdes.ByteArray() looks redundant.
RocksDbKeyValueBytesStoreSupplier is introduced in KAFKA-5650 (Kafka Streams 1.0.0) as part of KIP-182: Reduce Streams DSL overloads and allow easier use of custom storage engines.
In KIP-182, there is the following sentence :
The new Interface BytesStoreSupplier supersedes the existing StateStoreSupplier (which will remain untouched). This so we can provide a convenient way for users creating custom state stores to wrap them with caching/logging etc if they chose. In order to do this we need to force the inner most store, i.e, the custom store, to be a store of type <Bytes, byte[]>.
Please help me understand why we need to force custom stores to be of type <Bytes, byte[]>?
Another place (KAFKA-5749) where I found similar sentence:
In order to support bytes store we need to create a MeteredSessionStore and ChangeloggingSessionStore. We then need to refactor the current SessionStore implementations to use this. All inner stores should by of type < Bytes, byte[] >
Why?

Your observation is correct -- the PR implementing KIP-182 did miss to remove the Serdes from RocksDBStore that are not required anymore. This was fixed in 1.1 release already.

Hadoop Cascading : CascadeException "no loops allowed in cascade" when cogroup pipes twice

I'm trying to write a Casacading(v1.2) casade (http://docs.cascading.org/cascading/1.2/userguide/htmlsingle/#N20844) consisting of two flows:
1) The first flow outputs urls to a db table, (in which they are automatically assigned id's via an auto-incrementing id value).
This flow also outputs pairs of urls into a SequenceFile with field names "urlTo", "urlFrom".
2) The second flow reads from both these sources and tries to do a CoGroup on "urlTo" (from the SequenceFile) and "url" (from the db source) to get the db record "id" for each "urlTo".
It then does a CoGroup on "urlFrom" and "url" to get the db record "id" for each "urlFrom".
The two flows work individually - if I call flow.complete() on the first before running the second flow. But if I put the two flows in a cascade object I get the error
cascading.cascade.CascadeException: no loops allowed in cascade, flow: urlLink*url*url, source: JDBCTap{connectionUrl='jdbc:mysql://localhost:3306/mydb', driverClassName='com.mysql.jdbc.Driver', tableDesc=TableDesc{tableName='urls', columnNames=null, columnDefs=null, primaryKeys=null}}, sink: JDBCTap{connectionUrl='jdbc:mysql://localhost:3306/mydb', driverClassName='com.mysql.jdbc.Driver', tableDesc=TableDesc{tableName='url_link', columnNames=[urlLinkFrom, urlLinkTo], columnDefs=[bigint(20), bigint(20)], primaryKeys=[urlLinkFrom, urlLinkTo]}}
on trying to configure the cascade.
I can see it's coming from the addEdgeFor function of the CascadeConnector but I'm not clear on how to resolve this problem.
I've never used Cascade / CascadeConnector before. Is there something I'm missing?

It seems like your some paths for source and sinks are the same.
A Cascade uses the concept of Direct Graphs to build the Cascade itself so if you have a flow source and a sink source pointing to the same location that in essence creates a loop and is disallowed in the concept of Directed Graphs since
it does not go from:
Source Location A to Sink Location B
but instead goes from:
Source Location A to Sink Location A.

"A Tap is not given an explicit name by design. This is so a given Tap instance can be re-used in different {#link Flow}s that may expect a source or sink by a different logical name, but are the same physical resource."
"In general, two instances of the same Tap class must have differing Identifiers (and different #equals)."
It turns out that JDBCTaps generate their identifier from the connection url alone (and do not include the table name). So as I was reading from one table and writing to a different table in the same database it seemed like I was reading from and writing to the same Tap and causing a loop.
As a work-around, I'm going to subclass the JDBCTap and override the getIdentifier() method to include the table name.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to access DSL-created KTable/GlobalKTable using Processor API? - apache-kafka-streams

Related

Kafka streams: Using the DSL api, within a transform, how can I send two messages to different topics/separate DSL downstream processors

BizTalk: How to promote 2 fields in custom pipeline?

Different serde for Kafka Streams KTable state store

What is the purpose of RocksDBStore with Serdes.Bytes() and Serdes.ByteArray()?

Hadoop Cascading : CascadeException "no loops allowed in cascade" when cogroup pipes twice

Categories

Resources