apache camel data validation issue

apache camel data validation issue - validation

I am new to camel, reading a data file (flat text file) using multiple threads.
Able to read data and store into pojo. Getting issue in validation.
Issue is in validation. While reading line by line I am storig records in the context in a map. Now validation has to be across the line. Like some of the record has suppose ids that should be defined in earlier record. Here is my route:-
get("readerService","fileReader")
.log(...)
.process(e->readerService.init(...))
.split(...)
.streaming()
.parallelProcessing(...)
.threads(...)
.process(e -> readerService.read(e, ctx)
// tranforms read record to pojo and stores into map in ctx
.process(e -> recTransformer(e, ctx)
.process(e -> myValidator(e, ctx)
.end()
problem is read service is reading 10 data randomly and transformer reads, creates pojo and stores into map in context. But validator also getting called each time when the complete record is not read.
In some cases it is trying to validate record when previous record is not stored into map due to parallel processing.
I would like to know how to call .process(e->myValidator(e, ctx)
after entire file is read and map is complete. That will resolve my issue. Any suggestion will be great.

Related

Kafka streams: Using the DSL api, within a transform, how can I send two messages to different topics/separate DSL downstream processors

I'm using the DSL api and I have a use case where I need to check a condition and then if true, send an additional message to a separate topic from the happy path. My question is, how can I attach child processors to parents in the DSL api? Is it as simple as caching a stream variable and using it in two subsequent places, and naming those stream processors? Here's some brief code that explains what I'm trying to do. I am using the DSL api because I need the use of the foreignKeyJoin.
var myStream = stream.process(myProcessorSupplier); //3.3 returns a stream
stream.to("happyThingTopic"); Q: will the forward ever land here?
stream.map( myKvMapper, new Named("what-is-this")).to("myOtherTopic"); //will the forward land here?
public KeyValue<String, Object> process(Object key, Object value){
if (value.hasFlag){
processorContext.forward(key, new OtherThing(), "what-is-this?");
}
return new KeyValue(key, HappyThing(value));
}

How to access DSL-created KTable/GlobalKTable using Processor API?

I am using a Processor API (PAPI) topology.
Is it possible to access a KTable (or GlobalKTable) created with DSL from within the Processor API (even if read-only)?
I.e. using the:
val builder = new StreamsBuilder()
val KTable = builder.table("topicname")
I get a KTable, but the Topology only allows you to use addStateStore with a StoreBuilder, not the KTable itself.
.addStateStore(myStoreBuilder, MY_PROCESSOR_NAME)
So I could build one by doing this:
def keyValueStoreBuilder[K, V](storeName: String, keySerde: Serde[K], valueSerde: Serde[V]): StoreBuilder[KeyValueStore[K, V]] = {
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(storeName),
keySerde,
valueSerde)
}
But, how to cleanly obtain the storeName in this case?

When you create a KTable it will automatically create a store internally, with a generated name. (You can get the name via Topology#describe()). You can also assign a name to the store via table() method using Materialized parameter.
It's a little unclear to me, what you mean by "access a KTable within the Processor API" though? If you mean "access the KTable store within a Processor" you can use Topology#connectProcessorAndStateStores() to give the processor access to the store. Note, that the processor should never write into the KTable store, as the table() operator is responsible to maintain the table's state. If you do write into the store, there are not guarantees and you might loose data in case of a failure.

Different serde for Kafka Streams KTable state store

As part of our application logic, we use Kafka Streams state store for range lookups, data is loaded from Kafka topic using builder.table() method.
The problem is that source topic's key is serialised as JSON and doesn't suite well to binary key comparisons used internally in RocksDB based state store.
We were hoping to use a separate serde for keys by passing it to Materialized.as(). However, it looks like that streams implementation resets whatever is passed to the original serdes used to load from the table topic.
This is what I can see in streams builder internals:
public synchronized <K, V> KTable<K, V> table(final String topic,
final Consumed<K, V> cons,
final Materialized<K, V, KeyValueStore<Bytes, byte[]>> materialized) {
Objects.requireNonNull(topic, "topic can't be null");
Objects.requireNonNull(consumed, "consumed can't be null");
Objects.requireNonNull(materialized, "materialized can't be null");
materialized.withKeySerde(consumed.keySerde).withValueSerde(consumed.valueSerde);
return internalStreamsBuilder.table(topic,
new ConsumedInternal<>(consumed),
new MaterializedInternal<>(materialized, internalStreamsBuilder, topic + "-"));
}
Anybody knows why it's done this way, and if it's possible to use a different serde for a DSL state store?
Please don't propose using Processor API, this route is well explored. I would like to avoid writing a processor and a custom state store every time when I need to massage data before saving it into a state store.
After some digging through streams sources, I found out that I can pass a custom Materialized.as to the filter with always true predicate. But it smells a bit hackerish.
This is my code, that unfortunately doesn't work as we hoped to, because of "serdes reset" described above.
Serde<Value> valueSerde = new JSONValueSerde()
KTable<Key, Value> table = builder.table(
tableTopic,
Consumed.with(new JSONKeySerde(), valueSerde)
Materialized.as(cacheStoreName)
.withKeySerde(new BinaryComparisonsCompatibleKeySerde())
.withValueSerde(valueSerde)
)

The code works by design. From a streams point of view, there is no reason to use a different Serde for the store are for reading the data from the topic, because it's know to be the same data. Thus, if one does not use the default Serdes from the StreamsConfig, it's sufficient to specify the Serde once (in Consumed) and it's not required to specify it in Materialized again.
For you special case, you could read the topic as a stream a do a "dummy aggregation" that just return the latest value per record (instead of computing an actual aggregate). This allows you to specify a different Serde for the result type.

How to get ordering for different MessagePostProcessors in SimpleMessageListenerContainer

I have multiple MessagePostProcessors in SpringAMQP which i set them using SimpleMessageListenerContainer.setAfterReceivePostProcessors API , now my query is does these MessagePostProcessors are called in order I have mentioned.
Pseoudo code
SimpleMessageListenerContainer container = // api returing SimpleMessageListenerContainer object
container.setAfterReceivePostProcessors(new MessagePostProcessor[] {
messagePostProcessors1 , messagePostProcessors2});
So does Spring AMQP call messagePostProcessors1 followed messagePostProcessors2 in sequence or does it randomly selects the same ?
If it randomly selects is there any way that we can order the same i.e messagePostProcessors2 always gets called after messagePostProcessors1

Akshat , the order is based on the order that is set in the processor.Quoting the document here. When i look at the concrete implementation of the processors , i find there is a setOrder method (form interface ordered i think). May be setting that in your message post processor will do the trick.
public void setAfterReceivePostProcessors(MessagePostProcessor...
afterReceivePostProcessors)
Set a MessagePostProcessor that will be
invoked immediately after a Channel#basicGet() and before any message
conversion is performed. May be used for operations such as
decompression Processors are invoked in order, depending on
PriorityOrder, Order and finally unordered.

Hadoop Cascading : CascadeException "no loops allowed in cascade" when cogroup pipes twice

I'm trying to write a Casacading(v1.2) casade (http://docs.cascading.org/cascading/1.2/userguide/htmlsingle/#N20844) consisting of two flows:
1) The first flow outputs urls to a db table, (in which they are automatically assigned id's via an auto-incrementing id value).
This flow also outputs pairs of urls into a SequenceFile with field names "urlTo", "urlFrom".
2) The second flow reads from both these sources and tries to do a CoGroup on "urlTo" (from the SequenceFile) and "url" (from the db source) to get the db record "id" for each "urlTo".
It then does a CoGroup on "urlFrom" and "url" to get the db record "id" for each "urlFrom".
The two flows work individually - if I call flow.complete() on the first before running the second flow. But if I put the two flows in a cascade object I get the error
cascading.cascade.CascadeException: no loops allowed in cascade, flow: urlLink*url*url, source: JDBCTap{connectionUrl='jdbc:mysql://localhost:3306/mydb', driverClassName='com.mysql.jdbc.Driver', tableDesc=TableDesc{tableName='urls', columnNames=null, columnDefs=null, primaryKeys=null}}, sink: JDBCTap{connectionUrl='jdbc:mysql://localhost:3306/mydb', driverClassName='com.mysql.jdbc.Driver', tableDesc=TableDesc{tableName='url_link', columnNames=[urlLinkFrom, urlLinkTo], columnDefs=[bigint(20), bigint(20)], primaryKeys=[urlLinkFrom, urlLinkTo]}}
on trying to configure the cascade.
I can see it's coming from the addEdgeFor function of the CascadeConnector but I'm not clear on how to resolve this problem.
I've never used Cascade / CascadeConnector before. Is there something I'm missing?

It seems like your some paths for source and sinks are the same.
A Cascade uses the concept of Direct Graphs to build the Cascade itself so if you have a flow source and a sink source pointing to the same location that in essence creates a loop and is disallowed in the concept of Directed Graphs since
it does not go from:
Source Location A to Sink Location B
but instead goes from:
Source Location A to Sink Location A.

"A Tap is not given an explicit name by design. This is so a given Tap instance can be re-used in different {#link Flow}s that may expect a source or sink by a different logical name, but are the same physical resource."
"In general, two instances of the same Tap class must have differing Identifiers (and different #equals)."
It turns out that JDBCTaps generate their identifier from the connection url alone (and do not include the table name). So as I was reading from one table and writing to a different table in the same database it seemed like I was reading from and writing to the same Tap and causing a loop.
As a work-around, I'm going to subclass the JDBCTap and override the getIdentifier() method to include the table name.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

apache camel data validation issue - validation

Related

Kafka streams: Using the DSL api, within a transform, how can I send two messages to different topics/separate DSL downstream processors

How to access DSL-created KTable/GlobalKTable using Processor API?

Different serde for Kafka Streams KTable state store

How to get ordering for different MessagePostProcessors in SimpleMessageListenerContainer

Hadoop Cascading : CascadeException "no loops allowed in cascade" when cogroup pipes twice

Categories

Resources