Significance of boolean direct Output Fields declare (boolean direct, Fields fields) - apache-storm

Looking at the OutputFieldsDeclarer class, I see there is an overloaded method for declare(...) with a boolean flag direct.
If I use the method declare(Fields fields), it sets this boolean flag as false.
I am not sure how Storm interprets this boolean field internally while processing with Spouts and Bolts .
Can somebody explain me the significance of this flag?

If you declare a direct stream (ie, setting the flag to true), you need to
emit tuples via
collector.emitDirect(...)
methods (collector.emit(...) is not allowed for direct streams). Those
methods require to specify the consumer task ID that should receive the
tuple.
Furthermore, when connecting a consumer to a direct stream, you need to
specify
builder.setBolt(....).directGrouping("direct-emitting-bolt", "direct-stream-Id");
All other connection patterns are not allowed on direct stream.
Direct streams have the advantage, that you have fine-grained controlled
over the data distribution from producer to consumer. You can implement
any imaginable distribution pattern. Of course, direct streams are much
more difficult to handle. For example, you need to know the task IDs of
subscribed consumers (those can be looked up in the TopologyContext
provided via Bolt.prepare).

Related

Best practices for transforming a batch of records using KStream

I am new to KStream and would like to know best practices or guidance on how to optimally process batch of records of n size using KStream. I have a working code as shown below but it does work for single messages at a time.
KStream<String, String> sourceStream = builder.stream("upstream-kafka-topic",
Consumed.with(Serdes.String(),
Serders.String());
//transform sourceStream using implementation of ValueTransformer<String,String>
sourceStream.transformValues(() -> new MyValueTransformer()).
to("downstream-kafka-topic",
Produced.with(Serdes.String(),
Serdes.String());
Above code works with single records as MyValueTransformer which implements ValueTransformer transforms single String value. How do I make above code work for Collection of String values?
You would need to somehow "buffer / aggregate" the messages. For example, you could add a state store to your transformer and store N messages inside the store. As long as the store contains fewer than N messages you don't do any processing and also don't emit any output (you might want to use flatTransformValues which allows you to emit zero results).
Not sure what you're trying to achieve. Kafka Streams by concept is designed to process one record at a time. If you want to process a collection or batch of messages you have a few options.
You might not actually need Kafka streams as the example you mentioned doesn't do much with the message, in this case, you can leverage a normal Consumer which will enable you to process in Batches. Check spring Kafka implementation of this here -> https://docs.spring.io/spring-kafka/docs/current/reference/html/#receiving-messages (Kafka process batches on the network layer but normally you would process one record at a time, but it's possible with a standard client to process batches) OR you might model your Object value to have an array of messages so for each record you will be receiving an object which contains a collection embedded which you could then use Kafka streams to do it, check the array type for Avro -> https://avro.apache.org/docs/current/spec.html#Arrays
Check this part of the documentation to understand better the Kafka streams concepts -> https://kafka.apache.org/31/documentation/streams/core-concepts

Confusion about MQ Put options (IBM MQ)

I am trying to understand some of the IBM MQ put options:
I have used https://www.ibm.com/docs/en/ibm-mq/9.2?topic=interfaces-mqputmessageoptionsnet-class for documentation.
It seems that MQPMO_ASYNC_RESPONSE and MQPMO_SYNC_RESPONSE in fact are mutually exclusive, yet these options have to different ID's (bit 12 and bit 13). What does MQ do when both options are set or neither one is set?
t seems that MQPMO_SYNCPOINT and MQPMO_NO_SYNCPOINT in fact are mutually exclusive, yet these options have to different ID's (bit 2 and bit 3). What does MQ do when both options are set or neither one is set?
MQPMO_RESPONSE_AS_Q_DEF makes things even more confusing for me. As I understand from the documentation this bit defers the control of the requests being synchronous or asynchronous to the queue definition. Therefore ignorning the options set with MQPMO_ASYNC_RESPONSE and MQPMO_SYNC_RESPONSE. But the documentation states the following
For an MQDestination.put call, this option takes the put response type from DEFPRESP attribute of the queue.
For an MQQueueManager.put call, this option causes the call to be made synchronously.
And the documentation for DEFRESP at https://www.ibm.com/docs/en/ibm-mq/9.1?topic=queues-defpresp-mqlong is stating this:
The default put response type (DEFPRESP) attribute defines the value used by applications when the PutResponseType within MQPMO has been set to MQPMO_RESPONSE_AS_Q_DEF. This attribute is valid for all queue types.
The value is one of the following:
SYNC
The put operation is issued synchronously returning a response.
ASYNC
The put operation is issued asynchronously, returning a subset of MQMD fields.
But the other documentation says setting this option makes the call synchronous.
So in short: What happens with the seemingly mutually exclusive options and what does the MQPMO_RESPONSE_AS_Q_DEF really do?
If you combine options flags that are not supposed to be combined, such as MQPMO_SYNCPOINT and MQPMO_NO_SYNCPOINT you will be return MQRC_OPTIONS_ERROR on the MQPUT call. you can see the documentation for this in IBM Docs here.
You are correct, the use of MQPMO_RESPONSE_AS_Q_DEF tells the queue manager to take the value for Put Response from the queue attribute DEFPRESP. The queue manager will look up the queue definition, and if it is SYNC it will effectively use MQPMO_SYNC_RESPONSE and if it is ASYNC it will effectively use MQPMO_ASYNC_RESPONSE.
The documentation you pointed us to states the following:-
MQC.MQPMO_RESPONSE_AS_Q_DEF
For an MQDestination.put call, this option takes the put response type from DEFPRESP attribute of the queue.
For an MQQueueManager.put call, this option causes the call to be made synchronously.
I don't know why it would be different depending on the class used, but I can tell you that if the MQPMO_RESPONSE_AS_Q_DEF makes it to the queue manager it will change it as described above. This documentation suggests that the MQQueueManager class is changing it itself which is an odd decision.

Read ruleset topic/partition by multiple kafka stream instances of the same app

I am having a Kafka Stream app that does some processing in a main event topic and I also have a side topic that
is used to apply a ruleset to the main event topic.
Till now the app was running as a single instance and when
a rule was applied a static variable was set for the other processing operator (main topic consumer) to continue
operating evaluating rules as expected. This was necessary since the rule stream would be written to a single partition depending
on the rule key literal e.g. <"MODE", value> and therefore that way (through static variable) all the other tasks
involved would made aware of the change.
Apparently though when deploying the application to multiple nodes this approach could not work since having a
single consumer group (from e.g. two instance apps) would lead only one instance app setting its static variable to
the correct value and the other instance app never consuming that rule value (Also setting each instance app to a
different group id would lead to the unwanted side-effect of consuming the main topic twice)
On the other hand a solution of making the rule topic used as a global table would lead to have the main processing
operator querying the global table every time an event is consumed by that operator in order to retrieve the latest rules.
Is it possible to use some sort of a global table listener when a value is introduced in that topic to execute some
callback code and set a static variable ?
Is there a better/alternative approach to resolve this issue ?
Instead of a GlobalKTable, you can fall back to addGlobalStore() that allows you to execute custom code.

Adding a global store for a transformer to consume

Is there a way to add a global store for a Transformer to use? In the docs for transformer it says:
"Transform each record of the input stream into zero or more records in the output stream (both key and value type can be altered arbitrarily). A Transformer (provided by the given TransformerSupplier) is applied to each input record and computes zero or more output records. In order to assign a state, the state must be created and registered beforehand via stores added via addStateStore or addGlobalStore before they can be connected to the Transformer"
yet, the API for addGlobalStore on takes a ProcessSupplier?
addGlobalStore(storeBuilder: StoreBuilder[_ <: StateStore],
topic: String,
consumed: Consumed[_, _],
stateUpdateSupplier: ProcessorSupplier[_, _])
My end goal is to the Kafka Streams DSL, with a transformer since I need a flatMap and transform both keys and values to my output topic. I do not have a processor in my topology tho.
I would expect something like this:
addGlobalStore(storeBuilder: StoreBuilder[_ <: StateStore], topic: String, consumed: Consumed[_, ], stateUpdateSupplier: TransformerSupplier[, _])
The Processor that is passed into addGlobalStore() is use to maintain (ie, write) the store. Note, that's it's expected atm that this Processor copies the data as-is into the store (cf https://issues.apache.org/jira/browse/KAFKA-7663).
After you have added a global store, you can also add a Transformer and the Transformer can access the store. Note, that it's not required to connect a global store to make it available (only "regular" stores, would need to be added). Also note, that a Transformer only gets read access to global stores.
Use a Processor instead of Transformer, for all the transformations you want to perform on the input topic, whenever there is a usecase of lookingup data from GlobalStateStore . Use context.forward(key,value,childName) to send the data to the downstream nodes. context.forward(key,value,childName) may be called multiple times in a process() and punctuate() , so as to send multiple records to downstream node. If there is a requirement to update GlobalStateStore, do this only in Processor passed to addGlobalStore(..) because, there is a GlobalStreamThread associated with GlobalStateStore, which keeps the state of the store consistent across all the running kstream instances.

#Storm: how to setup various metrics for the same data source

I'm trying to setup Storm to aggregate a stream, but with various (DRPC available) metrics on the same stream.
E.g. the stream is consisted of messages that have a sender, a recipient, the channel through which the message arrived and a gateway through which it was delivered. I'm having trouble deciding how to organize one or more topologies that could give me e.g. total count of messages by gateway and/or by channel. And besides the total, counts per minute would be nice too.
The basic idea is to have a spout that will accept messaging events, and from there aggregate the data as needed. Currently I'm playing around with Trident and DRPC and I've came up with two possible topologies that solve the problem at this stage. Can't decide which approach is better, if any?!
The entire source is available at this gist.
It has three classes:
RandomMessageSpout
used to emit the messaging data
simulates the real data source
SeparateTopology
creates a separate DRPC stream for each metric needed
also a separate query state is created for each metric
they all use the same spout instance
CombinedTopology
creates a single DRPC stream with all the metrics needed
creates a separate query state for each metric
each query state extracts the desired metric and groups results for it
Now, for the problems and questions:
SeparateTopology
is it necessary to use the same spout instance or can I just say new RandomMessageSpout() each time?
I like the idea that I don't need to persist grouped data by all the metrics, but just the groupings we need to extract later
is the spout emitted data actually processed by all the state/query combinations, e.g. not the first one that comes?
would this also later enable dynamic addition of new state/query combinations at runtime?
CombinedTopology
I don't really like the idea that I need to persist data grouped by all the metrics since I don't need all the combinations
it came as a surprise that the all the metrics always return the same data
e.g. channel and gateway inquiries return status metrics data
I found that this was always the data grouped by the first field in state definition
this topic explains the reasoning behind this behaviour
but I'm wondering if this is a good way of doing thins in the first place (and will find a way around this issue if need be)
SnapshotGet vs TupleCollectionGet in stateQuery
with SnapshotGet things tended to work, but not always, only TupleCollectionGet solved the issue
any pointers as to what is correct way of doing that?
I guess this is a longish question / topic, but any help is really appreciated!
Also, if I missed the architecture entirely, suggestions on how to accomplish this would be most welcome.
Thanks in advance :-)
You can't actually split a stream in SeparateTopology by invoking newStream() using the same spout instance, since that would create new instances of the same RandomMessageSpout spout, which would result in duplicate values being emitted to your topology by multiple, separate spout instances. (Spout parallelization is only possible in Storm with partitioned spouts, where each spout instance processes a partition of the whole dataset -- a Kafka partition, for example).
The correct approach here is to modify the CombinedTopology to split the stream into multiple streams as needed for each metric you need (see below), and then do a groupBy() by that metric's field and persistentAggregate() on each newly branched stream.
From the Trident FAQ,
"each" returns a Stream object, which you can store in a variable. You can then run multiple eaches on the same Stream to split it, e.g.:
Stream s = topology.each(...).groupBy(...).aggregate(...)
Stream branch1 = s.each(...)
Stream branch2 = s.each(...)
See this thread on Storm's mailing list, and this one for more information.

Resources