why do we use tibco mapper activity? - tibco

The tibco documentation says
The Mapper activity adds a new process variable to the process definition. This variable can be a simple datatype, a TIBCO ActiveEnterprise schema, an XML schema, or a complex structure.
so my question is tibco mapper does only this simple function.We can create process variables in process definition also(by right clicking on process definition).I looked for it in google but no body clearly explains why to use this activity and I have also tried in youtube and there also only one video and it does not explain clearly.I am looking for an example how it is used in large organizations and a real time example.Thanks in advance

The term "process variable" is a bit overloaded I guess:
The process variables that you define in the Process properties are stateful. You can use (read) their values anywhere in the process and you can change their values during the process using the Assign task (yellow diamond with a black equals sign).
The mapper activity produces a new output variable of that task that you can only use (read) in activities that are downstream from it. You cannot change its value after the mapper activity, as for any other activity's output.
The mapper activity is mainly useful to perform complex and reusable data mappings in it rather than in the mappers of other activities. For example, you have a process that has to map its input data into a different data structure and then has to both send this via a JMS message and log it to a file. The mapper allows you to perform the mapping only once rather than doing it twice (both in the Send JMS and Write to File activity).
You'll find that in real world projects, the mapper activity is quite often used to perform data mapping independently of other activities, it just gives a nicer structure to the processes. In contrast the Process Variables defined in the Process properties together with the Assign task are used much less frequently.
Here's a very simple example, where you use the mapper activity once to set a process variable (here the filename) and then use it in two different following activities (create CSV File and Write File). Obviously, the mapper activity becomes more interesting if the mapping is not as trivial as here (though even in this simple example, you only have one place to change how the filename is generated rather than two):
Mapper Activiy
First use of the filename variable in Create File
Second use of the filename variable in Write File

Process Variable/Assign Activity Vs Mapper Activity
The primary purpose of an assign task is to store a variable at a process level. Any variable in an assign task can be modified N times in a process. But a mapper is specifically used for introducing a new variable. We cannot change the same mapper variable multiple times in a project.
Memory is allocated to Process Variable when the process instance is created but in case of TIBCO Mapper the memory is allocated only when the mapper activity is executed in a process instance.
Process Variable is allocated a single slot of memory which is used to update/modify the schema thought the process instance execution i.e. N number of assign activity will access same memory allocated to the variable. Whereas using N mapper for a same schema will create N amount of memory.
Assign Activity can be is used to accumulate the output of a tibco activity inside a group.

Related

How to go about parallelizing my processing using tbb::parallel_for and tbb::dataflow?

I have a source of files that I need to process.
From each file, my code generates a variable number of data objects, let's call it N.
I have K number of processing objects that can be used to process the N data objects.
I'm thinking of doing the following using Tbb:dataflow:
Create a function_node with concurrency K and put my K processing objects into a concurrent_queue.
Use input_node to read file, generate the N data objects, and try_put each into the function_node.
The function_node body dequeues a processing object, uses it to process a data object, then returns the processing object back to the concurrent_queue when done.
Another way I can think of is possibly like so:
Create a function_node with serial concurrency.
Use input_node to read file, generate the N data objects, put the data objects into a collection and send over to the function_node.
At the function_node, partition the N objects into K ranges and use each of the K processing objects to process each range concurrently - not sure if it is possible to customize parallel_for for this purpose.
The advantage of the first method is probably lower latency because I can start sending data objects through the dataflow the moment they are generated rather than have to wait for all N data objects to be generated.
What do you think is the best way to go about parallelizing this processing?
Yes, you are right that the first method has this advantage of not waiting all of the data objects to start their processing. However, it also has an advantage of not waiting completion of processing all of the data objects passed to parallel_for. This becomes especially visible if the speed of processing varies for each data object and/or by each processing object.
Also, it seems enough to have buffer_node followed by (perhaps, reserving) join_node instead of concurrent_queue for saving of processing objects for further reuse. In this case, function_node would return processing object back to the buffer_node once it finishes processing of the data object. So, the graph will look like the following:
input_node -> input_port<0>(join_node);
buffer_node -> input_port<1>(join_node);
join_node -> function_node;
function_node -> buffer_node;
In this case, the concurrency of the function_node can be left unlimited as it would be automatically followed by the number of processing objects that exist (available tokens) in the graph.
Also, note that generating data objects from different files can be done in parallel as well. If you see benefit from that consider using function_node instead of input_node as the latter is always serial. However, in this case, use join_node with queueing policy since function_node is not reservable.
Also, please consider using tbb::parallel_pipeline instead as it seems you have a classic pipelining scheme of processing. In particular, this and that link might be useful.

Access ProcessContext::forward from multiple user threads

Given: DSL topology with KStream::transform. As part of Transformer::transform execution multiple messages are generated from the input one (it could be thousands of output messages from the single input message).
New messages are generated based on the data retrieved from the database. To speed up the process I would like to create multiple user threads to access data in DB in parallel. Upon generating a new message the thread will call ProcessContext::forward to send the message downstream.
Is it safe to call ProcessContext::forward from the different threads?
It is not safe and not allowed to call ProcessorContext#forward() from a different thread. If you try it, an exception will be thrown.
As a workaround, you could let all threads "buffer" their result data, and collect all data in the next call to process(). As an alternative, you could also schedule a punctuation that collects and forwards the data from the different threads.

Read ruleset topic/partition by multiple kafka stream instances of the same app

I am having a Kafka Stream app that does some processing in a main event topic and I also have a side topic that
is used to apply a ruleset to the main event topic.
Till now the app was running as a single instance and when
a rule was applied a static variable was set for the other processing operator (main topic consumer) to continue
operating evaluating rules as expected. This was necessary since the rule stream would be written to a single partition depending
on the rule key literal e.g. <"MODE", value> and therefore that way (through static variable) all the other tasks
involved would made aware of the change.
Apparently though when deploying the application to multiple nodes this approach could not work since having a
single consumer group (from e.g. two instance apps) would lead only one instance app setting its static variable to
the correct value and the other instance app never consuming that rule value (Also setting each instance app to a
different group id would lead to the unwanted side-effect of consuming the main topic twice)
On the other hand a solution of making the rule topic used as a global table would lead to have the main processing
operator querying the global table every time an event is consumed by that operator in order to retrieve the latest rules.
Is it possible to use some sort of a global table listener when a value is introduced in that topic to execute some
callback code and set a static variable ?
Is there a better/alternative approach to resolve this issue ?
Instead of a GlobalKTable, you can fall back to addGlobalStore() that allows you to execute custom code.

Adding a global store for a transformer to consume

Is there a way to add a global store for a Transformer to use? In the docs for transformer it says:
"Transform each record of the input stream into zero or more records in the output stream (both key and value type can be altered arbitrarily). A Transformer (provided by the given TransformerSupplier) is applied to each input record and computes zero or more output records. In order to assign a state, the state must be created and registered beforehand via stores added via addStateStore or addGlobalStore before they can be connected to the Transformer"
yet, the API for addGlobalStore on takes a ProcessSupplier?
addGlobalStore(storeBuilder: StoreBuilder[_ <: StateStore],
topic: String,
consumed: Consumed[_, _],
stateUpdateSupplier: ProcessorSupplier[_, _])
My end goal is to the Kafka Streams DSL, with a transformer since I need a flatMap and transform both keys and values to my output topic. I do not have a processor in my topology tho.
I would expect something like this:
addGlobalStore(storeBuilder: StoreBuilder[_ <: StateStore], topic: String, consumed: Consumed[_, ], stateUpdateSupplier: TransformerSupplier[, _])
The Processor that is passed into addGlobalStore() is use to maintain (ie, write) the store. Note, that's it's expected atm that this Processor copies the data as-is into the store (cf https://issues.apache.org/jira/browse/KAFKA-7663).
After you have added a global store, you can also add a Transformer and the Transformer can access the store. Note, that it's not required to connect a global store to make it available (only "regular" stores, would need to be added). Also note, that a Transformer only gets read access to global stores.
Use a Processor instead of Transformer, for all the transformations you want to perform on the input topic, whenever there is a usecase of lookingup data from GlobalStateStore . Use context.forward(key,value,childName) to send the data to the downstream nodes. context.forward(key,value,childName) may be called multiple times in a process() and punctuate() , so as to send multiple records to downstream node. If there is a requirement to update GlobalStateStore, do this only in Processor passed to addGlobalStore(..) because, there is a GlobalStreamThread associated with GlobalStateStore, which keeps the state of the store consistent across all the running kstream instances.

Clustering the Batch Job & distributing the data load

I have Batch Processing project, wanted to cluster on 5 machines.
Suppose I have input source is database having 1000 records.
I want to split these records equally i.e. 200 records/instance of batch job.
How could we distribute the work load ?
Given below, is the workflow that you may want to follow.
Assumptions:
You have the necessary Domain Objects respective to the DB table.
You have a batch flow configured wherein, there is a
reader/writer/tasklet mechanism.
You have a Messaging System (Messaging Queues are a great way to
make distributed applications talk to each other)
Input object is an object to the queue that contains the set of
input records split as per the required size.
Result object is an object to the queue that contains the processed
records or result value(if scalar)
The chunkSize is configured in a property file. Here 200
Design:
In the application,
Configure a queueReader to read from a queue
Configure a queueWriter to write to a queue
If using the task/tasklet mechanism, configure different queues to carry the input/result objects.
Configure a DB reader which reads from a DB
Logic in the DBReader
Read records from DB one by one and count of records maintained. if
(count%chunkSize==0) then write all the records to the inputMessage
object and write the object to the queue.
Logic in queueReader
Read the messages one by one
For each present message do the necessary processing.
Create a resultObject
Logic in the queueWriter
Read the resultObject (usually batch frameworks provide a way to
ensure that writers are able to read the output from readers)
If any applicable processing or downstream interaction is needed,
add it here.
Write the result object to the outputQueue.
Deployment
Package once, deploy multiple instances. For better performance, ensure that the chunkSize is small to enable fast processing. The queues are managed by the messaging system (The available systems in the market provide ways to monitor the queues) where you will be able to see the message flow.

Resources