For example, if there is a pipeline made of 3 processors P1, P2, P3. When P2 produces an output flowfile, then after exactly 5 minutes I want processor P3 to work.
I cant use a fixed CRON job because the P2 processor can run at anytime.
Nifi version - 1.9.1
Look at RetryFlowFile with
Maximum Retries = 1 to put between P2 and P3.
It could penalize flow file on retries exceed. It should do it instantly with max retries =1.
Then set penalize duration to 5min.
All set. P3 should not take flow file from queue during 5 min.
option 2
you could use ExecuteGroovyScript in place of retryflowfile with following script to penalize everything that is going through it.
def ff = session.get()
if( !ff ) return
ff = session.penalize(ff)
REL_SUCCESS << ff
ps: don't forget to set penalty duration for this processor
Apologies if this has been already covered before here, I couldn't find anything closely related. I have this Kafka Streams app which reads from multiple topics, persist the records on a DB and then publish an event to an output topic. Pretty straightforward, it's stateless in terms of kafka local stores. (Topology below)
Topic1(T1) has 5 partitions, Topic2(T2) has a single partition. The issue here is, while consuming from two topics, if I want to go "full speed" with T1 (5 consumers), it doesn't guarantee that I will have dedicated consumers for each partition on T1. It will be distributed within the two topic partitions and I might end up with unbalanced consumers (and idle consumers), something like below:
[c1: t1p1, t1p3], [c2: t1p2, t1p5], [c3: t1p4, t2p1], [c4: (idle consumer)], [c5: (idle consumer)]
[c1: t1p1, t1p2], [c2: t1p5], [c3: t1p4, t2p1], [c4: (idle consumer)], [c5: t1p3]
With that said:
Is it a good practice having a topology that reads from multiple topics within the same KafkaStreams instance?
Is there any way to achieve a partition assignment like the following if I want go "full speed" for T1? [c1: t1p1, t2p1], [c2: t1p2], [c3: t1p3], [c4: t1p4], [c5: t1p5]
Which of the topologies below is most optimal to what I want to achieve? Or is it completely unrelated?
Option A (Current topology)
Topologies:
Sub-topology: 0
Source: topic1-source (topics: [TOPIC1])
--> topic1-processor
Processor: topic1-processor (stores: [])
--> topic1-sink
<-- topic1-source
Sink: topic1-sink (topic: OUTPUT-TOPIC)
<-- topic1-processor
Sub-topology: 1
Source: topic2-source (topics: [TOPIC2])
--> topic2-processor
Processor: topic2-processor (stores: [])
--> topic2-sink
<-- topic2-source
Sink: topic2-sink (topic: OUTPUT-TOPIC)
<-- topic2-processor
Option B:
Topologies:
Sub-topology: 0
Source: topic1-source (topics: [TOPIC1])
--> topic1-processor
Source: topic2-source (topics: [TOPIC2])
--> topic2-processor
Processor: topic1-processor (stores: [])
--> response-sink
<-- topic1-source
Processor: topic2-processor (stores: [])
--> response-sink
<-- topic2-source
Sink: response-sink (topic: OUTPUT-TOPIC)
<-- topic2-processor, topic1-processor
If I use two streams for each topic instead of a single streams with multiple topic, would that work for what I am trying to achieve?
config1.put("application.id", "app1");
KakfaStreams stream1 = new KafkaStreams(config1, topologyTopic1);
stream1.start();
config2.put("application.id", "app2");
KakfaStreams stream2 = new KafkaStreams(config2, topologyTopic2);
stream2.start();
The initial assignments you describe, would never happen with Kafka Streams (And also not with any default Consumer config). If there are 5 partitions and you have 5 consumers, each consumer would get 1 partition assigned (for a plain consumer with a custom PartitionAssignor you could do the assignment differently, but all default implementations would ensure proper load balancing).
Is it a good practice having a topology that reads from multiple topics within the same KafkaStreams instance?
There is not issue with that.
Is there any way to achieve a partition assignment like the following if I want go "full speed" for T1? [c1: t1p1, t2p1], [c2: t1p2], [c3: t1p3], [c4: t1p4], [c5: t1p5]
Depending how you write your topology, this would be the assignment Kafka Streams uses out-of-the-box. For you two options, option B would result in this assignment.
Which of the topologies below is most optimal to what I want to achieve? Or is it completely unrelated?
As mentioned above, Option B would result in the assignment above. For Option A, you could actually even use a 6th instance and each instance would processes exactly one partition (because there are two sub-topologies, you get 6 tasks, 5 for sub-topology-0 and 1 for sub-topology-1; sub-topologies are scaled out independently of each other); for Option A, you only get 5 tasks though because there is only one sub-topology and thus the maximum number of partitions of both input topic (that is 5) determines the number of tasks.
If I use two streams for each topic instead of a single streams with multiple topic, would that work for what I am trying to achieve?
Yes, it would be basically the same as Option A -- however, you get two consumer groups and thus "two application" instead of one.
Seeking for an advise on the below faced COA correlation issue.
Background: there is an application A which is feeding data to an application B via MQ (nothing special - remote queue def pointing to the local q def on remote QM). Where the sending app A is requesting COAs. That is a stable setup working for years:
App A -> QM.A[Q1] -channel-> QM.B[Q2] -> App B
Here:
Q1 is a remote q def pointing to the Q2.
Problem: there is an application C which requires exactly the same data feed which A is sending to B via MQ. => it is required to duplicate data feed considering the following constraint.
Constraint: neither code, nor app config of applications A and B could be changed - duplication of the data feed from A to B should be transparent for applications A and B - A puts messages to the same queue Q1 on QM.A; B gets messages from the same queue Q2 on the QM.B
Proposed solution: duplicate the feed on the MQ layer by creation of the Topic/subscirbers configuration on the QM of the app B:
App A -> QM.A[Q1] -channel-> QM.B[QA->T->{S2,S3}->{Q2,Q3}] -> {App B, QM.C[Q4] -> App C}
Here:
Q1 - has the rname property updated to point to the QA for Topic
instead of Q2
QA - Queue Alias for Topic T
T - Topic
S2, S3 - subscribers publishing data to the Q2 and Q3
Q2 - unchanged, the same local queue definition where App B consumes from
Q3 - remote queue definition pointing to the Q4
Q4 - local queue definition on the QM.C, the queue with copy of messages sent from A to B
With this set up duplication of the messages from the app A to the app B and C works fine.
But ... there is an issue.
Issue: application A is not able to correlate COAs and that is the problem.
I'm not sure if app A is not able to correlate COAs at all, or (what is more likely guess) it is not able to correlate additional COAs e.g. from the QM.C
Any idea or advise is very much appreciated.
Let's assume I have two documents in the same collection/partition, both at "version 1": A1, B1.
I update A1 -> A2, the write operation returns a session token SA.
Using SA to read document A will guarantee I get version A2.
Now I update B1 -> B2, and get a new session token SB.
Using SB to read document B will guarantee I get version B2.
My question is:
does using token SB guarantee I can see older writes as well?
I.e. will reading A with token SB always get me A2 ?
Yes. In your case SB > SA and hence it will ensure latest version of A.
Can 1 Tasktracker run multiple JVMs?
Here is the scenario:
Assume there are 2 files (A & B) and 2 Data nodes (D1 & D2).
When you load A, assume it is getting split into A1 & A2 on D1 & D2
and when you load B, assume it is getting split into B1 & B2 on D1 & D2.
For some reason let us assume D1 is busy with some other tasks
and D2 is available and there are a couple of jobs which are submitted,
one using file A and the other one usign File B.
So now D2 is available and has blocks A2 & B2.
Will the JobTracker submit the code to TaskTracker on D2 and run the task for A2 and B2 at a time or
will it first run A2 and after it finishes it will run B2?
If so, again is it possible to run both the tasks in parallel which means 1 TaskTracker and 2 jvms, or will it create/spawn 2 TaskTrackers on D2?
By default Task Tracker spawns one JVM for each task.
You can reuse jvms by setting this configuration parameter: mapred.job.reuse.jvm.num.tasks
A task tracker (TT) can launch multiple map or reduce tasks in parallel on a single machine. By default TT launches 2 maps (mapreduce.tasktracker.map.tasks.maximum) and 2 reduce (mapreduce.tasktracker.reduce.tasks.maximum) tasks. The properties have to be configured in the mapred-default.xml.