Apologies if this has been already covered before here, I couldn't find anything closely related. I have this Kafka Streams app which reads from multiple topics, persist the records on a DB and then publish an event to an output topic. Pretty straightforward, it's stateless in terms of kafka local stores. (Topology below)
Topic1(T1) has 5 partitions, Topic2(T2) has a single partition. The issue here is, while consuming from two topics, if I want to go "full speed" with T1 (5 consumers), it doesn't guarantee that I will have dedicated consumers for each partition on T1. It will be distributed within the two topic partitions and I might end up with unbalanced consumers (and idle consumers), something like below:
[c1: t1p1, t1p3], [c2: t1p2, t1p5], [c3: t1p4, t2p1], [c4: (idle consumer)], [c5: (idle consumer)]
[c1: t1p1, t1p2], [c2: t1p5], [c3: t1p4, t2p1], [c4: (idle consumer)], [c5: t1p3]
With that said:
Is it a good practice having a topology that reads from multiple topics within the same KafkaStreams instance?
Is there any way to achieve a partition assignment like the following if I want go "full speed" for T1? [c1: t1p1, t2p1], [c2: t1p2], [c3: t1p3], [c4: t1p4], [c5: t1p5]
Which of the topologies below is most optimal to what I want to achieve? Or is it completely unrelated?
Option A (Current topology)
Topologies:
Sub-topology: 0
Source: topic1-source (topics: [TOPIC1])
--> topic1-processor
Processor: topic1-processor (stores: [])
--> topic1-sink
<-- topic1-source
Sink: topic1-sink (topic: OUTPUT-TOPIC)
<-- topic1-processor
Sub-topology: 1
Source: topic2-source (topics: [TOPIC2])
--> topic2-processor
Processor: topic2-processor (stores: [])
--> topic2-sink
<-- topic2-source
Sink: topic2-sink (topic: OUTPUT-TOPIC)
<-- topic2-processor
Option B:
Topologies:
Sub-topology: 0
Source: topic1-source (topics: [TOPIC1])
--> topic1-processor
Source: topic2-source (topics: [TOPIC2])
--> topic2-processor
Processor: topic1-processor (stores: [])
--> response-sink
<-- topic1-source
Processor: topic2-processor (stores: [])
--> response-sink
<-- topic2-source
Sink: response-sink (topic: OUTPUT-TOPIC)
<-- topic2-processor, topic1-processor
If I use two streams for each topic instead of a single streams with multiple topic, would that work for what I am trying to achieve?
config1.put("application.id", "app1");
KakfaStreams stream1 = new KafkaStreams(config1, topologyTopic1);
stream1.start();
config2.put("application.id", "app2");
KakfaStreams stream2 = new KafkaStreams(config2, topologyTopic2);
stream2.start();
The initial assignments you describe, would never happen with Kafka Streams (And also not with any default Consumer config). If there are 5 partitions and you have 5 consumers, each consumer would get 1 partition assigned (for a plain consumer with a custom PartitionAssignor you could do the assignment differently, but all default implementations would ensure proper load balancing).
Is it a good practice having a topology that reads from multiple topics within the same KafkaStreams instance?
There is not issue with that.
Is there any way to achieve a partition assignment like the following if I want go "full speed" for T1? [c1: t1p1, t2p1], [c2: t1p2], [c3: t1p3], [c4: t1p4], [c5: t1p5]
Depending how you write your topology, this would be the assignment Kafka Streams uses out-of-the-box. For you two options, option B would result in this assignment.
Which of the topologies below is most optimal to what I want to achieve? Or is it completely unrelated?
As mentioned above, Option B would result in the assignment above. For Option A, you could actually even use a 6th instance and each instance would processes exactly one partition (because there are two sub-topologies, you get 6 tasks, 5 for sub-topology-0 and 1 for sub-topology-1; sub-topologies are scaled out independently of each other); for Option A, you only get 5 tasks though because there is only one sub-topology and thus the maximum number of partitions of both input topic (that is 5) determines the number of tasks.
If I use two streams for each topic instead of a single streams with multiple topic, would that work for what I am trying to achieve?
Yes, it would be basically the same as Option A -- however, you get two consumer groups and thus "two application" instead of one.
Related
I have one broadcast function in Flink that accepts two kinesis streams, one for the element A and one for broadcast element B. I noticed that all element A goes into one task slot even if I have already set the env parallelism to 4.
here is the main process function:
env.setParallelism(4);
BroadcastStream<ElementBroadcast> elementBroadcastStream =
env.addSource(elementBroadcastSource)
.uid("element-broadcast")
.name("broadcast")
.setParallelism(4)
.returns(ElementB.class)
.broadcast(Descriptors.ELEMENT_B_DESCRIPTORS);
DataStream<ElementA> elementAStream =
elementASourceStream
.connect(elementBroadcastStream)
.process(injector.getInstance(
ElementAElementBProcessFunction.class))
.uid("");
The strange thing is when I check the Flink job or read the metrics I added inside the ElementAElementBProcessFunction, only the metrics in processBroadcastElement() confirms that all 4 task slots can received Element B, the processElement() works like a single thread function and you can also see it from the attached screenshots all the records(Element A) are received on slot 3. The other three slots receives 2 broadcast elements(Element B) from my application, but no element A at all.
Does any one know why multi slots parallelism only appears inside the processBroadcastElement() but not processElement()?
Thank you!
This might because the partition of soucre A is 1, you can check it on your AWS Management Console or use rebalance or rescale before process. As for element B, you broadcast it, this guarantees that all elements go to all downstream tasks.
I have a Topology :
Topology builder = new Topology();
builder.addSource("source",stringDeserializer,stringDeserializer, "TOPIC-DEV-ACH")
.addProcessor("process1", ProcessorOne::new , "source")
.addProcessor("process2", ProcessorTwo::new , "source")
.addProcessor("process3", ProcessorThree::new , "source")
.addSink("sink", "asink" ,stringSerializer, stringSerializer, "process1","process2","process3");
If I log:
Thread.currentThread().getName() in process(K var1, V var2)
result :
processor1 97527H7-e45cfcd3-6fb7-4fa9-b6a1-b3f5ed122304-StreamThread-1
processor2 97527H7-e45cfcd3-6fb7-4fa9-b6a1-b3f5ed122304-StreamThread-1
processor3 97527H7-e45cfcd3-6fb7-4fa9-b6a1-b3f5ed122304-StreamThread-1
I want a multiThreading to execute each Processor in a thread and then merge all results, is it possible with KafkaStreams library?
A KafkaStreams instance uses StreamThreads for stream processing.
The number of StreamThreads is controlled by StreamsConfig.NUM_STREAM_THREADS_CONFIG (num.stream.threads) configuration property that defaults to 1 and hence what you see.
Please note that although your Kafka Streams application can use multiple threads, a single topology (with all processors) is executed by a single thread.
A single thread executing a whole topology is simply a Kafka consumer of the source topics and with that it should be obvious that the number of thread (processing a single topology) is exactly the number of partitions of the topics (modulo number of KafkaStreams instances of the Kafka Streams application).
I making storm topology and I'm dealing with strings from this format: "x-x-x-x" where x is some digit. I want the strings stream to be split between 4 bolts equaly.
The problem is that for the following code, all the bolts get all the tuples, instead of send eack tuple to exactly one bolt:
builder.setSpout("digits-spout", new ReaderSpout());
builder.setBolt("level-1", new SomeBolt(1)).shuffleGrouping("digits-spout");
builder.setBolt("level-2", new SomeBolt(2)).shuffleGrouping("digits-spout");
builder.setBolt("level-3", new SomeBolt(3)).shuffleGrouping("digits-spout");
builder.setBolt("level-4", new SomeBolt(4)).shuffleGrouping("digits-spout");
as you can see i use same bolt but different consturctor.
Thanks!
According to what I understand from your question, I may offer an extra bolt for your problem like following example:
builder.setSpout("digits-spout", new ReaderSpout());
builder.setBolt("stringSplitterBoltName", new
StringSplitterBolt(1)).shuffleGrouping("digits-spout");
builder.setBolt("level-1", new
SomeBolt(1)).shuffleGrouping("stringSplitterBoltName");
builder.setBolt("level-2", new
SomeBolt(2)).shuffleGrouping("stringSplitterBoltName");
builder.setBolt("level-3", new
SomeBolt(3)).shuffleGrouping("stringSplitterBoltName");
builder.setBolt("level-4", new
SomeBolt(4)).shuffleGrouping("stringSplitterBoltName");
If you want the bolts to have different processing logic, you can just add 4 tasks of the same bolt. In this case, you will receive messages randomly between the bolt instances. You can check for the string value within that bolt and take appropriate execution path. You will avoid separate codebase for 4 bolts.
Alternatively, if you want to have separate bolt code for the strings, go for above suggestion by zackeriya.
I am using storm with trident topology but I am not able to understand how the parallelism is attained, it is differing according to my calculation and what I seeing on storm UI,
Here's the code the assign number of workers:
public Config getTopologyConfiguration() {
Config conf = new Config();
//conf.setDebug(true);
conf.setNumWorkers(6);
conf.setMessageTimeoutSecs(100);
return conf;
}
And Here is the stream processing code:
s.name("aggregation_stream")
.parallelismHint(invoiceAggregationConfig.getSpoutParallelism())
.partitionBy(groupedFields)
.partitionAggregate(aggregateInputFields,
new GenericAggregator(groupedFields, aggregatedFieldsList, aggregateFieldsOperationList),
aggregatorOutputFields)
.parallelismHint(invoiceAggregationConfig.getAggregationParallelism())
.shuffle()
.each(aggregatorOutputFields,
new CreatePaymentFromInvoices(paymentType, groupMap, aggMap, paymentExtraParams),
Const.PAYMENT_FIELD)
.each(TridentUtils.fieldsConcat(aggregatorOutputFields, Const.PAYMENT_FIELD),
new CreateApplicationFromPaymentAndInvoices(invoiceType),
Const.APPLICATIONS_FIELD)
.each(TridentUtils.fieldsConcat(aggregatorOutputFields, Const.PAYMENT_FIELD, Const.APPLICATIONS_FIELD),
new RestbusFilterForPaymentAndApplications(environment, bu, serviceConfiguration))
.parallelismHint(invoiceAggregationConfig.getPersistenceParallelism());
and the parallelism attributes which I am using in the code above are here:
spoutParallelism: 3
aggregationParallelism: 6
persistenceParallelism: 6
Now according to my calculation the number of executors should be
3*6 + 6 = 24
But in Storm UI it is showing 23, how??
EDITED
Adding new screenshot which is having information about individual components
Here I can see the number of Executors and tasks are 50, but I didn't set any configuration to for this, does storm provide this itself??
Secondly, The number of emitted tuples is huge in number, I am not producing this much data,this is more than 100s times more tuples, how come this much tuples are showing in UI??
Number of emitted tuples can be huge number
Reason : when spout emit a tuple it will expect the ack, if ack not received it will resend the tuple so emitted and transferred count can be a higher value. (check ack count its small number with compare to emitted count)
Can the following design be accomplished in Storm?
Lets take the wordcount example that is present in the following
https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/WordCountTopology.java
I am changing the word generator spout to a file reader spout
The design for this Word Count Topology is
1. Spout to read file and create sentences line by line
2. Bolt to split sentences to words
3. Bolt to add unique words and give a word and its corresponding count
So in a way the topology is describing the flow a file needs to take to count the unique words it has.
If I have two files file 1 and file 2 one should be able to call the same topology and create two instance of this topology to run the same word count.
In order to track if the word count has indeed finished the instances of word count topology should have a completed status once the file has been processed.
In the current design of Storm, I find that the Topology is the actual instance so it is like a task.
One needs to make two different calls with different Topology names like
for file 1
StormSubmitter.submitTopology("WordCountTopology1", conf,builder.createTopology());
for file 2
StormSubmitter.submitTopology("WordCountTopology2", conf,builder.createTopology());
not to mention the same upload of the jar using the storm client
storm jar stormwordcount-1.0.0-jar-with-dependencies.jar com.company.WordCount1Main.App "server" "filepath1"
storm jar stormwordcount-1.0.0-jar-with-dependencies.jar com.company.WordCount2Main.App "server" "filepath2"
The other issue is the topologies don't complete once the file is processed. They are alive all the time before we issue a kill on the topology
storm kill "WordCountTopology"
I understand that in a streaming world where the messages are coming from a message queue like Kafka there is no end of message but how is that relevant in the file world where the entities/messages are fixed.
Is there an API that does the following?
//creates the topology, this is done one time using the storm to upload the respective jars
StormSubmitter.submitTopology("WordCountTopology", conf,builder.createTopology());
Once uploaded the application code just instantiates the topology with the agruments
//creates an instance of the topology and give a status tracker
JobTracker tracker = StormSubmitter.runTopology("WordCountTopology", conf, args);
//Can query the Storm for the current job if its complete or not
JobStatus status = StormSubmitter.getTopologyStatus(conf, tracker);
For reusing the same topology twice, you have two possibilities:
1) Use a constructor parameter for your file spout and instantiate the same topology with twice with different parameters:
private StormTopology createMyTopology(String filename) {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("File Spout", new FileSpout(filename));
// add further spouts and bolts etc.
return builder.createTopology();
}
public static void main(String[] args) {
String file1 = "/path/to/file1";
String file2 = "/path/to/file2";
Config c = new Config();
if(useFile1) {
StormSubmitter.submitTopology("T1", c, createMyTopology(file1));
} else {
StormSubmitter.submitTopology("T1", c, createMyTopology(file2));
}
}
2) As an alternative, you could configure your file spout in open() method.
public class FileSpout extends IRichSpout {
#Override
public void open(Map conf, ...) {
String filenmae = (String)conf.get("FILENAME");
// ...
}
// other methods omitted
}
public static void main(String[] args) {
String file1 = "/path/to/file1";
String file2 = "/path/to/file2";
Config c = new Config();
if(useFile1) {
c.put("FILENAME", file1);
} else {
c.put("FILENAME", file2);
}
// assembly topology...
StormSubmitter.submitTopology("T", c, builder.createTopology());
}
For you second question: there is no API in Storm that terminates a topology automatically. You could use TopologyInfo and monitor the number of emitted tuples of the spout. If it does not change for some time, you can assume that the whole file got read and then kill the topology.
Config cfg = new Config();
// set NIMBUS_HOST and NIMBUS_THRIFT_PORT in cfg
Client client = NimbusClient.getConfiguredClient(cfg).getClient();
TopologyInfo info = client.getTopologyInfo("topologyName");
// get emitted tuples...
client.killTopology("topologyName");
The word count topology mentioned in the post doesn't do justice for the might and power of Storm. Since Storm is a Stream processor, it requires a stream; period. By definition files are files it is static. I empathize with the Storm developers on how can a simple hello world be given to the adoption on how to show case the topology concepts and a non stream technology like file was taken. So to the newbies who are learning Storm which I was at that time, it was a difficult to understand how to develop using the example. The example is just a way to show how Storm concepts work, not a real word application of how files would come or needs to be processed.
So here is the take on how one of the solution could be.
Since topologies run all the time, they can compute the word count for as long as one wants i,e within a file or across all files for any periods of time.
In order to allow for different files to come in, we would need a streaming spout. So naturally you would need a Kafka Message Broker or similar to receive files in a stream. Depending on the size of the file and the restriction that message brokers put namely Kafka which has a 1 MB file restriction, we could pick to send the file itself as the payload or the reference of the file in which case you would need a distributed file system to store the file namely a Hadoop DFS or a NAS.
We then read these files using a Kafka Spout as opposed to FileSpout.
We now have the following issues
1. Word Count Across Files
2. Word Count per File
3. Running Status on the word count till it is processed
4. When do we know if a file is processed or complete
Word Count Across Files
Using the example provided, this is the use case the example targets so if we continue to stream the files and in each file we read the lines, split the word and send to other bolts, the bolts would count the words independent of which file it came from.
File1 A quick brown fox jumped ...
File2 Once upon a time a fox ...
Field Grouping
quick
brown
fox
...
Once
upon
fox (not needed as it came in file 1)
...
Word Count Per File
In order to do this, we would now need to put the fields grouping of words to be appended with the fileId. So now the example needs to change to include a fileId for each word it splits.
So
File1 A quick brown fox jumped ...
File2 Once upon a time a fox ...
So the fields grouping on word would be (canceling the noise words)
File1_quick
File1_brown
File1_fox
File2_once
File2_upon
File2_fox
Running Status on the word count till it is processed
Since all these counts are in memory of the bolt and we don't know the EoF there is no way to get the status unless someone peaks into the bolt or we send the counts periodically to another data store where we can query it. This is exactly what we need to do, which is at periodic intervals we need to persist the in-memory bolt counts to a data store like hbase, elastic, mongo db etc
When do we know if a file is processed or complete
Perhaps this is the toughest question to answer in the streaming world, basically the stream processor doesn't know the steam is finished as from its perspective the streams are files coming in and it needs to split each file into words and count in corresponding bolts. So they don't know what has happened before or after it reached each actor.
This entire thing needs to be done by the app developer.
One way to do this is when each file is read we count the total words and send a message
File 1 : Total Words : 1000
File 2 : Total Words : 2000
Now when we do the word count and find different words per file File1_* the count of individual words and the total words should match before we say a file is complete. All these are custom logic we would need to write before we can say its complete.
So in essential Storm provides the framework to do stream processing in a variety of ways. Its the application developers job to develop with the design that it has and implement their own logic depending on the use case. It doesn't provide application use cases out of the box or a good reference implementation which I think we need to build as its not a commercial product and depends on community to champion.