I have two topics BACKUPDATA and LIVEDATA.
what is the best solution for read both topics??
1. Two different topologies?
2. One topology with two spouts?
I tried with two different topology but storm not allocating slots to second topology.
Yes, you can use multiple spouts in a topology.
builder.setSpout("kafka-spout1", new KafkaSpout(spoutConf1), 1);
builder.setSpout("kafka-spout2", new KafkaSpout(spoutConf2), 1);
Well, configuration depends on how you process the data.
If you create separate topology for both, so one topology failure issue won't affect another one, but It'll affect the running cost.
And in case of single topology with multiple spout, both will be affected with each-other failures. If you want to club the data from both topics at the same time, you should use multiple spouts.
Related
I want to use Apache Storm in one of my project. I have a concern regarding its parallelism technique. By definition we can give hints on how many instances of the components we want to run.
For example if there are 4 executors running the same spout, which itself is supposed to read data from external source and transform it into tuples, how does Storm ensures that no two or many spout get the same data.
Help would be appreciated.
Quick Background:
A customer can have multiple event processors (Actions to be taken on a particular input) and each of these event processors could be changed independently. As an optimization we have grouped all the processors for a single customer into a single Topology. The advantage is isolation across customers and on the flip side entire topology for a customer needs to be redeployed even if a single processor is changed, plus the downtime it takes to kill a topology and redeploy the new topology.
Now the options that I am contemplating is:
Dynamic Topology: No easy way to change the spouts and bolts at
runtime. Even storm swap also doesn't seem to be available just yet. Is there a way to dynamically update topologies without deployment or any way to hot deploy topologies.
Have one topology per event processor per customer. That would end up having thousands or even 100 thousand topologies and obviously seems incorrect.
Have read through this old post, but not much of help. Whats the recommendation.
I have already read related materials about storm parallel but still keep something unclear. Suppose we take Tweets processing as an example. Generally what we are doing is retrieving tweets streaming, counting numbers of words of each tweets and write the numbers into a local file.
My question is how to understand the value of the parallelism of spouts as well as bolts. Within the function of builder.setSpout and builder.setBolt we can assign the parallel value. But in the case of word counting of tweets is it correct that only one spout should be set? More than one spouts are regarded as copies of the first same spout by which identical tweets flow into several spouts. If that is the case what is the value of setting more than one spouts?
Another unclear point is how to assign works to bolts? Is the parallel mechanism achieve in the way of Storm will find currently available bolts to process a next emitting spout? I revise the basic tweets counting code so the final counting results will be written into a specific directory, however, all results are actually combined in one file on nimbus. Therefore after processing data on supervisors all results will be sent back to nimbus. If this is true what is the communication mechanism between nimbus and supervisors?
I really want to figure out those problems!!! Do appreciate for the help!!
Setting the parallelism for spouts larger than one, required that the user code does different things for different instances. Otherwise (as you mentioned already), data is just sent through the topology twice. For example, you can have a list of ports you want to listen to (or a list of different Kafka topics). Thus, you need to ensure, that different instanced listen to different ports or topics... This can be achieved in open(...) method by looking at topology metadata like own task ID, and dop. As each instance has a unique ID, you can partition your ports/topics such that each instance picks different ports/topics from the overall list.
About parallelism: this depends on the connection pattern you are using when pluging your topology together. For example, using shuffleGrouping results in a round robin distribution of your emitted tuples to the consuming bolt instances. For this case, Storm does not "look" if any bolt instance is available for processing. Tuples are simply transfered and buffered at the receiver if necessary.
Furthermore, Nimbus and Supervisor only exchange meta data. There is not dataflow (ie, flow of tuples) between them.
In some cases like "Kafka's Consumer Group" you have queue behaviour - which means that if one consumer read from the queue, other consumer will read different message from the queue.
This will distribute read load from the queue across all workers.
In those cases you can have multiple spout reading from the queue
I want to fire multiple web requests in parallel and then aggregate the data in a storm topology? which of the following way is preferred
1) create multiple threads within a bolt
2) Create multiple bolts and create a merging bolt to aggregate the data.
I would like to create multiple threads within a bolt because merging data in another bolt is not a simple process. But i see there are some concerns around that I found on internet
https://mail-archives.apache.org/mod_mbox/storm-user/201311.mbox/%3CCAAYLz+pUZ44GNsNNJ9O5hjTr2rZLW=CKM=FGvcfwBnw613r1qQ#mail.gmail.com%3E
but didn't get clear reason why not to create multiple threads. Any pointers will help.
On a side note does that mean i should not use java8's capabilities of parallel streams as well as mentioned in https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html?
Increase number of tasks for the bolt, its like spawning multiple instances of the same. And also increase the number of executors (threads) to handle them evenly.
Make sure #executors <= #tasks. Storm will do the rest for you.
I am a newbie on storm. just thinking if I can use storm to merge/join two tables from two different dbs(of coz, two tables have some sort of Foreign Key relationship, just happen to exist in different dbs/systems), any ideas How I'd make up the topology? like having two separated spouts reading periodically from two dbs and having a bolt to do the join work?
Is this even a proper use case for storm?
any ideas are appreciated!
This may be a good use of Storm, but it really depends on your dataset. If you just have two tables in separate DBMSs that you want to join and store in some third place (DBMS or otherwise), Storm will only make really make sense if this is a streaming join, i.e. the two tables are frequently written to and you want to join the stuff that was just recently written together.
Also, it almost goes without saying that you should only employ the complexity Storm will bring if this is for something relatively large and high volume.
If it's small, you will probably be better served with a traditional ETL tool, even if that's just some code you whip up to access the two databases and combine the data.
If the data set is large and you need to do joins across more than a short timeframe, I'd consider doing this another way, such as using a map-reduce job that pulls data from the two DBs and spreads the join out over a cluster.
like having two separated spouts reading periodically from two dbs and having a bolt to do the join work
Yes, this is very much possible. Storm can have multiple spouts. And A bolt consumes any number of input streams, does some processing, and possibly emits new streams. typically its better to have your spout read from a queue like Kafka or RabbitMQ (you can find spout integration with most of the queuing system). So in that case you can feed the queue with the data from DB and then let spout consumes the same.
UPDATE:
Here is a very nice Article about how storm parallelism works