I have tested Storm on 5 virtual machines on AWS ec2 but I don't know how to evaluate it after testing. in addition the the number of tuples in Topology stats does not incrementing over the time and sometimes the number reset to small number?
Related
Our system has a cluster of 5 hosts (e.g., data node or computer slaves…). Now, I want allocate different number of reducers of these hosts because 1 host is slow. . We are using Hadoop Yarn. The resource manager (so called Job tracker in MapReduce1) always allocate evenly number of reducers of to 5 hosts. Is there anyway that I can limit number of reducers of a specific host? For example, what I want is that a job with 40 reducers, 4 fast computers have 36 reducers (e.g., 9 reducers each host), the slow computer has only 4 reducers.
It is entirely possible and a common phenomenon to have heterogenous systems in a hadoop cluster. Typically, as the cluster keeps becoming larger and hence is scaling horizontally, new nodes of different configurations get added to the cluster.
In such scenarios, in order to have configurations applicable to a specific node or to a group of nodes, we need to modify the configurations accordingly on those hosts.
For example, in case of Hortonworks Data Platform where the cluster is managed through Ambari, the concept of host config groups can be leveraged for this purpose.
Please see the below link for further information:
https://docs.hortonworks.com/HDPDocuments/Ambari-2.1.1.0/bk_Ambari_Users_Guide/content/_using_host_config_groups.html
Also see the below link, where the discussion is about increasing the number of YARN containers at a node level. It remains the same in your case as well, which is the opposite of the use case discussed there:
How to increase the number of containers in nodemanager in YARN
Another useful link:
http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
How does Apache Storm Divide the tasks amongst it's workers, I read that storm does it by itself, and it's a function of parallelism, but what I don't know is how do I figure out which node does what and how many nodes would do which task, basically so that I can calculate the optimal number of nodes required?
Assuming that the hardware configuration of all nodes is not the same.
By default, Storm used "round robin" scheduling, ie, it loops over all supervisors with available slots and assigns the parallel instances of spouts/bolts. If no more free slots are available, single workers are assigned multiple spout/bolt instances.
You need to have a look at storm UI. The metrics: complete latency, capacity, execute latency, process latency and failed tuples will give you "hints" on how many executors and tasks you should allocate for each bolt.
We have the Storm configured in a single node development server with most of the configurations set to default (not local mode).
Having storm nimbus, supervisor and workers running in that single node only and UI also configured.
AFAIK parallelism and configuration differs from topology to topology.
I think finding the right parallelism and configuration is by trial and error method only.
So, to find the best parallelism we have started testing our Storm topology with various configurations in a single node.
Strangely the results are unexpected:
Our topology processes stream of xml files from HDFS directory.
Having a single spout (Parallelism always 1) and four bolts.
Single worker
Whatever the topology parallelism we get the almost same performance results (the rate of data processed)
Multiple workers
Whatever the topology parallelism we get the similar performance as of single worker until sometime (most of the cases it is 10 minutes).
But after that complete topology gets restarted without any error traces.
We had observed that Whatever data processed in 20 minutes with single worker took 90 minutes with 5 workers having the same parallelism.
Also Topology had restarted 7 times with 5 workers.
And CPU usage is relatively high.
(Someone else also had faced this topology restart issue http://search-hadoop.com/m/LrAq5ZWeaU but no answer)
After testing many configurations we found that single worker with less no of parallelism (each bolt with 2 or 3 instances) works better than high parallelism or more no of workers.
Ideally the performance of Storm topology should be better with more no workers/ parallelism.
Apparently this rule is not holding good here.
why can't we set more than a single worker in a single node?
What are the maximum no of workers can be run in a single node?
What are the Storm configurations changes that are need to scale the performance? (I have tried nimbus.childopts and worker.childopts)
If your CPU usage is high on the one node then you're not going to get any better performance as you increase parallelism. If you do increase parallelism, there will just be greater contention for a constant number of CPU cycles. Not knowing any more about your specific topology, I can only suggest that you look for ways to reduce the CPU usage across your bolts and spouts. Only then can you would it make sense to add more bolt and spout instances.
I am starting to learn Hadoop. I have a hadoop server and it connects with 3 clusters node. If I run a MapReduce job it works well. I need to set the priority for these clusters.
For example
node1, node2, node3 are my cluster which is connect with my hadoop server. Here If I run the MR job, It will split and assign job like the above priority for every time. Is it possible?
Because the cluster nodes have different memory capacity. So I need to set high memory node will handle the Job first.
It's not possible to 'weight' certain servers based on capacity. However, each server can have a configuration to match it's memory, processor count, etc.
For example, if one server has 16 cores and another has 8 cores, you can configure the first server to run 12 tasks simultaneously and the second to only run 6. The same idea with memory.
I'm evaluating EC2/EMR for running a ~20 node Hadoop cluster. (custom JAR cluster). I've run the simple WordCount example on a single-node 3.3 GHz 2GB RAM local VMWare instance which takes less than 10 seconds to complete. The WordCount example takes 3 minutes to complete on EMR with 2 c1.mediumm instances (excluding the startup time of 3-5 minutes). Takes the same time for 2 m1.small instances. There will be some overhead for running a job on EMR, and maybe this problem size is too small, so this seems understandable.
At about what size problems do you begin to see the performance advantage of the cloud? Or at about how many nodes or compute units?
If you're spinning up an EMR job, that essentially means you're asking Amazon to provide you an on-demand cluster of N machines, and the simple fact of provisioning and giving you these machines can easily take several minutes, not to mention that these machines need to be setup, can have bootstrap actions, and so on. I've rarely seen EMR jobs (even big ones) take more than 10 minutes to have the cluster ready, but I've also rarely seen a cluster be up in less than a couple minutes.
If you have a job that you're running frequently (for example every hour), then the cost of setting up and shutting down your EMR cluster might be too big, in this case it would be a good idea to create your cluster with some reserved instances on EC2. With reserved instances, you will have your own cluster always up and administered by you, so there is no time lost setting up/shutting down your cluster, this behaves like a regular Hadoop cluster.
What I've been doing in the past couple years is use an EC2 cluster on reserved instances that is always up and all the jobs are running on it, but for some jobs that are very large and that couldn't fit on my cluster, I run them on EMR where I can choose how many nodes I want and since these are large jobs the time to setup/shutdown the cluster is small in comparison to the total runtime. I would not recommend using EMR for small/frequent jobs.