Not all nodes are utilized in spark job - hadoop

I am running spark-submit in cluster mode with yarn on hadoop. This is all from apache. I tried the PI java example and found that of 4 spark slave nodes only one was used to do the actual computation(one one node had a log file and the output for the value of pi).
Trying another python app I found that only two nodes at a maximum were being used. That is two out of a maximum of four.
Is this normal behavior, or am I missing something? Thanks in advance.

Related

Comparison between Hadoop Classic and Yarn

I have two clusters each running different version of Hadoop. I am working on a POC were I need to understand how YARN provides the capability to run multiple applications simultaneously which was not accomplished with Classic Map Reduce Framework.
Hadoop Classic:
I have a wordcount.jar file and executed on a single cluster (2 Mappers & 2 Reducers). I started two jobs in parallel, the one lucky started first got both mappers, completed the task and then second job started. This is the expected behavior.
Hadoop Yarn:
Same wordcount.jar with a different cluster (4 cores, so total 4 machines). As Yarn does not pre-assign mapper and reducer, any core can be used as mapper or reducer. Here also I submitted two jobs in parallel.
Expected Behavior: Both the jobs should start with 2 mappers each or whichever config as resource manager assigns but atleast both the jobs should start.
Reality: One job starts with 3 mappers and 1 reducers. second job waits untill first is completed.
Can someone please help me understand the behavior, as well as does the parallelism behavior best reflected with multinode cluster?
Not sure if this is the exact reason, but the Classic Hadoop and YARN architectures use a different scheduler. Classic Hadoop uses a JobQueueTaskScheduler, while YARN uses CapacityScheduler by default.

why isn't hadoop distributing a file to all nodes?

I set up a 4 node hadoop cluster according to the walk-through in http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/. I used replication of 1 (the cluster is just for testing)
I copied a 2GB file from local. When browsing the file in the http interface I see it was split to 31 blocks, but all of them are on one node (the master)
Is this correct? How can I investigate the reason?
They are all on one node because by default Hadoop will write to the local node first by default. I'm going to guess you were using the Hadoop client from that node. Since you have a replication of one, it's only going to be on that node.
Since you are just playing around, you might want to force spreading the data out. To do this, you can run the rebalancer with hadoop rebalancer. Just control-C it after a few minutes.

hadoop node unused for map tasks

I've noticed that all map and all reduce tasks are running on a single node (node1). I tried creating a file consisting of a single hdfs block which resides on node2. When running a mapreduce tasks whose input consists only of this block resident on node2, the task still runs on node1. I was under the impression that hadoop prioritizes running tasks on the nodes that contain the input data. I see no errors reported in log files. Any idea what might be going on here?
I have a 3-node cluster running on kvms created by following the cloudera cdh4 distributed installation guide.
I was under the impression that hadoop prioritizes running tasks on
the nodes that contain the input data.
Well, there might be an exceptional case :
If the node holding the data block doesn't have any free CPU slots, it won't be able to start any mappers on that particular node. In such a scenario instead of waiting data block will be moved to a nearby node and get processed there. But before that framework will try to process the replica of that block, locally(If RF > 1).
HTH
I don't understand when you say "I tried creating a file consisting of a single hdfs block which resides on node2". I don't think you can "direct" hadoop cluster to store some block in some specific node.
Hadoop will decide number of mappers based on input's size. If input size is less than hdfs block size (default I think is 64m), it will spawn just one mapper.
You can set job param "mapred.max.split.size" to whatever size you want to force spawning multiple reducers (default should suffice in most cases).

starting and stopping hadoop daemons/processes in a cluster

I have a linux cluster with 9 nodes and I have installed hadoop 1.0.2. I have a GIS program that I am running using multiple slaves. I need to measure the speedUp of my program by using say 1, 2, 3, 4 .. 8 slave nodes. I use start-all.sh/stop-all.sh script to start/stop my cluster once I make changes in the conf/slaves file by varying the number of slaves.
But I am getting wierd errors while doing so, and it feels that I am not using the correct technique to add/remove slave nodes in the cluster.
Any help regarding the ideal "technique to make changes in slaves file and to restart the cluster" will be appreciated.
The problem likely is that you are not allowing Hadoop to gracefully remove the nodes from the system.
What you want to be doing is decommissioning the nodes so that HDFS has times to re-replicate the files elsewhere. The process is essentially to add some nodes to an excludes file. Then, you run bin/hadoop dfsadmin -refreshNodes, which reads the configurations and refreshes the cluster's view of the nodes.
When adding nodes and even perhaps when removing nodes, you should think about running the rebalancer. This will spread the data out evenly and would help in some performance you may see if new nodes don't have any data.

In Hadoop can we control the number of nodes per job programatically?

I am running a job timing analysis. I have a pre configured cluster with 8 nodes. I want to run a given job with 8 nodes, 6 nodes , 4 nodes and 2 nodes respectively and note down the corresponding run times. Is there a way i can do this programatically, i.e by using appropriate settings in the Job configuration in Java code ?
There are a couple of ways. Would prefer in the same order.
exclude files can be used to not allow some of the task trackers/data nodes connect to the job tracker/ name node. Check this faq. The properties to be used are mapreduce.jobtracker.hosts.exclude.filename and dfs.hosts.exclude. Note than once the files have been changed, the name node and the job tracker have to be refreshed using the mradmin and dfsadmin commands with the refreshNodes option and it might take some time for the cluster to settle because data blocks have to be moved from the excluded nodes.
Another way is to stop the task tracker on the nodes. Then the map/reduce tasks will not be scheduled on that node. But, the data will still be fetched from all the data nodes. So, the data nodes also need to be stopped. Make sure that the name node gets out of safe mode and the replication factor is also set properly (with 2 data nodes, the replication factor can't be 3).
A Capacity Scheduler can also be used to limit the usage of a cluster by a particular job. But, when resources are free/idle then the scheduler will allocate resources beyond capacity for better utilization of the cluster. I am not sure if this can be stopped.
Well are you good with scripting ? If so play around with start scripts of the daemons. Since this is an experimental setup, I think restarting hadoop for each experiment should be fine.

Resources