How does Hadoop dfs.replicate work? - hadoop

I have a 2 node hadoop (1 is the master/slave and another slave) setup and 4 input files each of size 1GB.
When i set dfs.replicate to 2, then the entire data is copied over to both the nodes which is understandable. But my question is that, how do i see an improved performance (almost twice as better) over a single node setup since in the 2 node case, map-reduce will still run over the complete data set on both the systems along with the added overhead of channeling the inputs from 2 mappers to reducers.
Also when i set the replication as 1, the entire data exists only on the master node which is also understandable to avoid ethernet overhead. But even in this case, i see a performance improvement compared to single node setup which i find confusing, since map-reduce runs on local data sets, this scenario should essentially be similar to single node setup with one map-reduce program running on master node on the entire data set ??
Can someone help me understand what i am missing here ???
Thanks
Pawan

Pawan,
In the two node case the map reduce job will not run on entire dataset. MapReduce operates in HDFS blocks which will be of size 64 MB or more based on your configuration. Your 1 GB is split into blocks and distributed on the cluster nodes. some of these blocks are processed on node 1 and the other on node 2 but no duplications. The replication factor only increases the availability of data and more tolerance towards node failures. It will not duplicate the tasks.
resultantly what's happening is, from the processing perspective the data is split between the node 1 and node 2 and being processed. Which means, if your are utilizing your processing power fully and rightly, your are doubling your speed theoritically.
Cheers
Rags

Related

Switching off data locality for Hadoop MapReduce jobs

I have a YARN cluster and dozens of nodes in the cluster. My program is a map-only job.
Its Avro input is very small in size with several million rows, but processing a single row requires lots of CPU power. What I observe is that many maps tasks are running on a single node, whereas other nodes are not participating. That causes some nodes to be very slow and affecting overall HDFS performance. I assume this behaviour is because of the Hadoop data-locality.
I'm curious whether it's possible to switch it off, or is there another way to force YARN to distribute map tasks across more uniformly across cluster?
Thanks!
Assuming you can't easily redistribute the data more uniformly across the cluster (surely not all your data is on 1 node right?!) this seems to be the easy way to relax locality:
yarn.scheduler.capacity.node-locality-delay
This setting should have a default of 40, try setting it to 1 to see whether this has the desired effect. Perhaps even 0 could work.

Can a slave node have multiple blocks of the same file in hadoop?

Say I have a hadoop cluster where one node is the Master node and the other is a Data node. The slave node is an 8-core machine just to make sure there are enough cores to process jobs parallelly. Can i still split the file into say 3 blocks and have the slave node store all the three blocks separately on it. In other words, "if we want to utilize all the slave nodes in a hadoop cluster", then is there a 1:1 relation between number of slave nodes and the maximum number of blocks of a file? If yes, then in such a case how would the map-reduce work. Will the master node fire three map jobs to the slave node and have each mapper pick up each block on the slave node?
My question can be seen in a different way. If we have a 1GB file on a cluster with 3 data nodes then how do the 64 MB blocks get divided and how are they distributed between the three nodes?
The second question seems to be more understandable for me so I will take that first.
From HDFS Perspective:
With 64MB block size a 1GB file consists from 16 blocks, blocks are being stored somewhat randomly between DataNodes, if you have more from them as the replication factor, but you can expect an even distribution between the nodes, if you do not load the data from one of the DNs. If you do, that DN will hold a replica from all the blocks, and other DNs will hold the remaining replicas distributed sort of evenly (still randomly placed). So yes, if you have a file consists from 16 blocks, and only 3 DN with a replication factor of 3 all 3 DNs will hold all 16 blocks for example.
From YARN's perspective when you run the MapReduce job:
YARN tries to find a container on a node for a mapper that has the data locally, there is a configurable wait time for a free container on such nodes before YARN starts up the mapper on a node that does not have the data.
YARN does not rely on physical cores directly, you can configure the number of virtual cores and the amount of memory a container uses, and based on these values YARN will allocate the amount of available containers in a NodeManager.
Further reading on YARN tuning on Cloudera Engineering blog
However:
From the first part of the question as I understand you want to achieve paralellism by defining the block size to split your data files.
MapReduce does not care about HDFS blocks, it has its own abstraction to split the input, it is called InputSplit. InputSplits are feeded to the mappers, by the InputFormat. Also InputSplits are defining the place where the split is available locally so that YARN can find a container that is on a node that has the split on local data storage. I suggest to check the API, and the available implementations of InputFormat, as they most likely suit your needs, however if they are not, then you can still write your own implementation, and specify it via the job configuration.

Hadoop, uneven load between machines

I have a cluster of 4 machines that I need to run a benchmark against.
I decide to use Terasort to benchmark.
However, when I run the benchmark, only one out of four machine is under load, while the other three are completely idle.
If I run the test another time, a different machine would be completely under load while the other three would be idle.
When I create the dataset with Teragen everything works just fine, the load is evenly distributed between all the four machine.
What can be wrong in this configuration ?
Thanks
I hope your cluster is distributed properly as 4 nodes (1 name node , 1 secondary name node, 2 data nodes)
The process flow happens like it starts with name-node and job tracker will schedule the job for the task trackers which has the data blocks.
The usage of data-nodes depends on few factors like number of replication, number of mappers and number of blocks.
If The number of blocks are many, it will be placed evenly in all the data nodes of your cluster. If the replication factor is 2, then the blocks will be available in both the data nodes. So both can run the mappers which deal with those blocks
If you have two blocks for a file and two mappers will run simultaneously in the data nodes and utilize the resources properly.
In your case, it seems block size is the problem. Try to reduce it. so there should be at least 2 blocks which makes utilization will be more and so is the performance.
Hadoop can be tuned as per your need with the below settings.
dfs.replication in hdfs-site.xml
dfs.block.size in hdfs-site.xml
Good luck !!!

Can Hadoop tasks run in parallel on single node

I am new to hadoop and I have following questions on the same.
This is what I have understood in hadoop.
1) When ever any file is written in hadoop it is stored across all the data nodes in chunks (64MB default)
2) When we run the MR job, a split will be created from this block and on each data node the split will be processed.
3) From each split record reader will be used to generate key/value pair at mapper side.
Questions :
1) Can one data node process more than one split at a time ? What if data node capacity is more?
I think this was limitation in MR1, and with MR2 YARN we have better resource utilization.
2) Will a split be read in serial fashion at data node or can it be processed in parallel to generate key/value pair? [ By randomly accessing disk location in data node split]
3) What is 'slot' terminology in map/reduce architecture? I was reading through one of the blogs and it says YARN will provide better slot utilization in Datanode.
Let me first address the what I have understood in hadoop part.
A file stored on Hadoop file system is NOT stored across all data nodes. Yes, it is split into chunks (default is 64MB), but the number of DataNodes on which these chunks are stored depends on a.File Size b.Current Load on Data Nodes c.Replication Factor and d.Physical Proximity. The NameNode takes these factors into account when deciding which dataNodes will store the chunks of a file.
Again each Data Node MAY NOT Process a split. Firstly, DataNodes are only responsible for managing the storage of data, not executing jobs/tasks. The TaskTracker is the slave node responsible for executing tasks on individual nodes. Secondly, only those nodes which contain the data required for that particular Job will process the splits, unless the load on these nodes is too high, in which case the data in the split is copied to another node and processed there.
Now coming to the questions,
Again, dataNodes are not responsible for processing jobs/tasks. We usually refer to a combination of dataNode + taskTracker as a node since they are commonly found on the same node, handling different responsibilities (data storage & running tasks). A given node can process more than one split at a time. Usually a single split is assigned to a single Map task. This translates to multiple Map tasks running on a single node, which is possible.
Data from input file is read in serial fashion.
A node's processing capacity is defined by the number of Slots. If a node has 10 slots, it means it can process 10 tasks in parallel (these tasks may be Map/Reduce tasks). The cluster administrator usually configures the number of slots per each node considering the physical configuration of that node, such as memory, physical storage, number of processor cores, etc.

Hadoop data locality, counter-intuitive observation

Can anyone help me understand following observation that is opposite to my understand of Hadoop data locality.
A Hadoop cluster with 3 nodes:
master: 10.28.75.146
slave1: 10.157.6.202
slave2: 10.31.130.224
run a task successfully. From job console:
Task Attempts:attempt_201304030122_0003_m_000000_0
Machine: /default-rack/10.31.130.224<p>
Task log: INFO: consuming hdfs://10.28.75.146:9000/input/22.seq
We know 224 node is processing /input/22.seq data. By command:
$hadoop fsck /input -files -blocks -locations |grep -A 1 "22.seq"
/input/22.seq 61731242 bytes, 1 block(s): OK
0. blk_-8703092405392537739_1175 len=61731242 repl=1 [10.157.6.202:9200]
22.seq fits in one block which is smaller than default HDFS block size (64MB) and not replicated to other node.
Question: since 22.seq is not local to 224 node, why Hadoop assigns 224 node processing data remotely on 202?
Note: this is not an exception. I notice many data files are fetched remotely, and observe huge network traffic on eth0 at both nodes. I am expecting near-zero traffic between two nodes, since all my data files are <64MB, and data should processed locally.
FYI: This is observed on Amazon's AWS EMR.
I am not sure if this will answer your question fully, but I will attempt to shine some light.
The network traffic you encountered above may have been influenced by the process by which the mapreduce framework submits a job; part of which transfers by default 10 copies of your job jar and all libraries contained therein across the cluster (in cases like yours where there are not 10 nodes I am not sure how it would behave): there are heatbeats and getting input split info and reporting progress which seem like small bandwidth operation although I am ignorant about the specifics on their network resource consumption.
Regarding the job you are running: If it is a map only job then Hadoop tries (tries because there may be resource limiting factors running on the data-local node) for data locality optimization and runs the job where the input split is located. It sounds like in your case, the file is less than the default 64MB so 1 split should equal your data which in turn should result in one map since the number is maps is directly proportional to the number of splits you have, but if your job is a Map and Reduce job then the network traffic may be picking up some of the reduce copy and sort phase HTTP network traffic which can end up on separate nodes.
N Input Splits = N Maps --output--> M partitions = M Reducers
Of course the network traffic and data locality optimizations are dependent on the availability of the nodes resources so your test assumptions should take this into consideration.
Hope I was a tiny bit helpful.
Short answer - because Hadoop scheduler sucks. It has no up-front global plan on which file split should go where. As nodes ask for work - it looks at the available splits - and gives out the best match. There are parameters that tune how aggressive Hadoop is in finding a best match (ie. - when a request for work arrives - does it give the best match available at that time? or does it wait for sometime to see if other, better matching nodes also send requests?)
By default (and I am pretty sure this is the case with EMR) - the scheduler would always give back some work to a requesting node - if there was any work available. You can see that if your input is small (spans only a few blocks/nodes), but the number of nodes are larger (in comparison) - then you will get very poor locality. On the other hand - if the size of input is large - then your odds of getting good locality goes up a lot.
The FairScheduler has parameters to delay scheduling - so as to get better locality. However i don't think that is the default scheduler with EMR.

Resources