Switching off data locality for Hadoop MapReduce jobs - hadoop

I have a YARN cluster and dozens of nodes in the cluster. My program is a map-only job.
Its Avro input is very small in size with several million rows, but processing a single row requires lots of CPU power. What I observe is that many maps tasks are running on a single node, whereas other nodes are not participating. That causes some nodes to be very slow and affecting overall HDFS performance. I assume this behaviour is because of the Hadoop data-locality.
I'm curious whether it's possible to switch it off, or is there another way to force YARN to distribute map tasks across more uniformly across cluster?
Thanks!

Assuming you can't easily redistribute the data more uniformly across the cluster (surely not all your data is on 1 node right?!) this seems to be the easy way to relax locality:
yarn.scheduler.capacity.node-locality-delay
This setting should have a default of 40, try setting it to 1 to see whether this has the desired effect. Perhaps even 0 could work.

Related

Hadoop, uneven load between machines

I have a cluster of 4 machines that I need to run a benchmark against.
I decide to use Terasort to benchmark.
However, when I run the benchmark, only one out of four machine is under load, while the other three are completely idle.
If I run the test another time, a different machine would be completely under load while the other three would be idle.
When I create the dataset with Teragen everything works just fine, the load is evenly distributed between all the four machine.
What can be wrong in this configuration ?
Thanks
I hope your cluster is distributed properly as 4 nodes (1 name node , 1 secondary name node, 2 data nodes)
The process flow happens like it starts with name-node and job tracker will schedule the job for the task trackers which has the data blocks.
The usage of data-nodes depends on few factors like number of replication, number of mappers and number of blocks.
If The number of blocks are many, it will be placed evenly in all the data nodes of your cluster. If the replication factor is 2, then the blocks will be available in both the data nodes. So both can run the mappers which deal with those blocks
If you have two blocks for a file and two mappers will run simultaneously in the data nodes and utilize the resources properly.
In your case, it seems block size is the problem. Try to reduce it. so there should be at least 2 blocks which makes utilization will be more and so is the performance.
Hadoop can be tuned as per your need with the below settings.
dfs.replication in hdfs-site.xml
dfs.block.size in hdfs-site.xml
Good luck !!!

Task scheduling with spark

I am running fairly large task on my 4 node cluster. I am reading around 4 GB of filtered data from a single table and running Naïve Baye’s training and prediction. I have HBase region server running on a single machine which is separate from the spark cluster running in fair scheduling mode, although HDFS is running on all machines.
While executing, I am experiencing strange task distribution in terms of the number of active tasks on the cluster. I observed that only one active task or at most two tasks are running on one/two machines at any point of time while the other are sitting idle. My expectation was that the data in the RDD will be divided and processed on all the nodes for operations like count and distinct etcetera. Why are all nodes not being used for large tasks of a single job? Does having HBase on a separate machine has anything to do with this?
Some things to check:
Presumably you are reading in your data using hadoopFile() or hadoopRDD(): consider setting the [optional] minPartitions parameter to make sure the number of partitions is equal to the number of nodes you want to use.
As you create other RDDs in your application, check the number of partitions of those RDDs and how evenly the data is distributed across them. (Sometimes an operation can create an RDD with the same number of partitions but can make the data within it badly unbalanced.) You can check this by calling the glom() method, printing the number of elements of the resulting RDD (the number of partitions) and then looping through it and printing the number of elements of each of the arrays. (This introduces communication so don't leave it in your production code.)
Many of the API calls on RDD have optional parameters for setting the number of partitions, and then there are calls like repartition() and coalesce() that can change the partitioning. Use them to fix problems you find using the above technique (but sometimes it will expose the need to rethink your algorithm.)
Check that you're actually using RDDs for all your large data, and haven't accidentally ended up with some big data structure on the master.
All of these assume that you have data skew problems rather than something more sinister. That's not guaranteed to be true, but you need to check your data skew situation before looking for something complicated. It's easy for data skew to creep in, especially given Spark's flexibility, and it can make a real mess.

Confusion of how hadoop splits work

We are Hadoop newbies, we realize that hadoop is for processing big data, and how Cartesian product is extremely expensive. However we are having some experiments where we are running a Cartesian product job similar to the one in the MapReduce Design Patterns book except with a reducer calculating avg of all intermediate results( including only upper half of A*B, so total is A*B/2).
Our setting: 3 node cluster, block size = 64M, we tested different data set sizes ranging from
5000 points (130KB) to 10000 points (260KB).
Observations:
1- All map tasks are running on one node, sometimes on the master machine, other times on one of the slaves, but it never processed on more than one machine.Is there a way to force hadoop to distribute the splits therefore map tasks among machines? Based on what factors dose hadoop decide which machine is going to process the map tasks( in our case once it decided the master, in another case it decided a slave).
2- In all cases where we are testing the same job on different data sizes, we are getting 4 map tasks. Where dose the number 4 comes from?since our data size is less than the block size, why are we having 4 splits not 1.
3- Is there a way to see more information about exact splits for a running job.
Thanks in advance
What version of Hadoop are you using? I am going to assume a later version that uses YARN.
1) Hadoop should distribute the map tasks among your cluster automatically and not favor any specific nodes. It will place a map task as close to the data as possible, i.e. it will choose a NodeManager on the same host as a DataNode hosting a block. If such a NodeManager isn't available, then it will just pick a node to run your task. This means you should see all of your slave nodes running tasks when your job is launched. There may be other factors blocking Hadoop from using a node, such as the NodeManager being down, or not enough memory to start up a JVM on a specific node.
2) Is your file size slightly above 64MB? Even one byte over 67,108,864 bytes will create two splits. The CartesianInputFormat first computes the cross product of all the blocks in your data set. Having a file that is two blocks will create four splits -- A1xB1, A1xB2, A2xB1, A2xB2. Try a smaller file and see if you are still getting four splits.
3) You can see the running job in the UI of your ResourceManager. https://:8088 will open the main page (jobtracker-host:50030 for MRv1) and you can navigate to your running job from there, which will get you to see individual tasks that are running. If you want more specifics on what the input format is doing, add some log statements to the CartesianInputFormat's getSplits method and re-run your code to see what is going on.

How HDFS works when running Hadoop on a single node cluster?

There is a lot of content explaining data locality and how MapReduce and HDFS works on multi-node clusters. But I can't find much information regarding a single node setup. In the past three months that I'm experimenting with Hadoop I'm always reading tutorials and threads regarding number of mappers and reducers and writing custom partitioners to optimize jobs, but I always think, does it apply to a single node cluster?
What is the loss of running MapReduce jobs on a single node cluster comparing to a multi-node cluster?
Does the parallelism that is provided by splitting the input data still applies in this case?
What's the difference of reading input from a single node HDFS and reading from the local filesystem?
I think due to my little experience I can't answer these questions clearly, so any help is appreciated!
Thanks in advance!
EDIT: I understand Hadoop is not suitable for a single node setup because of all the factors listed by #TC1. So, what's the benefit of setting up a pseudo-distributed Hadoop environment?
I'm always reading tutorials and threads regarding number of mappers and reducers and writing custom partitioners to optimize jobs, but I always think, does it apply to a single node cluster?
It depends. Combiners are run between mapping and reducing and you'd definitely feel the impact even on a single node if they were used right. Custom partitioners -- probably no, the data hits the same disk before reducing. They would affect the logic, i.e., what data your reducers receive, but probably not the performance
What is the loss of running MapReduce jobs on a single node cluster comparing to a multi-node cluster?
Processing capability. If you can get by with a single node setup for your data, you probably shouldn't be using Hadoop for your processing in the first place.
Does the parallelism that is provided by splitting the input data still applies in this case?
No, the bottleneck typically is I/O, i.e., accessing the disk. In this case, you're still accessing the same disk, only hitting it from more threads.
What's the difference of reading input from a single node HDFS and reading from the local filesystem?
Virtually non-existent. The idea of HDFS is to
store files in big, contiguous blocks, to avoid disk seeking
replicate these blocks among the nodes to provide resilience;
both of those are moot when running on a single node.
EDIT:
The difference between "single-node" and "pseudo-distributed" is that in single-mode all the Hadoop processes run on a single JVM. There's no network communication involved, not even through localhost etc. Even if simply testing a job on small data, I'd advise to use pseudo-distributed since that is essentially the same as a cluster.

when is it a good idea to increase/decrease the number of nodes interactively on a hadoop mapreduce job?

I have an intuition that increasing/decreasing
number of nodes interactively on running job can speed up map-heavy
jobs, but won't help wth reduce heavy jobs, where most of work is done
by reduce.
There's an faq about this but it doesn't really explain very well
http://aws.amazon.com/elasticmapreduce/faqs/#cluster-18
This question was answered by Christopher Smith, who gave me permission to post here.
As always... "it depends". One thing you can pretty much always count
on: adding nodes later on is not going to help you as much as having
the nodes from the get go.
When you create a Hadoop job, it gets split up in to tasks. These
tasks are effectively "atoms of work". Hadoop lets you tweak the # of
mapper and # of reducer tasks during job creation, but once the job is
created, it is static. Tasks are assigned to "slots". Traditionally,
each node is configured to have a certain number of slots for map
tasks, and a certain number of slots for reduce tasks, but you can
tweak that. Some newer versions of Hadoop don't require you to
designate the slots as being for map or reduce tasks. Anyway, the
JobTracker periodically assigns tasks to slots. Because this is done
dynamically, new nodes coming online can speed up the processing of a
job by providing more slots to execute the tasks.
This sets the stage for understanding the reality of adding new nodes.
There's obviously an Amdahl's law issue where having more slots than
pending tasks accomplishes little (if you have speculative execution
enabled, it does help somewhat, as Hadoop will schedule the same task
to run on many different nodes, so that a slow node's tasks can be
completed by faster nodes if there are spare resources). So, if you
didn't define your job with many map or reduce tasks, adding more
nodes isn't going to help much. Of course, each task imposes some
overhead, so you don't want to go crazy high either. That's why I
suggest a guideline for task size should be "something which takes
~2-5 minutes to execute".
Of course, when you add nodes dynamically, they have one other
disadvantage: they don't have any data local. Obviously, if you are at
the start of a EMR pipeline, none of the nodes have data in them, so
doesn't matter, but if you have an EMR pipeline made of many jobs,
with earlier jobs persisting their results to HDFS, you get a huge
performance boost because the JobTracker will favour shaping and
assigning tasks so nodes have that lovely locality of data (this is a
core trick of the whole MapReduce design to maximize performance). On
the reducer side, data is coming from other map tasks, so dynamically
added nodes are really at no disadvantage as compared to other nodes.
So, in principle, dynamically adding new nodes is actually less likely
to help with IO bound map tasks that are reading from HDFS.
Except...
Hadoop has a variety of cheats under the covers to optimize
performance. Once is that it starts transmitting map output data to
the reducers before the map task completes/the reducer starts. This
obviously is a critical optimization for jobs where the mappers
generate a lot of data. You can tweak when Hadoop starts to kick off
the transfers. Anyway, this means that a newly spun up node might be
at a disadvantage, because the existing nodes might already have such
a huge data advantage. Obviously, the more output that the mappers
have transmitted, the larger the disadvantage.
That's how it all really works. In practice though, a lot of Hadoop
jobs have mappers processing tons of data in a CPU intensive fashion,
but outputting comparatively little data to the reducers (or they
might send a lot of data to the reducers, but the reducers are still
very simple, so not CPU bound at all). Often jobs will have few
(sometimes even 0) reducer tasks, so even extra nodes could help, if
you already have a reduce slot available for every outstanding reduce
task, new nodes can't help. New nodes also disproportionately help out
with CPU bound work, for obvious reasons, so because that tends to
be map tasks more than reduce tasks, that's where people typically see
the win. If your mappers are I/O bound and pulling data from the
network, adding new nodes obviously increases the aggregate bandwidth
of the cluster, so it helps there, but if your map tasks are I/O bound
reading HDFS, the best thing is to have more initial nodes, with data
already spread over HDFS. It's not unusual to see reducers get I/O
bound because of poorly structured jobs, in which case adding more
nodes can help a lot, because it splits up the bandwidth again.
There's a caveat there too of course: with a really small cluster,
reducers get to read a lot of their data from the mappers running on
the local node, and adding more nodes shifts more of the data to being
pulled over the much slower network. You can also have cases where
reducers spend most of their time just multiplexing data processing
from all the mappers sending them data (although that is tunable as
well).
If you are asking questions like this, I'd highly recommend profiling
your job using something like Amazon's offering of KarmaSphere. It
will give you a better picture of where your bottlenecks are and what
are your best strategies for improving performance.

Resources