Hadoop / AWS elastic map reduce performance - hadoop

I am looking for a ballpark if any one has experience with this...
Does anyone have benchmarks on the speed of AWS's map reduce?
Lets say I have 100 million records and I am using hadoop streaming (a php script) to map, group, and reduce (with some simple php calculations). The average group will contain 1-6 records.
Also is it better/more cost effective to run a bunch of small instances or larger ones? I realize it is broken up into nodes within an instance but regardless will larger nodes have a higher I/O so that means faster per node per sever (and more cost efficient)?
Also with streaming how is the ratio of mappers vs reducers determined?

I don't know if you can give a meaningful benchmark -- it's kind of like asking how fast a computer program generally runs. It's not possible to say how fast your program will run without knowing anything about the script.
If you mean, how fast are the instances that power an EMR job, they're the same spec as the underlying instances that your specify, from AWS.
If you want a very rough take on the how EMR performs differently: I'd say you will probably run into I/O bottleneck before CPU bottleneck.
In theory this means you should run many small instances and ask for rack diversity, in order to maybe grab more I/O resources from across more machines rather than let them compete. In practice I've found that fewer, higher I/O instances can be more effective. But even this impression doesn't always hold -- really depends on how busy the zone is and where your jobs are scheduled.

Related

Tasks taking longer over time in Apache Spark

I have a large dataset that I am trying to run with Apache Spark (around 5TB). I have noticed that when the job starts, it retrieves data really fast and the first stage of the job (a map transformation) gets done really fast.
However, after having processed around 500GB of data, that map transformation starts being slow and some of the tasks are taking several minutes or even hours to complete.
I am using 10 machines with 122 GB and 16CPUs and I am allocating all resources to each of the worker nodes. I thought about increasing the number of machines, but is there any other thing I could be missing?
I have tried with a small portion of my data set (30 GB) and it seemed to be working fine.
It seems that the stage gets completed locally in some nodes faster than in others. Driven from that observation, here is what I would try:
Cache the RDD that you process. Do not forget to unpersist it, when you don't need it anymore.
Understanding caching, persisting in Spark.
Check if the partitions are balanced, which doesn't seem to be
the case (that would explain why some local stages complete much
earlier than others). Having balanced partitions is the holy grail
in distributed-computing, isn't it? :)
How to balance my data across the partitions?
Reducing the communications costs, i.e. use less workers than you
use, and see what happens. Of course that heavily depends on your
application. You see, sometimes communication costs become so big,
they dominate, so using less machines for example, speeds up the
job. However, I would do that, only if steps 1 and 2 would not suffice.
Without any more info it would seem that at some point of the computation your data gets spilled to the disk because there is no more space in memory.
It's just a guess, you should check your Spark UI.

when is it a good idea to increase/decrease the number of nodes interactively on a hadoop mapreduce job?

I have an intuition that increasing/decreasing
number of nodes interactively on running job can speed up map-heavy
jobs, but won't help wth reduce heavy jobs, where most of work is done
by reduce.
There's an faq about this but it doesn't really explain very well
http://aws.amazon.com/elasticmapreduce/faqs/#cluster-18
This question was answered by Christopher Smith, who gave me permission to post here.
As always... "it depends". One thing you can pretty much always count
on: adding nodes later on is not going to help you as much as having
the nodes from the get go.
When you create a Hadoop job, it gets split up in to tasks. These
tasks are effectively "atoms of work". Hadoop lets you tweak the # of
mapper and # of reducer tasks during job creation, but once the job is
created, it is static. Tasks are assigned to "slots". Traditionally,
each node is configured to have a certain number of slots for map
tasks, and a certain number of slots for reduce tasks, but you can
tweak that. Some newer versions of Hadoop don't require you to
designate the slots as being for map or reduce tasks. Anyway, the
JobTracker periodically assigns tasks to slots. Because this is done
dynamically, new nodes coming online can speed up the processing of a
job by providing more slots to execute the tasks.
This sets the stage for understanding the reality of adding new nodes.
There's obviously an Amdahl's law issue where having more slots than
pending tasks accomplishes little (if you have speculative execution
enabled, it does help somewhat, as Hadoop will schedule the same task
to run on many different nodes, so that a slow node's tasks can be
completed by faster nodes if there are spare resources). So, if you
didn't define your job with many map or reduce tasks, adding more
nodes isn't going to help much. Of course, each task imposes some
overhead, so you don't want to go crazy high either. That's why I
suggest a guideline for task size should be "something which takes
~2-5 minutes to execute".
Of course, when you add nodes dynamically, they have one other
disadvantage: they don't have any data local. Obviously, if you are at
the start of a EMR pipeline, none of the nodes have data in them, so
doesn't matter, but if you have an EMR pipeline made of many jobs,
with earlier jobs persisting their results to HDFS, you get a huge
performance boost because the JobTracker will favour shaping and
assigning tasks so nodes have that lovely locality of data (this is a
core trick of the whole MapReduce design to maximize performance). On
the reducer side, data is coming from other map tasks, so dynamically
added nodes are really at no disadvantage as compared to other nodes.
So, in principle, dynamically adding new nodes is actually less likely
to help with IO bound map tasks that are reading from HDFS.
Except...
Hadoop has a variety of cheats under the covers to optimize
performance. Once is that it starts transmitting map output data to
the reducers before the map task completes/the reducer starts. This
obviously is a critical optimization for jobs where the mappers
generate a lot of data. You can tweak when Hadoop starts to kick off
the transfers. Anyway, this means that a newly spun up node might be
at a disadvantage, because the existing nodes might already have such
a huge data advantage. Obviously, the more output that the mappers
have transmitted, the larger the disadvantage.
That's how it all really works. In practice though, a lot of Hadoop
jobs have mappers processing tons of data in a CPU intensive fashion,
but outputting comparatively little data to the reducers (or they
might send a lot of data to the reducers, but the reducers are still
very simple, so not CPU bound at all). Often jobs will have few
(sometimes even 0) reducer tasks, so even extra nodes could help, if
you already have a reduce slot available for every outstanding reduce
task, new nodes can't help. New nodes also disproportionately help out
with CPU bound work, for obvious reasons, so because that tends to
be map tasks more than reduce tasks, that's where people typically see
the win. If your mappers are I/O bound and pulling data from the
network, adding new nodes obviously increases the aggregate bandwidth
of the cluster, so it helps there, but if your map tasks are I/O bound
reading HDFS, the best thing is to have more initial nodes, with data
already spread over HDFS. It's not unusual to see reducers get I/O
bound because of poorly structured jobs, in which case adding more
nodes can help a lot, because it splits up the bandwidth again.
There's a caveat there too of course: with a really small cluster,
reducers get to read a lot of their data from the mappers running on
the local node, and adding more nodes shifts more of the data to being
pulled over the much slower network. You can also have cases where
reducers spend most of their time just multiplexing data processing
from all the mappers sending them data (although that is tunable as
well).
If you are asking questions like this, I'd highly recommend profiling
your job using something like Amazon's offering of KarmaSphere. It
will give you a better picture of where your bottlenecks are and what
are your best strategies for improving performance.

How to decrease number of map sweeps in a job (without changing data chunk size)?

The gist of my problem is..how does one decrease the number of map sweeps a job may need ? The number of map tasks for a job is data_size/HDFS_BLOCK_SIZE. The number of sweeps it may take to complete this is dependent on how many map slots we have. Assuming I am running nothing else and just one job, I find that the per node CPU utilization is low (implying I could actually run more map jobs per node). I played with mapred.tasktracker.map.tasks.maximum parameter (for example, each of my node has 32 processors and I set it to as high as 30) - but I could never increase the number of map slots and the overall CPU utilization is 60% or so. Are there any other parameters to play with? The data size I have is large enough (32GB, 8 node cluster each with 32 cpus) and it does take two map sweeps (first sweep does map 1-130 and second sweep completes the rest).
In case anyone haven't told you yet:
MapReduce is mainly IO bound, it has to read a lot of data from disk, write it back, read it and write it again. In between the reads and writes it executes your map and reduce logic.
So what I have heard lifting the CPU usage is making a cluster not IO bound anymore
RAID-0 or RAID-10 your hard disks, get the fastest harddisk out there. In consumer market there are the Western Digital VelociRaptors with 10k RPM.
SSD's don't contribute too much, since Hadoop is mostly optimized for sequencial rads.
Give as much network bandwidth as possible.
Lots of RAM for disk caching.
Even then, you should face <100% CPU utilization, but it is much better and the perfomance will skyrocket.
However, CPU utilization is not a good metric for a Hadoop cluster, as you might conclude from the points above.
Hadoop is mainly about the reliable storage of data, giving neat features to crunch it. Not given you the super-computer performance, if you need this get a MPI cluster and a PH.D to code your algorithms ;)
Sorry for the thrash - but something must have gone wrong with my installation. I happen to reinstall hadoop and it works as expected. I guess some parameter must have been conflicting.

What is the right way to identify bottlenecks in map/reduce jobs?

In normal java development, if I want to improve the performance of an application my usual procedure would be to run the program with a profiler attached, or alternatively embed within the application a collection of instrumentation marks. In either case, the immediate goal is to identify the hot spot of the application, and subsequently to be able to measure the effect of the changes that I make.
What is the correct analog when the application is a map/reduce job running in a hadoop cluster?
What options are available for collecting performance data when jobs appear to be running more slowly than you would predict from running equivalent logic in your development sandbox?
Map/Reduce Framework
Watch the Job in the Job-Tracker. Here you will see how long the mappers and reducers take. A common example would be if you do too much work in the reducers. In that case you will notice that the mappers finish quite soon while the reducers take forever.
It might also be interesting to see if all your mappers take a similar amount of time. Maybe the job is held up by a few slow tasks? This could indicate a hardware defect in the cluster (in which case speculative execution could be the answer) or the workload is not distributed evenly enough.
The Operating System
Watch the nodes (either with something simple as top or with monitoring such as munin or ganglia) to see if your job is cpu bound or io bound. If for example your reduce phase is io bound you can increase the number of reducers you use.
Something else you might detect here is when your tasks are using to much memory. If the tasktrackers do not have enough RAM increasing the number of tasks per node might actually hurt performance. A monitor system might highlight the resulting swapping.
The Single Tasks
You can run a Mapper/Reducers in isolation for profiling. In this case you can use all the tools you already know.
If you think the performance problem appears only when the job is executed in the cluster you can measure the time of relevant portions of the code with System.nanoTime() and use System.outs to output some rough performance numbers.
Of course there is also the option of adding JVM-Parameters to the child JVMs and connecting a profiler remotely.

Estimating Hadoop Scalability Performance on pseudo-distributed nodes?

Are there any tools, packages, or methodologies available to estimate / simulate the scalability performance of Hadoop using only a single machine using a pseudo-distributed architecture? Such a system would need to make accurate estimations based on jobs that do not interfere with each other in the simulation (e.g., with blocked I/O).
In my mind, how this would work is that I'd run all my map / reduce jobs sequentially, and use some metric to estimate how well the system is scaling (e.g., take the longest running map job and estimate that the run time will be bottlenecked by it).
Additionally, I have multiple map/reduce jobs which are being chained together to form the output.
I think it is largely depends on the nature of your job. Let us try to take a few examples:
1. Your job has heavy input formatting and mapper processing, with minimal data passed to reducer. In this case I would estimate that pseudo distributed cluster will realistically reflect real cluster performance (per slot) and you can assume that 5 nodes cluster will have about x5 performance. I would suggest to put enough data that job time will take at least 5-10 times of the job start-up time. This estimation will be better if you have enough splits to ensure data locality during processing.
If you plan to have a lot of relatively small files - put enough in your test, to simulate per task overhead.
2. Your heavily relaying on Hadoop distributed sort capability (shuffling). Its performance in one node and real cluster can be quite different and the factor is hard to estimate.
I can summarize that throughput of mapper and, in some extent, reducer in terms of MB/sec per slot you can estimated from above. Real cluster probably will have not better performance per slot.

Resources