Which metrics to measure the efficiency of a MapReduce application? - performance

I wrote a MapReduce application which run on 6 nodes of computers.
I am sure that my MapReduce algorithm (run on a cluster of computers) outperforms the sequential algorithm (run on a single computer), but I think this does not mean that my MapReduce algorithm is efficiently enough, right?
I have searched around and found: speedup, scaleup, and sizeup metrics. Is it true that we normally consider these metrics when measuring the efficiency of MapReduce application? Is there any metric that we need to consider?
Thank you a lot.

Before specifically addressing your question, let's revisit the map-reduce model and see what's the real problem, it tries to solve. You can refer this answer (by me/ of course you can refer other good answers for the same problem), to get an idea of map-reduce model.
So what it really tries to solve? It deduces a generic model that can be applied to solve vast a range of problems that needs to process a massive amount of data (usually in GBs or even Peta Bytes). And the real deal of this model is, it can be easily parallelized and can even be easily distributed the execution among number of nodes. This article (by me) has some detailed explanation of whole model.
So let's go to your question, you are asking about measuring the efficiency of a map reduce program based on speed, memory-efficiency and scalability.
Speaking to the point, the efficiency of a map-reduce program always depend on how far it enjoys the parallelism given by the underlying computational power. This directly indicates that a map-reduce program runs on one cluster may not be the ideal program to run in a different cluster. So we need to have a good idea of our cluster, if we hope to build up our program to a precisely fine-tuned level. But practically its rare some one needs to get it tuned up to that much level.
Let's take your points one by one:
Speed up:
It depends on how you split your input to different portions. This directly deduces the amount of parallelism (in human control). So as I mentioned above, the speed-up directly depends on how your split logic going to be able to utilize your cluster.
Memory efficiency:
It mostly depends on how memory efficient your mapper logic and reducer logic are.
Scalability:
This is mostly out of concern. You can see that the map-reduce model is already highly scalable to a level that one would rarely think about an extra mile.
So speaking as a whole, efficiency of a map reduce program is rarely a concern (even speed and memory). Practically speaking the most valuable metric is the quality of its output. i.e. how good your analytic data are. (in place of marketing, research etc.)

Related

distribute processing to a cluster of heterogeneous compute nodes taking relative performance and cost of communication into account?

Given a cluster of truly heterogeneous compute nodes how is it possible to
distribute processing to them while taking into account both their relative performance
and cost of passing messages between them?
(I know optimising this is NP-complete in general)
Which concurrency platforms currently best support this?
You might rephrase/summarise the question as:
What algorithms make most efficient use of cpu, memory and communications resources for distributed computation in theory and what existing (open source) platforms come closest to realising this?
Obviously this depends somewhat on workload so understanding the trade-offs is critical.
Some Background
I find some on S/O want to understand the background so they can provide a more specific answer, so I've included quite a bit below, but its not necessary to the essence of the question.
A typical scenario I see is:
We have an application which runs on X nodes
each with Y cores. So we start with a homogeneous cluster.
Every so often the operations team buys one or more new servers.
The new servers are faster and may have more cores.
They are integrated into the cluster to make things run faster.
Some older servers may be re-purposed but the new cluster now contains machines with different performance characteristics.
The cluster is no-longer homogeneous but has more compute power overall.
I believe this scenario must be standard in big cloud data-centres as well.
Its how this kind of change in infrastructure can be best utilised that I'm really interested in.
In one application I work with the work is divided into a number of relative long tasks. Tasks are allocated to logical processors (we usually have one per core) as they become
available. While there are tasks to perform cores are generally not unoccupied but
for the most part those jobs can be classified as "embarassingly scalable".
This particular application is currently C++ with a roll your own concurrency platform using ssh and nfs for large task.
I'm considering the arguments for various alternative approaches.
Some parties prefer various hadoop mad/reduce options. I'm wondering how they shape up versus more C++/machine oriented approaches such as openMP, Cilk++. I'm more interested in the pros and cons than the answer for that specific case.
The task model itself seems scalable and sensible independent of platform.
So, I'm assuming a model where you divide work into tasks and a (probably distributed) scheduler tries to decide which processor to which allocate each task. I am open to alternatives.
There could be task queues for each node, possibly each processor and idle processors should allow work stealing (e.g. from processors with long queues).
However, when I look at the various models of high performance and cloud cluster computing I don't see this discussed so much.
Michael Wong classifies parallelism, ignoring hadoop, into two main camps (starting around 14min in).
https://isocpp.org/blog/2016/01/the-landscape-of-parallelism-michael-wong-meetingcpp-2015
HPC and multi-threaded applications in industry
The HPC community seems to favour openMP on a cluster of identical nodes.
This may still be heterogeneous if each node supports CUDA or has FPGA support but each node tends to be identical.
If that's the case do they upgrade their data centres in a big bang or what?
(E.g. supercomputer 1 = 100 nodes of type x. supercomputer v2.0 is on a different site with
200 nodes of type y).
OpenMP only supports a single physical computer by itself.
The HPC community gets around this either using MPI (which I consider too low level) or by creating a virtual machine from all the nodes
using a hypervisor like scaleMP or vNUMA (see for example - OpenMP program on different hosts).
(anyone know of a good open source hypervisor for doing this?)
I believe these are still considered the most powerful computing systems in the world.
I find that surprising as I don't see what prevents the map/reduce people creating an even bigger cluster more easily
that is much less efficient overall but wins on brute force due to the total number of cores utilised?
So which other concurrency platforms support truly heterogeneous nodes with widely varying characteristics and how do they deal with the performance mismatch (and similarly the distribution of data)?
I'm excluding MPI as an option as while powerful it is too low-level. You might as well say use sockets. A framework building on MPI would be acceptable (does X10 work this way?).
From the user's perspective the map/reduce
approach seems to be add enough nodes that it doesn't matter and not worry about using them at maximum efficiency.
Actually those details are kept under the hood in the implementation
of the schedulers and distributed file systems.
How/where is the cost of computation and message passing taken into account?
Is there any way in openMP (or your favourite concurrency platform)
to make effective use of information that this node is N times as fast as this node and the data transfer rate
to or from this node is on average X Mb/s?
In YARN you have Dominant Resource Fairness:
http://blog.cloudera.com/blog/2013/12/managing-multiple-resources-in-hadoop-2-with-yarn/
http://static.usenix.org/event/nsdi11/tech/full_papers/Ghodsi.pdf
This covers memory and cores using Linux Control Groups but it does not yet
cover disk and network I/O resources.
Are there equivalent or better approaches in other concurrency platforms? How do they compare to DRF?
Which concurrency platforms handle this best and why?
Are there any popular ones that are likely to be evolutionary dead ends?
OpenMP keeps surprising me by actively thriving. Could something like Cilk++ be made to scale this way?
Apologies in advance for combining several PhD thesis worth questions into one.
I'm basically looking for tips on what to look for for further reading
and advice on which platforms to investigate further (from the programmer's perspective).
A good summary of some platforms to investigate and/or links to papers or articles would suffice as a useful answer.

Is my application running efficiently?

The question is generic and can be extended to other frameworks or contexts beyond Spark & Machine Learning algorithms.
Regardless of the details, from a high-level point-of-view, the code is applied on a large dataset of labeled text documents. It passes by 9 iterations of cross-validation to tune some parameters of a Logistic Regression multi-class classifier.
It is expected that this kind of Machine Learning processing will be expensive in term of time and resources.
I am running now the code and everything seems to be OK, except that I have no idea if my application is running efficiently or not.
I couldn't find guidelines saying that for a certain type and amount of data, and for certain type of processing and computing resources the processing time should be in the approximate order of...
Is there any method that help in judging if my application is running slow or fast, or it is purely a matter of experience?
I had the same question and I didn't find a real answer/tool/way to test how good my performances were just looking "only inside" my application.
I mean, as far as I know, there's no tool like a speedtest or something like for the internet connection :-)
The only way I found is to re-write my app (if possible) with another stack in order to see if the difference (in terms of time) is THAT big.
Otherwise, I found very interesting 2 main resources, even if quite old:
1) A sort of 4 point guide to remember when coding:
Understanding the Performance of Spark Applications, SPark Summit 2013
2) A 2-episode article from Cloudera blog to tune at best your jobs:
episode1
episode2
Hoping it could help
FF
Your question is pretty generic, so I would also highlight few generic areas where you can look out for performance optimizations: -
Scheduling Delays - Are there significant scheduling delays in scheduling the tasks? if yes then you can analyze the reasons (may be your cluster needs more resources etc).
Utilization of Cluster - are your jobs utilizing the available cluster resources (like CPU, mem)? In case not then again look out for the reasons. May be creating more partitions helps in faster execution. May be there is significant time taken in serialization, so can you switch to Kyro Serialization.
JVM Tuning - Consider analyzing GC logs and tune if you find anomalies.
Executor Configurations - Analyze the memory/ cores provided to your executors. It should be sufficient to hold the data processed by the task/job.
your DAG and
Driver Configuration - Same as executors, Driver should also have enough memory to hold the results of certain functions like collect().
Shuffling - See how much time is spend in Shuffling and kind of Data Locality used by your task.
All the above are needed for the preliminary investigations and in some cases it can also increase the performance of your jobs to an extent but there could be complex issues for which the solution will depend upon case to case basis.
Please also see Spark Tuning Guide

Why does real world MapReduce jobs tend to have very small dataset sizes?

All the papers I have read suggest real world mapreduce jobs tend to operate on relatiely small data set sizes (mostly map only, tend to operate on KB-16GB for vast majority of jobs). If anyone working in production world could talk about how and why smaller data set tends to be the case, I would understand better. For small dataset (<128MB), are the files tend to be fragmented or contigous because it has some implication on the splits and number of map tasks spawned ? And if hadoop lets mapreduce to operate only on a section of file ?
Any pointers is much appreciated.
Typically small data is used to quickly check if the logic / code is good enough. The evaluations have to be done again and again until a good solution is obtained.
I work in production and we use small data for unit testing (order of MBs) and we have sample data sets of size 10-30 gigs which we use for integration testing at dev end. But this is way too small considering the actual data dealt with on prod servers (which is in order of terabytes). The dev environment is of low capacity as compared to prod environment so we cannot expect terabytes of data to run smoothly over it... plus its time consuming as it has to be executed for every release.
Moving to technical papers: Authors want real data: that too which is inclined towards the specific use cases that they attempt to solve. Its difficult to obtain huge data sets (10-100 gigs) focused to their problem. I have seen few papers where they used huge data sets but then those researchers where belonged to big companies and can easily get that data.

How efficient are opensource computation platform like Hadoop etc.?

How efficient are opensource distributed computation frameworks like Hadoop? By efficiency, I mean CPU cycles that can be used for the "actual job" in tasks that are mostly pure computation. In other words, how much CPU cycles are used for overhead, or wasted because of being not used? I'm not looking for specific numbers, just a rough picture. E.g. can I expect to use 90% of the cluster's CPU power? 99%? 99.9%?
To be more specific, let's say I want to calculate PI, and I have an algorithm X. When I perform this on a single core in a tight loop, let's say I get some performance Y. If I do this calculation in a distributed fashion using e.g. Hadoop, How much performance degradation can I expect?
I understand this would depend on many factors, but what would be the rough magnitude? I'm thinking of a cluster with maybe 10 - 100 servers (80 - 800 CPU cores total), if that matters.
Thanks!
Technically hadoop has considerable overheads in several dimensions:
a) Per task overhead which can be estimated from 1 to 3 seconds.
b) HDFS Data reading overhead, due to passing data via socket and CRC calculation. It is harder to estimate
These overheads can be very significant if you have a lot of small tasks, and/or if your data processing is light.
In the same time if your have big files (less tasks) and Your data processing is heavy (let say a few mb/sec per core) then Hadoop overhead can be negleted.
In a bottom line - Hadoop overhead is variable things which higly depends on the nature of processing you are doing.
This question is too broad and vague to answer usefully. There are many different open-source platforms, varying very widely in their quality. Some early Beowulfs were notoriously wasteful, for example, whereas modern MPI2 is pretty lean.
Also, "efficiency" means different things in different domains. It might mean the amount of CPU overhead spent on constructing and passing messages relative to the work payload (in which case you're comparing MPI vs Map/Reduce), or it might mean the number of CPU cycles wasted by the interpreter/VM, if any (in which case you're comparing C++ vs Python).
It depends on the problem you are trying to solve, too. In some domains, you have lots of little messages flying back and forth, in which case the CPU cost of constructing them matters a lot (like high-frequency trading). In others, you have relatively few but large work-blocks, so the cost of packing the messages is small compared to the computational efficiency of the math inside the work block (like Folding#Home).
So in summary, this is an impossible question to answer generally, because there's no one answer. It depends on specifically what you're trying to do with the distributed platform, and what machinery it is running on.
MapR is one of the alternative for Apache Hadoop and Srivas (CTO and founder of MapR) has compared MapR with Apache Hadoop. The below presentation and video have metrics comparing MapR and Apache Hadoop. Looks like the hardware is not efficiently used in Apache Hadoop.
http://www.slideshare.net/mcsrivas/design-scale-and-performance-of-maprs-distribution-for-hadoop
http://www.youtube.com/watch?v=fP4HnvZmpZI
Apache Hadoop seems to be inefficient in some dimensions, but there is a lot of activity in Apache Hadoop community around scalability/reliability/availability/efficiency. Next Generation MapReduce, HDFS Scalability/Availability are some of things being worked currently. These would be available in the Hadoop version 0.23.
Till some time back, the focus of the Hadoop community seemed to be on scalability, but now shifting towards efficiency also.

What are some scenarios for which MPI is a better fit than MapReduce?

As far as I understand, MPI gives me much more control over how exactly different nodes in the cluster will communicate.
In MapReduce/Hadoop, each node does some computation, exchanges data with other nodes, and then collates its partition of results. Seems simple, but since you can iterate the process, even algorithms like K-means or PageRank fit the model quite well. On a distributed file system with locality of scheduling, the performance is apparently good. In comparison, MPI gives me explicit control over how nodes send messages to each other.
Can anyone describe a cluster programming scenario where the more general MPI model is an obvious advantage over the simpler MapReduce model?
Almost any scientific code -- finite differences, finite elements, etc. Which kind of leads to the circular answer, that any distributed program which doesn't easily map to MapReduce would be better implemented with a more general MPI model. Not sure that's much help to you, I'll downvote this answer right after I post it.
Athough, this question has been answered, I would like to add/reiterate one very important point.
MPI is best suited for problems that require a lot of interprocess communication.
When Data becomes large (petabytes, anyone?), and there is little interprocess communication, MPI becomes a pain. This is so because the processes will spend all the time sending data to each other (bandwidth becomes a limiting factor) and your CPUs will remain idle. Perhaps an even bigger problem is reading all that data.
This is the fundamental reason behind having something like Hadoop. The Data also has to be distributed - Hadoop Distributed File System!
To say all this in short, MPI is good for task parallelism and Hadoop is good for Data Parallelism.
The best answer that I could come up with is that MPI is better than MapReduce in two cases:
For short tasks rather than batch processing. For example, MapReduce cannot be used to respond to individual queries - each job is expected to take minutes. I think that in MPI, you can build a query response system where machines send messages to each other to route the query and generate the answer.
For jobs nodes need to communicate more than what iterated MapReduce jobs support, but not too much so that the communication overheads make the computation impractical. I am not sure how often such cases occur in practice, though.
I expect that MPI beats MapReduce easily when the task is iterating over a data set whose size is comparable with the processor cache, and when communication with other tasks is frequently required. Lots of scientific domain-decomposition parallelization approaches fit this pattern. If MapReduce requires sequential processing and communication, or ending of processes, then the computational performance benefit from dealing with a cache-sized problem is lost.
When the computation and data that you are using have irregular behaviors that mostly translates to many message-passings between objects, or when you need low level hardware level accesses e.g. RDMA then MPI is better. In some answers that you see in here the latency of tasks or memory consistency model gets mentioned, frameworks like Spark or Actor Models like AKKA have shown that they can compete with MPI. Finally one should consider that MPI has benefit of being for years the main base for development of libraries needed for scientific computations (This are the most important missing parts missing from new frameworks using DAG/MapReduce Models).
All in all, I think the benefits that MapReduce/DAG models are bringing to the table like dynamic resource managers, and fault tolerance computation will make make them feasible for scientific computing groups.

Resources