Data Mining Library for MPI - hadoop

Is there any Data Mining library, which is using (or can be used by) MPI (Massage Passing Interface)? I am looking for something similar to Apache Mahout but which can easily be integrated in a MPI environment.
The reason why I want to use MPI is that the configuration (compared to Hadoop) is easy.
Or does it not make sense to use MPI in a Data Mining scenario?

There is no reason why MPI (which is a concept, not a software itself!) necessarily is easier to install than Hadoop/Mahout. Indeed, the latter two currently are a mess, in particular because of their Java library chaos. Apache Bigtop tries to make them easier to install, and once you've figured out some basics it's quite ok.
However:
If your data is small (i.e. it can be processed on a single node), don't install a cluster solution, you pay for the overhead. Hadoop does not make much sense on single hosts. Use Weka, ELKI, RapidMiner, KNIME or whatever.
If your data is large, you will want to minimize data transfer. And this is where the strength of Hadoop/Mahout lies, minimizing data transfer. A typical message passing API cannot scale the same way for data-heavy operations.
There are some efforts such as Apache Hama that are quite similar to MPI stuff IMHO. It is based on messages, however they are bulk-processed via barrier synchronization. It might also have some message aggregation prior to sending to reduce traffic.

I strongly recommend graphlab. Currently graphlab, a Distributed Graph-Parallel API, has toolkits including
topic modeling
collaborative filtering
clustering
graphical model
http://docs.graphlab.org/toolkits.html
GraphLab is a graph-based, high performance, distributed computation framework written in C++. While GraphLab was originally developed for Machine Learning tasks, it has found great success at a broad range of other data-mining tasks; out-performing other abstractions by orders of magnitude.
GraphLab Features:
A unified multicore and distributed API: write once run efficiently in both shared and distributed memory systems
Tuned for performance: optimized C++ execution engine leverages extensive multi-threading and asynchronous IO
Scalable: GraphLab intelligently places data and computation using sophisticated new algorithms
HDFS Integration: Access your data directly from HDFS
Powerful Machine Learning Toolkits: Turn BigData into actionable knowledge with ease

this idea doesn't make sense and I think you have some misconceptions, MPI is more for tightly coupled systems and i'm 99% sure won't send messages to an external location, you can however process or analyze the data with MPI much more quickly (depending on your hardware). My 2 cents is that you are better off using one of the AMQP protocol open source implementations ,I would say zeromq is your best bet and then processing all the data you get in R or python or if your data set is very very large MPI. Another option is that you can call serial libraries on different machines connected and running MPI given they all are connected to the internet seperately. R is real easy to call with MPI so is python.

Related

distribute processing to a cluster of heterogeneous compute nodes taking relative performance and cost of communication into account?

Given a cluster of truly heterogeneous compute nodes how is it possible to
distribute processing to them while taking into account both their relative performance
and cost of passing messages between them?
(I know optimising this is NP-complete in general)
Which concurrency platforms currently best support this?
You might rephrase/summarise the question as:
What algorithms make most efficient use of cpu, memory and communications resources for distributed computation in theory and what existing (open source) platforms come closest to realising this?
Obviously this depends somewhat on workload so understanding the trade-offs is critical.
Some Background
I find some on S/O want to understand the background so they can provide a more specific answer, so I've included quite a bit below, but its not necessary to the essence of the question.
A typical scenario I see is:
We have an application which runs on X nodes
each with Y cores. So we start with a homogeneous cluster.
Every so often the operations team buys one or more new servers.
The new servers are faster and may have more cores.
They are integrated into the cluster to make things run faster.
Some older servers may be re-purposed but the new cluster now contains machines with different performance characteristics.
The cluster is no-longer homogeneous but has more compute power overall.
I believe this scenario must be standard in big cloud data-centres as well.
Its how this kind of change in infrastructure can be best utilised that I'm really interested in.
In one application I work with the work is divided into a number of relative long tasks. Tasks are allocated to logical processors (we usually have one per core) as they become
available. While there are tasks to perform cores are generally not unoccupied but
for the most part those jobs can be classified as "embarassingly scalable".
This particular application is currently C++ with a roll your own concurrency platform using ssh and nfs for large task.
I'm considering the arguments for various alternative approaches.
Some parties prefer various hadoop mad/reduce options. I'm wondering how they shape up versus more C++/machine oriented approaches such as openMP, Cilk++. I'm more interested in the pros and cons than the answer for that specific case.
The task model itself seems scalable and sensible independent of platform.
So, I'm assuming a model where you divide work into tasks and a (probably distributed) scheduler tries to decide which processor to which allocate each task. I am open to alternatives.
There could be task queues for each node, possibly each processor and idle processors should allow work stealing (e.g. from processors with long queues).
However, when I look at the various models of high performance and cloud cluster computing I don't see this discussed so much.
Michael Wong classifies parallelism, ignoring hadoop, into two main camps (starting around 14min in).
https://isocpp.org/blog/2016/01/the-landscape-of-parallelism-michael-wong-meetingcpp-2015
HPC and multi-threaded applications in industry
The HPC community seems to favour openMP on a cluster of identical nodes.
This may still be heterogeneous if each node supports CUDA or has FPGA support but each node tends to be identical.
If that's the case do they upgrade their data centres in a big bang or what?
(E.g. supercomputer 1 = 100 nodes of type x. supercomputer v2.0 is on a different site with
200 nodes of type y).
OpenMP only supports a single physical computer by itself.
The HPC community gets around this either using MPI (which I consider too low level) or by creating a virtual machine from all the nodes
using a hypervisor like scaleMP or vNUMA (see for example - OpenMP program on different hosts).
(anyone know of a good open source hypervisor for doing this?)
I believe these are still considered the most powerful computing systems in the world.
I find that surprising as I don't see what prevents the map/reduce people creating an even bigger cluster more easily
that is much less efficient overall but wins on brute force due to the total number of cores utilised?
So which other concurrency platforms support truly heterogeneous nodes with widely varying characteristics and how do they deal with the performance mismatch (and similarly the distribution of data)?
I'm excluding MPI as an option as while powerful it is too low-level. You might as well say use sockets. A framework building on MPI would be acceptable (does X10 work this way?).
From the user's perspective the map/reduce
approach seems to be add enough nodes that it doesn't matter and not worry about using them at maximum efficiency.
Actually those details are kept under the hood in the implementation
of the schedulers and distributed file systems.
How/where is the cost of computation and message passing taken into account?
Is there any way in openMP (or your favourite concurrency platform)
to make effective use of information that this node is N times as fast as this node and the data transfer rate
to or from this node is on average X Mb/s?
In YARN you have Dominant Resource Fairness:
http://blog.cloudera.com/blog/2013/12/managing-multiple-resources-in-hadoop-2-with-yarn/
http://static.usenix.org/event/nsdi11/tech/full_papers/Ghodsi.pdf
This covers memory and cores using Linux Control Groups but it does not yet
cover disk and network I/O resources.
Are there equivalent or better approaches in other concurrency platforms? How do they compare to DRF?
Which concurrency platforms handle this best and why?
Are there any popular ones that are likely to be evolutionary dead ends?
OpenMP keeps surprising me by actively thriving. Could something like Cilk++ be made to scale this way?
Apologies in advance for combining several PhD thesis worth questions into one.
I'm basically looking for tips on what to look for for further reading
and advice on which platforms to investigate further (from the programmer's perspective).
A good summary of some platforms to investigate and/or links to papers or articles would suffice as a useful answer.

How to do load and performance testing of Hadoop cluster?

Are there any tools to generate an automated scenario with a predefined ramp up of user requests (running same map-reduce job) and monitoring some specific metrics of Hadoop cluster under load? I am looking ideally for something like LoadRunner but free/open source tool.
The tool does not have to have a cool UI but rather an ability to record and save scenarios that include a ramp up and a rendezvous point for several users (wait until other users reach some point and do some action simultaneously).
The Hadoop distribution I am going to test is the latest MapR.
Searching internet did not bring any good free alternatives to HP LoadRunner. In case you had an experience with Hadoop (or MapR in particular) load testing, please share what tool you have used.
Every solution you will look at has both a tool quotient and a labor quotient in the total price. There are many open source tools which take the tool cost to zero but the labor charge is so high that your total cost to deliver will be higher than a purchase of a commercial tool with a lower labor charge. Also, many people look at performance testing tools as load generation alone, ignoring the automated collection of monitoring data and the analysis of the results where you can pin an increase in response times with a correlated use of resources at the same time. This is a laborious process made longer to do when you are using decoupled tools.
As you have mentioned LoadRunner, when you are provided a tool you should compare what is available in that tool to whatever you are provided. For instance,
there are Java, C, C++ & VB interfaces available in LoadRunner. You are going to find a way to exercise your map and reduce infrastructure. Compare the integrated monitoring capabilities (native/SNMP/terminal user with command line...) as well as analysis and reporting. Where capabilities do not exist you will either need to build the capability or acquire it elsewhere.
You have also brought up the concept of Rendezvous. You will want to be careful with its application in any tool. Unless you have a very large population the odds of Simultaneous collision in the same area of code/action at the same time becomes quite small. Humans are chaotic instruments, arriving and departing independently from one another. On the other hand, if you are automating an agent which is based upon a clock tick then rendezvous makes a lot more sense. Taking a look at your job submission logs by IP address can provide an objective model for how many are submitted simultaneously (rendezvous) versus how many are running concurrently. I audit a lot of tests and rendezvous is the most abused item across tools, resulting in thousands of lost engineering hours chasing engineering ghosts that would never occur in natural use.

Will hadoop replace data warehousing?

I've heard reports that Hadoop is poised to replace data warehousing. So I was wondering if there were actual case studies done with success/failure rates or if some of the developers here had worked on a project where this was done, either totally or partially?
With the advent of "Big Data" there seems to be a lot of hype with it and I'm trying to figure out fact from fiction.
We have a huge database conversion in the works and I'm thinking this may be an alternative solution.
Ok so there are a lot of success stories out there with Big Data startups, especially in AdTech, though it's not so much "replace" the old expensive proprietary ways but they are just using Hadoop first time round. This I guess is the benefit of being a startup - no legacy systems. Advertising, although somewhat boring from the outside, is very interesting from a technical and data science point of view. There is a huge amount of data and the challenge is to more efficiently segment users and bid for ad space. This usually means some machine learning is involved.
It's not just AdTech though, Hadoop is used in banks for fraud detection and various other transactional analysis.
So my two cents as to why this is happening I'll try to summarise with a comparison of my main experience, that is using HDFS with Spark and Scala, vs traditional approaches that use SAS, R & Teradata:
HDFS is a very very very effective way to store huge amounts of data in an easily accessible distributed way without the overhead of first structuring the data.
HDFS does not require custom hardware, it works on commodity hardware and is therefore cheaper per TB.
HDFS & the hadoop ecosystem go hand in glove with dynamic and flexible cloud architectures. Google Cloud and Amazon AWS have such rich and cheap features that completely eliminate the need for in house DCs. There is no need to buy 20 powerful servers and 100s TB of storage to then discover it's not enough, or it's too much, or it's only needed for 1 hour a day. Setting up a cluster with cloud services is getting easier and easier, there are even scripts out there that make doing it possible for those with only a small amount of sysadm/devops experience.
Hadoop and Spark, particularly when used with a high level statically typed language like Scala (but Java 8 is also OK-ish) means data scientists can now do things they could never do with scripting languages like R, Python and SAS. First they can wire up their modelling code with other production systems, all in one language, all in one virtual environment. Think about all the high velocity tools written in Scala; Kafka, Akka, Spray, Spark, SparkStreaming, GraphX etc, and in Java: HDFS, HBase, Cassandra - now all these tools are highly interoperable. What this means is for first time in history, data analysts can reliably automate analytics and build stable products. They have the high-level functionality they need, but with the predictability and reliability of static typing, FP and unit testing. Try building a large complicated concurrent system in Python. Try writting unit tests in R or SAS. Try compiling your code, watching the tests pass, and conclude "hey it works! lets ship it" in a dynamically typed language.
These four points combined means that A: storing data is now a lot lot cheaper, B: processing data is now a lot lot cheaper and C: human resource costs are much much cheaper as now you don't need several teams siloed off into analysts, modellers, engineers, developers, you can mash these skills together to make hybrids ultimately needing to employ less people.
Things won't change over night, currently the labour market is majorly lacking two groups; good Big Data DevOps and Scala engineers/developers, and their rates clearly reflect that. Unfortunately the supply is quite low even though the demand is very high. Although I still conjecture Hadoop for warehousing is much cheaper, finding talent can be a big costs that is restricting the pace of transition.

Best method of having a single process distributed across a cluster

I'm very new to cluster computing, and wanted to know more about the various software used for cluster computing, and which is best for particular tasks. In particular, the problem I am trying to solve involves a Manager/Workers type scenario, where a single Manager is responsible for the creation of 100s to 1000s of jobs. Each job, while relatively large, must execute on a small frame-by-frame basis. I.e. the Manager will tell each job, "advance one frame and report back to me". The execution of a single frame will be very small, so latency between the Manager and the worker machines must be very small, on the order of microseconds.
Thank you! Any information would be appreciated, even stuff that doesn't perfectly fit the scenario I described, just to give me a starting point. Some that I have researched so far are Hadoop, HTCondor, and Akka.
Since communication latency is important to you, you should probably consider using MPI. It's not too difficult to write simple Master/Worker programs using MPI, and it will probably give you the best performance, especially if your cluster has high performance networking, such as infiniband.
If, as it seems, you're using Java, you will have to do some research to determine a good Java/MPI package. You'll find some suggestions here: Java openmpi.

What are some scenarios for which MPI is a better fit than MapReduce?

As far as I understand, MPI gives me much more control over how exactly different nodes in the cluster will communicate.
In MapReduce/Hadoop, each node does some computation, exchanges data with other nodes, and then collates its partition of results. Seems simple, but since you can iterate the process, even algorithms like K-means or PageRank fit the model quite well. On a distributed file system with locality of scheduling, the performance is apparently good. In comparison, MPI gives me explicit control over how nodes send messages to each other.
Can anyone describe a cluster programming scenario where the more general MPI model is an obvious advantage over the simpler MapReduce model?
Almost any scientific code -- finite differences, finite elements, etc. Which kind of leads to the circular answer, that any distributed program which doesn't easily map to MapReduce would be better implemented with a more general MPI model. Not sure that's much help to you, I'll downvote this answer right after I post it.
Athough, this question has been answered, I would like to add/reiterate one very important point.
MPI is best suited for problems that require a lot of interprocess communication.
When Data becomes large (petabytes, anyone?), and there is little interprocess communication, MPI becomes a pain. This is so because the processes will spend all the time sending data to each other (bandwidth becomes a limiting factor) and your CPUs will remain idle. Perhaps an even bigger problem is reading all that data.
This is the fundamental reason behind having something like Hadoop. The Data also has to be distributed - Hadoop Distributed File System!
To say all this in short, MPI is good for task parallelism and Hadoop is good for Data Parallelism.
The best answer that I could come up with is that MPI is better than MapReduce in two cases:
For short tasks rather than batch processing. For example, MapReduce cannot be used to respond to individual queries - each job is expected to take minutes. I think that in MPI, you can build a query response system where machines send messages to each other to route the query and generate the answer.
For jobs nodes need to communicate more than what iterated MapReduce jobs support, but not too much so that the communication overheads make the computation impractical. I am not sure how often such cases occur in practice, though.
I expect that MPI beats MapReduce easily when the task is iterating over a data set whose size is comparable with the processor cache, and when communication with other tasks is frequently required. Lots of scientific domain-decomposition parallelization approaches fit this pattern. If MapReduce requires sequential processing and communication, or ending of processes, then the computational performance benefit from dealing with a cache-sized problem is lost.
When the computation and data that you are using have irregular behaviors that mostly translates to many message-passings between objects, or when you need low level hardware level accesses e.g. RDMA then MPI is better. In some answers that you see in here the latency of tasks or memory consistency model gets mentioned, frameworks like Spark or Actor Models like AKKA have shown that they can compete with MPI. Finally one should consider that MPI has benefit of being for years the main base for development of libraries needed for scientific computations (This are the most important missing parts missing from new frameworks using DAG/MapReduce Models).
All in all, I think the benefits that MapReduce/DAG models are bringing to the table like dynamic resource managers, and fault tolerance computation will make make them feasible for scientific computing groups.

Resources