Software needed to build a cluster - cluster-computing

I've been thinking about getting a little bit greener with my computers and using some lower power, mini-itx boards in my next computer. Some can generate under 10 watts and are pretty inexpensive.
So I thought, if one is such low cost and low power, why not try to make a cluster out of them? However, I'm not really sure what I would need to do in terms of Operating System or management software to make this happen?
Can anyone provide advice on existing software to do this or any ideas as to how to design my own?

What do you want to actually do with your cluster kind of decides what software you will need.
Do you need job scheduling?
Monitoring tools?
Do you need to deploy software across all nodes at once seamlessly?
One file system across all nodes (recommended).
You could just as easily install a linux or *BSD on the boards and just use ssh to manage and run jobs across all the nodes. No other software really required.
Software you might find useful:
PBS (mostly job scheduling, google)
Kerrighed (Single System Image based, Linux distro)
Rocks (cluster based distribution)
Mosix ( cluster Management, openMosix also )
Ganglia (Monitoring, probably over kill for you)
Lustre (Super fast, opensource cluster filesytem from Sun)

Take a look at beowulf to get started.
That being said, the best advice I can give is to carefully measure whether you are actually being more green with your cluster. I've been a little way down this road before, and in my experience, the losses involved in having many separate computers end up wiping out any energy savings. Keep in mind that every computer needs a power supply, which converts your household voltage down to a level that the computer wants. The conversion is inefficient, and wastes heat (this is why the power supplies have fans). The same can be said for each hard drive, RAM bank and motherboard that you need.
This isn't meant to discourage you from the project. Just be sure to profile. Exactly like writing software! :)

You can use Beowulf to run a cluster.

There's a lot to this question.
First, if you just want to get a cluster up and running, there are many suggestions listed already here. Once you have the cluster up and running, though, you're just starting.
At that point, you need to have software that will work correctly across the cluster. If you are working on your own software, you'll need to design it to be parallelized across a cluster, using something like MPI.
Without software written to run across the cluster, though, the cluster is nothing but a highly customized box that doens't do anything special...

Related

Hadoop virtual cluster vs single machine

I have a question regarding speed & performance of
using multiple virtualized nodes in a single machine VS single node on the single machine itself.
which one will perform better?
The reason why I ask this question is because I am currently learning hadoop on a single machine, and I see some tutorials on the internet that shows the use of multiple virtualized nodes in a single machine.
Thank you in advance
There is always some overhead that comes with virtualization, so unless really necessary I wouldn't advise to run Hadoop in a virtualized environment.
That being said, I know VMWare did a lot of work on making Hadoop work in a virtualized environment, and they have published some benchmarks in which they claim under certain conditions to have better performance with VMs that a native application. I haven't played much with vSphere, but this could be something to look at if you want to explore virtualization further. But don't take the numbers for granted, it really depends on the type of hardware you're running, so in some conditions I think you might gain some performance with VMs, but I'm guessing from experience that in most cases you won't gain anything.
If you're just getting started and testing with Hadoop, I think virtualizing is overkill. You can very easily run Hadoop in pseudo-distributed mode, which means that you can run multiple Hadoop daemons on the same box, each as a separate process. That's what I used to get started with Hadoop, and it's a good head start. You can find more info here (or might need another page depending on which Hadoop version you're running).
If you get to the point where you want to test with a real cluster, but don't have the resources, I would advise looking at Amazon Elastic Map/Reduce: it gives you a cluster on demand and it's pretty cheap. That way you can do more advanced tests. More info here.
the bottom line is, I think if the purpose is simply testing, you don't really need a virtual cluster.
A performance analysis case study conducted on this topic showed that a virtual Hadoop cluster is only around 4% less efficient compared to its native counterpart: Virtualized hadoop performance case study

What is the best software stack for a small scientific computing cluster

I work in a research group doing a lot of Machine Learning and Computational Biology.
We currently have a cluster, but it is poorly maintained, suffers from low I/O throughput, and most critically doesn't have any setup for scheduling or load-balancing. Therefore, to use it, you have to find a free node yourself, ssh into that node, run your script on the command line, and manually collect your results.
What is the best software stack to implement an easy to use scheduler and load-balancer, such that users can submit their job to a central queue, have it run automatically when resources are available, and easily get their results back?
There's a number of scheduler/resource manager options that are open source and well thought of:
Torque/Maui, descendants of the venerable PBS, now maintained by adaptive computing
Slurm, a newer project out of LLNL, which has the advantage that it scales very well
Open Grid Engine, née Sun Grid Engine
But there's also a number of entire software stacks that aim to make managing a cluster easier:
Warewulf, out of LBL
Rocks
I'm making this a community wiki for others who have suggestions.

Condor, Sun Grid Engine, or something else?

I'm trying to work out whether we should try out Condor or Sun Grid Engine at work (or possibly something else).
We often have lots of unused WinXp workstations. The hope is that we could use wake-on-LAN, run all our jobs, and then shut down automatically. We'd mainly be running Matlab, Java or Python simulations for either monte-carlo or parameter explorations.
With my limited knowledge of Condor, it sounds like using a the vm universe might be a convenient way of taking care of snapshots without having to modify existing code.
Is SGE or something else better than condor for this kind of work?
SGE doesn't really support windows. It comes with all kinds of caveats and missing bits on Windows.
I've been running Condor pools for many years now and it is a superb HTPC setup for both cycle-stealing and dedicated, always-on hardware, on Linux and Windows machines. The recent addition of their Rooster daemon lets you put machines to sleep between job cycles and wake them up when new work appears in the pool. They also have an active and very helpful support community. Checkpointing is the only Condor feature not available on Windows. Everything else is there. With the addition of the VM Universe, checkpointing is getting less and less useful. Really: to use checkpointing successfully you need to be able to relink your entire code stack. So if you're running Matlab jobs, even on Linux, checkpointing isn't going to be possible.
If you have specific questions about getting Condor running on Windows I'd be happy to answer them, share my experiences with it. I run Condor across 4 pools around the globe with a total of about 1500 dedicated machines in all the pools and some 1000 or so additional desktop machines that are available as users care to donate them.
I'd start with Condor. It has good support for Windows, and newer versions have built-in support for sending wake-on-lan in a very configurable way when jobs can run on certain machines. It can also shut the machines down based on user-defined policies.
After Oracle's takeover of SGE (Sun Grid Engine), there is the Open Grid Scheduler project that still offers open-source Grid Engine.
http://gridscheduler.sourceforge.net/
For dedicated hardware I'd go with Grid Engine.
For scavenging clock cycles on machines which may be in use I'd go with Condor.
For hardware which you have dedicated access to for fixed periods, such as overnight and at weekends, I'd probably still go with Condor but might be able to persuade myself to use Grid Engine.
I've had to choose between condor and SGE for a customer project recently. I was favoring SGE (because I was more familiar with that environment), but Condor won finally because:
the customer infrastructure is Windows oriented, and the SGE solution requires a Unix or Linux machine for the Central Manager, + installing MS Services for Unix on the computation hosts
support and installation process of Condor on Windows was much simpler.
However, you cannot use the most interesting features of Condor on Windows : checkpointing is not available, nor the Condor specific IOs. I'm not using the VM universe, so I cannot comment on that aspect.
I've only tried Condor, and it was a pain to attempt to set up. If you need all the clock cycles you can fully utiilize, go with Condor.
I'm about to try SGE, and I'll tell you how it goes. However at my company, people have had experience setting up SGE, so I'll probably say SGE is easier.
SGE doesn't exist... it's OGE, and it's very expensive. Go with Condor.

Can computer clusters be used for general everyday applications?

Does anyone know how a computer cluster can be used for everyday applications, like for example video games?
I would like to build a computer cluster that can run applications over the cluster that were not specifically designed for computer clusters and still see the performance increase. One use would be for video games, but I would also like to utilize the increased computing power for running a large network of virtualized machines.
It won't help, especially in the case of video games. You have to build around the cluster; the cluster does not work around you.
At any rate, video games require sub-50ms response time on input and response,and network propagation would just destroy any performance gains you might see. Video processing, on the other hand, benefits GREATLY from the cluster as the task is inherently geared toward parallelization. It does not require user input, and output is only measured in terms of the batch process.
If you have a program written for a single core, running it on a four-core processor won't help you (except that one core can be devoted to that program). For example, I have Visual Studio compiling on multiple cores on this machine, but linking is done on one core (and is annoyingly slow). In order to get use out of multiple cores, I have to either run something that can use multiple cores or run several separate programs.
Clusters are like that, only more so. All communication between the machines is explicit and must be programmed in. There are things you can do with a cluster (see Google's map-reduce algorithm), but they do require special programming and work.
Typical clusters are used either to specialize machines (one might be a database server and one a web server, for example), or to run large numbers of programs simultaneously.
You will not be able to easily run a video game on a cluster, unless it was already designed to work on multiple machines. I have not heard of such a game. You may have some luck creating a virtual server farm, but I doubt it will be easy to get it working perfectly. If you are interested in this, one example would be amazon's EC2 service. They offer virtual machines for "rent" by the hour. Behind the scenes, I assume they have a giant cluster that is supplying all of these virtual machines.
Unfortunately, unless you have some pretty clever operating system / software design in mind - simply connecting programs together via a cluster and hoping to get increased performance is not likely to work - especially not for video games. In order to get increased performance from running things in a cluster you have to program for it otherwise there is a good change you'd see a decrease in performance rather than an increase.

How can i connect two or more machines via tcp cable to form a network grid?

How can i connect two or more machines to form a network grid and how can i distribute work load to the two machines?
What operating systems do i need to run on the machines, and what application should i use to manage the load balancing?
NB: I read somewhere that google uses cheap machines to perform this fete, how do they connect two network cards( 'Teaming' ) and distribute load across the machines?
Good practical examples would serve me good, with actual code samples.
Pointers to some good site i might read this stuff will be highly appreciated.
An excellent place to start is with the Beowulf project. Basically an opensource cluster built on the Linux OS.
There are several software solutions in this expanding market. The term "cloud computing" is certainly gaining traction to describe what you want to do. Are you wanting a service, or do you want to run it in house?
I'm most familiar with Appistry EAF - Runs on commodity based hardware. Its available as a free download. Runs on windows or linux.
Another is GoGrid - I believe this is only available as a service, but I'm not as familiar with it.
There are many different approaches to parallel processing, and many types of system architectures you could use.
For commodity systems, there are clusters and grids, or you can even form a single system image from several pieces of commodity hardware. There is of course also load balancing, high availability, failover, etc.
It's pretty much impossible to answer this question without more detail. What exactly do you want to with these systems? The answer is very highly dependent on the application.
You might want to have a look at some of the stuff to do with Folding At Home and the SETI project, and some of their participants blogs, here is a pretty amazing cluster that a guy built in an IKEA cabinet:
http://helmer.sfe.se/
Might give you some ideas.
The question is too abstract.
One of the (imaginable) ways is to use MPI - a framework for parallel programming, the Wikipedia page includes examples in C++, and there are bindings for other languages.

Resources