Can computer clusters be used for general everyday applications? - cluster-computing

Does anyone know how a computer cluster can be used for everyday applications, like for example video games?
I would like to build a computer cluster that can run applications over the cluster that were not specifically designed for computer clusters and still see the performance increase. One use would be for video games, but I would also like to utilize the increased computing power for running a large network of virtualized machines.

It won't help, especially in the case of video games. You have to build around the cluster; the cluster does not work around you.
At any rate, video games require sub-50ms response time on input and response,and network propagation would just destroy any performance gains you might see. Video processing, on the other hand, benefits GREATLY from the cluster as the task is inherently geared toward parallelization. It does not require user input, and output is only measured in terms of the batch process.

If you have a program written for a single core, running it on a four-core processor won't help you (except that one core can be devoted to that program). For example, I have Visual Studio compiling on multiple cores on this machine, but linking is done on one core (and is annoyingly slow). In order to get use out of multiple cores, I have to either run something that can use multiple cores or run several separate programs.
Clusters are like that, only more so. All communication between the machines is explicit and must be programmed in. There are things you can do with a cluster (see Google's map-reduce algorithm), but they do require special programming and work.
Typical clusters are used either to specialize machines (one might be a database server and one a web server, for example), or to run large numbers of programs simultaneously.

You will not be able to easily run a video game on a cluster, unless it was already designed to work on multiple machines. I have not heard of such a game. You may have some luck creating a virtual server farm, but I doubt it will be easy to get it working perfectly. If you are interested in this, one example would be amazon's EC2 service. They offer virtual machines for "rent" by the hour. Behind the scenes, I assume they have a giant cluster that is supplying all of these virtual machines.

Unfortunately, unless you have some pretty clever operating system / software design in mind - simply connecting programs together via a cluster and hoping to get increased performance is not likely to work - especially not for video games. In order to get increased performance from running things in a cluster you have to program for it otherwise there is a good change you'd see a decrease in performance rather than an increase.

Related

Hadoop virtual cluster vs single machine

I have a question regarding speed & performance of
using multiple virtualized nodes in a single machine VS single node on the single machine itself.
which one will perform better?
The reason why I ask this question is because I am currently learning hadoop on a single machine, and I see some tutorials on the internet that shows the use of multiple virtualized nodes in a single machine.
Thank you in advance
There is always some overhead that comes with virtualization, so unless really necessary I wouldn't advise to run Hadoop in a virtualized environment.
That being said, I know VMWare did a lot of work on making Hadoop work in a virtualized environment, and they have published some benchmarks in which they claim under certain conditions to have better performance with VMs that a native application. I haven't played much with vSphere, but this could be something to look at if you want to explore virtualization further. But don't take the numbers for granted, it really depends on the type of hardware you're running, so in some conditions I think you might gain some performance with VMs, but I'm guessing from experience that in most cases you won't gain anything.
If you're just getting started and testing with Hadoop, I think virtualizing is overkill. You can very easily run Hadoop in pseudo-distributed mode, which means that you can run multiple Hadoop daemons on the same box, each as a separate process. That's what I used to get started with Hadoop, and it's a good head start. You can find more info here (or might need another page depending on which Hadoop version you're running).
If you get to the point where you want to test with a real cluster, but don't have the resources, I would advise looking at Amazon Elastic Map/Reduce: it gives you a cluster on demand and it's pretty cheap. That way you can do more advanced tests. More info here.
the bottom line is, I think if the purpose is simply testing, you don't really need a virtual cluster.
A performance analysis case study conducted on this topic showed that a virtual Hadoop cluster is only around 4% less efficient compared to its native counterpart: Virtualized hadoop performance case study

VMware: performance of an application on one or more virtual servers

I am evaluating a system architecture where the application is split on two virtual servers on the same hardware.
The reason is that the overall system will perform better if a certain part of it runs on its own virtual server. Is this correct?
I would think that if the processes run on the same hardware, a split on two servers will give an overhead to communication compared with installation on one virtual server.
To make sure, it sounds like you're asking, more or less, about how 2 (or more) virtual machines might "compete" with one another, and how might separating the processes run on them affect the overall performance?
Quick Answer: In short, the good news is that you can control how the VM's "fight" over resources very tightly if you wish. This will keep VM's from competing with each other over things like RAM, CPU, etc. and can improve the overall performance of your application. I think you're interested in knowing about two main things: VM reservations/limits/shares and resource pools. Links are included below.
In-depth Answer: In general, it's a great idea to separate the functionality of your application. A web server and a DB server running on different machines is a perfect example of this. I don't know about your application in particular, but if it's not leveraging multi-threading (to enable the use of multiple processors) already, separating your application onto two servers might really help performance. Here's why:
VM's understand, in a sense, what HW they're running on. This means they know the CPU/RAM/disk space that's available to them. So let's say your ESX server has 4 CPU's and 16 GB of RAM. When you create a VM, you're free to give 2 CPU's and 8 GB of RAM to each server, or you can alter the setting to meet your needs. In VMware, you can guarantee a certain resource level by using something called: limits, shares, and reservations. Documentation on them can be found here http://pubs.vmware.com/vsphere-4-esx-vcenter/index.jsp?topic=/com.vmware.vsphere.vmadmin.doc_41/vsp_vm_guide/configuring_virtual_machines/t_allocate_cpu_resources.html, among other places. These reservations will help you guarantee that a certain VM always has access to a certain level of resources and will keep VM's from causing contention over RAM, CPU, etc. VMware also offers another extension of this idea using something called "resource pools", which are pools of RAM, CPU, etc. that can be set aside for certain machines. You can read about them here: http://pubs.vmware.com/vsphere-4-esx-vcenter/index.jsp?topic=/com.vmware.vsphere.resourcemanagement.doc_41/managing_resource_pools/c_why_use_resource_pools.html.

Using Cloud/Distrubted computing to share processor time - possibilities and methods

My question is one I have pondered around when working on a demanding network application that would explicitly share a task across the network using a server to assign the job to each computer individually and "share the load".
I wondered: could this be done in a more implicit manner?
Question
Is there a possibility of distributing processor intensive tasks around a voluntary and public network of computers to make the job run more efficiently without requiring the job's program or process to be installed on each computer?
Scenario
Lets say we have a ridiculously intensive mathematics scenario where I am trying to get my computer to calculate every prime factorization break down for all numbers from 1 to 10,000,000 and store them in a database (assuming I have the space and that the algorithms are already implemented in their own class, program, dynamic link library or any runnable process.)
Now it would be more efficient to share this burdening process across a network or on a multi-core super computer, however these are both expensive. To my knowledge you would require a specifically designed program to run the specific algorithm and have the program installed across the said cloud/distributed computing network whilst you have a server keep track of what each computer is doing (ie. what number they are currently calculating the primes for).
Conclusion
Overall:
Would it be possible to create a cloud program / OS / suite
where you could share processor time
for an unspecified type of process?
If so how would you implement it, where would you start?
Would you make an OS dedicated to being able to run unspecified non-explicit tasks or would it be possible to do with a cloud enabled program installed on volunteers computers volunteers who were willing to share a percentage of their processor clock to help the general community).
If this was implementable, would you be a voluntary part of the greater cloud?
I would love to hear everyone's thoughts and possible solutions as this would be a wonderful project to start.
I have been dealing with the same challenge for the last few months.
My conclusions thus far:
The main problem with using a public network (internet) for cloud computing is in addressing the computers through NATs and firewalls. Solving this is non-trivial.
Asking all participants to open ports in their firewalls and configure their router for port-forwarding is generally too much to ask for 95% of users and can pose severe security threats.
A solution is to use a registration server where all peers register themselves and can get in contact with others. Connections are kept open by server and communication is routed through server. There are several variations. However, in all scenario's, this requires vast resources to keep everything scalable and is therefore out of reach for anyone but large corporations.
More practical solutions are offered by commercial cloud platforms like Azure (for .Net) or just the .Net ServiceBus. Since everything runs in the cloud, there will be no problems with reaching computers behind NATs and firewalls, and even if you need to do so to reach "on-premise" computers or those of clients, this can be done through the ServiceBus.
Would you trust someone else's code to run on your computer?
Its more practical to not ask their permission: ;)
I once wrote a programming competition solver written with Haxe in a Flash banner on a friend's fairly popular website...
I would not expect a program which allow "share processor time for an unspecified type of process". An prerequisite of sharing processor time is that the task can divided into multiple sub tasks.
To allow automatic sharing of processor time for any dividable task, the key of solution would be an AI program clever enough to know how to divide a task into sub task, which does not seems realistic.

Software needed to build a cluster

I've been thinking about getting a little bit greener with my computers and using some lower power, mini-itx boards in my next computer. Some can generate under 10 watts and are pretty inexpensive.
So I thought, if one is such low cost and low power, why not try to make a cluster out of them? However, I'm not really sure what I would need to do in terms of Operating System or management software to make this happen?
Can anyone provide advice on existing software to do this or any ideas as to how to design my own?
What do you want to actually do with your cluster kind of decides what software you will need.
Do you need job scheduling?
Monitoring tools?
Do you need to deploy software across all nodes at once seamlessly?
One file system across all nodes (recommended).
You could just as easily install a linux or *BSD on the boards and just use ssh to manage and run jobs across all the nodes. No other software really required.
Software you might find useful:
PBS (mostly job scheduling, google)
Kerrighed (Single System Image based, Linux distro)
Rocks (cluster based distribution)
Mosix ( cluster Management, openMosix also )
Ganglia (Monitoring, probably over kill for you)
Lustre (Super fast, opensource cluster filesytem from Sun)
Take a look at beowulf to get started.
That being said, the best advice I can give is to carefully measure whether you are actually being more green with your cluster. I've been a little way down this road before, and in my experience, the losses involved in having many separate computers end up wiping out any energy savings. Keep in mind that every computer needs a power supply, which converts your household voltage down to a level that the computer wants. The conversion is inefficient, and wastes heat (this is why the power supplies have fans). The same can be said for each hard drive, RAM bank and motherboard that you need.
This isn't meant to discourage you from the project. Just be sure to profile. Exactly like writing software! :)
You can use Beowulf to run a cluster.
There's a lot to this question.
First, if you just want to get a cluster up and running, there are many suggestions listed already here. Once you have the cluster up and running, though, you're just starting.
At that point, you need to have software that will work correctly across the cluster. If you are working on your own software, you'll need to design it to be parallelized across a cluster, using something like MPI.
Without software written to run across the cluster, though, the cluster is nothing but a highly customized box that doens't do anything special...

How can i connect two or more machines via tcp cable to form a network grid?

How can i connect two or more machines to form a network grid and how can i distribute work load to the two machines?
What operating systems do i need to run on the machines, and what application should i use to manage the load balancing?
NB: I read somewhere that google uses cheap machines to perform this fete, how do they connect two network cards( 'Teaming' ) and distribute load across the machines?
Good practical examples would serve me good, with actual code samples.
Pointers to some good site i might read this stuff will be highly appreciated.
An excellent place to start is with the Beowulf project. Basically an opensource cluster built on the Linux OS.
There are several software solutions in this expanding market. The term "cloud computing" is certainly gaining traction to describe what you want to do. Are you wanting a service, or do you want to run it in house?
I'm most familiar with Appistry EAF - Runs on commodity based hardware. Its available as a free download. Runs on windows or linux.
Another is GoGrid - I believe this is only available as a service, but I'm not as familiar with it.
There are many different approaches to parallel processing, and many types of system architectures you could use.
For commodity systems, there are clusters and grids, or you can even form a single system image from several pieces of commodity hardware. There is of course also load balancing, high availability, failover, etc.
It's pretty much impossible to answer this question without more detail. What exactly do you want to with these systems? The answer is very highly dependent on the application.
You might want to have a look at some of the stuff to do with Folding At Home and the SETI project, and some of their participants blogs, here is a pretty amazing cluster that a guy built in an IKEA cabinet:
http://helmer.sfe.se/
Might give you some ideas.
The question is too abstract.
One of the (imaginable) ways is to use MPI - a framework for parallel programming, the Wikipedia page includes examples in C++, and there are bindings for other languages.

Resources