My question is one I have pondered around when working on a demanding network application that would explicitly share a task across the network using a server to assign the job to each computer individually and "share the load".
I wondered: could this be done in a more implicit manner?
Question
Is there a possibility of distributing processor intensive tasks around a voluntary and public network of computers to make the job run more efficiently without requiring the job's program or process to be installed on each computer?
Scenario
Lets say we have a ridiculously intensive mathematics scenario where I am trying to get my computer to calculate every prime factorization break down for all numbers from 1 to 10,000,000 and store them in a database (assuming I have the space and that the algorithms are already implemented in their own class, program, dynamic link library or any runnable process.)
Now it would be more efficient to share this burdening process across a network or on a multi-core super computer, however these are both expensive. To my knowledge you would require a specifically designed program to run the specific algorithm and have the program installed across the said cloud/distributed computing network whilst you have a server keep track of what each computer is doing (ie. what number they are currently calculating the primes for).
Conclusion
Overall:
Would it be possible to create a cloud program / OS / suite
where you could share processor time
for an unspecified type of process?
If so how would you implement it, where would you start?
Would you make an OS dedicated to being able to run unspecified non-explicit tasks or would it be possible to do with a cloud enabled program installed on volunteers computers volunteers who were willing to share a percentage of their processor clock to help the general community).
If this was implementable, would you be a voluntary part of the greater cloud?
I would love to hear everyone's thoughts and possible solutions as this would be a wonderful project to start.
I have been dealing with the same challenge for the last few months.
My conclusions thus far:
The main problem with using a public network (internet) for cloud computing is in addressing the computers through NATs and firewalls. Solving this is non-trivial.
Asking all participants to open ports in their firewalls and configure their router for port-forwarding is generally too much to ask for 95% of users and can pose severe security threats.
A solution is to use a registration server where all peers register themselves and can get in contact with others. Connections are kept open by server and communication is routed through server. There are several variations. However, in all scenario's, this requires vast resources to keep everything scalable and is therefore out of reach for anyone but large corporations.
More practical solutions are offered by commercial cloud platforms like Azure (for .Net) or just the .Net ServiceBus. Since everything runs in the cloud, there will be no problems with reaching computers behind NATs and firewalls, and even if you need to do so to reach "on-premise" computers or those of clients, this can be done through the ServiceBus.
Would you trust someone else's code to run on your computer?
Its more practical to not ask their permission: ;)
I once wrote a programming competition solver written with Haxe in a Flash banner on a friend's fairly popular website...
I would not expect a program which allow "share processor time for an unspecified type of process". An prerequisite of sharing processor time is that the task can divided into multiple sub tasks.
To allow automatic sharing of processor time for any dividable task, the key of solution would be an AI program clever enough to know how to divide a task into sub task, which does not seems realistic.
Related
I understand why cold starts happen (Byte code needs to be turned into machine code through JIT compilation). However with all the generated meta data available for binaries these days I do not understand why there isn't a simple tool that automatically takes the byte code and turns ALL PATHS THROUGH THE CODE (auto discovered) into machine code specific for that target platform. That would mean the first request through any path (assume a rest api) would be fast and not require any further Just In Time Compilation.
We can create an automation test suite or load test to JIT all the paths before allowing the machine into the load balancer rotation (good best practice anyway). We can also flip the "always on" setting in cloud hosting providers to keep the warmed application from getting evicted from memory (requiring the entire process over again). However, it seems like such an archaic process to still be having in 2020.
Why isn't there a tool that does this? What is the limitation that prevents us from using meta data, debug symbols and/or other means to understand how to generate machine code that is already warm and ready for users from the start?
So I have been asking some sharp minds around my professional network and no one seems to be able to point out exactly what limitation makes this so hard to do. However, I did get a few tools on my radar that do what i'm looking for to some level.
Crossgen appears to be the most promising but it's far from widely used among the many peers I've spoken to. Will have to take a closer look.
Also several do some sort of startup task that runs some class initialization and also register them as singletons. I wouldn't consider those much different then just running integration or load tests on the application.
Most programming languages have some form of native image compiler tool. It's up to you to use them if that is what you are looking to do.
Providers are supposed to give you a platform for your application and there is a certain amount of isolation and privacy you should expect from your provider. They should not go digging into your application to figure out all its "paths". That would be very invasive.
Plus "warming up" all paths would be an overly resource intensive process for a provider to be obligated to perform for every application they host.
Here is my question:
Is tere any service or technology to run parallel algorythms on more computer without knowing them?
For example: I write a parallel algorythm. My friends install a simple client app, and if they have internet connection, they can help my calculation with their free processor capacity. I would like to see them like an additional core in my CPU.
If there is no technology like that, is there any unsolvable problems with developing one? (I know there must be a lot problems with code trasfering, operation systems, and compatibility)
I believe that you can use BOINC to set up your own volunteer computing project. But I have no experience of this to report.
Does anyone know how a computer cluster can be used for everyday applications, like for example video games?
I would like to build a computer cluster that can run applications over the cluster that were not specifically designed for computer clusters and still see the performance increase. One use would be for video games, but I would also like to utilize the increased computing power for running a large network of virtualized machines.
It won't help, especially in the case of video games. You have to build around the cluster; the cluster does not work around you.
At any rate, video games require sub-50ms response time on input and response,and network propagation would just destroy any performance gains you might see. Video processing, on the other hand, benefits GREATLY from the cluster as the task is inherently geared toward parallelization. It does not require user input, and output is only measured in terms of the batch process.
If you have a program written for a single core, running it on a four-core processor won't help you (except that one core can be devoted to that program). For example, I have Visual Studio compiling on multiple cores on this machine, but linking is done on one core (and is annoyingly slow). In order to get use out of multiple cores, I have to either run something that can use multiple cores or run several separate programs.
Clusters are like that, only more so. All communication between the machines is explicit and must be programmed in. There are things you can do with a cluster (see Google's map-reduce algorithm), but they do require special programming and work.
Typical clusters are used either to specialize machines (one might be a database server and one a web server, for example), or to run large numbers of programs simultaneously.
You will not be able to easily run a video game on a cluster, unless it was already designed to work on multiple machines. I have not heard of such a game. You may have some luck creating a virtual server farm, but I doubt it will be easy to get it working perfectly. If you are interested in this, one example would be amazon's EC2 service. They offer virtual machines for "rent" by the hour. Behind the scenes, I assume they have a giant cluster that is supplying all of these virtual machines.
Unfortunately, unless you have some pretty clever operating system / software design in mind - simply connecting programs together via a cluster and hoping to get increased performance is not likely to work - especially not for video games. In order to get increased performance from running things in a cluster you have to program for it otherwise there is a good change you'd see a decrease in performance rather than an increase.
I've been thinking about getting a little bit greener with my computers and using some lower power, mini-itx boards in my next computer. Some can generate under 10 watts and are pretty inexpensive.
So I thought, if one is such low cost and low power, why not try to make a cluster out of them? However, I'm not really sure what I would need to do in terms of Operating System or management software to make this happen?
Can anyone provide advice on existing software to do this or any ideas as to how to design my own?
What do you want to actually do with your cluster kind of decides what software you will need.
Do you need job scheduling?
Monitoring tools?
Do you need to deploy software across all nodes at once seamlessly?
One file system across all nodes (recommended).
You could just as easily install a linux or *BSD on the boards and just use ssh to manage and run jobs across all the nodes. No other software really required.
Software you might find useful:
PBS (mostly job scheduling, google)
Kerrighed (Single System Image based, Linux distro)
Rocks (cluster based distribution)
Mosix ( cluster Management, openMosix also )
Ganglia (Monitoring, probably over kill for you)
Lustre (Super fast, opensource cluster filesytem from Sun)
Take a look at beowulf to get started.
That being said, the best advice I can give is to carefully measure whether you are actually being more green with your cluster. I've been a little way down this road before, and in my experience, the losses involved in having many separate computers end up wiping out any energy savings. Keep in mind that every computer needs a power supply, which converts your household voltage down to a level that the computer wants. The conversion is inefficient, and wastes heat (this is why the power supplies have fans). The same can be said for each hard drive, RAM bank and motherboard that you need.
This isn't meant to discourage you from the project. Just be sure to profile. Exactly like writing software! :)
You can use Beowulf to run a cluster.
There's a lot to this question.
First, if you just want to get a cluster up and running, there are many suggestions listed already here. Once you have the cluster up and running, though, you're just starting.
At that point, you need to have software that will work correctly across the cluster. If you are working on your own software, you'll need to design it to be parallelized across a cluster, using something like MPI.
Without software written to run across the cluster, though, the cluster is nothing but a highly customized box that doens't do anything special...
How can i connect two or more machines to form a network grid and how can i distribute work load to the two machines?
What operating systems do i need to run on the machines, and what application should i use to manage the load balancing?
NB: I read somewhere that google uses cheap machines to perform this fete, how do they connect two network cards( 'Teaming' ) and distribute load across the machines?
Good practical examples would serve me good, with actual code samples.
Pointers to some good site i might read this stuff will be highly appreciated.
An excellent place to start is with the Beowulf project. Basically an opensource cluster built on the Linux OS.
There are several software solutions in this expanding market. The term "cloud computing" is certainly gaining traction to describe what you want to do. Are you wanting a service, or do you want to run it in house?
I'm most familiar with Appistry EAF - Runs on commodity based hardware. Its available as a free download. Runs on windows or linux.
Another is GoGrid - I believe this is only available as a service, but I'm not as familiar with it.
There are many different approaches to parallel processing, and many types of system architectures you could use.
For commodity systems, there are clusters and grids, or you can even form a single system image from several pieces of commodity hardware. There is of course also load balancing, high availability, failover, etc.
It's pretty much impossible to answer this question without more detail. What exactly do you want to with these systems? The answer is very highly dependent on the application.
You might want to have a look at some of the stuff to do with Folding At Home and the SETI project, and some of their participants blogs, here is a pretty amazing cluster that a guy built in an IKEA cabinet:
http://helmer.sfe.se/
Might give you some ideas.
The question is too abstract.
One of the (imaginable) ways is to use MPI - a framework for parallel programming, the Wikipedia page includes examples in C++, and there are bindings for other languages.