Does anyone know whether, when I run multiple Jupyter Notebooks (different files) simultaneously, they are automatically distributed over multiple cores? Or are they by default threaded on a single core? I am not explicitly using a multiprocessing package of any kind and not looking for a way to implement parallel computing. I'm just curious whether the computer automatically sends the calculations to different cores.
Related
I have a program written in C++11. On the current input it takes too long to run. Luckily, the data can be safely split into chunks for parallel processing, which makes it a good candidate for, say, a Map/Reduce service.
AWS EMR could be a possible solution. However, since my code uses many modern libraries, it's quite a pain to compile it on the instances that are assigned for Apache Hadoop clusters. For example, I want to use soci (not available at all), boost 1.58+ (1.53 is there), etc etc. I also need a modern C++ compiler.
Obviously, all libraries and compilers can be manually upgraded (and the process scripted), but this sounds like a lot of manual work. And what about slave nodes - will they get all the libraries? Somehow I'm not sure. And the whole process of initializing the environment can now take very long time - thus killing a lot of performance advantage that distributing the jobs was supposed to bring in to begin with.
On the other hand, I don't really need all the advanced functionality that Apache Hadoop provides. And I don't want to set up a personal permanent cluster with my own installation of Hadoop or similar, because I will need to run the tasks only periodically and most of the time the servers will be idle, wasting money.
So, what would be the best product (or overall strategy) that could do the following:
Grab the given binaries + set of input files
Run the binaries on a predefined number of instances, using a recent Linux, ideally Ubuntu 15.10
Put the resulting files in a predefined location (S3 bucket?)
Shut everything down
I am sure I could write a number of scripts using the aws tool to achieve that manually, but I really don't want to reinvent the wheel. Any thoughts?
Thanks in advance!
Honestly that would be pretty easy to script, and you'll need to probably use scripting to grab the latest code on the servers when they start up anyway. I would suggest looking into defining an AutoScaling group with scheduled scaling policies. Alternatively you could have a Lambda function scheduled to run and issue the API command to create your instances.
You could either have a startup script on the server AMI, or simply pass a user-data script when you create the instances, that pulls down the binaries and input files and runs the command. The final step of the script could be to copy results to S3 and shutdown the server.
The (relatively new) AWS Batch is made for this purpose specifically.
I have a Windows application which uses many third-party modules of questionable reliability. My app has to create many objects from those modules, and one bad object may cause a lot of problems.
I was thinking of a multi-process scheme where the objects are created in a separate process (the interfaces are basically all the same, so creating the cross-process communication shouldn't be so difficult). At the most extreme, I'm considering one object per process so I might end up with anywhere between 20 processes and a few hundred processes launch from my main app.
Would that cause Windows any problems? I'm using Windows 7, 64-bit, and memory and CPU power won't be an issue.
As long as you have enough CPU power and memory there are no problems. Having the general rules for distributed applications, multithreading (yes, multithreading), deadlocks, atomic operations and co. everything should be fine.
How can i run multiple computers as one?
i.e. one "master" which issues commands and one or more slaves who do what they are told to do.
also, How do the distributed computingsystems in supercomputers do this?
EDIT:
I found this, this and this and now i wonder, is there something similar that will run parallel programs like hash cracking? Mostly a software designed for these types of cloud computing systems.
In distributed computing systems, broadly speaking, there is no concept of master-slave, the way you describe.
It is a set of distinct autonomous machines (or to define it differently a set of HW or SW modules running at different computers) that work "together" to the same goal.
They achieve this by network communication.
It is as if you had a single software running (through all the machines) and the various processing modules of this software "running" in separate machines (as opossed to separate threads or processes in the same machine).
Parallel computing is not the same concept as distributed computing, a difference being that in distributed systems each machine uses its own memory.
Supercomputer is a term usually refering to hardware capabilities.
The other day I needed to archive a lot of data on our network and I was frustrated I had no immediate way to harness the power of multiple machines to speed-up the process.
I understand that creating a distributed job management system is a leap from a command-line archiving tool.
I'm now wondering what the simplest solution to this type of distributed performance scenario could be. Would a custom tool always be a requirement or are there ways to use standard utilities and somehow distribute their load transparently at a higher level?
Thanks for any suggestions.
One way to tackle this might be to use a distributed make system to run scripts across networked hardware. This is (or used to be) an experimental feature of (some implementations of) GNU Make. Solaris implements a dmake utility for the same purpose.
Another, more heavyweight, approach might be to use Condor to distribute your archiving jobs. But I think you wouldn't install Condor just for the twice-yearly archiving runs, it's more of a system for regularly scavenging spare cycles from networked hardware.
The SCons build system, which is really a Python-based replacement for make, could probably be persuaded to hand work off across the network.
Then again, you could use scripts to ssh to start jobs on networked PCs.
So there are a few ways you could approach this without having to take up parallel programming with all the fun that that entails.
Does anyone know how a computer cluster can be used for everyday applications, like for example video games?
I would like to build a computer cluster that can run applications over the cluster that were not specifically designed for computer clusters and still see the performance increase. One use would be for video games, but I would also like to utilize the increased computing power for running a large network of virtualized machines.
It won't help, especially in the case of video games. You have to build around the cluster; the cluster does not work around you.
At any rate, video games require sub-50ms response time on input and response,and network propagation would just destroy any performance gains you might see. Video processing, on the other hand, benefits GREATLY from the cluster as the task is inherently geared toward parallelization. It does not require user input, and output is only measured in terms of the batch process.
If you have a program written for a single core, running it on a four-core processor won't help you (except that one core can be devoted to that program). For example, I have Visual Studio compiling on multiple cores on this machine, but linking is done on one core (and is annoyingly slow). In order to get use out of multiple cores, I have to either run something that can use multiple cores or run several separate programs.
Clusters are like that, only more so. All communication between the machines is explicit and must be programmed in. There are things you can do with a cluster (see Google's map-reduce algorithm), but they do require special programming and work.
Typical clusters are used either to specialize machines (one might be a database server and one a web server, for example), or to run large numbers of programs simultaneously.
You will not be able to easily run a video game on a cluster, unless it was already designed to work on multiple machines. I have not heard of such a game. You may have some luck creating a virtual server farm, but I doubt it will be easy to get it working perfectly. If you are interested in this, one example would be amazon's EC2 service. They offer virtual machines for "rent" by the hour. Behind the scenes, I assume they have a giant cluster that is supplying all of these virtual machines.
Unfortunately, unless you have some pretty clever operating system / software design in mind - simply connecting programs together via a cluster and hoping to get increased performance is not likely to work - especially not for video games. In order to get increased performance from running things in a cluster you have to program for it otherwise there is a good change you'd see a decrease in performance rather than an increase.