mpi and process launching/binding - parallel-processing

In MPI, is it possible to schedule processes in such a way that one core is mapped to one process, the process finishes and then the core starts working on another process in a situation where the number of processes > number of cores? If so, how exactly is it done? As in, using mpiexec --bind-to-core -bycore or using a hostfile, rankfile etc.
Would really appreciate any input.

Related

How to control how many tasks to run per executor in PySpark [duplicate]

I don't quite understand spark.task.cpus parameter. It seems to me that a “task” corresponds to a “thread” or a "process", if you will, within the executor. Suppose that I set "spark.task.cpus" to 2.
How can a thread utilize two CPUs simultaneously? Couldn't it require locks and cause synchronization problems?
I'm looking at launchTask() function in deploy/executor/Executor.scala, and I don't see any notion of "number of cpus per task" here. So where/how does Spark eventually allocate more than one cpu to a task in the standalone mode?
To the best of my knowledge spark.task.cpus controls the parallelism of tasks in you cluster in the case where some particular tasks are known to have their own internal (custom) parallelism.
In more detail:
We know that spark.cores.max defines how many threads (aka cores) your application needs. If you leave spark.task.cpus = 1 then you will have #spark.cores.max number of concurrent Spark tasks running at the same time.
You will only want to change spark.task.cpus if you know that your tasks are themselves parallelized (maybe each of your task spawns two threads, interacts with external tools, etc.) By setting spark.task.cpus accordingly, you become a good "citizen". Now if you have spark.cores.max=10 and spark.task.cpus=2 Spark will only create 10/2=5 concurrent tasks. Given that your tasks need (say) 2 threads internally the total number of executing threads will never be more than 10. This means that you never go above your initial contract (defined by spark.cores.max).

Puma clustering benefits for site which handles lots of uploads/downloads

I'm trying to understand the benefits of using puma clustering. The GitHub says that the number puma workers should be set to the number of cpu cores, and the default number of threads for each is 0-16. The worker processes can run in parallel while the threads run concurrently. It was my understanding that The MRI GIL only allows one thread across all cores to run Ruby code, so how does puma enable things to run in parallel /provide benefits over running one worker process with double the amount of threads? The site I'm working on is heavily IO bound, handling several uploads and downloads at the same time - any config suggestions for this set up are also welcome.
The workers in clustered mode will actually spawn new child processes each of which has its own "GIL". Only one thread in a single process can be running code at one time, thus having a process per cpu core works well because each cpu can only be doing one thing at a time. It also makes sense to run multiple threads per process because if a thread is waiting for IO it will allow another thread to execute.

parallel programs in Multi Core processors

Assume im trying to run parallel program with 3 different tasks on quad core processor
my question is ,when these tasks run simultaneously ,will they be computed on each core of processor
or in what way they are executed simultaneously?
If you are using c# and parallel lib then yes, they would get queued up in the thread pool and executed in parallel, but there are few other factors that are very important to consider.
Such as:
- Is there is any shared data?
- Does one process need to wait on another?
Also order of execution is not guaranteed.

What is the relation between number of thread and number of processor cores?

I am writing a server application that is thread pool based(IOCP). But I don't know how many threads are appropriate. Is the thread number associated with the number of processor cores?
If your work items never block, use threads = cores. If your threads never need to be descheduled you can max out all cores by creating one thread per core.
If your work items sometimes block (which they shouldn't do much if you want to make best use of IOCP) you need more threads. You need to measure how many.
Multiple threads make up a process, and the number of threads is not dependent on the number of cores. A single core processor can handle a multi-thread process using various scheduling schemes. That said, if you have multiple cores on your processor, you can have different threads run concurrently. So to run multiple threads at the same time, you need multiple cores, but to run multiple threads, but not necessarily simultaneously (can seem simultaneous though), you can use a single core by implementing a scheduling system.
Some useful wiki pages for you:
http://en.wikipedia.org/wiki/Computer_multitasking
http://en.wikipedia.org/wiki/Thread_%28computing%29
http://en.wikipedia.org/wiki/Input/output_completion_port
http://en.wikipedia.org/wiki/Scheduling_%28computing%29
http://en.wikipedia.org/wiki/Thread_pool_pattern

Prevent execution of non-SGE programs

From the point of view of the system administration of an SGE node, is it possible to force users to run long-running programs through qsub instead of running it stand-alone?
The problem is that the same machine is acting as the control node and the computation node. So, I can't distinguish a long-running program from a user who is compiling with "gcc". Ideally, I would like to force users to submit long-running jobs (i.e., more than an hour) through qsub. I don't even mind being a bit mean and killing jobs that have run longer than an hour but weren't submitted through qsub.
Until now, all that I can do is send e-mails out asking users to "Please use qsub!"...
I've looked through the SGE configuration and nothing seems relevant. But maybe I've just missed something...any help would be appreciated! Thanks!
I'm a little confused about your setup, but I'm assuming users are submitting jobs by logging into what is also a computation node. Here are some ideas, best to worst:
Obviously, the best thing is to have a separate control node for users.
Barring that, run a resource-limited VM as the control node.
Configure user-level resource limits (e.g. ulimit) on the nodes. You can restrict CPU, memory, and process usage, which are probably what you care about rather than clock time.
It sounds like the last one may be best for you. It's not hard, either.

Resources