MPI parallel runing a parallel program

MPI parallel runing a parallel program - parallel-processing

I have the following doubts, I am developing a genetic algorithm that uses a second program to do certain calculations. This second program is implemented parallelization.
My program calls this second program to do the necessary calculations. With MPI, I can run two calculations at the same time.
But I would like to implement a way to run these same calculations using also parallelizing own program. a simple scheme below
GENETIC
|
| ---> MPI HERE
/ \
/ \
/ \
1º machine - PROGRAM PROGRAM - 2º machine
/\ /\
/ \ / \
/ \ / \
program paralell program paralell
That is, I reserve the MPI 6 cores on 2 machines, 3 on each machine to run my genetic algorithm, in each of these machines I share the task called 2x the second program, 1 on each machine and each machine the second program will do the calculations using 3 cores.
Sorry for my English, I did my maximum to explain in a simple way, if you still have any questions about the process I try to explain in a different way ...

Related

kdb matrix functions improvement

Did anyone come across using kdb’s matrix function? I found it quite slow compared to other tools.
For inverse matrix function inv on a 1000 by 1000 float matrix , It took kdb+ 166,682 milliseconds for 10 executions. For matrix multiplication, it took 3,380 milliseconds, where the two matrices share the same dimension as 1000 by 1000.
I also tried the same experiment on DolphinDB, a similar time series database with built-in analytics features (DolphinDB ). DolphinDB’s inv function is about 17 times faster and matrix multiplication is about 6 times faster.
I would think matrix operation optimization should be something well established. Can any kdb expert explain the reason behind it or if there are any ways to improve it?
I used KDB+ 4.0 64 bit version and DolphinDB_Linux_V2.00.7(DolphinDB community version: 2 cores and 8GB memory). Both experiments are conducted using 2 cores of CPU.
KDB implementation
// Start the server
rlwrap -r taskset -c 0,1 ./l64/q -p 5002 -s 2
// The code
ma:1000 cut (-5.0+ 1000000?10.0)
mb:1000 cut (-5.0+ 1000000?10.0)
\t do[10;inv ma]
16682
\t do[10;ma mmu mb]
3380
DolphinDB implementation
// Start the server
rlwrap -r ./dolphindb -localSite localhost:5002:local5002 -localExecutors 1
// The code
ma=(-5.0+ rand(10.0,1000000))$1000:1000;
mb=(-5.0+ rand(10.0,1000000))$1000:1000;
timer(10) ma.inv();
Time elapsed: 975.349 ms
timer(10) ma**mb;
Time elapsed: 581.618 ms

DolphinDB uses openblas for matrix operations like inv while kdb+ doesn't.

Nobody here is privy to what kdb and dolphin are doing under the covers for these operations, other than knowing that inv in kdb uses LU decomposition (previously it used Cholesky decomposition as well).
In general kdb is not optimised for matrix operations, it is optimised for vector operations. Case in point - if you have enough worker threads and use peach:
q)\t do[10;ma mmu mb]
2182
q)\t do[10;mmu[;mb]peach ma]
375
If you wanted to do serious matrix operations in kdb you would load C libraries to carry out the heavy lifting. For example using the Q math library (http://althenia.net/qml):
q)\t inv ma
1588
q)\t .qml.minv ma
689

Test Intel Extension for Pytorch(IPEX) in multiple-choice from huggingface / transformers

I am trying out one huggingface sample with SWAG dataset
https://github.com/huggingface/transformers/tree/master/examples/pytorch/multiple-choice
I would like to use Intel Extension for Pytorch in my code to increase the performance.
Here I am using the one without training (run_swag_no_trainer)
In the run_swag_no_trainer.py , I made some changes to use ipex .
#Code before changing is given below:
device = accelerator.device
model.to(device)
#After adding ipex:
import intel_pytorch_extension as ipex
device = ipex.DEVICE
model.to(device)
While running the below command, its taking too much time.
export DATASET_NAME=swag
accelerate launch run_swag_no_trainer.py \
--model_name_or_path bert-base-cased \
--dataset_name $DATASET_NAME \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3 \
--output_dir /tmp/$DATASET_NAME/
Is there any other method to test the same on intel ipex?

First you have to understand, which factors actually increases the running time. Following are these factors:
The large input size.
The data structure; shifted mean, and unnormalized.
The large network depth, and/or width.
Large number of epochs.
The batch size not compatible with physical available memory.
Very small or high learning rate.
For fast running, make sure to work on the above factors, like:
Reduce the input size to the appropriate dimensions that assures no loss in important features.
Always preprocess the input to make it zero mean, and normalized it by dividing it by std. deviation or difference in max, min values.
Keep the network depth and width that is not to high or low. Or always use the standard architecture that are theoretically proven.
Always make sure of the epochs. If you are not able to make any further improvements in your error or accuracy beyond a defined threshold, then there is no need to take more epochs.
The batch size should be decided based on the available memory, and number of CPUs/GPUs. If the batch cannot be loaded fully in memory, then this will lead to slow processing due to lots of paging between memory and the filesystem.
Appropriate learning rate should be determine by trying multiple, and using that which gives the best reduction in error w.r.t. number of epochs.

Why does GNU parallel affect script speed?

I have some Fortran script. I compile with gfortran and then run as time ./a.out.
My script completes, and outputs the runtime as,
real 0m36.037s
user 0m36.028s
sys 0m0.004s
i.e. ~36 seconds
Now suppose I want to run this script multiple times, in parallel. For this I am using GNU Parallel.
Using the lscpu command tells me that I have 8 CPUs, with 2 threads per core and 4 cores per socket.
I create some file example.txt of the form,
time ./a.out
time ./a.out
time ./a.out
time ./a.out
...
which goes on for 8 lines.
I can then run these in parallel on 8 cores as,
parallel -j 8 :::: example.txt
In this case I would expect the runtime for each script to still be 36 seconds, and the total runtime to be ~36 seconds. However, in actuality what happens is the run time for each script roughly doubles.
If I instead run on 4 cores instead of 8 (-j 4) the problem disappears, and each script reverts to taking 36 seconds to run.
What is the cause of this? I have heard talk in the past on 'overheads' but I am not sure exactly what is meant by this.

What is happening is that you have only one socket with 4 physical cores in it.
Those are the real cores of your machine.
The total number of CPUs you see as output of lscpu is calculated using the following formula: #sockets * #cores_per_socket * #threads_per_core.
In your case it is 1*4*2=8.
Threads per core are a sort of virtual CPUs and they do not always perform as real CPUs, expecially for compute intense processing (this spec is called hyperthreading ).
Hence when you try to squeeze two threads per core, they get almost executed serially.
Take a look at this article for more info.

When/how to benefit from parallel processing of scripts?

I have a sequence of scripts to run on a computer with 1 physical & logical core.
I have tried running them in sequence, and also forking them with something like the bash script below. The background processes, running in parallel, actually took longer than running them in sequence.
My question is: Under what circumstances should a workload like this be run in parallel? That is, must I have particular hardware or can I do this more efficiently with the single-processor computer that I have?
#!/bin/bash
# Run them in sequence...
T1=$(date +%s)
python proc_test1.py
python proc_test2.py
python proc_test3.py
T2=$(date +%s)
T=$((T2-T1))
echo "Scripts took ${T} seconds."
# Now fork them...
T1=$(date +%s)
python proc_test1.py &
python proc_test2.py &
python proc_test3.py &
wait
T2=$(date +%s)
T=$((T2-T1))
echo "Scripts took ${T} seconds."
exit 0

Parallel processing usually becomes beneficial when hardware is available. For example, if you had three logical CPUs, then three computations could occur simultaneously. Thus, ideally, if you ran proc_test1.py three times with forks with three available processors, all three would finish in the same time it would take to run just one instance of proc_test1.py.
In other words, given sufficient hardware, running three proc_test1.py's in serial will take three times as long as running them with forks.
Now, given that you just have one hardware cpu, it makes sense that the parallel jobs would run more slowly than the serial ones, as each python program will be competing with each other for cpu time. The cpu stopping one job and resuming another costs cpu time itself.
For example, say you had 6 oranges and two hands, and you had to hold all 6 oranges for 5 seconds total. Say it takes you one second to pick up or swap out oranges. You could do this task in serial, and pick up two oranges at a time for five seconds before swapping for a new pair. This would take you
1 + 5 + 1 + 5 + 1 + 5
= 3 * 5 + 3 = 18
seconds to complete.
Now suppose the parallel analogy. Then all 6 oranges are begging to be picked up and you holding one does not mean that you wont immediately drop it for an alternative. There isn't necessarily an upper bound on how long it will take you to complete the task as we have defined it, so suppose that you have to hold the oranges for at least 2.5 seconds before you swap them out, and you only swap in pairs. Then, it will take you
1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1
= 3 * 5 + 7 = 22
seconds to complete. Note that by "forking", it takes you 22% longer to hold 6 oranges for 5 seconds each. Since you have two hands, it still takes you 15 seconds to complete the task, but there is a variable overhead in switching time based on your strategy. Note that if you had 6 hands, it would take only 7 seconds to complete the task.
Thus when you have more processors, fork processes, otherwise you're just juggling jobs on limited hardware.

Your simple question is in practice extremely hard to answer in general. In real life I would always measure to see if reality agrees with my theory.
An example where reality did not agree with my theory was on my Intel Core i7. It has 4 cores and has hyperthreading. This would suggest that running 8 threads in parallel will be optimal: You will be using the 4 processing units and use 4 additional threads for keeping the pipelines filled.
However, Core i7 has 6MB of shared cache. It just so happened, that the working set of my data fit inside the 6MB. So I saw an extreme speedup by running 1 thread instead of 2 or even 8: Running more than 1 would simply flush the cache all the time. This would not have been true if the cache had not been shared, but it just shows that it is not simple to say when parallelizing will be faster.

Matlab parallel processing using a network computer

I'm familiar with matlabpool, and parfor usage, but I still need to speedup the computation.
I have a more powerful computer in my 1GB network. Both computers have R2010b, and have the same code and paths.
What is the simplest way to use both computers for parallel computation?
Example of the code I use today:
--- main.m---
matlabpool('open', 3);
% ...
x = randn(1e5,1);
y = nan(size(x));
parfor k = 1 : length(x)
y(k) = myfunc(x(k));
end
--- myfunc.m---
function y = myfunc(x)
y = x; % some computation
return

For real cluster computing, you'll need the distributed computing toolbox, as you can read on the parallel computing info page:
Without changing the code, you can run the same application on a computer cluster or a grid computing service (using MATLAB Distributed Computing Server™). You can run parallel applications interactively or in batch.
But installing (=buying) a toolbox just for adding one computer to the worker pool might be a bit too expensive. Luckily there are also alternatives: http://www.mathworks.com/matlabcentral/fileexchange/13775
I personally haven't used this, but think it's definitely worth a look.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio