Understanding MPI multi-host mode

Understanding MPI multi-host mode - performance

I have two hosts with the same number of cores (20 = 10 virtual + 10 real).
Just to test my MPI performance I use simple matrix multiplication program.
I understand MPI with two hosts as "let's start maximum cores in every host"
The problem is the following:
In one-node mode I execute mpirun -n 20 mm and obtain executed time nearly 0.5 sec.
Then in multi-node mode I execute mpirun -n 20 --host srv1:20,srv2:20 mm and obtain time nearly 0.5 sec too.
So there are no better performance with two hosts usage (but it's expected).
What settings, options, configuration files (and so on) I should check&repair to get expected result?
Thanks.

Related

Worker node-status on a Ray EC2 cluster: update-failed

I now have a Ray cluster working on EC2 (Ubuntu 16.04) with a c4.8xlarge master node and one identical worker. I wanted to check whether multi-threading was being used, so I ran tests to time increasing numbers (n) of the same 9-second task. Since the instance has 18 CPUs, I expected to see the job taking about 9s for up to n<=35 (assuming one CPU for the cluster management) and then either a fault, or an increase to about 18 sec when switching to 36 vCPUs per node.
Instead, the cluster handled up to only 14 tasks in parallel and then the execution time jumped to 40s and continued to increase for increasing n. When I tried a c4xlarge master (4 CPUs), the times were directly proportional to n, i.e. they were running serially. So I surmise that the master actually requires 4 CPUs for the system, and that the worker node is not being used at all. However, if I add a second worker, the times for n>14 are about 40s less that without it. I also tried a value for target_utilization_factor less than 1.0, but that made no difference.
There were no reported errors, but I did notice that the ray-node-status for the worker in the EC2 Instances console was "update-failed". Is this significant? Can anyone enlighten me about this behaviour?

The cluster did not appear to be using the workers, so the trace is showing only 18 actual cpus dealing with the task. The monitor (ray exec ray_conf.yaml 'tail -n 100 -f /tmp/ray/session_/logs/monitor') identified that the "update-failed" is significant in that the setup commands, called by the ray updater.py, were failing on the worker nodes. Specifically, it was the attempt to install the C build-essential compiler package on them that, presumably, exceeded the worker memory allocation. I was only doing this in order to suppress a "setproctitle" installation warning - which I now understand can be safely ignored anyway.

slurm limit number of spwaned processes

I am a newbie trying to install/administer slurm. I want to limit the amount of forking a slurm job can do. I used stress command to see the CPU utilization by slurm.
When I run this batch script
#SBATCH -p Test -c 1
stress -c 1
The job runs fine with one core used 100 percent. But this script
#SBATCH -p Test -c 1
stress -c 20
also runs but the top command gives list of 20PIDs forked with cpu utilization of 5 percent each. This makes sense as the total utilization is 1 CPU core 100 percent. This makes load averages go crazy which I learned by googling, are not a correct view of system load. I have 2 questions
Is it possible in slurm to limit such a behavior from the admin config by killing the second run. My various attempts have so far yielded nothing. The slurm is configured with cgroup and kills over memory jobs fine. No MPI is used or configured.
Does this behavior cause inefficiency because of process waiting times ?
I tried setting these drastic params to check if something happens.
MaxStepCount=1
MaxTasksPerNode=2
But surprisingly nothing happens and I can submit many more jobs after this.

Slurm's job is to allocate computational resources to user jobs. The lowest unit of computation manageable is referred to in the documentation as the CPU. This refers to processing threads/ execution cores, not physical cores. Slurm does not oversee how those resources are managed by the job. So no, nothing in Slurm can kill a job with too many userland threads.
Running that many threads would probably affect efficiency, yes. All those threads will cause increased context switching unless the job has enough cpu threads to handle them.
MaxStepCount and MaxTasksPerNode are for jobs. "Tasks" in this context are not userland threads but separate processes launched by a job step.
I hope that helps.

How does lftp calculate the throughput in parallel mode?

I'm using lftp (lftp --version shows Version 4.0.9) in mirror mode to test the performance of some sftp servers I'm specially interested in the throughput (bytes/sec) when I run lftp with a different number of concurrent connections.
When I ran the test with 25 concurrent connections it gave me a rather strange number of 5866 seconds as time to download. To check what was the real time spent in the download I used the time command (as suggested in this related question).
The output was:
$ time lftp -e 'mirror --parallel=25 (rest of the command-line)'
21732837094 bytes transferred in 5866 seconds (3.53M/s)
real 4m31.315s
user 1m25.977s
sys 1m38.041s
My first thought was that those 5866 seconds where the sum of the time spent by every connection, so dividing that by 25 gives me 234,64 seconds (or 03m54.64s) which is kind of far from 4m31.315s.
Does anyone have an insight on how the numbers from lftp are calculated?

Before lftp-4.5.0 mirror summed overlapping durations of the parallel transfers (incorrectly). It was fixed in 4.5.0 to count wall clock time when any of the transfers was active.

Redis scale with number of clients

I am testing Redis performance on my local machine and I want to know how well Redis can scale when number of parallel connections increases. My machine has 24 cores.
At first, I tested with -c = 8, the benchmark command is ./redis-benchmark -c 1 -n 100000 -t set,get. The result is around 70K requests/s. Then I run ./redis-benchmark -c 8 -n 100000 -t set,get. The result is 200K requests/s. Finally I run ./redis-benchmark -c 10 -n 100000 -t set,get. It's still around 200K requests/s. I expected the throughput to increase around 8 times when the number of parallel connections increase 8 times. Also, why -c = 8 and -c = 10 has no difference? Many thanks for your time.

Redis is single-threaded. The maximum QPS it can achieve, is limited by the power of a single processor. 200K might be the maximum QPS it can achieve (based on your hardware environment).
If you want to achieve higher QPS, you need a more powerful CPU or more Redis instances.

bash loop with curl evidencing non-linear scaling of response times

I wrote this simple Bash script to detect incidence of error-pages:
date;
iterations=10000;
count_error=0;
count_expected=0;
for ((counter = 0; counter < iterations; ++counter)); do
if curl -s http://www.example.com/example/path | grep -iq error;
then
((count_error++));
else
((count_expected++));
fi;
sleep 0.1;
done;
date;
echo count_error=$count_error count_expected=$count_expected
I'm finding total execution-time does not scale linearly with iteration count. 10 iterations 00:00:12, 100 in 00:01:46, 1000 in 00:17:24, 10000 in ~50 mins, 100000 in ~10 hrs
Can anyone provide insights into the non-linearity and/or improvements to the script? Is curl unable to fire requests at rate of 10/sec? Is GC having to periodically clear internal buffers filling up with response text ?

Here are a few thoughts:
You are not creating 10 requests per second here (as you stated in the question), instead you are running the requests sequentially, i.e. as many per second as possible.
The ; at the end of each line is not required in Bash.
When testing your script from my machine against a different URL, 10 iterations take 3 seconds, 100 take 31 seconds, and 1000 take 323 seconds, so the execution time scales linearly in this range.
You could try using htop or top to identify performance bottlenecks on your client or server.
The apache benchmark tool ab is a standard tool to benchmark web servers and available on most distributions. See manpage ab(1) for more information.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Understanding MPI multi-host mode - performance

Related

Worker node-status on a Ray EC2 cluster: update-failed

slurm limit number of spwaned processes

How does lftp calculate the throughput in parallel mode?

Redis scale with number of clients

bash loop with curl evidencing non-linear scaling of response times

Categories

Resources