IntMap traverseWithKey faster than mapWithKey?

IntMap traverseWithKey faster than mapWithKey? - performance

I was reading this part of Parallel and Concurrent Programming in Haskell, and found the sequential version of the program to be far slower than the parallel one with one core:
Sequential:
$ cabal run fwsparse -O2 --ghc-options="-rtsopts" -- 1000 800 +RTS -s
...
Total time 11.469s ( 11.529s elapsed)
Parallel:
$ cabal run fwsparse1 -O2 --ghc-options="-rtsopts -threaded" -- 1000 800 +RTS -s -N1
...
Total time 4.906s ( 4.988s elapsed)
According to the book, the sequential one should be slightly faster than the parallel one (if it runs on a single core).
The only difference in the two programs was this function:
Sequential:
update g k = Map.mapWithKey shortmap g
Parallel:
update g k = runPar $ do
m <- Map.traverseWithKey (\i jmap -> spawn (return (shortmap i jmap))) g
traverse get m
Since the spawn function uses deepseq, I initially though it had something to do with strictness but the use of force didn't change the performance of the sequential program.
Finally, I managed to get the sequential one work faster:
$ cabal run fwsparse -O2 --ghc-options="-rtsopts" -- 1000 800 +RTS -s
...
Total time 3.891s ( 3.901s elapsed)
by changing the update function to this:
update g k = runIdentity $ Map.traverseWithKey (\i jmap -> pure $ shortmap i jmap) g
Why does using traverseWithKey in the Identity monad speed up performance? I checked the IntMap source code but couldn't figure out the reason.
Is this a bug or the expected behaviour?
(Also, this is my first question on StackOverflow so please tell me if I'm doing anything wrong.)
EDIT:
As per Ismor's comment, I turned off optimization (using -O0) and the sequential program runs in 19.5 seconds with either traverseWithKey or mapWithKey, while the parallel one runs in 21.5 seconds.
Why is traverseWithKey optimized to be so much faster than mapWithKey?
Which function should I use in practice?

Related

Problems with Orca and OpenMPI for parallel jobs

Hello to the community:
I recently started to use ORCA software for some quantum calculation but I have been having a lot of problems to lunch a parallel calculation in the cluster of my University.
To install Orca I used the static version:
orca_4_2_1_linux_x86-64_openmpi314.tar.xz.
In a shared direction of the cluster (/data/shared/opt/ORCA/).
And putted in my ~/.bash_profile:
export PATH="/data/shared/opt/ORCA/orca_4_2_1_linux_x86-64_openmpi314:$PATH"
export LD_LIBRARY_PATH="/data/shared/opt/ORCA/orca_4_2_1_linux_x86-64_openmpi314:$LD_LIBRARY_PATH"
For the installation of the corresponding OpenMPI version (3.1.4)
tar -xvf openmpi-3.1.4.tar.gz
cd openmpi-3.1.4
./configure --prefix="/data/shared/opt/ORCA/openmpi314/"
make -j 10
make install
When I use the frontend server all is wonderful:
With a .sh like this:
#! /bin/bash
export PATH="/data/shared/opt/ORCA/openmpi314/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/data/shared/opt/ORCA/openmpi314/lib"
$(which orca) test.inp > test.out
and an input like this:
# Computation of myjob at b3lyp/6-31+G(d,p)
%pal nprocs 10 end
%maxcore 8192
! RKS B3LYP 6-31+G(d,p)
! TightSCF Grid5 NoFinalGrid
! Opt
! Freq
%cpcm
smd true
SMDsolvent "water"
end
* xyz 0 1
C 0 0 0
O 0 0 1.5
*
The problem appears when I use the nodes:
.inp file:
#! Computation at RKS B3LYP/6-31+G(d,p) for cis1_bh267_m_Cell_152
%pal nprocs 12 end
%maxcore 8192
! RKS B3LYP 6-31+G(d,p)
! TightSCF Grid5 NoFinalGrid
! Opt
! Freq
%cpcm
smd true
SMDsolvent "water"
end
* xyz 0 1
C -4.38728130 0.21799058 0.17853303
C -3.02072869 0.82609890 -0.29733316
F -2.96869122 2.10937041 0.07179384
F -3.01136328 0.87651596 -1.63230798
C -1.82118365 0.05327804 0.23420220
O -2.26240947 -0.92805650 1.01540713
C -0.53557484 0.33394113 -0.05236121
C 0.54692198 -0.46942807 0.50027196
O 0.31128292 -1.43114232 1.22440290
C 1.93990391 -0.12927675 0.16510948
C 2.87355011 -1.15536140 -0.00858832
C 4.18738231 -0.82592189 -0.32880964
C 4.53045856 0.52514329 -0.45102225
N 3.63662927 1.52101319 -0.26705841
C 2.36381718 1.20228695 0.03146190
F -4.51788749 0.24084604 1.49796862
F -4.53935644 -1.04617745 -0.19111502
F -5.43718443 0.87033190 -0.30564680
H -1.46980819 -1.48461498 1.39034280
H -0.26291843 1.15748249 -0.71875720
H 2.57132559 -2.20300864 0.10283592
H 4.93858460 -1.60267627 -0.48060140
H 5.55483009 0.83859415 -0.70271364
H 1.67507560 2.05019549 0.17738396
*
.sh file (Slurm job):
#!/bin/bash
#SBATCH -p deflt #which partition I want
#SBATCH -o cis1_bh267_m_Cell_152_myjob.out #path for the slurm output
#SBATCH -e cis1_bh267_m_Cell_152_myjob.err #path for the slurm error output
#SBATCH -c 12 #number of cpu(logical cores)/task (task is normally an MPI process, default is one and the option to change it is -n)
#SBATCH -t 2-00:00 #how many time I want the resources (this impacts the job priority as well)
#SBATCH --job-name=cis1_bh267_m_Cell_152 #(to recognize your jobs when checking them with "squeue -u USERID")
#SBATCH -N 1 #number of node, usually 1 when no parallelization over nodes
#SBATCH --nice=0 #lowering your priority if >0
#SBATCH --gpus=0 #number of gpu you want
# This block is echoing some SLURM variables
echo "Jobid = $SLURM_JOBID"
echo "Host = $SLURM_JOB_NODELIST"
echo "Jobname = $SLURM_JOB_NAME"
echo "Subcwd = $SLURM_SUBMIT_DIR"
echo "SLURM_CPUS_PER_TASK = $SLURM_CPUS_PER_TASK"
# This block is for the execution of the program
export PATH="/data/shared/opt/ORCA/openmpi314/bin:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/data/shared/opt/ORCA/openmpi314/lib"
$(which orca) ${SLURM_JOB_NAME}.inp > ${SLURM_JOB_NAME}.log --use-hwthread-cpus
I used the --use-hwthread-cpus flag as a recommendation but the same problem appears with and without this flag.
All the error is:
There are not enough slots available in the system to satisfy the 12 slots that were requested by the application: /data/shared/opt/ORCA/orca_4_2_1_linux_x86-64_openmpi314/orca_gtoint_mpi
Either request fewer slots for your application, or make more slots available for use. A "slot" is the Open MPI term for an allocatable unit where we can launch a process. The number of slots available are defined by the environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an RM is present, Open MPI defaults to the number of processor cores In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the --use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the number of available slots when deciding the number of processes to launch.
*[file orca_tools/qcmsg.cpp, line 458]:
.... aborting the run*
When I go to the output of the calculation, it looks like start to run but when launch the parallel jobs fail and give:
ORCA finished by error termination in GTOInt
Calling Command: mpirun -np 12 --use-hwthread-cpus /data/shared/opt/ORCA/orca_4_2_1_linux_x86-64_openmpi314/orca_gtoint_mpi cis1_bh267_m_Cell_448.int.tmp cis1_bh267_m_Cell_448
[file orca_tools/qcmsg.cpp, line 458]:
.... aborting the run
We have two kind of nodes on the cluster:
A punch of them are:
Xeon 6-core E-2136 # 3.30GHz (12 logical cores) and Nvidia GTX 1070Ti
And the other ones:
AMD Epyc 24-core (24 logical cores) and 4x Nvidia RTX 2080Ti
Using the command scontrol show node the details of one node of each group are:
First Group:
NodeName=fang1 Arch=x86_64 CoresPerSocket=6
CPUAlloc=12 CPUTot=12 CPULoad=12.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:gtx1070ti:1
NodeAddr=fang1 NodeHostName=fang1 Version=19.05.5
OS=Linux 5.7.12-arch1-1 #1 SMP PREEMPT Fri, 31 Jul 2020 17:38:22 +0000
RealMemory=15923 AllocMem=0 FreeMem=171 Sockets=1 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=7961 Weight=1 Owner=N/A MCS_label=N/A
Partitions=deflt,debug,long
BootTime=2020-10-27T09:56:18 SlurmdStartTime=2020-10-27T15:33:51
CfgTRES=cpu=12,mem=15923M,billing=12,gres/gpu=1,gres/gpu:gtx1070ti=1
AllocTRES=cpu=12,gres/gpu=1,gres/gpu:gtx1070ti=1
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Second Group
NodeName=fang50 Arch=x86_64 CoresPerSocket=24
CPUAlloc=48 CPUTot=48 CPULoad=48.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:rtx2080ti:4
NodeAddr=fang50 NodeHostName=fang50 Version=19.05.5
OS=Linux 5.7.12-arch1-1 #1 SMP PREEMPT Fri, 31 Jul 2020 17:38:22 +0000
RealMemory=64245 AllocMem=0 FreeMem=807 Sockets=1 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=32122 Weight=1 Owner=N/A MCS_label=N/A
Partitions=deflt,long
BootTime=2020-12-15T10:09:43 SlurmdStartTime=2020-12-15T10:14:17
CfgTRES=cpu=48,mem=64245M,billing=48,gres/gpu=4,gres/gpu:rtx2080ti=4
AllocTRES=cpu=48,gres/gpu=4,gres/gpu:rtx2080ti=4
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
I use in the script of Slurm the flag -c, --cpus-per-task = integer; and in the input for Orca the command %pal nprocs integer end. I tested different combinations of this two parameters in order to see if I am using more CPU than the available:
-c, --cpus-per-task = integer
%pal nprocs integer end
None
6
None
3
None
2
1
2
1
12
2
6
3
4
12
12
With different amount of memories: 8000 MBi and 2000 MBi (my total memory is around 15 GBi). And in all the cases the same error appears. I am not an expert user neither in ORCA non in informatic (but maybe you guess this for the extension of the question), so maybe the solution is simple but I really don’t have it, Idon't know what's going on!
A lot of thanks in advance,
Alejandro.

Faced the same issue.
Explicit declaration --prefix ${OMPI_HOME} directly as ORCA parameter and using of static linked ORCA version helps me:
export RSH_COMMAND="/usr/bin/ssh"
export PARAMS="--mca routed direct --oversubscribe -machinefile ${HOSTS_FILE} --prefix ${OMPI_HOME}"
$ORCA_DIR/orca $WORKDIR/$JOBFILE.inp "$PARAMS" > $WORKDIR/$JOBFILE.out
Also, It's better to build OpenMPI 3.1.x with --disable-builtin-atomics flag.

Thank you #Alexey for your answer. And sorry for the wrong Tag, like I said, I am pretty rookie on this stuff.
The problem was not in the Orca or OpenMPI configuration but in the bash script used for scheduled the Slurm job.
I thought that the entire Orca job itself was what Slurm call a "task". For that reason I declared the flag --cpus-per-task equal to the number of parallel jobs that I want to do with Orca. But the problem is that each parallel Orca job (that is launch using OpenMPI) is a task for Slurm. Therefore with my Slurm script I was reserving a node with at least 12 CPU, but when Orca launch their parallel jobs, each one ask for 12 CPU, so: "There are not enough slots available ..." because I needed 144 CPU.
The rest of the cases in the table of my Question fails for another reason. I was launching at the same time 5 different Orca calculation. Now, because --cpus-per-task could be None, 1, 2 or 3; the five calculation might enter in the same node or in another node with this amount of free CPU, but when Orca ask for the parallel jobs, fail again because there are not this amount of CPU on the node.
The solution that I found is pretty simple. On the .sh script for Slurm I putted this:
#SBATCH --mincpus=n*m
#SBATCH --ntasks=n
#SBATCH --cpus-per-task m
Instead of only:
#SBATCH --cpus-per-task m
Where n will be equal to the number of parallel jobs specified on the Orca input (%pal nprocs n end) and m the number of CPU that you want to use for each parallel Orca job.
In my case I used n = 12, m = 1. With the flag --mincpus I ensured to take a node with at least 12 CPU and allocated them. With the --cpus-per-task is pretty evident what this flag do (even for me :-) ), which, by the way, has a default value of 1 and I don't know if more than 1 CPU for each OpenMPI Orca job improve the velocity of the calculation. And --ntasks gives the information to Slurm of how many task you will do.
Of course if you know the number of task and the CPU per task is easy to know how many CPU you need to reserve, but I don't know if this is easy to Slurm too :-). So, to be sure that I allocate the correct number of CPU i used --mincpus flag, but maybe is not needed. The thing is that it works now ^_^.
It is also important to take into account the amount of memory that you declare in the input of Orca in order of do not exceed the available memory. For example, if you have 12 task and a RAM of 15000 MBi, the right amount of memory to declared should be no more than 15000/12 = 1250 MBi

I had a similar problem with parallel jobs before. The slurm also output not enough slots error.
My solution is to change parallel threads into parallel processes. For my system is to change
#SBATCH -c 24
into
#SBATCH -n 24
and everything works just fine.

NuSMV Simulation Using Random Traces

I am attempting to run "random" or non-deterministic simulations of a NuSMV model I have created. However between subsequent runs the trace that is produced is exactly the same.
Here is the model:
MODULE main
VAR x : 0..4;
VAR clk : 0..10;
DEFINE next_x :=
case
x = 0 : {0,1};
x = 1 : {1,2};
x = 2 : {1,0};
TRUE : {0};
esac;
DEFINE next_clk :=
case
(clk < 10) : (clk+1);
TRUE : clk;
esac;
INIT (x = 0);
INIT (clk = 0);
TRANS (next(x) in next_x);
TRANS next(clk) = next_clk;
CTLSPEC AG(clk < 10);
I am running this using the following commands in the interactive shell:
go
pick_state -r
simulate -k -r 30
show_traces 1
quit
Perhaps I have a mistake in my model? Or I am not running the correct commands in the shell.
Thanks in advance!

As far as I can tell after playing around with the tool, I would say that what you experience is a common behaviour due to using pseudo-random generators in a certain way.
Basically, I posit that each time one starts NuSMV void srand(unsigned int seed) is initialised with the same seed value. The obvious result is that NuSMV performs the exact same non-deterministic choices among independent runs, provided that you load the exact same model and perform exactly the same sequence of commands.
This kind of design is common among model checkers because it allows to reproduce potential bug traces reported by users more easily.
After looking at NuSMV -help and NuSMV documentation, it appears to me that the program has no option to manually set an arbitrary seed for the pseudo-random generator. (Note: you might want to contact NuSMV mailing list about this, it may be possible that there exists some internal variable to configure the random seed with the aid of the set command)
Therefore, I would like to propose the following work-around to help you achieving your goal of collecting different, non-deterministic execution traces from the same model. Try:
go
pick_state -r
simulate -r RANDOM_SEED
pick_state -r
simulate -r 30
show_traces 2
quit
Basically, the idea is to exploit the first simulation in order to move forward the pseudo-random generator to an arbitrary point in the pseudo-random chain. Each time you execute this script, you change the value of RANDOM_SEED, so that any two executions of NuSMV have a different starting-point in the pseudo-random generator for the second trace. In this way, NuSMV no longer repeats the same choices it has done in other executions for the second trace, unless that happens by pure chance.
Alternatively, you may obtain all the non-deterministic execution traces you want from a single run of the NuSMV solver:
go
pick_state -r
simulate -r 30
show_traces 1
pick_sate -r
simulate -r 30
show_traces 2
...
pick_state -r
simulate -r 30
show_traces N
quit
Note 1: your model has only one initial state, so pick_state -r always chooses the same initial state.
Note 2: your model reports the following error on my system:
TYPE ERROR file test.smv: line 23 :
illegal operand types of "=" : integer-set and integer
when I type pick_state -i.
Note 3: since NuSMV source code is available, another possible solution is to patch it so as to accept a novel option for setting an arbitrary seed to initialise the pseudo-random generator.

Speed of read from standard in versus file in Matlab Compiled code

I have the following code to read from standard in, in my compiled Matlab code, which I compiled using mcc -m -T link:exe -N -v -R -nojvm -d build test1.m -o test1
function test1()
% First, read input stream
tic;
stdin=char(0);
while ~strcmp(stdin,'EOF')
%while ~isempty(stdin)
stdin=input(char(0),'s');
end
toc;
% Now, read file
tic;
fid=fopen('test1read.txt');
tline=fgetl(fid);
while ischar(tline)
tline=fgetl(fid);
strcmp(tline,'EOF');
end
fclose(fid);
toc;
end
I executed it as type test1stream.txt | test1.exe (on Windows)
For an order 1.5MB file, I get timing output as
Elapsed time is 1.000616 seconds.
Elapsed time is 0.156772 seconds.
I was a bit surpised to see that reading from input is slower than reading from a file.
For an order 1.5GB file (longer and more lines), I get a timing as
Elapsed time is 78.877386 seconds.
Elapsed time is 95.457972 seconds
My first question would be: why is that? I think one reason is that input() is doing more things than necessary. Is this assumption correct?
My second question is: can I speed up the read from input in a stand-alone tool?
See also: Using standard io stream:stdin and stdout in a matlab exe

Linpack sometimes starting, sometimes not, but nothing changed

I installed Linpack on a 2-Node cluster with Xeon processors. Sometimes if I start Linpack with this command:
mpiexec -np 28 -print-rank-map -f /root/machines.HOSTS ./xhpl_intel64
linpack starts and prints the output, sometimes I only see the mpi mappings printed and then nothing following. To me this seems like random behaviour because I don't change anything between the calls and as already mentioned, Linpack sometimes starts, sometimes not.
In top I can see that xhpl_intel64processes have been created and they are heavily using the CPU but when watching the traffic between the nodes, iftop is telling me that it nothing is sent.
I am using MPICH2 as MPI implementation. This is my HPL.dat:
# cat HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
10000 Ns
1 # of NBs
250 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
2 Ps
14 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
edit2:
I now just let the program run for a while and after 30min it tells me:
# mpiexec -np 32 -print-rank-map -f /root/machines.HOSTS ./xhpl_intel64
(node-0:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
(node-1:16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31)
Assertion failed in file ../../socksm.c at line 2577: (it_plfd->revents & 0x008) == 0
internal ABORT - process 0
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
Is this a mpi problem?
Do you know what type of problem this could be?

I figured out what the problem was: MPICH2 uses different random ports each time it starts and if these are blocked your application wont start up correctly.
The solution for MPICH2 is to set the environment variable MPICH_PORT_RANGE to START:END, like this:
export MPICH_PORT_RANGE=50000:51000
Best,
heinrich

Is there a way to limit the memory, ghci can have?

I'm used to debug my code using ghci. Often, something like this happens (not so obvious, of course):
ghci> let f#(_:x) = 0:1:zipWith(+)f x
ghci> length f
Then, nothing happens for some time, and if I don't react fast enough, ghci has eaten maybe 2 GB of RAM, causing my system to freeze. If it's too late, the only way to solve this problem is [ALT] + [PRINT] + [K].
My question: Is there an easy way to limit the memory, which can be consumed by ghci to, let's say 1 GB? If limit is exceed, the calculation should ve aborted or ghci should be killed.

A platform independant way to accomplish this is to supply the -M option as on option to the Haskell runtime like this
ghci +RTS -M1m
see the GHC documentation’s page on how to control the RTS (runtime system) for details.
The ghci output now looks like:
>ghci +RTS -M10m
GHCi, version 6.12.3: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
Prelude> let f#(_:x) = 0:1:zipWith(+)f x
Prelude> length f
Heap exhausted;
Current maximum heap size is 10485760 bytes (10 MB);
use `+RTS -M<size>' to increase it.

Running it under a shell with ulimit -m set is a fairly easy way. If you want to run with some limit on a regular basis, you can create a wrapper script that does ulimit before running ghci.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio