mpiexec using wrong number of cpus - cluster-computing

I am trying to set up a MPI Cluster. But I have the problem that the number of CPUs added to the mpd.conf file is not correctly used.
I have three Ubuntu servers.
opteron with 48 Cores
calc1 with 8 Cores
calc2 with 8 Cores.
My mpd.hosts looks like:
opteron:46
calc1:6
calc2:6
After booting (mpdboot -n 3 -f mpd.hosts) the System is running.
mpdtrace -> all three of them are listed.
But running a Programm like "mpiexec -n 58 raxmlHPC-MPI ..." causes that calc1 and calc2 get to many jobs and opteron gets to few at the same time.
What am I doing wrong?
Regards
Bjoern

I found a workaround.
I used the additional parameter "-machinefile /path/to/mpd.hosts" for the mpiexec command. And now, all nodes are running correctly.
One problem I got was that I got following error message:
... MPIU_SHMW_Seg_create_attach_templ(671): open failed - No such file or directory ...
To fix it, I had to set the environment variable MPICH_NO_LOCAL=1

As you figured out, you must pass the machinefile to both mpdboot and mpiexec in order to use per-host process counts. The "open failed" issue is a known bug in MPD, the process manager you are using. Note that the MPICH_NO_LOCAL=1 workaround will work, but will probably result in a big performance penalty for intranode communication.
You are clearly using MPICH2 (or an MPICH2 derivative), but it's not clear what version you are using. If you can, I would strongly recommend upgrading to either MPICH2 1.2.1p1 or (better yet) 1.3.1. Both of these releases include a newer process manager called hydra that is much faster and more robust. In 1.3.1, hydra is the default process manager. It doesn't require an mpdboot phase, and it supports a $HYDRA_HOST_FILE environment variable so that you don't have to specify the machine file on every mpiexec.

Related

How can I get condor collector to run

I have installed HTcondor on my cluster of Dell Optiplex 390s they all are running Centos 8 and I am not able to run condor_status I get the following error --> Error: can't find collector
I am new to using condor and all I want to be able to do is have a master node that can manage jobs and execute them and for the rest to just execute the jobs. I have opened port 9618/tcp on all the nodes to run the daemon.
Ok, well there are two possibilities: One, the collector isn't running, and two, it is running, but condor_status can't find it.
Let's start with potential problem number one. If you run
ps auxww | grep condor_collector
on you machine that should be the central manager, is there a collector process running?
If so, that's good.
problem 2 is to set the condor_config variable COLLECTOR_HOST to point to this machine e.g.
COLLECTOR_HOST = my_central_manager

Why are my bash scripts refusing to run until I type 'exit' ever since I used 'script /dev/null'?

I am working on a cluster, the 'home' machine of which runs Linux version 2.6.38.4 on Debian 4.3.2-1.1. Other machines on the cluster run more recent versions of Linux (3.x.x.x) but on differing flavours (some Redhat, some Debian etc).
As usual, I transferred to chos 8 on one of these machines and set a script running in a screen, but the server began denying that there were any sockets available when I went to reattach it. I followed advice I found online and typed ‘script /dev/null’ in order to retrieve my screens, but it keeps happening. Also, when I start a new screen now, the command prompt is preceded by ‘(base)’.
Now, if I try to run a bash script on anything other than the home machine, the scripts won't run until I follow the command with 'exit', as follows:
bash ~/DAPHNIA/Scripts/compare_BUSCO_depths.sh 2 21 3 3 2-WGS_Clone_21_CGCTATGT-GTGTCGGA_L001;
exit;
The contents of the script don't seem to matter - this irritating quirk now happens regardless of the script being run.
Does anybody have any idea a) what has caused this, and b) how I can fix it, please?

two greenplum installation on the same machine

I have an old version of greenplum and I would like to upgrade to version 5.0.0 since it has been released. https://github.com/greenplum-db/gpdb/releases/tag/5.0.0.
I have a huge machine, and I can not simply have an equivalent one. So I would like to know how can I run both version on the same machine. I have seen for example gpseginstall distribute binaries to the /usr/local/gpdb which is already there for the old version.
Regards
I have run multiple versions in parallel on a single node system.
You need to set your config file you use for the gpinitsystem for different segment/mirror directories, master port, starting port, etc..
You will also need two different OS profiles to source, so when you log as gpadmin you can source your 4.3 or 5.0 paths ($GPHOME, $MASTER_DATA_DIRECTORY) for gpstart, gpstop, psql, etc..
Hope this makes sense... I haven't tried it on a multi node system, but the setup should be the same
i.e.
GPDB 4.3
ARRAY_NAME="GPDB"
MACHINE_LIST_FILE=./hostsfile
SEG_PREFIX=seg
PORT_BASE=40000
declare -a DATA_DIRECTORY=(/gpsegment4 /gpsegment4 /gpsegment4 /gpsegment4)
MASTER_HOSTNAME=mdw
MASTER_DIRECTORY=/gpmaster4
MASTER_PORT=5432
TRUSTED_SHELL=ssh
CHECK_POINT_SEGMENTS=8
ENCODING=UNICODE
DATABASE_NAME=gpadmin
#MIRROR_PORT_BASE=50000
REPLICATION_PORT_BASE=41000
#MIRROR_REPLICATION_PORT_BASE=51000
#declare -a MIRROR_DATA_DIRECTORY=(/mirror4 /mirror4 /mirror4 /mirror4)
GPDB 5.0
ARRAY_NAME="GPDB"
MACHINE_LIST_FILE=./hostsfile
SEG_PREFIX=seg
PORT_BASE=60000
declare -a DATA_DIRECTORY=(/gpsegment5 /gpsegment5 /gpsegment5 /gpsegment5)
MASTER_HOSTNAME=mdw
MASTER_DIRECTORY=/gpmaster5
MASTER_PORT=7432
TRUSTED_SHELL=ssh
CHECK_POINT_SEGMENTS=8
ENCODING=UNICODE
DATABASE_NAME=gpadmin
#MIRROR_PORT_BASE=70000
REPLICATION_PORT_BASE=61000
#MIRROR_REPLICATION_PORT_BASE=71000
#declare -a MIRROR_DATA_DIRECTORY=(/mirror5 /mirror5 /mirror5 /mirror5)
I have seen where you can have different versions installed, then change the greenplum-db link to point to the one you want to run. That link is referenced when you enter gpstart. Not sure how you could have two different versions running at the same time on the same machine.
If your goal is to do upgrade dry runs and test on the new release, another alternative could be to deploy a Greenplum cluster using Microsoft Azure. This would deploy the latest version (5.0).
Sounds like you know how to build your own greenplum so you could delete that 5.0 install then install the version you are currently using, then practice the upgrade/migration as well as just kick the tires of 5.0.
You could also easily have side by side systems in Azure; one running your current release and the other running 5.0.
The smallest cluster you can deploy is 1 master and 1 segment which could be adequate depending on your requirements.
Hope this helps

Jenkins job windows batch execution 20 times slower than executing in cmd.exe

I just installed Jenkins 2.46.2 on a Windows 2012 Server \o/. It runs as a system service.
I created a job that execute a windows batch (.bat) script to build a code project. This batch results in executing 2 mingw32-make.exe commands to clean and then build a full binary from source code.
Executing the batch manually on the machine, located on the same filesystem (same workspace as used by the Jenkins' job, local disk - not network disk), the clean-build takes ~50 seconds.
But when executed by Jenkins, the job takes more than 20x more time longer! (~19 minutes). It terminates succesfully with the same behavior as executed manually in cmd.exe.
I changed the launch arguments for the jvm in the jenkins.xml file with "-Xmx1024m -XX:MaxPermSize=512m" options as I have read in the documentation to improve performance. But it does not fix anything :-(
Also when I monitors the CPU/disk/RAM usages they all stay very very low while building, so I deduce that brute performances of the machine are not in cause.
Whether I invoke the batch with call statement in the Jenkins job build step or not does not change anything : the job always last 19 minutes.
Can anybody help me to investigate why so slowness ?
Thanks in advance :)
I had a similar problem. I noticed that .bat files with echo Hello World ran fast and with no problem.
But once I tried to launch any grep.exe from a batch script, it took 24 seconds (in my case) to run even with no input files. If launched manually it finishes in no time.
I used grep.exe version 2.5.4 from MSys 1.0 distribution.
The solution in my case was rather unexpected - I updated grep to version 2.24, and now, being launched from Jenkins, it takes less than one second to process over 1 MB log file.
For a couple of day investigation, I finally find the cause.
In my case, it is the reason of Jenkins agent.
When I install Jenkins agent as a windows service in the slave agent, the consuming time is so huge,but when I try to start Jenkins agent via windows command line, the consuming time is as normal as executing the batch file manually.
My env:
master: CentOS7
slave agent: win 7
And I also test this case in a slave agent of win 10 for comparison.
The time executing via Jenkins is approximately the same as executing the batch file manually on the agent machine.
So I guess this is the compatibility issue between win 7 and Jenkins.
But for that the Jenkins official said that Jenkins not support win 7 anymore (Microsoft does not support Windows 7), we temporarily put it aside.
Anyway we find a way to conquer this. Hope this will help you for similar scenario.

Apache 2 - reload config on Windows

I have a PHP script that modifies my httpd.conf file, so I need to automatically reload it in Apache.
On Linux, there is graceful restart, but on Windows (I use the restart command) it terminates all the current connections. Is there a command as graceful restart on Windows? Is there a workaround on this?
Yes, you should use the -k switch.
httpd.exe -k restart or apache.exe -k restart
More info here has well. http://www.zrinity.com/developers/apache/usage.cfm
Edit:
It shouldn't that is the point of Graceful. Notice I used the -k. That is not the same as a normal restart. It let's the current sessions complete their task while the config is being reread, so that it will start taking new requests immediately.
From the documentation:
The USR1 or graceful signal causes the parent process to advise the children to exit after their current request (or to exit immediately if they're not serving anything). The parent re-reads its configuration files and re-opens its log files. As each child dies off the parent replaces it with a child from the new generation of the configuration, which begins serving new requests immediately.
http://httpd.apache.org/docs/2.2/stopping.html#graceful
It's doing what you are asking for.
Edit 2:
Adding this link and gave both possible versions because some people think you there is only one specific way to do something instead of search themselves.
http://httpd.apache.org/docs/2.4/platform/windows.html#wincons
I think I'm just going to delete this answer because either people can't read or if it doesn't work for someone it gets a DV. There are different windows versions made by different developers. If it doesn't work look for the answer from them. Even Linux has different commands depending on the distro. geez
In the newest Apache 2.4.20 VC10 the "httpd -k restart" command actually DOES do a graceful restart. It won't drop any connections, for example if somebody is downloading something from your server, it WILL NOT interrupt this process. One more proof is that "-k restart" will not reset your server statistics that mod_status provides, won't even alter the "Restart Time" value.
Although "httpd -k graceful" and "httpd -k graceful-stop" commands are available in Windows, but they will not work giving an error "couldn't make a socket".

Resources