Adding processes to two different remote hosts - parallel-processing

I have several servers that I'm planning to use to run some simulations in Julia. The problem is, I can only add remote processes to a single server. If i try to add the processes to the next server I get an error. This is what I'm trying to do and what I get
addprocs(["user#host1"], tunnel=true, dir="~/julia-483dbf5279/bin/", sshflags=`-p 6969`)
addprocs(["user#host2"], tunnel=true, dir="~/julia-483dbf5279/bin/", sshflags=`-p 6969`)
id: cannot find name for group ID 350
fatal error on 6: ERROR: connect: host is unreachable (EHOSTUNREACH)
in wait at ./task.jl:284
in wait at ./task.jl:194
in stream_wait at stream.jl:263
in wait_connected at stream.jl:301
in Worker at multi.jl:113
in anonymous at task.jl:905
Worker 6 terminated.
The host is reachable and I can connect to it via ssh. I had a similar problem when adding local processes, as I explained in this stackoverflow question:
Combining local processes with remote processes in Julia

Related

Running JMeter to Master & Slave machines, but at Master machine JMeter script execution not ended, so report and result not generated

Running JMeter to Master & Slave machines, Slave is showing the script is started & finished, but at master showing "Waiting for possible Shutdown/StopTestNow/HeapDump/ThreadDump message on port 4445", for this at master JMeter script execution not ended, so report and result not generated.
Though this script only contains a HTTP Request with a single thread, for execution it only needs few seconds. I waited for couple of hours, but not got the result.
How can i solve this problem?
for both Master & Slave machines I configured:
install jdk1.8.0_271 & jmeter5.3
on "jmeter.properties" i added : server_port=4000, client.rmi.localport=4000, server.rmi.port=4000, server.rmi.localport=4000
on "user.properties" i added : server.rmi.port=9999, server.rmi.localport=4000
Though this script only contains a HTTP Request with a single thread, for execution it only needs few seconds. - you're running it with 100 threads and 500 seconds ramp-up so it will run for 8 minutes at least (plus time required for the last user to execute the last iteration of the sampler)
The fact that the slave cannot report the test finished event and the results back to the master means that the slave cannot properly communicate with the master
Assuming all above you need to open 2 ports in the slave, i.e.
4000 for the SERVER_PORT
5000 for the server.rmi.localport
and run your slave as jmeter-server -Dserver.rmi.localport=5000 -Dserver_port=4000 -Jclient.rmi.localport=4000
master should be executed as jmeter -Jclient.rmi.localport=4000
All the aforementioned ports must be opened in the firewall.
More information:
Remote hosts and RMI configuration
Using a different port
JMeter Distributed Testing with Docker

SSH timeout error on Azure DevOps CD pipeline

I am getting an timeout error when trying to deploy to an VM instance hosted on AWS. Manually I can log ing using
ssh -i myKeyFile.pem myuser#IP
Once I accessed the remote machine I can execute some docker commands and everything works fine. But now that I need to automated that on the CD pipeline is where I am getting the following error:
2020-06-02T21:37:12.6877276Z Trying to establish an SSH connection to ***#IP:port
2020-06-02T21:38:52.4629461Z ##[error]Failed to connect to remote machine. Verify the SSH service connection details. Error: Error: Timed out while waiting for handshake.
2020-06-02T21:38:52.4685976Z ##[section]Finishing: Run shell commands on remote machine
The steps I follow to make the SSH connection are:
I created a SSH service connection on the project settings in Azure DevOps
I created the CD pipeline
I added a SSH task with the following parameters
When I manually trigger it to test if it works, the release start working fine but after 1:43 minutes more or less is when I got the error:
Then, when I review the logs, it is the same error I pasted at the beginning:
[error]Failed to connect to remote machine. Verify the SSH service connection details. Error: Error: Timed out while waiting for handshake
I've increase the handshake timeout settings from the default one (20000) to 90000, but no luck.
Any one has face this problem before?
Seems there is an ongoing error with the default agent pools from Azure DevOps. Lot of people have been reported this and Azure DevOps teams is working on it at the time this post is been written (I couldn't find the post where all that is details. I will add this later on).
The workaround is
To create a self-hosted agent.
After this has been created you will need to re-create your CD pipeline using the new self-hosted agent.
The rest of the SSH task configuration depends on your needs. But if you want to test the SSH connection works, just print something:
echo 'I'm connected'
After this you CD pipeline should be working fine.
More details on how to created the Self-Hosted Agent on Windows. There are also links for Linux and Mac.
I had a similar issue with a VM in Azure. It turned out I had set the security group to only allow SSH in from my local network and Azure Dev-Ops agents obviously run in a Microsoft network and were coming from a different IP Address range. The solution was to open up SSH to all source IP Addresses. You can get the list of IP address ranges Dev-Ops agents use but they appear to change every week which isn't very helpful.
See https://learn.microsoft.com/en-us/azure/devops/organizations/security/allow-list-ip-url?view=azure-devops#microsoft-hosted-agents

How can I get condor collector to run

I have installed HTcondor on my cluster of Dell Optiplex 390s they all are running Centos 8 and I am not able to run condor_status I get the following error --> Error: can't find collector
I am new to using condor and all I want to be able to do is have a master node that can manage jobs and execute them and for the rest to just execute the jobs. I have opened port 9618/tcp on all the nodes to run the daemon.
Ok, well there are two possibilities: One, the collector isn't running, and two, it is running, but condor_status can't find it.
Let's start with potential problem number one. If you run
ps auxww | grep condor_collector
on you machine that should be the central manager, is there a collector process running?
If so, that's good.
problem 2 is to set the condor_config variable COLLECTOR_HOST to point to this machine e.g.
COLLECTOR_HOST = my_central_manager

RabbitMQ Erlang distribution failed

I have two Windows Server 2012 R2 machines located in one of the client's datacenters. Both servers are domain-joined. They both have RabbitMQ 3.6.0. installed on them. RabbitMQ is running as Windows Service on both machines. I've tried to cluster these two machines for a long time now without success. I always get the following error when I try to cluster them.
One the first machine nodeA I run the command 'rabbitmqctl join_cluster rabbit#nodeB'. This is what I get:
Clustering node 'rabbit#nodeA' with 'rabbit#nodeB' ...
Error: unable to connect to nodes ['rabbit#nodeB']: nodedown
`DIAGNOSTICS`
===========
attempted to contact: ['rabbit#nodeB']
rabbit#nodeB:
* connected to epmd (port 4369) on nodeB
* epmd reports node 'rabbit' running on port 25672
* TCP connection succeeded but Erlang distribution failed
* suggestion: hostname mismatch?
* suggestion: is the cookie set correctly?
* suggestion: is the Erlang distribution using TLS?
current node details:
- node name: 'rabbitmq-cli-3892#nodeA'
- home dir: C:\Users\mydirectory
- cookie hash: l+SSu57+cRyAQ03AJdwAbQ==
I've tried this setup with Azure Virtual Machines within Azure Virtual Network and I succeeded to cluster the two VM's, however it seems I cannot connect these two (customer's machines) together.
This is what I have done and ensured:
There isn't any firewall blocking connections
Added host names to hosts file located on C:\Windows\system32\drivers\etc
Tried to refer to host names as FQDN without adding anything to hosts file
Tried to refer to host names with CAPITAL letters and without
Copied the same exact .erlang.cookie to C:\Windows and C:\Users\mydirectory on both machines.
I've read, understood and applied RabbitMQ Clustering Guide https://www.rabbitmq.com/clustering.html
Stopped, restarted, reinstalled RabbitMQ on both machines.
It seems I can't get it to work. On Azure machines, which were not domain-joined clustering worked beautifully. I am really running out of options... Any help?
i had the same problem you need to install rabbitmq as a admin. uninstall then reinstall as admin and it should work fine
Try to connect to each of RabbitMQ nodes via remote shell and check if value of cookie is the same (cookie can be set in 3 different ways: .erlang.cookie is one of them).
erl -remsh 'rabbitmq-cli-3892#nodeA' -name 'test#nodeA'
erlang:get_cookie().

Running MPI on two hosts

I've looked through many examples and I'm still confused. I've compiled a simple latency check program from here, and it runs perfectly on one host, but when I try to run it on two hosts it hangs. However, running something like hostname runs fine:
[hamiltont#4 latency]$ mpirun --report-bindings --hostfile hostfile --rankfile rankfile -np 2 hostname
[4:16622] [[5908,0],0] odls:default:fork binding child [[5908,1],0] to slot_list 0
4
[5:12661] [[5908,0],1] odls:default:fork binding child [[5908,1],1] to slot_list 0
5
But here is the compiled latency program:
[hamiltont#4 latency]$ mpirun --report-bindings --hostfile hostfile --rankfile rankfile -np 2 latency
[4:16543] [[5989,0],0] odls:default:fork binding child [[5989,1],0] to slot_list 0
[5:12582] [[5989,0],1] odls:default:fork binding child [[5989,1],1] to slot_list 0
[4][[5989,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 10.0.2.5 failed: Connection timed out (110)
My current guess is that there is something wrong with my firewall rules (e.g. hostname does not communicate between hosts, but the latency program does).
[hamiltont#4 latency]$ cat rankfile
rank 0=10.0.2.4 slot=0
rank 1=10.0.2.5 slot=0
[hamiltont#4 latency]$ cat hostfile
10.0.2.4 slots=2
10.0.2.5 slots=2
There are two kinds of communication involved in running an Open MPI job. First the job has to be launched. Open MPI uses a special framework to support many kinds of launches and you are probably using the rsh remote login launch mechanism over SSH. Obviously your firewall is correctly set up to allow SSH connections.
When an Open MPI job is launched and the processes are true MPI programs, they connect back to the mpirun process that spawned the job and learn all about the other processes in the job, most importantly the available network endpoints at each process. This message:
[4][[5989,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 10.0.2.5 failed: Connection timed out (110)
indicates that the process which runs on host 4 is unable to open a TCP connection to the process which runs on host 5. The most common reason for that is the presence of a firewall, which limits the inbound connections. So checking your firewall is the first thing to do.
Another common reason is if on both nodes there are additional network interfaces configured and up, with compatible network addresses, but without the possibility to establish connection between them. This often happens on newer Linux setups where various virtual and/or tunnelling interfaces are being brought up by default. One can instruct Open MPI to skip those interfaces by listing them (either as interface names or as CIDR network addresses) in the btl_tcp_if_exclude MCA parameter, e.g.:
$ mpirun --mca btl_tcp_if_exclude "127.0.0.1/8,tun0" ...
(one always have to add the loopback interface if setting btl_tcp_if_exclude)
or one can explicitly specify which interfaces to be used for communication by listing them in the btl_tcp_if_include MCA parameter:
$ mpirun --mca btl_tcp_if_include eth0 ...
Since the IP address in the error message matches the address of your second host in the hostfile, then the problem must come from an active firewall rule.

Resources