parallel ipython/ipcluster through head node - parallel-processing

I want to use the parallel capabilities of ipython on a remote computer cluster. Only the head node is accessible from the outside. I have set up ssh keys so that I can connect to the head node with e.g. ssh head and from there I can also ssh into any node without entering a password, e.g. ssh node3. So I can basically run any commands on the nodes by doing:
ssh head ssh node3 command
Now what I really want to do is to be able to run jobs on the cluster from my own computer from ipython. The way to set up the hosts to use in ipcluster is:
send_furl = True
engines = { 'host1.example.com' : 2,
'host2.example.com' : 5,
'host3.example.com' : 1,
'host4.example.com' : 8 }
But since I only have a host name for the head node, I don't think I can do this. One option is to set us ssh tunneling on the head node, but I cannot do this in my case, since this requires enough ports to be open to accommodate all the nodes (and this is not the case). Are there any alternatives?

I use ipcluster on the NERSC clusters by using the PBS queue:
http://ipython.org/ipython-doc/stable/parallel/parallel_process.html#using-ipcluster-in-pbs-mode
in summary you submit jobs which runs mpiexec ipengine, (after having launched ipcontroller on the login node). Do you have PBS on your cluster?
this was working fine with ipython .10, it is now broken in .11 alpha.

I would set up a VPN server on the master, and connect to that with a VPN client on my local machine. Once established, the virtual private network will allow all of the slaves to appear as if they're on the same LAN as my local machine (on a "virtual" network interface, in a "virtual" subnet), and it should be possible to ssh to them.
You could possibly establish that VPN over SSH ("ssh tunneling", as you mention); other options are OpenVPN and IPsec.
I don't understand what you mean by "this requires enough ports to be open to accommodate all the nodes". You will need: (i) one inbound port on the master, to provide the VPN/tunnel, (ii) inbound SSH on each slave, accessible from the master, (iii) another inbound port on each slave, over which the master drives the IPython engines. Wouldn't (ii) and (iii) be required in any setup? So all we've added is (i).

Related

NTP syncing time from master to child Pis

Syncing Pis to a Master in Arch Linux Bash.
I am trying to configure a master Pi as an ntp server (which uses an external pool for time).
I have several child Pis that I need to configure to use the master Pi as their sync source.
Ideally I would like to have the master be able to get all the pi's connected to it as a time source, and gather the results from ntpq -pnon all pis from the master.
I have setup the configuration on the master to use the pool to get the time. I have three child Pis that are configured to get time from the master's ip.
Any ideas on how to do this or links would be greatly appreciated.
Thanks,
Ron
It sounds like you're trying to gather results of running ntpq -pn on all of the PIs. This kind of problem is not NTP-specific, but rather a general remote-execution task. SSH is well-documented and standard. Salt Stack is more advanced and powerful.
SSH
Setup SSH keys between the master and all other PIs.
Loop over the hosts with: for h in child1 child2; do ssh $h ntp -pn > ${h}.output; done
Salt Stack
Setup the Salt Master on the master PI.
Setup the child PIs as Salt Minions. (blog)
Use Salt to run a command on all minions: salt '*' cmd.run 'ntpq -pn'

Creating a multi headnode hpc cluster

I have a HPC cluster where several webapps are installed in docker containers, the queue is managed using Torque. Every app submits job to the HPC cluster connecting to it through ssh and then running qsub: ssh user#cluster qsub bla blabla. There are shared folder for exchanging data.
I am not satisfied with this setup and I'd like to know if it is possible to have a masternode running on each docker and using qsub directly inside it without doing an ssh connection. I'd prefer to use torque but I am open to other solutions.
Torque permits multiple submission hosts.
The names or addresses of the hosts should be added to submit_hosts variable in Torque server configuration here is a relevant page from the manual.
qmgr -c 'set server submit_hosts = headnode'
qmgr -c 'set server submit_hosts += app1'
qmgr -c 'set server submit_hosts += app2'
Assuming app1 and app2 are domain names of the docker containers. You will need to configure name resolution.
For more details and other options see Torque manual.

RabbitMQ Erlang distribution failed

I have two Windows Server 2012 R2 machines located in one of the client's datacenters. Both servers are domain-joined. They both have RabbitMQ 3.6.0. installed on them. RabbitMQ is running as Windows Service on both machines. I've tried to cluster these two machines for a long time now without success. I always get the following error when I try to cluster them.
One the first machine nodeA I run the command 'rabbitmqctl join_cluster rabbit#nodeB'. This is what I get:
Clustering node 'rabbit#nodeA' with 'rabbit#nodeB' ...
Error: unable to connect to nodes ['rabbit#nodeB']: nodedown
`DIAGNOSTICS`
===========
attempted to contact: ['rabbit#nodeB']
rabbit#nodeB:
* connected to epmd (port 4369) on nodeB
* epmd reports node 'rabbit' running on port 25672
* TCP connection succeeded but Erlang distribution failed
* suggestion: hostname mismatch?
* suggestion: is the cookie set correctly?
* suggestion: is the Erlang distribution using TLS?
current node details:
- node name: 'rabbitmq-cli-3892#nodeA'
- home dir: C:\Users\mydirectory
- cookie hash: l+SSu57+cRyAQ03AJdwAbQ==
I've tried this setup with Azure Virtual Machines within Azure Virtual Network and I succeeded to cluster the two VM's, however it seems I cannot connect these two (customer's machines) together.
This is what I have done and ensured:
There isn't any firewall blocking connections
Added host names to hosts file located on C:\Windows\system32\drivers\etc
Tried to refer to host names as FQDN without adding anything to hosts file
Tried to refer to host names with CAPITAL letters and without
Copied the same exact .erlang.cookie to C:\Windows and C:\Users\mydirectory on both machines.
I've read, understood and applied RabbitMQ Clustering Guide https://www.rabbitmq.com/clustering.html
Stopped, restarted, reinstalled RabbitMQ on both machines.
It seems I can't get it to work. On Azure machines, which were not domain-joined clustering worked beautifully. I am really running out of options... Any help?
i had the same problem you need to install rabbitmq as a admin. uninstall then reinstall as admin and it should work fine
Try to connect to each of RabbitMQ nodes via remote shell and check if value of cookie is the same (cookie can be set in 3 different ways: .erlang.cookie is one of them).
erl -remsh 'rabbitmq-cli-3892#nodeA' -name 'test#nodeA'
erlang:get_cookie().

Jmeter remote connection throwing "Connection refused to host"

I setup a distributed load testing environment using JMeter in unbundu machines.
->Master: the system running JMeter GUI, control each slave.
->Slave: the system running jmeter-server, receive command from the master and send a request to server under test.
->Target: the web server under test, get request from slaves.
Basic requirements are done:
-The firewalls on the systems are turned off
-All the planned master and Slaves are in the same subnet
-The JMeter server can access the target.
-Same version of JMeter on all the systems (version 2.3.4 ).
I did the following:
1) Tried pinging form master to slave and vice versa through ubundu terminal. its happening ..
2) Added the following to client (master) jmeter.properties:
# Remote hosts and RMI configuration
remote_hosts=192.168.0.139:1099
# RMI port to be used by the server (must start rmiregistry with same port)
server_port=1099
3) Added the following to server (Slave) jmeter.properties:
# On the server(s)
set server_port=1234
start rmiregistry with port 1234
4) Now started the Jmeter engine on Master.
a) Started Jmeter on master machine (GUI)
b) Created test plan--> (added tread group , samplers and required listners)
c) Now start the Slave(s) from the GUI
-click Run at the top
-select Remote start
-select the IP address
But error popup came as :-
"Connection refused to host : 192.168.0.139; nested exception is : java.net.ConnectionException : Connection Refused"
what may be the reason for not connecting with the remote salve (say here : 192.168.0.139)
DO i need to do any more configuration in jmeter.properties file or in any other files (in both slave and master)?
I think you forgot to start the slave in "slave mode".
In command line mode, go to jmeter/bin directory and execute jmeter-server.bat
That will start the slave process and will keeps it listening for commands.
Then you can go forward, loading amd launching the script.
have a look at:
http://jmeter.apache.org/usermanual/jmeter_distributed_testing_step_by_step.pdf
Also be aware that:
- the two systems MUST run the same Jmeter version
- the two systems MUST be on the same subnetwork
- the two systems SHOULD be as similar as possible: same OS, same directory tree, etc
- "remote_hosts" only require the address. The port is specified by "server_port" parameter.

Riak node no longer working after changing IP address

I'm using an instanced Amazon EC2 virtual Ubuntu 12.04 server as my single Riak node. I've gone through all the proper stages of setting up Riak on the instance using the guide on the basho website here. Where x.x.x.x is the private IP address of the instance, this included:
Installation
Using sudo su - to gain root privileges (EC2 logs me in as 'Ubuntu').
Installing the SSL Lib with:
sudo apt-get install libssl0.9.8
Downloading the 64-bit package for 12.04:
wget http://downloads.basho.com.s3-website-us-east-1.amazonaws.com/riak/CURRENT/ubuntu/precise/riak_1.2.1-1_amd64.deb
Then unpacking via:
sudo dpkg -i riak_1.2.1-1_amd64.deb
As instructed in the basho guide, I updated these two files (using vi):
vm.args
Changing -name riak#x.x.x.x to the private IP of my instance.
app.config
Changing {http, [ {"x.x.x.x", 8098 } ]} to the private IP of my instance.
Changing {pb_ip, "x.x.x.x"} to the private IP of my instance.
The Riak node was working fine when I first setup the server and performed the above, I could connect to the node, and using riak start then riak-admin test returned successfully with:
>Attempting to restart script through sudo -H -u riak
>Successfully completed 1 read/write cycle to 'riak#x.x.x.x'
The next day I fired up the instance, repeated the above process (ignoring installation) with the instance's new IP address y.y.y.y (the private IP of the instance changes every time it stops/starts) and typed riak start into the terminal, only to be greeted with:
>Attempting to restart script through sudo -H -u riak
>Riak failed to start within 15 seconds,
>see the output of 'riak console' for more information.
>If you want to wait longer, set the environment variable
>WAIT_FOR_ERLANG to the number of seconds to wait
In the riak console the error given is:
>gen_server riak_core_capability terminated with reason: no function clause matching orddict:fetch('riak#y.y.y.y', [{'riak#x.x.x.x',[{{riak_core,staged_joins},[true,false]},{{riak_core,vnode_routing},[proxy,...]},...]}])
Where y.y.y.y is the new instance IP address and x.x.x.x was the old one.
I've been scratching my head over this for a while now and can't find anything on the topic, the only solution I can think of is to re-install Riak on the off chance my PATH directories have gone wrong. If that fails my last resort would be to terminate the instance and reconfigure Riak on a new instance. So before I jump the gun, what I would like to ask is:
After updating the fields in app.config and vm.args with the new instance IP address, why is the riak start command no longer successful?
Is there any way for an Ubuntu EC2 instance to be assigned with a static private IP? Not only would this help solve the problem, but saves me time having to update app.config and vm.args every time I start/stop the instance.
So after some more digging around and intense reading, I've found a solution:
You need to remove the Riak ring and start Riak again to reset riak_core.
You can do this by using this command in the terminal:
rm -rf /var/lib/riak/ring/*
NOTE: This should be done after you've updated app.config and vm.args with the new server IP, nasty side-effects can occur otherwise.
Then
riak start
I was no longer thrown a 'failed to connect' error, and after issuing a riak-admin test command I pleasantly received (where y.y.y.y is my instance's private IP):
>Attempting to restart script through sudo -H -u riak
>Successfully completed 1 read/write cycle to 'riak#y.y.y.y'
I should note that this solution applies to virtual servers as well as physical ones. Although I would imagine the reassigning of IP's would be a much rarer occurrence in physical servers.
Now while that solves the issue, it still means whenever I need to reboot the instance I have to go through editing the app.config and vm.args files to change the private IP address (remember the private IP changes every time an Ubuntu instance is started/stopped) and then clear the Riak ring using the command above, so it's not exactly an elegant solution.
If anyone knows a way to set a static private IP to an EC2 instance (or another solution that tackles both hurdles?) it would solve this problem outright.
EDIT: 14/12/12
A limited solution to assigning a static IP to an EC2 instance:
Amazon Web Services allows the association of Elastic IP's to EC2 instances (of any kind). Therefore, if an instance has an elastic IP associated with it, even if it is rebooted, that IP will remain associated with that instance. You can find the documentation on elastic IP's here.
If you're under Amazon's free usage tier, creating an Elastic IP shouldn't charge you as long as it's associated with a running instance. If an elastic IP is disassociated, Amazon will incur charges for each running hour of an unused Elastic IP for as long as that Elastic IP remains disassociated. For example, terminating an instance will disassociate an elastic IP, unless that elastic IP is re-associated or released, the above applies. Stopping your instance entirely then starting it at a later time will also disassociate an elastic IP.
You can have a maximum of one elastic IP per an instance, any more and this will incur charges.
For those interested, you can find more information Elastic IP's pricing here under Elastic IP Addresses.
As of Riak 1.3, riak-admin reip is deprecated and the use of riak-admin cluster replace is the recomended way of replacing a cluster's name.
These are the commands I had to issue:
riak stop # stop the node
riak-admin down riak#127.0.0.1 # take it down
sudo rm -rf /var/lib/riak/ring/* # delete the riak ring
sudo sed -i "s/127.0.0.1/`hostname -i`/g" /etc/riak/vm.args # Change the name in config
riak-admin cluster force-replace riak#127.0.0.1 riak#"`hostname -i`" # replace the name
riak start # start the node
That should set the node's name to riak#[your EC2 internal IP address].
As well as changing the PB and HTTP IP's in the app.config, and the vm.args IP I also had to run:
http://docs.basho.com/riak/1.2.0/references/Command-Line-Tools---riak-admin/#reip
Without doing this, running riak console and looking in the output, the old IP is still present in the error log.

Resources