unable to connect worker node to the local cluster - cluster-computing

I am very new to ray and I have a project that requires more resources so I am using ray for local cluster setup
Computer 1 head
os : ubuntu 16.0.4
ray: 1.12
python: 3.9
Computer 2 head
os : windows 10
ray: 1.12
python: 3.9
I am starting cluster manually by typing ray start --head and it shows node ip address and all other details
after that I go to the computer 2 and in cmd type ray start --address="192.168.x.10:6379" <- head node ip address.
it says unable to connect to gcs server at the given address or check firewall is of or gcs address mismatch.
Firewall is off in both the computers and I don't know I googled and tried lot of youtube but no help.

Related

CDH cluster installation failing in "distributing" stage- failed due to stall on seeded torrent

Hi,
We are trying to install CDH cluster on Redhat 7 remote server using cloudera-installer.bin file, in standalone mode( we have only 1 host) . We are specifying hostname/ip address of the machine during installation , it is able to resolve it. But the installation halts during parcel distribution stage. Here are the logs of cloudera-scm-agent :(We tried both cloudera express edition and entrerprise trial version too)
['http://INHUSZ1-V250152:7180/cmf/parcel/download/CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel'] location=/opt/cloudera/parcels/.flood/CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel progress=0]
[03/Oct/2018 10:11:55 +0000] 28315 Thread-13 downloader INFO Current state: CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel [totalDownloaded=0 totalSize=2120090032 upload=0 state=downloading seed=['http://INHUSZ1-V250152:7180/cmf/parcel/download/CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel'] location=/opt/cloudera/parcels/.flood/CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel progress=0]
[03/Oct/2018 10:11:57 +0000] 28315 Thread-13 downloader INFO Current state: CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel [totalDownloaded=0 totalSize=2120090032 upload=0 state=downloading seed=['http://INHUSZ1-V250152:7180/cmf/parcel/download/CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel'] location=/opt/cloudera/parcels/.flood/CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel progress=0]
[03/Oct/2018 10:11:59 +0000] 28315 Thread-13 downloader INFO Current state: CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel [totalDownloaded=0 totalSize=2120090032 upload=0 state=downloading seed=['http://INHUSZ1-V250152:7180/cmf/parcel/download/CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel'] location=/opt/cloudera/parcels/.flood/CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel progress=0]
Please let us know what can be done
I just had the same error message and stall during install at parcel distribution stage.
Installing a single node (test) cluster on CentOS 7.5 with CDH Express 5.15.
Solution that worked for me was adding the node IP and FQDN to /etc/hosts (previously it only contained entries for 127.0.0.1 localhost):
[root#mynode ~]# vi /etc/hosts
192.168.1.1 myhostname.mydomain
Then restarted Cloudera SCM Agent:
[root#mynode ~]# service cloudera-scm-agent restart
Installation then continued successfully.
Do the following:
Stop all services.
Deactivate all in-use parcels.
Shut down the Cloudera Manager Agent on all hosts.
Move the existing parcels to the new location.
Configure the host parcel directory.
Start the Cloudera Manager Agents.
Activate the parcels.
Start all services.
Delete the corresponding parcels package from below folder including .torrent file
/opt/cloudera/parcels/.flood/
Download and distribute
This is happening because .torrent file is corrupted

RabbitMQ Erlang distribution failed

I have two Windows Server 2012 R2 machines located in one of the client's datacenters. Both servers are domain-joined. They both have RabbitMQ 3.6.0. installed on them. RabbitMQ is running as Windows Service on both machines. I've tried to cluster these two machines for a long time now without success. I always get the following error when I try to cluster them.
One the first machine nodeA I run the command 'rabbitmqctl join_cluster rabbit#nodeB'. This is what I get:
Clustering node 'rabbit#nodeA' with 'rabbit#nodeB' ...
Error: unable to connect to nodes ['rabbit#nodeB']: nodedown
`DIAGNOSTICS`
===========
attempted to contact: ['rabbit#nodeB']
rabbit#nodeB:
* connected to epmd (port 4369) on nodeB
* epmd reports node 'rabbit' running on port 25672
* TCP connection succeeded but Erlang distribution failed
* suggestion: hostname mismatch?
* suggestion: is the cookie set correctly?
* suggestion: is the Erlang distribution using TLS?
current node details:
- node name: 'rabbitmq-cli-3892#nodeA'
- home dir: C:\Users\mydirectory
- cookie hash: l+SSu57+cRyAQ03AJdwAbQ==
I've tried this setup with Azure Virtual Machines within Azure Virtual Network and I succeeded to cluster the two VM's, however it seems I cannot connect these two (customer's machines) together.
This is what I have done and ensured:
There isn't any firewall blocking connections
Added host names to hosts file located on C:\Windows\system32\drivers\etc
Tried to refer to host names as FQDN without adding anything to hosts file
Tried to refer to host names with CAPITAL letters and without
Copied the same exact .erlang.cookie to C:\Windows and C:\Users\mydirectory on both machines.
I've read, understood and applied RabbitMQ Clustering Guide https://www.rabbitmq.com/clustering.html
Stopped, restarted, reinstalled RabbitMQ on both machines.
It seems I can't get it to work. On Azure machines, which were not domain-joined clustering worked beautifully. I am really running out of options... Any help?
i had the same problem you need to install rabbitmq as a admin. uninstall then reinstall as admin and it should work fine
Try to connect to each of RabbitMQ nodes via remote shell and check if value of cookie is the same (cookie can be set in 3 different ways: .erlang.cookie is one of them).
erl -remsh 'rabbitmq-cli-3892#nodeA' -name 'test#nodeA'
erlang:get_cookie().

Installing Kubernetes on mac with vagrant and virtualbox

This is my first attempt to install and use Kubernetes. I am trying to install an environment on Mac for developing my own apps and deploying them for test locally with Kubernetes. I am familiar with using Vagrant, VirtualBox and Docker for the same purpose. When I saw this page https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/getting-started-guides/vagrant.md I assumed it would be trivial. I executed these lines:
export KUBERNETES_PROVIDER=vagrant
curl -sS https://get.k8s.io | bash
This created a master VM and a Minion, but Kubernetes seems to have failed to start on the master. On the master /var/log/salt/master is full of python Traceback errors, like this:
2015-07-17 22:14:42,629 [cherrypy.error ][INFO ][3252] [17/Jul/2015:22:14:42] ENGINE Started monitor thread '_TimeoutMonitor'.
2015-07-17 22:14:42,736 [cherrypy.error ][ERROR ][3252] [17/Jul/2015:22:14:42] ENGINE Error in HTTP server: shutting down
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cherrypy/process/servers.py", line 187, in _start_http_thread
self.httpserver.start()
File "/usr/lib/python2.7/site-packages/cherrypy/wsgiserver/wsgiserver2.py", line 1824, in start
raise socket.error(msg)
error: No socket could be created
Vagrant is version 1.7.3. VirtualBox is version 4.3.30
Have I made an obvious stupid mistake?
I don't yet know the fix but I know what is going wrong since it happens to me as well:
OS X 10.10.3
Vagrant 1.7.4
VirtualBox 4.3.30
Kubernetes 1.0.1
When I run the default configuration of this (which creates one "master" and one "minion" VM) I see that the static IP address is not being assigned to the "eth1" interface, and I also see that the Salt API server is sitting in what appears to be an infinite retry loop because it is trying to listen on that IP address.
Also, the following message happened during boot:
[vagrant#kubernetes-master ~]$ dmesg | grep eth1
[ 9.321496] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
So basically, the static IP address didn't get assigned because eth1 wasn't ready when the system first booted, and Salt is waiting for it to get assigned.
I could fix this after boot by sshing to the box using "vagrant ssh" and running the command:
sudo /etc/init.d/network restart
on each host.
This "fixes" eth1 by assigning the static IP address, and after that Salt begins to do its thing, installs Docker, boots various containers, and so on.
What I don't know is how to make this work every time without manual intervention. It appears to be some sort of a race condition between Vagrant and VirtualBox.
If you just want to kick the tires with Kubernetes, I'd recommend installing boot2docker and then following the Running kubernetes locally via Docker getting started guide. Once you are comfortable interacting with the Kubernetes API and want a more complex local setup, you can then work on installing Vagrant.
If the Vagrant instructions aren't working, you should also feel free to file a bug in the github repository.
The tutorial pointed by Robert is realy easy to run. Just change the version to 0.21.2 (maybe 0.21.3 works too).
Else, if you prefer a vagrant solution, try with pires cluster on vagrant. It runs with almost nothing to change.
Running Kubernetes inside VirtualBox requires 4 networks and some adjustments to the configuration:
The VirtualBox HOST ONLY network will be the network used to access the Kubernetes master and nodes from the Mac or PC.
The NAT Network to download packages from the Internet.
The internal connections between Kubernetes PODs uses a tunnel network TUN
The Kubernetes Cluster IP Network is a private IP range used inside the cluster to give each Kubernetes service a dedicated IP
Vagrantfile needs to pass the node public IPs to the Ansible roles that configure Kubernetes to set KUBELET_EXTRA_ARGS environment variable with the public IP of each node (required for reading logs using kubectl).
NodePort needs to be used to publish applications running inside the Kubernetes cluster as Load Balancers are not available in VirtualBox.
See the full example and download the code at Building a Kubernetes Cluster with Vagrant and Ansible (without Minikube), it has been tested in Ubuntu but should work on a MAC as well.

windows cluster - SSH seems to be failing

Two physical systems, each is running Server 2008
Installed DataStax Community (version 2.0.7 64-bit) on each (that is the version number in the DataStax package I downloaded according to the file name)
OpCenter running locally shows a running 1 node cluster. I can execute IO on the system at the command line (using cassandra-stress)
The system names are "5017-cassandra-1" and "5017-cassandra-2"
I'd like to create a cluster in which both nodes participate. This is not a production environment (I'm just trying to learn).
From OpCenter on 5017-cassandra-1 I go to Nodes (I see 1 node of course), Add Nodes.
I leave the "Package" drop down as default (but the latest version shown in the drop down is 2.0.6), enter the IP address of 5017-cassandra-2. I add the Administrator user name and password in the "Node Creditials (sudo)" fields and press "Add Nodes" and get:
Error provisioning cluster: Unable to SSH to some of the hosts
Unable to SSH to 10.108.14.224:
global name 'get_output' is not defined
Reading that I needed to add OpenSSL - I installed the runtime redistributables (on both system) and Win64 OpenSSL-1_0_1h.
The error persists.
any suggestions or link to a step-by-step would be appreciated.

parallel ipython/ipcluster through head node

I want to use the parallel capabilities of ipython on a remote computer cluster. Only the head node is accessible from the outside. I have set up ssh keys so that I can connect to the head node with e.g. ssh head and from there I can also ssh into any node without entering a password, e.g. ssh node3. So I can basically run any commands on the nodes by doing:
ssh head ssh node3 command
Now what I really want to do is to be able to run jobs on the cluster from my own computer from ipython. The way to set up the hosts to use in ipcluster is:
send_furl = True
engines = { 'host1.example.com' : 2,
'host2.example.com' : 5,
'host3.example.com' : 1,
'host4.example.com' : 8 }
But since I only have a host name for the head node, I don't think I can do this. One option is to set us ssh tunneling on the head node, but I cannot do this in my case, since this requires enough ports to be open to accommodate all the nodes (and this is not the case). Are there any alternatives?
I use ipcluster on the NERSC clusters by using the PBS queue:
http://ipython.org/ipython-doc/stable/parallel/parallel_process.html#using-ipcluster-in-pbs-mode
in summary you submit jobs which runs mpiexec ipengine, (after having launched ipcontroller on the login node). Do you have PBS on your cluster?
this was working fine with ipython .10, it is now broken in .11 alpha.
I would set up a VPN server on the master, and connect to that with a VPN client on my local machine. Once established, the virtual private network will allow all of the slaves to appear as if they're on the same LAN as my local machine (on a "virtual" network interface, in a "virtual" subnet), and it should be possible to ssh to them.
You could possibly establish that VPN over SSH ("ssh tunneling", as you mention); other options are OpenVPN and IPsec.
I don't understand what you mean by "this requires enough ports to be open to accommodate all the nodes". You will need: (i) one inbound port on the master, to provide the VPN/tunnel, (ii) inbound SSH on each slave, accessible from the master, (iii) another inbound port on each slave, over which the master drives the IPython engines. Wouldn't (ii) and (iii) be required in any setup? So all we've added is (i).

Resources