I installed spark on a cluster of machines w/o public DNS (just created machines on a cloud).
Hadoop looks to be installed and worked correctly, but Sparks listens on 7077 and 6066 as 127.0.0.1 instead of public ip so worker nodes can't connect to it.
What is wrong?
My /etc/hosts on the master node looks like:
127.0.1.1 namenode namenode
127.0.0.1 localhost
XX.XX.XX.XX namenode-public
YY.YY.YY.YY hadoop-2
ZZ.ZZ.ZZ.ZZ hadoop-1
My $SPARK_HOME/conf/spark-env.sh looks like:
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export SPARK_PUBLIC_DNS=namenode-public
export SPARK_WORKER_CORES=6
export SPARK_LOCAL_IP=XX.XX.XX.XX
sudo netstat -pan|grep 7077
tcp 0 0 127.0.1.1:7077 0.0.0.0:* LISTEN 6670/java
You should specify SPARK_MASTER_HOST in spark-env.sh (it must be the address of your machine that is visible to the slave nodes). Moreover, you may need to add rules for ports 7077 and 6066 in iptables.
Related
When I build hadoop cluster based on vmware, and I use sbin/start-dfs.sh command, I meet the problem about ssh. It says,
ssh: Could not resolve hostname now.: No address associated with hostname
I have used vi /etc/hosts command to check the hostname and IP address, and vi /etc/profile command. I ensure that there is no fault.
Few suggestions
Check if the hostnames in hdfs-site.xml is set correctly. If you are running with single host setup, and you set namenode host as localhost, you need to make sure localhost mapped to 127.0.0.1 in your /etc/hosts. If you are setting multiple nodes, make sure you use FQDN of each host in your configuration, and make sure each FQDN mapped to the correct IP address in /etc/hosts.
Setup passwordless SSH. Note start-dfs.sh requires that you have passwordless SSH setup from the host where you run this command to rest of cluster nodes. Verify this by ssh hostx date and it doesn't ask for a password.
Check the hostname in the error message (maybe you did not paste the complete log), for the problematic hostname, run SSH command manually to make sure it can be resolved. If not, check /etc/hosts. A common /etc/hosts setup looks like
127.0.0.1 localhost localhost.localdomain
::1 localhost localhost.localdomain
172.16.151.224 host1.test.com host1
172.16.152.238 host2.test.com host2
172.16.153.108 host3.test.com host3
I installed Hadoop (HDP 2.5.3) on 4 VMs with Ambari (1 Ambari Server and 3 Ambari Clients; with the DNS entries server, node0, node1, node2) with HDFS, YARN, MapReduce and Zookeeper.
However, YARN doesn't want to start. When starting the Resource Manager on node1 I get the following error:
resource_management.core.exceptions.ExecutionFailed: Execution of 'curl -sS -L -w '%{http_code}' -X GET 'http://node0:50070/webhdfs/v1/ats/done/?op=GETFILESTATUS&user.name=hdfs' 1>/tmp/tmpgsiRLj 2>/tmp/tmpMENUFa' returned 7. curl: (7) Failed to connect to node0 port 50070: connection refused 000
App Timeline Server and History Server on node1 don't want to start either. Zookeeper, NameNode, DataNode and Nodemanager on Node0 is up. The nodes can reach each other (tried with ping) so that shouldn't be the problem.
Hopefully one can help me. I'm really new to this topic and not really familiar with the system.
You should check the host file (/etc/hosts), see the host name and FNDN, check if there any duplicates name, IP address.
Could you also confirm the firewall activity by steps:
sudo ufw status
And also check the port in iptables (or allow port in firewall: udp, tcp).
I am trying to setup a 3-workers 1 master hadoop cluster using 2.7.1. When I start the cluster, the master has the following daemons running:
2792 NameNode
3611 NodeManager
4362 Jps
3346 ResourceManager
2962 DataNode
3169 SecondaryNameNode
And in the three worker nodes,
2163 NodeManager
2030 DataNode
2303 Jps
Problem is when I look at the web UI, the cluster does not recognize the 3 workers. It says 1 live data node and that is the master itself. Please have a look here:
http://master:50070/dfshealth.html#tab-overview
Question is, what are the daemon processes that are suppose to be running on workers node? I tried to look at the log files and didnt find anything helpful because it contains log only related to running daemons and the log files does not have any errors or Fatal errors.
I thought secondary namenode should be running in workers and ports are not letting it to communicate. So I tried to open up Port 9000 and 9001 in master by
sudo iptables -I INPUT -p tcp --dport 9000 -j ACCEPT
sudo iptables -I INPUT -p tcp --dport 9001 -j ACCEPT
iptables-save
but this didnt help much. Still facing the same problem. Log files in workers are not helpful either.
Appreicate your help in fixing this.
Edit 1:
The following is my configuration at core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9001</value> <!-- slave1, 2 & 3 in position of master -->
</property>
</configuration>
This is my /etc/hosts file:
127.0.0.1 localhost math2
127.0.1.1 math2
192.168.1.2 master
192.168.1.3 worker1
192.168.1.7 worker5
192.168.1.8 worker6
This is my configuration at /etc/network/interfaces
# interfaces(5) file used by ifup(8) and ifdown(8)
auto lo
iface lo inet loopback
address 192.168.1.2 (3,5,6 instead of 2 for slaves)
netmask 255.255.255.0
gateway 192.168.1.1
broadcast 192.168.1.255
Here is the log output for one of the datanodes:
2016-02-05 17:54:12,655 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in secureMain
java.net.BindException: Problem binding to [0.0.0.0:50010] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
Did you put the ip-address of all the nodes in /etc/hosts. The /etc/hosts file of all nodes(master and slaves) should contain the ip-address of all the nodes in your cluster.
For example if we have three data nodes and a master node the /etc/hosts file should be:
192.168.0.1 master
192.168.0.2 datanode1
192.168.0.3 datanode2
192.168.0.4 datanode33
On Datanode you can see below process when you run jps command
19728 DataNode
19819 Jps
and when you run ps -aef |grep -i datanode on Datanode, it should show the two process one with root user and another with HDFS user
I am experiencing some difficulties connecting two RabbitMQ nodes on amazon EC2.
The two nodes are controlled using puppet, here is my rabbit.config file:
[
{mnesia, [{dump_log_write_threshold, 1000}]},
{rabbit, [
{tcp_listeners, [5672]},
{kernel, [{inet_dist_listen_min, 55700},{inet_dist_listen_max, 55800}]} ,
{cluster_nodes, ['rabbit#server1', 'rabbit#server2']}
]
}
].
I believe the rights ports for the cluster to connect are open. I am able to telnet from server2 to server1 on both 5672 and 4369.
I have the same /var/lib/rabbitmq/.erlang.cookie on both servers.
And from erlang command line when I net_admin:ping the other node I get pang back.
However, when I run cluster_status on any node they do not look like they are aware of each other. Doing stop_app, reset,rabbitmqctl cluster rabbit#server1 I always get the following error:
Error: {no_running_cluster_nodes...
Has anybody solved a similar problem, or know how to solve it?
Have you opened the ports between 55700 and 55800?
Try checking this to understand what other ports RabbitMQ listens on:
netstat -plten | grep beam
And I'd double-check the cookie...
Like Ivan suggests, you can check which ports the servers are listening on first and then add those TCP rules to Security Groups for servers. That's a good first step.
netstat -plten | grep beam
Returns the following (if server still running and not stop_app)
tcp 0 0 0.0.0.0:37419 0.0.0.0:* LISTEN 498 118739 15519/beam
tcp 0 0 0.0.0.0:15672 0.0.0.0:* LISTEN 498 119032 15519/beam
tcp 0 0 0.0.0.0:55672 0.0.0.0:* LISTEN 498 119029 15519/beam
tcp 0 0 :::5672 :::* LISTEN 498 119018 15519/beam
Notice the common ports 5672 15672 55672 for amqp and web server and the other port is the port the cluster is listening on. Check your other instances and make sure your range includes both of them, then retry and it will work.
Security Group > Inbound > TCP Rule:
30000-65535 and the Security Group allowed sg-XXXXXX and repeat for reciprocating security groups and don't forget to "Apply Rules".
Next make sure you share the /var/lib/rabbitmq/.erlang.cookie (just copy from one server to all others and restart instances)
Then on your command line:
[root#ip-172-31-27-150 ~]# rabbitmqctl stop_app
Stopping node 'rabbit#ip-172-31-27-150' ...
...done.
[root#ip-172-31-27-150 ~]# rabbitmqctl reset
Resetting node 'rabbit#ip-172-31-27-150' ...
...done.
[root#ip-172-31-27-150 ~]# rabbitmqctl join_cluster rabbit#ip-172-31-28-79
Clustering node 'rabbit#ip-172-31-27-150' with 'rabbit#ip-172-31-28-79' ...
...done.
Lastly, don't forget to restart your instance rabbitmqctl start_app
This worked for me on 5 EC2 instances.
thanks for your answer, what I did is to remove the content of this directory except .erlang.cookie ( rm -R /var/lib/rabbitmq/ ). And the cluster connected successfully.
Cheers!
I'm setting up Hadoop (0.20.2). For starters, I just want it to run on a single machine - I'll probably need a cluster at some point, but I'll worry about that when I get there. I got it to the point where my client code can connect to the job tracker and start jobs, but there's one problem: the job tracker is only accessible from the same machine that it's running on. I actually did a port scan with nmap, and it shows port 9001 open when scanning from the Hadoop machine, and closed when it's from somewhere else.
I tried this on three machines (one Mac, one Ubuntu, and an Ubuntu VM running in VirtualBox), it's the same. None of them have any firewalls set up, so I'm pretty sure it's a Hadoop problem. Any suggestions?
In your hadoop configuration files, does fs.default.name and mapred.job.tracker refer to localhost?
If so, then Hadoop will only listen to port 9000 and 9001 on the loopback interface, which is inaccessible from any other host. Make sure fs.default.name and mapred.job.tracker refer to your machine's externally accessible host name.
Make sure that you have not double listed your master in the /etc/hosts file.
I had the following which only allowed master to listen on 127.0.1.1
127.0.1.1 hostname master
192.168.x.x hostname master
192.168.x.x slave-1
192.168.x.x slave-2
The above answer caused the problem. I changed my /ect/hosts file to the following to make it work.
127.0.1.1 hostname
192.168.x.x hostname master
192.168.x.x slave-1
192.168.x.x slave-2
Use the command netstat -an | grep :9000 to verify your connections are working!
In addition to the above answer, I found that in /etc/hosts on the master (running ubuntu) had the line:
127.0.1.1 master
Which meant that running nslookup master on the master returned a local address - so in spite of using master in mapred-site.xml I suffered the same problem. My solution (there are probably better ones) was to create an alias in my DNS server and use that instead. I guess you could probably also change the IP address in /etc/hosts to the external one, but I haven't tried this - I'm not sure what implications it would have for other services.