Spark Cluster starting issue - hadoop

I'm new to spark, and trying to setup spark cluster. I did following things to set-up and check status of spark cluster, but not sure about status.
I tried to check master-ip:8081 (8080, 4040, 4041) in the browser, but didn't see any results. To start with, I set-up and started hadoop cluster.
JPS gives:
2436 SecondaryNameNode
2708 NodeManager
2151 NameNode
5495 Master
2252 DataNode
2606 ResourceManager
5710 Jps
Question (Was it necessary to start hadoop?)
In Master /usr/local/spark/conf/slaves
localhost
slave-node-1
slave-node-2
Now, to start Spark; Master Started with
$SPARK_HOME/sbin/start-master.sh
And tested with
ps -ef|grep spark
hduser 5495 1 0 18:12 pts/0 00:00:04 /usr/local/java/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/*:/usr/local/hadoop/etc/hadoop/ -Xmx1g org.apache.spark.deploy.master.Master --host master-hostname --port 7077 --webui-port 8080
On slave node 1
$SPARK_HOME/sbin/start-slave.sh spark://205.147.102.19:7077
Tested with
ps -ef|grep spark
hduser 1847 1 20 18:24 pts/0 00:00:04 /usr/local/java/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://master-ip:7077
Same on slave-node 2
$SPARK_HOME/sbin/start-slave.sh spark://master-ip:7077
ps -ef|grep spark
hduser 1948 1 3 18:18 pts/0 00:00:03 /usr/local/java/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://master-ip:7077
I was not able to see anything on the web console of spark.. so I thought problem may be with firewall. Here is my iptables..
iptables -L -nv
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
6136 587K fail2ban-ssh tcp -- * * 0.0.0.0/0 0.0.0.0/0 multiport dports 22
151K 25M ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 state RELATED,ESTABLISHED
6 280 ACCEPT icmp -- * * 0.0.0.0/0 0.0.0.0/0
579 34740 ACCEPT all -- lo * 0.0.0.0/0 0.0.0.0/0
34860 2856K ACCEPT all -- eth1 * 0.0.0.0/0 0.0.0.0/0
145 7608 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:22
56156 5994K REJECT all -- * * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited
0 0 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080
0 0 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8081
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
0 0 REJECT all -- * * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited
Chain OUTPUT (policy ACCEPT 3531 packets, 464K bytes)
pkts bytes target prot opt in out source destination
Chain fail2ban-ssh (1 references)
pkts bytes target prot opt in out source destination
2 120 REJECT all -- * * 218.87.109.153 0.0.0.0/0 reject-with icmp-port-unreachable
5794 554K RETURN all -- * * 0.0.0.0/0 0.0.0.0/0
I'm trying all I can to see if spark-cluster is set-up and how to check it properly. And if cluster is set-up, why am I not able to check that on web-console? What could be wrong? Any pointers would be helpful...
EDIT - ADDING LOGS after spark-shell --master local command (in the master)
17/01/11 18:12:46 INFO util.Utils: Successfully started service 'sparkMaster' on port 7077.
17/01/11 18:12:47 INFO master.Master: Starting Spark master at spark://master:7077
17/01/11 18:12:47 INFO master.Master: Running Spark version 2.1.0
17/01/11 18:12:47 INFO util.log: Logging initialized #3326ms
17/01/11 18:12:47 INFO server.Server: jetty-9.2.z-SNAPSHOT
17/01/11 18:12:47 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#20f0b5ff{/app,null,AVAILABLE}
17/01/11 18:12:47 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#734e74b2{/app/json,null,AVAILABLE}
17/01/11 18:12:47 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#1bc45d76{/,null,AVAILABLE}
17/01/11 18:12:47 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#6a274a23{/json,null,AVAILABLE}
17/01/11 18:12:47 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#4f5d45d5{/static,null,AVAILABLE}
17/01/11 18:12:47 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#4fb65368{/app/kill,null,AVAILABLE}
17/01/11 18:12:47 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#76208805{/driver/kill,null,AVAILABLE}
17/01/11 18:12:47 INFO server.ServerConnector: Started ServerConnector#258dbadd{HTTP/1.1}{0.0.0.0:8080}
17/01/11 18:12:47 INFO server.Server: Started #3580ms
17/01/11 18:12:47 INFO util.Utils: Successfully started service 'MasterUI' on port 8080.
17/01/11 18:12:47 INFO ui.MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://master:8080
17/01/11 18:12:47 INFO server.Server: jetty-9.2.z-SNAPSHOT
17/01/11 18:12:47 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#1cfbb7e9{/,null,AVAILABLE}
17/01/11 18:12:47 INFO server.ServerConnector: Started ServerConnector#2f7af4e{HTTP/1.1}{master:6066}
17/01/11 18:12:47 INFO server.Server: Started #3628ms
17/01/11 18:12:47 INFO util.Utils: Successfully started service on port 6066.
17/01/11 18:12:47 INFO rest.StandaloneRestServer: Started REST server for submitting applications on port 6066
17/01/11 18:12:47 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#799d5f4f{/metrics/master/json,null,AVAILABLE}
17/01/11 18:12:47 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#647c46e3{/metrics/applications/json,null,AVAILABLE}
17/01/11 18:12:47 INFO master.Master: I have been elected leader! New state: ALIVE
In slave nodes-
17/01/11 18:22:46 INFO Worker: Connecting to master master:7077...
17/01/11 18:22:46 WARN Worker: Failed to connect to master master:7077
Tonnes of java errors..
17/01/11 18:31:18 ERROR Worker: All masters are unresponsive! Giving up.

Spark Web UI start when you are creating SparkContext
Try to run spark-shell --master yourmaster:7077 and then open Spark UI. You can also use spark-sumit to submit some application, then SparkContext will be created.
Example spark-submit, from Spark documentation:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--deploy-mode cluster \
--supervise \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
Answer to first question: you must start Hadoop components if you want to use HDFS or YARN. If not, they can be not started
Also you can go to /etc/hosts/ and remove line with 127.0.0.1 or set MASTER_IP variable in Spark configuration to proper host name

The problem was IP tables. most other things were fine. So I just followed instructions here https://wiki.debian.org/iptables to fix IP tables, and it worked for me. Only thing that you should know which ports will be used for spark/hadoop etc. I opened 8080, 54310, 50070, 7077 (some defaults used by many for hadoop and spark installation)...

Related

Web browser times out when trying to connect to remote Hadoop NameNode web UI

No firewall is active on the remote machine:
[p512788#dev09901 ~]$ sudo systemctl status firewalld
[sudo] password for p512788:
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
The port is open and the application is listening on the port as far as I can tell:
[p512788#dev09901 ~]$ lsof -i:9870
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 11183 p512788 265u IPv4 67057 0t0 TCP dev09901.resbank.co.za:9870 (LISTEN)
My core-site.xml configuration is:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://dev09901.resbank.co.za:8020</value>
</property>
</configuration>
My hdfs-site.xml configuration is:
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/p512788/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>dev09901.resbank.co.za:9870</value>
</property>
</configuration>
Running jps returns:
[p512788#dev09901 ~]$ jps
12960 Jps
11425 SecondaryNameNode
12649 ResourceManager
11183 NameNode
The (partial) output of $HADOOP_HOME/logs/hadoop-p512788-namenode-dev09901.resbank.co.za.log:
2020-10-20 02:20:11,885 INFO org.apache.hadoop.hdfs.DFSUtil: Starting Web-server for hdfs at: http://dev09901.resbank.co.za:9870
2020-10-20 02:20:11,895 INFO org.eclipse.jetty.util.log: Logging initialized #766ms
2020-10-20 02:20:11,967 INFO org.apache.hadoop.security.authentication.server.AuthenticationFilter: Unable to initialize FileSignerSecretProvider, falling back to use random secrets.
2020-10-20 02:20:11,975 INFO org.apache.hadoop.http.HttpRequestLog: Http request log for http.requests.namenode is not defined
2020-10-20 02:20:11,982 INFO org.apache.hadoop.http.HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
2020-10-20 02:20:11,983 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context hdfs
2020-10-20 02:20:11,983 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context logs
2020-10-20 02:20:11,984 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context static
2020-10-20 02:20:11,999 INFO org.apache.hadoop.http.HttpServer2: Added filter 'org.apache.hadoop.hdfs.web.AuthFilter' (class=org.apache.hadoop.hdfs.web.AuthFilter)
2020-10-20 02:20:11,999 INFO org.apache.hadoop.http.HttpServer2: addJerseyResourcePackage: packageName=org.apache.hadoop.hdfs.server.namenode.web.resources;org.apache.hadoop.hdfs.web.resources, pathSpec=/webhdfs/v1/*
2020-10-20 02:20:12,005 INFO org.apache.hadoop.http.HttpServer2: Jetty bound to port 9870
2020-10-20 02:20:12,007 INFO org.eclipse.jetty.server.Server: jetty-9.3.24.v20180605, build timestamp: 2018-06-05T19:11:56+02:00, git hash: 84205aa28f11a4f31f2a3b86d1bba2cc8ab69827
2020-10-20 02:20:12,028 INFO org.eclipse.jetty.server.handler.ContextHandler: Started o.e.j.s.ServletContextHandler#4b741d6d{/logs,file:///home/p512788/hadoop-3.2.1/logs/,AVAILABLE}
2020-10-20 02:20:12,029 INFO org.eclipse.jetty.server.handler.ContextHandler: Started o.e.j.s.ServletContextHandler#8f2ef19{/static,file:///home/p512788/hadoop-3.2.1/share/hadoop/hdfs/webapps/static/,AVAILABLE}
2020-10-20 02:20:12,074 INFO org.eclipse.jetty.server.handler.ContextHandler: Started o.e.j.w.WebAppContext#5d908d47{/,file:///home/p512788/hadoop-3.2.1/share/hadoop/hdfs/webapps/hdfs/,AVAILABLE}{/hdfs}
2020-10-20 02:20:12,078 INFO org.eclipse.jetty.server.AbstractConnector: Started ServerConnector#9816741{HTTP/1.1,[http/1.1]}{dev09901.resbank.co.za:9870}
The output of telnet and nmap from one of the DataNodes
[p512788#dev09902 ~]$ nmap -p 9870 dev09901.resbank.co.za
Starting Nmap 6.40 ( http://nmap.org ) at 2020-10-20 00:36 SAST
Nmap scan report for dev09901.resbank.co.za (10.36.16.101)
Host is up (0.00027s latency).
PORT STATE SERVICE
9870/tcp open unknown
Nmap done: 1 IP address (1 host up) scanned in 0.03 seconds
[p512788#dev09902 ~]$ telnet dev09901.resbank.co.za 9870
Trying 10.36.16.101...
Connected to dev09901.resbank.co.za.
Escape character is '^]'.
Connection closed by foreign host.
Running jps on a DataNode gives:
[p512788#dev09902 ~]$ jps
24741 NodeManager
23578 DataNode
25455 Jps
Just faced the same problem. You can try with:
sudo ufw allow 9870/tcp
Then verify that the port is open using:
sudo ufw status verbose
you can try to add a row in /etc/hosts file which content is "IP Hostname"

HDFS NFS startup error: “ERROR mount.MountdBase: Failed to start the TCP server...ChannelException: Failed to bind..."

Attempting to use / startup HDFS NFS following the docs (ignoring the instructions to stop the rpcbind service and did not start the hadoop portmap service given that the OS is not SLES 11 and RHEL 6.2), but running into error when trying to set up the NFS service starting the hdfs nfs3 service:
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# service nfs status
Redirecting to /bin/systemctl status nfs.service
Unit nfs.service could not be found.
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# service nfs stop
Redirecting to /bin/systemctl stop nfs.service
Failed to stop nfs.service: Unit nfs.service not loaded.
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# service rpcbind status
Redirecting to /bin/systemctl status rpcbind.service
● rpcbind.service - RPC bind service
Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2019-07-23 13:48:54 HST; 28s ago
Process: 27337 ExecStart=/sbin/rpcbind -w $RPCBIND_ARGS (code=exited, status=0/SUCCESS)
Main PID: 27338 (rpcbind)
CGroup: /system.slice/rpcbind.service
└─27338 /sbin/rpcbind -w
Jul 23 13:48:54 HW02.ucera.local systemd[1]: Starting RPC bind service...
Jul 23 13:48:54 HW02.ucera.local systemd[1]: Started RPC bind service.
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# hdfs nfs3
19/07/23 13:49:33 INFO nfs3.Nfs3Base: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting Nfs3
STARTUP_MSG: host = HW02.ucera.local/172.18.4.47
STARTUP_MSG: args = []
STARTUP_MSG: version = 3.1.1.3.1.0.0-78
STARTUP_MSG: classpath = /usr/hdp/3.1.0.0-78/hadoop/conf:/usr/hdp/3.1.0.0-78/hadoop/lib/jersey-server-1.19.jar:/usr/hdp/3.1.0.0-78/hadoop/lib/ranger-hdfs-plugin-shim-1.2.0.3.1.0.0-78.jar:
...
<a bunch of other jars>
...
STARTUP_MSG: build = git#github.com:hortonworks/hadoop.git -r e4f82af51faec922b4804d0232a637422ec29e64; compiled by 'jenkins' on 2018-12-06T12:26Z
STARTUP_MSG: java = 1.8.0_112
************************************************************/
19/07/23 13:49:33 INFO nfs3.Nfs3Base: registered UNIX signal handlers for [TERM, HUP, INT]
19/07/23 13:49:33 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
19/07/23 13:49:33 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
19/07/23 13:49:33 INFO impl.MetricsSystemImpl: Nfs3 metrics system started
19/07/23 13:49:33 INFO oncrpc.RpcProgram: Will accept client connections from unprivileged ports
19/07/23 13:49:33 INFO security.ShellBasedIdMapping: Not doing static UID/GID mapping because '/etc/nfs.map' does not exist.
19/07/23 13:49:33 INFO nfs3.WriteManager: Stream timeout is 600000ms.
19/07/23 13:49:33 INFO nfs3.WriteManager: Maximum open streams is 256
19/07/23 13:49:33 INFO nfs3.OpenFileCtxCache: Maximum open streams is 256
19/07/23 13:49:34 INFO nfs3.DFSClientCache: Added export: / FileSystem URI: / with namenodeId: -1408097406
19/07/23 13:49:34 INFO nfs3.RpcProgramNfs3: Configured HDFS superuser is
19/07/23 13:49:34 INFO nfs3.RpcProgramNfs3: Delete current dump directory /tmp/.hdfs-nfs
19/07/23 13:49:34 INFO nfs3.RpcProgramNfs3: Create new dump directory /tmp/.hdfs-nfs
19/07/23 13:49:34 INFO nfs3.Nfs3Base: NFS server port set to: 2049
19/07/23 13:49:34 INFO oncrpc.RpcProgram: Will accept client connections from unprivileged ports
19/07/23 13:49:34 INFO mount.RpcProgramMountd: FS:hdfs adding export Path:/ with URI: hdfs://hw01.ucera.local:8020/
19/07/23 13:49:34 INFO oncrpc.SimpleUdpServer: Started listening to UDP requests at port 4242 for Rpc program: mountd at localhost:4242 with workerCount 1
19/07/23 13:49:34 ERROR mount.MountdBase: Failed to start the TCP server.
org.jboss.netty.channel.ChannelException: Failed to bind to: 0.0.0.0/0.0.0.0:4242
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at org.apache.hadoop.oncrpc.SimpleTcpServer.run(SimpleTcpServer.java:89)
at org.apache.hadoop.mount.MountdBase.startTCPServer(MountdBase.java:83)
at org.apache.hadoop.mount.MountdBase.start(MountdBase.java:98)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.startServiceInternal(Nfs3.java:56)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.startService(Nfs3.java:69)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.main(Nfs3.java:79)
Caused by: java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
...
...
19/07/23 13:49:34 INFO util.ExitUtil: Exiting with status 1: org.jboss.netty.channel.ChannelException: Failed to bind to: 0.0.0.0/0.0.0.0:4242
19/07/23 13:49:34 INFO nfs3.Nfs3Base: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down Nfs3 at HW02.ucera.local/172.18.4.47
************************************************************/
Not sure how to interpret any of the errors seen here (and have not installed any packages like nfs-utils, assuming that Ambari would have installed all needed packages when cluster was initially installed).
Any debugging suggestions or solutions for what to do about this?
** UPDATE:
After looking at the error, I can see
Caused by: java.net.BindException: Address already in use
and looking into what is already using it, we see...
[root#HW02 ~]# netstat -ltnp | grep 4242
tcp 0 0 0.0.0.0:4242 0.0.0.0:* LISTEN 98067/jsvc.exec
The process jsvc.exec appears to be related to running java applications. Given that hadoop runs on java, I assume it would be bad to just kill the process. Is it not supposed to be on this port (since interferes with NFS Gateway)? Not sure what to do about this.
TLDR: nfs gateway service was already running (by default, apparently) and the service that I thought was blocking the hadoop nfs3 service (jsvc.exec) from starting was (I'm assuming) part of that service already running.
What made me suspect this was that when shutting down the cluster, the service also stopped plus the fact that it was using the port I needed for nfs. The way that I confirmed this was just from following the verification steps in the docs and seeing that my output was similar to what should be expected.
[root#HW02 ~]# rpcinfo -p hw02
program vers proto port service
100000 4 tcp 111 portmapper
100000 3 tcp 111 portmapper
100000 2 tcp 111 portmapper
100000 4 udp 111 portmapper
100000 3 udp 111 portmapper
100000 2 udp 111 portmapper
100005 1 udp 4242 mountd
100005 2 udp 4242 mountd
100005 3 udp 4242 mountd
100005 1 tcp 4242 mountd
100005 2 tcp 4242 mountd
100005 3 tcp 4242 mountd
100003 3 tcp 2049 nfs
[root#HW02 ~]# showmount -e hw02
Export list for hw02:
/ *
Another thing that could told me that the jsvc process was part of an already running hdfs nfs service would have been checking the process info...
[root#HW02 ~]# ps -feww | grep jsvc
root 61106 59083 0 14:27 pts/2 00:00:00 grep --color=auto jsvc
root 163179 1 0 12:14 ? 00:00:00 jsvc.exec -Dproc_nfs3 -outfile /var/log/hadoop/root/hadoop-hdfs-root-nfs3-HW02.ucera.local.out -errfile /var/log/hadoop/root/privileged-root-nfs3-HW02.ucera.local.err -pidfile /var/run/hadoop/root/hadoop-hdfs-root-nfs3.pid -nodetach -user hdfs -cp /usr/hdp/3.1.0.0-78/hadoop/conf:...
...
hdfs 163193 163179 0 12:14 ? 00:00:17 jsvc.exec -Dproc_nfs3 -outfile /var/log/hadoop/root/hadoop-hdfs-root-nfs3-HW02.ucera.local.out -errfile /var/log/hadoop/root/privileged-root-nfs3-HW02.ucera.local.err -pidfile /var/run/hadoop/root/hadoop-hdfs-root-nfs3.pid -nodetach -user hdfs -cp /usr/hdp/3.1.0.0-78/hadoop/conf:...
and seeing jsvc.exec -Dproc_nfs3 ... to get the hint that jsvc (which apparently is for running java apps on linux) was being used to run the very nfs3 service I was trying to start.
And for anyone else with this problem, note that I did not stop all the services that the docs want you to stop (since using centos7)
[root#HW01 /]# service nfs status
Redirecting to /bin/systemctl status nfs.service
● nfs-server.service - NFS server and services
Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; disabled; vendor preset: disabled)
Active: inactive (dead)
[root#HW01 /]# service rpcbind status
Redirecting to /bin/systemctl status rpcbind.service
● rpcbind.service - RPC bind service
Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2019-07-19 15:17:02 HST; 6 days ago
Main PID: 2155 (rpcbind)
CGroup: /system.slice/rpcbind.service
└─2155 /sbin/rpcbind -w
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Also note that I did not follow any of the config file settings recommended in the docs (and that some of the properties instructed in the docs could not even be found in the Ambari-managed HDFS configs (so if anyone can explain why this is still working for me despite that, please do)).
** Update:
After talking with some people more experienced with using HDP (v3.1) than me, the docs that I linked to for setting up NFS for HDFS may not be totally up to date (when setting up NFS via Ambari mgnt. in any case)...
Can have a cluster node act as an NFS gateway by checking it off as a NFS node in the Ambari host management UI:
Needed configs can be set like so in the HDFS mgnt. UI...
Can confirm that HDFS NFS gateway is running by looking at the Host > Summary > Components section in Ambari...

Hadoop Kerberos: Datanode cannot connect to Namenode. Started Datanode by jsvc to binding with privileged ports (not use SASL)

I've set up an HA Hadoop cluster that worked. But after adding Kerberos authentication datanode cannot connect to namenode.
Verified that Namenode servers starts successfully and log no error. I start all services with user 'hduser'
$ sudo netstat -tuplen
...
tcp 0 0 10.28.94.150:8019 0.0.0.0:* LISTEN 1001 20218 1518/java
tcp 0 0 10.28.94.150:50070 0.0.0.0:* LISTEN 1001 20207 1447/java
tcp 0 0 10.28.94.150:9000 0.0.0.0:* LISTEN 1001 20235 1447/java
Datanode
Start datanode as root, using jsvc to bind service with privileged ports (ref.
Secure Datanode)
$ sudo -E sbin/hadoop-daemon.sh start datanode
starting datanode, logging to /opt/hadoop-2.7.3/logs//hadoop-hduser-datanode-STWHDDN01.out
Got the error that datanode cannot connect to namenodes:
...
2018-01-08 09:25:40,051 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dnUserName = hduser
2018-01-08 09:25:40,052 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: supergroup = supergroup
2018-01-08 09:25:40,114 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue
2018-01-08 09:25:40,125 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 50020
2018-01-08 09:25:40,152 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened IPC server at /0.0.0.0:50020
2018-01-08 09:25:40,219 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Refresh request received for nameservices: ha-cluster
2018-01-08 09:25:41,189 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting BPOfferServices for nameservices: ha-cluster
2018-01-08 09:25:41,226 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2018-01-08 09:25:41,227 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting
2018-01-08 09:25:42,297 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: STWHDRM02/10.28.94.151:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-01-08 09:25:42,300 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: STWHDRM01/10.28.94.150:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
datanode hdfs-site.xml (excerpt):
<property>
<name>dfs.block.access.token.enable</name>
<value>true</value>
</property>
<property>
<name>dfs.datanode.keytab.file</name>
<value>/opt/hadoop/etc/hadoop/hdfs.keytab</value>
</property>
<property>
<name>dfs.datanode.kerberos.principal</name>
<value>hduser/_HOST#FDATA.COM</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:1004</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>0.0.0.0:1006</value>
</property>
<property>
<name>dfs.datanode.data.dir.perm</name>
<value>700</value>
</property>
I have set HADOOP_SECURE_DN_USER=hduser and JSVC_HOME in hadoop-env.sh
hdfs.keytab on datanode:
$ klist -ke etc/hadoop/hdfs.keytab Keytab name: FILE:etc/hadoop/hdfs.keytab
KVNO Principal
---- --------------------------------------------------------------------------
1 hduser/stwhddn01#FDATA.COM (aes256-cts-hmac-sha1-96)
1 hduser/stwhddn01#FDATA.COM (aes128-cts-hmac-sha1-96)
1 hduser/stwhddn01#FDATA.COM (des3-cbc-sha1)
1 hduser/stwhddn01#FDATA.COM (arcfour-hmac)
1 hduser/stwhddn01#FDATA.COM (des-hmac-sha1)
1 hduser/stwhddn01#FDATA.COM (des-cbc-md5)
1 HTTP/stwhddn01#FDATA.COM (aes256-cts-hmac-sha1-96)
1 HTTP/stwhddn01#FDATA.COM (aes128-cts-hmac-sha1-96)
1 HTTP/stwhddn01#FDATA.COM (des3-cbc-sha1)
1 HTTP/stwhddn01#FDATA.COM (arcfour-hmac)
1 HTTP/stwhddn01#FDATA.COM (des-hmac-sha1)
1 HTTP/stwhddn01#FDATA.COM (des-cbc-md5)
OS: Centos 7
Hadoop: 2.7.3
Kerberos: MIT 1.5.1
I guest when running datanode as user root it does not authenticate with kerberos.
Any ideas?
I found the problem. Need to change /etc/hosts to map 127.0.0.1 to localhost only.
Before
127.0.0.1 STWHDDD01
127.0.0.1 localhost
...
After
127.0.0.1 localhost
...
I still wonder why the old mapping worked in the context of no Kerberos authentication.

Spark shell connect to Mesos hangs: No credentials provided. Attempting to register without authentication

I installed Mesos in an OpenStack environment using these instructions from Mesosphere: https://open.mesosphere.com/getting-started/datacenter/install/. I ran the verification test as described and it was successful. UI for both Mesos and Marathon are working as expected.
When I run the Spark shell from my laptop I cannot connect. The shell hangs with the output below. I don't see anything in the Mesos master or slave logs that would indicate an error, so am not sure what to investigate next.
Any help would be appreciated.
TOMWATER-M-60SN:bin tomwater$ ./spark-shell --master mesos://zk://10.93.193.78:2181,10.93.193.79:2181,10.93.193.80:2181/mesos
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/08/06 15:39:02 INFO SecurityManager: Changing view acls to: tomwater
15/08/06 15:39:02 INFO SecurityManager: Changing modify acls to: tomwater
15/08/06 15:39:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(tomwater); users with modify permissions: Set(tomwater)
15/08/06 15:39:02 INFO HttpServer: Starting HTTP Server
15/08/06 15:39:02 INFO Utils: Successfully started service 'HTTP class server' on port 63056.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.1
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)
Type in expressions to have them evaluated.
Type :help for more information.
15/08/06 15:39:05 INFO SparkContext: Running Spark version 1.4.1
15/08/06 15:39:05 INFO SecurityManager: Changing view acls to: tomwater
15/08/06 15:39:05 INFO SecurityManager: Changing modify acls to: tomwater
15/08/06 15:39:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(tomwater); users with modify permissions: Set(tomwater)
15/08/06 15:39:05 INFO Slf4jLogger: Slf4jLogger started
15/08/06 15:39:05 INFO Remoting: Starting remoting
15/08/06 15:39:05 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#10.93.235.120:63057]
15/08/06 15:39:05 INFO Utils: Successfully started service 'sparkDriver' on port 63057.
15/08/06 15:39:05 INFO SparkEnv: Registering MapOutputTracker
15/08/06 15:39:05 INFO SparkEnv: Registering BlockManagerMaster
15/08/06 15:39:05 INFO DiskBlockManager: Created local directory at /private/var/folders/7g/p1nw5zg94yx5cck_6c4jgwh80000gp/T/spark-74145a91-396f-4989-b2c0-5902e32e9e16/blockmgr-511d3fdf-f84a-40dc-b6e5-daace4d3f786
15/08/06 15:39:05 INFO MemoryStore: MemoryStore started with capacity 265.1 MB
15/08/06 15:39:05 INFO HttpFileServer: HTTP File server directory is /private/var/folders/7g/p1nw5zg94yx5cck_6c4jgwh80000gp/T/spark-74145a91-396f-4989-b2c0-5902e32e9e16/httpd-4ce76073-5636-4656-9fba-633fbc1c16f4
15/08/06 15:39:05 INFO HttpServer: Starting HTTP Server
15/08/06 15:39:05 INFO Utils: Successfully started service 'HTTP file server' on port 63058.
15/08/06 15:39:05 INFO SparkEnv: Registering OutputCommitCoordinator
15/08/06 15:39:05 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/08/06 15:39:05 INFO SparkUI: Started SparkUI at http://10.93.235.120:4040
2015-08-06 15:39:06,236:30782(0x1210e7000):ZOO_INFO#log_env#712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-08-06 15:39:06,236:30782(0x1210e7000):ZOO_INFO#log_env#716: Client environment:host.name=TOMWATER-M-60SN
2015-08-06 15:39:06,236:30782(0x1210e7000):ZOO_INFO#log_env#723: Client environment:os.name=Darwin
2015-08-06 15:39:06,236:30782(0x1210e7000):ZOO_INFO#log_env#724: Client environment:os.arch=14.4.0
2015-08-06 15:39:06,236:30782(0x1210e7000):ZOO_INFO#log_env#725: Client environment:os.version=Darwin Kernel Version 14.4.0: Thu May 28 11:35:04 PDT 2015; root:xnu-2782.30.5~1/RELEASE_X86_64
2015-08-06 15:39:06,236:30782(0x1210e7000):ZOO_INFO#log_env#733: Client environment:user.name=tomwater
I0806 15:39:06.235976 547205120 sched.cpp:157] Version: 0.23.0
2015-08-06 15:39:06,236:30782(0x1210e7000):ZOO_INFO#log_env#741: Client environment:user.home=/Users/tomwater
2015-08-06 15:39:06,236:30782(0x1210e7000):ZOO_INFO#log_env#753: Client environment:user.dir=/Users/tomwater/development/tools/spark-1.4.1-bin-hadoop2.6/bin
2015-08-06 15:39:06,236:30782(0x1210e7000):ZOO_INFO#zookeeper_init#786: Initiating client connection, host=10.93.193.78:2181,10.93.193.79:2181,10.93.193.80:2181 sessionTimeout=10000 watcher=0x11eca0d00 sessionId=0 sessionPasswd=<null> context=0x7f8f7cffbaf0 flags=0
2015-08-06 15:39:06,333:30782(0x12147c000):ZOO_INFO#check_events#1703: initiated connection to server [10.93.193.78:2181]
2015-08-06 15:39:06,705:30782(0x12147c000):ZOO_INFO#check_events#1750: session establishment complete on server [10.93.193.78:2181], sessionId=0x14f0502209a0006, negotiated timeout=10000
I0806 15:39:06.707475 544960512 group.cpp:313] Group process (group(1)#10.93.235.120:63059) connected to ZooKeeper
I0806 15:39:06.707785 544960512 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0806 15:39:06.707952 544960512 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I0806 15:39:06.712241 547741696 detector.cpp:138] Detected a new leader: (id='126')
I0806 15:39:06.712530 555130880 group.cpp:656] Trying to get '/mesos/info_0000000126' in ZooKeeper
W0806 15:39:06.714071 544960512 detector.cpp:444] Leading master master#192.168.1.69:5050 is using a Protobuf binary format when registering with ZooKeeper (info): this will be deprecated as of Mesos 0.24 (see MESOS-2340)
I0806 15:39:06.714269 544960512 detector.cpp:481] A new leading master (UPID=master#192.168.1.69:5050) is detected
I0806 15:39:06.714498 544960512 sched.cpp:254] New master detected at master#192.168.1.69:5050
I0806 15:39:06.714643 544960512 sched.cpp:264] No credentials provided. Attempting to register without authentication
I've just had this - obviously check that you can talk to the master mesos node (on port 5050 normally). However you also need to allow the mesos master to talk back to your client (it's an emphemeral port annoyingly).
If you strace it you can see what's going on.
strace -e trace=network -f -s 16384 -o /tmp/strace.log pyspark
Looking at the strace.log - first we ask for a random socket and listen on it:
28462 socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 254
28462 setsockopt(254, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
28462 bind(254, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
28462 getsockname(254, {sa_family=AF_INET, sin_port=htons(46975), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
28462 listen(254, 500000) = 0
Now's the interesting part - we talk to the mesos master (10.1.201.191:5050) and we tell it our IP and that port we opened (10.1.200.212:46975)
It then talks back to us (the accept()):
28507 connect(258, {sa_family=AF_INET, sin_port=htons(5050), sin_addr=inet_addr("10.1.201.191")}, 16) = -1 EINPROGRESS (Operation now in progress)
28510 getsockopt(258, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
28510 sendto(258, "POST /master/mesos.scheduler.Call HTTP/1.1\r\nUser-Agent: libprocess/scheduler-52db362e-d5dd-4109-97d3-e28e80f2391b#10.1.200.212:46975\r\nLibproce
ss-From: scheduler-52db362e-d5dd-4109-97d3-e28e80f2391b#10.1.200.212:46975\r\nConnection: Keep-Alive\r\nHost: \r\nTransfer-Encoding: chunked\r\n\r\n54\r\n\20\1\32P\n
N\n\6ubuntu\22\fPySparkShell:\34ip-10-1-200-212.ec2.internalJ\30http://10.1.200.212:4040\r\n0\r\n\r\n", 375, MSG_NOSIGNAL, NULL, 0) = 375
28510 accept(254, {sa_family=AF_INET, sin_port=htons(33743), sin_addr=inet_addr("10.1.201.191")}, [16]) = 259

Pig keeps trying to connect to job history server (and fails)

I'm running a Pig job that fails to connect to the Hadoop job history server.
The task (usually any task with GROUP BY) runs for a while and then it starts with a message like:
2015-04-21 19:05:22,825 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2015-04-21 19:05:26,721 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2015-04-21 19:05:29,721 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
It then continues for a while retrying the connection. Sometimes it precedes further with the job. Othertimes it throws this exception:
2015-04-21 19:05:55,822 [main] WARN org.apache.pig.tools.pigstats.mapreduce.MRJobStats - Unable to get job counters
java.io.IOException: java.io.IOException: java.net.NoRouteToHostException: No Route to Host from cluster-01/10.10.10.11 to 0.0.0.0:10020 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost
at org.apache.pig.backend.hadoop.executionengine.shims.HadoopShims.getCounters(HadoopShims.java:132)
at org.apache.pig.tools.pigstats.mapreduce.MRJobStats.addCounters(MRJobStats.java:284)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.addSuccessJobStats(MRPigStatsUtil.java:235)
at org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil.accumulateStats(MRPigStatsUtil.java:165)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:360)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:280)
I found this question here but in my case the job history server is started. If I run netstat, I find:
tcp 0 0 0.0.0.0:10020 0.0.0.0:* LISTEN 12073/java off (0.00/0/0)
Where 12073 is ...
12073 pts/4 Sl 0:07 /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Dproc_historyserver -Xmx1000m -Djava.library.path=/data/hadoop/hadoop/lib -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/data/hadoop/hadoop-2.3.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/data/hadoop/hadoop-2.3.0 -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,console -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/data/hadoop/hadoop/logs -Dhadoop.log.file=mapred-hadoop-historyserver-cluster-01.log -Dhadoop.root.logger=INFO,RFA -Dmapred.jobsummary.logger=INFO,JSA -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer
I tried opening the port 10200 in case it was a firewall issue:
ACCEPT tcp -- anywhere anywhere tcp dpt:10020
... but no luck.
After a few minutes, some of the tasks just arbitrarily continue to the next part.
I'm using Hadoop 2.3 and Pig 0.14.
My question is:
1) What are the possible reasons why Pig cannot connect to the job history server (JHS) given that the JHS is running on the same port that Pig looks for it?
... or failing that ...
2) Is there any way to just tell Pig to stop trying to connect to the JHS and continue with the task?
It seems that most Hadoop installation/configuration guides neglect to mention configuring the Job History Server. It seems that Pig, in particular, relies on this server. It also seems like the default (local) settings for the JHS won't work in a multi-node cluster.
The solution was to add the hostname of the server into the configuration in mapred-site.xml to make sure it could be accesses from the other machines. (In my version of the file, the lines had to be added as "new" ... there were no previous settings.)
<property>
<name>mapreduce.jobhistory.address</name>
<value>cm:10020</value>
<description>Host and port for Job History Server (default 0.0.0.0:10020)</description>
</property>
Then restart the job history server:
mr-jobhistory-daemon.sh stop historyserver
mr-jobhistory-daemon.sh start historyserver
If you get a bind exception (port in use), it means the stop didn't work. Either
Use ps ax | grep -e JobHistory to get the process and kill it manually with kill -9 [pid]. Then call the start command above again. Or
Use a different port in the configuration
Pig should pick up the new settings automatically. Run a Pig script and hope for the best.
start history server in hadoop bin using the below command
bin$ ./mr-jobhistory-daemon.sh start historyserver
run pig using the below command
$pig
Config mapreduce.jobhistory.address in hadoop/etc/hadoop/mapred-site.xml,
then:
mapred --daemon start
The solution was the History server was not running:
[user#vm9 sbin]$ ./mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /home/user/hadoop-2.7.7/logs/mapred-user-historyserver-vm9.out
[user#vm9 sbin]$ jps
5683 NameNode
6309 NodeManager
5974 SecondaryNameNode
8075 RunJar
6204 ResourceManager
8509 JobHistoryServer
5821 DataNode
8542 Jps
[user#vm9 sbin]$
Now pig can run properly and it will connect to the job history server and the dump command is working fine.

Resources