Unable to access Couldera Manager 5 web console after installation - hadoop

I am setting up a hadoop cluster(2.6) on CentOS 7 machine with three nodes, cluster is running fine now. However, I am not able to access the Cloudera manager(5.6) web console after completing the CM installation though its services seems to be running.
Below are my findings, please help me what could be the possible reasons:
All process are up and running !
[root#vm-txxxxxx1 ~]# jps
27978 ResourceManager
15368 Main
27052 Jps
27400 DataNode
27639 SecondaryNameNode
28106 NodeManager
27258 NameNode
Firewall stopped
[root#vm-txxxxx1 ~]# service iptables stop
Redirecting to /bin/systemctl stop iptables.service
[root#vm-txxxxxx1 ~]# service iptabes status
Redirecting to /bin/systemctl status iptabes.service
iptabes.service
Loaded: not-found (Reason: No such file or directory)
Active: inactive (dead)
Mar 24 19:24:05 vm-txxxxx1 systemd[1]: Stopped IPv4 firewall with iptables.
Listening on port 7180 and tested the same locally
[root#vm-txxxxxx1 ~]# netstat -tulpn | grep 7180
tcp 0 0 0.0.0.0:7180 0.0.0.0:* LISTEN 15368/java
[root#vm-txxxxx1 ~]# telnet localhost 7180
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
SELINUX Disabled:
[root#vm-txxxxxx1 ~]# getenforce
Disabled
Hostfile entries
[root#vm-txxxxxx1 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4
172.16.xx.x1 vm-txxxxxx1
172.16.xx.x2 vm-xxxxxxx2
172.16.xx.x4 del1-vm-poc04
Verify if Cloudera Manager is running:
[root#vm-txxxxxx1 ~]# service cloudera-scm-server status
cloudera-scm-server.service - LSB: Cloudera SCM Server
Loaded: loaded (/etc/rc.d/init.d/cloudera-scm-server)
Active: active (exited) since Tue 2016-03-22 17:09:55 IST; 2 days ago
Process: 15344 ExecStart=/etc/rc.d/init.d/cloudera-scm-server start (code=exited, status=0/SUCCESS)
Mar 22 17:09:50 vm-txxxxxx1 systemd[1]: Starting LSB: Cloudera SCM Server...
Mar 22 17:09:50 vm-txxxxx1 su[15366]: (to cloudera-scm) root on none
Mar 22 17:09:55 vm-txxxxxx1 cloudera-scm-server[15344]: Starting cloudera-scm-server:...]
Mar 22 17:09:55 vm-txxxxxx1 systemd[1]: Started LSB: Cloudera SCM Server.
Hint: Some lines were ellipsized, use -l to show in full.
Below are the lines from Cloudera servers logs
[root#vm-txxxxx1 ~]# tail -f /var/log/cloudera-scm-server/cloudera-scm-server.log
2016-03-24 18:21:00,398 INFO StaleEntityEviction:com.cloudera.server.cmf.StaleEntityEvictionThread: Reaped total of 0 deleted commands
2016-03-24 18:21:00,400 INFO StaleEntityEviction:com.cloudera.server.cmf.StaleEntityEvictionThread: Found no commands older than 2014-03-25T12:51:00.399Z to reap.
2016-03-24 18:21:00,400 INFO StaleEntityEviction:com.cloudera.server.cmf.StaleEntityEvictionThread: Wizard is active, not reaping scanners or configurators
I am accessing the Cloudera Manager page http://172.16.xx.1x:7180
at the end it says "The connection has timeout", it looks like my http request is not able to reach out to the server, that's why nothing comes up in the logs. Please suggest if I am missing something.
Thanks in advance!
#Havnar: Thanks for the suggestion, I am confirming SSL is not enabled now
and sharing the curl result.
[root#vm-txxxx1 ~]# curl -i -u 'admin:admin' http://localhost:7180/api/v1/tools/echo
HTTP/1.1 200 OK
Expires: Thu, 01-Jan-1970 00:00:00 GMT
Set-Cookie: CLOUDERA_MANAGER_SESSIONID=1etaj5o42vprlndf43ua7rbaf;Path=/;HttpOnly
Content-Type: application/json
Date: Fri, 25 Mar 2016 05:50:36 GMT
Transfer-Encoding: chunked
Server: Jetty(6.1.26.cloudera.4)
{
"message" : "Hello, World!"
I tried stop and restarted the cloudera service, nothing find suspicious, there was one warning which is looking little bit suspicious, search them google, nothing looks relevant.
[root#vm-txxxxx1 ~]# vi /var/log/cloudera-scm-server/cloudera-scm-server.log
2016-03-24 20:22:29,002 WARN main:org.hibernate.cache.ehcache.AbstractEhcacheRegionFactory: HHH020003: Could not find a specific ehcache configuration for cache named [org.hibernate.cache.internal.StandardQueryCache]; using defaults.
2016-03-24 20:22:28,581 INFO main:org.hibernate.engine.jdbc.internal.LobCreatorBuilder: HHH000424: Disabling contextual LOB creation as createClob() method threw error : java.lang.reflect.InvocationTargetException
#Havnar : I didn't get what do you meant by "try a cat on the machine running the CM", let me know if anything else need to be checked.
Thanks

Related

AWS Linux 2 AMI Failed to get D-Bus connection: No such file or directory

I have an AWS Linux 2 AMI EC2 instance.
When running systemctl --user status I get the message:
Failed to get D-Bus connection: No such file or directory
I then ran systemctl start dbus.socket, which gave me this message:
Failed to start dbus.socket: The name org.freedesktop.PolicyKit1 was not provided by any .service files See system logs and 'systemctl status dbus.socket' for details.
I then ran systemctl status dbus.socket -l which returned this:
dbus.socket - D-Bus System Message Bus Socket
Loaded: loaded (/usr/lib/systemd/system/dbus.socket; static; vendor preset: disabled)
Active: active (running) since Thu 2022-03-31 21:26:42 UTC; 14h ago
Listen: /run/dbus/system_bus_socket (Stream)
Mar 31 21:26:42 ip-10-0-0-193.ec2.internal systemd[1]: Listening on D-Bus System Message Bus Socket.
Mar 31 21:26:42 ip-10-0-0-193.ec2.internal systemd[1]: Starting D-Bus System Message Bus Socket.
Running sudo systemctl --user status gives a different error:
Failed to get D-Bus connection: Connection refused
I'm unsure of what to investigate next or what steps to take to resolve the issue.

Cannot start Oracle NoSQL Database on localhost

Trying to install Oracle NoSQL 18.1.27 on Mac
Setup:
$ java -version
java version "1.8.0_221"
Java(TM) SE Runtime Environment (build 1.8.0_221-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.221-b11, mixed mode)
$ echo $KVROOT
/Users/sn/Software/oraclenosql/kvroot
$ echo $KVHOME
/Users/sn/Software/oraclenosql/kv-18.1.27
Used this command to install:
java -jar $KVHOME/lib/kvstore.jar makebootconfig -root $KVROOT -port 5000 -host localhost -storagedir $KVHOME/kvdata/ -harange 5010,5030 -storagedirsize "1 gb" -store-security none
Test using jps:
$ jps -m
8866 Jps -m
8826 kvstore.jar start -root /Users/sn/Software/oraclenosql/kvroot
8831 ManagedService -root /Users/sn/Software/oraclenosql/kvroot -class Admin -service BootstrapAdmin.5000 -config config.xml
Trying to start the db
$ java -jar $KVHOME/lib/kvstore.jar ping -host localhost -port 5000
Could not connect to registry at localhost:5000 Unable to connect to the storage node agent at host localhost, port 5000, which may not be running; nested exception is:
java.rmi.ConnectException: Connection refused to host: localhost; nested exception is:
java.net.ConnectException: Connection refused (Connection refused)
Can't find store topology: Could not contact any RepNode at: [localhost:5000]
And when trying to ping:
SNA at hostname: localhost, registry port: 5000 is not registered.
No further information is available
Can't find store topology: Could not contact any RepNode at: [localhost:5000]
Logs show these:
adminboot.log
2020-03-26 20:05:07.344 UTC INFO [BootstrapAdmin] Starting in bootstrap mode
2020-03-26 20:05:07.348 UTC INFO [BootstrapAdmin] Starting commandService on rmi://localhost:5000/commandService
2020-03-26 20:05:07.448 UTC INFO [BootstrapAdmin] Successfully created a secure proxy for commandService
2020-03-26 20:05:07.531 UTC INFO [BootstrapAdmin] Starting admin:CLIENT_ADMIN on rmi://localhost:5000/admin:CLIENT_ADMIN
2020-03-26 20:05:07.640 UTC INFO [BootstrapAdmin] Successfully created a secure proxy for admin:CLIENT_ADMIN
2020-03-26 20:05:07.713 UTC INFO [BootstrapAdmin] Started AdminService
What am i missing?
You are not starting the NOSQL you are trying to ping. If you want to start :
$ jps -m ( this will not shows any service if its not started )
$ nohup java -jar $KVHOME/lib/kvstore.jar start -root $KVROOT&
Press enter again to come out of the nohup
$ now run jps -m again it will shows the process running status
Note: if its properly configured then there is no issue, else it will throw errors. Kindly follow Proper document and google the error :)
Thanks,

filebeat failed to connect to elasticsearch

I have Elasticsearch running on Kubernetes (EKS), with filebeat running as daemonset on Kubernetes.
Now I am trying to get the logs from other EC2 machines (outside of the EKS), so have installed exact version of filebeat on EC2 and configured it to send logs to Elasticsearch running on Kubernetes.
But not able to see any logs in Elasticsearch (Kibana). Here are the logs for filebeat
2019-08-26T18:18:16.005Z INFO instance/beat.go:292 Setup Beat: filebeat; Version: 7.2.1
2019-08-26T18:18:16.005Z INFO [index-management] idxmgmt/std.go:178 Set output.elasticsearch.index to 'filebeat-7.2.1' as ILM is enabled.
2019-08-26T18:18:16.005Z INFO elasticsearch/client.go:166 Elasticsearch url: http://elasticsearch.dev.domain.net:9200
2019-08-26T18:18:16.005Z INFO add_cloud_metadata/add_cloud_metadata.go:351 add_cloud_metadata: hosting provider type detected as aws, metadata={"availability_zone":"us-west-2a","instance":{"id":"i-0185e1d68306f95b4"},"machine":{"type":"t2.medium"},"provider":"aws","region":"us-west-2"}
2019-08-26T18:18:16.005Z INFO [publisher] pipeline/module.go:97 Beat name: dev-web1
2019-08-26T18:18:16.006Z INFO elasticsearch/client.go:166 Elasticsearch url: http://elasticsearch.dev.domain.net:9200
Not much info in the logs.
Then I notice :
root#dev-web1:~# sudo systemctl status filebeat
● filebeat.service - Filebeat sends log files to Logstash or directly to Elasticsearch.
Loaded: loaded (/lib/systemd/system/filebeat.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2019-08-26 18:18:47 UTC; 18min ago
Docs: https://www.elastic.co/products/beats/filebeat
Main PID: 7768 (filebeat)
CGroup: /system.slice/filebeat.service
└─7768 /usr/share/filebeat/bin/filebeat -e -c /etc/filebeat/filebeat.yml -path.home /usr/share/filebeat -path.config /etc/filebeat -path.data /var/lib/filebeat -path.logs
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.156Z ERROR pipeline/output.go:100 Failed to connect to backoff(elasticsearch(http://elasticsear
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.156Z INFO pipeline/output.go:93 Attempting to reconnect to backoff(elasticsearch(http://elastic
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.156Z INFO [publisher] pipeline/retry.go:189 retryer: send unwait-signal to consumer
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.157Z INFO [publisher] pipeline/retry.go:191 done
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.157Z INFO [publisher] pipeline/retry.go:166 retryer: send wait signal to consumer
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.157Z INFO [publisher] pipeline/retry.go:168 done
Aug 26 18:35:47 dev-web1 filebeat[7768]: 2019-08-26T18:35:47.028Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitori
Aug 26 18:36:17 dev-web1 filebeat[7768]: 2019-08-26T18:36:17.028Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitori
root#dev-web1:~#
But I can't read complete line in above status message.
So I tried :
root#dev-web1:~# curl elasticsearch.dev.domain.net/_cat/health
1566844775 18:39:35 dev-eks-logs green 3 3 48 24 0 0 0 0 - 100.0%
root#dev-web1:~#
which worked but not with port
root#dev-web1:~# curl elasticsearch.dev.domain.net:9200/_cat/health
filebeat has following config
output.elasticsearch:
hosts: ["elasticsearch.dev.domain.net"]
username: "elastic"
password: "changeme"
How can I fix this at filebeat side ?
Telnet Test :
root#dev-web1:~# telnet <ip> 5044
Trying <ip>...
telnet: Unable to connect to remote host: Connection refused
root#dev-web1:~# telnet localhost 5044
Trying 127.0.0.1...
telnet: Unable to connect to remote host: Connection refused
root#dev-web1:~#
https://www.elastic.co/guide/en/beats/filebeat/current/elasticsearch-output.html#hosts-option says:
hosts...If no port is specified, 9200 is used.
Adding hosts: ["elasticsearch.dev.domain.net:80"] in the filbeat configuration should resolve the issue.
I think is a problem of network , check A telnet to localhost/IP 5044

HDFS NFS startup error: “ERROR mount.MountdBase: Failed to start the TCP server...ChannelException: Failed to bind..."

Attempting to use / startup HDFS NFS following the docs (ignoring the instructions to stop the rpcbind service and did not start the hadoop portmap service given that the OS is not SLES 11 and RHEL 6.2), but running into error when trying to set up the NFS service starting the hdfs nfs3 service:
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# service nfs status
Redirecting to /bin/systemctl status nfs.service
Unit nfs.service could not be found.
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# service nfs stop
Redirecting to /bin/systemctl stop nfs.service
Failed to stop nfs.service: Unit nfs.service not loaded.
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# service rpcbind status
Redirecting to /bin/systemctl status rpcbind.service
● rpcbind.service - RPC bind service
Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2019-07-23 13:48:54 HST; 28s ago
Process: 27337 ExecStart=/sbin/rpcbind -w $RPCBIND_ARGS (code=exited, status=0/SUCCESS)
Main PID: 27338 (rpcbind)
CGroup: /system.slice/rpcbind.service
└─27338 /sbin/rpcbind -w
Jul 23 13:48:54 HW02.ucera.local systemd[1]: Starting RPC bind service...
Jul 23 13:48:54 HW02.ucera.local systemd[1]: Started RPC bind service.
[root#HW02 ~]#
[root#HW02 ~]#
[root#HW02 ~]# hdfs nfs3
19/07/23 13:49:33 INFO nfs3.Nfs3Base: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting Nfs3
STARTUP_MSG: host = HW02.ucera.local/172.18.4.47
STARTUP_MSG: args = []
STARTUP_MSG: version = 3.1.1.3.1.0.0-78
STARTUP_MSG: classpath = /usr/hdp/3.1.0.0-78/hadoop/conf:/usr/hdp/3.1.0.0-78/hadoop/lib/jersey-server-1.19.jar:/usr/hdp/3.1.0.0-78/hadoop/lib/ranger-hdfs-plugin-shim-1.2.0.3.1.0.0-78.jar:
...
<a bunch of other jars>
...
STARTUP_MSG: build = git#github.com:hortonworks/hadoop.git -r e4f82af51faec922b4804d0232a637422ec29e64; compiled by 'jenkins' on 2018-12-06T12:26Z
STARTUP_MSG: java = 1.8.0_112
************************************************************/
19/07/23 13:49:33 INFO nfs3.Nfs3Base: registered UNIX signal handlers for [TERM, HUP, INT]
19/07/23 13:49:33 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
19/07/23 13:49:33 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
19/07/23 13:49:33 INFO impl.MetricsSystemImpl: Nfs3 metrics system started
19/07/23 13:49:33 INFO oncrpc.RpcProgram: Will accept client connections from unprivileged ports
19/07/23 13:49:33 INFO security.ShellBasedIdMapping: Not doing static UID/GID mapping because '/etc/nfs.map' does not exist.
19/07/23 13:49:33 INFO nfs3.WriteManager: Stream timeout is 600000ms.
19/07/23 13:49:33 INFO nfs3.WriteManager: Maximum open streams is 256
19/07/23 13:49:33 INFO nfs3.OpenFileCtxCache: Maximum open streams is 256
19/07/23 13:49:34 INFO nfs3.DFSClientCache: Added export: / FileSystem URI: / with namenodeId: -1408097406
19/07/23 13:49:34 INFO nfs3.RpcProgramNfs3: Configured HDFS superuser is
19/07/23 13:49:34 INFO nfs3.RpcProgramNfs3: Delete current dump directory /tmp/.hdfs-nfs
19/07/23 13:49:34 INFO nfs3.RpcProgramNfs3: Create new dump directory /tmp/.hdfs-nfs
19/07/23 13:49:34 INFO nfs3.Nfs3Base: NFS server port set to: 2049
19/07/23 13:49:34 INFO oncrpc.RpcProgram: Will accept client connections from unprivileged ports
19/07/23 13:49:34 INFO mount.RpcProgramMountd: FS:hdfs adding export Path:/ with URI: hdfs://hw01.ucera.local:8020/
19/07/23 13:49:34 INFO oncrpc.SimpleUdpServer: Started listening to UDP requests at port 4242 for Rpc program: mountd at localhost:4242 with workerCount 1
19/07/23 13:49:34 ERROR mount.MountdBase: Failed to start the TCP server.
org.jboss.netty.channel.ChannelException: Failed to bind to: 0.0.0.0/0.0.0.0:4242
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at org.apache.hadoop.oncrpc.SimpleTcpServer.run(SimpleTcpServer.java:89)
at org.apache.hadoop.mount.MountdBase.startTCPServer(MountdBase.java:83)
at org.apache.hadoop.mount.MountdBase.start(MountdBase.java:98)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.startServiceInternal(Nfs3.java:56)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.startService(Nfs3.java:69)
at org.apache.hadoop.hdfs.nfs.nfs3.Nfs3.main(Nfs3.java:79)
Caused by: java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
...
...
19/07/23 13:49:34 INFO util.ExitUtil: Exiting with status 1: org.jboss.netty.channel.ChannelException: Failed to bind to: 0.0.0.0/0.0.0.0:4242
19/07/23 13:49:34 INFO nfs3.Nfs3Base: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down Nfs3 at HW02.ucera.local/172.18.4.47
************************************************************/
Not sure how to interpret any of the errors seen here (and have not installed any packages like nfs-utils, assuming that Ambari would have installed all needed packages when cluster was initially installed).
Any debugging suggestions or solutions for what to do about this?
** UPDATE:
After looking at the error, I can see
Caused by: java.net.BindException: Address already in use
and looking into what is already using it, we see...
[root#HW02 ~]# netstat -ltnp | grep 4242
tcp 0 0 0.0.0.0:4242 0.0.0.0:* LISTEN 98067/jsvc.exec
The process jsvc.exec appears to be related to running java applications. Given that hadoop runs on java, I assume it would be bad to just kill the process. Is it not supposed to be on this port (since interferes with NFS Gateway)? Not sure what to do about this.
TLDR: nfs gateway service was already running (by default, apparently) and the service that I thought was blocking the hadoop nfs3 service (jsvc.exec) from starting was (I'm assuming) part of that service already running.
What made me suspect this was that when shutting down the cluster, the service also stopped plus the fact that it was using the port I needed for nfs. The way that I confirmed this was just from following the verification steps in the docs and seeing that my output was similar to what should be expected.
[root#HW02 ~]# rpcinfo -p hw02
program vers proto port service
100000 4 tcp 111 portmapper
100000 3 tcp 111 portmapper
100000 2 tcp 111 portmapper
100000 4 udp 111 portmapper
100000 3 udp 111 portmapper
100000 2 udp 111 portmapper
100005 1 udp 4242 mountd
100005 2 udp 4242 mountd
100005 3 udp 4242 mountd
100005 1 tcp 4242 mountd
100005 2 tcp 4242 mountd
100005 3 tcp 4242 mountd
100003 3 tcp 2049 nfs
[root#HW02 ~]# showmount -e hw02
Export list for hw02:
/ *
Another thing that could told me that the jsvc process was part of an already running hdfs nfs service would have been checking the process info...
[root#HW02 ~]# ps -feww | grep jsvc
root 61106 59083 0 14:27 pts/2 00:00:00 grep --color=auto jsvc
root 163179 1 0 12:14 ? 00:00:00 jsvc.exec -Dproc_nfs3 -outfile /var/log/hadoop/root/hadoop-hdfs-root-nfs3-HW02.ucera.local.out -errfile /var/log/hadoop/root/privileged-root-nfs3-HW02.ucera.local.err -pidfile /var/run/hadoop/root/hadoop-hdfs-root-nfs3.pid -nodetach -user hdfs -cp /usr/hdp/3.1.0.0-78/hadoop/conf:...
...
hdfs 163193 163179 0 12:14 ? 00:00:17 jsvc.exec -Dproc_nfs3 -outfile /var/log/hadoop/root/hadoop-hdfs-root-nfs3-HW02.ucera.local.out -errfile /var/log/hadoop/root/privileged-root-nfs3-HW02.ucera.local.err -pidfile /var/run/hadoop/root/hadoop-hdfs-root-nfs3.pid -nodetach -user hdfs -cp /usr/hdp/3.1.0.0-78/hadoop/conf:...
and seeing jsvc.exec -Dproc_nfs3 ... to get the hint that jsvc (which apparently is for running java apps on linux) was being used to run the very nfs3 service I was trying to start.
And for anyone else with this problem, note that I did not stop all the services that the docs want you to stop (since using centos7)
[root#HW01 /]# service nfs status
Redirecting to /bin/systemctl status nfs.service
● nfs-server.service - NFS server and services
Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; disabled; vendor preset: disabled)
Active: inactive (dead)
[root#HW01 /]# service rpcbind status
Redirecting to /bin/systemctl status rpcbind.service
● rpcbind.service - RPC bind service
Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2019-07-19 15:17:02 HST; 6 days ago
Main PID: 2155 (rpcbind)
CGroup: /system.slice/rpcbind.service
└─2155 /sbin/rpcbind -w
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Also note that I did not follow any of the config file settings recommended in the docs (and that some of the properties instructed in the docs could not even be found in the Ambari-managed HDFS configs (so if anyone can explain why this is still working for me despite that, please do)).
** Update:
After talking with some people more experienced with using HDP (v3.1) than me, the docs that I linked to for setting up NFS for HDFS may not be totally up to date (when setting up NFS via Ambari mgnt. in any case)...
Can have a cluster node act as an NFS gateway by checking it off as a NFS node in the Ambari host management UI:
Needed configs can be set like so in the HDFS mgnt. UI...
Can confirm that HDFS NFS gateway is running by looking at the Host > Summary > Components section in Ambari...

One node in hadoop cluster failure

I have configured 10 nodes HDP hadoop cluster recently, each node is of OS SLES11..
On master node I have configured all master services and clients..also the mabari-server. Remaining nodes other slave services and their clients.
NTP sync is on, other pre-requisites also fine.
I am experiencing weird behavior on hadoop cluster, After starting all the services within few hours one of the node goes down.
When I experienced this first time, I have restarted that particular node and added back to the cluster.
Now My master node is causing the same issue due to which whole cluster is down. I have checked the logs but there are no indications related to failure.
I am clueless what is the root cause for the failure of the node in hadoop cluster?
Below are logs :-
the system which went down:
/var/log/messages
these are /var/log/messages: notice)=0', processed='source(src)=6830'
Apr 23 05:22:43 lnx1863 SuSEfirewall2: SuSEfirewall2 not active Apr 23
05:23:49 lnx1863 SuSEfirewall2: SuSEfirewall2 not active Apr 23
05:24:17 lnx1863 sudo: root : TTY=pts/0 ; PWD=/ ; USER=root ;
COMMAND=/usr/bin/du -h / Apr 23 05:24:55 lnx1863 SuSEfirewall2:
SuSEfirewall2 not active Apr 23 05:25:22 lnx1863 kernel:
[248531.127254] megasas: Found FW in FAULT state, will reset adapter.
Apr 23 05:25:22 lnx1863 kernel: [248531.127260] megaraid_sas:
resetting fusion adapter. Apr 23 05:25:22 lnx1863 kernel:
[248531.127427] megaraid_sas: Reset not supported, killing adapter.
namenode logs:-
INFO 2015-04-23 05:27:43,665 Heartbeat.py:78 - Building Heartbeat:
{responseId = 7607, timestamp = 1429781263665, commandsInProgress =
False, componentsMapped = True} INFO 2015-04-23 05:28:44,053
security.py:135 - Encountered communication error. Details:
SSLError('The read operation timed out',) ERROR 2015-04-23
05:28:44,053 Controller.py:278 - Connection to http://localhost was
lost (details=Request to
https://localhost:8441/agent/v1/heartbeat/localhostip failed due to
Error occured during connecting to the server: The read operation
timed out) INFO 2015-04-23 05:29:16,061 NetUtil.py:48 - Connecting to
https://localhost:8440/connection_info INFO 2015-04-23 05:29:16,118
security.py:93 - SSL Connect being called.. connecting to the server

Resources