One node in hadoop cluster failure - hadoop

I have configured 10 nodes HDP hadoop cluster recently, each node is of OS SLES11..
On master node I have configured all master services and clients..also the mabari-server. Remaining nodes other slave services and their clients.
NTP sync is on, other pre-requisites also fine.
I am experiencing weird behavior on hadoop cluster, After starting all the services within few hours one of the node goes down.
When I experienced this first time, I have restarted that particular node and added back to the cluster.
Now My master node is causing the same issue due to which whole cluster is down. I have checked the logs but there are no indications related to failure.
I am clueless what is the root cause for the failure of the node in hadoop cluster?
Below are logs :-
the system which went down:
/var/log/messages
these are /var/log/messages: notice)=0', processed='source(src)=6830'
Apr 23 05:22:43 lnx1863 SuSEfirewall2: SuSEfirewall2 not active Apr 23
05:23:49 lnx1863 SuSEfirewall2: SuSEfirewall2 not active Apr 23
05:24:17 lnx1863 sudo: root : TTY=pts/0 ; PWD=/ ; USER=root ;
COMMAND=/usr/bin/du -h / Apr 23 05:24:55 lnx1863 SuSEfirewall2:
SuSEfirewall2 not active Apr 23 05:25:22 lnx1863 kernel:
[248531.127254] megasas: Found FW in FAULT state, will reset adapter.
Apr 23 05:25:22 lnx1863 kernel: [248531.127260] megaraid_sas:
resetting fusion adapter. Apr 23 05:25:22 lnx1863 kernel:
[248531.127427] megaraid_sas: Reset not supported, killing adapter.
namenode logs:-
INFO 2015-04-23 05:27:43,665 Heartbeat.py:78 - Building Heartbeat:
{responseId = 7607, timestamp = 1429781263665, commandsInProgress =
False, componentsMapped = True} INFO 2015-04-23 05:28:44,053
security.py:135 - Encountered communication error. Details:
SSLError('The read operation timed out',) ERROR 2015-04-23
05:28:44,053 Controller.py:278 - Connection to http://localhost was
lost (details=Request to
https://localhost:8441/agent/v1/heartbeat/localhostip failed due to
Error occured during connecting to the server: The read operation
timed out) INFO 2015-04-23 05:29:16,061 NetUtil.py:48 - Connecting to
https://localhost:8440/connection_info INFO 2015-04-23 05:29:16,118
security.py:93 - SSL Connect being called.. connecting to the server

Related

filebeat failed to connect to elasticsearch

I have Elasticsearch running on Kubernetes (EKS), with filebeat running as daemonset on Kubernetes.
Now I am trying to get the logs from other EC2 machines (outside of the EKS), so have installed exact version of filebeat on EC2 and configured it to send logs to Elasticsearch running on Kubernetes.
But not able to see any logs in Elasticsearch (Kibana). Here are the logs for filebeat
2019-08-26T18:18:16.005Z INFO instance/beat.go:292 Setup Beat: filebeat; Version: 7.2.1
2019-08-26T18:18:16.005Z INFO [index-management] idxmgmt/std.go:178 Set output.elasticsearch.index to 'filebeat-7.2.1' as ILM is enabled.
2019-08-26T18:18:16.005Z INFO elasticsearch/client.go:166 Elasticsearch url: http://elasticsearch.dev.domain.net:9200
2019-08-26T18:18:16.005Z INFO add_cloud_metadata/add_cloud_metadata.go:351 add_cloud_metadata: hosting provider type detected as aws, metadata={"availability_zone":"us-west-2a","instance":{"id":"i-0185e1d68306f95b4"},"machine":{"type":"t2.medium"},"provider":"aws","region":"us-west-2"}
2019-08-26T18:18:16.005Z INFO [publisher] pipeline/module.go:97 Beat name: dev-web1
2019-08-26T18:18:16.006Z INFO elasticsearch/client.go:166 Elasticsearch url: http://elasticsearch.dev.domain.net:9200
Not much info in the logs.
Then I notice :
root#dev-web1:~# sudo systemctl status filebeat
● filebeat.service - Filebeat sends log files to Logstash or directly to Elasticsearch.
Loaded: loaded (/lib/systemd/system/filebeat.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2019-08-26 18:18:47 UTC; 18min ago
Docs: https://www.elastic.co/products/beats/filebeat
Main PID: 7768 (filebeat)
CGroup: /system.slice/filebeat.service
└─7768 /usr/share/filebeat/bin/filebeat -e -c /etc/filebeat/filebeat.yml -path.home /usr/share/filebeat -path.config /etc/filebeat -path.data /var/lib/filebeat -path.logs
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.156Z ERROR pipeline/output.go:100 Failed to connect to backoff(elasticsearch(http://elasticsear
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.156Z INFO pipeline/output.go:93 Attempting to reconnect to backoff(elasticsearch(http://elastic
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.156Z INFO [publisher] pipeline/retry.go:189 retryer: send unwait-signal to consumer
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.157Z INFO [publisher] pipeline/retry.go:191 done
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.157Z INFO [publisher] pipeline/retry.go:166 retryer: send wait signal to consumer
Aug 26 18:35:38 dev-web1 filebeat[7768]: 2019-08-26T18:35:38.157Z INFO [publisher] pipeline/retry.go:168 done
Aug 26 18:35:47 dev-web1 filebeat[7768]: 2019-08-26T18:35:47.028Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitori
Aug 26 18:36:17 dev-web1 filebeat[7768]: 2019-08-26T18:36:17.028Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitori
root#dev-web1:~#
But I can't read complete line in above status message.
So I tried :
root#dev-web1:~# curl elasticsearch.dev.domain.net/_cat/health
1566844775 18:39:35 dev-eks-logs green 3 3 48 24 0 0 0 0 - 100.0%
root#dev-web1:~#
which worked but not with port
root#dev-web1:~# curl elasticsearch.dev.domain.net:9200/_cat/health
filebeat has following config
output.elasticsearch:
hosts: ["elasticsearch.dev.domain.net"]
username: "elastic"
password: "changeme"
How can I fix this at filebeat side ?
Telnet Test :
root#dev-web1:~# telnet <ip> 5044
Trying <ip>...
telnet: Unable to connect to remote host: Connection refused
root#dev-web1:~# telnet localhost 5044
Trying 127.0.0.1...
telnet: Unable to connect to remote host: Connection refused
root#dev-web1:~#
https://www.elastic.co/guide/en/beats/filebeat/current/elasticsearch-output.html#hosts-option says:
hosts...If no port is specified, 9200 is used.
Adding hosts: ["elasticsearch.dev.domain.net:80"] in the filbeat configuration should resolve the issue.
I think is a problem of network , check A telnet to localhost/IP 5044

Elastic 2.3.4. Node Startup Quiet Failure

We are using a 5 node cluster hosted in Google Cloud (Ubuntu 16.04 LTS) and we noticed that one of the node's disk space was at 90%+ so we shut down the node with:
sudo service elasticsearch stop
Then stopping the instance in the GCP console.
After upgrading the node's disk space, we tried starting elastic again using:
sudo service elasticsearch start
This command seems to fail silently, and the SSH session terminates after freezing momentarily. Nothing shows in the node's elasticsearch logs, and nothing shows up in the current cluster's master elasticsearch logs either. The only hint we can find of something going wrong is in the node's syslog:
Jan 25 15:48:29 elasticsearch-1-vm systemd[1]: Started Cleanup of Temporary Directories.
Jan 25 15:48:29 elasticsearch-1-vm systemd[1]: Starting Elasticsearch...
Jan 25 15:48:29 elasticsearch-1-vm systemd[1]: Started Elasticsearch.
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.597729] kernel tried to execute NX-protected page - exploit attempt? (uid: 113)
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.605545] BUG: unable to handle kernel paging request at 00007f896d5467c0
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.612621] IP: 0x7f896d5467c0
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.615779] PGD 80000003050ee067
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.615780] P4D 80000003050ee067
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.619199] PUD 30508d067
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.622626] PMD 305162067
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.625438] PTE 80000003df15b867
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.628245]
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.633174] Oops: 0011 [#1] SMP PTI
The cluster health with 4 nodes is green, and we can't seem to figure out why this may be happening.
Any ideas on why this may be happening would be very helpful.
Here is our config located in /etc/default/elasticsearch:
https://gist.github.com/deppi/58826c38ea8414d301eb034e9a29cd54
Also here is our /etc/elasticsearch/elasticsearch.yml
https://gist.github.com/deppi/17b1f28e649ee528b0fe2ca93a2ff19c
The only thing I can think that might be causing this issue is discovery.zen.minimum_master_nodes: 2
When maybe it should be configured as
discovery.zen.minimum_master_nodes: 3
But we are uncertain this is the issue and don't want to risk further breaking out elasticsearch cluster
By experience, I know that shutting down the cluster using the elasticsearch command was not the best, we had issues with nodes not entirely down, and trying to take the master level. That's maybe why you can see 2 nodes, but your node is not part of it anymore.
What you should do, is shutting down the elasticsearch process on each nodes, unless you still index on the two nodes. In this case shut your cluster properly :
Stop the collect first everytime you need to stop elasticsearch, so logstash if you are using the stack
Then stop elasticsearch itself https://www.elastic.co/guide/en/elasticsearch/reference/master/stopping-elasticsearch.html
Start your first nodes as you let the protocol take place
Start elastic on the other nodes => see if all the nodes enter in
If not your config might be the problem, as I would use 1 master node and 3 slaves, and use another data path. When you need to shut down your cluster, stop the collect, stop the queuing, stop the storage (elastic), node by node
This seems to be an issue with a new kernel that has been deployed on GCP for the Ubuntu 16.04 LTS OS.
Problem Kernel:
uname -a
Linux elasticsearch-1-vm 4.13.0-1007-gcp #10-Ubuntu SMP Fri Jan 12 13:56:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Proper Kernel:
uname -a
Linux elasticsearch-1-vm 4.13.0-1006-gcp #9-Ubuntu SMP Mon Jan 8 21:13:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
To fix the issue with the GCP instances, I ran:
sudo apt remove 4.13.0-1007-gcp
sudo apt install 4.13.0-1006-gcp
exit
Then in google cloud console, restart the instance, then SSH back in then:
sudo service elasticsearch start

KrbException connecting to Hadoop cluster with Zookeeper client - UNKNOWN_SERVER

My Zookeeper client is having trouble connecting to the Hadoop cluster.
This works fine from a Linux VM, but I am using a Mac.
I set the -Dsun.security.krb5.debug=true flag on the JVM and get the following output:
Found ticket for solr#DDA.MYCO.COM to go to krbtgt/DDA.MYCO.COM#DDA.MYCO.COM expiring on Sat Apr 29 03:15:04 BST 2017
Entered Krb5Context.initSecContext with state=STATE_NEW
Found ticket for solr#DDA.MYCO.COM to go to krbtgt/DDA.MYCO.COM#DDA.MYCO.COM expiring on Sat Apr 29 03:15:04 BST 2017
Service ticket not found in the subject
>>> Credentials acquireServiceCreds: same realm
Using builtin default etypes for default_tgs_enctypes
default etypes for default_tgs_enctypes: 17 16 23.
>>> CksumType: sun.security.krb5.internal.crypto.RsaMd5CksumType
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
>>> KrbKdcReq send: kdc=oc-10-252-132-139.nat-ucfc2z3b.usdv1.mycloud.com UDP:88, timeout=30000, number of retries =3, #bytes=682
>>> KDCCommunication: kdc=oc-10-252-132-139.nat-ucfc2z3b.usdv1.mycloud.com UDP:88, timeout=30000,Attempt =1, #bytes=682
>>> KrbKdcReq send: #bytes read=217
>>> KdcAccessibility: remove oc-10-252-132-139.nat-ucfc2z3b.usdv1.mycloud.com
>>> KDCRep: init() encoding tag is 126 req type is 13
>>>KRBError:
cTime is Thu Dec 24 11:18:15 GMT 2015 1450955895000
sTime is Fri Apr 28 15:15:06 BST 2017 1493388906000
suSec is 925863
error code is 7
error Message is Server not found in Kerberos database
cname is solr#DDA.MYCO.COM
sname is zookeeper/oc-10-252-132-160.nat-ucfc2z3b.usdv1.mycloud.com#DDA.MYCO.COM
msgType is 30
KrbException: Server not found in Kerberos database (7) - UNKNOWN_SERVER
at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:73)
at sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:251)
at sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:262)
at sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:308)
at sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:126)
at sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:458)
at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:693)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:248)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
at org.apache.zookeeper.client.ZooKeeperSaslClient$2.run(ZooKeeperSaslClient.java:366)
at org.apache.zookeeper.client.ZooKeeperSaslClient$2.run(ZooKeeperSaslClient.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslToken(ZooKeeperSaslClient.java:362)
at org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslToken(ZooKeeperSaslClient.java:348)
at org.apache.zookeeper.client.ZooKeeperSaslClient.sendSaslPacket(ZooKeeperSaslClient.java:420)
at org.apache.zookeeper.client.ZooKeeperSaslClient.initialize(ZooKeeperSaslClient.java:458)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1057)
Caused by: KrbException: Identifier doesn't match expected value (906)
at sun.security.krb5.internal.KDCRep.init(KDCRep.java:140)
at sun.security.krb5.internal.TGSRep.init(TGSRep.java:65)
at sun.security.krb5.internal.TGSRep.<init>(TGSRep.java:60)
at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:55)
... 18 more
ERROR 2017-04-28 15:15:07,046 5539 org.apache.zookeeper.client.ZooKeeperSaslClient [main-SendThread(oc-10-252-132-160.nat-ucfc2z3b.usdv1.mycloud.com:2181)]
An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed
[Caused by GSSException: No valid credentials provided
(Mechanism level: Server not found in Kerberos database (7) - UNKNOWN_SERVER)])
occurred when evaluating Zookeeper Quorum Member's received SASL token.
This may be caused by Java's being unable to resolve the Zookeeper Quorum Member's hostname correctly.
You may want to try to adding '-Dsun.net.spi.nameservice.provider.1=dns,sun' to your client's JVMFLAGS environment.
Zookeeper Client will go to AUTH_FAILED state.
I've tested Kerberos config as follows:
>kinit -kt /etc/security/keytabs/solr.headless.keytab solr
>klist
Credentials cache: API:3451691D-7D5E-49FD-A27C-135816F33E4D
Principal: solr#DDA.MYCO.COM
Issued Expires Principal
Apr 28 16:58:02 2017 Apr 29 04:58:02 2017 krbtgt/DDA.MYCO.COM#DDA.MYCO.COM
Following the instructions from hortonworks I managed to get the kerberos ticket stored in a file:
>klist -c FILE:/tmp/krb5cc_501
Credentials cache: FILE:/tmp/krb5cc_501
Principal: solr#DDA.MYCO.COM
Issued Expires Principal
Apr 28 17:10:25 2017 Apr 29 05:10:25 2017 krbtgt/DDA.MYCO.COM#DDA.MYCO.COM
Also I tried the suggested JVM option suggested in the stack trace (-Dsun.net.spi.nameservice.provider.1=dns,sun), but this led to a different error along the lines of Client session timed out, which suggests that this JVM param is preventing the client from connecting correctly in the first place.
==EDIT==
Seems that the Mac version of Kerberos is not the latest:
> krb5-config --version
Kerberos 5 release 1.7-prerelease
I just tried brew install krb5 to install a newer version, then adjusting the path to point to the new version.
> krb5-config --version
Kerberos 5 release 1.15.1
This has had no effect whatsoever on the outcome.
NB this works fine from a linux VM on my Mac, using exactly the same jaas.conf, keytab files, and krb5.conf.
krb5.conf:
[libdefaults]
renew_lifetime = 7d
forwardable = true
default_realm = DDA.MYCO.COM
ticket_lifetime = 24h
dns_lookup_realm = false
dns_lookup_kdc = false
[realms]
DDA.MYCO.COM = {
admin_server = oc-10-252-132-139.nat-ucfc2z3b.usdv1.mycloud.com
kdc = oc-10-252-132-139.nat-ucfc2z3b.usdv1.mycloud.com
}
Reverse DNS:
I checked that the FQDN hostname I'm connecting to can be found using a reverse DNS lookup:
> host 10.252.132.160
160.132.252.10.in-addr.arpa domain name pointer oc-10-252-132-160.nat-ucfc2z3b.usdv1.mycloud.com.
This is exactly as per the response to the same command from the linux VM.
===WIRESHARK ANALYSIS===
Using Wireshark configured to use the system key tabs allows a bit more detail in the analysis.
Here I have found that a failed call looks like this:
client -> host AS-REQ
host -> client AS-REP
client -> host AS-REQ
host -> client AS-REP
client -> host TGS-REQ <-- this call is detailed below
host -> client KRB error KRB5KDC_ERR_S_PRINCIPAL_UNKNOWN
The erroneous TGS-REQ call shows the following:
Kerberos
tgs-req
pvno: 5
msg-type: krb-tgs-req (12)
padata: 1 item
req-body
Padding: 0
kdc-options: 40000000 (forwardable)
realm: DDA.MYCO.COM
sname
name-type: kRB5-NT-UNKNOWN (0)
sname-string: 2 items
SNameString: zookeeper
SNameString: oc-10-252-134-51.nat-ucfc2z3b.usdv1.mycloud.com
till: 1970-01-01 00:00:00 (UTC)
nonce: 797021964
etype: 3 items
ENCTYPE: eTYPE-AES128-CTS-HMAC-SHA1-96 (17)
ENCTYPE: eTYPE-DES3-CBC-SHA1 (16)
ENCTYPE: eTYPE-ARCFOUR-HMAC-MD5 (23)
Here is the corresponding successful call from the linux box, which is followed by several more exchanges.
Kerberos
tgs-req
pvno: 5
msg-type: krb-tgs-req (12)
padata: 1 item
req-body
Padding: 0
kdc-options: 40000000 (forwardable)
realm: DDA.MYCO.COM
sname
name-type: kRB5-NT-UNKNOWN (0)
sname-string: 2 items
SNameString: zookeeper
SNameString: d59407.ddapoc.ucfc2z3b.usdv1.mycloud.com
till: 1970-01-01 00:00:00 (UTC)
nonce: 681936272
etype: 3 items
ENCTYPE: eTYPE-AES128-CTS-HMAC-SHA1-96 (17)
ENCTYPE: eTYPE-DES3-CBC-SHA1 (16)
ENCTYPE: eTYPE-ARCFOUR-HMAC-MD5 (23)
So it looks like the client is sending
oc-10-252-134-51.nat-ucfc2z3b.usdv1.mycloud.com
as the server host, when it should be sending:
d59407.ddapoc.ucfc2z3b.usdv1.mycloud.com
So the question is, how do I fix that? Bear in mind this is a Java piece of code.
My /etc/hosts has the following:
10.252.132.160 b3e073.ddapoc.ucfc2z3b.usdv1.mycloud.com
10.252.134.51 d59407.ddapoc.ucfc2z3b.usdv1.mycloud.com
10.252.132.139 d7cc18.ddapoc.ucfc2z3b.usdv1.mycloud.com
And my krb5.conf file has:
kdc = d7cc18.ddapoc.ucfc2z3b.usdv1.mycloud.com
kdc = b3e073.ddapoc.ucfc2z3b.usdv1.mycloud.com
kdc = d59407.ddapoc.ucfc2z3b.usdv1.mycloud.com
I tried adding -Dsun.net.spi.nameservice.provider.1=file,dns as a JVM param but got the same result.
I fixed this by setting up a local dnsmasq instance to supply the forward and reverse DNS lookups.
So now from the command line, host d59407.ddapoc.ucfc2z3b.usdv1.mycloud.com returns 10.252.134.51
See also here and here.
Looks like some DNS issue.
Could this SO question help you resolving your problem?
Also, here is an Q&A about the problem.
It also could be because of non Sun JVM.

Unable to access Couldera Manager 5 web console after installation

I am setting up a hadoop cluster(2.6) on CentOS 7 machine with three nodes, cluster is running fine now. However, I am not able to access the Cloudera manager(5.6) web console after completing the CM installation though its services seems to be running.
Below are my findings, please help me what could be the possible reasons:
All process are up and running !
[root#vm-txxxxxx1 ~]# jps
27978 ResourceManager
15368 Main
27052 Jps
27400 DataNode
27639 SecondaryNameNode
28106 NodeManager
27258 NameNode
Firewall stopped
[root#vm-txxxxx1 ~]# service iptables stop
Redirecting to /bin/systemctl stop iptables.service
[root#vm-txxxxxx1 ~]# service iptabes status
Redirecting to /bin/systemctl status iptabes.service
iptabes.service
Loaded: not-found (Reason: No such file or directory)
Active: inactive (dead)
Mar 24 19:24:05 vm-txxxxx1 systemd[1]: Stopped IPv4 firewall with iptables.
Listening on port 7180 and tested the same locally
[root#vm-txxxxxx1 ~]# netstat -tulpn | grep 7180
tcp 0 0 0.0.0.0:7180 0.0.0.0:* LISTEN 15368/java
[root#vm-txxxxx1 ~]# telnet localhost 7180
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
SELINUX Disabled:
[root#vm-txxxxxx1 ~]# getenforce
Disabled
Hostfile entries
[root#vm-txxxxxx1 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4
172.16.xx.x1 vm-txxxxxx1
172.16.xx.x2 vm-xxxxxxx2
172.16.xx.x4 del1-vm-poc04
Verify if Cloudera Manager is running:
[root#vm-txxxxxx1 ~]# service cloudera-scm-server status
cloudera-scm-server.service - LSB: Cloudera SCM Server
Loaded: loaded (/etc/rc.d/init.d/cloudera-scm-server)
Active: active (exited) since Tue 2016-03-22 17:09:55 IST; 2 days ago
Process: 15344 ExecStart=/etc/rc.d/init.d/cloudera-scm-server start (code=exited, status=0/SUCCESS)
Mar 22 17:09:50 vm-txxxxxx1 systemd[1]: Starting LSB: Cloudera SCM Server...
Mar 22 17:09:50 vm-txxxxx1 su[15366]: (to cloudera-scm) root on none
Mar 22 17:09:55 vm-txxxxxx1 cloudera-scm-server[15344]: Starting cloudera-scm-server:...]
Mar 22 17:09:55 vm-txxxxxx1 systemd[1]: Started LSB: Cloudera SCM Server.
Hint: Some lines were ellipsized, use -l to show in full.
Below are the lines from Cloudera servers logs
[root#vm-txxxxx1 ~]# tail -f /var/log/cloudera-scm-server/cloudera-scm-server.log
2016-03-24 18:21:00,398 INFO StaleEntityEviction:com.cloudera.server.cmf.StaleEntityEvictionThread: Reaped total of 0 deleted commands
2016-03-24 18:21:00,400 INFO StaleEntityEviction:com.cloudera.server.cmf.StaleEntityEvictionThread: Found no commands older than 2014-03-25T12:51:00.399Z to reap.
2016-03-24 18:21:00,400 INFO StaleEntityEviction:com.cloudera.server.cmf.StaleEntityEvictionThread: Wizard is active, not reaping scanners or configurators
I am accessing the Cloudera Manager page http://172.16.xx.1x:7180
at the end it says "The connection has timeout", it looks like my http request is not able to reach out to the server, that's why nothing comes up in the logs. Please suggest if I am missing something.
Thanks in advance!
#Havnar: Thanks for the suggestion, I am confirming SSL is not enabled now
and sharing the curl result.
[root#vm-txxxx1 ~]# curl -i -u 'admin:admin' http://localhost:7180/api/v1/tools/echo
HTTP/1.1 200 OK
Expires: Thu, 01-Jan-1970 00:00:00 GMT
Set-Cookie: CLOUDERA_MANAGER_SESSIONID=1etaj5o42vprlndf43ua7rbaf;Path=/;HttpOnly
Content-Type: application/json
Date: Fri, 25 Mar 2016 05:50:36 GMT
Transfer-Encoding: chunked
Server: Jetty(6.1.26.cloudera.4)
{
"message" : "Hello, World!"
I tried stop and restarted the cloudera service, nothing find suspicious, there was one warning which is looking little bit suspicious, search them google, nothing looks relevant.
[root#vm-txxxxx1 ~]# vi /var/log/cloudera-scm-server/cloudera-scm-server.log
2016-03-24 20:22:29,002 WARN main:org.hibernate.cache.ehcache.AbstractEhcacheRegionFactory: HHH020003: Could not find a specific ehcache configuration for cache named [org.hibernate.cache.internal.StandardQueryCache]; using defaults.
2016-03-24 20:22:28,581 INFO main:org.hibernate.engine.jdbc.internal.LobCreatorBuilder: HHH000424: Disabling contextual LOB creation as createClob() method threw error : java.lang.reflect.InvocationTargetException
#Havnar : I didn't get what do you meant by "try a cat on the machine running the CM", let me know if anything else need to be checked.
Thanks

Running Selenium Grid through Vagrant

I'm trying to migrate from running my Selenium server and client from all on my Mac, to having the servers run in a Vagrant VM, and the clients run locally on my Mac.
I'm using Vagrant 1.4.3 running on Mac OS X 10.9.1 to launch an Ubuntu 13.10 VM. Once the VM is launched, I install Java, Node.js and a few other dependencies that are required for my testing environment. After installing Selenium 2.39.0 (the latest as of this writing), here are the relevant configurations.
I SSH into my Vagrant VM and run the following:
java -jar /usr/local/bin/selenium-server-standalone-*.jar \
-role hub \
-trustAllSSLCertificates \
-hubConfig /vagrant/hub.json
/vagrant on the VM maps to the root of my project directory on my local Mac. Here's the relevant config from my Vagrantfile.
config.vm.box = "saucy64"
config.vm.box_url = "http://cloud-images.ubuntu.com/vagrant/saucy/20140202/saucy-server-cloudimg-amd64-vagrant-disk1.box"
# ...
config.vm.define "testing" do | test |
test.vm.network :forwarded_port, guest: 3444, host: 4444
test.vm.network :private_network, ip: "192.168.50.6"
# ...
end
Here is the Hub config that the Selenium Grid Hub is using on the Vagrant VM. Selenium Hub uses port 3444 inside the VM, which is portmapped to 4444 outside the VM, facing my Mac.
{
"browserTimeout": 180000,
"capabilityMatcher": "org.openqa.grid.internal.utils.DefaultCapabilityMatcher",
"cleanUpCycle": 2000,
"maxSession": 5,
"newSessionWaitTimeout": -1,
"nodePolling": 2000,
"port": 3444,
"throwOnCapabilityNotPresent": true,
"timeout": 30000
}
Here's how I launch Selenium on my Mac as a node.
java -jar selenium-server-standalone-*.jar \
-role node \
-trustAllSSLCertificates \
-nodeConfig node.mac.json
And here's the node config which tries to talk to the Hub running inside Vagrant.
{
"capabilities": [
{
"platform": "MAC",
"seleniumProtocol": "WebDriver",
"browserName": "firefox",
"maxInstances": 1
},
{
"platform": "MAC",
"seleniumProtocol": "WebDriver",
"browserName": "chrome",
"maxInstances": 1
}
],
"configuration": {
"proxy": "org.openqa.grid.selenium.proxy.DefaultRemoteProxy",
"hubHost": "127.0.0.1",
"hubPort": 4444,
"hub": "http://127.0.0.1:4444/grid/register",
"maxSession": 1,
"port": 4445,
"register": true,
"registerCycle": 2000,
"remoteHost": "http://127.0.0.1:4445",
"role": "node",
"url": "http://127.0.0.1:4445"
}
}
Lastly, here's what I get in the Terminal on the Mac side.
Feb 02, 2014 9:29:07 PM org.openqa.grid.selenium.GridLauncher main
INFO: Launching a selenium grid node
21:29:18.706 INFO - Java: Oracle Corporation 24.51-b03
21:29:18.706 INFO - OS: Mac OS X 10.9.1 x86_64
21:29:18.713 INFO - v2.39.0, with Core v2.39.0. Built from revision ff23eac
21:29:18.773 INFO - Default driver org.openqa.selenium.ie.InternetExplorerDriver registration is skipped: registration capabilities Capabilities [{platform=WINDOWS, ensureCleanSession=true, browserName=internet explorer, version=}] does not match with current platform: MAC
21:29:18.802 INFO - RemoteWebDriver instances should connect to: http://127.0.0.1:4445/wd/hub
21:29:18.803 INFO - Version Jetty/5.1.x
21:29:18.804 INFO - Started HttpContext[/selenium-server/driver,/selenium-server/driver]
21:29:18.804 INFO - Started HttpContext[/selenium-server,/selenium-server]
21:29:18.804 INFO - Started HttpContext[/,/]
21:29:18.864 INFO - Started org.openqa.jetty.jetty.servlet.ServletHandler#593aa24f
21:29:18.864 INFO - Started HttpContext[/wd,/wd]
21:29:18.866 INFO - Started SocketListener on 0.0.0.0:4445
21:29:18.867 INFO - Started org.openqa.jetty.jetty.Server#48ef85f3
21:29:18.867 INFO - using the json request : {"class":"org.openqa.grid.common.RegistrationRequest","capabilities":[{"platform":"MAC","seleniumProtocol":"WebDriver","browserName":"firefox","maxInstances":1},{"platform":"MAC","seleniumProtocol":"WebDriver","browserName":"chrome","maxInstances":1},{"platform":"MAC","seleniumProtocol":"WebDriver","browserName":"iphone","maxInstances":1},{"platform":"MAC","seleniumProtocol":"WebDriver","browserName":"ipad","maxInstances":1}],"configuration":{"nodeConfig":"node.mac.json","port":4445,"host":"192.168.50.1","hubHost":"127.0.0.1","registerCycle":2000,"trustAllSSLCertificates":"","hub":"http://127.0.0.1:4444/grid/register","url":"http://127.0.0.1:4445","remoteHost":"http://127.0.0.1:4445","register":true,"proxy":"org.openqa.grid.selenium.proxy.DefaultRemoteProxy","maxSession":1,"role":"node","hubPort":4444}}
21:29:18.868 INFO - Starting auto register thread. Will try to register every 2000 ms.
21:29:18.868 INFO - Registering the node to hub :http://127.0.0.1:4444/grid/register
21:30:25.079 INFO - Registering the node to hub :http://127.0.0.1:4444/grid/register
21:31:31.254 INFO - Registering the node to hub :http://127.0.0.1:4444/grid/register
21:32:35.416 INFO - Registering the node to hub :http://127.0.0.1:4444/grid/register
21:33:41.581 INFO - Registering the node to hub :http://127.0.0.1:4444/grid/register
21:34:47.752 INFO - Registering the node to hub :http://127.0.0.1:4444/grid/register
21:35:51.908 INFO - Registering the node to hub :http://127.0.0.1:4444/grid/register
21:36:56.045 INFO - Registering the node to hub :http://127.0.0.1:4444/grid/register
21:38:00.189 INFO - Registering the node to hub :http://127.0.0.1:4444/grid/register
Lastly, here's what I get in the Terminal on the Vagrant VM side.
Feb 03, 2014 5:28:53 AM org.openqa.grid.selenium.GridLauncher main
INFO: Launching a selenium grid server
2014-02-03 05:28:54.780:INFO:osjs.Server:jetty-7.x.y-SNAPSHOT
2014-02-03 05:28:54.811:INFO:osjsh.ContextHandler:started o.s.j.s.ServletContextHandler{/,null}
2014-02-03 05:28:54.823:INFO:osjs.AbstractConnector:Started SocketConnector#0.0.0.0:3444
Feb 03, 2014 5:29:20 AM org.openqa.grid.selenium.proxy.DefaultRemoteProxy isAlive
WARNING: Failed to check status of node: Connection refused
Feb 03, 2014 5:29:22 AM org.openqa.grid.selenium.proxy.DefaultRemoteProxy isAlive
WARNING: Failed to check status of node: Connection refused
Feb 03, 2014 5:29:22 AM org.openqa.grid.selenium.proxy.DefaultRemoteProxy onEvent
WARNING: Marking the node as down. Cannot reach the node for 2 tries.
Feb 03, 2014 5:29:24 AM org.openqa.grid.selenium.proxy.DefaultRemoteProxy isAlive
WARNING: Failed to check status of node: Connection refused
Feb 03, 2014 5:29:26 AM org.openqa.grid.selenium.proxy.DefaultRemoteProxy isAlive
WARNING: Failed to check status of node: Connection refused
Feb 03, 2014 5:29:28 AM org.openqa.grid.selenium.proxy.DefaultRemoteProxy isAlive
WARNING: Failed to check status of node: Connection refused
Feb 03, 2014 5:29:30 AM org.openqa.grid.selenium.proxy.DefaultRemoteProxy isAlive
WARNING: Failed to check status of node: Connection refused
Feb 03, 2014 5:29:32 AM org.openqa.grid.selenium.proxy.DefaultRemoteProxy isAlive
WARNING: Failed to check status of node: Connection refused
Google returns nothing of usefulness in this situation. Can anybody help me determine why the Hub and the Node can't talk to each other?
I have a similar setup where my selenium server (aka hub) is on a remote vm and a client (aka node) is on my local machine. I've been seeing the same error:
Feb 04, 2014 5:29:22 PM org.openqa.grid.selenium.proxy.DefaultRemoteProxy isAlive
WARNING: Failed to check status of node: Connection refused
Feb 04, 2014 5:29:22 PM org.openqa.grid.selenium.proxy.DefaultRemoteProxy onEvent
WARNING: Marking the node as down. Cannot reach the node for 2 tries.
I talked to our Ops team and they told me that my vm is sitting on a different network and in different location. And even though the node machine is able to reach the hub but the hub can never reach the node. They suggested to get another VM that is sitting on the same network. It's like one way street.
Hope it helps.
I don't know too much about Selenium, but I guess the issue is about using 127.0.0.1. Especially the VM has no way to connect to the host, and you don't forward port 4445.
As you already specify a private_network address (192.168.50.6), you could try to use it directly without any port forwarding.
The first answer was partially correct. You do have to ensure communication path between the node and the server and the server to the node is clear and able to connect on the specific ports. Since technically you are running 2 servers a server on the node listening on 1 port and a server on the hub listening to another port.
Try this:
I had the same problem, but fixed it by adding the host field:
"host": [ip or hostname of node],
Here is my node config file:
{
"capabilities":[
{
"platform":"MAC",
"browserName":"firefox",
"version":"28",
"maxInstances":1
},
{
"platform":"MAC",
"browserName":"chrome",
"version":"34",
"maxInstances":1
}
],
"configuration":{
"port": 5556,
"hubPort": 5555,
"host": 10.50.10.101, //this is the ip of my node
"hubHost":"10.50.10.100", //this is ip of my grid hub
"nodePolling":2500,
"registerCycle":10500,
"register":true,
"cleanUpCycle":2500,
"maxSession":5,
"role":"node"
}
}

Resources