PostgreSQL-12.1 :: Streaming Replication Error - database-backups

I have set up streaming replication on PostgreSQL 12.1
The master and slave are configured as below and WAL files are accumulating on the master.
However, something is wrong as I get complaints that the WAL files are missing after a pg_restore on MASTER
MASTER
postgres#srvm:~$ 2019-12-20 16:35:07.910 CET [1334] replicator#[unknown] ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:35:12.920 CET [1338] replicator#[unknown] ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:35:17.925 CET [1340] replicator#[unknown] ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:35:22.932 CET [1362] replicator#[unknown] ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:35:27.935 CET [1364] replicator#[unknown] ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:35:32.942 CET [1365] replicator#[unknown] ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:35:37.948 CET [1366] replicator#[unknown] ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:35:42.954 CET [1367] replicator#[unknown] ERROR: requested WAL segment 000000010000000100000076 has already been removed
SLAVE
postgres#srvs:~$ 2019-12-20 16:36:53.027 CET [21978] LOG: started streaming WAL from primary at 1/76000000 on timeline 1
2019-12-20 16:36:53.027 CET [21978] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:36:58.029 CET [21979] LOG: started streaming WAL from primary at 1/76000000 on timeline 1
2019-12-20 16:36:58.029 CET [21979] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:37:03.040 CET [21980] LOG: started streaming WAL from primary at 1/76000000 on timeline 1
2019-12-20 16:37:03.040 CET [21980] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000100000076 has already been removed
2019-12-20 16:37:08.042 CET [21981] LOG: started streaming WAL from primary at 1/76000000 on timeline 1
2019-12-20 16:37:08.042 CET [21981] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000100000076 has already been removed
Then the pg_basebackup is run and the slave started.
The slave has all the data as of the time of the backup, but no new data from the WAL files, and the error above.
max_wal_senders = 10
wal_keep_segments = 120
What have I mis-configured? Do we need to enabled archive_mode = on for streaming replication?

Here is my configuration for streaming replication:
Master config
listen_addresses = 'localhost,[IP_ADDRESS_OF_PRIMARY_ON_LAN]' # what IP address(es) to listen on;
wal_level = 'replica'
archive_mode = on
archive_command = 'cd .'
max_wal_senders = 5
primary_conninfo = 'host=[REPLICA_IP] port=5432 user=replication password=[REPLICATION PASSWORD]'
hot_standby = on
max_wal_senders = 10
wal_keep_segments = 48
Slave config:
listen_addresses = 'localhost,[IP_ADDRESS_OF_REPLIACA_ON_LAN]' # what IP address(es) to listen on;
max_connections = 100 # Ensure that this value is the same as the primary's
wal_level = 'replica'
archive_mode = on
archive_command = 'cd .'
max_wal_senders = 5
primary_conninfo = 'host=[PRIMARY_IP] port=5432 user=replication password=[REPLICATION PASSWORD]'
hot_standby = on
max_wal_senders = 10
wal_keep_segments = 48
I set up my server using the following resource: https://www.gab.lc/articles/postgresql-12-replication/

Related

PostgresSQL: could not stat directory "base/<db_oid>": Unknown error

I am getting error in starting PostgreSQL server after a machine restart on windows server 2019.
This server has a multiple databases and in logs it is showing error for one of the database may be corrupted.
I tried restart server few times, but it it is getting shut down after some time. How can I make server up again with other databases running?
2022-04-20 07:32:49.843 PDT [3720] LOG: starting PostgreSQL 13.3, compiled by Visual C++ build 1914, 64-bit
2022-04-20 07:32:49.844 PDT [3720] LOG: listening on IPv6 address "::", port 5432
2022-04-20 07:32:49.844 PDT [3720] LOG: listening on IPv4 address "0.0.0.0", port 5432
2022-04-20 07:32:57.949 PDT [6600] LOG: database system was interrupted while in recovery at 2022-04-20 06:29:00 PDT
2022-04-20 07:32:57.949 PDT [6600] HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery.
2022-04-20 07:34:29.599 PDT [852] FATAL: the database system is starting up
2022-04-20 07:44:04.349 PDT [2480] FATAL: the database system is starting up
2022-04-20 07:52:28.107 PDT [4800] FATAL: the database system is starting up
2022-04-20 08:09:30.847 PDT [1872] FATAL: the database system is starting up
2022-04-20 08:21:41.884 PDT [6600] LOG: could not stat file "./base/22504705": Unknown error
2022-04-20 08:21:47.121 PDT [6600] LOG: database system was not properly shut down; automatic recovery in progress
2022-04-20 08:22:01.957 PDT [6600] LOG: redo starts at 22E/5419D678 2022-04-20 08:22:01.994 PDT [6600] FATAL: could not stat directory "base/22504705": Unknown error
2022-04-20 08:22:01.994 PDT [6600] CONTEXT: WAL redo at 22E/5422EDD8 for Heap/LOCK: off 3: xid 26378859: flags 0x00 LOCK_ONLY KEYSHR_LOCK
2022-04-20 08:22:02.000 PDT [3720] LOG: startup process (PID 6600) exited with exit code 1
2022-04-20 08:22:02.000 PDT [3720] LOG: aborting startup due to startup process failure
2022-04-20 08:22:02.085 PDT [3720] LOG: database system is shut down
logs

OKD4.4 compute node failed with "Internal Server Error"

I tried to deployed OKD 4.4 on my home cluster using the following doc =>
https://medium.com/#craig_robinson/openshift-4-4-okd-bare-metal-install-on-vmware-home-lab-6841ce2d37eb
The "services", "bootstrap" and "control-plane" nodes went smoothly (at least the output on screen is similar to those in the doc).
However, when I deployed the "compute" (worker) nodes, it failed to startup with the following error:
ignition[xxx]: GET https://api-int.lab.xxxtest.com:22623/config/worker: attempt #xxx
ignition[xxx]: GET result: Internal Server Error
A check on the bootstrap node (journalctl -u bootkube | grep bootkube.sh | tail):
[root#okd4-bootstrap openshift]# journalctl -u bootkube | grep bootkube.sh | tail
Apr 07 05:22:14 okd4-bootstrap.lab.xxxtest.com bootkube.sh[4838]: Error: unhealthy cluster
Apr 07 05:22:14 okd4-bootstrap.lab.xxxtest.com bootkube.sh[4838]: etcdctl failed. Retrying in 5 seconds...
Apr 07 05:22:24 okd4-bootstrap.lab.xxxtest.com bootkube.sh[4838]: {"level":"warn","ts":"2020-04-07T05:22:24.872Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-57584517-34e6-40c3-b945-0b920fb059e6/localhost:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Apr 07 05:22:24 okd4-bootstrap.lab.xxxtest.com bootkube.sh[4838]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Apr 07 05:22:24 okd4-bootstrap.lab.xxxtest.com bootkube.sh[4838]: Error: unhealthy cluster
Apr 07 05:22:24 okd4-bootstrap.lab.xxxtest.com bootkube.sh[4838]: etcdctl failed. Retrying in 5 seconds...
Apr 07 05:22:35 okd4-bootstrap.lab.xxxtest.com bootkube.sh[4838]: {"level":"warn","ts":"2020-04-07T05:22:35.347Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-304bfb54-2184-4c01-acdb-86850fbe9b8d/localhost:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp [::1]:2379: connect: connection refused\""}
Apr 07 05:22:35 okd4-bootstrap.lab.xxxtest.com bootkube.sh[4838]: https://localhost:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Apr 07 05:22:35 okd4-bootstrap.lab.xxxtest.com bootkube.sh[4838]: Error: unhealthy cluster
Apr 07 05:22:35 okd4-bootstrap.lab.xxxtest.com bootkube.sh[4838]: etcdctl failed. Retrying in 5 seconds...
[root#okd4-bootstrap openshift]#
Any idea what could have gone wrong?
It seems bootstrap is trying to start/connect to "etcd" on the "localhost" (bootstrap node).
Thanks.

ERROR: hbase:meta table is not consistent

Error while creating table in HBASE. "ERROR: java.io.IOException: Table Namespace Manager not ready yet, try
again later."
hcbk -fix shows ERROR: hbase:meta is not found on any region.
Error appeared after fresh start of hbase shell session No errors reported in master log during start.
Last session of Hbase closed properly but not zookeeper (suspecting this as a reason for meta table corruption).
I am able to list the tables created earlier
hbase(main):001:0> list
TABLE
IDX_STOCK_SYMBOL
Patient
STOCK_SYMBOL
STOCK_SYMBOL_BKP
SYSTEM.CATALOG
SYSTEM.FUNCTION
SYSTEM.SEQUENCE
SYSTEM.STATS
8 row(s) in 1.7930 seconds
Creating a table named custmaster
hbase(main):002:0> create 'custmaster', 'customer'
ERROR: java.io.IOException: Table Namespace Manager not ready yet, try
again later
at org.apache.hadoop.hbase.master.HMaster.getNamespaceDescriptor(HMaster.java:3179)
at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1735)
at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1774)
at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:40470)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Work Around: Running hbck to identify inconsistencies
[hduser#master ~]$ hbase hbck
>Version: 0.98.4-hadoop2
>Number of live region servers: 2
>Number of dead region servers: 0
>Master: master,60000,1538793456542
>Number of backup masters: 0
>Average load: 0.0
>Number of requests: 11
>Number of regions: 0
>Number of regions in transition: 1
>
>ERROR: META region or some of its attributes are null.
>ERROR: hbase:meta is not found on any region.
>ERROR: hbase:meta table is not consistent. Run HBCK with proper fix options to fix hbase:meta inconsistency. Exiting...
.
.
.
>Summary: >
>3 inconsistencies detected.
>Status: INCONSISTENT
Ran hbck wih-details option to identify the tables involved
[hduser#master ~]$ hbase hbck -details
>ERROR: META region or some of its attributes are null.
>ERROR: hbase:meta is not found on any region.
>ERROR: hbase:meta table is not consistent. Run HBCK with proper fix options to fix hbase:meta inconsistency. Exiting...
>Summary:
>3 inconsistencies detected.
>Status: INCONSISTENT
The output of -details clearly shows the meta is not found on any region.
Tried running the command hbase hbck -fixMeta but same result returned as above
Hence tried hbase hbck -fix
This command ran for sometime with the prompt "Trying to fix a problem with hbase:meta.." and resulted in below error
[hduser#master ~]$ hbase hbck -fix
Version: 0.98.4-hadoop2
Number of live region servers: 2
Number of dead region servers: 0
Master: master,60000,1538793456542
Number of backup masters: 0
Average load: 0.0
Number of requests: 19
Number of regions: 0
Number of regions in transition: 1
ERROR: META region or some of its attributes are null.
ERROR: hbase:meta is not found on any region.
Trying to fix a problem with hbase:meta..
2018-10-06 09:01:03,424 INFO [main] client.HConnectionManager$HConnectionImplementation: Closing master protocol: MasterService
2018-10-06 09:01:03,425 INFO [main] client.HConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x166473bbe720005
2018-10-06 09:01:03,432 INFO [main] zookeeper.ZooKeeper: Session: 0x166473bbe720005 closed
2018-10-06 09:01:03,432 INFO [main-EventThread] zookeeper.ClientCnxn: EventThread shut down
Exception in thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=35, exceptions:
Sat Oct 06 08:52:13 IST 2018, org.apache.hadoop.hbase.client.RpcRetryingCaller#18920cc, org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.PleaseHoldException): org.apache.hadoop.hbase.PleaseHoldException: Master is initializing
at org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:2416)
at org.apache.hadoop.hbase.master.HMaster.assignRegion(HMaster.java:2472)
at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:40456)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Sat Oct 06 08:52:13 IST 2018, org.apache.hadoop.hbase.client.RpcRetryingCaller#18920cc, org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.PleaseHoldException): org.apache.hadoop.hbase.PleaseHoldException: Master is initializing
at org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:2416)
at org.apache.hadoop.hbase.master.HMaster.assignRegion(HMaster.java:2472)
at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:40456)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Help me on how to resolve this issue?
Thanks in advance !!
I havenot checked on NameNode and Datanode logs. But when I check the real issue turned out was corrupt file in HDFS.
Ran hadoop fsck / to check health of the file system.
[hduser#master ~]$ hadoop fsck /
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
18/10/06 09:52:00 WARN util.NativeCodeLoader: Unable to load native-hadoop libr
ary for your platform... using builtin-java classes where applicable
Connecting to namenode via http://master:50070/fsck?ugi=hduser&path=%2F
FSCK started by hduser (auth:SIMPLE) from /192.168.1.11 for path / at Sat Oct 0
6 09:52:02 IST 2018
...............................................................................
..
/user/hduser/hbase/.hbck/hbase-1538798774320/data/hbase/meta/1588230740/info/35
9783d4cd07419598264506bac92dcf: CORRUPT blockpool BP-1664228054-192.168.1.11-15
35828595216 block blk_1073744002
/user/hduser/hbase/.hbck/hbase-1538798774320/data/hbase/meta/1588230740/info/35 9783d4cd07419598264506bac92dcf: MISSING 1 blocks of total size 3934 B.........
/user/hduser/hbase/data/default/IDX_STOCK_SYMBOL/a27db76f84487a05f3e1b8b74c13fa
78/0/c595bf49443f4daf952df6cdaad79181: CORRUPT blockpool BP-1664228054-192.168.
1.11-1535828595216 block blk_1073744000
/user/hduser/hbase/data/default/IDX_STOCK_SYMBOL/a27db76f84487a05f3e1b8b74c13fa
78/0/c595bf49443f4daf952df6cdaad79181: MISSING 1 blocks of total size 1354 B...
.........
...
/user/hduser/hbase/data/default/SYSTEM.CATALOG/d63574fdd00e8bf3882fcb6bd53c3d83
/0/dcb68bbb5e394d19b06db7f298810de0: CORRUPT blockpool BP-1664228054-192.168.1.
11-1535828595216 block blk_1073744001
/user/hduser/hbase/data/default/SYSTEM.CATALOG/d63574fdd00e8bf3882fcb6bd53c3d83
/0/dcb68bbb5e394d19b06db7f298810de0: MISSING 1 blocks of total size 2283 B..... ......................Status: CORRUPT
Total size: 4232998 B
Total dirs: 109
Total files: 129
Total symlinks: 0
Total blocks (validated): 125 (avg. block size 33863 B)
********************************
UNDER MIN REPL'D BLOCKS: 3 (2.4 %)
dfs.namenode.replication.min: 1
CORRUPT FILES: 3
MISSING BLOCKS: 3
MISSING SIZE: 7571 B
CORRUPT BLOCKS: 3
********************************
Minimally replicated blocks: 122 (97.6 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 1.952
Corrupt blocks: 3
Missing replicas: 0 (0.0 %)
Number of data-nodes: 2
Number of racks: 1
FSCK ended at Sat Oct 06 09:52:02 IST 2018 in 66 milliseconds
The filesystem under path '/' is CORRUPT
And I run hdfs hbck -delete option to delete the corrupt files and fixed the issue.
Detailed explanation on cleaning hdfs f/s is available here --> How to fix corrupt HDFS FIles

KrbException connecting to Hadoop cluster with Zookeeper client - UNKNOWN_SERVER

My Zookeeper client is having trouble connecting to the Hadoop cluster.
This works fine from a Linux VM, but I am using a Mac.
I set the -Dsun.security.krb5.debug=true flag on the JVM and get the following output:
Found ticket for solr#DDA.MYCO.COM to go to krbtgt/DDA.MYCO.COM#DDA.MYCO.COM expiring on Sat Apr 29 03:15:04 BST 2017
Entered Krb5Context.initSecContext with state=STATE_NEW
Found ticket for solr#DDA.MYCO.COM to go to krbtgt/DDA.MYCO.COM#DDA.MYCO.COM expiring on Sat Apr 29 03:15:04 BST 2017
Service ticket not found in the subject
>>> Credentials acquireServiceCreds: same realm
Using builtin default etypes for default_tgs_enctypes
default etypes for default_tgs_enctypes: 17 16 23.
>>> CksumType: sun.security.krb5.internal.crypto.RsaMd5CksumType
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
>>> KrbKdcReq send: kdc=oc-10-252-132-139.nat-ucfc2z3b.usdv1.mycloud.com UDP:88, timeout=30000, number of retries =3, #bytes=682
>>> KDCCommunication: kdc=oc-10-252-132-139.nat-ucfc2z3b.usdv1.mycloud.com UDP:88, timeout=30000,Attempt =1, #bytes=682
>>> KrbKdcReq send: #bytes read=217
>>> KdcAccessibility: remove oc-10-252-132-139.nat-ucfc2z3b.usdv1.mycloud.com
>>> KDCRep: init() encoding tag is 126 req type is 13
>>>KRBError:
cTime is Thu Dec 24 11:18:15 GMT 2015 1450955895000
sTime is Fri Apr 28 15:15:06 BST 2017 1493388906000
suSec is 925863
error code is 7
error Message is Server not found in Kerberos database
cname is solr#DDA.MYCO.COM
sname is zookeeper/oc-10-252-132-160.nat-ucfc2z3b.usdv1.mycloud.com#DDA.MYCO.COM
msgType is 30
KrbException: Server not found in Kerberos database (7) - UNKNOWN_SERVER
at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:73)
at sun.security.krb5.KrbTgsReq.getReply(KrbTgsReq.java:251)
at sun.security.krb5.KrbTgsReq.sendAndGetCreds(KrbTgsReq.java:262)
at sun.security.krb5.internal.CredentialsUtil.serviceCreds(CredentialsUtil.java:308)
at sun.security.krb5.internal.CredentialsUtil.acquireServiceCreds(CredentialsUtil.java:126)
at sun.security.krb5.Credentials.acquireServiceCreds(Credentials.java:458)
at sun.security.jgss.krb5.Krb5Context.initSecContext(Krb5Context.java:693)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:248)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
at org.apache.zookeeper.client.ZooKeeperSaslClient$2.run(ZooKeeperSaslClient.java:366)
at org.apache.zookeeper.client.ZooKeeperSaslClient$2.run(ZooKeeperSaslClient.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslToken(ZooKeeperSaslClient.java:362)
at org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslToken(ZooKeeperSaslClient.java:348)
at org.apache.zookeeper.client.ZooKeeperSaslClient.sendSaslPacket(ZooKeeperSaslClient.java:420)
at org.apache.zookeeper.client.ZooKeeperSaslClient.initialize(ZooKeeperSaslClient.java:458)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1057)
Caused by: KrbException: Identifier doesn't match expected value (906)
at sun.security.krb5.internal.KDCRep.init(KDCRep.java:140)
at sun.security.krb5.internal.TGSRep.init(TGSRep.java:65)
at sun.security.krb5.internal.TGSRep.<init>(TGSRep.java:60)
at sun.security.krb5.KrbTgsRep.<init>(KrbTgsRep.java:55)
... 18 more
ERROR 2017-04-28 15:15:07,046 5539 org.apache.zookeeper.client.ZooKeeperSaslClient [main-SendThread(oc-10-252-132-160.nat-ucfc2z3b.usdv1.mycloud.com:2181)]
An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed
[Caused by GSSException: No valid credentials provided
(Mechanism level: Server not found in Kerberos database (7) - UNKNOWN_SERVER)])
occurred when evaluating Zookeeper Quorum Member's received SASL token.
This may be caused by Java's being unable to resolve the Zookeeper Quorum Member's hostname correctly.
You may want to try to adding '-Dsun.net.spi.nameservice.provider.1=dns,sun' to your client's JVMFLAGS environment.
Zookeeper Client will go to AUTH_FAILED state.
I've tested Kerberos config as follows:
>kinit -kt /etc/security/keytabs/solr.headless.keytab solr
>klist
Credentials cache: API:3451691D-7D5E-49FD-A27C-135816F33E4D
Principal: solr#DDA.MYCO.COM
Issued Expires Principal
Apr 28 16:58:02 2017 Apr 29 04:58:02 2017 krbtgt/DDA.MYCO.COM#DDA.MYCO.COM
Following the instructions from hortonworks I managed to get the kerberos ticket stored in a file:
>klist -c FILE:/tmp/krb5cc_501
Credentials cache: FILE:/tmp/krb5cc_501
Principal: solr#DDA.MYCO.COM
Issued Expires Principal
Apr 28 17:10:25 2017 Apr 29 05:10:25 2017 krbtgt/DDA.MYCO.COM#DDA.MYCO.COM
Also I tried the suggested JVM option suggested in the stack trace (-Dsun.net.spi.nameservice.provider.1=dns,sun), but this led to a different error along the lines of Client session timed out, which suggests that this JVM param is preventing the client from connecting correctly in the first place.
==EDIT==
Seems that the Mac version of Kerberos is not the latest:
> krb5-config --version
Kerberos 5 release 1.7-prerelease
I just tried brew install krb5 to install a newer version, then adjusting the path to point to the new version.
> krb5-config --version
Kerberos 5 release 1.15.1
This has had no effect whatsoever on the outcome.
NB this works fine from a linux VM on my Mac, using exactly the same jaas.conf, keytab files, and krb5.conf.
krb5.conf:
[libdefaults]
renew_lifetime = 7d
forwardable = true
default_realm = DDA.MYCO.COM
ticket_lifetime = 24h
dns_lookup_realm = false
dns_lookup_kdc = false
[realms]
DDA.MYCO.COM = {
admin_server = oc-10-252-132-139.nat-ucfc2z3b.usdv1.mycloud.com
kdc = oc-10-252-132-139.nat-ucfc2z3b.usdv1.mycloud.com
}
Reverse DNS:
I checked that the FQDN hostname I'm connecting to can be found using a reverse DNS lookup:
> host 10.252.132.160
160.132.252.10.in-addr.arpa domain name pointer oc-10-252-132-160.nat-ucfc2z3b.usdv1.mycloud.com.
This is exactly as per the response to the same command from the linux VM.
===WIRESHARK ANALYSIS===
Using Wireshark configured to use the system key tabs allows a bit more detail in the analysis.
Here I have found that a failed call looks like this:
client -> host AS-REQ
host -> client AS-REP
client -> host AS-REQ
host -> client AS-REP
client -> host TGS-REQ <-- this call is detailed below
host -> client KRB error KRB5KDC_ERR_S_PRINCIPAL_UNKNOWN
The erroneous TGS-REQ call shows the following:
Kerberos
tgs-req
pvno: 5
msg-type: krb-tgs-req (12)
padata: 1 item
req-body
Padding: 0
kdc-options: 40000000 (forwardable)
realm: DDA.MYCO.COM
sname
name-type: kRB5-NT-UNKNOWN (0)
sname-string: 2 items
SNameString: zookeeper
SNameString: oc-10-252-134-51.nat-ucfc2z3b.usdv1.mycloud.com
till: 1970-01-01 00:00:00 (UTC)
nonce: 797021964
etype: 3 items
ENCTYPE: eTYPE-AES128-CTS-HMAC-SHA1-96 (17)
ENCTYPE: eTYPE-DES3-CBC-SHA1 (16)
ENCTYPE: eTYPE-ARCFOUR-HMAC-MD5 (23)
Here is the corresponding successful call from the linux box, which is followed by several more exchanges.
Kerberos
tgs-req
pvno: 5
msg-type: krb-tgs-req (12)
padata: 1 item
req-body
Padding: 0
kdc-options: 40000000 (forwardable)
realm: DDA.MYCO.COM
sname
name-type: kRB5-NT-UNKNOWN (0)
sname-string: 2 items
SNameString: zookeeper
SNameString: d59407.ddapoc.ucfc2z3b.usdv1.mycloud.com
till: 1970-01-01 00:00:00 (UTC)
nonce: 681936272
etype: 3 items
ENCTYPE: eTYPE-AES128-CTS-HMAC-SHA1-96 (17)
ENCTYPE: eTYPE-DES3-CBC-SHA1 (16)
ENCTYPE: eTYPE-ARCFOUR-HMAC-MD5 (23)
So it looks like the client is sending
oc-10-252-134-51.nat-ucfc2z3b.usdv1.mycloud.com
as the server host, when it should be sending:
d59407.ddapoc.ucfc2z3b.usdv1.mycloud.com
So the question is, how do I fix that? Bear in mind this is a Java piece of code.
My /etc/hosts has the following:
10.252.132.160 b3e073.ddapoc.ucfc2z3b.usdv1.mycloud.com
10.252.134.51 d59407.ddapoc.ucfc2z3b.usdv1.mycloud.com
10.252.132.139 d7cc18.ddapoc.ucfc2z3b.usdv1.mycloud.com
And my krb5.conf file has:
kdc = d7cc18.ddapoc.ucfc2z3b.usdv1.mycloud.com
kdc = b3e073.ddapoc.ucfc2z3b.usdv1.mycloud.com
kdc = d59407.ddapoc.ucfc2z3b.usdv1.mycloud.com
I tried adding -Dsun.net.spi.nameservice.provider.1=file,dns as a JVM param but got the same result.
I fixed this by setting up a local dnsmasq instance to supply the forward and reverse DNS lookups.
So now from the command line, host d59407.ddapoc.ucfc2z3b.usdv1.mycloud.com returns 10.252.134.51
See also here and here.
Looks like some DNS issue.
Could this SO question help you resolving your problem?
Also, here is an Q&A about the problem.
It also could be because of non Sun JVM.

One node in hadoop cluster failure

I have configured 10 nodes HDP hadoop cluster recently, each node is of OS SLES11..
On master node I have configured all master services and clients..also the mabari-server. Remaining nodes other slave services and their clients.
NTP sync is on, other pre-requisites also fine.
I am experiencing weird behavior on hadoop cluster, After starting all the services within few hours one of the node goes down.
When I experienced this first time, I have restarted that particular node and added back to the cluster.
Now My master node is causing the same issue due to which whole cluster is down. I have checked the logs but there are no indications related to failure.
I am clueless what is the root cause for the failure of the node in hadoop cluster?
Below are logs :-
the system which went down:
/var/log/messages
these are /var/log/messages: notice)=0', processed='source(src)=6830'
Apr 23 05:22:43 lnx1863 SuSEfirewall2: SuSEfirewall2 not active Apr 23
05:23:49 lnx1863 SuSEfirewall2: SuSEfirewall2 not active Apr 23
05:24:17 lnx1863 sudo: root : TTY=pts/0 ; PWD=/ ; USER=root ;
COMMAND=/usr/bin/du -h / Apr 23 05:24:55 lnx1863 SuSEfirewall2:
SuSEfirewall2 not active Apr 23 05:25:22 lnx1863 kernel:
[248531.127254] megasas: Found FW in FAULT state, will reset adapter.
Apr 23 05:25:22 lnx1863 kernel: [248531.127260] megaraid_sas:
resetting fusion adapter. Apr 23 05:25:22 lnx1863 kernel:
[248531.127427] megaraid_sas: Reset not supported, killing adapter.
namenode logs:-
INFO 2015-04-23 05:27:43,665 Heartbeat.py:78 - Building Heartbeat:
{responseId = 7607, timestamp = 1429781263665, commandsInProgress =
False, componentsMapped = True} INFO 2015-04-23 05:28:44,053
security.py:135 - Encountered communication error. Details:
SSLError('The read operation timed out',) ERROR 2015-04-23
05:28:44,053 Controller.py:278 - Connection to http://localhost was
lost (details=Request to
https://localhost:8441/agent/v1/heartbeat/localhostip failed due to
Error occured during connecting to the server: The read operation
timed out) INFO 2015-04-23 05:29:16,061 NetUtil.py:48 - Connecting to
https://localhost:8440/connection_info INFO 2015-04-23 05:29:16,118
security.py:93 - SSL Connect being called.. connecting to the server

Resources