Solr - lost configuration after recovering Zookeeper - hadoop

I just inherited a hadoop cluster ( never worked with hadoop before ) consisting of 7 servers and administered through Ambari.
Today Ambari lost heartbeat with all services on server3 as well as ZooKeeper services (hosted on servers 1, 2, and 3), ZKFailOver ( hosted on server 1 and 2 ), and ZooKeeper clients ( hosted on 4,5,6,7 ) stopped and all refused to start. This also caused the Solr services to stop working.
After some investigating I found that Zookeeper on server3 was erroring on recent snapshot due to a CRC problem. After some more reading I removed the old snapshot files in .../zookeeper/version-2/ and ran 'zk -formatZK' (on server1). Zookeeper services are now able to start and heartbeat from server3 are being received.
The problem I see now is all the Solr services are no longer configured properly - "...ZooKeeperException: Could not find configName for collectioin xxxx found:null" I haven't been having much success figuring out how to get the previous Solr configurations to Zookeeper. I'm trying to use 'zkcli.sh' that I found in the Solr directory which is located in '/opt/solr/xxxx/scripts/cloud-scripts/' but it doesn't seem to work like the zkCli described in the Hadoop documentation.
My question is, how do I setup the Solr servers using the existing config files? If I can't how can I go about reconstructing the following configuration:
/ --- server5
/--shard1------- server7
core -- <
\--shard2------- server4
\ --- server6
Thanks.

So after trial and error I found that zkcli.sh should be used in the following manner:
./zkcli.sh server1:2181,server2:2181,server3:2181 -cmd upconfig .../solr/<corename>/conf -confname <configfilename>
This should upload any existing config to all ZK nodes.

Related

HBase fully distributed mode [Zookeeper error while executing HBase shell]

Following these two tutorials: i.e tutorial 1 and tutorial 2, I was able to set up HBase cluster in fully-distributed mode. Initially the cluster seems to work okay.
The 'jps' output in HMaster/ Name node
The jps output in DataNodes/ RegionServers
Nevertheless, when every I try to execute hbase shell, it seems that the HBase processors are interrupted due to some Zookeeper error. The error is pasted below:
2021-03-13 11:52:26,047 ERROR [main] zookeeper.RecoverableZooKeeper: ZooKeeper exists failed a│1951 HRegionServer
fter 4 attempts │hduser#master-vm:~$
2021-03-13 11:52:26,048 WARN [main] zookeeper.ZKUtil: hconnection-0x4375b0130x0, quorum=137.4│
3.49.59:2181,137.43.49.58:2181,137.43.49.50:2181,137.43.49.49:2181, baseZNode=/hbase Unable to│
set watcher on znode (/hbase/hbaseid) │
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss│
for /hbase/hbaseid │
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) │
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) │
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) │
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.│
java:221) │
at org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:417) │
at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:6│
I tried several attempts to solve this issue (including trying out with different HBase/ Hadoop compatible versions). But still no progress.
Would like to have your input on this.
Shared below are other information required:
in /etc/hosts file:
(I already tried commenting the HBase related hosts in /etc/hosts/, still didn'w work)
in hbase-site.xml
After 5 days of hustle, I learned what went wrong. Posting my solution here. Hope it can help some of the other developers too. Also would like to thank #VV_FS for the comments.
In my scenario, I used virtual machines which I burrowed from an external party. Therefore, there were certain firewalls and other security measures. In case if you follow a similar experimental setup, these steps might help you.
To set up HBase cluster, follow the following tutorials.
Set up Hadoop in distributed mode.
Notes when setting up HBase in fully distributed-mode:
Make sure to open all the ports mentioned in the post. For example, use sudo ufw allow 9000 to open port 9000. Follow the command to open all the ports in relation to running Hadoop.
Set up Zookeeper in distributed mode.
Notes when setting up Zookeeper in fully distributed mode:
Make sure to open all the ports mentioned in the post. For example, use sudo ufw allow 3888 to open port 3888. Follow the command to open all the ports in relation to running Zookeeper.
DO NOT START ZOOKEEPER NODES AFTER INSTALLATION. ZOOKEEPER WILL BE MANAGED HBASE INTERNALLY. THEREFORE, DON'T START ZOOKEEPER AT THIS STAGE.
Set up HBase in distributed mode.
When setting up values for hbase-site.xml, use port number 60000 for hbase.master tag, not 60010. (thanks #VV_FS to point this out in the earlier discussion).
Make sure to open all the ports mentioned in the post. For example, use sudo ufw allow 60000 to open port 60000. Follow the command to open all the ports in relation to running Zookeeper.
[Important thoughts]: If encounters errors, always refer to HBase logs. In my case, hbase-mater-xxxxx.log and zookeeper-master--xxx.log helped me to track down exact errors.

SYBASE Cluster on RedHat Pacemaker Cluster

I am trying to setup a SYBASE cluster on RedHat 7.5 using Pacemaker. I want the Active/Passive mode, where SYBASE will be running only in a single node a the time, but when I configure in such way it's work fine during the configuration, but when the standby node reboots the SYBASE resource is trying to get started on node 2 which it should not happen once it´s up and running on node 1.
I have configured Pacemaker as:
- lvm-sybasedev-res and lvm-databasedev-res are there in order to give shared volume (iSCSI) access to the correct node where SYBASE will be running at the time.
- The sybase-res resource has been created using the bellow command:
Resource Group: sybase-rg
lvm-sybasedev-res (ocf::heartbeat:LVM): Started sdp-1
lvm-databasedev-res (ocf::heartbeat:LVM): Started sdp-1
sybase-IP (ocf::heartbeat:IPaddr2): Started sdp-1
sybase-res (ocf::heartbeat:sybaseASE): Started sdp-1
> pcs resource create sybase-res ocf:heartbeat:sybaseASE server_name="SYBASE" db_user="sa" \
db_passwd="password" sybase_home="/global/sdp/sybase" sybase_ase="ASE-15_0" \
sybase_ocs="OCS-15_0" interfaces_file="/global/sdp/sybase/interfaces" \
sybase_user="sybase" --group sybase-rg --disable
I have constraint colocation setup in order to keep all resource under sybase-rg resoure group on the same node.
I was expecting that if the sybase-rg is up and running on node-1 (sdp-1)... even the node-2 (sdp-2) reboots it should not affect sybase-res because it's the inactive node which is rebooting.
Do I miss something? Any help is welcome.
Regards,

Ambari won't restart: DB check failed

When I restarted my cluster, ambari didn't start because of a db check failed config:
sudo service ambari-server restart --skip-database-check
Using python /usr/bin/python
Restarting ambari-server
Waiting for server stop...
Ambari Server stopped
Ambari Server running with administrator privileges.
Organizing resource files at /var/lib/ambari-server/resources...
Ambari Server is starting with the database consistency check skipped. Do not make any changes to your cluster topology or perform a cluster upgrade until you correct the database consistency issues. See "/var/log/ambari-server/ambari-server-check-database.log" for more details on the consistency issues.
Server PID at: /var/run/ambari-server/ambari-server.pid
Server out at: /var/log/ambari-server/ambari-server.out
Server log at: /var/log/ambari-server/ambari-server.log
Waiting for server start.....................
DB configs consistency check failed. Run "ambari-server start --skip-database-check" to skip. You may try --auto-fix-database flag to attempt to fix issues automatically. If you use this "--skip-database-check" option, do not make any changes to your cluster topology or perform a cluster upgrade until you correct the database consistency issues. See /var/log/ambari-server/ambari-server-check-database.log for more details on the consistency issues.
ERROR: Exiting with exit code -1.
REASON: Ambari Server java process has stopped. Please check the logs for more information.
I looked in the logs in "/var/log/ambari-server/ambari-server-check-database.log", and I saw:
2017-08-23 08:16:13,445 INFO - Checking Topology tables
2017-08-23 08:16:13,447 ERROR - Your topology request hierarchy is not complete for each row in topology_request should exist at least one raw in topology_logical_request, topology_host_request, topology_host_task, topology_logical_task.
I tried both options --auto-fix-database and --skip-database-check, it didn't work.
It seems that postgresql didn't start correctly, and even if in the log of Ambari there was no mention of postgresql not started or not available, but it was weird that ambari couldn't access to the topology configuration stored in it.
sudo service postgresql restart
Stopping postgresql service: [ OK ]
Starting postgresql service: [ OK ]
It did the trick:
sudo service ambari-server restart
Using python /usr/bin/python
Restarting ambari-server
Ambari Server is not running
Ambari Server running with administrator privileges.
Organizing resource files at /var/lib/ambari-server/resources...
Ambari database consistency check started...
Server PID at: /var/run/ambari-server/ambari-server.pid
Server out at: /var/log/ambari-server/ambari-server.out
Server log at: /var/log/ambari-server/ambari-server.log
Waiting for server start.........
Server started listening on 8080

Apache Phoenix Installation not done properly

We are trying to install Phoenix 4.4.0 on HBase 1.0.0-cdh5.4.4 (CDH5.5.5 four nodes cluster) via this installation document: Phoenix installation
Based on that we copied our phoenix-server-4.4.0-HBase-1.0.jar to hbase libs on each region server and master server, so that, on each /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hbase/lib folder in the master and three region servers.
After that we reboot the HBase service via Cloudera Manager.
Everything seems to be ok, but when we are trying to access to phoenix shell via ./sqlline.py localhost command, we get a Zookeeper error in that way:
15/09/09 14:20:51 WARN client.ZooKeeperRegistry: Can't retrieve clusterId from Zookeeper
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
So we are not sure that the installation is properly done. Is necessary any further configuration?
We are not even sure wether we are using the sqlline command properly.
Any help will be appreciated.
After reinstalling the 4 nodes cluster on AWS, phoenix is now working properly.
It's a pitty that we don't know exactly what was really happening, but we think that after several changes in our config, we broke something that made phoenix impossible to work.
One thing to take into consideration is that sqllline command has to be executed with an ip that is in the zookeeper quorum, and this is something we were doing wrong, since we were trying to run it from the namenode, and it wasn't in the zookeeper quorum.Once we run sqlline.py from a datanode, everything is working fine.
Btw, the installation guide that we finally followed is Phoenix Installation

Setting up Kerberos on HDP 2.1

I have 2 node Ambari Hadoop cluster running on CentOS 6. Recently I setup Kerberos for the services in the cluster as per the instructions detailed here:
http://docs.hortonworks.com/HDPDocuments/Ambari-1.6.0.0/bk_ambari_security/content/ambari-kerb.html
In addition to the above documentation, found that you have to add additional configurations for the Web Namenode UI and so on (QuickLinks in the Ambari server console for each of the Hadoop Services) to work. Hence I followed the configuration options, listed in the question portion of the article to setup HTTP Authentication:Hadoop Web Authentication using Kerberos
Also to create the secret http file, I used the command to generate the file on node 1, and then copied the file to the same folder location on node 2 on the cluster as well:
sudo dd if=/dev/urandom of=/etc/security/keytabs/http_secret bs=1024 count=1
Updated the Zookeeper JAAS client file under, /etc/zookeeper/conf/zookeeper_client_jaas.conf to add the following:
Client { com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=false
useTicketCache=true
keyTab='/etc/security/keytabs/zk.service.keytab'
principal='zookeeper/host#realm-name';
};
This step followed from the article: http://blog.spryinc.com/2014/03/configuring-kerberos-security-in.html
When I restarted my Hadoop services, I get the 401 Authentication Required error, when I try to access the NameNode UI/ NameNode Logs/ NameNode JMX and so on. None of the links given in the QuickLinks drop down is able to connect and pull up the data.
Any thoughts to resolve this error?

Resources