Errors in setting up HBase on Distributed Hadoop, ZooKeeperServer not running - hadoop

I'm trying to set up HBase on Hadoop and have been follow various great tutorials online by Michael G. Noll and here. Basically all is fine, my Hdfs and MapRed works well on the web interface it shows that I have 2 nodes (my NameNode is both a NameNode and a DataNode but that's just for testing purposes).
When I got to the point of installing HBase, thats where I meet problems, I get lots of different errors. The latest one I have is on the log file in my slave node
INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.2.xx.xx:43089 (no session established for client)
INFO org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
But when I type in
$ zkServer.sh status
It says shows the mode that both machines are running in!
Anyone has any idea what is this problem. Or does any one know of another guide/tutorial that I can follow to set this up? I've tried following the HBase documentation on setting up HBase on a distributed HDFS but it doesn't work too.
Thanks for any help offered!

Are both the zookeepers servers configured in a Qorum? If so have they managed to connect to one another and vote on who's the leader (this should all be in the logs for both servers).
Zookeeper may be running, but if they can't communicate with one another (firewall rules or miss configuration for example), then zookeeper will not accept in coming client connections

Related

JobTracker web UI - not working in psedo-distributed mode v 2.7.1

I have installed Hadoop 2.7.1 in psuedo distributed mode (all daemons on single machine). It's up and running and I'm able to access HDFS through command line and run the jobs and I'm able to see the output.
I can access http://localhost:50070/dfshealth.html#tab-overview. it shows version and cluster status and can access hadoop file system.
I found one link and applied its accepted solution but that does not work for me. When I am trying to access http://127.0.0.1:54310, I am getting below error message
It looks like you are making an HTTP request to a Hadoop IPC port. This is
not the correct port for the web interface on this daemon.
Any help is appreciated.
Thanks..
I am using MR2 and not able to track my job on 8088. When I run map reduce job, it submit the job on http://localhost:8080 and thats url is not opening to track the job.
Use port 50030 if you are using MRV1 for YARN use port 8088 for accessing resource manager.

Submitting jobs to Spark EC2 cluster remotely

I've set up the EC2 cluster with Spark. Everything works, all master/slaves are up and running.
I'm trying to submit a sample job (SparkPi). When I ssh to cluster and submit it from there - everything works fine. However when driver is created on a remote host (my laptop), it doesn't work. I've tried both modes for --deploy-mode:
--deploy-mode=client:
From my laptop:
./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
Results in the following indefinite warnings/errors:
WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have
sufficient memory 15/02/22 18:30:45
ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0 15/02/22 18:30:45
ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 1
...and failed drivers - in Spark Web UI "Completed Drivers" with "State=ERROR" appear.
I've tried to pass limits for cores and memory to submit script but it didn't help...
--deploy-mode=cluster:
From my laptop:
./bin/spark-submit --master spark://ec2-52-10-82-218.us-west-2.compute.amazonaws.com:7077 --deploy-mode cluster --class SparkPi ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
The result is:
.... Driver successfully submitted as driver-20150223023734-0007 ...
waiting before polling master for driver state ... polling master for
driver state State of driver-20150223023734-0007 is ERROR Exception
from cluster was: java.io.FileNotFoundException: File
file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
does not exist. java.io.FileNotFoundException: File
file:/home/oleg/spark/spark12/ec2test/target/scala-2.10/ec2test_2.10-0.0.1.jar
does not exist. at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:329) at
org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)
at
org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:75)
So, I'd appreciate any pointers on what is going wrong and some guidance how to deploy jobs from remote client. Thanks.
UPDATE:
So for the second issue in cluster mode, the file must be globally visible by each cluster node, so it has to be somewhere in accessible location. This solve IOException but leads to the same issue as in the client mode.
The documentation at:
http://spark.apache.org/docs/latest/security.html#configuring-ports-for-network-security
lists all the different communication channels used in a Spark cluster. As you can see, there are a bunch where the connection is made from the Executor(s) to the Driver. When you run with --deploy-mode=client, the driver runs on your laptop, so the executors will try to make a connection to your laptop. If the AWS security group that your executors run under blocks outbound traffic to your laptop (which the default security group created by the Spark EC2 scripts doesn't), or you are behind a router/firewall (more likely), they fail to connect and you get the errors you are seeing.
So to resolve it, you have to forward all the necessary ports to your laptop, or reconfigure your firewall to allow connection to the ports. Seeing as a bunch of the ports are chosen at random, this means opening up a wide range of, if not all ports. So probably using --deploy-mode=cluster, or client from the cluster, is less painful.
I advise against submitting spark jobs remotely using the port opening strategy, because it can create security problems and is in my experience, more trouble than it's worth, especially due to having to troubleshoot the communication layer.
Alternatives:
1) Livy - now an Apache project! http://livy.io or http://livy.incubator.apache.org/
2) Spark Job server - https://github.com/spark-jobserver/spark-jobserver

After restar HBase ZooKeeper log Quorum.Learner: Got zxid 0x100000001 expected 0x1

I am performing some tests using HBase and Hadoop, I did setup a cluster with one master, two zookeeper and four region servers. Up until yestarday everything was working perfectly well, starting from today it simply don't start anymore.
When executing start-hbase all the process get alive:
HMaster using ports 8020 and 60010
HQuorumPeer using ports 2181 and 3888
HRegionServer
However when I take a look onto the server logs it seems the servers got stucked for some reason...
. HServer stop printing a WARNING about a native library that I was supposed to be using
. HQuorumPeer on node 1 prints a WARNING about Getting a zxid 0x10000000001 expected 0x1
. HQuorumPerr on node 1 has not print at all
Does someone has any idea on this?
Thanks.
Well, I am far, far away to be considered a hbase/hadoop expert. In fact it is just the first time I am playing around with it. Probably, the problem I had face was related to unproperly shutdown or corrupt file from the couple hbase/hadoop.
So here is my tip if you found yourself on the same situation:
cleanup all hbase logs, in my case at $HBASE_INSTALL/logs/*
cleanup all zookeeper data, in my case at /var/zookeeper/*
cleanup all hadoop data, in my case at /var/hadoop/*
cleanup all hdfs logs, in my case at /var/hdfs/log/*
cleanup all hdfs namenode data, in my case at /var/hdfs/namenode/*
cleanup all hdfs datanode data, in my case at /var/hdfs/datanode/*
format your hdfs cluster typing the command hdfs namenode -format
IMPORTANT: Don't do that if you have data, you will probably loose all of it. I could do that once I am just using it for test purpose.
I will keep reading about hbase/hadoop in order to understand it better, anyway I can guarantee that is a tool far to be "plug and play" when compared to cassandra.
Hope this can help.
Regards

How does Apache Spark handles system failure when deployed in YARN?

Preconditions
Let's assume Apache Spark is deployed on a hadoop cluster using YARN. Furthermore a spark execution is running. How does spark handle the situations listed below?
Cases & Questions
One node of the hadoop clusters fails due to a disc error. However replication is high enough and no data was lost.
What will happen to tasks that where running at that node?
One node of the hadoop clusters fails due to a disc error. Replication was not high enough and data was lost. Simply spark couldn't find a file anymore which was pre-configured as resource for the work flow.
How will it handle this situation?
During execution the primary namenode fails over.
Did spark automatically use the fail over namenode?
What happens when the secondary namenode fails as well?
For some reasons during a work flow the cluster is totally shut down.
Will spark restart with the cluster automatically?
Will it resume to the last "save" point during the work flow?
I know, some questions might sound odd. Anyway, I hope you can answer some or all.
Thanks in advance. :)
Here are the answers given by the mailing list to the questions (answers where provided by Sandy Ryza of Cloudera):
"Spark will rerun those tasks on a different node."
"After a number of failed task attempts trying to read the block, Spark would pass up whatever error HDFS is returning and fail the job."
"Spark accesses HDFS through the normal HDFS client APIs. Under an HA configuration, these will automatically fail over to the new namenode. If no namenodes are left, the Spark job will fail."
Restart is part of administration and "Spark has support for checkpointing to HDFS, so you would be able to go back to the last time checkpoint was called that HDFS was available."

HBase NoServerForRegionException?

I am getting this exception when for a while i didn't communicated with HBase:
org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying to locate root region because: Connection refused
is this something related with session expiry, if so, how can i extend session lifetime?
Query bin/hbase hbck and find in which machine root Regionserver is running..
You should get -ROOT- is okay on hbck. Make sure that all your
Regionserver is up and running.
use start regionserver for starting regionserver
I don't think this has anything to do with session lifetime.
Check your cluster to make sure that it is up and working correctly and all region servers are alive. Then check the logs to make sure that they are not reporting some error state.
HBase is complex software -- without more detailed information it is very difficult to diagnose what is going on. And often you can discover the problem by collecting the more detailed information.
This error shows that the client is not able to talk to Region server.
Check the region server associated with the region its trying to connect and check its up.
To identify the region server associated with the region please go through http://hbase.apache.org/0.94/book/regions.arch.html#regions.arch.assignment
Some factors have played a role here.
Please note the below steps which occur when you try to connect to Hbase from a client,
Hbase connects to Zookeeper to get the Ip of the regionservers which host the ROOT table.
The client caches this information about the IP's so that it doesnt have to contact the zookeeper again.
Your problem is that, your client is trying to connect to the zookeeper to get the IP. one of the below things may be going wrong,
Your client is not able to connect to the zookeeper.
The information about the ROOT contained inside the Znode in ZooKeeper is wrong.
Possible fixes.
Check if your zookeeper is working fine.
Delete the Znode for Hbase in your Zookeeper and restart the cluster. Don't worry, this wont delete your data.
Once this is achieved? the client can get the ROOT information and then query for the META table without any issue.

Resources