Cannot start Nutch crawling - hadoop

I'm trying to deploy Nutch 2.3 + ElasticSearch 1.4 + HBase 0.94 on Ubuntu 14.04 following this tutorial. When I try to start the crawling injecting the urls doing:
$NUTCH_ROOT/runtime/local/bin/nutch inject urls
I get:
InjectorJob: starting at 2017-10-12 19:27:48
InjectorJob: Injecting urlDir: urls
and the process remains there for hours.
How do I know what's going on?
Configuration files:
nutch-site.xml
<configuration>
<property>
<name>http.agent.name</name>
<value>mycrawlername</value> <!-- this can be changed to something more sane if you like -->
</property>
<property>
<name>http.robots.agents</name>
<value>mycrawlername</value> <!-- this is the robot name we're looking for in robots.txt files -->
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
</property>
<property>
<name>plugin.includes</name>
<!-- do **NOT** enable the parse-html plugin, if you want proper HTML parsing. Use something like parse-tika! -->
<value>protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value> <!-- do not leave the seeded domains (optional) -->
</property>
<property>
<name>elastic.host</name>
<value>localhost</value> <!-- where is ElasticSearch listening -->
</property>
</configuration>
hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>/home/kike/RIWS/hbase-0.94.14/</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>false</value>
</property>
</configuration>
Log files:
HBase master log
2017-10-12 19:27:49,593 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:47778
2017-10-12 19:27:49,596 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /127.0.0.1:47778
2017-10-12 19:27:49,609 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x15f11684f3f0017 with negotiated timeout 40000 for client /127.0.0.1:47778
2017-10-12 19:31:11,092 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=1.99 MB, free=239.7 MB, max=241.69 MB, blocks=2, accesses=18, hits=16, hitRatio=88,88%, , cachingAccesses=18, cachingHits=16, cachingHitsRatio=88,88%, , evictions=0, evicted=0, evictedPerRun=NaN
2017-10-12 19:31:24,623 DEBUG org.apache.hadoop.hbase.client.MetaScanner: Scanning .META. starting at row= for max=2147483647 rows using org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation#1646b7c
2017-10-12 19:31:24,630 DEBUG org.apache.hadoop.hbase.master.CatalogJanitor: Scanned 0 catalog row(s) and gc'd 0 unreferenced parent region(s)
2017-10-12 19:32:13,832 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x15f11684f3f0017
2017-10-12 19:32:13,849 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:47778 which had sessionid 0x15f11684f3f0017
2017-10-12 19:32:14,852 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:47817
2017-10-12 19:32:14,853 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /127.0.0.1:47817
2017-10-12 19:32:14,880 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x15f11684f3f0018 with negotiated timeout 40000 for client /127.0.0.1:47817
Hadoop log
2017-10-12 19:27:48,871 INFO crawl.InjectorJob - InjectorJob: starting at 2017-10-12 19:27:48
2017-10-12 19:27:48,871 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: urls
EDIT:
After a few time, the hadoop log shows:
2017-10-12 20:34:59,333 ERROR crawl.InjectorJob - InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:133)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
... 7 more
Caused by: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times
at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:139)
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:115)
... 9 more
But if I type jps I can see the HMaster running:
31672 Jps
20553 HMaster
19739 Elasticsearch

Your Error logs shows : (hbase.MasterNotRunningException)
org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
We need to Setup Hbase
open ~/Desktop/Nutch/hbase/conf/hbase-site.xml and add the following 2 nodes. We need to tell hbase the rootdir of the install and also specify a data directory for zookeeper.
open ~/Desktop/Nutch/hbase/conf/hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///Users/sntiwari/Desktop/Nutch/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/Users/sntiwari/Desktop/Nutch/zookeeper</value>
</property>
</configuration>
Next, we need to tell gora to use Hbase for it’s default data store.
open ~/Desktop/Nutch/nutch/conf/gora.properties
# open ~/Desktop/Nutch/nutch/runtime/local/conf/gora.properties
# Add this line under `HBaseStore properties` (to keep things organised)
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
We need to add/uncomment the gora-hbase dependency to our ivy.xml (may be line 118).
open ~/Desktop/Nutch/nutch/ivy/ivy.xml
# Find and Uncomment this line (aprrox 118)
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />
** Test Your Hbase **
# Start it up!
~/Desktop/Nutch/hbase/bin/start-hbase.sh
# Stop it (Can take a while, be patient)
~/Desktop/Nutch/hbase/bin/stop-hbase.sh
# Access the shell
~/Desktop/Nutch/hbase/bin/hbase shell
# list = list all tables
# disable 'webpage' = disable the table (before dropping)
# drop 'webpage' = drop the table (webpage is created & used by nutch)
# exit = exit from hbase
# For the next part, we need to start hbase
~/Desktop/Nutch/hbase/bin/start-hbase.sh
Follow some testing Step also :
First Check Version Compatibility.
Make sure JAVA_HOME and NUTCH_JAVA_HOME environment variable is set
Compiling nutch [ You need to compile Apache Nutch using ant ( ant runtime ) ]

Related

Hbase connection about zookeeper error

Environment : Ubuntu 14.04 , hadoop-2.2.0 , hbase-0.98.7
when i start hadoop and hbase(single node mode), both all success (I also check the website 8088 for hadoop, 60010 for hbase)
jps
4507 SecondaryNameNode
5350 HRegionServer
4197 NameNode
4795 NodeManager
3948 QuorumPeerMain
5209 HMaster
4678 ResourceManager
5831 Jps
4310 DataNode
but when i check hbase-hadoop-master-localhost.log, i found a information following
2014-10-23 14:16:11,392 INFO [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2014-10-23 14:16:11,426 INFO [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
i have google lot of website for that unknown error problem, but i can't solve this problem...
Following is my hadoop and hbase configuration
Hadoop :
salves content : localhost
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:8020</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>localhost:9001</value>
<description>host is the hostname of the resource manager and
port is the port on which the NodeManagers contact the Resource Manager.
</description>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>localhost:9002</value>
<description>host is the hostname of the resourcemanager and port is the port
on which the Applications in the cluster talk to the Resource Manager.
</description>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
<description>In case you do not want to use the default scheduler</description>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:9003</value>
<description>the host is the hostname of the ResourceManager and the port is the port on
which the clients can talk to the Resource Manager. </description>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value></value>
<description>the local directories used by the nodemanager</description>
</property>
<property>
<name>yarn.nodemanager.address</name>
<value>localhost:9004</value>
<description>the nodemanagers bind to this port</description>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>10240</value>
<description>the amount of memory on the NodeManager in GB</description>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/app-logs</value>
<description>directory on hdfs where the application logs are moved to </description>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value></value>
<description>the directories used by Nodemanagers as log directories</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>shuffle service that needs to be set for Map Reduce to run </description>
</property>
</configuration>
Hbase:
hbase-env.sh :
..
export JAVA_HOME="/usr/lib/jvm/java-7-oracle"
..
export HBASE_MANAGES_ZK=true
..
hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:8020/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
</configuration>
regionserver content : localhost
my /etc/hosts content:
127.0.0.1 localhost
#127.0.1.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
I try lots of methods to solve it, but all fail, please help me to solve it, i really need to know how to solve.
Originally, i run a mapreuce program and when map 67% reduce 0%, it print out some INFO and some of INFO is following:
14/10/23 15:50:41 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=60000 watcher=org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher#ce1472
14/10/23 15:50:41 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
14/10/23 15:50:41 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
14/10/23 15:50:41 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x1493be510380007, negotiated timeout = 40000
14/10/23 15:50:43 INFO mapred.LocalJobRunner: map > sort
14/10/23 15:50:46 INFO mapred.LocalJobRunner: map > sort
then it crash.. I think program maybe in dead lock and that is what i want to solve zookeeper problem above.
If want another configuration file i set in hadoop or hbase or others, just tell me, i'll post up.
thanks!
I don't think zookeeper is your problem. You should look at your other logs for more information about your map/reduce job status. Check the datanode and namenode logs for errors along with the yarn log messages through the yarn job tracker ui.
ZooKeeper Messages
Those messages are from zookeeper trying to connect with the Zookeeper sasl client. If sasl is not configured the client will still be able to connect but the connection won't be authenticated.
Error message comes from this file
ZooKeeperSaslClient.java
150 // The user did not override the default context. It might be that they just don't intend to use SASL,
151 // so log at INFO, not WARN, since they don't expect any SASL-related information.
152 String msg = "Will not attempt to authenticate using SASL ";
153 if (runtimeException != null) {
154 msg += "(" + runtimeException + ")";
155 } else {
156 msg += "(unknown error)";
157 }
158 this.configStatus = msg;
159 this.isSASLConfigured = false;
160 }
If you want to get rid of the error you will have to configure zookeeper to use sasl. Sorry don't have any experience with configuration sasl for zookeeper.
ZooKeeper SASL Configuration
Add follwing properties in hbase-site.xml file
<property>
<name>hbase.zookeeper.quorum</name>
<value>192.168.56.101</value> #this is my server ip
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
restart ./start-hbase.sh
This is how I solved it - after I tried including hbase-site.xml in classpath and giving the zookeeper quorum value as -Dhbase.zookeeper.quorum - and they did not work.
I copied hbase-site.xml to the same folder as my jar and then did
jar uf myjar.jar hbase-site.xml
And then I ran hadoop jar myjar.jar Blah
This fixed the problem

Hbase master keeps dying, claims a hbase:namespace already exists

In todays episode of hbase is bringing me to my wits end we have an issue where the hbase master starts and then very quickly dies. My master log is like so:
2014-06-20 12:52:40,469 FATAL [master:hdev01:60000] master.HMaster: Master serve
r abort: loaded coprocessors are: []
2014-06-20 12:52:40,470 FATAL [master:hdev01:60000] master.HMaster: Unhandled ex
ception. Starting shutdown.
org.apache.hadoop.hbase.TableExistsException: hbase:namespace
at org.apache.hadoop.hbase.master.handler.CreateTableHandler.prepare(Cre
ateTableHandler.java:120)
at org.apache.hadoop.hbase.master.TableNamespaceManager.createNamespaceT
able(TableNamespaceManager.java:232)
at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNames
paceManager.java:86)
at org.apache.hadoop.hbase.master.HMaster.initNamespace(HMaster.java:106
2)
at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.j
ava:926)
at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:615)
at java.lang.Thread.run(Thread.java:662)
2014-06-20 12:52:40,473 INFO [master:hdev01:60000] master.HMaster: Aborting
2014-06-20 12:52:40,473 DEBUG [master:hdev01:60000] master.HMaster: Stopping ser
vice threads
2014-06-20 12:52:40,473 INFO [master:hdev01:60000] ipc.RpcServer: Stopping serv
er on 60000
2014-06-20 12:52:40,473 INFO [CatalogJanitor-hdev01:60000] master.CatalogJanito
r: CatalogJanitor-hdev01:60000 exiting
2014-06-20 12:52:40,473 INFO [hdev01,60000,1403283149823-BalancerChore] balance
r.BalancerChore: hdev01,60000,1403283149823-BalancerChore exiting
2014-06-20 12:52:40,474 INFO [RpcServer.listener,port=60000] ipc.RpcServer: Rpc
Server.listener,port=60000: stopping
2014-06-20 12:52:40,474 INFO [RpcServer.responder] ipc.RpcServer: RpcServer.res
ponder: stopped
2014-06-20 12:52:40,474 INFO [master:hdev01:60000] master.HMaster: Stopping inf
oServer
2014-06-20 12:52:40,474 INFO [RpcServer.responder] ipc.RpcServer: RpcServer.res
ponder: stopping
2014-06-20 12:52:40,474 INFO [master:hdev01:60000.oldLogCleaner] cleaner.LogCle
aner: master:hdev01:60000.oldLogCleaner exiting
2014-06-20 12:52:40,475 INFO [hdev01,60000,1403283149823-ClusterStatusChore] ba
lancer.ClusterStatusChore: hdev01,60000,1403283149823-ClusterStatusChore exiting
2014-06-20 12:52:40,476 INFO [master:hdev01:60000.oldLogCleaner] master.Replica
tionLogCleaner: Stopping replicationLogCleaner-0x246ba2ab1e4001c, quorum=hdev02:
5181,hdev01:5181,hdev03:5181, baseZNode=/hbase
2014-06-20 12:52:40,479 INFO [master:hdev01:60000] mortbay.log: Stopped SelectC
hannelConnector#0.0.0.0:16010
2014-06-20 12:52:40,478 INFO [master:hdev01:60000.archivedHFileCleaner] cleaner
.HFileCleaner: master:hdev01:60000.archivedHFileCleaner exiting
2014-06-20 12:52:40,483 INFO [master:hdev01:60000.oldLogCleaner] zookeeper.ZooK
eeper: Session: 0x246ba2ab1e4001c closed
2014-06-20 12:52:40,484 INFO [master:hdev01:60000-EventThread] zookeeper.Client
Cnxn: EventThread shut down
2014-06-20 12:52:40,589 DEBUG [master:hdev01:60000] catalog.CatalogTracker: Stop
ping catalog tracker org.apache.hadoop.hbase.catalog.CatalogTracker#f3f348b
2014-06-20 12:52:40,591 INFO [master:hdev01:60000] client.HConnectionManager$HC
onnectionImplementation: Closing zookeeper sessionid=0x246ba2ab1e4001b
2014-06-20 12:52:40,592 INFO [master:hdev01:60000] zookeeper.ZooKeeper: Session
: 0x246ba2ab1e4001b closed
2014-06-20 12:52:40,592 INFO [master:hdev01:60000-EventThread] zookeeper.Client
Cnxn: EventThread shut down
2014-06-20 12:52:40,695 INFO [hdev01,60000,1403283149823.splitLogManagerTimeout
Monitor] master.SplitLogManager$TimeoutMonitor: hdev01,60000,1403283149823.split
LogManagerTimeoutMonitor exiting
2014-06-20 12:52:40,696 INFO [master:hdev01:60000] zookeeper.ZooKeeper: Session
: 0x246ba2ab1e4001a closed
2014-06-20 12:52:40,696 INFO [main-EventThread] zookeeper.ClientCnxn: EventThre
ad shut down
2014-06-20 12:52:40,696 INFO [master:hdev01:60000] master.HMaster: HMaster main
thread exiting
2014-06-20 12:52:40,697 ERROR [main] master.HMasterCommandLine: Master exiting
java.lang.RuntimeException: HMaster Aborted
at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMaster
CommandLine.java:194)
at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandL
ine.java:135)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLi
ne.java:126)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2803)
I thought this might be some remnant of an old run so I deleted the files in hbases data directory, the zookeepers data directory and my hdfs. I still got the same error. Strangely my HMaster popper back up again temporarily when I ran stop-hbase.sh although there wasn't much I could do with it.
My Hbase version is 98.3 and my hadoop is 2.2.0. My hbase-site.comf is
<configuration>
<property>
<name>hbase.master</name>
<value>hdev01:60000</value>
<description>The host and port that the HBase master runs at.
A value of 'local' runs the master and a regionserver
in a single process.
</description>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://hdev01:9000/hbase</value>
<description>The directory shared by region servers.</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
<description>The mode the cluster will be in. Possible values are
false: standalone and pseudo-distributed setups with managed
Zookeeper true: fully-distributed with unmanaged Zookeeper
Quorum (see hbase-env.sh)
</description>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>5181</value>
<description>Property from ZooKeeper's config zoo.cfg.
The port at which the clients will connect.
</description>
</property>
<property>
<name>zookeeper.session.timeout</name>
<value>10000</value>
<description></description>
</property>
<property>
<name>hbase.client.retries.number</name>
<value>10</value>
<description></description>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>hdev01,hdev02,hdev03</value>
<description>Comma separated list of servers in the ZooKeeper Quorum. For example, "host1.mydomain.com,host2.mydomain.com". By default this is set to localhost for local and pseudo-distributed modes of operation. For a fully-distributed setup, this should be set to a full list of ZooKeeper quorum servers. If
HBASE_MANAGES_ZK is set in hbase-env.sh
this is the list of servers which we will start/stop
ZooKeeper on.
</description>
</property>
</configuration>
EDIT
Attempted hbase org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair, my error now is HBase file layout needs to be upgraded. You have version null and I want version 8. Is your hbase.rootdir valid? If so, you may need to run 'hbase hbck -fixVersionFile'
Which is unhelpful since without a master hbck will not actually run.
Edited edit
I nuked and restarted my dfs and then tried repairing and starting things again, i am now back where i started.
hbase namespace is the internal namespace HBAse uses for its own management tables. Try to run the offline repair tool
from the $HBASE_HOME directory:
./bin/hbase org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair
su - hdfs
hbase org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair
(restart the hbase master.if still u are facing issue then do following)
zookeeper-client (enter)
rmr /hbase
quit
Then restart the hbase master service
#shash:
When HBase manages ZooKeeper( i.e. HBASE_manages_ZK=true), the command to access and clean hbase data is :
hbase zkcli. Afterwards you clean hbae using the command rmr /hbase, then you quit.

How to setup titan over hbase in a single node hadoop

I have a single node hadoop and have installed hbase also on my ubuntu 12.04. Now i want to install titan over hbase. I have setup hadoop-1.0.3 and hbase-0.94.18 and titan/hbase-0.4.2
I have added a user mnit.My /usr/local/ folder contains hadoop2 , hbase2, titan2 .First i start my hadoop using command bin/start-all.sh and then i start hbase using command bin/start-hbase.sh . after it when i do jps i found the following :
mnit#aman:/usr/local$ jps
9921 DataNode
11386 HRegionServer
11041 HQuorumPeer
11537 Jps
11115 HMaster
10153 SecondaryNameNode
10252 JobTracker
9691 NameNode
10483 TaskTracker
now i start gremlin.sh in titan2 using command bin/gremlin.sh .
i applied the following commands
mnit#aman:/usr/local/titan2$ bin/gremlin.sh
gremlin> conf = new BaseConfiguration();
==>org.apache.commons.configuration.BaseConfiguration#19288c2
gremlin> conf.setProperty("storage.backend","hbase");
==>null
gremlin> conf.setProperty("storage.hostname","127.0.0.1");
==>null
gremlin> g = TitanFactory.open(conf);
WARN org.apache.zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper - Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
WARN org.apache.zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
When i searched over this problem i found that there is a file named pom.xml but the titan that i have downloaded does not contain pom.xml. please tell me if this is a problem due to pom.xml. or i am doing something wrong or there is some other issue.
Thanks in advance
zk is managed by hbase in my system. i have added the following line in bin/hbase-env.sh
export HBASE_MANAGES_ZK=true
the content of my hbase-site.xml is as follows :
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:54310/user/hbase</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2222</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
<property>
<name>hbase.zookeeper.property.datadir</name>
<value>/app/hadoop/tmp/zookeeper</value>
</property>
</configuration>
Your Titan and HBase configurations appear to be inconsistent. Your hbase-site.xml overrides the default ZK port (2181) to 2222, but it seems you haven't told Titan to use this non-default ZK port by setting storage.port in your Titan config file. Naturally, they can't talk to each other in that state. This doesn't have anything to do with pom.xml.
By the way, please don't simultaneously crosspost to SO and the aureliusgraphs Google Group. They're both good venues with slightly different purposes, but you seem to have just copy-pasted between this SO question and your subjectless thread on the aureliusgraphs list.

Fully Distributed HBase Error

I'm trying to setup HBase 0.96 to run on top of my Hadoop 2.2.0 cluster. I run start-hbase.sh and the master along with the regions startup. I can log into each region and see the processes running. However when check to see how many regions are up either through the web ui or a shell command I get a response of 0. Based on the logs it looks like the region servers are starting up not unable to notify the master that they are running. I confirmed that the master is listening on port 60000 and ports 60000 along with 60020 are both open. I've included my hbase-site file along with the logs from a region server.
<property>
<name>hbase.rootdir</name>
<value>hdfs://master:9000/hbase</value>
<description>The directory shared by RegionServers.
</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
<description>The mode the cluster will be in. Possible values are
false: standalone and pseudo-distributed setups with managed Zookeeper
true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
</description>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>master</value>
</property>
<property>
<name>zookeeper.znode.parent</name>
<value>/master</value>
</property>
Log File:
2013-11-08 20:08:58,357 INFO [regionserver60020] regionserver.HRegionServer: reportForDuty to master=10.119.102.58,60000,1383941300240 with port=60020, startcode=1383941300420
2013-11-08 20:09:18,636 WARN [regionserver60020] regionserver.HRegionServer: error telling master we are up
com.google.protobuf.ServiceException: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connec$
at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1667)
at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1708)
at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:5402)
at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1924)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:790)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending local=/100.65.$
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupConnection(RpcClient.java:573)
at org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:858)
at org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1532)
at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1421)
at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1650)
... 5 more
2013-11-08 20:09:18,676 WARN [regionserver60020] regionserver.HRegionServer: reportForDuty failed; sleeping and then retrying.
I don't think the hbase.zookeeper.quorum is set correctly, which may cause the connection timeout. If you just wan't to test 0.96, start it in standalone mode and then make sure the zookeeper cluster is running before you change to distributed mode.
The HRegionServer complains that it can't connect to HMaster in order to report status (up).
It's probable that the HMaster process is not running so you may want to start it, or if you already started it to check the master log file.
Check your master server is listening on port 60000 by using the following command
netstat -l
tcp6 0 0 Vostro-350:60000 : LISTEN
if the server is listening on ipv6 then disabled it.
To disable you have to append the following to the file: /etc/sysctl.conf
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
After reboot you should validate that IPV6 is really off by:
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
(0 = IPV6 on ; 1 = IPV6 off)
Ref link : Connecting and Persisting to HBase

Zookeeper connecting to localhost

While running Pig Script which is inserting data into HBase i got following error.
2013-10-25 14:57:03,147 [main-SendThread(localhost:2181)] INFO
org.apache.zookeeper.ClientCnxn - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2013-10-25 14:57:03,147 [main-SendThread(localhost:2181)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established to localhost/127.0.0.1:2181, initiating session
2013-10-25 14:57:03,169 [main-SendThread(localhost:2181)] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x41ead7bb150092, negotiated timeout = 180000
Also when i see slave logs it says that unable to establish connection to master, But master says that connection is established successfully. Below are the logs of both:
Master Log :
zookeeper.ZooKeeper: Initiating client connection, connectString=slave:2181,hadoop-master:2181,ubuntu:2181 sessionTimeout=180000 watcher=hconnection
13/10/24 19:51:56 INFO zookeeper.ClientCnxn: Opening socket connection to server slave:2181. Will not attempt to authenticate using SASL (unknown error)
13/10/24 19:51:56 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 104744#hadoop-master
13/10/24 19:51:56 INFO zookeeper.ClientCnxn: Socket connection established to slave:2181, initiating session
13/10/24 19:51:56 INFO zookeeper.ClientCnxn: Session establishment complete on server slave:2181, sessionid = 0x141ead77c250002, negotiated timeout = 180000
Slave Log
2013-10-24 19:51:32,174 INFO org.apache.zookeeper.server.quorum.FastLeaderElection: New election. My id = 1, proposed zxid=0x80000002a
2013-10-24 19:51:32,180 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager: Cannot open channel to 0 at election address hadoop-master:3888
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
Below are my configurations:
hbase-env.sh
export HBASE_MANAGES_ZK=true
hbase-site.xml
<configuration>
<property>
<name>hbase.master</name>
<value>hadoop-master:60000</value>
<description>
The host and port that the HBase master runs at.
A value of 'local' runs the master and a regionserver in a single process.
</description>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://hadoop-master:54310/hbase</value>
<description>
The directory shared by RegionServers.
</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
<description>
The mode the cluster will be in. Possible values are
false: standalone and pseudo-distributed setups with managed Zookeeper
true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
</description>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
<description>
Property from ZooKeeper's config zoo.cfg.
The port at which the clients will connect.
</description>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>hadoop-master,slave,ubuntu</value>
<description>
Comma separated list of servers in the ZooKeeper Quorum.
For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
By default this is set to localhost for local and pseudo-distributed modes
of operation. For a fully-distributed setup, this should be set to a full
list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in hbase-env.sh
this is the list of servers which we will start/stop ZooKeeper on.
</description>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
<description>
Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
By entering command jps i get following output:
At Master:
104744 HMaster
91841 JobTracker
80184 TaskTracker
91475 DataNode
91222 NameNode
105062 HRegionServer
91747 SecondaryNameNode
104666 HQuorumPeer
At Slave:
11533 HQuorumPeer
2444 SecondaryNameNode
5970 TaskTracker
11756 HRegionServer
This is probably an issue with your HBase config not being on the classpath when running the Pig script. One thing you can try is to set the zookeeper connection settings in the Pig script to see if it connects to the correct instance. If it does, you can add the HBase config to the classpath before launching your script.

Resources