Oozie hive action with kerberos on HDP-1.3.3

Oozie hive action with kerberos on HDP-1.3.3 - hadoop

I'm trying to execute hive script from oozie hive action on kerberos enabled environment.
Here is my workflow.xml
<action name="hive-to-hdfs">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>hive-site.xml</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<script>script.q</script>
<param>HIVE_EXPORT_TIME=${hiveExportTime}</param>
</hive>
<ok to="pass"/>
<error to="fail"/>
I'm facing issue when trying to connect to hive metastore.
6870 [main] INFO hive.metastore - Trying to connect to metastore with URI thrift://10.0.0.242:9083
Heart beat
Heart beat
67016 [main] WARN hive.metastore - set_ugi() not successful, Likely cause: new client talking to old server. Continuing without it.
org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
67018 [main] INFO hive.metastore - Waiting 1 seconds before next connection attempt.
68018 [main] INFO hive.metastore - Connected to metastore.
Heart beat
Heart beat
128338 [main] WARN org.apache.hadoop.hive.metastore.RetryingMetaStoreClient - MetaStoreClient lost connection. Attempting to reconnect.
org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
129339 [main] INFO hive.metastore - Trying to connect to metastore with URI thrift://10.0.0.242:9083
Heart beat
Heart beat
189390 [main] WARN hive.metastore - set_ugi() not successful, Likely cause: new client talking to old server. Continuing without it.
org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
189391 [main] INFO hive.metastore - Waiting 1 seconds before next connection attempt.
190391 [main] INFO hive.metastore - Connected to metastore.
Heart beat
Heart beat
250449 [main] ERROR org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table SESSION_MASTER
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:953)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:887)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1083)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1059)
When I disable kerberos security workflow works fine

To enable your Oozie Hive action to function on a secured cluster you need to include a <credentials> section with a credential of type 'hcat' to your workflow.
Your workflow would then look something like:
<workflow-app name='workflow' xmlns='uri:oozie:workflow:0.1'>
<credentials>
<credential name='hcat' type='hcat'>
<property>
<name>hcat.metastore.uri</name>
<value>HCAT_URI</value>
</property>
<property>
<name>hcat.metastore.principal</name>
<value>HCAT_PRINCIPAL</value>
</property>
</credential>
</credentials>
<action name="hive-to-hdfs" cred="hcat">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>hive-site.xml</job-xml>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<script>script.q</script>
<param>HIVE_EXPORT_TIME=${hiveExportTime}</param>
</hive>
<ok to="pass"/>
<error to="fail"/>
</action>
</workflow>
There is also Oozie documentation about this feature.

Related

Cannot start Nutch crawling

I'm trying to deploy Nutch 2.3 + ElasticSearch 1.4 + HBase 0.94 on Ubuntu 14.04 following this tutorial. When I try to start the crawling injecting the urls doing:
$NUTCH_ROOT/runtime/local/bin/nutch inject urls
I get:
InjectorJob: starting at 2017-10-12 19:27:48
InjectorJob: Injecting urlDir: urls
and the process remains there for hours.
How do I know what's going on?
Configuration files:
nutch-site.xml
<configuration>
<property>
<name>http.agent.name</name>
<value>mycrawlername</value> <!-- this can be changed to something more sane if you like -->
</property>
<property>
<name>http.robots.agents</name>
<value>mycrawlername</value> <!-- this is the robot name we're looking for in robots.txt files -->
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
</property>
<property>
<name>plugin.includes</name>
<!-- do **NOT** enable the parse-html plugin, if you want proper HTML parsing. Use something like parse-tika! -->
<value>protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value> <!-- do not leave the seeded domains (optional) -->
</property>
<property>
<name>elastic.host</name>
<value>localhost</value> <!-- where is ElasticSearch listening -->
</property>
</configuration>
hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>/home/kike/RIWS/hbase-0.94.14/</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>false</value>
</property>
</configuration>
Log files:
HBase master log
2017-10-12 19:27:49,593 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:47778
2017-10-12 19:27:49,596 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /127.0.0.1:47778
2017-10-12 19:27:49,609 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x15f11684f3f0017 with negotiated timeout 40000 for client /127.0.0.1:47778
2017-10-12 19:31:11,092 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=1.99 MB, free=239.7 MB, max=241.69 MB, blocks=2, accesses=18, hits=16, hitRatio=88,88%, , cachingAccesses=18, cachingHits=16, cachingHitsRatio=88,88%, , evictions=0, evicted=0, evictedPerRun=NaN
2017-10-12 19:31:24,623 DEBUG org.apache.hadoop.hbase.client.MetaScanner: Scanning .META. starting at row= for max=2147483647 rows using org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation#1646b7c
2017-10-12 19:31:24,630 DEBUG org.apache.hadoop.hbase.master.CatalogJanitor: Scanned 0 catalog row(s) and gc'd 0 unreferenced parent region(s)
2017-10-12 19:32:13,832 INFO org.apache.zookeeper.server.PrepRequestProcessor: Processed session termination for sessionid: 0x15f11684f3f0017
2017-10-12 19:32:13,849 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /127.0.0.1:47778 which had sessionid 0x15f11684f3f0017
2017-10-12 19:32:14,852 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /127.0.0.1:47817
2017-10-12 19:32:14,853 INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /127.0.0.1:47817
2017-10-12 19:32:14,880 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x15f11684f3f0018 with negotiated timeout 40000 for client /127.0.0.1:47817
Hadoop log
2017-10-12 19:27:48,871 INFO crawl.InjectorJob - InjectorJob: starting at 2017-10-12 19:27:48
2017-10-12 19:27:48,871 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: urls
EDIT:
After a few time, the hadoop log shows:
2017-10-12 20:34:59,333 ERROR crawl.InjectorJob - InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
Caused by: java.lang.RuntimeException: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:133)
at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
... 7 more
Caused by: org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times
at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:139)
at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:115)
... 9 more
But if I type jps I can see the HMaster running:
31672 Jps
20553 HMaster
19739 Elasticsearch

Your Error logs shows : (hbase.MasterNotRunningException)
org.apache.hadoop.hbase.MasterNotRunningException: Retried 14 times
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135)
at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:78)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:218)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:252)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:275)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:284)
We need to Setup Hbase
open ~/Desktop/Nutch/hbase/conf/hbase-site.xml and add the following 2 nodes. We need to tell hbase the rootdir of the install and also specify a data directory for zookeeper.
open ~/Desktop/Nutch/hbase/conf/hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///Users/sntiwari/Desktop/Nutch/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/Users/sntiwari/Desktop/Nutch/zookeeper</value>
</property>
</configuration>
Next, we need to tell gora to use Hbase for it’s default data store.
open ~/Desktop/Nutch/nutch/conf/gora.properties
# open ~/Desktop/Nutch/nutch/runtime/local/conf/gora.properties
# Add this line under `HBaseStore properties` (to keep things organised)
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
We need to add/uncomment the gora-hbase dependency to our ivy.xml (may be line 118).
open ~/Desktop/Nutch/nutch/ivy/ivy.xml
# Find and Uncomment this line (aprrox 118)
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />
** Test Your Hbase **
# Start it up!
~/Desktop/Nutch/hbase/bin/start-hbase.sh
# Stop it (Can take a while, be patient)
~/Desktop/Nutch/hbase/bin/stop-hbase.sh
# Access the shell
~/Desktop/Nutch/hbase/bin/hbase shell
# list = list all tables
# disable 'webpage' = disable the table (before dropping)
# drop 'webpage' = drop the table (webpage is created & used by nutch)
# exit = exit from hbase
# For the next part, we need to start hbase
~/Desktop/Nutch/hbase/bin/start-hbase.sh
Follow some testing Step also :
First Check Version Compatibility.
Make sure JAVA_HOME and NUTCH_JAVA_HOME environment variable is set
Compiling nutch [ You need to compile Apache Nutch using ant ( ant runtime ) ]

oozie sqoop action fails to import

I am facing issue while executing oozie sqoop action. In logs I can see that sqoop is able to import data to temp directory then sqoop creates hive scripts to import data.
It fails while importing data to hive.
Below is a sqoop action I am using.
<action name="import" retry-max="2" retry-interval="5">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${jobQueue}</value>
</property>
</configuration>
<arg>import</arg>
<arg>-D</arg>
<arg>sqoop.mapred.auto.progress.max=300000</arg>
<arg>-D</arg>
<arg>map.retry.exponentialBackOff=TRUE</arg>
<arg>-D</arg>
<arg>map.retry.numRetries=3</arg>
<arg>--options-file</arg>
<arg>${odsparamFileName}</arg>
<arg>--table</arg>
<arg>${odsTableName}</arg>
<arg>--where</arg>
<arg>${ods_data_pull_column} BETWEEN TO_DATE(${wf:actionData('getDates')['prevMonthBegin']},'YYYY-MM-DD hh24:mi:ss') AND TO_DATE(${wf:actionData('prevMonthEnd')['endDate']},'YYYY-MM-DD hh24:mi:ss')</arg>
<arg>--hive-import</arg>
<arg>--hive-overwrite</arg>
<arg>--hive-table</arg>
<arg>${stgTable}</arg>
<arg>--hive-drop-import-delims</arg>
<arg>--warehouse-dir</arg>
<arg>${sqoopStgDir}</arg>
<arg>--delete-target-dir</arg>
<arg>--null-string</arg>
<arg>\\N</arg>
<arg>--null-non-string</arg>
<arg>\\N</arg>
<arg>--compress</arg>
<arg>--compression-codec</arg>
<arg>gzip</arg>
<arg>--num-mappers</arg>
<arg>1</arg>
<arg>--verbose</arg>
<file>${odsSqoopConnectionParamsFileLocation}</file>
</sqoop>
<ok to="rev"/>
<error to="fail"/>
</action>
Below is the error i am getting in mapred logs
20078 [main] DEBUG org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat - Creating input split with lower bound '1=1' and upper bound '1=1'
Heart beat
Heart beat
Heart beat
Heart beat
151160 [main] INFO org.apache.sqoop.mapreduce.ImportJobBase - Transferred 0 bytes in 135.345 seconds (0 bytes/sec)
151164 [main] INFO org.apache.sqoop.mapreduce.ImportJobBase - Retrieved 0 records.
151164 [main] ERROR org.apache.sqoop.tool.ImportTool - Error during import: Import job failed!
Intercepting System.exit(1)
<<< Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]
Oozie Launcher failed, finishing Hadoop job gracefully
Please suggest

You can import the table to hdfs path using --target-dir and set the location of your hive table to point that path. I fixed it using this approach. Hope it helps you as well.

Hadoop job stuck due to connection timeout

I am new to hadoop.I have set up hadoop in my mac system, then I am trying to run following:
hadoop jar wordcount.jar /usr/joy/input /usr/joy/output
In response to the command, following message got printed in terminal,
16/03/18 17:13:20 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable 16/03/18 17:13:20 INFO client.RMProxy: Connecting to
ResourceManager at localhost/127.0.0.1:8032
16/03/18 17:13:21 WARN mapreduce.JobResourceUploader: Hadoop command-
line option parsing not performed. Implement the Tool interface and
execute your application with ToolRunner to remedy this.
16/03/18 17:13:21 INFO input.FileInputFormat: Total input paths to
process : 1
16/03/18 17:13:21 INFO mapreduce.JobSubmitter: number of splits:1
16/03/18 17:13:21 INFO mapreduce.JobSubmitter: Submitting tokens for
job: job_1458279089418_0002
16/03/18 17:13:21 INFO impl.YarnClientImpl: Submitted application
application_1458279089418_0002
16/03/18 17:13:21 INFO mapreduce.Job: The url to track the job:
http://EN-AbhishekM:8088/proxy/application_1458279089418_0002/
Now while I am checking status of the job at browser, in logs I found following error:
Application application_1458279089418_0001 failed 2 times due to Error
launching appattempt_1458279089418_0001_000002. Got exception:
org.apache.hadoop.net.ConnectTimeoutException: Call From
EN-AbhishekM/192.168.0.102 to 192.168.43.66:61029
failed on socket timeout exception:
org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout
while waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending
remote=192.168.43.66/192.168.43.66:61029];....
I am pasting configuration files here:
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
yarn-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Formatted the filesystem by:
bin/hdfs namenode -format
started namenode and datanode daemon by:
sbin/start-dfs.sh
Started ResourceManager daemon and NodeManager daemon by:
sbin/start-yarn.sh
Can please anyone suggest me what mistake I am doing here.

Scheduling/running mahout command in oozie

I'm trying to run mahout command - sequence2sparse using oozie scheduler , but it is giving some error.
I tried running the mahout command using oozie - shell tags but nothing worked.
Following are the oozie workflow -
<action name="mahoutSeq2Sparse">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>mahout seq2sparse</exec>
<argument>-i</argument>
<argument>${nameNode}/tmp/Clustering/seqOutput</argument>
<argument>-o</argument>
<argument>${nameNode}/tmp/Clustering/seqToSparse</argument>
<argument>-ow</argument>
<argument>-nv</argument>
<argument>-x</argument>
<argument>100</argument>
<argument>-n</argument>
<argument>2</argument>
<argument>-wt</argument>
<argument>tf</argument>
<capture-output/>
</shell>
<ok to="brandCanopyInitialCluster" />
<error to="fail" />
</action>
I also tried by creating a shell script and run it in oozie
<action name="mahoutSeq2Sparse">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${EXEC}</exec>
<file>${EXEC}#${EXEC}</file>
</shell>
<ok to="brandCanopyInitialCluster" />
<error to="fail" />
</action>
with job.properties as
nameNode=hdfs://abc02:8020
jobTracker=http://abc02:8050/
clusteringJobInput=hdfs://abc02:8020/tmp/Activity/000000_0
queueName=default
oozie.wf.application.path=hdfs://abc02:8020/tmp/workflow/
oozie.use.system.libpath=true
EXEC=generatingBrandSparseFile.sh
and generatingBrandSparseFile.sh is
export INPUT_PATH="hdfs://abc02:8020/tmp/Clustering/seqOutput"
export OUTPUT_PATH="hdfs://abc02:8020/tmp/Clustering/seqToSparse"
sudo -u hdfs hadoop fs -chmod -R 777 "hdfs://abc02:8020/tmp/Clustering/seqOutput"
mahout seq2sparse -i ${INPUT_PATH} -o ${OUTPUT_PATH} -ow -nv -x 100 -n 2 -wt tf
sudo -u hdfs hadoop fs -chmod -R 777 ${OUTPUT_PATH}
but none of the option is working.
The error with the later one is -
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
sudo: no tty present and no askpass program specified
15/06/05 12:23:59 WARN driver.MahoutDriver: No seq2sparse.props found on classpath, will use command-line arguments only
15/06/05 12:24:01 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1
For sudo: no tty present this error I have commented out
/etc/sudoers -
Defaults !requiretty
Mahout is installed on the node where oozie server is installed.
Also the following oozie workflow is not valid-
<workflow-app xmlns="uri:oozie:workflow:0.4" name="map-reduce-wf">
<action name="mahoutSeq2Sparse">
<ssh>
<host>rootUserName#abc05.ad.abc.com<host>
<command>mahout seq2sparse</command>
<args>-i</arg>
<args>${nameNode}/tmp/Clustering/seqOutput</arg>
<args>-o</arg>
<args>${nameNode}/tmp/Clustering/seqToSparse</arg>
<args>-ow</args>
<args>-nv</args>
<args>-x</args>
<args>100</args>
<args>-n</args>
<args>2</args>
<args>-wt</args>
<args>tf</args>
<capture-output/>
</ssh>
<ok to="brandCanopyInitialCluster" />
<error to="fail" />
</action>
Error- Error: E0701 : E0701: XML schema error, cvc-complex-type.2.4.a: Invalid content was found starting with element 'ssh'. One of '{"uri:oozie:workflow:0.4":map-reduce, "uri:oozie:workflow:0.4":pig, "uri:oozie:workflow:0.4":sub-workflow, "uri:oozie:workflow:0.4":fs, "uri:oozie:workflow:0.4":java, WC[##other:"uri:oozie:workflow:0.4"]}' is expected.
Will installing mahout on all the nodes will help?- (oozie can run the script on any node).
Is there a way to make mahout available on hadoop cluster?
Any other solution is also welcome.
Thanks in advance.
Edit:
I have changed the approach slightly, and now I am calling the seq2sparse class directly. The workflow is -
<action name="mahoutSeq2Sparse">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles</main-class>
<arg>-i</arg>
<arg>${nameNode}/tmp/OozieData/Clustering/seqOutput</arg>
<arg>-o</arg>
<arg>${nameNode}/tmp/OozieData/Clustering/seqToSparse</arg>
<arg>-ow</arg>
<arg>-nv</arg>
<arg>-x</arg>
<arg>100</arg>
<arg>-n</arg>
<arg>2</arg>
<arg>-wt</arg>
<arg>tf</arg>
</java>
<ok to="CanopyInitialCluster"/>
<error to="fail"/>
</action>
Still the job is not running , the error is
>>> Invoking Main class now >>>
Main class : org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles
Arguments :
-i
hdfs://abc:8020/tmp/OozieData/Clustering/seqOutput
-o
hdfs://abc:8020/tmp/OozieData/Clustering/seqToSparse
-ow
-nv
-x
100
-n
2
-wt
tf
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
Heart beat
<<< Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.JavaMain], main() threw exception, java.lang.IllegalStateException: Job failed!
org.apache.oozie.action.hadoop.JavaMainException: java.lang.IllegalStateException: Job failed!
at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:58)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:39)
at org.apache.oozie.action.hadoop.JavaMain.main(JavaMain.java:36)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:226)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.IllegalStateException: Job failed!
at org.apache.mahout.vectorizer.DictionaryVectorizer.startWordCounting(DictionaryVectorizer.java:368)
at org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:179)
at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:288)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:56)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:55)
... 15 more
Oozie Launcher failed, finishing Hadoop job gracefully
Oozie Launcher, uploading action data to HDFS sequence file: hdfs://vchniecnveg02:8020/user/root/oozie-oozi/0000054-150604142118313-oozie-oozi-W/mahoutSeq2Sparse--java/action-data.seq
Oozie Launcher ends

Those errors on Oozie are very frustrating. From my experience, most of them are produced by a typo in the xml or in the parameter order.
On your last workflow, you didn't close the host tag:
<host>rootUserName#abc05.ad.abc.com<host>
should be
<host>rootUserName#abc05.ad.abc.com</host>
For the shell error, first I recommend to use the version 0.2 (defined here : https://oozie.apache.org/docs/4.0.0/DG_ShellActionExtension.html#AE.A_Appendix_A_Shell_XML-Schema) and to remove all the parameters and everything not useful to start the action (do not care about the results).
You need to use :
<shell xmlns="uri:oozie:shell-action:0.2">

Loading data into HBASE using importtsv causes error

I am trying to load a data from CSV File into HBASE using importtsv tool. I have set up a cluster of 3 machines.
This is my hbase-site.xml file
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://ec2-54-190-103-64.us-west-2.compute.amazonaws.com:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>ec2-54-203-95-235.us-west-2.compute.amazonaws.com</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2222</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/ubuntu/zookeeper</value>
<description>Property from ZooKeeper's config zoo.cfg.
The directory where the snapshot is stored.
</description>
</property>
</configuration>
When i start and run jps. Under Master Node i see HMaster and under datanodes i see Hquorumpeer and Hregionserver
When i try to load data i get following error
INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
14/08/11 07:41:28 WARN zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
14/08/11 07:41:28 WARN zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
14/08/11 07:41:28 INFO util.RetryCounter: Sleeping 8000ms before retry #3...
14/08/11 07:41:29 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
Not sure what is the issue with zookeeper. Thanks in advance

You need to add the following property in your hbase-site.xml file(both master and slaves)
<property>
<name>hbase.zookeeper.property.maxClientCnxns</name>
<value>1000</value>
</property>

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Oozie hive action with kerberos on HDP-1.3.3 - hadoop

Related

Cannot start Nutch crawling

oozie sqoop action fails to import

Hadoop job stuck due to connection timeout

Scheduling/running mahout command in oozie

Loading data into HBASE using importtsv causes error

Categories

Resources