Oozie Hive Action Using -i init script - hadoop

How can I run an Oozie Hive or Hive2 Action with init scripts?
In the CLI this can be usually done via the -i init.hive argument; however when using this in an Oozie Action via <argument>-i init.hive</argument> the workflow stops with an error.
I linked the init.hive file with the <file>init.hive#init.hive</file> property and it is available in the local appcache directory.
$ ll appcache/application_1480609892100_0274/container_e55_1480609892100_0274_01_000001/ | grep init
> lrwxrwxrwx 1 root root 42 Jan 12 12:24 init.hive -> /hadoop/yarn/local/filecache/519/init.hive
The error (in the local appcache) is the following
Connecting to jdbc:hive2://localhost:10000/
Connected to: Apache Hive (version 1.2.1000.2.4.0.0-169)
Driver: Hive JDBC (version 1.2.1000.2.4.0.0-169)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Running init script init.hive
init.hive (No such file or directory)
The hive2 action looks like this (the complete workflow can be found on Github https://github.com/chaosmail/oozie-bugs/tree/master/simple-hive-init/simple-hive-init-wf)
<action name="test-action">
<hive2 xmlns="uri:oozie:hive2-action:0.1">
<jdbc-url>${jdbcURL}</jdbc-url>
<script>query.hive</script>
<argument>-i init.hive</argument>
<file>init.hive#init.hive</file>
</hive2>
<ok to="end"/>
<error to="fail"/>
</action>
Edit 1: added workflow action

[Recap of the comments thread above, plus some extra stuff in retrospect]
The Oozie documentation states that you may have multiple <argument> elements in your Action, which hints that the arguments must be provided separately.
In retrospect, it makes sense -- on a command line, it's the shell that would parse the list of arguments into an args[] array for the Java executable, but Oozie is not a shell interpreter...
And experience shows that Beeline accepts two syntax variants for its command-line args...
-xValue (one arg) means option -x with associated Value
-x followed by Value (two args) means the same thing
So you have two correct ways to pass command-line arguments to Beeline via Oozie:
<argument>-xValue</argument>
<argument>-x</argument> <argument>Value</argument>
On the other hand, <argument>-x Value</argument> would fail, because in single-arg syntax, Beeline considers that the separator space should be part of the value...!

Related

Could not find or load main class hdfs problem

I am trying to use Apache Rya for some tests (https://rya.apache.org/).
For those who are familiar with Rya and RDF stores, I am trying to do a bulk loading which is explained here: https://github.com/apache/rya/blob/master/extras/rya.manual/src/site/markdown/loaddata.md.
Briefly, I should copy a Jar file 'mapreduce/target/rya.mapreduce--shaded.jar' into an hdfs volume then run the following command:
hadoop hdfs://volume/rya.mapreduce-<version>-shaded.jar org.apache.rya.accumulo.mr.tools.RdfFileInputTool -Dac.zk=localhost:2181 -Dac.instance=accumulo -Dac.username=root -Dac.pwd=secret -Drdf.tablePrefix=rya_ -Drdf.format=N-Triples hdfs://volume/dir1,hdfs://volume/dir2,hdfs://volume/file1.nt
Well I copied the needed Jar and the input files into hdfs and verified that they are really there using bin/hadoop fs -put command. My problem is that when I run the cmd in the official example I get the following lines of error that I could not understand or resolve.
/project/hadoop/libexec/hadoop-functions.sh: line 2393: HADOOP_HDFS://LOCALHOST:9000/USER/RYA.MAPREDUCE-4.0.0-INCUBATING-SHADED.JAR_USER: invalid variable name
/project/hadoop/libexec/hadoop-functions.sh: line 2358: HADOOP_HDFS://LOCALHOST:9000/USER/RYA.MAPREDUCE-4.0.0-INCUBATING-SHADED.JAR_USER: invalid variable name
/project/hadoop/libexec/hadoop-functions.sh: line 2453: HADOOP_HDFS://LOCALHOST:9000/USER/RYA.MAPREDUCE-4.0.0-INCUBATING-SHADED.JAR_OPTS: invalid variable name
Error: Could not find or load main class hdfs:..localhost:9000.user.rya.mapreduce-4.0.0-incubating-shaded.jar
For information; all env variables are properly set, HADOOP_HOME and HADOOP_PREFIX

Listing MS SQL Server table in OOZIE via SQOOP Action

I am able to execute the following SQOOP command in CLI perfectly.
sqoop list-tables
--connect 'jdbc:sqlserver://xx.xx.xx.xx\MSSQLSERVER2012:1433;username=usr;password=xxx;database=db'
--connection-manager org.apache.sqoop.manager.SQLServerManager
--driver com.microsoft.sqlserver.jdbc.SQLServerDriver
-- --schema schma
But getting errors while trying out the same in OOZIE (HUE)
2055 [main] ERROR org.apache.sqoop.manager.CatalogQueryManager -
Failed to list tables java.sql.SQLException: No suitable driver found
for 'jdbc:sqlserver://xx.xx.xx.xx\MSSQLSERVER2012:1433;username=usr;password=xxx;database=db'
-
2057 [main] ERROR org.apache.sqoop.Sqoop - Got exception running
Sqoop: java.lang.RuntimeException: java.sql.SQLException: No suitable
driver found for 'jdbc:sqlserver://xx.xx.xx.xx\MSSQLSERVER2012:1433;username=usr;password=xxx;database=db'
How can we get it to work in oozie?
(Working on Cloudera Hadoop Distribution)
This worked for me using CDH 5.11 and the Hue Workflow Editor to create an Oozie>Sqoop1 workflow...but it REQUIRES you to hard code the UserName and Password arguments... Screenshots are included below.
Here is the Step-by-Step:
Open the Hue > Workflow Editor
Create a new workflow
Drag the Sqoop 1 action into the the "drop your action here" grey box.
Ignore the default Sqoop command box and instead hit the + to the right of the ARGUMENTS below the Sqoop command box to add a new argument.
Add "import" without the double quote marks as the very first argument.
Delete the entire content of the Sqoop command box, it needs to be empty.
Add a new argument with the value of "--connect" without the double quotes.
Add a new argument with the value of "jdbc:sqlserver://YourServerNameHere;database=YourDatabaseNameHere"
Add a new argument with the value of "--username"
Add a new argument with the value of "YourSQLServerNamedUserNameHere"
Add a new argument with the value of "--password"
Add a new argument with the value of "--query"
Add a new argument with the value of "Select * from OptionalDBNameHere.SchemaNameHere.TableNameHere Where $CONDITIONS"
Add a new argument with the value of "--delete-target-dir"
Add a new argument with the value of "--target-dir"
Add a new argument with the value of "hdfs://FDQServerName:PortNumber8020IsDefault/User/full/path/to/where/you/want/the/csv/file/placed/in/hdfs/NewFolderForThisTableHere" -- The last folder will be deleted and re-created each time you run the sqoop job.
Add a new argument with the value of "num-mappers"
Add a new argument with the value of "1"
Important:
A. The "Where $CONDITIONS" is critical to have at the end of the SQL Select statement in item 13. It will not run without it.
B. This uses a SQL Server Named User account with access to the DBServer Database and Table you want to Sqoop.
B. Entering arguments like this is required if your Named User does not have the default schema set to "dbo" or if the schema of your table is not the default schema for the database and user.
C. The SQL Server JDBC driver is placed correctly in your installation. For my particular version of Cloudera the location is: "/opt/cloudera/parcels/CDH-5.11.0-1.cdh5.11.0.p0.34/lib/sqoop/lib/sqljdbc41.jar" but you may also try putting it in either "/var/lib/oozie" or "/var/lib/sqoop"...not sure either of those work on their own.
D. I have not been successful at replacing the UserName and Password I hardcoded in as Arguments with values from a job.properties file. I believe it is possible but I have been unable to find anyone who can clearly show how to do it and days of brute force trial and error have been unsuccessful.
Here are screenshots showing what this looks like when done.
SqoopCommandAsArguments
SqoopCommandAsArgumentsSuccess

Hive issue using yarn

I am running hive sql on yarn,
it's throwing error with join condition , I am able to create External as well as internal table but failed to create table when use command
create table as AS SELECT name from student.
when running same query through hive cli it's working fine but with spring jog it throws error
2016-03-28 04:26:50,692 [Thread-17] WARN
org.apache.hadoop.hive.shims.HadoopShimsSecure - Can't fetch tasklog:
TaskLogServlet is not supported in MR2 mode.
Task with the most failures(4):
-----
Task ID:
task_1458863269455_90083_m_000638
-----
Diagnostic Messages for this Task:
AttemptID:attempt_1458863269455_90083_m_000638_3 Timed out after 1 secs
2016-03-28 04:26:50,842 [main] INFO
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Killed application
application_1458863269455_90083
2016-03-28 04:26:50,849 [main] ERROR com.mapr.fs.MapRFileSystem - Failed to
delete path maprfs:/home/pro/amit/warehouse/scratdir/hive_2016-03-28_04-
24-32_038_8553676376881087939-1/_task_tmp.-mr-10003, error: No such file or
directory (2)
2016-03-28 04:26:50,852 [main] ERROR org.apache.hadoop.hive.ql.Driver -
FAILED: Execution Error, return code 2 from
As per my findings I think there is some issue with scratdir.
Kindly suggest if any one face same issue.
This issue occurs if the recursive directory doesnot exist. Hive doesnt automatically create directories recursively.
Please check existence of directories to child\table level from root
I faced a similar issue while running the below Hive query
select * from <db_name>.<internal_tbl_name> where <field_name_of_double_type> in (<list_of_double_values>) order by <list_of_order_fields> limit 10;
I performed an explain on the above statement and below was the result.
fs.FileUtil: Failed to delete file or dir [/hdfs/Hadoop_Misc_Logs/Edge01/local_scratch/<hive_username>/41289638-cd53-4d4b-88c9-3359e9ec99e2/hive_2017-05-08_04-26-36_658_6626096693992380903-1/.nfs0000000057b93e2d00001590]: it still exists.
2017-05-08 04:26:37,969 WARN [41289638-cd53-4d4b-88c9-3359e9ec99e2 main] fs.FileUtil: Failed to delete file or dir [/hdfs/Hadoop_Misc_Logs/Edge01/local_scratch/<hive_username>/41289638-cd53-4d4b-88c9-3359e9ec99e2/hive_2017-05-08_04-26-36_658_6626096693992380903-1/.nfs0000000057b93e2700001591]: it still exists.
Time taken: 0.886 seconds, Fetched: 24 row(s)
And checked the logs through
yarn logs -applicationID application_1458863269455_90083
The error happened after a MapR upgrade from the admin team. It is probably due to some upgrade or installation issue and Tez configurations (as suggested by the line 873 in log below). Or probably, the Hive query is syntactically not supporting the Tez optimization. Saying so, because another Hive query on an external table is running fine in my case. Have to check a bit deeper though.
Though not sure but the error line in the logs that looks to be most relevant is as follows:
2017-05-08 00:01:47,873 [ERROR] [main] |web.WebUIService|: Tez UI History URL is not set
Solution:
It is probably happening due to some open files or applications that are using some resources. Pls check https://unix.stackexchange.com/questions/11238/how-to-get-over-device-or-resource-busy
You can run the explain <your_Hive_statement>
In the result execution plan, you can come across the filenames/dirs that Hive execution engine fails to delete e.g.
2017-05-08 04:26:37,969 WARN [41289638-cd53-4d4b-88c9-3359e9ec99e2 main] fs.FileUtil: Failed to delete file or dir [/hdfs/Hadoop_Misc_Logs/Edge01/local_scratch/<hive_username>/41289638-cd53-4d4b-88c9-3359e9ec99e2/hive_2017-05-08_04-26-36_658_6626096693992380903-1/.nfs0000000057b93e2d00001590]: it still exists.
Go to the path given in the step 2 e.g. /hdfs/Hadoop_Misc_Logs/Edge01/local_scratch/<hive_username>/41289638-cd53-4d4b-88c9-3359e9ec99e2/hive_2017-05-08_04-26-36_658_6626096693992380903-1/
In path 3, doing ls -a or lsof +D /path will show the open process_ids blocking the files from delete.
If you run ps -ef | grep <pid>, you get
hive_username <pid> 19463 1 05:19 pts/8 00:00:35 /opt/mapr/tools/jdk1.7.0_51/jre/bin/java -Xmx256m -Dhiveserver2.auth=PAM -Dhiveserver2.authentication.pam.services=login -Dmapr_sec_enabled=true -Dhadoop.login=maprsasl -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/mapr/hadoop/hadoop-2.7.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/mapr/hadoop/hadoop-2.7.0 -Dhadoop.id.str=hive_username -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/mapr/hadoop/hadoop-2.7.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx512m -Dlog4j.configurationFile=hive-log4j2.properties -Dlog4j.configurationFile=hive-log4j2.properties -Djava.util.logging.config.file=/opt/mapr/hive/hive-2.1/bin/../conf/parquet-logging.properties -Dhadoop.security.logger=INFO,NullAppender -Djava.security.auth.login.config=/opt/mapr/conf/mapr.login.conf -Dzookeeper.saslprovider=com.mapr.security.maprsasl.MaprSaslProvider -Djavax.net.ssl.trustStore=/opt/mapr/conf/ssl_truststore org.apache.hadoop.util.RunJar /opt/mapr/hive/hive-2.1//lib/hive-cli-2.1.1-mapr-1703.jar org.apache.hadoop.hive.cli.CliDriver
CONCLUSION:
The HiveCLiDriver clearly shows that running "Hive on Spark" (or managed) tables through Hive CLI is not supported any more from Hive 2.0 onwards and it is going to be deprecated going forward. You have to use HiveContext in Spark for running Hive queries. But you can still run queries on Hive external tables through Hive CLI.

Oozie Workflow and Coordinator

I have 2 properties files one for workflow and one for coordinator.
./job.properties and ./coordinator/job.properties
2 files are identical except in coordinator there are a few additional variables set. below are those variables
coordstartTime=2013-04-08T18:40Z
coordendTime=2020-04-08T18:40Z
coordTimeZone=GMT
oozie.coord.application.path=${workflowRoot}/coordinator
wfPath=${workflowRoot}/workflow-master.xml
Everything is fine when I run the workflow but I am getting error when I run coordinator
error :
Error: E0301 : E0301: Invalid resource [filename]
that filename exists and when I do hadoop fs -ls [filename] it is listed.
What am I doing wrong here.
thanks
Problem was both
oozie.wf.application.path
and
oozie.coord.application.path
existed in the coordinator properties file.
I removed oozie.wf.application.path and the coordinator worked.
thanks

Using Oozie workflow and coordinator - E0302: Invalid parameter error

I'm trying to run a workflow using a coordinator, but when i try to set the workflow and coordinator XML file paths together, i get an error.
This is how my jobs.properties file looks like:
nameNode=hdfs://10.74.6.155:9000
jobTracker=10.74.6.155:9010
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}/examples/apps/test/
oozie.coord.application.path=${nameNode}/user/${user.name}/examples/apps/test/
when i run my workflow with the command line:
bin\oozie job -oozie http://localhost:11000/oozie -config examples\apps\test\job.properties -run
i get the following error:
Error: E0302 : E0302: Invalid parameter [{0}]
what am i doing wrong?
Thanks!
Both workflow and coordination paths cannot exist in job.properties at the same time. You can either run a job as a workflow or as a coordination.
Use only your Coordinator path in your properties file and use your workflow path in the Coordinator.xml file.
**oozie.use.system.libpath=true
workflowpath=${nameNode}/user/${user.name}/examples/apps/test/
oozie.coord.application.path=${nameNode}/user/${user.name}/examples/apps/test/**
In your coordinator.xml file add this line
'<app-path>${workflowpath}</app-path>'

Resources