Running spark2 job through Oozie shell action? - shell

As mentioned from the title i'm trying to run a shell action that kicks off a spark job but unfortunately i'm consistently getting the following error...
19/05/10 14:03:39 ERROR AbstractRpcClient: SASL authentication failed.
The most likely cause is missing or invalid credentials. Consider
'kinit'.
javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Failed
to find any Kerberos tgt)]
java.io.IOException: Could not set up IO Streams to
<hbaseregionserver>
Fri May 10 14:03:39 BST 2019,
RpcRetryingCaller{globalStartTime=1557493419339, pause=100,
retries=2}, org.apache.hadoop.hbase.ipc.FailedServerException: This
server is in the failed servers list: <hbaseregionserver>
Been playing around trying to get the script to take in the kerberos ticket but having no luck, as far as I can tell its related to the Oozie job not being able to pass the kerberos ticket any ideas why its not picking it up? I'm at a loss? Related code is below
Oozie workflow action
<action name="sparkJ" cred="hive2Cred">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${oozieQueueName}</value>
</property>
</configuration>
<exec>run.sh</exec>
<file>/thePathToTheScript/run.sh#run.sh</file>
<file>/thePathToTheProperties/myp.properties#myp.properties</file>
<capture-output />
</shell>
<ok to="end" />
<error to="fail" />
</action>
Shell script
#!/bin/sh
export job_name=SPARK_JOB
export configuration=myp.properties
export num_executors=10
export executor_memory=1G
export queue=YARNQ
export max_executors=50
kinit -kt KEYTAB KPRINCIPAL
echo "[[[[[[[[[[[[[ Starting Job - name:${job_name},
configuration:${configuration} ]]]]]]]]]]]]]]"
/usr/hdp/current/spark2-client/bin/spark-submit \
--name ${job_name} \
--driver-java-options "-Dlog4j.configuration=file:./log4j.properties" \
--num-executors ${num_executors} \
--executor-memory ${executor_memory} \
--master yarn \
--keytab KEYTAB \
--principal KPRINCIPAL \
--supervise \
--deploy-mode cluster \
--queue ${queue} \
--files "./${configuration},./hbase-site.xml,./log4j.properties" \
--conf spark.driver.extraClassPath="/usr/hdp/current/hive-
client/lib/datanucleus-*.jar:/usr/hdp/current/tez-client/*.jar" \
--conf spark.executor.extraJavaOptions="-
Djava.security.auth.login.config=./jaas.conf -
Dlog4j.configuration=file:./log4j.properties" \
--conf spark.executor.extraClassPath="/usr/hdp/current/hive-
client/lib/datanucleus-*.jar:/usr/hdp/current/tez-client/*.jar" \
--conf spark.streaming.stopGracefullyOnShutdown=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.maxExecutors=${max_executors} \
--conf spark.streaming.concurrentJobs=2 \
--conf spark.streaming.backpressure.enabled=true \
--conf spark.yarn.security.tokens.hive.enabled=true \
--conf spark.yarn.security.tokens.hbase.enabled=true \
--conf spark.streaming.kafka.maxRatePerPartition=5000 \
--conf spark.streaming.backpressure.pid.maxRate=3000 \
--conf spark.streaming.backpressure.pid.minRate=200 \
--conf spark.streaming.backpressure.initialRate=5000 \
--jars /usr/hdp/current/hbase-client/lib/guava-
12.0.1.jar,/usr/hdp/current/hbase-client/lib/hbase-
common.jar,/usr/hdp/current/hbase-client/lib/hbase-
client.jar,/usr/hdp/current/hbase-client/lib/hbase-
protocol.jar,/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-
3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-
3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-
3.2.10.jar \
--class myclass myjar.jar ./${configuration}
Many thanks to any help you can provide.

Related

Cloudera Docker keeps retrying to connect to 8032

I use the docker image from cloudera, but it seems the configuration not quite right. Because I do this:
hadoop jar /usr/lib/hadoop*/contrib/streaming/hadoop-streaming*cdh*.jar \
-mapper mapper -reducer reducer \
-file mapper -file reducer \
-input input -output output
I got this all the time:
18/03/14 02:34:33 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
This is how I did prior to running the process above.
Increase Docker memory into 8GB
Start the container, by running this in the host
docker run -p 7180:7180 \
--hostname=quickstart.cloudera --privileged=true \
-t -i cloudera/quickstart:latest \
/usr/bin/docker-quickstart
Start the manager
/home/cloudera/cloudera-manager --express
Open cloudera manager to start HDFS
Upload sample input into HDFS
You need to use the manager to start not just HDFS, but also YARN

sqoop import-all-tables : not working

CDH version = 5.5.0-0
Hive process is up & running - No issues
I try to import tables from MySQL into hive using the below script.Tables not importing into Hive.Can you please help me what is the issue or am I missing something?
sqoop import-all-tables \
--num-mappers 1 \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=reatil_dba \
--password=cloudera \
--hive-import \
--hive-overwrite \
--create-hive-table \
--compress \
--compresession-codec org.apache.hadoop.io.compress.SnappyCodec \
--outdir java_files
ERROR:
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
16/10/12 06:36:21 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.5.0
16/10/12 06:36:21 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/10/12 06:36:21 ERROR tool.BaseSqoopTool: Error parsing arguments for import-all-tables:
16/10/12 06:36:21 ERROR tool.BaseSqoopTool: Unrecognized argument: --compresession-codec
16/10/12 06:36:21 ERROR tool.BaseSqoopTool: Unrecognized argument: org.apache.hadoop.io.compress.SnappyCodec
16/10/12 06:36:21 ERROR tool.BaseSqoopTool: Unrecognized argument: --outdir
16/10/12 06:36:21 ERROR tool.BaseSqoopTool: Unrecognized argument: java_files
There is a typo, argument name should be --compression-codec instead of --compresession-codec

Apache Spark's deployment issue (cluster-mode) with Hive

EDIT:
I'm developing a Spark application that reads a data from the multiple structured schemas and I'm trying to aggregate the information from those schemas. My application runs well when I run it locally. But when I run it on a cluster, I'm having trouble with the configurations (most probably with hive-site.xml) or with the submit-command arguments. I've looked for the other related posts, but couldn't find the solution SPECIFIC to my scenario. I've mentioned what commands I tried and what errors I got in detail below. I'm new to Spark and I might be missing something trivial, but can provide more information to support my question.
Original Question:
I've been trying to run my spark application in a 6-node Hadoop cluster bundled with HDP2.3 components.
Here are component information that might be useful for you guys in suggesting the solutions:
Cluster information: 6-node cluster:
128GB RAM
24 core
8TB HDD
Components used in the application
HDP - 2.3
Spark - 1.3.1
$ hadoop version:
Hadoop 2.7.1.2.3.0.0-2557
Subversion git#github.com:hortonworks/hadoop.git -r 9f17d40a0f2046d217b2bff90ad6e2fc7e41f5e1
Compiled by jenkins on 2015-07-14T13:08Z
Compiled with protoc 2.5.0
From source with checksum 54f9bbb4492f92975e84e390599b881d
Scenario:
I'm trying to use the SparkContext and HiveContext in a way to take full advantage of the spark's real time query on it's data structure like dataframe. The dependencies used in my application are:
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.4.0</version>
</dependency>
Below are the submit commands and the coresponding error logs that I'm getting:
Submit Command1:
spark-submit --class working.path.to.Main \
--master yarn \
--deploy-mode cluster \
--num-executors 17 \
--executor-cores 8 \
--executor-memory 25g \
--driver-memory 25g \
--num-executors 5 \
application-with-all-dependencies.jar
Error Log1:
User class threw exception: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
Submit Command2:
spark-submit --class working.path.to.Main \
--master yarn \
--deploy-mode cluster \
--num-executors 17 \
--executor-cores 8 \
--executor-memory 25g \
--driver-memory 25g \
--num-executors 5 \
--files /etc/hive/conf/hive-site.xml \
application-with-all-dependencies.jar
Error Log2:
User class threw exception: java.lang.NumberFormatException: For input string: "5s"
Since I don't have the administrative permissions, I cannot modify the configuration. Well, I can contact to the IT engineer and make the changes, but I'm looking for the
solution that involves less changes in the configuration files, if possible!
Configuration changes were suggested here.
Then I tried passing various jar files as arguments as suggested in other discussion forums.
Submit Command3:
spark-submit --class working.path.to.Main \
--master yarn \
--deploy-mode cluster \
--num-executors 17 \
--executor-cores 8 \
--executor-memory 25g \
--driver-memory 25g \
--num-executors 5 \
--jars /usr/hdp/2.3.0.0-2557/spark/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/2.3.0.0-2557/spark/lib/datanucleus-core-3.2.10.jar,/usr/hdp/2.3.0.0-2557/spark/lib/datanucleus-rdbms-3.2.9.jar \
--files /etc/hive/conf/hive-site.xml \
application-with-all-dependencies.jar
Error Log3:
User class threw exception: java.lang.NumberFormatException: For input string: "5s"
I didn't understood what happened with the following command and couldn't analyze the error log.
Submit Command4:
spark-submit --class working.path.to.Main \
--master yarn \
--deploy-mode cluster \
--num-executors 17 \
--executor-cores 8 \
--executor-memory 25g \
--driver-memory 25g \
--num-executors 5 \
--jars /usr/hdp/2.3.0.0-2557/spark/lib/*.jar \
--files /etc/hive/conf/hive-site.xml \
application-with-all-dependencies.jar
Submit Log4:
Application application_1461686223085_0014 failed 2 times due to AM Container for appattempt_1461686223085_0014_000002 exited with exitCode: 10
For more detailed output, check application tracking page:http://cluster-host:XXXX/cluster/app/application_1461686223085_0014Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e10_1461686223085_0014_02_000001
Exit code: 10
Stack trace: ExitCodeException exitCode=10:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 10
Failing this attempt. Failing the application.
Any other possible options? Any kind of help will be highly appreciated. Please let me know if you need any other information.
Thank you.
The solution explained in here worked for my case. There are two locations hive-site.xml resides that might be confusing. Use --files /usr/hdp/current/spark-client/conf/hive-site.xml instead of --files /etc/hive/conf/hive-site.xml. I didn't have to add the jars for my configuration. Hope this will help someone struggling with the similar problem. Thanks.

Driver not found exception while querying oracle using spark submit in yarn-client mode

I am getting the following exception while submitting job to spark:
Exception in thread "main" java.sql.SQLException:
No suitable driver found for jdbc:oracle:thin:user/pass#x.x.x.x:1521:orcl
Command used to submit job:
bin/spark-submit --class com.mycompany.app.App
--master yarn-client
--conf spark.yarn.jar=hdfs://192.168.xxx.xx:9000/spark-assembly-1.3.1-hadoop2.6.0.jar
--driver-class-path hdfs://192.168.xxx.xx:9000/ojdbc7.jar
--jars hdfs://192.168.xxx.xx:9000/ojdbc7.jar
/home/impadmin/maventest-idw/my-app/target/my-app-1.0.jar
Any pointers?
--jar does not need to be used if you are using it with HDFS, HTTP or FTP files.
You can directly use SparkContext.addJar("path of jar file")

Not able to run sqoop using oozie

When I am running the below sqoop command from the CLI, I am able to export data to OracleDB, But the same command when I am running using the oozie workflow , I am getting issues.
Command running directly from CLI:
sqoop export --connect jdbc:oracle:thin:#192.168.245.1:1521:XE --username HR --password HR --table HR.REVIEW_FINAL --export-dir /user/cloudera/Review/hive/review_final --input-fields-terminated-by '\001'
Below is what I am using through Oozie :
<command>export --connect jdbc:oracle:thin:#192.168.245.1:1521:XE --username HR --password HR --table HR.REVIEW_FINAL --export-dir /user/cloudera/Review/hive/review_final --input-fields-terminated-by '\001'</command>
Exception which I see in the logs:
java.io.IOException: Can't export data, please check failed map task logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.RuntimeException: Can't parse input data:
--input-fields-terminated-by '\001'
alter it to
--input-fields-terminated-by \001
good luck!
In my case, not only the input-fields-terminated-by, but also the input-null-non-string need to remove the single quote.
<action name="load-to-mysql">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>export --connect jdbc:mysql://xtradb1.t1.shadc.yosemitecloud.com:3306/report_test --username root -password root --table location_stats --export-dir /user/hive/warehouse/report/test_location_stats_${DATE} --input-null-string \\N --input-null-non-string \\N --input-fields-terminated-by \001 --input-lines-terminated-by '\n' --driver com.mysql.jdbc.Driver --update-mode allowinsert</command>
</sqoop>
<ok to="remove-tmp-data"/>
<error to="fail"/>
</action>

Resources