How to run sqoop and spark streaming jobs together - spark-streaming

I have a problem with sqoop and spark streaming jobs running together.
When i start spark streaming job and sqoop , the sqoop job stay on "accepted" mode and can't start. However,after killing spark job ,the sqoop job can run properly .
I really dont know what is the problem .

Related

Sqoop command not found when running through Oozie

When I am running Sqoop script in CLI, it is running fine without any issue. But when run it using Oozie, it failed with Sqoop command not found. It seems sqoop is not installed in other data nodes. So to run Sqoop script using Oozie, sqoop should be installed in all data nodes or is there any alternatives for that. Currently we have one master and 2 Data nodes.

Can sqoop run without hadoop?

Just wondering can sqoop run without a hadoop cluster? sort of in a standalone mode? Has anyone tried to run sqoop on spark, please share some experiences on it.
To run Sqoop commands (both sqoop1 and sqoop2), Hadoop is a mandatory prerequisite. You cannot run sqoop commands without the Hadoop libraries.
Sqoop works in local mode too, so it is not a requirement that the Hadoop daemons must be running. To run sqoop in local mode,
sqoop [tool-name] -fs local -jt local [tool-arguments]
Sqoop on Spark is still In-Progress. See SQOOP-1532

hadoop - How to kill a TEZ job started by hive?

Below is what I can find. But the problem is if we reuse jdbc hive session all the hive queries go as same Application-Id. Is there a way I can kill a dag?
Tez jobs can be listed using: yarn application -list
Tez jobs can be killed using: yarn application -kill Application-Id

Apache Zeppelin running on Spark Cluster and YARN

I have created and ran a %pyspark program in Apache Zeppelin running on a Spark Cluster with yarn-client. The program is reading a file in a Dataframe from HDFS and does a simple groupby command and prints the output successfully. I am using Zeppellin version 0.6.2 and Spark 2.0.0 .
I can see the job running in YARN(see application_1480590511892_0007):
But when I check the Spark UI at the same time there is nothing at all for this job:
Question 1: Shouldn't this job appear in both of these windows?
Also, the completed applications in the SparkUI image just above, were Zeppelin jobs with the %python interpreter simply initializing a SparkSession and stopping it:
1st Zeppelin block:
%python
from pyspark.sql import SparkSession
from pyspark.sql import Row
import collections
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()
2nd Zeppelin block:
%python
spark.stop()
Question 2: This job in turn, has not appeared in the YARN UI. Is it the case that whenever a job appears in the SparkUI means that it is running with Spark Resource manager?
Any insights for these questions are highly appreciated.
Zeppelin runs a continuous Spark application once the interpreter is first used. All the paragraphs will run in this one application. In your second paragraph you are stopping the SparkSession (spark.stop), so that would kill the application that was created when the interpreter was first used. So you can just see the jobs under the Completed Applications section. If you remove the spark.stop, you should see the job listed under Running Applications.

Pig job hangs on first failure

I'm encountering a problem with Pig and Oozie.
I have pig script that tries to read data from non-existent table, so an exception happens in initialize method of RecordReader . And that is ok, it should occur ( as the table definitely doesn't exist).
The problem starts when such a script is launched via oozie on a multi-node hadoop cluster - after the first attempt job just hangs and does nothing until any other job is submitted to the cluster.
If launched via CMD (pig -f test.pig) it doesn't hang. It also doesn't hang if launched in local mode or on a single-node cluster(via CMD or via Oozie).
I really hope someone had a problem like this and can help me.

Resources