hadoop jobs in deadlock with pyspark and oozie - hadoop

I am trying to run pyspark on yarn with oozie, after submitting the workflow, there are 2 jobs in the hadoop job queue, one is the oozie job , which is with the application type "map reduce", and another job triggered by the previous one, with application type "Spark", while the first job is running, the second job remains in 'accepted" status. here comes the problem, while the first job is waiting for the second job to finish to proceed, the second is waiting for the first one to finish to run, I may be stuck in a dead lock, how could I get rid of this trouble, is there anyway the hadoop job with application type "mapreduce" run parallel with other jobs of different application type?
Any advice is appreciated, thanks!

Please check the value for property into Yarn scheduler configuration. I guess you need to increase it to something like .9 or so.
Property: yarn.scheduler.capacity.maximum-am-resource-percent
You would need to start Yarn, MapReduce and Oozie after updating the property.
More info: Setting Application Limits.

Related

Apache Spark Jobc complete but hadoop job still running

I'm running a large Spark job (about 20TB in and stored to HDFS) alongside Hadoop. The spark console is showing the job as complete but Hadoop still things the job is running, both in the console and the logs are still spitting out 'running'.
How long should I be waiting until I should be worried?
You can try to stop the spark context cleanly. If you havent close it add a sparkcontext stop method at the end of the job. For example
sc.stop()

Spark Launcher Jobs not starting because of token cant be found in cache after 24 hours

I have a Java Application, which runs continuously and checks a table in database for new records. When a New record is added in the table, the Java application do a unzip file and puts into HDFS location and then a Spark Job gets triggered(I am pro-grammatically triggering the Spark Job using 'SparkLauncher" class inside the Java Application), which does the processing for newly added file in HDFS location.
I have scheduled the Java Application in cluster using Oozie Java Action.
The cluster is HDP kerberized cluster.
The Job is working perfectly fine for 24 hours. All the unzip happens and spark job is running.
But after 24 hours the unzip happens in Java Application but the Spark Job is not get triggered in Resource Manager.
Exception : Exception encountered while connecting to the server :INFO: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (owner=****, renewer=oozie mr token, realUser=oozie, issueDate=1498798762481, maxDate=1499403562481, sequenceNumber=36550, masterKeyId=619) can't be found in cache
As per my understanding, after 24 hours oozie is renewing the token, and that token is not getting updated for the Spark launcher Job. The spark Launcher is still looking for the older Token which is not available in cache.
Please help me, how I can make Spark Launcher to look for the new-token.
As per my understanding, after 24 hours oozie is renewing the token
Why? Can you point to any documentation, source code, blog?
Remember that Oozie is a scheduler for batch jobs, and its canonical use case (at Yahoo!) is for triggering hourly jobs.
Only a pathological batch job would run for more than 24h, therefore renewal of the Hadoop delegation token is not really useful in Oozie.
But your Java thing acts as a service, running continuously, and needing automatic restart if it ever crashes. So you should consider...
either Slider, if you really want to run it inside YARN (although there
are many, many drawbacks -- how do you inspect the
logs of a running YARN job? how can you make sure that the app starts on time and is not delayed by a lack of resources? how can you make sure that your app will not be killed because YARN needs resources for a high-priority job?) but it is probably overkill for simply running your toy app
or a plain Linux service running on some Edge Node -- it's a Do-It-Yourself task, but not extremely complicated, and there are tutorials on the web
If you insist on using Oozie, in spite of all the limitations of both YARN and Oozie, then you have to change the way your app runs -- for instance, schedule the Coordinator to launch a job every 12h and pass the "nominal time" as Workflow property, edit the Workflow to pass that time to the Java app, edit the Java code so that the app exits at (arg + 11:58) and clears the way for the next exec.

Difference between job, application, task, task attempt logs in Hadoop, Oozie

I'm running an Oozie job with multiple actions and there's a part I could not make it work. In the process of troubleshooting I'm overwhelmed with lots of logs.
In YARN UI (yarn.resourceman­ager.webapp.address in yarn-site.xml, normally on port 8088), there's the application_<app_id> logs.
In Job History Server (yarn.log.server.url in yarn-site.xml, ours on port 19888), there's the job_<job_id> logs. (These job logs should also show up on Hue's Job Browser, right?)
In Hue's Oozie workflow editor, there's the task and task_attempt (not sure if they're the same, everything's a mixed-up soup to me already), which redirects to the Job Browser if you clicked here and there.
Can someone explain what's the difference between these things from Hadoop/Oozie architectural standpoint?
P.S.
I've seen in logs container_<container_id> as well. Might as well include this in your explanation in relation to the things above.
In terms of YARN, the programs that are being run on a cluster are called applications. In terms of MapReduce they are called jobs. So, if you are running MapReduce on YARN, job and application are the same thing (if you take a close look, job ids and application ids are the same).
MapReduce job consists of several tasks (they could be either map or reduce tasks). If a task fails, it is launched again on another node. Those are task attempts.
Container is a YARN term. This is a unit of resource allocation. For example, MapReduce task would be run in a single container.

YARN Queue Can't Run More Than One Spark Job

I can run several jobs (MapReduce, Hive) in one queue. But if I run a Spark/Spark Streaming job, every job added after that will be in ACCEPTED state but not RUNNING. Only after I kill the Spark job the other job will be RUNNING.
I tried to create a different queue for Spark and non Spark jobs, they work as expected but this is not what I want.
My questions:
1. Is this YARN or Spark config issue?
2. What is the right config to solve that issue?
Any helps will be appreciated, thanks.

How to chaining mapred and mapreduce job

Now I have two hadoop jobs need to chain together. One is Mapred job(old api), the other is Mapreduce job(new API), this is because the external library we used for these two jobs.
I want to know whether there is a good way to chain these two jobs.
I have tried one way (first run the mapred job with JobClient.runjob(), after it finished run the second one.) But there is a problem for me submit this job to the hadoop clustor. If I close my local terminal, then only the first job will run, the second won't. It is because the Java code is running locally, so is there a good solution for this? Then I can just submit the whole job to cluster, the local program not need to keep running.

Resources