Datastage: parallel jobs often get stuck - parallel-processing

When running parallel jobs in datastage, the jobs often get stuck like this:
It seems the jobs log files or status files are locked, cause I clean up resource in Director(by clicking "log out all") then the jobs go ahead(in a sequence job) or get crashed.
Thanks for any help.

Related

gcloud Dataflow Drain Command Wait Until Job Finished Draining

I'm currently building cd pipeline that replace existing Google Cloud Dataflow streaming pipeline with the new one with bash command. The old and new has the same name job. And I write bash command like this
gcloud dataflow jobs drain "${JOB_ID}" --region asia-southeast2 && \
gcloud dataflow jobs run NAME --other-flags
The problem with this command is that the first command doesn't wait until the job finish draining so that the second command throw error because duplicated job name.
Is there a way to wait until dataflow job finish draining? Or is there any better way?
Thanks!
Seeing as this post hasn't garnered any attention, I will be posting my comment as a post:
Dataflow jobs are asynchronous to the command gcloud dataflow jobs run, so when you use && the only thing that you'll be waiting on will be for the command to finish and since that command is just to get the process started (be it draining a job or running one) it finishes earlier than the job/drain does.
There are a couple of ways you could wait for the job/drain to finish, both having some added cost:
You could use a Pub/Sub step as part of a larger Dataflow job (think of it as a parent to the jobs you are draining and running, with the jobs you are draining or running sending a message to Pub/Sub about their status once it changes) - you may find the cost of Pub/Sub [here].
You could set up some kind of loop to repeatedly check the status of the job you're draining/running, likely inside of a bash script, though that can be a bit more tedious and isn't as neat as a listener, and it would require one's own computer/connection to be maintained or a GCE instance.

Deleting a periodic job in sidekiq

I am trying to delete a Sidekiq Enterprise periodic job for an app, and I'm not sure how one goes about deleting the periodic job itself after deleting the schedule from the initialize and deleting the worker job.
I see this answer from earlier but the app in question has other jobs (both periodic and regular sidekiq jobs) and I cannot just globally blow away all scheduled periodic jobs and would prefer to not have to totally shut down and restart sidekiq either. Is there a way I can just get the specific job I am deleting out of redis so that it will no longer try to run at the next scheduled time?
You have to deploy your code change and restart Sidekiq for it to pick up periodic changes.

Hadoop / Spark: kill a task and do not retry

I would like to know if there is a way to stop a job (and have it in a FAILED or KILLED state) when I detect something wrong within a map or a reduce task without Hadoop retries the task.
If possible I would like to keep the fact that on "normal" fails Yarn restart the task.
Currently I am throwing an exception but Hadoop tries again.
It is Scala/Spark code but It may be useful in Java/Hadoop too.
Thanks

Yarn capacity-scheduler Parallelize

Does capacity-scheduler in yarn run app in parallel on the same queue for the same user.
For example:If we have 2 hive CLI on 2 terminals with same user, and the same query is started on both, do they execute on the default queue in parallel or sequentially.
Currently, the UI shows 1 running, and 1 in pending state:
Is there a way to run it in parallel?
Yarn capacity scheduler run jobs in FIFO manner for the jobs submitted in the same queue. For example if both the hive cli's got submitted for default queue then which ever able to secure resources first will get into running state and other will wait(only if enough resources are not present in the queue).
If you want parallel execution
1) you can run other job in different queue.You can define the queue name while launching job on yarn.
2) You need to define resources in a manner so that both job can get resources as desired.

Difference between job, application, task, task attempt logs in Hadoop, Oozie

I'm running an Oozie job with multiple actions and there's a part I could not make it work. In the process of troubleshooting I'm overwhelmed with lots of logs.
In YARN UI (yarn.resourceman­ager.webapp.address in yarn-site.xml, normally on port 8088), there's the application_<app_id> logs.
In Job History Server (yarn.log.server.url in yarn-site.xml, ours on port 19888), there's the job_<job_id> logs. (These job logs should also show up on Hue's Job Browser, right?)
In Hue's Oozie workflow editor, there's the task and task_attempt (not sure if they're the same, everything's a mixed-up soup to me already), which redirects to the Job Browser if you clicked here and there.
Can someone explain what's the difference between these things from Hadoop/Oozie architectural standpoint?
P.S.
I've seen in logs container_<container_id> as well. Might as well include this in your explanation in relation to the things above.
In terms of YARN, the programs that are being run on a cluster are called applications. In terms of MapReduce they are called jobs. So, if you are running MapReduce on YARN, job and application are the same thing (if you take a close look, job ids and application ids are the same).
MapReduce job consists of several tasks (they could be either map or reduce tasks). If a task fails, it is launched again on another node. Those are task attempts.
Container is a YARN term. This is a unit of resource allocation. For example, MapReduce task would be run in a single container.

Resources