Oozie not cleaning up old jobs from Oozie database - hadoop

I have the following properties set in my oozie-site.xml (Using safety-valve in Cloudera Manager)
oozie.services.ext - org.apache.oozie.service.PurgeService
oozie.service.PurgeService.older.than - 15
oozie.service.PurgeService.coord.older.than - 7
oozie.service.PurgeService.bundle.older.than - 7
oozie.service.PurgeService.purge.interval - 60
However, I still see some old jobs which are KILLED or completed as old as September 2014
To give an example,
I have a Coordinator which is currently in RUNNING state. When I use the Oozie Web Console to list the instances of that Co-ordinator i.e. Click on Co-ordinator tab and click on my co-ordinator and in the pop up I see the oldest job of all materialised workflow jobs (co-ordinator actions) of September 2014.
I assume the property responsible for cleaning this up is oozie.service.PurgeService.older.than which I have set to 15 days.
So what am I missing here?

The problem is for long running coordinator jobs with high frequency. all child workflows are never purged as the coord job is still running.
The solution is to (quoting from the external link),
What you can do as a workaround, is split up your long-running
Coordinators. For example, instead of making your Coordinator run for
years? forever?, make it run for, say, 6 months. And have an
identical Coordinator scheduled to start exactly when that one ends.
This will allow Oozie to cleanup the old child Workflows from that
Coordinator every 6 months. Otherwise, you can schedule a cron job
to manually delete old jobs from the Database. However, please be
careful about this. When deleting a workflow job from the WF_JOBS
table, you'll also need to delete the workflow actions from the
WF_ACTIONS table that belong to it, as well as the coordinator action
from the WF_ACTIONS table that it belongs to. If you miss something,
it will likely cause problems.
References:
https://community.cloudera.com/t5/Batch-Processing-and-Workflow/Oozie-not-cleaning-up-old-jobs-from-Oozie-database/m-p/30692#U30692
https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/zkWa2kDMyyo
http://qnalist.com/questions/5404909/oozie-purging
JIRA Link:
https://issues.apache.org/jira/browse/OOZIE-1532

Related

Apache NIFI Jon is not terminating automatically

I am new to Apache NIFI tool. I am trying to import data from mongo db and put that data into the HDFS. I have created 2 processors one for MongoDB and second for HDFS and I configured them correctly. The job is running successfully and storing the data into HDFS but the job should terminate automatically on success. But it is not, and creating too many files in HDFS. I want to know how to make On Demand Job in NIFI and how to determine that a job is successfull.
GetMongo will continue to pull data from MongoDB based on the provided properties such as Query, Projection, Limit. It has no way of tracking the execution process, at least for now. What you can do, however, is changing the Run Schedule and/or Scheduling Strategy. You can find them by right clicking on the processor and clicking Configure. By default, Run Schedule will be 0 sec which means running continuously. Changing it to, say, 60 min will make the processor run every one hour. This will still read the same documents from MongoDB again every one hour but since you have mentioned that you just want to run it only once, I'm suggesting this approach.

Spark Launcher Jobs not starting because of token cant be found in cache after 24 hours

I have a Java Application, which runs continuously and checks a table in database for new records. When a New record is added in the table, the Java application do a unzip file and puts into HDFS location and then a Spark Job gets triggered(I am pro-grammatically triggering the Spark Job using 'SparkLauncher" class inside the Java Application), which does the processing for newly added file in HDFS location.
I have scheduled the Java Application in cluster using Oozie Java Action.
The cluster is HDP kerberized cluster.
The Job is working perfectly fine for 24 hours. All the unzip happens and spark job is running.
But after 24 hours the unzip happens in Java Application but the Spark Job is not get triggered in Resource Manager.
Exception : Exception encountered while connecting to the server :INFO: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (owner=****, renewer=oozie mr token, realUser=oozie, issueDate=1498798762481, maxDate=1499403562481, sequenceNumber=36550, masterKeyId=619) can't be found in cache
As per my understanding, after 24 hours oozie is renewing the token, and that token is not getting updated for the Spark launcher Job. The spark Launcher is still looking for the older Token which is not available in cache.
Please help me, how I can make Spark Launcher to look for the new-token.
As per my understanding, after 24 hours oozie is renewing the token
Why? Can you point to any documentation, source code, blog?
Remember that Oozie is a scheduler for batch jobs, and its canonical use case (at Yahoo!) is for triggering hourly jobs.
Only a pathological batch job would run for more than 24h, therefore renewal of the Hadoop delegation token is not really useful in Oozie.
But your Java thing acts as a service, running continuously, and needing automatic restart if it ever crashes. So you should consider...
either Slider, if you really want to run it inside YARN (although there
are many, many drawbacks -- how do you inspect the
logs of a running YARN job? how can you make sure that the app starts on time and is not delayed by a lack of resources? how can you make sure that your app will not be killed because YARN needs resources for a high-priority job?) but it is probably overkill for simply running your toy app
or a plain Linux service running on some Edge Node -- it's a Do-It-Yourself task, but not extremely complicated, and there are tutorials on the web
If you insist on using Oozie, in spite of all the limitations of both YARN and Oozie, then you have to change the way your app runs -- for instance, schedule the Coordinator to launch a job every 12h and pass the "nominal time" as Workflow property, edit the Workflow to pass that time to the Java app, edit the Java code so that the app exits at (arg + 11:58) and clears the way for the next exec.

hadoop jobs in deadlock with pyspark and oozie

I am trying to run pyspark on yarn with oozie, after submitting the workflow, there are 2 jobs in the hadoop job queue, one is the oozie job , which is with the application type "map reduce", and another job triggered by the previous one, with application type "Spark", while the first job is running, the second job remains in 'accepted" status. here comes the problem, while the first job is waiting for the second job to finish to proceed, the second is waiting for the first one to finish to run, I may be stuck in a dead lock, how could I get rid of this trouble, is there anyway the hadoop job with application type "mapreduce" run parallel with other jobs of different application type?
Any advice is appreciated, thanks!
Please check the value for property into Yarn scheduler configuration. I guess you need to increase it to something like .9 or so.
Property: yarn.scheduler.capacity.maximum-am-resource-percent
You would need to start Yarn, MapReduce and Oozie after updating the property.
More info: Setting Application Limits.

Schedule a trigger for a job that is excecuted on every node in a cluster

I'm wondering if there is a simple workaround/hack for quartz of triggering a job that is excecuted on every node in a cluster.
My situation:
My application is caching some things and is running in a cluster with no distributed-cache. Now I have situations where I want to refresh the caches on all nodes triggered by a job.
As you have found out, Quartz always picks up a random instance to execute a scheduled job and this cannot be easily changed unless you want to hack its internals.
Probably the easiest way to achieve what you describe would be to implement some sort of a coordinator (or master) job that will be aware of all Quartz instances in the cluster and will "manually" trigger execution of the cache-sync job on every single node. The master job can easily do it via the RMI, or JMX APIs exposed by Quartz.
You may want to check this somewhat similar question.

Event Scheduler in PostgreSQL?

Is there a similar event scheduler from MySQL available in PostgreSQL?
While a lot of people just use cron, the closest thing to a built-in scheduler is PgAgent. It's a component to the pgAdmin GUI management tool. A good intro to it can be found at Setting up PgAgent and doing scheduled backups.
pg_cron is a simple, cron-based job scheduler for PostgreSQL that runs
inside the database as an extension. A background worker initiates
commands according to their schedule by connecting to the local
database as the user that scheduled the job.
pg_cron can run multiple jobs in parallel, but it runs at most one
instance of a job at a time. If a second run is supposed to start
before the first one finishes, then the second run is queued and
started as soon as the first run completes. This ensures that jobs run
exactly as many times as scheduled and don’t run concurrently with
themselves.
If you set up pg_cron on a hot standby, then it will start running the
cron jobs, which are stored in a table and thus replicated to the hot
standby, as soon as the server is promoted. This means your periodic
jobs automatically fail over with your PostgreSQL server.
Source: citusdata.com

Resources