Rerun oozie workflows every day - hadoop

I have an Oozie coordinator that when started sets its start time to 365 days ago and then runs its workflow with a daily frequency until reaching the latest date having input data. So the coordinator.xml looks something like this:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<coordinator-app name="${jobName}" frequency="${coord:days(1)}" start="${startTime}" end="${endTime}" timezone="${timezone}" xmlns="uri:oozie:coordinator:0.2">
<controls>
<timeout>-1</timeout>
<concurrency>30</concurrency>
<execution>FIFO</execution>
</controls>
<datasets>
<dataset name="myData" frequency="${coord:days(1)}" initial-instance="${initialTime}" timezone="${timezone}">
<uri-template>${outputPath}/${YEAR}${MONTH}${DAY}</uri-template>
</dataset>
</datasets>
<input-events>
<data-in name="myInput" dataset="myData">
<instance>${coord:current(-1)}</instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>${appPath}</app-path>
<configuration>
<property>
<name>myInput</name>
<value>${coord:dataIn('myInput')}</value>
</property>
<property>
<name>date</name>
<value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), -1, 'DAY'), 'yyyyMMdd')}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
On startup, this coordinator triggers 365 workflows. Then the next day, it just runs one workflow for that day's input data. And so on for the next day.
Now the requirements have changed and I need to run 365 workflows every day, not just the first time! So on startup, I want it to run like it does now. But then the next day, when the input data becomes available, I need to go back and run daily workflows for the last 365 days again. And so on for each subsequent day as soon as new input data is detected.
How can I achieve this?
Thanks,
Alvaro

In the data-in tag of the coordinator use start-instance and end-instance to specify the range of instance to consider as input.
start-instance: refers to the beginning of an instance range (the time for a synchronous dataset).
end-instance: refers to the end of an instance range (the time for a synchronous dataset).
<input-events>
<data-in name="myInput" dataset="myData">
<start-instance>${coord:current(-365)}</start-instance>
<end-instance>${coord:current(-1)}</end-instance>
</data-in>
</input-events>
For more information please refer: Synchronous Coordinator Application Definition

You can use the sub workflow to recall the main workflow in recursive manner.

Related

How to pass coordinator time of Oozie to SLA argument nominal_time in a workflow?

I would like to pass the execution time of the coordinator into the SLA value of nominal time of a workflow. I pass the EL function in a property tag in the workflow and also in coordinator, but I received the message:
EL_ERROR No function is mapped to the name "coord:nominalTime"
I have tried different EL functions but the result was the same.
Which function do I need to use in order to pass the time of coordinator to the SLA argument of nominl time in the workflow?
To clarify, the SLA code snippet exists in workflow action, not in the coordinator.
It's hard to give you an answer without seeing your Oozie XML, but I think this is what you are asking for.
In the coordinator:
<action>
<workflow>
<app-path>wf.xml</app-path>
<configuration>
<property>
<name>nominalTime</name>
<value>${coord:nominalTime()}</value>
</property>
</configuration>
</workflow>
</action>
In the workflow:
<sla:info>
<sla:nominal-time>${nominalTime}</sla:nominal-time>
...
</sla:info>

YARN minimum-user-limit-percent not working?

I'm using the capacity scheduler in YARN and I saw that there's the possibility for users to get a minimum percentage of the queue by using the property 'yarn minimum-user-limit-percent'. I set this property to 20, and what I would expect is that resources would get equally distributed up to 5 users, according to this:
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_yarn-resource-management/content/setting_user_limits.html
But that's not the case when users start runnig applications. For instance, if I run an application when the queue is idle, YARN allocates all the requested resources for that application. When another user runs the same application afterwards, YARN allocates as much resources as there are left in the queue, and the queue gets filled up. At this point, I thought that with the property the second user would get 50% of the queue, and the first one would have less resources.
If a third user comes in, I would expect him/her to get 33% of the queue, but YARN doesn't even schedule the application because there are no available resources left.
Am I missing something? I thought this parameter made requests independent of the available resources until it hit the minimum percentage per user.
Here are my yarn-site.xml and capacity-scheduler.xml:
yarn-site.xml
<configuration>
<property>
<name>yarn.acl.enable</name>
<value>true</value>
</property>
<property>
<name>yarn.admin.acl</name>
<value>*</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoopLogin:8032</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoopLogin:8033</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoopLogin:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoopLogin:8031</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoopLogin:8088</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.https.address</name>
<value>hadoopLogin:8090</value>
</property>
<property>
<name>yarn.resourcemanager.client.thread-count</name>
<value>50</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.client.thread-count</name>
<value>50</value>
</property>
<property>
<name>yarn.resourcemanager.admin.client.thread-count</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.scheduler.increment-allocation-mb</name>
<value>512</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>14336</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.increment-allocation-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>32</value>
</property>
<property>
<name>yarn.resourcemanager.amliveliness-monitor.interval-ms</name>
<value>1000</value>
</property>
<property>
<name>yarn.am.liveness-monitor.expiry-interval-ms</name>
<value>600000</value>
</property>
<property>
<name>yarn.resourcemanager.am.max-attempts</name>
<value>2</value>
</property>
<property>
<name>yarn.resourcemanager.container.liveness-monitor.interval-ms</name>
<value>600000</value>
</property>
<property>
<name>yarn.resourcemanager.nm.liveness-monitor.interval-ms</name>
<value>1000</value>
</property>
<property>
<name>yarn.nm.liveness-monitor.expiry-interval-ms</name>
<value>600000</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.client.thread-count</name>
<value>50</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name><value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
yarn.scheduler.fair.user-as-default-queue
true
yarn.scheduler.fair.preemption
false
yarn.scheduler.fair.sizebasedweight
false
yarn.scheduler.fair.assignmultiple
false
yarn.resourcemanager.max-completed-applications
10000
yarn.nodemanager.aux-services
spark_shuffle
yarn.nodemanager.aux-services.spark_shuffle.class
org.apache.spark.network.yarn.YarnShuffleService
capacity-scheduler.xml
yarn.scheduler.capacity.root.queues
batch,notebook
Definition of console and batch jobs (batch) and notebook jobs (notebook) queues
yarn.scheduler.capacity.root.batch.capacity
50
Percentage of capacity for root.batch queue
yarn.scheduler.capacity.root.notebook.capacity
50
Percentage of capacity for root.notebook queue
yarn.scheduler.capacity.root.batch.maximum-capacity
55
Percentage of maximum capacity for root.batch queue
yarn.scheduler.capacity.root.notebook.maximum-capacity
55
Percentage of maximum capacity for root.notebook queue
yarn.scheduler.capacity.root.batch.state
RUNNING
Current state of the root.batch cue
yarn.scheduler.capacity.root.notebook.state
RUNNING
Current state of the root.notebook cue
yarn.scheduler.capacity.root.acl_submit_applications
hadoop,yarn,mapred,hdfs,spark
The ACL of who can submit jobs to the root queue.
yarn.scheduler.capacity.root.batch.acl_submit_applications
scienceUser1 root,gaia,ub,ucm,uac,udc,esac,upo,une,inaf
The ACL of who can submit jobs to the root.batch queue.
yarn.scheduler.capacity.root.notebook.acl_submit_applications
* root,gaia,ub,ucm,uac,udc,esac,upo,une,inaf
The ACL of who can submit jobs to the root.notebook queue.
yarn.scheduler.capacity.root.batch.acl_administer_queue
gaia
The ACL of who can administer jobs to the root.batch queue.
yarn.scheduler.capacity.root.notebook.acl_administer_queue
gaia
The ACL of who can administer jobs to the root.notebook queue.
yarn.scheduler.capacity.root.acl_administer_queue
gaia
The ACL of who can administer jobs on the root queue.
yarn.scheduler.capacity.root.acl_administer_queue
gaia
The ACL of who can administer jobs on the root queue.
yarn.scheduler.capacity.root.acl_administer_queue
gaia
The ACL of who can administer jobs on the root queue.
yarn.scheduler.capacity.root.batch.acl_administer_queue
gaia
The ACL of who can administer jobs on the batch queue.
yarn.scheduler.capacity.root.notebook.acl_administer_queue
gaia
The ACL of who can administer jobs on the notebook queue.
yarn.scheduler.capacity.resource-calculator
org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
yarn.scheduler.capacity.root.notebook.minimum-user-limit-percent
33
Minimum percentage of resources a user gets from the queue.
Actually this is not the way minimum-user-limit should be treated. suppose you have 100 GB and there are 4 users. As a fair , each can get up to 25GB. But what if tomorrow you have 100 users coming in? In such case, dividing equally will end up giving 1GB to each user. Now that is something we may not want to happen considering the case where each process will end up starving for resources and might lead to worse performance. So here we limit ourselves by specifying the parameter minimum-user-limit as 25%. Which means each user might get minimum 25GB and if there are more users , it will be put to accepted state.So now if 1 user comes in , it can go upto 100GB. 2 users can take upto 50GB each. 3 users can take upto 33 GB each , 4 users can take upto 25GB each. But if 5th user comes in. The 5th user will be send to waiting because each user cannot be given 20GB each because we have set the limit as 25%(of 100GB ) which is 25GB. P.S this is my understanding. If anyone finds that this is wrong , then please comment. It is indeed a bit confusing.
Now in your scenario , lets say first two users took the entire resources. That case is different from this. 25% limit only says that you can enter in only when you yarn has at least 25% available to allocate to you. For ensuring two users do not take full resource , you can tweak with user-limit-factor. If set to .2 , each user cannot take more than 20%.
The behavior you're seeing is correct if you haven't enabled preemption.
Look in your yarn-site.xml for yarn.resourcemanager.scheduler.monitor.enable equals true to see if you've enabled it.
If you haven't enabled it, refer to the following for guidance:
http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Capacity_Scheduler_container_preemption
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_yarn-resource-management/content/preemption.html
https://community.hortonworks.com/questions/89894/capacity-scheduler-preemption-doesnt-work.html

Is there a way to set a timeout for the commit-interval on a spring batch job?

We have data streaming in on an irregular basis and in quantities that I can not predict. I currently have the commit-interval set to 1 because we want data to be written as soon as we receive it. We sometimes get large numbers of items at a time (~1000-50000 items in a second) which I would like to commit in larger chunks as it takes awhile to write these individually. Is there way to set a timeout on the commit-interval?
Goal: We set the commit-interval to 10000, we get 9900 items and after 1 second it commits the 9900 items rather then waiting until it receives 100 more.
Currently, when we set the commit-interval greater than 1, we just see data waiting to be written until it hits the amount specified by the commit-interval.
How is your data streaming in? Is it being loaded to a work table? Added to a queue? Typically you'd just drain the work table or queue with whatever commit interval performs best then re-run the job periodically to check if a new batch of inbound records has been received.
Either way, I would typically leverage flow control to have your job loop and just process as many records as are ready to be processed for a given time interval:
<job id="job">
<decision id="decision" decider="decider">
<next on="PROCESS" to="processStep" />
<next on="DECIDE" to="decision" />
<end on="COMPLETED" />
<fail on="*" />
</decision>
<step id="processStep">
<!-- your step here -->
</step>
</job>
<beans:bean id="decider" class="com.package.MyDecider"/>
Then your decider would do something like this:
if (maxTimeReached) {
return END;
}
if (hasRecords) {
return PROCESS;
} else {
wait X seconds;
return DECIDE;
}

Run same coordinator with different input values

I have a coordinator which I'd like to run with different input values.
For example, below are the 'input-events' provided. I would like to run the coordinator job with the following values of param 'benzene_hbase_input_offset':
(-24) - looks at the data of the last 24 hours.
(-96) - looks at the data of the last 96 hours.
Nowadays I run only option '2'. In order to avoid code duplication, I'd like to run the same coordinator with a different input for the 'start-instance' param.
How can I do that?
<input-events>
<data-in name="benzene_dwell_calc-output" dataset="BENZENE_DWELL_CALC_OUTPUT">
<start-instance>${coord:current(benzene_hbase_input_offset)}</start-instance>
<end-instance>${coord:current(-1)}</end-instance>
</data-in>
<data-in name="benzene_dwell_calc-output" dataset="BENZENE_DWELL_CALC_OUTPUT">
<start-instance>${coord:current(benzene_hbase_input_offset)}</start-instance>
<end-instance>${coord:current(-1)}</end-instance>
</data-in>
</input-events>
Set this param in the jog.properties.
In the coordinator place it in the start instance and also place it in the coordinator name so you can distinct between the two

Nutch segments folder grows every day

I have configured nutch/solr 1.6 to crawl/index every 12 hours an intranet with about 4000 documents and html pages.
If I execute the crawler with an empty database the process takes about 30 minutes.
When the crawling is executed for several days, it becomes very slow.
Looking the log file it seems that this night the last step (SolrIndexer) started after 1 hour and 20 minutes and it took a bit more than 1 hour.
Because the number of documents indexed doesn't grow, I'm wondering why it is so slow now.
Nutch is executed with the following command:
bin/nutch crawl -urlDir urls -solr http://localhost:8983/solr -dir nutchdb -depth 15 -topN 3000
The nutch-site.xml contains:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>Internet Site Agent</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(tika|metatags)|index-(basic|anchor|metadata|more|http-header)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<!-- Used only if plugin parse-metatags is enabled. -->
<property>
<name>metatags.names</name>
<value>description;keywords;published;modified</value>
<description> Names of the metatags to extract, separated by;.
Use '*' to extract all metatags. Prefixes the names with 'metatag.'
in the parse-metadata. For instance to index description and keywords,
you need to activate the plugin index-metadata and set the value of the
parameter 'index.parse.md' to 'metatag.description;metatag.keywords'.
</description>
</property>
<property>
<name>index.parse.md</name>
<value>metatag.description,metatag.keywords,metatag.published,metatag.modified</value>
<description> Comma-separated list of keys to be taken from the parse metadata to generate fields.
Can be used e.g. for 'description' or 'keywords' provided that these values are generated
by a parser (see parse-metatags plugin)
</description>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>Set this to false if you start crawling your website from
for example http://www.example.com but you would like to crawl
xyz.example.com. Set it to true otherwise if you want to exclude external links
</description>
</property>
<property>
<name>http.content.limit</name>
<value>10000000</value>
<description>The length limit for downloaded content using the http
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
<property>
<name>fetcher.max.crawl.delay</name>
<value>1</value>
<description>
If the Crawl-Delay in robots.txt is set to greater than this value (in
seconds) then the fetcher will skip this page, generating an error report.
If set to -1 the fetcher will never skip such pages and will wait the
amount of time retrieved from robots.txt Crawl-Delay, however long that
might be.
</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>10</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection). The total
number of threads running in distributed mode will be the number of
fetcher threads * number of nodes as fetcher has one map task per node.
</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>10</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection). The total
number of threads running in distributed mode will be the number of
fetcher threads * number of nodes as fetcher has one map task per node.
</description>
</property>
<property>
<name>fetcher.server.delay</name>
<value>1.0</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server.</description>
</property>
<property>
<name>http.redirect.max</name>
<value>0</value>
<description>The maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to negative or 0, fetcher won't immediately
follow redirected URLs, instead it will record them for later fetching.
</description>
</property>
<property>
<name>fetcher.threads.per.queue</name>
<value>2</value>
<description>This number is the maximum number of threads that
should be allowed to access a queue at one time. Replaces
deprecated parameter 'fetcher.threads.per.host'.
</description>
</property>
<property>
<name>link.delete.gone</name>
<value>true</value>
<description>Whether to delete gone pages from the web graph.</description>
</property>
<property>
<name>link.loops.depth</name>
<value>20</value>
<description>The depth for the loops algorithm.</description>
</property>
<!-- moreindexingfilter plugin properties -->
<property>
<name>moreIndexingFilter.indexMimeTypeParts</name>
<value>false</value>
<description>Determines whether the index-more plugin will split the mime-type
in sub parts, this requires the type field to be multi valued. Set to true for backward
compatibility. False will not split the mime-type.
</description>
</property>
<property>
<name>moreIndexingFilter.mapMimeTypes</name>
<value>false</value>
<description>Determines whether MIME-type mapping is enabled. It takes a
plain text file with mapped MIME-types. With it the user can map both
application/xhtml+xml and text/html to the same target MIME-type so it
can be treated equally in an index. See conf/contenttype-mapping.txt.
</description>
</property>
<!-- Fetch Schedule Configuration -->
<property>
<name>db.fetch.interval.default</name>
<!-- for now always re-fetch everything -->
<value>10</value>
<description>The default number of seconds between re-fetches of a page (less than 1 day).
</description>
</property>
<property>
<name>db.fetch.interval.max</name>
<!-- for now always re-fetch everything -->
<value>10</value>
<description>The maximum number of seconds between re-fetches of a page
(less than one day). After this period every page in the db will be re-tried, no
matter what is its status.
</description>
</property>
<!--property>
<name>db.fetch.schedule.class</name>
<value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
<description>The implementation of fetch schedule. DefaultFetchSchedule simply
adds the original fetchInterval to the last fetch time, regardless of
page changes.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.inc_rate</name>
<value>0.4</value>
<description>If a page is unmodified, its fetchInterval will be
increased by this rate. This value should not
exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.dec_rate</name>
<value>0.2</value>
<description>If a page is modified, its fetchInterval will be
decreased by this rate. This value should not
exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.min_interval</name>
<value>60.0</value>
<description>Minimum fetchInterval, in seconds.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.max_interval</name>
<value>31536000.0</value>
<description>Maximum fetchInterval, in seconds (365 days).
NOTE: this is limited by db.fetch.interval.max. Pages with
fetchInterval larger than db.fetch.interval.max
will be fetched anyway.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.sync_delta</name>
<value>true</value>
<description>If true, try to synchronize with the time of page change.
by shifting the next fetchTime by a fraction (sync_rate) of the difference
between the last modification time, and the last fetch time.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.sync_delta_rate</name>
<value>0.3</value>
<description>See sync_delta for description. This value should not
exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property>
<property>
<name>db.fetch.schedule.adaptive.sync_delta_rate</name>
<value>0.3</value>
<description>See sync_delta for description. This value should not
exceed 0.5, otherwise the algorithm becomes unstable.</description>
</property-->
<property>
<name>fetcher.threads.fetch</name>
<value>1</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection). The total
number of threads running in distributed mode will be the number of
fetcher threads * number of nodes as fetcher has one map task per node.
</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/apache-nutch/tmp/</value>
</property>
<!-- Boilerpipe -->
<property>
<name>tika.boilerpipe</name>
<value>true</value>
</property>
<property>
<name>tika.boilerpipe.extractor</name>
<value>ArticleExtractor</value>
</property>
</configuration>
As you can see, I have configured nutch to always refetch all the documents.
Because the site is small, it should be ok for now to refetch everything (the first time takes only 30 minutes...).
I have noticed that in the folder crawldb/segments every day more or less 40 new segments are created.
the disk size of the database of course is growing very fast.
Is this the expected behaviour ? Is there something wrong with the configuration?
It is necessary to delete from the nutchdb the segments that are older than the db.default.fetch.interval. This interval defines when a page should be refetched.
If the page has been refetched, the old segments can be deleted.
If the segments are not deleted the step solrindexer has to read too many segments and becomes very slow (in my case one hour instead of 4 minutes).

Resources