I have two kinds of tasks in spark : A and B
In spark.scheduler.pool, I have two pools: APool and BPool.
I want task A to be executed aways in APool while B is in BPool.
The resources in APool is preserved to A.
Because task B may take too much resources to execute. Every time when B is executing, A needs to wait. I want no matter when the task is submitted, there will always be some resource for A to execute.
I am using spark with java in standalone mode. I submit the job like javaRDD.map(..).reduce... The javaRDD is a sub-clesse extended form JavaRDD. Task A and B have different RDD class like ARDD and BRDD. They run in the same spark application.
The procedure is like: The app start up -> spark application created, but no job runs -> I click "run A" on the app ui, then ARDD will run. -> I click "run B" on the app ui, then BRDD will run in the same spark application as A.
Setting 'common' properties for child tasks is not working
The SCDF version I'm using is 2.9.6.
I want to make CTR A-B-C, each of tasks does follows:
A : sql select on some source DB
B : process DB data that A got
C : sql insert on some target DB
Simplest way to make this work seems to define shared work directory folder Path "some_work_directory", and pass it as application properties to A, B, C. Under {some_work_directory}, I just store each of task result as file, like select.result, process.result, insert.result, and access them consequently. If there is no precedent data, I could assume something went wrong, and make tasks exit with 1.
I tried with a composed task instance QWER, with two task from same application "global" named as A, B. This simple application prints out test.value application property to console, which is "test" in default when no other properties given.
If I tried to set test.value in global tab on SCDF launch builder, it is interpreted as app.*.test.value in composed task's log. However, SCDF logs on child task A, B does not catch this configuration from parent. Both of them fail to resolve input given at launch time.
If I tried to set test.value as row in launch builder, and pass any value to A, B like I did when task is not composed one, this even fails. I know this is not 'global' that I need, it seems that CTR is not working correctly with SCDF launch builder.
The only workaround I found is manually setting app.QWER.A.test.value=AAAAA and app.QWER.B.test.value=BBBBB in launch freetext. This way, input is converted to app.QWER-A.app.global4.test.value=AAAAA, app.QWER-B.app.global4.test.value=BBBBB, and print well.
I understand that, by this way, I could set detailed configurations for each of child task at launch time. However, If I just want to set some 'global' that tasks in one CTR instance would share, there seems to be no feasible way.
Am I missing something? Thanks for any information in advance.
CTR will orchestrate the execution of a collection of tasks. There is no implicit data transfer between tasks. If you want the data from A to be the input to B and then output of B becomes the input of C you can create one Task / Batch application that have readers and writers connected by a processor OR you can create a stream application for B and use JDBC source and sink for A and C.
In Azure DataBricks i have scheduled one job with notebook attached to simple python file.
[![dbutils.widgets.text("input", "","")
y = getArgument("input")
print ("Param -\'input':")
print (y)][1]][1]
Cluster: D8s_v3 ( 1 Worker)
even though its quite simple code its take about 9 to 10 second to execute by DataBricks Jobs. If i run python file directly it execute under 1 second.
Please guide me to optimize it for DataBricks Jobs
I am programming my custom speculator, I reviewed documentation and by default is "DefaultSpeculator.java" and is set in class "MRAppMaster.java" (function createSpeculator()) in core of Hadoop. I want to know if you can update/change speculator in runtime when executing my job, because i need to test between about 5 speculators.
Thanks !!!
The speculative execution can be turned on and off for map tasks and reduce tasks on a cluster-wide basis or on a per-job basis.
The speculator is instantiated in MRAppMaster (Map-Reduce Application Master). As you mentioned in your question, following is the piece of code in MRAppMaster::serviceInit() function, which instantiates the speculator:
if (conf.getBoolean(MRJobConfig.MAP_SPECULATIVE, false)
|| conf.getBoolean(MRJobConfig.REDUCE_SPECULATIVE, false)) {
//optional service to speculate on task attempts' progress
speculator = createSpeculator(conf, context);
It checks the JobConfig, to see if speculative execution is turned on for either Map or Reduce tasks and then creates the speculator.
Since the speculator is created inside the MRAppMaster, you can enable your custom speculator for each job.
Following are the speculative execution properties:
mapreduce.map.speculative: Enable speculative execution for map tasks
mapreduce.reduce.speculative: Enable speculative execution for reduce
yarn.app.mapreduce.am.job.speculator.class: Speculator class
yarn.app.mapreduce.am.job.task.estimator.class: Estimator class. This is used by speculator for estimating the run time of a task.
I am in need to hook a custom execution hook in Apache Hive. Please let me know if somebody know how to do it.
The current environment I am using is given below:
Hadoop : Cloudera version 4.1.2
Operating system : Centos
There are several types of hooks depending on at which stage you want to inject your custom code:
Driver run hooks (Pre/Post)
Semantic analyizer hooks (Pre/Post)
Execution hooks (Pre/Failure/Post)
Client statistics publisher
If you run a script the processing flow looks like as follows:
Driver.run() takes the command
Driver.compile() starts processing the command: creates the abstract syntax tree
Semantic analysis
Create and validate the query plan (physical plan)
Driver.execute() : ready to run the jobs
ExecDriver.execute() runs all the jobs
For each job at every HiveConf.ConfVars.HIVECOUNTERSPULLINTERVAL interval:
ClientStatsPublisher.run() is called to publish statistics
If a task fails: ExecuteWithHookContext.run()
Finish all the tasks
ExecuteWithHookContext.run() (HiveConf.ConfVars.POSTEXECHOOKS)
Before returning the result HiveDriverRunHook.postDriverRun() ( HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS)
Return the result.
For each of the hooks I indicated the interfaces you have to implement. In the brackets
there's the corresponding conf. prop. key you have to set in order to register the
class at the beginning of the script.
E.g: setting the PreExecution hook (9th stage of the workflow)
HiveConf.ConfVars.PREEXECHOOKS -> hive.exec.pre.hooks :
set hive.exec.pre.hooks=com.example.MyPreHook;
Unfortunately these features aren't really documented, but you can always look into the Driver class to see the evaluation order of the hooks.
Remark: I assumed here Hive 0.11.0, I don't think that the Cloudera distribution
differs (too much)
a good start --> http://dharmeshkakadia.github.io/hive-hook/
there are examples...
note: hive cli from console show the messages if you execute from hue, add a logger and you can see the results in hiveserver2 log role.
I'm writing an rails 3 application which requires performing small tasks on a custom schedule for each user. The scheduled tasks will be defined dynamically. Right now my plan is to use resque scheduler with redis.
Once I set the schedule for a specify task (for eg. run task A every 48 hours) I would like to run that task indefinitely. So I would like to store those schedules in a db or something so in case an app crashes when it restarts it would load queue those task again.
Is this something Resque supports by default by storing it in redis or do I need to write my own custom thing? I was also looking at ruby-taskr (http://code.google.com/p/ruby-taskr/). I am not sure if taskr supports storing it in a database and registering it on start?
Also it would be helpful if there are applications/demo that I can look at it.
I have a similar setup for batch jobs. The user adds them on a web dashboard and they get run however often is specified.
I use active-record to store the scheduling definitions, use resque for execution and a single cron entry for enqueueing using a rake task.
so then in the rake task:
to_run = Report.daily
to_run += Report.weekly if Time.now.monday?
to_run += Report.monthly if Time.now.day == 1
to_run.each{|r| r.enqueue!}
where daily, weekly, monthly are named scopes on the model:
class Report < ActiveRecord::Base
scope :daily, where(:when_to_run => 'daily')
scope :weekly, where(:when_to_run => 'weekly')
scope :monthly, where(:when_to_run => 'monthly')
This is a little hacky, but it works well and I stay within the stack nicely. Hope that is useful