Can 1 Tasktracker run multiple JVMs?
Here is the scenario:
Assume there are 2 files (A & B) and 2 Data nodes (D1 & D2).
When you load A, assume it is getting split into A1 & A2 on D1 & D2
and when you load B, assume it is getting split into B1 & B2 on D1 & D2.
For some reason let us assume D1 is busy with some other tasks
and D2 is available and there are a couple of jobs which are submitted,
one using file A and the other one usign File B.
So now D2 is available and has blocks A2 & B2.
Will the JobTracker submit the code to TaskTracker on D2 and run the task for A2 and B2 at a time or
will it first run A2 and after it finishes it will run B2?
If so, again is it possible to run both the tasks in parallel which means 1 TaskTracker and 2 jvms, or will it create/spawn 2 TaskTrackers on D2?
By default Task Tracker spawns one JVM for each task.
You can reuse jvms by setting this configuration parameter: mapred.job.reuse.jvm.num.tasks
A task tracker (TT) can launch multiple map or reduce tasks in parallel on a single machine. By default TT launches 2 maps (mapreduce.tasktracker.map.tasks.maximum) and 2 reduce (mapreduce.tasktracker.reduce.tasks.maximum) tasks. The properties have to be configured in the mapred-default.xml.
Related
I'm working on an app which runs remote tasks (Task A and Task B) on a few (10) servers (s1 to s0) and once the parts are complete Task C is run on the local server. All these tasks could take a while to finish (from a minute to an hour) but task A takes between 4 to 20 times longer than task B (and this could change for each run).
I don't wish to run more than one task on any server at a time.I'm trying to be efficient with how this works so I think laravel8's Queue would serve my purpose. My thinking I have say 5 queues q1,q2,q3,q4,q5 I then add taskA for queues 1 to 3 for the first 3 servers and task B for s4 and s5 I would then repeat for all tasks. After this my Queues would look like this
q1 q2 q3 q4 q5
s1ta s2ta s3ta s4tb s5tb
s6ta s7ta s8ta s9tb s0tb
s1tb s2tb s3tb s4ta s4ta
s6tb s7tb s8tb s9ta s0ta
--tc
While this looks good what if q1 gets to task C but other queues are running? Is there a way I can trigger task C when all queues are empty? Is there a better way to do this? Should I use something else except Queues for this and if so what? Is an event triggered when a job in a queue finishes?
I await your thoughts and recommendations.
thanks
Craig
*** EDIT ***
Thinking more on this it would make sense to run task a and task b on the same queue after each other so:
q1 q2 q3 q4 q5
s1ta s2ta s3ta s4ta s5ta
s1tb s2tb s3tb s4tb s5tb
s6ta s7ta s8ta s9ta s0ta
s6tb s7tb s8tb s9tb s0tb
--tc
but the issue with Task C would still remain and it would be good if a task could move to an empty Queue if it hasn't started. Right now I've no idea where to begin...
For example, if there is a pipeline made of 3 processors P1, P2, P3. When P2 produces an output flowfile, then after exactly 5 minutes I want processor P3 to work.
I cant use a fixed CRON job because the P2 processor can run at anytime.
Nifi version - 1.9.1
Look at RetryFlowFile with
Maximum Retries = 1 to put between P2 and P3.
It could penalize flow file on retries exceed. It should do it instantly with max retries =1.
Then set penalize duration to 5min.
All set. P3 should not take flow file from queue during 5 min.
option 2
you could use ExecuteGroovyScript in place of retryflowfile with following script to penalize everything that is going through it.
def ff = session.get()
if( !ff ) return
ff = session.penalize(ff)
REL_SUCCESS << ff
ps: don't forget to set penalty duration for this processor
I try store and parse&store some raw data with two strategies (serial & parallel)
Flux<PanasonicData> f = Flux.create(sink -> dataRepo.addConsumer(sink::next));
Flux.from(f).publishOn(Schedulers.single()).subscribe(this::save1);
Flux.from(f).publishOn(Schedulers.parallel()).map(MyClass::parse).subscribe(this::save2);
Or
ConnectableFlux<PanasonicData> cf = Flux.create(sink -> dataRepo.addConsumer(sink::next)).publish();
cf.autoConnect().publishOn(Schedulers.single()).subscribe(this::save1);
cf.autoConnect().publishOn(Schedulers.parallel()).map(MyClass::parse).subscribe(this::save2);
But the second task is never ran !!!
How can i run this two tasks with this two different strategies?
You can specify the minimum number of subscribers via autoConnect(int minSubscribers):
Flux<PanasonicData> cf = Flux.create(sink -> dataRepo.addConsumer(sink::next)).publish().autoConnect(2);
cf.publishOn(Schedulers.single()).subscribe(this::save1);
cf.publishOn(Schedulers.parallel()).map(MyClass::parse).subscribe(this::save2);
I have two kinds of tasks in spark : A and B
In spark.scheduler.pool, I have two pools: APool and BPool.
I want task A to be executed aways in APool while B is in BPool.
The resources in APool is preserved to A.
Because task B may take too much resources to execute. Every time when B is executing, A needs to wait. I want no matter when the task is submitted, there will always be some resource for A to execute.
I am using spark with java in standalone mode. I submit the job like javaRDD.map(..).reduce... The javaRDD is a sub-clesse extended form JavaRDD. Task A and B have different RDD class like ARDD and BRDD. They run in the same spark application.
The procedure is like: The app start up -> spark application created, but no job runs -> I click "run A" on the app ui, then ARDD will run. -> I click "run B" on the app ui, then BRDD will run in the same spark application as A.
I have a oozie job that has 3 actions A1,B1 and C1. I am running the three actions in parallel by configuring in a fork. When A1 gets failed due to EL_ERROR the job gets failed. However the status of the other two actions B1 and C1 are still in progress and they don't get completed. What could be the issue.