reuse JVM in Hadoop mapreduce jobs

reuse JVM in Hadoop mapreduce jobs - performance

I know we can set the property "mapred.job.reuse.jvm.num.tasks" to re-use JVM. My questions are:
(1) how to decide the number of tasks to be set here, -1 or some other positive integers?
(2) is it a good idea to already reuse JVMs and set this property to the value of -1 in mapreduce jobs?
Thank you very much!

If you have very small tasks that are definitely running after each other, it is useful to set this property to -1 (meaning that a spawned JVM will be reused unlimited times).
So you just spawn (number of task in your cluster available to your job)-JVMs instead of (number of tasks)-JVMs.
This is a huge performance improvement. In long running jobs the percentage of the runtime in comparision to setup a new JVM is very low, so it doesn't give you a huge performance boost.
Also in long running tasks it is good to recreate the task process, because of issues like heap fragmentation degrading your performance.
In addition, if you have some mid-time-running jobs, you could reuse just 2-3 of the tasks, having a good trade-off.

JVM reuse(only possible in MR1) should help with performance because it removes the startup lag of the JVM but it is only marginal and comes with a number of drawbacks(read side effects. Most tasks will run for a long time (tens of seconds or even minutes) and startup times are not the problem when you look at those task run times. You would like to start a new task on a clean slate. When you re-use a JVM there is a chance that the heap is not completely clean(it is fragmented from the previous runs). The fragmentation can lead to more GC's and nullify all the start up time gains. If there is a memory leak it could also affect the memory usage etc. So it's better to start a new JVM for the tasks(if the tasks are not reasonably small). In MR2(YARN) - new JVM is always started for the tasks. For Uber tasks - it will run the task in the local JVM only.

Related

How to control how many tasks to run per executor in PySpark [duplicate]

I don't quite understand spark.task.cpus parameter. It seems to me that a “task” corresponds to a “thread” or a "process", if you will, within the executor. Suppose that I set "spark.task.cpus" to 2.
How can a thread utilize two CPUs simultaneously? Couldn't it require locks and cause synchronization problems?
I'm looking at launchTask() function in deploy/executor/Executor.scala, and I don't see any notion of "number of cpus per task" here. So where/how does Spark eventually allocate more than one cpu to a task in the standalone mode?

To the best of my knowledge spark.task.cpus controls the parallelism of tasks in you cluster in the case where some particular tasks are known to have their own internal (custom) parallelism.
In more detail:
We know that spark.cores.max defines how many threads (aka cores) your application needs. If you leave spark.task.cpus = 1 then you will have #spark.cores.max number of concurrent Spark tasks running at the same time.
You will only want to change spark.task.cpus if you know that your tasks are themselves parallelized (maybe each of your task spawns two threads, interacts with external tools, etc.) By setting spark.task.cpus accordingly, you become a good "citizen". Now if you have spark.cores.max=10 and spark.task.cpus=2 Spark will only create 10/2=5 concurrent tasks. Given that your tasks need (say) 2 threads internally the total number of executing threads will never be more than 10. This means that you never go above your initial contract (defined by spark.cores.max).

Enabling Univa Grid Engine Resource Reservation without a time limit on jobs

My organization has a server cluster running Univa Grid Engine 8.4.1, with users submitting various kinds of jobs, some using a single CPU core, and some using OpenMPI to utilize multiple cores, all with varying and unpredictable run-times.
We've enabled a ticketing system so that one user can't hog the entire queue, but if the grid and queue are full of single-CPU jobs, no multi-CPU job can ever start (they just sit at the top of the queue waiting for the required number of cpu slots to become free, which generally never happens). We're looking to configure Resource Reservation such that, if the MPI job is the next in the queue, the grid will hold slots open as they become free until there's enough to submit the MPI job, rather than filling them with the single-CPU jobs that are further down in the queue.
I've read (here for example) that the grid makes the decision of which slots to "reserve" based on how much time is remaining on the jobs running in those slots. The problem we have is that our jobs have unknown run-times. Some take a few seconds, some take weeks, and while we have a rough idea how long a job will take, we can never be sure. Thus, we don't want to start running qsub with hard and soft time limits through -l h_rt and -l s_rt, or else our jobs could be killed prematurely. Resource Reservation appears to be using the default_duration, which we set to infinity for lack of a better number to use, and treating all jobs equally. Its picking slots filled by month-long jobs which have already been running for a few days, instead of slots filled by minute-long jobs which have only been running for a few seconds.
Is there a way to tell the scheduler to reserve slots for a multi-CPU MPI job as they become available, rather than pre-select slots based on some perceived run-time of the jobs in them?

Unfortunately I'm not aware of a way to do what you ask - I think that the reservation is created once at the time that the job is submitted, not progressively as slots become free. If you haven't already seen the design document for the Resource Reservation feature, it's worth a look to get oriented to the feature.
Instead, I'm going to suggest some strategies for confidently setting job runtimes. The main problem when none of your jobs have runtimes is that Grid Engine can't reserve space infinitely in the future, so even if you set some really rough runtimes (within an order of magnitude of the true runtime), you may get some positive results.
If you've run a similar job previously, one simple rule of thumb is to set max runtime to 150% of the typical or maximum runtime of the job, based on historical trends. Use qacct or parse the accounting file to get hard data. Of course, tweak that percentage to whatever suits your risk threshold.
Another rule of thumb is to set the max runtime not based on the job's true runtime, but based on a sense around "after this date, the results won't be useful" or "if it takes this long, something's definitely wrong". If you need an answer by Friday, there's no sense in setting the runtime limit for three months out. Similarly, if you're running md5sum on typically megabyte-sized files, there's no sense in setting a 1-day runtime limit; those jobs ought to only take a few seconds or minutes, and if it's really taking a long time, then something is broken.
If you really must allow true indefinite-length jobs, then one option is to divide your cluster into infinite and finite queues. Jobs specifying a finite runtime will be able to use both queues, while infinite jobs will have fewer resources available; this will incentivize users to work a little harder at picking runtimes, without forcing them to do so.
Finally, be sure that the multi-slot jobs are submitted with the -R y qsub flag to enable the resource reservation system. This could go in the system default sge_request file, but that's generally not recommended as it can reduce scheduling performance:
Since reservation scheduling performance consumption is known to grow with the number of pending jobs, use of -R y option is recommended only for those jobs actually queuing for bottleneck resources.

Apache Spark - How to avoid failing slow tasks

Dear fellow Apache Spark enthusiasts
I recently kicked off a sideline project with the goal of turning a couple of ODROID XU4 computers into a stand-alone Spark Cluster.
After setting up the cluster I ran into a problem that seems to be specific to heterogeneous multi processors. Spark executor tasks run extremely slow on the XU4 when using all 8 processors. The reason, as mentioned in a comment on my post below, is that Spark does not wait for the executors that have been kicked off on the slow processors.
http://forum.odroid.com/viewtopic.php?f=98&t=21369&sid=4276f7dc89a8d7825320e7f705011326&p=152415#p152415
One solution is to use fewer executor cores and to set the CPU affinity to not use the LITTLE processors. This is however a less than ideal solution.
Is there a way to ask Spark to wait a bit longer for feedback from slower executors? Obviously waiting too long will have a negative effect on performance. The positive effect of utilising all cores should however balance out the negative effect.
Thanks in advance for any help!

#Dikei response highlights two potential causes, but it turns out the problem is not the one he suspects. I have the same set up as the #TJVR, and it turns out the driver is missing heartbeats from executors. To address this, I added the following to spark-env.sh:
export SPARK_DAEMON_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"
export SPARK_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"
This changes the default timeouts for executor heartbeats. Also set spark.shuffle.consolidateFiles to true to improve performance on my ext4 filesystem. These defaults changes allowed me to increased the core usage above one and not frequently lose executors.

Spark does not kill slow executors, but will mark an executor as dead in two cases:
If the driver doesn't receive a heartbeat signal in a while (default: 120s): The executor have to regularly (default: 10s) send a heartbeat message to notify the driver that it is still alive. Network issues or large GC pause can prevent these heartbeat from happening.
The executor has crashed due to exception in the code or JVM runtime error, most likely due to GC pause as well.
In my opinion, it's probably that GC overhead has killed your slowed executor and the driver has to redo the task on a different executor. If this is the case, you can try splitting your data into smaller partitions, so that each executor has to process less data at a time.
Secondly, you should NOT set spark.speculation to 'true' without testing. It's 'false' by default for a reason, I've seen it do more harm than good in some case.
Lastly, the following assumption might not hold true.
The positive effect of utilising all cores should however balance out
the negative effect.
Slow executors (straggles) can cause the program to perform much worse, depending on workload. It's entirely possible that avoiding the slow cores will provide the best result.

why do pentaho jobs ran through kitchen take a lot of cpu resources?

could you please give some small explanation on what happens when kitchen.bat calls a job?
I can only guess that it instantiates it and that probably is the reason why my taskmgr spikes up whenever I call 5 jobs all at the same time. after a couple of seconds, the spikes would wind down.
or maybe not? would there be other reasons that the calling of jobs through kitchen uses a lot of resources?
would there be ways to save the cpu resources while taking advantage of parallelism (calling the jobs all at the same time)? are there optimizations that can be done?

How exactly are you calling 5 jobs at the same time? in the shell script? In which case the spike is because you're starting 5 JVM's at the same time - starting the JVM is relatively expensive. And there should be no need to do this - you can do it all in one JVM and do the parallelisation in the job?
Kitchen itself doesn't specifically use a lot of resources. If your transformation has a large number of steps, then getting that going can take some time, but not ages.
Is this really a problem? Why does it matter if your cpu spikes for a couple of seconds? The point of parallelism is generally to max out the CPU/box/resource!

Is there a way to configure timeout for speculative execution in Hadoop?

I have hadoop job with tasks that are expected to run for significant length of fime (few minues). However hadoop starts speculative execution too soon. I do not want to turn speculative execution completely off but I want to increase duration of time hadoop waits before considering job for speculative execution. Is there a config option to control this timeout?
Thanks

I don't believe the speculative execution time is currently configurable. On the other hand, there's probably no need to adjust it. Speculative execution is meant to bail you out of slow running tasks (usually due to degraded hardware performance). If you have available cluster resources such that spec exec is kicking in, what's the harm in letting it do so? Note that minutes is not considered "significant" and is more than normal for medium or larger size jobs.
It's also worth noting that while mapper spec exec is almost always fine and low overhead to the system, reducer spec exec can hurt and probably should be disabled. The rationale is that if a mapper is progressing slowly and there are available resources where the data is local (normal), there's no shared overhead. If a reducer is performing slowly, starting another attempt of the same task will simply double the network load - normally the most painful part of reducer execution. If the network is what is causing the reducer to be "slow," starting a second attempt only hurts both attempts.
If you truly have a use case for adjusting the spec exec time, it might be worth filing a jira at http://issues.apache.org.
Hope this helps.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio