dask.submit: Insufficent memory/load allocation in scheduler - memory-management

Running a fine dask server:
Than I start pre-filling some data on the workers using publish_dataset and it looks the memory is pretty good allocated:
Next, I call futures.append(client.submit(fn, values)) in a loop and afterwards client.gather(futures) - which works! But it is not as fast as assumed, because most of the computation/memory load is beeing concentrated on 2 single processes (worker4 and worker 1):
Questions:
Why is Dask doing it like this?
Is it possible to pre-load data on specific works and than submit to the specific workers?
Should I somehow re-allocate the used memory between the workers to enable better parallelization?
Related: Dask Worker Stealing Register Worker callbacks and Dask Actors per worker

Why is Dask doing it like this?
Unfortunately I don't know enough about your problem to know why Dask is choosing to do what it's doing.
Is it possible to pre-load data on specific works and than submit to the specific workers?
Yes. See the workers= keyword to scatter and submit
Should I somehow re-allocate the used memory between the workers to enable better parallelization?
Dask should handle this for you. I recommend using scatter rather than publish_dataset, which doesn't do what you think it does (I recommend looking through the docstring for each method)

Related

Parallelism in Apache Storm with one worker node

I am trying to Parallelize my topology using Apache Storm but it gives me java.util.ConcurrentModificationException error on worker nodes if I increased the number of workers>1. It works fine with 1 worker and in local cluster. I want a way to parallelize my topology and measure the different parameters like throughput, latency, emit rate etc. using one worker node only.
Based on the stack trace you posted, it looks like Kryo is trying to serialize an ArrayList and hitting a ConcurrentModificationException. I would look for any place you emit an ArrayList and make sure that you don't modify it after you've passed it to OutputCollector.emit.
Likely the reason you're not seeing this issue when you only have one worker is that Storm only serializes emitted objects when they need to be sent to a different worker.

Parallel Processing with Starting New Task - front end screen timeout

I am running an ABAP program to work with a huge amount of data. The SAP documentation gives the information that I should use
Remote Function Modules with the addition STARTING NEW TASK to process the data.
So my program first selects all the data, breaks the data into packages and calls a function module with a package of data for further processing.
So that's my pseudo code:
Select KEYFIELD from MYSAP_TABLE into table KEY_TABLE package size 500.
append KEY_TABLE to ALL_KEYS_TABLE.
Endselect.
Loop at ALL_KEYS_TABLE assigning <fs_table> .
call function 'Z_MASS_PROCESSING'
starting new TASK 'TEST' destination in group default
exporting
IT_DATA = <fs_table> .
Endloop .
But I am surprised to see that I am using Dialog Processes instead of Background Process for the call of my function module.
So now I encountered the problem that one of my Dialog Processes were killed after 60 Minutes because of Timeout.
For me, it seems that STARTING NEW TASK is not the right solution for parallel processing of mass data.
What will be the alternative?
As already mentioned, thats not an easy topic that is handled with a few lines of codes. The general steps you have to conduct in a thoughtful way to gain the desired benefit is:
1) Get free work processes available for parallel processing
2) Slice your data in packages to be processed
3) Call an RFC enabled function module asynchronously for each package with the available work processes. Handle waiting for free work processes, if packages > available processes
4) Receive your results asynchronously
5) Wait till everything is processed and merge the data together again and assure that every package was handled properly
Although it is bad practice to just post links, the code is very long and would make this answer very messy, therfore take a look at the following links:
Example1-aRFC
Example2-aRFC
Example3-aRFC
Other RFC variants (e.g. qRFC, tRFC etc.) can be found here with short description but sadly cannot give you further insight on them.
EDIT:
Regarding process type of aRFC:
In parallel processing, a job step is started as usual in a background
processing work process. (...)While the job itself runs in a
background process, the parallel processing tasks that it starts run
in dialog work processes. Such dialog work processes may be located on
any SAP server.
The server is specified with the GROUP (default: parallel_generators) see transaction RZ12 and can have its own ressources just for parallel processing. If your process times out, you have to slice your packages differently in size.
I think, best way for parallel processing in SAP is Bank Parallel Processing framework as Jagger mentioned. Unfortunently its rarerly mentioned in any resource and its not documented well.
Actually, best documentation I found was in this book
https://www.sap-press.com/abap-performance-tuning_2092/
Yes, it's tricky. It costed me about 5 or 6 days to force it going. But results were good.
All stuff is situated in package BANK_PP_JOBCTRL and you can use its name for googling.
Main idea there is to divide all your work into steps (simplified):
Preparation
Parallel processing
2.1. Processing preparation
2.2. Processing
(Actually there are more steps there)
First step is not paralleized. Here you should prepare all you data for parallel processing and devide it into 'piece' which will be processed in parallel.
Content of pieces, in turn, can be ID or preloaded data as well.
After that, you can run step 2 in parallel processing.
Great benefit of all this is that error in one piece of parallel work won't lead to crash of all your processing.
I recomend you check demo in function group BANK_API_PP_DEMO
To implement parallel processing, you need to do a bit more than just add that clause. The information is contained in this help topic. A lot of design effort needs to be devoted to ensure that the communication and result merging overhead of the parallel processing does not negate the performance advantage gained by the parallel processing in the first place and that referential integrity of the data is maintained even when some of the parallel tasks fail. Do not under-estimate the complexity of this task.
You could make use of the bgRFC technique. This is a new method of background processing made by SAP.
BgRFC has, in addition to the already existing IN BACKGROUND TASK, the possibility to configure and monitor all calls which run through this method.
You can read more documentation between the different possibilities here. This is all (of course) depending on your SAP version.

How to write rows asynchronously in Spark Streaming application to speed up batch execution?

I have a spark job where I need to write the output of the SQL query every micro-batch. Write is a expensive operation perf wise and is causing the batch execution time to exceed the batch interval.
I am looking for ways to improve the performance of write.
Is doing the write action in a separate thread asynchronously like shown below a good option?
Would this cause any side effects because Spark itself executes in a distributed manner?
Are there other/better ways of speeding up the write?
// Create a fixed thread pool to execute asynchronous tasks
val executorService = Executors.newFixedThreadPool(2)
dstream.foreachRDD { rdd =>
import org.apache.spark.sql._
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate
import spark.implicits._
import spark.sql
val records = rdd.toDF("record")
records.createOrReplaceTempView("records")
val result = spark.sql("select * from records")
// Submit a asynchronous task to write
executorService.submit {
new Runnable {
override def run(): Unit = {
result.write.parquet(output)
}
}
}
}
1 - Is doing the write action in a separate thread asynchronously like shown below a good option?
No. The key to understand the issue here is to ask 'who is doing the write'. The write is done by the resources allocated for your job on the executors in a cluster. Placing the write command on an async threadpool is like adding a new office manager to an office with a fixed staff. Will two managers be able to do more work than one alone given that they have to share the same staff? Well, one reasonable answer is "only if the first manager was not giving them enough work, so there's some free capacity".
Going back to our cluster, we are dealing with a write operation that is heavy on IO. Parallelizing write jobs will lead to contention for IO resources, making each independent job longer. Initially, our job might look better than the 'single manager version', but trouble will eventually hit us.
I've made a chart that attempts to illustrate how that works. Note that the parallel jobs will take longer proportionally to the amount of time that they are concurrent in the timeline.
Once we reach that point where jobs start getting delayed, we have an unstable job that will eventually fail.
2- Would this cause any side effects because Spark itself executes in a distributed manner?
Some effects I can think of:
Probably higher cluster load and IO contention.
Jobs are queuing on the Threadpool queue instead of on the Spark Streaming Queue. We loose the ability to monitor our job through the Spark UI and monitoring API, as the delays are 'hidden' and all is fine from the Spark Streaming point of view.
3- Are there other/better ways of speeding up the write?
(ordered from cheap to expensive)
If you are appending to a parquet file, create a new file often. Appending gets expensive with time.
Increase your batch interval or use Window operations to write larger chunks of Parquet. Parquet likes large files
Tune the partition and distribution of your data => make sure that Spark can do the write in parallel
Increase cluster resources, add more nodes if necessary
Use faster storage
Is doing the write action in a separate thread asynchronously like shown below a good option?
Yes. It's certainly something to consider when optimizing expensive queries and saving their results to external data stores.
Would this cause any side effects because Spark itself executes in a distributed manner?
Don't think so. SparkContext is thread-safe and promotes this kind of query execution.
Are there other/better ways of speeding up the write?
YES! That's the key to understand when to use the other (above) options. By default, Spark applications run in FIFO scheduling mode.
Quoting Scheduling Within an Application:
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
That means that to make a room for executing multiple writes asynchronously and in parallel you should configure your Spark application to use FAIR scheduling mode (using spark.scheduler.mode property).
You will have to configure so-called Fair Scheduler Pools to "partition" executor resources (CPU and memory) into pools that you can assign to jobs using spark.scheduler.pool property.
Quoting Fair Scheduler Pools:
Without any intervention, newly submitted jobs go into a default pool, but jobs’ pools can be set by adding the spark.scheduler.pool "local property" to the SparkContext in the thread that’s submitting them.

How to increment a hadoop counter from outside a mapper or reducer?

I want to add something to the hadoop counters from outside a mappper.
So, I want to access the getCounter on context object like this:
context.getCounter(counter, key).increment(amount)
I'm not able to get the context object from where I start the job. I can only do
job.getCounters().findCounter()
which doesn't let me add something to the hadoop counters.
You can only use/write to the counters from within the mapper/reducer tasks. The job tracker has built in capabilities to interactz with the counters and you don't really want to interfere with what is already a complex setup.
I had exactly this issue a few months ago, trying to use the counters to store interim information, but I decided to write the informtion I needed to a defined hdfs directory and read that once my job was complete.
EDIT: why and whatdo you want to use the counter for outside of the mapper?
EDIT #2: if you want stats from a finished job, then counters are not the right place for that, as a) they don't seem to bewritable once the job tracker is done collecting data and b) they are intended to be used for aggregating metrics across tasks. I had a similar need recently and endedup doing my stats sums in the job set-up class (on my edge node) and then writing the data to the logs.

Restrict Matlab's parallel toolbox from using all cores

Is there a possible way to deny Matlab access to all cores? There are 8 on the machine at present, but I want to reduce Matlab's usage to 3 per user so that one user doesn't start a job on all 8, slowing down others in the process.
I don't have a distributed computing server license... just plain old parallel proc toolbox
There's no way to enforce a strict limit from within MATLAB, but you can set the "ClusterSize" property of the local scheduler. Unfortunately, this must be done per user. Other than that, you would need to use an OS function, but I'm not sure if such a thing exists.
You can voluntarily restrict the number of workers used for a job by setting the MaximumNumberOfWorkers property of the job object before you submit it.
jobMgr = findResource(...appropriate parameters for your job manager here...);
job = createJob(jobMgr);
set(job, 'MaximumNumberOfWorkers', 3);
% create some tasks and add them to the job here
submit(job);
waitForState(job, 'finished');
results = getAllOutputArguments(job);
When you say per user, are you implying that multiple users are able to submit to this machine at once? This would certainly complicate things by I have some example code that might reveal some commands you are not familiar with that will help you reach your goal.
This is some code I use to use as many cores as possible, up to 8.
ncores = feature('numCores');
nworkers = min(8,ncores);

Resources