Why parallel processing taking longer than usual code? - parallel-processing

The idea is to update a particular pre-trained word2vec model with different sets of new corpus. I have the following
# c1, c2 are each a list of 100 files
filelist = [c1, c2, c3, c4, c5, c6, c7, c8, c9, c10]
def update_model(files):
# loading a pre-trained model
trained_model = gensim.models.Word2Vec.load("model_both_100")
# Document feeder is an iterable
docs = DocumentFeeder(files)
trained_model.build_vocab(docs, update=True)
trained_model.train(docs, total_examples=trained_model.corpus_count, epochs=trained_model.epochs)
with Pool(processes=10) as P:
P.map(update_model, filelist)
it takes about ~13 minutes to run. But the non-parallel version (looping over filelist) takes ~11 min. Why is this happening? Running on a 12 core cpu.

Gensim's Word2Vec training already uses multiple threads – depending on the workers parameter at model creation. (The default is to use workers=3, but your model may have been initialized to use even more.)
So you are launching 10 (heavyweight) processes, each separately loading a full-size model. That could easily trigger heavy memory usage & thus virtual-memory swapping.
Then each of those models does its own (single-threaded) vocabulary-expansion, then its (one manager thread and 3 or more worker threads) training. If they're all in training simultaneously, than means 40 threads active, within 10 OS processes, on your 12 core processor. There's no reason to necessarily expect a speedup in such a situation, and the contention of more-threads-than-cores, and all contending for access to totally different loaded model memory ranges, could easily explain a slowdown.
Are you really trying to create 10 separate incrementally-updated models? (Do they get re-saved to 10 different filenames after the update-training?)

Related

Setting Hugging Face dataloader_num_workers for multi-GPU training

Should the HuggingFace transformers TrainingArguments dataloader_num_workers argument be set per GPU? Or total across GPUs? And does this answer change depending whether the training is running in DataParallel or DistributedDataParallel mode?
For example if I have a machine with 4 GPUs and 48 CPUs (running only this training task), would there be any expected value in setting dataloader_num_workers greater than 12 (48 / 4)? Or would they all start contending over the same resources?
As I understand when running in DDP mode (with torch.distributed.launch or similar), one training process manages each device, but in the default DP mode one lead process manages everything. So maybe the answer to this is 12 for DDP but ~47 for DP?

Are all processor cores on a cache-coherent system required to see the same value of a shared data at any point in time

From what I've learnt, cache coherence is defined by the following 3 requirements:
Read R from an address X on a core C returns the value written by the most recent write W to X on C if no other core has written to X between W and R.
If a core C1 writes to X and a core C2 reads after a sufficient time, and there are no other writes in between, C2's read returns the value from C1's write.
Writes to the same location are serialized: any two writes to X must be seen to occur in the same order on all cores.
As far as I understand these rules, they basically require all threads to see updates made by other threads within some reasonable time and in the same order, but there seems to be no requirement about seeing the same data at any point in time. For example, say thread A wrote a value to a shared memory location X, then thread B wrote another value to X. Threads C and D reading from X must see the same order of updates: A, B. Imagine that thread C has already seen both updates A and B, while thread D has only observed A (the event B is yet to be seen). Provided that the time interval between writes to X and reads from X is small enough (less than what we consider a sufficient time), this situation doesn't violate any rules of coherence, does it?
On the other hand, coherence protocols, e.g. MSI use write-invalidation to guarantee that all cores have an up-to-date value of a shared variable. Wiki says: "The intention is that two clients must never see different values for the same shared data". If what I wrote about the coherence rules is true, I don't understand where this point comes from. I mean, I realize it's useful, but don't see where it is defined.

H2O - Not seeing much speed-up after moving to powerful machine

I am running a Python program that calls H2O for deep learning (training and testing). The program runs in a loop of 20 iterations and in each loop calls H2ODeepLearningEstimator() 4 times and associated predict() and model_performance(). I am doing h2o.remove_all() and cleaning up all data-related Python objects after each iteration.
Data size: training set 80,000 with 122 features (all float) with 20% for validation (10-fold CV). test set 20,000. Doing binary classification.
Machine 1: Windows 7, 4 core, Xeon, each core 3.5GHz, Memory 32 GB
Takes about 24 hours to complete
Machine 2: CentOS 7, 20 core, Xeon, each core 2.0GHz, Memory 128 GB
Takes about 17 hours to complete
I am using h2o.init(nthreads=-1, max_mem_size = 96)
So, the speed-up is not that much.
My questions:
1) Is the speed-up typical?
2) What can I do to achieve substantial speed-up?
2.1) Will adding more cores help?
2.2) Are there any H2O configuration or tips that I am missing?
Thanks very much.
- Mohammad,
Graduate student
If the training time is the main effort, and you have enough memory, then the speed up will be proportional to cores times core-speed. So, you might have expected a 40/14 = 2.85 speed-up (i.e. your 24hrs coming down to the 8-10 hour range).
There is a typo in your h2o.init(): 96 should be "96g". However, I think that was a typo when writing the question, as h2o.init() would return an error message. (And H2O would fail to start if you'd tried "96", with the quotes but without the "g".)
You didn't show your h2o.deeplearning() command, but I am guessing you are using early stopping. And that can be unpredictable. So, what might have happened is that your first 24hr run did, say, 1000 epochs, but your second 17hr run did 2000 epochs. (1000 vs. 2000 would be quite an extreme difference, though.)
It might be that you are spending too much time scoring. If you've not touched the defaults, this is unlikely. But you could experiment with train_samples_per_iteration (e.g. set it to 10 times the number of your training rows).
What can I do to achieve substantial speed-up?
Stop using cross-validation. That might be a bit controversial, but personally I think 80,000 training rows is going to be enough to do an 80%/10%/10% split into train/valid/test. That will be 5-10 times quicker.
If it is for a paper, and you want to show more confidence in the results, once you have your final model, and you've checked that test score is close to valid score, then rebuild it a couple of times using a different seed for the 80/10/10 split, and confirm you end up with the same metrics. (*)
*: By the way, take a look at the score for each of the 10 cv models you've already made; if they are fairly close to each other, then this approach should work well. If they are all over the place, you might have to re-consider the train/valid/test splits - or just think about what it is in your data that might be causing that sensitivity.

What is the best general purpose computing practice in OpenCL for iterative problems?

When we have a program that requires lots of operations over a large data sets and the operations on each of the data elements are independent, OpenCL can be one of the good choice to make it faster. I have a program like the following:
while( function(b,c)!=TRUE)
{
[X,Y] = function1(BigData);
M = functionA(X);
b = function2(M);
N = functionB(Y);
c = function3(N);
}
Here the function1 is applied on each of the elements on the BigData and produce another two big data sets (X,Y). function2 and function3 are then applied operation individually on each of the elements on these X,Y data, respectively.
Since the operations of all the functions are applied on each of the elements of the data sets independently, using GPU might make it faster. So I come up with the following:
while( function(b,c)!=TRUE)
{
//[X,Y] = function1(BigData);
1. load kernel1 and BigData on the GPU. each of the thread will work on one of the data
element and save the result on X and Y on GPU.
//M = functionA(X);
2a. load kernel2 on GPU. Each of the threads will work on one of the
data elements of X and save the result on M on GPU.
(workItems=n1, workgroup size=y1)
//b = function2(M);
2b. load kernel2 (Same kernel) on GPU. Each of the threads will work on
one of the data elements of M and save the result on B on GPU
(workItems=n2, workgroup size=y2)
3. read the data B on host variable b
//N = functionB(Y);
4a. load kernel3 on GPU. Each of the threads will work on one of the
data element of Y and save the result on N on GPU.
(workItems=n1, workgroup size=y1)
//c = function2(M);
4b. load kernel3 (Same kernel) on GPU. Each of the threads will work
on one of the data element of M and save the result on C on GPU
(workItems=n2, workgroup size=y2)
5. read the data C on host variable c
}
However, the overhead involved in this code seems significant to me (I have implemented a test program and run on a GPU). And if the kernels have some sort of synchronizations it might be ended up with more slowdown.
I also believe the workflow is kind of common. So what is the best practice to using OpenCL for speedup for a program like this.
I don't think there's a general problem with the way you've split up the problem into kernels, although it's hard to say as you haven't been very specific. How often do you expect your while loop to run?
If your kernels do negligible work but the outer loop is doing a lot of iterations, you may wish to combine the kernels into one, and do some number of iterations within the kernel itself, if that works for your problem.
Otherwise:
If you're getting unexpectedly bad performance, you most likely need to be looking at the efficiency of each of your kernels, and possibly their data access patterns. Unless neighbouring work items are reading/writing neighbouring data (ideally: 16 work items read 4 bytes each from a 64-byte cache line at a time) you're probably wasting memory bandwidth. If your kernels contain lots of conditionals or non-constant loop iterations, that will cost you, etc.
You don't specify what kind of runtimes you're getting, on what kind Of job size, (Tens? Thousands? Millions of arithmetic ops? How big are your data sets?) or what hardware. (Compute card? Laptop IGPU?) "Significant overhead" can mean a lot of different things. 5ms? 1 second?
Intel, nVidia and AMD all publish optimisation guides - have you read these?

why is mpi slower on my laptop

I am running MPI on my laptop (intel i7 quad core 4700m 12Gb RAM) and the efficiency drops even for codes that involve no inter-process communication. Obviously I cannot just throw 100 processes at it since my machine is only quad-core, but I thought that it should scale well up to 8 process (intel quad core simulates as 8???). For example consider the simple toy Fortran code:
program test
implicit none
integer, parameter :: root=0
integer :: ierr,rank,nproc,tt,i
integer :: n=100000
real :: s=0.0,tstart,tend
complex, dimension(100000/nproc) :: u=2.0,v=0.0
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD,nproc,ierr)
call cpu_time(tstart)
do tt=1,200000
v=0.0
do i=1,100000/nproc
v(i) = v(i) + 0.1*u(i)
enddo
enddo
call cpu_time(tend)
if (rank==root) then
print *, 'total time was: ',tend-tstart
endif
call MPI_FINALIZE(ierr)
end subroutine test2
For 2 processes it takes half the time, but even trying 4 processes (should be quarter of the time?) the result begins to become less efficient and for 8 processes there is no improvement whatsoever. Basically I am wondering if this is just because I am running on a laptop and has something to do with shared memory, or if I am making some fundamental mistake in my code. Thanks
Note: In the above example I manually change the nproc in the array declaration and the inner loop to be equal to the number of processors I am using.
A quad core processor, thanks to hyperthreading shows itself as having 8 threads, but physically they are just 4 cores. The other 4 are scheduled by the hardware itself using the free slots in the execution pipelines.
It happens that especially with compute intensive loads this approach does not pay at all, being often counter-productive too on extreme loads because of overheads and not always optimized cache usage.
You can try to disable hyperthreading in the BIOS and compare it: you will have just 4 threads, 4 cores.
Even going from 1 to 4 there are resources that are being in competition. In particular each core has its own L1 cache, but each pair of cores shares the L2 cache (2x256KB) and the 4 cores share the L3 cache.
And all the cores obviously share the memory channels.
So you cannot expect to have linear scaling occupying more and more cores, since they will have to balance the usage of the resources, that are dedicated to one core/one thread in the sequential case.
All of this without involving communications at all.
The same behavior happens on desktops/servers, in particular for memory-intensive loads, as the one in your test case.
For example it's less evident with matrix-matrix multiplies, that is compute-intensive: for a NxN matrix, you have O(N^2) memory accesses but O(N^3) floating point operations.

Resources