I try to use parallelism with word2vec implemented in the gensim library. I notice that, more I increase of threads, more the training is slow and I don't know why.
Are there any settings to make?
I use :
- debian
- python 3.6.9
- cython
how can i benefit parallelism ?
Thanks for advance
Gensim's default & original way of specifying a training corpus for Word2Vec (& related models) is via a single iterable object which can provide each text example in turn. Then, a single master thread reads from the iterable, parcelling out batches of texts to any number of worker threads (controlled by the workers parameter).
This still faces a few performance bottlenecks that prevent full utilization of large numbers of threads, especially as the number of threads grow.
First, if the iterable object is itself doing any time-consuming work to prepare each item – such as tokenization or preprocessing, or IO to a laggy/remote source – the single master thread may not send off texts as fast as the many workers can process them, becoming the limiting factor. (You can help this somewhat by ensuring your iteratable is doing the least amount of IO or regex/text-scanning/lookup possible - such as by using a corpus already tokenized in memory, or only reading an already tokenized/preprocessed corpus from disk requiring only simple item/token splitting on whitespace/linebreaks.)
Second, Python's 'Global Interpreter Lock' (GIL) means most pure-Python code can only be run by one thread at a time. Gensim uses Cython & library code to enable much of the workers' most-intense tasks to happen outside this GIL bottleneck, but some aspects of each thread's control-loops & inter-thread result handoffs still need the GIL. So As the number of worker threads grow, contention over the GIL becomes more of a limiting factor – and thus even with 16+ cores, training throughput often maxes somewhere around 5-12 threads. (Some parameters choices that intensify certain aspects of the training – like larger vectors sizes or more negative examples – can reduce the contention, but may not improve runtime, as those options just reclaim contended time for more calculation.)
Recent versions of gensim include an alternate method of supplying the corpus, if you can make your corpus available as a single file where each text is on its own line with all tokens separated by whitespace. In that case, every worker can open its own view onto a range of the file, allowing their training to proceed completely without the GIL/interthread handoffs.
To use this alternative, specify your corpus using the corpus_file parameter, as a path to the file.
This parameter is mentioned in the Word2Vec class docs and there's some more discussion of its usage in the release notes for gensim version 3.6.0.
With this option, training throughput should generally improve nearly linearly with each additional workers thread, up to the number of CPU cores available. (Note that the initial once-through vocabulary-building survey of the corpus is still single-threaded.)
Related
I am trying to improve the speed of my reinforcement learning algorithm by using multiprocessing to have multiple workers generating experience at the same time. Each process just runs the forward pass of my neural net, no gradient computation is needed.
As I understand it, when passing Tensors and nn.Modules across process boundaries (using torch.multiprocessing.Queue or torch.multiprocessing.Pool), the tensor data is moved to shared memory, which shouldn't be any slower than non-shared memory.
However, when I run my multiprocess code with 2 processes (on an 8 core machine), I find that my pytorch operations become more than 30x slower, more than counteracting the speedup from running two processes simultaneously.
I profiled my application to find which operations specifically are slowing down. I found that much of my time was spend in nn.functional.linear(), specifically on this line inside a Tensor.matmul call:
output = input.matmul(weight.t())
I added a timer just to this specific matmul call, and I found that when one process is running, this operation takes less than 0.3 milliseconds, but when two processes are running, it takes more than 10 milliseconds. Note that in both cases the weight matrix has been put in shared memory and passed across process boundaries to a worker process, the only difference is that in the second case there are two worker processes instead of one.
For reference, the shapes of input and weight tensors are torch.Size([1, 24, 180]) and torch.Size([31, 180]), respectively.
What could be causing this drastic slowdown? is there some subtlety to using torch multiprocessing or shared memory that is not mentioned in any of the documentation? I feel like there must be some hidden lock that is causing contention here, because this drastic slowdown makes no sense to me.
It seems like this was caused by a bad interaction of OpenMP (used by pytorch by default) and multiprocessing. This is a known issue in pytorch (https://github.com/pytorch/pytorch/issues/17199) and I was even hitting deadlocks in certain configurations I used to debug. Turning off OpenMP using torch.set_num_threads(1) fixed the issue for me, and immediately sped up my tensor operations in the multiple processes case, presumably, by bypassing internal locking OpenMP was doing.
My MPI experience showed that the speedup as does not increase linearly with the number of nodes we use (because of the costs of communication). My experience is similar to this:.
Today a speaker said: "Magically (smiles), in some occasions we can get more speedup than the ideal one!".
He meant that ideally, when we use 4 nodes, we would get a speedup of 4. But in some occasions we can get a speedup greater than 4, with 4 nodes! The topic was related to MPI.
Is this true? If so, can anyone provide a simple example on that? Or maybe he was thinking about adding multithreading to the application (he went out of time and then had to leave ASAP, thus we could not discuss)?
Parallel efficiency (speed-up / number of parallel execution units) over unity is not at all uncommon.
The main reason for that is the total cache size available to the parallel program. With more CPUs (or cores), one has access to more cache memory. At some point, a large portion of the data fits inside the cache and this speeds up the computation considerably. Another way to look at it is that the more CPUs/cores you use, the smaller the portion of the data each one gets, until that portion could actually fit inside the cache of the individual CPU. This is sooner or later cancelled by the communication overhead though.
Also, your data shows the speed-up compared to the execution on a single node. Using OpenMP could remove some of the overhead when using MPI for intranode data exchange and therefore result in better speed-up compared to the pure MPI code.
The problem comes from the incorrectly used term ideal speed-up. Ideally, one would account for cache effects. I would rather use linear instead.
Not too sure this is on-topic here, but here goes nothing...
This super-linearity in speed-up can typically occur when you parallelise your code while distributing the data in memory with MPI. In some cases, by distributing the data across several nodes / processes, you end-up having sufficiently small chunks of data to deal with for each individual process that it fits in the cache of the processor. This cache effect might have a huge impact on the code's performance, leading to great speed-ups and compensating for the increased need of MPI communications... This can be observed in many situations, but this isn't something you can really count for for compensating a poor scalability.
Another case where you can observe this sort of super-linear scalability is when you have an algorithm where you distribute the task of finding a specific element in a large collection: by distributing your work, you can end up in one of the processes/threads finding almost immediately the results, just because it happened to be given range of indexes starting very close to the answer. But this case is even less reliable than the aforementioned cache effect.
Hope that gives you a flavour of what super-linearity is.
Cache has been mentioned, but it's not the only possible reason. For instance you could imagine a parallel program which does not have sufficient memory to store all its data structures at low node counts, but foes at high. Thus at low node counts the programmer may have been forced to write intermediate values to disk and then read them back in again, or alternatively re-calculate the data when required. However at high node counts these games are no longer required and the program can store all its data in memory. Thus super-linear speed-up is a possibility because at higher node counts the code is just doing less work by using the extra memory to avoid I/O or calculations.
Really this is the same as the cache effects noted in the other answers, using extra resources as they become available. And this is really the trick - more nodes doesn't just mean more cores, it also means more of all your resources, so as speed up really measures your core use if you can also use those other extra resources to good effect you can achieve super-linear speed up.
I'm creating a multi-threaded application in Linux. here is the scenario:
Suppose I am having x instance of a class BloomFilter and I have some y GB of data(greater than memory available). I need to test membership for this y GB of data in each of the bloom filter instance. It is pretty much clear that parallel programming will help to speed up the task moreover since I am only reading the data so it can be shared across all processes or threads.
Now I am confused about which one to use Cilk, Cilk++ or OpenMP(which one is better). Also I am confused about which one to go for Multithreading or Multiprocessing
Cilk Plus is the current implementation of Cilk by Intel.
They both are multithreaded environment, i.e., multiple threads are spawned during execution.
If you are new to parallel programming probably OpenMP is better for you since it allows an easier parallelization of already developed sequential code. Do you already have a sequential version of your code?
OpenMP uses pragma to instruct the compiler which portions of the code has to run in parallel. If I understand your problem correctly you probably need something like this:
#pragma omp parallel for firstprivate(array_of_bloom_filters)
for i in DATA:
check(i,array_of_bloom_filters);
the instances of different bloom filters are replicated in every thread in order to avoid contention while data is shared among thread.
update:
The paper actually consider an application which is very unbalanced, i.e., different taks (allocated on different thread) may incur in very different workload. Citing the paper that you mentioned "a highly unbalanced task graph that challenges scheduling,
load balancing, termination detection, and task coarsening strategies". Consider that in order to balance computation among threads it is necessary to reduce the task size and therefore increase the time spent in synchronizations.
In other words, good load balancing comes always at a cost. The description of your problem is not very detailed but it seems to me that the problem you have is quite balanced. If this is not the case then go for Cilk, its work stealing approach its probably the best solution for unbalanced workloads.
At the time this was posted, Intel was putting a lot of effort into boosting Cilk(tm) Plus; more recently, some effort has been diverted toward OpenMP 4.0.
It's difficult in general to contrast OpenMP with Cilk(tm) Plus.
If it's not possible to distribute work evenly across threads, one would likely set schedule(runtime) in an OpenMP version, and then at run time try various values of environment variable, such as OMP_SCHEDULE=guided, OMP_SCHEDULE=dynamic,2 or OMP_SCHEDULE=auto. Those are the closest OpenMP analogies to the way Cilk(tm) Plus work stealing works.
Some sparse matrix functions in Intel MKL library do actually scan the job first and determine how much to allocate to each thread so as to balance work. For this method to be useful, the time spent in serial scanning and allocating has to be of lower order than the time spent in parallel work.
Work-stealing, or dynamic scheduling, may lose much of the potential advantage of OpenMP in promoting cache locality by pinning threads with cache locality e.g. by OMP_PROC_BIND=close.
Poor cache locality becomes a bigger issue on a NUMA architecture where it may lead to significant time spent on remote memory access.
Both OpenMP and Cilk(tm) Plus have facilities for switching between serial and parallel execution.
Recently working in parallel domain i come to know that there are two terms "vertical parallelism " and "horizontal parallelism". Some people says openmp ( shared memory parallelism ) as vertical while mpi ( distributed memory parallelism ) as horizontal parallelism. Why these terms are called so ? I am not getting the reason. Is it just terminology to call them so ?
The terms don't seem to be widely used, perhaps because often time a process or system is using both without distinction. These concepts are very generic, covering much more than the realm of MPI or openmp.
Vertical parallelism is the faculty for a system to employ several different devices at the same time. For instance, a programme may have a thread doing heavy computation, while another is handling DB queries, and the third is doing IO. Most operating systems expose naturally this faculty.
Horizontal parallelism occurs when a single device is used or operation is executed on several similar items of data. This is the sort of parallelism that happen for instance when running several threads on the same piece of code, but with different data.
In the software world, an interesting example is actually the map reduce algorithm, which uses both:
horizontal parallelism occurs at the map stage, when data is split and scattered accross several cpu for processing,
vertical parallelism happens between the map and reduce stage, where data is first divided in chunks, then processed by the map threads, and accumulated by the reduce thread,
Similarily, in the hardware world, superscalar pipelined CPUs do use both variations, where pipelining is a particular instance of vertical parallelisation (just like the map/reduce staging, but with several more steps).
The reason behind the use of this terminology probably comes from the same reasons it is used with supply chains: values are produced by chaining different steps or levels of processing. The final product can be seen as the root of an abstract tree of constructions (from bottom to top) or dependency (from top to bottom) , where each node is the result of an intermediate level or step. You can easily see the analogy between supply chains and computation here.
I wrote a C program which reads a dataset from a file and then applies a data mining algorithm to find the clusters and classes in the data. At the moment I am trying to rewrite this sequential program multithreaded with PThreads and I am newbie to a parallel programming and I have a question about the number of worker threads which struggled my mind:
What is the best practice to find the number of worker threads when you do parallel programming and how do you determine it? Do you try different number of threads and see its results then determine or is there a procedure to find out the optimum number of threads. Of course I'm investigating this question from the performance point of view.
There are a couple of issues here.
As Alex says, the number of threads you can use is application-specific. But there are also constraints that come from the type of problem you are trying to solve. Do your threads need to communicate with one another, or can they all work in isolation on individual parts of the problem? If they need to exchange data, then there will be a maximum number of threads beyond which inter-thread communication will dominate, and you will see no further speed-up (in fact, the code will get slower!). If they don't need to exchange data then threads equal to the number of processors will probably be close to optimal.
Dynamically adjusting the thread pool to the underlying architecture for speed at runtime is not an easy task! You would need a whole lot of additional code to do runtime profiling of your functions. See for example the way FFTW works in parallel. This is certainly possible, but is pretty advanced, and will be hard if you are new to parallel programming. If instead the number of cores estimate is sufficient, then trying to determine this number from the OS at runtime and spawning your threads accordingly will be a much easier job.
To answer your question about technique: Most big parallel codes run on supercomputers with a known architecture and take a long time to run. The best number of processors is not just a function of number, but also of the communication topology (how the processors are linked). They therefore benefit from a testing phase where the best number of processors is determined by measuring the time taken on small problems. This is normally done by hand. If possible, profiling should always be preferred to guessing based on theoretical considerations.
You basically want to have as many ready-to-run threads as you have cores available, or at most 1 or 2 more to ensure no core that's available to you will ever be left idle. The trick is in estimating how many threads will typically be blocked waiting for something else (mostly I/O), as that is totally dependent on your application and even on external entities beyond your control (databases, other distributed services, etc, etc).
In the end, once you've determined about how many threads should be optimal, running benchmarks for thread pool sizes around your estimated value, as you suggest, is good practice (at the very least, it lets you double check your assumptions), especially if, as it appears, you do need to get the last drop of performance out of your system!