Work sizes for completely independent calculations in OpenCL

Work sizes for completely independent calculations in OpenCL - parallel-processing

I have a 2D matrix where I want to modify every value by applying a function that is only dependent on the coordinates in the matrix and values set at compile-time. Since no synchronization is necessary between each such calculation, it seems to me like the work group size could really be 1, and the number of work groups equal to the number of elements in the matrix.
My question is whether this will actually yield the desired result, or whether other forces are at play here that might make a different setting for these values better?

My recomendation: Just set global size to your 2D matrix size, and local size to NULL. This will make the compiler select for you an optimal local size.
In your specific case, the local size does not need to hav any shape. In fact, any value value will do the work, but the performance may differ. You can tune it manually for different HW. But it is easyer to let the compiler do this job for you. And it is even more portable.

Related

how to plot variables with possibly wild variable values?

I want to build an application that would do something equivalent to running lsof (maybe changing it to output differently, because string processing may mean it is not real time enough) in a loop and then associate each line (entries) with what iteration it was present in, what I will be referring further as frames, as later on it will be better for understanding. My intention with it is that showing the times in which files are open by applications can reveal something about their structure, while not having big impact on their execution, which is often a problem. One problem I have is on processing the output, which would be a table relating "frames X entry", for that I am already anticipating that I will have wildly variable entry lengths. Which can fall in that problem of representing on geometry when you have very different scales, the smaller get infinitely small, while the bigger gets giant and fragmentation makes it even worse; so my question is if plotting libraries deal with this problem and how they do it

The easiest and most well-established technique for showing both small and large values in reasonable detail is a logarithmic scale. Instead of plotting raw values, plot their logarithms. This is notoriously problematic if you can have zero or even negative values, but as I understand your situations all your lengths would be strictly positive so this should work.
Another statistical solution you could apply is to plot ranks instead of raw values. Take all the observed values, and put them in a sorted list. When plotting any single data point, instead of plotting the value itself you look up that value in the list of values (possibly using binary search since it's a sorted list) then plot the index at which you found the value.
This is a monotonous transformation, so small values map to small indices and big values to big indices. On the other hand it completely discards the actual magnitude, only the relative comparisons matter.
If this is too radical, you could consider using it as an ingredient for something more tuneable. You could experiment with a linear combination, i.e. plot
a*x + b*log(x) + c*rank(x)
then tweak a, b and c till the result looks pleasing.

Is MPI shared memory a good solution for my problem?

I have several processes, each of them calculates certain sub-matrices of one global matrix. The problem is that the sub-matrices will overlap and in general they do not necessarily have to form a continuous block within the global matrix. Also each tasks might also have more than one sub-matrix.
Finally, in order to obtain my final matrix I need to perform an element wise summation of these sub-matrices by considering the position within the global matrix.
So far I am doing the following:
each processor has its own copy of the global array (matrix)
each processor then calculates a sub-matrix of that global matrix and adds the elements to the right position in the local copy of the global array
with mpi_allreduce I am obtaining the final global matrix synchronized over all the tasks (this is my element wise summation to obtain my final result)
This works reasonably well as long as my global matrix is small. However, this becomes quickly a memory bottleneck as allocating local copies of the global matrix becomes more and more expensive.
One constraint is that I have to solve this with MPI only.
Another constraint is that I need to perform operations on that global matrix afterwards. Where different task have access (this time read-only) different parts of that global matrix. The blocks are not the same as the sub-matrix blocks before.
I somehow stumbled along MPI-3 shared memory arrays. However, I am not sure if this might be the best solution for my problem as several processes have to add simultaneously small local arrays which overlap. However, for my operations afterwards, each process could also read from that global matrix again.
I am relatively inexperienced how to solve these kind of problems and I would be happy for any kind of suggestions.
Thanks!

Force gensim's word2vec vectors to be positive?

Is there any way in gensim that i can force the learned vectors in word2vec to be all positive? (all the elements of vector be positive). i am working on a different task that needs these vectors to be positive ( the reason is really complicated so don't ask why )
so what is the easiest way for me to force gensim to learn positive vectors?

There is no built-in feature of Gensim that would allow this extra constraint/regularization to be applied during training.
You should probably try to explain your 'really complicated' reason for this idosyncratic request. There might be a better way to achieve the real end-goal, rather than shoehorning vectors that are typically bushy-and-balanced around the origin into a non-negative representation.
Notably, a paper called 'All-but-the-Top: Simple and Effective Postprocessing for Word Representations' has suggested word-vectors can be improved by postprocessing to ensure they are more balanced around the origin, rather than less (as seems a reliable side-effect of typical negative-sampling configurations).
If you're still interested to experiment in the opposite direction – transforming usual word2vec word-vectors into a representation where all dimensions are positive – I can think of a number of trivial, superficial ways to achieve that. I have no idea whether they'd actually preserve, or ruin, beneficial properties in the vectors – but you could try them, and see. For example:
You could try simply setting all negative dimensions to 0.0 - truncation. (Loses lots of info but might give a quick indication if a dirt-simple experiment gives you any of the benefits you seek.)
You could find the largest negative dimension that appears anywhere in any of the vectors, then add its absolute value to all other dimensions. Voila! No vector dimension is now lower than 0.0. (You could also try this in a per-dimension manner - only correct dimension #0 with the lowest dimension #0 value. Or, try other re-scalings of each dimension such that the previously-highly-negative values are 0.0, and the previous-highly-positive values stay where they are or only shift a little.)
You could try turning every dimension in the original word-vectors into two dimensions in a transformed set: one that's the original positive value, or 0.0 if it was negative, and a 2nd dimension that's the absolute value of the original negative value, or 0.0 if it was positive. (Or similarly: one dimension that's the absolute-value of the original value, and one dimension that's 0.0 or 1.0 depending on whether original value was negative or positive.)
There are probably other more-sophisticated factorization/decompositions for re-representing the full set of word-vectors in a transformed array with only non-negative individual values, but I don't know them offhand, other than to think it might be worth searching for them.
And, whether any of these transformations work for your next steps, who knows? But it might be worth trying. (And if any of these offer surprisingly good results, it'd be great to hear in a followup comment!)

How to save a matrix in C++ in a non-linear way

I have to program an optimized multi-thread implementation of the Levenshtein distance problem. It can be computed using dynamic programming with a matrix, the wikipedia page on Levenshtein distance covers that well enough.
Now, I can compute diagonal elements concurrently. That is all alright.
My problem now comes with caches. Matrices in c++ are normaly saved in memory row by row, correct? Well, that is not good for me as I need 2 element of the previous row and 1 element of the current row to compute my result, that is horrible cache-wise. The cache will hold the current row (or part of it), then I ask for the previous one which it will probably not hold anymore.
Then for another one, I need a different part of the diagonal, so yet again, I ask for completely different rows and the cache will not have those ready for me.
Therefore, I would like to save my matrix to memory in blocks or maybe diagoals. That will result in fewer cachce misses and make my implementation faster again.
How do you do that? I tried searching the internet, but I could never find anything that would show me the way. Is it possible to tell c++ how to order that type in memory?
EDIT: As some of you seem confused about the nature of my question. I want to save a matrix (does not matter if I will make it a 2D array or any other way) in a custom way into the MEMORY. Normally, a 2D array will save row after row, I need to work with diagonals therefore caches will miss a lot on the huge matrices I will work at (possibly millions of rows and columns).

I believe you may have a mis-perception of (CPU) cache.
It's true that CPU caching is linear - that is, if you access an address in memory, it will bring into the cache some previous and some successive memory locations - which is like "guessing" that subsequent accesses will involve 1-dimensional-close elements. However, this is true on the micro-level. A CPU's cache is made up of a large number of small "lines" (64 Bytes on all cache levels in recent Intel CPUs). The locality is limited to the line; different cache lines can come from completely different places in memory.
Thus, if you "need two elements of the previous row and one element of the current row" of your matrix, then the cache should work very well for you: Some of the cache will hold elements of the previous row, and some will hold elements of the current row. And when you advance to the next element, the cache overall will usually contain the matrix elements you need to access. Just make sure your order of iteration agrees with the order of progression within the cache line.
Also, in some cases you could be faced with a situation where different threads are thrashing the same cache lines due to the mapping from main memory into the cache. Without getting into details, that is something you need to think about (but again, has nothing to do with 2D vs 1D data).
Edit: As geza notes, if your matrix' lines are long, you will still be reading each memory location twice with the straightforward approach: Once as the current-line, then again as the previous-line, since each value will be evicted from the cache before it's used as a previous-line value. If you want to avoid this, you can iterate over tiles of your matrix, whose size (length x width x sizeof(element)) fits into the L1 cache (along with whatever else needs to be there). You can also consider storing your data in tiles, but I don't think that would be too useful.

Preliminary comment: "Levenshtein distance" is edit distance (under the common definition). This is a very common problem; you probably don't even need to bother writing a solution yourself. Look for existing code.
Now, finally, for a proper answer... You don't actually need have a matrix at all, and you certainly don't need to "save" it: It's enough to keep merely a "front" of your dynamic programming matrix rather than the whole thing.
But what "front" shall you choose, and how do you advance it? I suggest you use anti-diagonals as your front, and given each anti-diagonal, compute concurrently the next anti-diagonal. Thus it'll be {(0,0)}, then {(0,1),(1,0)}, then {(0,2),(1,1),(2,0)} and so on. Each anti-diagonal requires at most two earlier anti-diagonals - and if we keep the values of each anti-diagonal consecutively in memory, then the access pattern going up the next anti-diagonal is a linear progression along the previous anti-diagonals - which is great for the cache (see my other answer).
So, you'll "concurrentize" the computation give each thread a bunch of consecutive anti-diagonal elements to compute; that should do the trick. And at any time you will only keep 3 anti-diagonal in memory: the one you're working on and the two previous ones. You can cycle between three such buffers so you don't re-allocate memory all the time (but then make sure to pre-allocate buffers with the maximum anti-diagonal length).
This whole thing should work basically the same for the non-square case.

I'm not absolutely sure, but i think a matrix is stored as a long array one row after the other and is mapped with pointer arithmetic to a matrix, so you always refer to the same address and calculate the distance in the memory where your value is located
Otherwise you can implement it easily as this type and implement operator[int, int] for your matrix

NULL values across a dimension in Support Vector Machine

I am designing a support vector machine considering n dimensions. Along every dimension, the values could range from [0-1]. Now, if I am unable to determine the value across a particular dimension from the original data set, for a particular data point due to various reasons, what should the value along that dimension be for the SVM? Can I just put it as [-1] indicating a missing value?
Thanks
Abhishek S

You would be better served leaving the missing value out altogether if the dimension won't be able to contribute to your machine's partitioning of the space. This is because the only thing the SVM can do is place zero weight on that dimension as far as classification power, as all of the points in that dimension are at the same place.
Thus each pass over that dimension is just wasted computational resources. If recovering this value is of importance, you may be able to use a regression model of some type to try to get estimated values back, but if that estimated value is generated from your other data, yet again it won't actually contribute to your SVM because the data in that estimated dimension is nothing more that a summary of the data you used to generate it (which I would assume would be in your SVM model already).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio