Someone point me in the right direction. I'm looking to do some heavy-duty manipulation of some really large and often very sparse matrices and I'm looking for the right tool for the job. These matrices will be much, much larger than the RAM of any single machine and will therefore likely be spread to several different machines. The matrices will often be sparse. I will want to perform all of the common matrix operations: multiplication, transpose, inverse, pseudo-inverse, SVD, Eigenvalue Decomposition, etc. Probably key among my concerns is that since the matrices will very likely be spread among several machines, I will want to minimize information sharing, because network latency is probably my biggest enemy. I'm concerned that map-reduce (a la Hadoop) is not the right option because it's focus is upon streaming large amounts of data between machines. This book provides a great intro to map-reduce from an algorithmic perspective. And lots of matrix operations are akin to giant JOIN operations which are known to be slow or map-reduce.
So... where should I go?
This paper: Design of Hadoop-based Large-Scale Matrix Computations can help you on the implementation guidelines. HBase is meant for storing sparse tables so HBase might be the recommended storage option of the Matrices.
Related
I am looking for an efficient algorithm to perform (dense) large matrix multiplications on GPUs. More specifically, for the case where the GPU does not have enough memory to hold all the matrices (e.g., m=n=k=100,000). I'm using cuBLAS to perform matrix multiplication in blocks, and I can think of many block-based approaches, but they are very inefficient because the A, B or C matrices have to be copied to/from the GPU multiple times.
I know that many efficient algorithms have been proposed (for example, here), but I was unable to find a concrete definition of the algorithm used. Is there an algorithm to perform this task without redundant copies (this is, copying A, B and C exactly once)? Any pointers to competitive approaches?
Such an algorithm is called an out-of-core algorithm and this problem is generally solved by using tiles. The idea is to first split A and B in relatively big tiles. Then, send 2 tiles on the GPU, perform the multiplication of the two, write the result in a preallocated tile (always the same), send it back to the CPU and accumulate the result in a tile of the C matrix. Actually, this algorithm is the same than the ones used to solve the matrix multiplication except that items are tiles and you need to care about sending/receiving data to/from the GPU. CUDA streams can be used to improve the execution time by overlapping communications with computations. Note that tiles needs to be copied multiple times because you do not have enough memory on the GPU. Lebesgue curves (aka Z-tiling or Z-order curves) can be used to reduce the number of copies/communications. Doing all of this is a bit complex. Some runtime systems and tools can help you to hide memory transfers more easily (eg. StarPu which is a research project).
This is kind of a long shot, but I am hoping that someone has been in a similar situation as I am looking for some advice how to efficiently bring a set of large word2vec models into a production environment.
We have a range of trained w2v models with a dimensionality of 300. Due to the underlying data - huge corpus with POS tagged words; specialized vocabularies with up to 1 mio words - these models became quite large and we are currently looking into effective ways how to expose these to our users w/o paying a too high price in infrastructure.
Besides trying to better control the vocabulary size, obviously, dimensionality reduction on the feature vectors would be an option. Is anyone aware of publications around that, particularly on how this would affect model quality, and how to best measure this?
Another option is to pre-calculate the top X most similar words to each vocabulary word and to provide a lookup table. With the model size being that big, this is currently also very inefficient. Are there any heuristics known that could be used reduce the number of necessary distance calculations from n x n-1 to a lower number?
Thank you very much!
There are pre-indexing techniques for similarity-search in high-dimensional spaces which can speed nearest-neighbor discovery, but usually at a cost of absolute accuracy. (They also need more memory for the index.)
An example is the ANNOY library. The gensim project includes a demo notebook showing its use with Word2Vec.
I once did some experiments using just 16-bit (rather than 32-bit) floats in a Word2Vec model. It saved memory in the idle state, and nearest-neighbor top-N results were nearly unchanged. But, perhaps because some behind-the-scenes up-conversion to 32-bit floats was still occurring during the one-against-all distance-calculations, speed of operations was actually reduced. (And this suggests that each distance-calculation may have caused a temporary memory expansion offsetting any idle-state savings.) So it's not a quick fix, but further research here – perhaps involving finding/implementing the right routines for float16 array operations – could maybe mean 50% model-size savings and equivalent or even better speed.
For many applications, discarding the least-frequent words doesn't hurt much – or even, when done before training, can improve the quality of the remaining vectors. As many implementations, including gensim, sort the word-vector array in most-to-least-frequent order, you can discard the tail-end of the array to save memory, or limit most_similar() searches to the first-N entries to speed calculations.
Once you've minimized the vocabulary size, you want to be sure the full set is in RAM, and no swapping is triggered during the (typical) full-sweep distance-calculations. If you need multiple processes to serve answers from the same vector set, as in a web service on a multicore machine, gensim's memory-mapping operations can prevent each process from loading its own redundant copy of the vectors. You can see a discussion of this technique in this answer about speeding gensim Word2Vec loading time.
Finally, while precomputing top-N neighbors for a larger vocabulary is both time-consuming and memory-intensive, if your pattern of access is such that some tokens are checked far more than others, a cache of the N most-recently or M most-frequently requested top-N could improve perceived performance a lot – making only less-frequently-requested neighbor-lists require the full distance calculations to every other token.
It's said when we wish to do statistics among paper references, map-reduce could do much better than traditional ways, as traditional ways involves a lot of memory/disk switches. I don't quite find out why traditional ways is not good.
Suppose I run map-reduce on just one machine(no cluster), does it still solve some problems better than traditional ways?
Or in another word, does the algorithm paradigm of "map-reduce" itself, has some advantages in solving problems, from algorithm point of view?
Thanks.
At best M/R allows re-applying the same algorithms as the advanced stats packages. But more typically some sacrifices are made in the algorithms used - to allow for running in a distributed fashion. Map/Reduce provides no "magic" in terms of - say - providing a more uniformly randomized distribution during cross-fold sampling (or any other sampling methodology).
For a small dataset that fits in memory M/R is usually worse than your traditional packages - due to compromises made in the algorithm for scalability. You start to see an advantage to M/R when using large datasets that are prohibitive to fully sample on a single machine. Using R / Matlab / SAS would typically require down-sampling - and possibly by orders or magnitude.
I have around 10 K points in 5 dimensional space. We can assume that the points are randomly distributed in space (0,0,0,0,0) and (100,100,100,100,100). Clearly, the whole data set can easily reside in memory.
I would like to know which algorithm for k nearest neighbour would run faster, kd-tree or RTree.
Although I have some very high level idea of these two algorithms, I am not sure which will run faster, and why. I am open to exploring other algorithms if any, which could run fast. Please, if possible, specify why an algorithm may run faster.
This depends on various parameters. Most importantly on your capability to implement these algorithms.
I've personally found bulk-loaded R*-trees to be faster for large data, probably because they have a better fan-out. Bulk-loaded R-trees is a more fair comparison, as kd-trees are commonly bulk-loaded (in fact, they don't support incremental operation very well at all).
For tiny data, kd-trees will likely be faster, plus they are much simpler to implement.
For other things, please refer to this earlier question / answer:
https://stackoverflow.com/a/11109467/1060350
I wrote a very simple distributed computing platform (based on the Map/Reduce paradigm), and I'm in the process of writing some demos and showcases. I have a very small team and have to prioritize which demos I'll write first.
To prioritize I need to sort the demos accordingly to about 70% being a relevant, common, significant use case of distributed computing, 30% being easy to write.
So far I have it ordered like this:
Discovering pi digits with Monte Carlo
Numerical integration with Monte Carlo
Large matrix multiplication (dense matrices)
Linear regressions
Large matrix inversion
Multiple regressions
Sorting
Clustering (K-Means)
Clustering (Hierarchical)
Number 1 is on the list because it took 10 minutes to write, although it's completely useless (I'm not sure but I figure there's not a lot of people trying to find more digits to pi).
Due to the nature of my platform, it will shine more in things that are of course embarrassingly parallel, and not I/O-bounded or reduce-dominated.
How would you change my list? What would you add to it? Is sorting useful at all in the enterprise world or is it only for benchmarking distributed computing platforms?
Your list suggests that you are not distinguishing between parallel computing and distributed computing. This is not necessarily wrong but someone looking for a demonstration of the excellence of a distributed computing platform might be left tepidly enthused upon seeing parallel computations, such as your items 2 - 5, being performed.
Sorting is certainly useful everywhere there is data: large enterprises, small enterprises, in your desk drawers, across the Googlesphere. So too is searching, which is a surprising omission from your list. The other omission which strikes me immediately is any sort of data fusion, merging large datasets to get information from their intersections beyond what can be extracted from the datasets individually.
I second Mark in that you are mixing distributed computing and HPC. Here are some comments on each of your topics:
(1) There are people trying to compute as many digits of Pi as they can but the Monte Carlo algorithm is completely useless there as its precision scales with the inverse square root of the number of trials, so in order to get one more decimal digit of precision you would roughly need 100 times more trials. There are other algorithms - see if you can implement some of them using Map/Reduce.
(2) This one is fine, although seldom used - same problem with precision as (1).
(5) Pure matrix inversions are seldom performed, mainly because of numerical instabilities. How about solving a dense system of linear equations instead?
I would say that you are missing one of the main usages of M/R processing nowadays, namely graph processing (read: social and other networks/flows analysis). Also some more general optimisation problem might be nice, e.g. genetic algorithms.