Maximum matrix size in Scala Breeze or other Java/Scala matrix packages? - scala-breeze

I am trying to write a generic modelling code in Scala that relies on a grid (matrix or tensor). After much looking around, I decided to use Breeze for the matrix because (1) it's API is in Scala, which is nice, (2) it is reasonably fast for what I need (not a lot of linear algebra, more a convenient data structure), and (3) it allows me to store non-Primitive values and saves their types (not everything is a Numerical value in my application).
However, I cannot find an information about maximum matrix size. I managed to blow my heap size a few times by creating large tensors (matrix of 100'000 x 10'000 cells each containing a Vector of dimension 5), but I managed to overcome this by increasing my heap size.
Now the above matrix works, but I get an 'interesting' error though when I try to create a matrix of 100'000 x 100'000 cells each containing a Vector of dimension 5. This is what it gives me:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 1410065408 out of bounds for length 1410065408
I suspect it has to do with the internals of how Breeze works with indexes, but I am not sure. I don't really mind having a limit in size (this is life) but I would need to know it to catch this before it crashes the application. Anyone has an idea?
Or is there a better package out there for what I need? I played with OjAlgo, which is nice but slower at creating the matrix than Breeze and can only store boxed primitives or primitives, not objects. Maybe Spark?
Thanks!

Breeze's DenseMatrix representation is backed by a single Java array, so the total number of elements is capped at (a little less than) 2^31.

Related

How to save a matrix in C++ in a non-linear way

I have to program an optimized multi-thread implementation of the Levenshtein distance problem. It can be computed using dynamic programming with a matrix, the wikipedia page on Levenshtein distance covers that well enough.
Now, I can compute diagonal elements concurrently. That is all alright.
My problem now comes with caches. Matrices in c++ are normaly saved in memory row by row, correct? Well, that is not good for me as I need 2 element of the previous row and 1 element of the current row to compute my result, that is horrible cache-wise. The cache will hold the current row (or part of it), then I ask for the previous one which it will probably not hold anymore.
Then for another one, I need a different part of the diagonal, so yet again, I ask for completely different rows and the cache will not have those ready for me.
Therefore, I would like to save my matrix to memory in blocks or maybe diagoals. That will result in fewer cachce misses and make my implementation faster again.
How do you do that? I tried searching the internet, but I could never find anything that would show me the way. Is it possible to tell c++ how to order that type in memory?
EDIT: As some of you seem confused about the nature of my question. I want to save a matrix (does not matter if I will make it a 2D array or any other way) in a custom way into the MEMORY. Normally, a 2D array will save row after row, I need to work with diagonals therefore caches will miss a lot on the huge matrices I will work at (possibly millions of rows and columns).
I believe you may have a mis-perception of (CPU) cache.
It's true that CPU caching is linear - that is, if you access an address in memory, it will bring into the cache some previous and some successive memory locations - which is like "guessing" that subsequent accesses will involve 1-dimensional-close elements. However, this is true on the micro-level. A CPU's cache is made up of a large number of small "lines" (64 Bytes on all cache levels in recent Intel CPUs). The locality is limited to the line; different cache lines can come from completely different places in memory.
Thus, if you "need two elements of the previous row and one element of the current row" of your matrix, then the cache should work very well for you: Some of the cache will hold elements of the previous row, and some will hold elements of the current row. And when you advance to the next element, the cache overall will usually contain the matrix elements you need to access. Just make sure your order of iteration agrees with the order of progression within the cache line.
Also, in some cases you could be faced with a situation where different threads are thrashing the same cache lines due to the mapping from main memory into the cache. Without getting into details, that is something you need to think about (but again, has nothing to do with 2D vs 1D data).
Edit: As geza notes, if your matrix' lines are long, you will still be reading each memory location twice with the straightforward approach: Once as the current-line, then again as the previous-line, since each value will be evicted from the cache before it's used as a previous-line value. If you want to avoid this, you can iterate over tiles of your matrix, whose size (length x width x sizeof(element)) fits into the L1 cache (along with whatever else needs to be there). You can also consider storing your data in tiles, but I don't think that would be too useful.
Preliminary comment: "Levenshtein distance" is edit distance (under the common definition). This is a very common problem; you probably don't even need to bother writing a solution yourself. Look for existing code.
Now, finally, for a proper answer... You don't actually need have a matrix at all, and you certainly don't need to "save" it: It's enough to keep merely a "front" of your dynamic programming matrix rather than the whole thing.
But what "front" shall you choose, and how do you advance it? I suggest you use anti-diagonals as your front, and given each anti-diagonal, compute concurrently the next anti-diagonal. Thus it'll be {(0,0)}, then {(0,1),(1,0)}, then {(0,2),(1,1),(2,0)} and so on. Each anti-diagonal requires at most two earlier anti-diagonals - and if we keep the values of each anti-diagonal consecutively in memory, then the access pattern going up the next anti-diagonal is a linear progression along the previous anti-diagonals - which is great for the cache (see my other answer).
So, you'll "concurrentize" the computation give each thread a bunch of consecutive anti-diagonal elements to compute; that should do the trick. And at any time you will only keep 3 anti-diagonal in memory: the one you're working on and the two previous ones. You can cycle between three such buffers so you don't re-allocate memory all the time (but then make sure to pre-allocate buffers with the maximum anti-diagonal length).
This whole thing should work basically the same for the non-square case.
I'm not absolutely sure, but i think a matrix is stored as a long array one row after the other and is mapped with pointer arithmetic to a matrix, so you always refer to the same address and calculate the distance in the memory where your value is located
Otherwise you can implement it easily as this type and implement operator[int, int] for your matrix

Own fast Gamma Index implementation

My friends and I are writing our own implementation of Gamma Index algorithm. It should compute it within 1s for standard size 2d pictures (512 x 512) though could also calculate 3D pictures; be portable and easy to install and maintain.
Gamma Index, in case if you haven't came across this topic, is a method for comparing pictures. On input we provide two pictures (reference and target); every picture consist of points distributed over regular fine grid; every point has location and value. As output we receive a picture of Gamma Index values. For each point of target picture we calculate some function (called gamma) against every point from reference picture (in original version) or against points from reference picture, that are closest to the one from target picture (in version, that is usually used in Gamma Index calculation software). The Gamma Index for certain target point is minimum of calculated for it gamma function.
So far we have tried following ideas with these results:
use GPU - the calculation time has decreased 10 times. Problem is, that it's fairly difficult to install it on machines with non nVidia graphics card
use supercomputer or cluster - the problem is with maintenance of this solution. Plus every picture has to be ciphered for travel through network due to data sensitivity
iterate points ordered by their distances to target point with some extra stop criterion - this way we got 15 seconds at best condition (which is actually not ideally precise)
currently we are writing in Python due to NumPy awesome optimizations over matrix calculation, but we are open for other languages too.
Do you have any ideas how we can accelerate our algorithm(s), in order to meet the objectives? Do you think the obtaining of this level of performance is possible?
Some more information about GI for anyone interested:
http://lcr.uerj.br/Manual_ABFM/A%20technique%20for%20the%20quantitative%20evaluation%20of%20dose%20distributions.pdf

How is a hash mapped to a location in a vector? What happens when the vector is resized?

I've been thinking about HashMaps/Dictionaries recently, and I realise there is a gap in my understanding of their implementation.
From my data structures classes, I was told that a hash value will map a key directly to a location in a vector of linked-list buckets. According to wikipedia, MurmurHash creates a 32-bit or 128-bit value. Obviously that value cannot map directly to a location in memory. How is that hash value used to assign a location in the underlying vector to the object being placed in the hash map?
After reading David Robinson's answer I want to expand my question:
If the mapping is based on the size of the underlying vector of lists, what happens when the vector is resized?
Typically when the result of a hash is generated, it is put through modulo N, where N is the size of the allocated vector of linked lists. Pseudocode:
linked_list = lists[MurmurHash(x) % len(lists)]
linked_list.append(x)
This lets the implementer decide on the length of the vector of linked lists (that is, how much he wants to trade space efficiency for time efficiency), while keeping the result pseudorandom.
A common alternative worth mentioning is bit masking- for example, ignoring all but the b least significant bits. (For instance, performing the operation x & 7 ignores all but the 3 least significant bits). This is equivalent to x modulo 2^b, it just happens to be faster on most operating systems.
To answer your second question: if the vector has to be resized, then each value stored in the dictionary does indeed need to be remapped.
There is an excellent blog post on the implementation of dictionaries in Python that explains how that language implements its built in hash table (which it calls dictionaries). In that implementation:
The dictionary is resized (made larger) if more than 2/3 of its slots are being used
The list of slots is resized to 4 times its current size
Every value in the old list of slots is remapped to the new list of slots.
There are many other useful optimizations described in that blog post; it gives an excellent view of the practical aspects of implementing a hash table.

Data structure for tiled map

I want to make an infinite tiled map, from (-max_int,-max_int) until (max_int,max_int), so I'm gonna make a basic structure: chunk, each chunk contain char tiles[w][h] and also it int x, y coordinates, so for example h=w=10 so tile(15,5) is in chunk(1,0) on (5,5) coordinate, and tile(-25,-17) is in chunk(-3,-2)on(5,3) and so on. Now there can be any amount of chunks and I need to store them and easy access them in O(logn) or better ( O(1) if possible.. but it's not.. ). It should be easy to: add, ??remove??(not must) and find. So what data structure should I use?
Read into KD-tree or Quad-tree (the 2d variant of Octree). Both of these might be a big help here.
So all your space is splited into chunks (rectangular clusters). Generally problem is storing data in sparse (since clusters already implemented) matrix. Why not to use two-level dictionary-like containers?.. I.e. rb-tree by row index where value is rb-tree by column index. Or if you are lucky you can use hashes to get your O(1). In both cases if you can't find row you allocate it in container and create new container as value but initially with only single chunk. Of course allocating new chunk on existing row will be a bit faster than on new one and I guess that's the only issue with this approach.

Choosing a Data structure for very large data

I have x (millions) positive integers, where their values can be as big as allowed (+2,147,483,647). Assuming they are unique, what is the best way to store them for a lookup intensive program.
So far i thought of using a binary AVL tree or a hash table, where the integer is the key to the mapped data (a name). However am not to sure whether i can implement such large keys and in such large quantity with a hash table (wouldn't that create a >0.8 load factor in addition to be prone for collisions?)
Could i get some advise on which data structure might be suitable for my situation
The choice of structure depends heavily on how much memory you have available. I'm assuming based on the description that you need lookup but not to loop over them, find nearest, or other similar operations.
Best is probably a bucketed hash table. By placing hash collisions into buckets and keeping separate arrays in the bucket for keys and values, you can both reduce the size of the table proper and take advantage of CPU cache speedup when searching a bucket. Linear search within a bucket may even end up faster than binary search!
AVL trees are nice for data sets that are read-intensive but not read-only AND require ordered enumeration, find nearest and similar operations, but they're an annoyingly amount of work to implement correctly. You may get better performance with a B-tree because of CPU cache behavior, though, especially a cache-oblivious B-tree algorithm.
Have you looked into B-trees? The efficiency runs between log_m(n) and log_(m/2)(n) so if you choose m to be around 8-10 or so you should be able to keep your search depth to below 10.
Bit Vector , with the index set if the number is present. You can tweak it to have the number of occurrences of each number. There is a nice column about bit vectors in Bentley's Programming Pearls.
If memory isn't an issue a map is probably your best bet. Maps are O(1) meaning that as you scale up the number of items to be looked up the time is takes to find a value is the same.
A map where the key is the int, and the value is the name.
Do try hash tables first. There are some variants that can tolerate being very dense without significant slowdown (like Brent's variation).
If you only need to store the 32-bit integers and not any associated record, use a set and not a map, like hash_set in most C++ libraries. It would use only 4-bytes records plus some constant overhead and a little slack to avoid being 100%. In the worst case, to handle 'millions' of numbers you'd need a few tens of megabytes. Big, but nothing unmanageable.
If you need it to be much tighter, just store them sorted in a plain array and use binary search to fetch them. It will be O(log n) instead of O(1), but for 'millions' of records it's still just twentysomething steps to get any one of them. In C you have bsearch(), which is as fast as it can get.
edit: just saw in your question you talk about some 'mapped data (a name)'. are those names unique? do they also have to be in memory? if yes, they would definitely dominate the memory requirements. Even so, if the names are the typical english words, most would be 10 bytes or less, keeping the total size in the 'tens of megabytes'; maybe up to a hundred megs, still very manageable.

Resources